# Categorical Data Lab

### Introduction

### Loading our Data

In [4]:
import pandas as pd 
url = "https://raw.githubusercontent.com/jigsawlabs-student/feature-engineering/master/8-categorical-variables-lab/imdb_movies.csv"
movies_df = pd.read_csv(url)

In [5]:
movies_df[:3]

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
0,Avatar,Action,237000000,162.0,2009,12,2787965087
1,Pirates of the Caribbean: At World's End,Adventure,300000000,169.0,2007,5,961000000
2,Spectre,Action,245000000,148.0,2015,10,880674609


Let's start by looking at an overview of the data with the `info` method.

In [34]:
import pandas as pd

df = pd.read_csv('./movies_adjusted.csv', index_col = 0)

In [35]:
df[:2]

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
0,Avatar,Action,237000000.0,162.0,2009,12,2787965000.0
1,Pirates of the Caribbean: At World's End,Adventure,300000000.0,169.0,2007,5,961000000.0


### Pipelining

In [39]:
from sklearn.impute import MissingIndicator, SimpleImputer

In [None]:
preprocess_pipeline = make_pipeline()

In [None]:
Imputer(strategy="median")

In [None]:
preprocess_pipeline = make_pipeline(
    ColumnSelector(columns=x_cols),
    FeatureUnion(transformer_list=[
        ("numeric_features", make_pipeline(
            TypeSelector(np.number),
            Imputer(strategy="median"),
            StandardScaler()
        )),
        ("categorical_features", make_pipeline(
            TypeSelector("category"),
            Imputer(strategy="most_frequent"),
            OneHotEncoder()
        )),
        ("boolean_features", make_pipeline(
            TypeSelector("bool"),
            Imputer(strategy="most_frequent")
        ))
    ])
)

In [27]:
is_na_df = movies_df[['budget', 'revenue']].isna()
is_na_df[:3]
# # 	budget	revenue
# 0	False	False
# 1	False	False
# 2	False	False

Unnamed: 0,budget,revenue
0,False,False
1,False,False
2,False,False


Notice that this is precisely the data we want to add to our dataframe.  Let's rename the columns `budget_is_na` and `revenue_is_na`.

In [28]:
cols = [f'{col}_is_na' for col in is_na_df.columns]

In [29]:
is_na_df.columns = cols

In [30]:
is_na_df[:3]
# 	budget_is_na	revenue_is_na
# 0	False	False
# 1	False	False
# 2	False	False

Unnamed: 0,budget_is_na,revenue_is_na
0,False,False
1,False,False
2,False,False


In [31]:
df_with_is_na = pd.merge(updated_df, is_na_df, right_index = True, left_index = True)

In [29]:
df_with_is_na['revenue'].mean()

184352857.52439708

In [30]:
df_with_is_na['budget'].mean()

64431616.28518124

In [31]:
replaced_df = df_with_is_na[['budget', 'revenue']].replace(np.nan, df_with_is_na[['budget', 'revenue']].mean())

In [32]:
replaced_df.isna().sum()

budget     0
revenue    0
dtype: int64

In [33]:
(replaced_df['budget'] == 64431616.28518124).sum()
# 124

124

In [34]:
(replaced_df['revenue'] == 184352857.52439708).sum()
# 217

217

Ok, so we can now update the `df_with_is_na` columns.

In [35]:
removed_nas_df =  df_with_is_na.replace(replaced_df)

In [36]:
# removed_nas_df.isna().sum()

In [37]:
df_with_is_na[['budget', 'revenue']] = replaced_df

In [38]:
df_with_is_na.isna().sum()

title            0
genre            0
budget           0
runtime          0
year             0
month            0
revenue          0
budget_is_na     0
revenue_is_na    0
dtype: int64

### Adding Categorical Values

In [39]:
df_with_is_na[df_with_is_na['year'] > 1999]['genre'].value_counts(normalize = True)

Action             0.235448
Comedy             0.194245
Drama              0.169392
Adventure          0.114454
Animation          0.052322
NA                 0.043819
Thriller           0.043165
Fantasy            0.039241
Crime              0.036625
Horror             0.026815
Science Fiction    0.023545
Romance            0.020929
Name: genre, dtype: float64

Ok, now let's begin working with our `genre` column.  Currently it's of type object.

In [40]:
genre_col = df_with_is_na['genre']

First call `pd.get_dummies` on the genre column.

In [41]:
dummied_genre = pd.get_dummies(genre_col)

In [42]:
dummied_genre[:3]

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Drama,Fantasy,Horror,NA,Romance,Science Fiction,Thriller
0,1,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0


Ok, now let's drop the first column, `Action`.  This will mean that every column is in relative to the value for an Action movie.

In [43]:
dummied_genre_no_action = pd.get_dummies(genre_col, drop_first = True)

In [44]:
dummied_genre_no_action[:3]

Unnamed: 0,Adventure,Animation,Comedy,Crime,Drama,Fantasy,Horror,NA,Romance,Science Fiction,Thriller
0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0


And lets use the `prefix` keyword to precede each column with the word `genre`.

In [45]:
dummied_genre_no_action_prefix  = pd.get_dummies(genre_col, drop_first = True, prefix = 'genre')

In [46]:
dummied_genre_no_action_prefix[:3]

Unnamed: 0,genre_Adventure,genre_Animation,genre_Comedy,genre_Crime,genre_Drama,genre_Fantasy,genre_Horror,genre_NA,genre_Romance,genre_Science Fiction,genre_Thriller
0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0


Ok, now let's merge this back with our original data.

In [47]:
movies_df_with_dummies = pd.merge(df_with_is_na, dummied_genre_no_action_prefix, left_index = True, right_index = True)

Remove the `title` and `genre` columns.  We don't need them.

In [48]:
movies_df_with_dummies_select = movies_df_with_dummies.iloc[:, 2:]

movies_df_with_dummies_select.columns

Index(['budget', 'runtime', 'year', 'month', 'revenue', 'budget_is_na',
       'revenue_is_na', 'genre_Adventure', 'genre_Animation', 'genre_Comedy',
       'genre_Crime', 'genre_Drama', 'genre_Fantasy', 'genre_Horror',
       'genre_NA', 'genre_Romance', 'genre_Science Fiction', 'genre_Thriller'],
      dtype='object')

### Splitting our data

In [49]:
post_2000_movies_df = movies_df_with_dummies_select[movies_df_with_dummies_select['year'] > 1999]

Now let's separate our data into a training set, a test set, and a validation set.  Let's begin with the validation set.  Our validation set should be the most recent data.  So sort data by the year, then the month, and select the most recent 20 percent.

In [50]:
movies_df_with_dummies_select[:2]

Unnamed: 0,budget,runtime,year,month,revenue,budget_is_na,revenue_is_na,genre_Adventure,genre_Animation,genre_Comedy,genre_Crime,genre_Drama,genre_Fantasy,genre_Horror,genre_NA,genre_Romance,genre_Science Fiction,genre_Thriller
0,237000000.0,162.0,2009,12,2787965000.0,False,False,0,0,0,0,0,0,0,0,0,0,0
1,300000000.0,169.0,2007,5,961000000.0,False,False,1,0,0,0,0,0,0,0,0,0,0


In [51]:
.2*post_2000_movies_df.shape[0]

305.8

In [52]:
sorted_movies_post_2000 = post_2000_movies_df.sort_values(['year', 'month'], ascending = False)

Let's take a look at when this set of movies is from.

In [53]:
sorted_movies_post_2000['year'].value_counts().sort_index(ascending = False)

2016     45
2015     84
2014     87
2013     94
2012     91
2011    105
2010     95
2009    100
2008     98
2007     77
2006     95
2005    105
2004    104
2003     82
2002     91
2001     92
2000     84
Name: year, dtype: int64

Ok, so by choosing a validation set this large we'll eat up five years of data.  Let's just choose the most recent hundred values to be in our dataframe.  Begin by selecting all of the columns except for revenue and assign them to `X`, and assign the revenue column to `y`.

In [54]:
X = sorted_movies_post_2000.drop(columns = ['revenue'])
y = sorted_movies_post_2000['revenue']

Then assign the most recent X and y values to the test set.

In [55]:
X_test = X[:100]
X_test.shape

y_test = y[:100]

In [56]:
X_test['year'][:2]
# 72     2016
# 357    2016
# Name: year, dtype: int64

72     2016
357    2016
Name: year, dtype: int64

Now let's select 150 movies for our validation set.

In [57]:
X_validate = X[100:250]
y_validate = y[100:250]
X_validate.shape

(150, 17)

In [58]:
X_validate['year'][:2]
# 244     2015
# 1348    2015
# Name: year, dtype: int64

244     2015
1348    2015
Name: year, dtype: int64

And the rest of our data can be in our training set.

In [59]:
X_train = X[250:]
y_train = y[250:]
X_train.shape

(1279, 17)

Ok, now let's build and train a model.

In [60]:
from sklearn.linear_model import LinearRegression

In [61]:
model = LinearRegression().fit(X_train, y_train)

In [62]:
model.score(X_validate, y_validate)

0.5757307325837366

### Feature Selection

So we just trained our first model, now it's time to perform feature selection.  Let's begin by scaling the training and validation data.

In [63]:
from sklearn.preprocessing import StandardScaler

In [64]:
transformer = StandardScaler()
X_train_scaled = transformer.fit_transform(X_train)

In [66]:
X_train_scaled[:2]

array([[-0.81078714,  0.22687539,  1.68006371,  0.62648705,  0.        ,
         0.        , -0.36174252,  4.22005018, -0.50634481, -0.1842637 ,
        -0.46824826, -0.19959308, -0.17498705, -0.20379142, -0.14119562,
        -0.13828777, -0.20995626],
       [ 1.06532769, -0.08923323,  1.68006371,  0.33308649,  0.        ,
         0.        , -0.36174252, -0.23696401, -0.50634481, -0.1842637 ,
        -0.46824826, -0.19959308, -0.17498705, -0.20379142, -0.14119562,
         7.23129772, -0.20995626]])

In [67]:
X_validate_scaled = transformer.fit_transform(X_validate)
X_validate_scaled[:2]

array([[ 0.64622839,  0.01443836,  1.59658123, -0.48304248,  0.        ,
         0.        , -0.32084447, -0.25264558, -0.39223227, -0.25264558,
        -0.39223227, -0.20412415, -0.11624764, -0.26726124, -0.20412415,
        -0.23735633, -0.25264558],
       [-0.67778457, -0.4728564 ,  1.59658123, -0.48304248,  0.        ,
         0.        , -0.32084447, -0.25264558, -0.39223227, -0.25264558,
         2.54950976, -0.20412415, -0.11624764, -0.26726124, -0.20412415,
        -0.23735633, -0.25264558]])

Next we can train a model, and check that the score is the same.

In [133]:
scaled_model = LinearRegression()
scaled_model.fit(X_train_scaled, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [136]:
scaled_model.score(X_validate_scaled, y_validate)

0.46900631798855086

> It's a significant hit, but we didn't do anything wrong.

Let's convert our `X_train_scaled` and `X_validate_scaled` to dataframes.

In [74]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns = X.columns)
X_validate_scaled_df = pd.DataFrame(X_validate_scaled, columns = X.columns)

Now sort the columns by the absolute value of the size of the coefficients.

In [75]:
sorted_idcs = np.argsort(np.absolute(model.coef_))[::-1]
sorted_idcs

array([ 7, 15,  6,  8, 12, 10, 13,  9, 14, 16, 11,  1,  3,  2,  0,  5,  4])

In [76]:
sorted_coefs = model.coef_[sorted_idcs]
sorted_coefs

array([ 1.14743905e+08,  4.89120773e+07,  4.45104028e+07,  3.34451892e+07,
        2.65432021e+07, -1.82466420e+07, -1.39596460e+07, -1.34162716e+07,
        6.70959721e+06, -6.58518495e+06,  1.81081552e+06,  1.65956053e+06,
        1.19771870e+06,  9.95636974e+05,  2.73628300e+00, -1.49011612e-08,
        9.31322575e-09])

In [77]:
sorted_cols = X_train_scaled_df.columns[sorted_idcs]
sorted_cols

Index(['genre_Animation', 'genre_Science Fiction', 'genre_Adventure',
       'genre_Comedy', 'genre_Horror', 'genre_Drama', 'genre_NA',
       'genre_Crime', 'genre_Romance', 'genre_Thriller', 'genre_Fantasy',
       'runtime', 'month', 'year', 'budget', 'revenue_is_na', 'budget_is_na'],
      dtype='object')

> Make sure that both X_train_scaled and X_validate_scaled are dataframes.

In [78]:
sorted_X_train_df = X_train_scaled_df[sorted_cols]
sorted_X_validate_df = X_validate_scaled_df[sorted_cols]

In [79]:
sorted_X_train_df.columns[:3], sorted_X_validate_df.columns[:3]

(Index(['genre_Animation', 'genre_Science Fiction', 'genre_Adventure'], dtype='object'),
 Index(['genre_Animation', 'genre_Science Fiction', 'genre_Adventure'], dtype='object'))

Ok, now we can train these models on an increasing number of columns and check the corresponding validation scores.

Let's begin by creating a variable called `training_datasets` that is equal to a list of training datasets, each dataset including one more feature than the last (beginning with `genre_Animation`). 

In [80]:
len(sorted_cols)

17

In [81]:
training_datasets = [sorted_X_train_df.iloc[:, :i] for i in range(1, 18)]

In [82]:
training_datasets[0][:3]
# 	genre_Animation
# 0	4.220050
# 1	-0.236964
# 2	-0.236964

Unnamed: 0,genre_Animation
0,4.22005
1,-0.236964
2,-0.236964


Now we can fit the models.

In [83]:
fitted_models = [LinearRegression().fit(dataset, y_train) for dataset in training_datasets]

And then we evaluate each of the fitted models.

But first we need to create a set of `validation_datasets`, just like our training datasets from earlier.

In [84]:
validation_datasets = [sorted_X_validate_df.iloc[:, :i] for i in range(1, 18)]

In [85]:
scores = [fitted_model.score(validation_dataset, y_validate) for fitted_model, validation_dataset in zip(fitted_models, validation_datasets)]

In [86]:
scores

[-0.06689105131670936,
 -0.044168277532686595,
 -0.01198186562734782,
 -0.007226773607761138,
 -0.005187626138720702,
 0.01375478693516552,
 0.014312630374475723,
 0.026193472901694825,
 0.029739763005278252,
 0.03556946456969612,
 0.03550208758801909,
 0.16226683830651512,
 0.16179049564314152,
 0.1673649784531568,
 0.4784437695592442,
 0.47844376955924395,
 0.47844376955924395]

Based on the scores it looks like we get a significant bump from the last six columns.  It could be because these are not dummy variables.  Let's try moving these to the front.

In [87]:
reordered_cols = np.append(sorted_cols[-6:], sorted_cols[:11])
reordered_cols

array(['runtime', 'month', 'year', 'budget', 'revenue_is_na',
       'budget_is_na', 'genre_Animation', 'genre_Science Fiction',
       'genre_Adventure', 'genre_Comedy', 'genre_Horror', 'genre_Drama',
       'genre_NA', 'genre_Crime', 'genre_Romance', 'genre_Thriller',
       'genre_Fantasy'], dtype=object)

In [88]:
training_datasets = [sorted_X_train_df[reordered_cols].iloc[:, :i] for i in range(1, 18)]

In [89]:
validation_datasets = [sorted_X_validate_df[reordered_cols].iloc[:, :i] for i in range(1, 18)]

In [90]:
fitted_models_again = [LinearRegression().fit(dataset, y_train) for dataset in training_datasets]

In [91]:
scores_again = [fitted_model.score(validation_dataset, y_validate) for fitted_model, validation_dataset in zip(fitted_models_again, validation_datasets)]

In [92]:
scores_again[:4]

[0.017804187824331796,
 0.017353667925966265,
 0.02090459108313314,
 0.4814264991512081]

This appears to be a more reasonable progression.  After the top four features, we no longer see a sizable bump in our validation score.  Let's choose those four features from our reordered columns.

In [93]:
selected_cols = reordered_cols[:4]
selected_cols

array(['runtime', 'month', 'year', 'budget'], dtype=object)

And now that we have selected these columns, we can train one last time by combining our training and validation sets, with just these `selected_cols`.

In [94]:
# X_validate[selected_cols]

In [95]:
X_combined = pd.concat([X_train[selected_cols], X_validate[selected_cols]])
X_combined.shape

(1429, 4)

In [96]:
y_combined = pd.concat([y_train, y_validate])
y_combined.shape

(1429,)

Then we'll see how we do on the validation set.

In [97]:
model = LinearRegression().fit(X_combined, y_combined)

In [99]:
selected_X_test = X_test[selected_cols]
selected_X_test[:2]

Unnamed: 0,runtime,month,year,budget
72,123.0,8,2016,175000000.0
357,125.0,8,2016,100000000.0


In [209]:
model.score(selected_X_test, y_test)

0.4152086927594558

So our model performs 40 percent better than the mean on the validation set.

### Sparse Matrix?

Before finishing up with the genre column.  Notice that added 11 new columns just from our genre column.  Is this a problem?  

Well remember that the more features we have, the larger our error due to variance.  So there are a number of values that don't show up too often, we may wish to simply label them other.

Let's take a look.

In [120]:
genre_col.value_counts(normalize = True)

Action             0.2415
Drama              0.1825
Comedy             0.1795
Adventure          0.1180
Animation          0.0465
NA                 0.0420
Fantasy            0.0400
Crime              0.0380
Thriller           0.0365
Horror             0.0295
Science Fiction    0.0260
Romance            0.0200
Name: genre, dtype: float64

Here, it's not too bad.  We see that the smallest number is romance with 2 percent of the data.  If we wanted to get rid of these bottom values, we can do something like the following:

In [132]:
updated_genre = np.where(np.isin(genre_col, ['Romance', 'Science Fiction', 'Horror', 'Thriller', 'Crime']),
                                 'Other', genre_col)

In [133]:
updated_genre[:3]

array(['Action', 'Adventure', 'Action'], dtype=object)

In [134]:
pd.Series(updated_genre).value_counts(normalize = True)

Action       0.2415
Drama        0.1825
Comedy       0.1795
Other        0.1500
Adventure    0.1180
Animation    0.0465
NA           0.0420
Fantasy      0.0400
dtype: float64

And then we can call `pd.get_dummies` on this updated column.  

That's a reasonable number to try to fit our data to.  We can always reduce the number of variables later if we find a lot of variance in our model.