# Categorical Data Lab

### Introduction

### Loading our Data

In [2]:
import pandas as pd 
url = "https://raw.githubusercontent.com/jigsawlabs-student/feature-engineering/master/8-categorical-variables-lab/imdb_movies.csv"
movies_df = pd.read_csv(url)

In [3]:
movies_df[:3]

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
0,Avatar,Action,237000000,162.0,2009,12,2787965087
1,Pirates of the Caribbean: At World's End,Adventure,300000000,169.0,2007,5,961000000
2,Spectre,Action,245000000,148.0,2015,10,880674609


### 1. Is the data in the correct format?

Let's start by looking at an overview of the data with the `info` method.

In [4]:


# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2000 entries, 0 to 1999
# Data columns (total 7 columns):
# title      2000 non-null object
# genre      1916 non-null object
# budget     2000 non-null int64
# runtime    2000 non-null float64
# year       2000 non-null int64
# month      2000 non-null int64
# revenue    2000 non-null int64
# dtypes: float64(1), int64(4), object(2)
# memory usage: 109.5+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
title      2000 non-null object
genre      1916 non-null object
budget     2000 non-null int64
runtime    2000 non-null float64
year       2000 non-null int64
month      2000 non-null int64
revenue    2000 non-null int64
dtypes: float64(1), int64(4), object(2)
memory usage: 109.5+ KB


Ok, so it seems like almost all of our data is in the correct format.  The only thing to eventually coerce is the `genre` column which we'll eventually make categorical.

### 2. Do we have null values?

The only column that has a null values is the genre column.  Let's take a look at some of the movies that have null genres to see if there are any patterns.

In [5]:
movies_df[movies_df['genre'].isna()]['title'].tolist()[:20]

['Alice in Wonderland',
 'The Jungle Book',
 'Wreck-It Ralph',
 'Pearl Harbor',
 'Alexander',
 'Madagascar: Escape 2 Africa',
 'Bee Movie',
 'The Revenant',
 'Penguins of Madagascar',
 'How the Grinch Stole Christmas',
 'Stuart Little 2',
 'Public Enemies',
 'Where the Wild Things Are',
 'Green Zone',
 'The Alamo',
 'Enemy at the Gates',
 'Scooby-Doo',
 'Eagle Eye',
 'Fury',
 'Spirit: Stallion of the Cimarron']

It looks like there are a number of kids movies, and some action movies.  Later on, we could fill in these values, but for now, replace our na with the string `NA`.

> Notice that this is different than how we have handled non-categorical values, we'll explain what we're up to here, later.

In [6]:
import numpy as np
# do so here


In [7]:
(movies_df['genre'] == 'NA').sum()
# 84

84

### 3. Do we have missing values we can replace?

Ok, next let's see if some of the other values have some hidden missing values.  Every other column (except title) is numeric, so the easiest check is to plot a histogram of the remaining features: `'budget', 'runtime', 'year', 'month', 'revenue'`.

In [1]:
# plot histograms of those columns here

<img src="./hist-features.png" width="60%">

Ok, while our data certainly does not look like a random sample (take a look at the year column), there also does not appear to be anomalies indicating a missing value.  Let's double check this, by describing these five columns to look for any outliers via the max or min.

In [20]:


# 	budget	runtime	year	month	revenue
# count	2.000000e+03	2000.000000	2000.00000	2000.000000	2.000000e+03
# mean	6.043686e+07	113.267500	2004.42650	7.044000	1.643506e+08
# std	4.692045e+07	21.073728	7.95319	3.398947	2.186816e+08
# min	0.000000e+00	0.000000	1940.00000	1.000000	0.000000e+00
# 25%	3.000000e+07	98.000000	2000.00000	4.000000	3.103408e+07
# 50%	4.800000e+07	110.000000	2005.00000	7.000000	9.400000e+07
# 75%	7.500000e+07	125.000000	2011.00000	10.000000	2.026695e+08
# max	3.800000e+08	254.000000	2016.00000	12.000000	2.787965e+09

Unnamed: 0,budget,runtime,year,month,revenue
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,60436860.0,113.2675,2004.4265,7.044,164350600.0
std,46920450.0,21.073728,7.95319,3.398947,218681600.0
min,0.0,0.0,1940.0,1.0,0.0
25%,30000000.0,98.0,2000.0,4.0,31034080.0
50%,48000000.0,110.0,2005.0,7.0,94000000.0
75%,75000000.0,125.0,2011.0,10.0,202669500.0
max,380000000.0,254.0,2016.0,12.0,2787965000.0


Hmm, it looks like we missed something.  The budget, runtime, and revenue all have zero values.  Use the `value_counts` method and look for bottom 10 values in the budget column.   

In [21]:
# budget bottom ten


# 0          120
# 25           1
# 28           1
# 30           1
# 110          1
# 500000       1
# 1800000      1
# 2000000      1
# 2600000      1
# 2800000      1
# Name: budget, dtype: int64

0          120
25           1
28           1
30           1
110          1
500000       1
1800000      1
2000000      1
2600000      1
2800000      1
Name: budget, dtype: int64

To avoid recopying the same code a lot, let's turn this into a method, that takes the name of the column as an argument.

In [22]:
def bottom_ten(column_name, df):
    pass

In [23]:
bottom_ten('runtime', movies_df)

# 0.0     1
# 63.0    1
# 72.0    1
# 74.0    1
# 75.0    2
# 76.0    2
# 77.0    1
# 78.0    5
# 79.0    2
# 80.0    6
# Name: runtime, dtype: int64

0.0     1
63.0    1
72.0    1
74.0    1
75.0    2
76.0    2
77.0    1
78.0    5
79.0    2
80.0    6
Name: runtime, dtype: int64

In [24]:
bottom_ten('revenue', movies_df)

# 0         211
# 12          1
# 13          1
# 14          1
# 103         1
# 6399        1
# 73706       1
# 108348      1
# 480314      1
# 528972      1
# Name: revenue, dtype: int64

0         211
12          1
13          1
14          1
103         1
6399        1
73706       1
108348      1
480314      1
528972      1
Name: revenue, dtype: int64

Ok, so for each column, let's replace the values that we simply do not believe with `np.nan`.  

* For budget, it's anything less than 500000.  

In [27]:
budget_nan_series = None

In [29]:
budget_nan_series.isna().sum()
# 124

124

* For revenue its the less than 100000 values.

In [33]:
revenue_nan = None


In [34]:
revenue_nan.isna().sum()
# 217

217

* For runtime, there is only one value that is suspicious, so we don't need to eventually add an `isna` column just for one value.  Instead, we can just replace our zero value with the mean of the budget.

In [35]:
runtime_mean = None
runtime_mean
# 113.2675

113.2675

In [37]:
runtime_replace_zero = None

In [38]:
(runtime_replace_zero == runtime_mean).sum()
# 1

1

Ok, now let's replace our budget, runtime, and revenue columns with our cleaned up series: `runtime_replace_zero`, `budget_nan_series`, `revenue_nan`.

In [39]:
updated_df = None

In [40]:
updated_df.isna().sum()

# title        0
# genre        0
# budget     124
# runtime      0
# year         0
# month        0
# revenue    217
# dtype: int64

title        0
genre        0
budget     124
runtime      0
year         0
month        0
revenue    217
dtype: int64

Ok, so now we do have our `isna` values properly identified.

Now for the budget, and revenue columns, we need corresponding isna columns.  Begin by selecting the budget and revenue columns and then creating a dataframe of `isna` values.

In [42]:
is_na_df = movies_df[['budget', 'revenue']].isna()
is_na_df[:3]
# # 	budget	revenue
# 0	False	False
# 1	False	False
# 2	False	False

Unnamed: 0,budget,revenue
0,False,False
1,False,False
2,False,False


Notice that this is precisely the data we want to add to our dataframe.  Let's rename the columns `budget_is_na` and `revenue_is_na`.

In [43]:
cols = [f'{col}_is_na' for col in is_na_df.columns]

In [44]:
is_na_df.columns = cols

In [45]:
is_na_df[:3]
# 	budget_is_na	revenue_is_na
# 0	False	False
# 1	False	False
# 2	False	False

Unnamed: 0,budget_is_na,revenue_is_na
0,False,False
1,False,False
2,False,False


Then add them to our dataframe.

In [46]:
df_with_is_na = None

Next replace the missing values in `revenue` and `budget` with the mean.

In [47]:
mean_revenue = None

# 184352857.52439708

184352857.52439708

In [48]:
mean_budget = None

# 64431616.28518124

64431616.28518124

Replace the na values accordingly.

In [49]:
replaced_df = None

In [50]:
replaced_df.isna().sum()

# budget     0
# revenue    0
# dtype: int64

budget     0
revenue    0
dtype: int64

In [51]:
(replaced_df['budget'] == 64431616.28518124).sum()
# 124

124

In [52]:
(replaced_df['revenue'] == 184352857.52439708).sum()
# 217

217

Ok, so we can now update the `df_with_is_na` columns with our `replaced_df`.

In [56]:
df_with_is_na.isna().sum()

# title            0
# genre            0
# budget           0
# runtime          0
# year             0
# month            0
# revenue          0
# budget_is_na     0
# revenue_is_na    0
# dtype: int64

title            0
genre            0
budget           0
runtime          0
year             0
month            0
revenue          0
budget_is_na     0
revenue_is_na    0
dtype: int64

### Adding Categorical Values

In [57]:
df_with_is_na[df_with_is_na['year'] > 1999]['genre'].value_counts(normalize = True)

Action             0.235448
Comedy             0.194245
Drama              0.169392
Adventure          0.114454
Animation          0.052322
NA                 0.043819
Thriller           0.043165
Fantasy            0.039241
Crime              0.036625
Horror             0.026815
Science Fiction    0.023545
Romance            0.020929
Name: genre, dtype: float64

Ok, now let's begin working with our `genre` column.  Currently it's of type object.

In [58]:
genre_col = df_with_is_na['genre']

First call `pd.get_dummies` on the genre column.

In [59]:
dummied_genre = None

In [60]:
dummied_genre[:3]
# 	Action	Adventure	Animation	Comedy	Crime	Drama	Fantasy	Horror	NA	Romance	Science Fiction	Thriller
# 0	1	0	0	0	0	0	0	0	0	0	0	0
# 1	0	1	0	0	0	0	0	0	0	0	0	0
# 2	1	0	0	0	0	0	0	0	0	

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Drama,Fantasy,Horror,NA,Romance,Science Fiction,Thriller
0,1,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0


Ok, now let's drop the first column, `Action`.  This will mean that every column is in relative to the value for an Action movie.

In [61]:
dummied_genre_no_action = None

In [62]:
dummied_genre_no_action[:3]

# 	Adventure	Animation	Comedy	Crime	Drama	Fantasy	Horror	NA	Romance	Science Fiction	Thriller
# 0	0	0	0	0	0	0	0	0	0	0	0
# 1	1	0	0	0	0	0	0	0	0	0	0
# 2	0	0	0	0	0	0	0	0	0	0	0

Unnamed: 0,Adventure,Animation,Comedy,Crime,Drama,Fantasy,Horror,NA,Romance,Science Fiction,Thriller
0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0


And lets use the `prefix` keyword to precede each column with the word `genre`.

In [63]:
dummied_genre_no_action_prefix  = None

In [64]:
dummied_genre_no_action_prefix[:3]


# genre_Adventure	genre_Animation	genre_Comedy	genre_Crime	genre_Drama	genre_Fantasy	genre_Horror	genre_NA	genre_Romance	genre_Science Fiction	genre_Thriller
# 0	0	0	0	0	0	0	0	0	0	0	0
# 1	1	0	0	0	0	0	0	0	0	0	0
# 2	0	0	0	0	0	0	

Unnamed: 0,genre_Adventure,genre_Animation,genre_Comedy,genre_Crime,genre_Drama,genre_Fantasy,genre_Horror,genre_NA,genre_Romance,genre_Science Fiction,genre_Thriller
0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0


Ok, now let's merge this back with our original data.

In [65]:
movies_df_with_dummies = None

Remove the `title` and `genre` columns.  We don't need them.

In [66]:
movies_df_with_dummies_select = movies_df_with_dummies.iloc[:, 2:]

movies_df_with_dummies_select.columns

# Index(['budget', 'runtime', 'year', 'month', 'revenue', 'budget_is_na',
#        'revenue_is_na', 'genre_Adventure', 'genre_Animation', 'genre_Comedy',
#        'genre_Crime', 'genre_Drama', 'genre_Fantasy', 'genre_Horror',
#        'genre_NA', 'genre_Romance', 'genre_Science Fiction', 'genre_Thriller'],
#       dtype='object')

Index(['budget', 'runtime', 'year', 'month', 'revenue', 'budget_is_na',
       'revenue_is_na', 'genre_Adventure', 'genre_Animation', 'genre_Comedy',
       'genre_Crime', 'genre_Drama', 'genre_Fantasy', 'genre_Horror',
       'genre_NA', 'genre_Romance', 'genre_Science Fiction', 'genre_Thriller'],
      dtype='object')

### Splitting our data

We'll first narrow our dataset to movies after 1999.

In [3]:
post_2000_movies_df = movies_df_with_dummies_select[movies_df_with_dummies_select['year'] > 1999]

NameError: name 'movies_df_with_dummies_select' is not defined

Now let's separate our data into a training set, a test set, and a validation set.  Let's begin with the validation set.  Our validation set should be the most recent data.  So sort data by the year, then the month, and select the most recent 20 percent.

In [70]:
sorted_movies_post_2000 = post_2000_movies_df.sort_values(['year', 'month'], ascending = False)

Let's take a look at when this set of movies is from.  Use value_counts to print out a count of movies per year in our dataset, sorted by year.

In [71]:
# write code here 


# 2016     45
# 2015     84
# 2014     87
# 2013     94
# 2012     91
# 2011    105
# 2010     95
# 2009    100
# 2008     98
# 2007     77
# 2006     95
# 2005    105
# 2004    104
# 2003     82
# 2002     91
# 2001     92
# 2000     84

2016     45
2015     84
2014     87
2013     94
2012     91
2011    105
2010     95
2009    100
2008     98
2007     77
2006     95
2005    105
2004    104
2003     82
2002     91
2001     92
2000     84
Name: year, dtype: int64

Ok, so by choosing a validation set this large we'll eat up five years of data.  Let's just choose the most recent hundred values to be in our dataframe.  Begin by selecting all of the columns except for revenue and assign them to `X`, and assign the revenue column to `y`.

In [72]:
X = None
y = None

Then assign the most recent X and y values to the test set.

In [197]:
X_test = None
X_test.shape

y_test = None

In [198]:
X_test['year'][:2]
# 72     2016
# 357    2016
# Name: year, dtype: int64

72     2016
357    2016
Name: year, dtype: int64

Now let's select 150 movies for our validation set.

In [199]:
X_validate = None
y_validate = None
X_validate.shape

# (150, 17)

(150, 17)

In [200]:
X_validate['year'][:2]
# 244     2015
# 1348    2015
# Name: year, dtype: int64

244     2015
1348    2015
Name: year, dtype: int64

And the rest of our data can be in our training set.

In [201]:
X_train = None
y_train = None
X_train.shape

# (1279, 17)

(1279, 17)

Ok, now let's build and train a model.

> Check the score on the validation set.

In [97]:
from sklearn.linear_model import LinearRegression
model = None

# 0.5757307325837366

### Feature Selection

So we just trained our first model, now it's time to perform feature selection.  Let's begin by scaling the training and validation data.

In [88]:
from sklearn.preprocessing import StandardScaler

In [146]:
transformer = StandardScaler()
X_train_scaled = transformer.fit_transform(X_train)

In [147]:
X_validate_scaled = transformer.fit_transform(X_validate)

Next we can train a model, and check that the score is the same.

In [133]:
scaled_model = LinearRegression()
scaled_model.fit(X_train_scaled, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [136]:
scaled_model.score(X_validate_scaled, y_validate)
# 0.46900631798855086

0.46900631798855086

> It's a significant hit, but we didn't do anything wrong.

Let's convert our `X_train_scaled` and `X_validate_scaled` to dataframes.

In [150]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns = X.columns)
X_validate_scaled_df = pd.DataFrame(X_validate_scaled, columns = X.columns)

In [None]:
X_train_scaled_df[:2]
# array([[-0.81078714,  0.22687539,  1.68006371,  0.62648705,  0.        ,
#          0.        , -0.36174252,  4.22005018, -0.50634481, -0.1842637 ,
#         -0.46824826, -0.19959308, -0.17498705, -0.20379142, -0.14119562,
#         -0.13828777, -0.20995626],
#        [ 1.06532769, -0.08923323,  1.68006371,  0.33308649,  0.        ,
#          0.        , -0.36174252, -0.23696401, -0.50634481, -0.1842637 ,
#         -0.46824826, -0.19959308, -0.17498705, -0.20379142, -0.14119562,
#          7.23129772, -0.20995626]])

In [None]:
X_validate_scaled_df[:2]
# array([[ 0.64622839,  0.01443836,  1.59658123, -0.48304248,  0.        ,
#          0.        , -0.32084447, -0.25264558, -0.39223227, -0.25264558,
#         -0.39223227, -0.20412415, -0.11624764, -0.26726124, -0.20412415,
#         -0.23735633, -0.25264558],
#        [-0.67778457, -0.4728564 ,  1.59658123, -0.48304248,  0.        ,
#          0.        , -0.32084447, -0.25264558, -0.39223227, -0.25264558,
#          2.54950976, -0.20412415, -0.11624764, -0.26726124, -0.20412415,
#         -0.23735633, -0.25264558]])

Now sort the columns by the absolute value of the size of the coefficients.

In [143]:
sorted_idcs = None
# array([ 7, 15,  6,  8, 12, 10, 13,  9, 14, 16, 11,  1,  3,  2,  0,  5,  4])

In [144]:
sorted_coefs = None
sorted_coefs
# array([ 1.14743905e+08,  4.89120773e+07,  4.45104028e+07,  3.34451892e+07,
#         2.65432021e+07, -1.82466420e+07, -1.39596460e+07, -1.34162716e+07,
#         6.70959721e+06, -6.58518495e+06,  1.81081552e+06,  1.65956053e+06,
#         1.19771870e+06,  9.95636974e+05,  2.73628300e+00, -1.49011612e-08,
#         9.31322575e-09])

array([ 1.14743905e+08,  4.89120773e+07,  4.45104028e+07,  3.34451892e+07,
        2.65432021e+07, -1.82466420e+07, -1.39596460e+07, -1.34162716e+07,
        6.70959721e+06, -6.58518495e+06,  1.81081552e+06,  1.65956053e+06,
        1.19771870e+06,  9.95636974e+05,  2.73628300e+00, -1.49011612e-08,
        9.31322575e-09])

In [151]:
sorted_cols = None
sorted_cols

# Index(['genre_Animation', 'genre_Science Fiction', 'genre_Adventure',
#        'genre_Comedy', 'genre_Horror', 'genre_Drama', 'genre_NA',
#        'genre_Crime', 'genre_Romance', 'genre_Thriller', 'genre_Fantasy',
#        'runtime', 'month', 'year', 'budget', 'revenue_is_na', 'budget_is_na'],
#       dtype='object')

Index(['genre_Animation', 'genre_Science Fiction', 'genre_Adventure',
       'genre_Comedy', 'genre_Horror', 'genre_Drama', 'genre_NA',
       'genre_Crime', 'genre_Romance', 'genre_Thriller', 'genre_Fantasy',
       'runtime', 'month', 'year', 'budget', 'revenue_is_na', 'budget_is_na'],
      dtype='object')

> Make sure that both X_train_scaled and X_validate_scaled are dataframes, then sort the column accordingly.

In [152]:
sorted_X_train_df = None
sorted_X_validate_df = None

In [153]:
sorted_X_train_df.columns[:3], sorted_X_validate_df.columns[:3]

(Index(['genre_Animation', 'genre_Science Fiction', 'genre_Adventure'], dtype='object'),
 Index(['genre_Animation', 'genre_Science Fiction', 'genre_Adventure'], dtype='object'))

Ok, now we can train these models on an increasing number of columns and check the corresponding validation scores.

Let's begin by creating a variable called `training_datasets` that is equal to a list of training datasets, each dataset including one more feature than the last (beginning with `genre_Animation`). 

In [261]:
len(sorted_cols)

17

In [154]:
training_datasets = None

In [155]:
training_datasets[0][:3]
# 	genre_Animation
# 0	4.220050
# 1	-0.236964
# 2	-0.236964

Unnamed: 0,genre_Animation
0,4.22005
1,-0.236964
2,-0.236964


Now we can fit the models.

In [156]:
fitted_models = None

And then we evaluate each of the fitted models.

But first we need to create a set of `validation_datasets`, just like our training datasets from earlier.

In [158]:
validation_datasets = None

In [159]:
scores = None

In [160]:
scores

[-0.06689105131670936,
 -0.044168277532686595,
 -0.01198186562734782,
 -0.007226773607761138,
 -0.005187626138720702,
 0.01375478693516552,
 0.014312630374475723,
 0.026193472901694825,
 0.029739763005278252,
 0.03556946456969612,
 0.03550208758801909,
 0.16226683830651512,
 0.16179049564314152,
 0.1673649784531568,
 0.4784437695592442,
 0.47844376955924395,
 0.47844376955924395]

Based on the scores it looks like we get a significant bump from the last six columns.  It could be because these are not dummy variables.  Let's try moving these to the front.

In [174]:
reordered_cols = np.append(sorted_cols[-6:], sorted_cols[:11])
reordered_cols

array(['runtime', 'month', 'year', 'budget', 'revenue_is_na',
       'budget_is_na', 'genre_Animation', 'genre_Science Fiction',
       'genre_Adventure', 'genre_Comedy', 'genre_Horror', 'genre_Drama',
       'genre_NA', 'genre_Crime', 'genre_Romance', 'genre_Thriller',
       'genre_Fantasy'], dtype=object)

In [176]:
training_datasets = [sorted_X_train_df[reordered_cols].iloc[:, :i] for i in range(1, 18)]

In [177]:
validation_datasets = [sorted_X_validate_df[reordered_cols].iloc[:, :i] for i in range(1, 18)]

In [178]:
fitted_models_again = [LinearRegression().fit(dataset, y_train) for dataset in training_datasets]

In [181]:
scores_again = [fitted_model.score(validation_dataset, y_validate) for fitted_model, validation_dataset in zip(fitted_models_again, validation_datasets)]

In [184]:
scores_again[:4]

[0.017804187824331796,
 0.017353667925966265,
 0.02090459108313314,
 0.4814264991512081]

This appears to be a more reasonable progression.  After the top four features, we no longer see a sizable bump in our validation score.  Let's choose those four features from our `reordered_cols`.

In [186]:
selected_cols = None
selected_cols

# array(['runtime', 'month', 'year', 'budget'], dtype=object)

array(['runtime', 'month', 'year', 'budget'], dtype=object)

And now that we have selected these columns, we can train one last time by combining our training and validation sets, with just these `selected_cols`.

In [205]:
X_combined = None
X_combined.shape

# (1429, 4)

(1429, 4)

In [206]:
y_combined = None
y_combined.shape

(1429,)

Then we'll see how we do on the validation set.

In [207]:
model = None

In [208]:
selected_X_test = None
selected_X_test[:2]
# # 	runtime	month	year	budget
# 72	123.0	8	2016	175000000.0
# 357	125.0	8	2016	100000000.0

> Then get the score on the test set.

In [209]:
# check the score on the test set 

# 0.4152086927594558

0.4152086927594558

So our model performs 40 percent better than the mean on the validation set.

### Sparse Matrix?

Before finishing up with the genre column.  Notice that added 11 new columns just from our genre column.  Is this a problem?  

Well remember that the more features we have, the larger our error due to variance.  So there are a number of values that don't show up too often, we may wish to simply label them other.

Let's take a look.

In [120]:
genre_col.value_counts(normalize = True)

Action             0.2415
Drama              0.1825
Comedy             0.1795
Adventure          0.1180
Animation          0.0465
NA                 0.0420
Fantasy            0.0400
Crime              0.0380
Thriller           0.0365
Horror             0.0295
Science Fiction    0.0260
Romance            0.0200
Name: genre, dtype: float64

Here, it's not too bad.  We see that the smallest number is romance with 2 percent of the data.  If we wanted to get rid of these bottom values, we can do something like the following:

In [132]:
updated_genre = np.where(np.isin(genre_col, ['Romance', 'Science Fiction', 'Horror', 'Thriller', 'Crime']),
                                 'Other', genre_col)

In [133]:
updated_genre[:3]

array(['Action', 'Adventure', 'Action'], dtype=object)

In [134]:
pd.Series(updated_genre).value_counts(normalize = True)

Action       0.2415
Drama        0.1825
Comedy       0.1795
Other        0.1500
Adventure    0.1180
Animation    0.0465
NA           0.0420
Fantasy      0.0400
dtype: float64

And then we can call `pd.get_dummies` on this updated column.  

That's a reasonable number to try to fit our data to.  We can always reduce the number of variables later if we find a lot of variance in our model.