# Categorical Data Lab

### Introduction

In this lesson, we'll work through our machine learning process with categorical data.

### Loading our Data

Let's begin by loading up our data.

In [694]:
import pandas as pd 
url = "https://raw.githubusercontent.com/jigsawlabs-student/feature-engineering/master/8-categorical-variables-lab/imdb_movies.csv"
movies_df = pd.read_csv(url)

Take a look at the `info` to get a sense of our datatypes and null values.

In [695]:


# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2000 entries, 0 to 1999
# Data columns (total 7 columns):
# title      2000 non-null object
# genre      1916 non-null object
# budget     2000 non-null int64
# runtime    2000 non-null float64
# year       2000 non-null int64
# month      2000 non-null int64
# revenue    2000 non-null int64
# dtypes: float64(1), int64(4), object(2)
# memory usage: 109.5+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
title      2000 non-null object
genre      1916 non-null object
budget     2000 non-null int64
runtime    2000 non-null float64
year       2000 non-null int64
month      2000 non-null int64
revenue    2000 non-null int64
dtypes: float64(1), int64(4), object(2)
memory usage: 109.5+ KB


We can see that our movies include a time component, so let's sort our movies by `year` and then `month` right off the bat.

In [696]:
sorted_movies_df = None

In [697]:
sorted_movies_df[:2]
# 	title	genre	budget	runtime	year	month	revenue
# 1108	Pinocchio	Animation	2600000	88.0	1940	2	84300000
# 862	Lolita	Drama	2000000	153.0	1962	6	9250000

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
1108,Pinocchio,Animation,2600000,88.0,1940,2,84300000
862,Lolita,Drama,2000000,153.0,1962,6,9250000


### Missing Data Analysis

Remember that this includes:

1. **Plotting histograms** of features to look for:
    * Values that do not fit our domain
    * Values that occur more than we would expect (and could have been replaced missing values)

2. **Examining common placeholders** for missing values like:
    * an empty string
    * not available
    * 0, 999, -999


Let's get an overview of our data by plotting a histogram of each of the our features.

<img src="./hist-features.png" width="60%">

Ok, each of the distributions above seem reasonable.  Although we can see from the `year` histogram that we do not have a simple random sample of movies.

Let's use `describe` to see if there are any suspicous values.

In [699]:


# 	budget	runtime	year	month	revenue
# count	2.000000e+03	2000.000000	2000.00000	2000.000000	2.000000e+03
# mean	6.043686e+07	113.267500	2004.42650	7.044000	1.643506e+08
# std	4.692045e+07	21.073728	7.95319	3.398947	2.186816e+08
# min	0.000000e+00	0.000000	1940.00000	1.000000	0.000000e+00
# 25%	3.000000e+07	98.000000	2000.00000	4.000000	3.103408e+07
# 50%	4.800000e+07	110.000000	2005.00000	7.000000	9.400000e+07
# 75%	7.500000e+07	125.000000	2011.00000	10.000000	2.026695e+08
# max	3.800000e+08	254.000000	2016.00000	12.000000	2.787965e+09

Unnamed: 0,budget,runtime,year,month,revenue
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,60436860.0,113.2675,2004.4265,7.044,164350600.0
std,46920450.0,21.073728,7.95319,3.398947,218681600.0
min,0.0,0.0,1940.0,1.0,0.0
25%,30000000.0,98.0,2000.0,4.0,31034080.0
50%,48000000.0,110.0,2005.0,7.0,94000000.0
75%,75000000.0,125.0,2011.0,10.0,202669500.0
max,380000000.0,254.0,2016.0,12.0,2787965000.0


Here, looking at the min and max vaues, we see that the minimum values of 0 in runtime, year and revenue that don't make sense.

1. Revenue

In this case there are a number of revenue values that are equal to 0 or less than 100,000.  Either way, these numbers do not seem very believable.

> Use `value_counts` to display the bottom ten revenues below.

In [700]:


# 0         211
# 12          1
# 13          1
# 14          1
# 103         1
# 6399        1
# 73706       1
# 108348      1
# 480314      1
# 528972      1
# Name: revenue, dtype: int64

0         211
12          1
13          1
14          1
103         1
6399        1
73706       1
108348      1
480314      1
528972      1
Name: revenue, dtype: int64

Because revenue is our target, let's drop the rows with revenue values less than 100000.  As opposed to imputing the data.

In [2]:
sorted_movies_df = None

In [4]:
sorted_movies_df.shape
# (1783, 7)

The other column that has some suspicous values is the budget column.  Take a look at the lowest five values there.

In [703]:
# look at lowest 5 values

# 0          29
# 500000      1
# 1800000     1
# 2000000     1
# 2600000     1
# Name: budget, dtype: int64

0          29
500000      1
1800000     1
2000000     1
2600000     1
Name: budget, dtype: int64

We see that there are 29 values that are 0.  Let's assume that our 0 values are really missing.

Let's apply our technique of replacing our missing values with the mean, and adding a column to indicate missingness.

In [704]:
budget = sorted_movies_df['budget']

First we'll assign a boolean series that indicates if a budget is nan (or 0).  

In [705]:
budget_is_na = None

In [706]:
budget_is_na[:3]

# 1108    False
# 862     False
# 1125    False
# Name: budget, dtype: bool

1108    False
862     False
1125    False
Name: budget, dtype: bool

Add our our `budget_is_na` column to the dataframe. 

In [707]:
sorted_movies_df = sorted_movies_df.assign(budget_is_na = budget_is_na)

Next replace the zero values in the budget with the mean.

> First calculate the mean budget (not including the zero values).

In [708]:
mean_budget = None
mean_budget
# 65950234.977765106

65950234.977765106

And replace the zero values with the mean.

In [5]:
# do so here

In [710]:
(sorted_movies_df['budget'] == 65950234.977765106).sum()
# 29

29

3. Runtime

Let's apply the same procedure for runtime.  We'll begin by identifying suspicious values.

In [7]:
sorted_movies_df['runtime'].value_counts().sort_index().iloc[:10]

Here there is only one value equal to 0 so, let's just let it be.

### Missing Value Coercion

After handling missing values in three of our columns, let's now move onto our `genre` column.

In [719]:
sorted_movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1783 entries, 1108 to 357
Data columns (total 8 columns):
title           1783 non-null object
genre           1707 non-null object
budget          1783 non-null float64
runtime         1783 non-null float64
year            1783 non-null int64
month           1783 non-null int64
revenue         1783 non-null int64
budget_is_na    1783 non-null bool
dtypes: bool(1), float64(2), int64(3), object(2)
memory usage: 113.2+ KB


So we see that `genre` is the only column with `nan` values.  Note that if we replace the `np.nan` with a string like `na`, we will get the missing value column for free when we call get dummies.  

Let's walk through this.

In [545]:
genre_col = sorted_movies_df['genre']

Begin by replacing the `np.nan` values in `genre_col` with the string `na`.

In [546]:
import numpy as np
replaced_genre = None

In [547]:
replaced_genre.value_counts(normalize = True)
# Action             483
# Drama              365
# Comedy             359
# Adventure          236
# Animation           93
# na                  84
# Fantasy             80
# Crime               76
# Thriller            73
# Horror              59
# Science Fiction     52
# Romance             40

Action             0.246214
Drama              0.177790
Comedy             0.167134
Adventure          0.123948
Animation          0.048233
na                 0.042625
Fantasy            0.039260
Crime              0.038699
Thriller           0.037016
Horror             0.031408
Science Fiction    0.028603
Romance            0.019069
Name: genre, dtype: float64

Let's see how many different values we now have in the genre column:

In [720]:
replaced_genre.value_counts().shape

# (12,)

(12,)

Looks like 12 new columns, each of which consist of over 1 percent of the data.  

Now let's apply one hot encoding to our `replaced_genre` column, dropping the first value.  Assign the result to `genre_df`.

In [584]:
genre_df = None

In [585]:
genre_df[:2]

# 	Adventure	Animation	Comedy	Crime	Drama	Fantasy	Horror	Romance	Science Fiction	Thriller	na
# 1108	0	1	0	0	0	0	0	0	0	0	0
# 862	0	0	0	0	1	0	0	0	0	0	0

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Drama,Fantasy,Horror,Romance,Science Fiction,Thriller,na
1108,0,0,1,0,0,0,0,0,0,0,0,0
862,0,0,0,0,0,1,0,0,0,0,0,0


> Our dropped value was `Action`.

Let's take another look at our `sorted_movies_df`.

In [586]:
sorted_movies_df[:2]


# title	genre	budget	runtime	year	month	revenue
# 1108	Pinocchio	Animation	2600000	88.0	1940	2	84300000
# 862	Lolita	Drama	2000000	153.0	1962	6	9250000

Unnamed: 0,title,genre,budget,runtime,year,month,revenue,budget_is_na
1108,Pinocchio,Animation,2600000.0,88.0,1940,2,84300000,False
862,Lolita,Drama,2000000.0,153.0,1962,6,9250000,False


Assign revenue to equal the target, `y`.

In [587]:
y = None

In [588]:
y[:3]
# 1108    84300000
# 862      9250000
# 1125    71000000
# Name: revenue, dtype: int64

1108    84300000
862      9250000
1125    71000000
Name: revenue, dtype: int64

Assign our columns except for `title`, `genre`, and `revenue` combined with our `genre_df` to the variable `X`. 

In [722]:

X = None

In [723]:
X[:2]
# 	budget	runtime	year	month	Adventure	Animation	Comedy	Crime	Drama	Fantasy	Horror	Romance	Science Fiction	Thriller	na
# 1108	2600000	88.0	1940	2	0	1	0	0	0	0	0	0	0	0	0
# 862	2000000	153.0	1962	6	0	0	0	0	1	0	0	0	0	0	0

Unnamed: 0,budget,runtime,year,month,budget_is_na,Action,Adventure,Animation,Comedy,Crime,Drama,Fantasy,Horror,Romance,Science Fiction,Thriller,na
1108,2600000.0,88.0,1940,2,False,0,0,1,0,0,0,0,0,0,0,0,0
862,2000000.0,153.0,1962,6,False,0,0,0,0,0,1,0,0,0,0,0,0


In [727]:
X.columns.tolist() == ['budget', 'runtime', 'year', 'month', 'budget_is_na', 'Action',
       'Adventure', 'Animation', 'Comedy', 'Crime', 'Drama', 'Fantasy',
       'Horror', 'Romance', 'Science Fiction', 'Thriller', 'na']
# True

True

### Training the model

Ok, now that we have completed feature engineering, let's split the data into train,  validation and test sets.

> Make sure that the data is split sequentially.

Allocate 10 percent of the data for the validation set, and ten percent of the data for test set.  Make sure that the data is separated sequentially, and that the split is not random (with `shuffle = False`).

In [8]:
# split data here

In [10]:
X_train.shape, X_validate.shape, X_test.shape
# ((1443, 17), (161, 17), (179, 17))

In [733]:
X_train['year'][-1:], X_validate['year'][-1:], X_test['year'][-1:]
# (1800    2012
#  Name: year, dtype: int64,
#  85    2014
#  Name: year, dtype: int64,
#  357    2016
#  Name: year, dtype: int64)

(220    2012
 Name: year, dtype: int64,
 629    2014
 Name: year, dtype: int64,
 357    2016
 Name: year, dtype: int64)

### Scaling the Data

Now let's scale the data.

Initialize a `StandardScaler`.

In [11]:
scaler = None

And fit and transform the `X_train` data.

In [12]:
X_train_scaled = None

In [13]:
X_train_scaled

# array([[-1.42219236, -1.18914405, -8.56884218, ..., -0.16222142,
#         -0.18945647, -0.20463933],
#        [-1.4365378 ,  1.83325971, -5.55202435, ..., -0.16222142,
#         -0.18945647, -0.20463933],
#        [-0.7404252 ,  6.25061905, -5.41489627, ..., -0.16222142,
#         -0.18945647, -0.20463933],

Then do not refit the data, but transform the `X_Validate` data.

In [738]:
X_validate_scaled = None
X_validate_scaled[:2]

# array([[ 0.06973368, -1.42163664,  1.30437981, -0.63232683, -0.12152338,
#         -0.55092925, -0.38822378, -0.21366369,  2.18449079, -0.1933473 ,
#         -0.48740571, -0.19526776, -0.19141037, -0.13808619, -0.16222142,
#         -0.18945647, -0.20463933],
#        [-0.5279931 , -0.16617662,  1.30437981, -0.63232683, -0.12152338,
#         -0.55092925, -0.38822378, -0.21366369, -0.45777259, -0.1933473 ,
#         -0.48740571, -0.19526776, -0.19141037,  7.24185366, -0.16222142,
#         -0.18945647, -0.20463933]])

array([[ 0.06973368, -1.42163664,  1.30437981, -0.63232683, -0.12152338,
        -0.55092925, -0.38822378, -0.21366369,  2.18449079, -0.1933473 ,
        -0.48740571, -0.19526776, -0.19141037, -0.13808619, -0.16222142,
        -0.18945647, -0.20463933],
       [-0.5279931 , -0.16617662,  1.30437981, -0.63232683, -0.12152338,
        -0.55092925, -0.38822378, -0.21366369, -0.45777259, -0.1933473 ,
        -0.48740571, -0.19526776, -0.19141037,  7.24185366, -0.16222142,
        -0.18945647, -0.20463933]])

### Training the Model

Ok, now we have our training, validation, and test data.  The main rule is that we only get to assess our model against our test dataset once.

Let's train our model on our `X_train_scaled` dataset, and score on the `X_validate_scaled` dataset.

In [15]:
# train model here

In [742]:
# score model here 

# 0.537861413573713

0.537861413573713

### Feature Selection

We can use the eli5 library for feature importances.

In [743]:
from eli5.sklearn import PermutationImportance
import eli5

perm = PermutationImportance(model).fit(X_validate_scaled, y_validate)

In [744]:
perm

PermutationImportance(cv='prefit',
                      estimator=LinearRegression(copy_X=True,
                                                 fit_intercept=True,
                                                 n_jobs=None, normalize=False),
                      n_iter=5, random_state=None, refit=True, scoring=None)

In [762]:
exp_df = eli5.explain_weights_df(perm, feature_names = list(X_train.columns))
exp_df[:10]

Unnamed: 0,feature,weight,std
0,budget,0.709011,0.046883
1,Animation,0.03981,0.009855
2,runtime,0.030055,0.016954
3,budget_is_na,0.014037,0.005122
4,Adventure,0.010915,0.002343
5,Drama,0.003156,0.002714
6,Thriller,0.001565,0.00138
7,Crime,0.001394,0.002559
8,Action,0.00075,0.001126
9,month,0.000315,0.000316


In [759]:
selected_feats = exp_df[:4]['feature'].values
selected_feats

array(['budget', 'Animation', 'runtime', 'budget_is_na'], dtype=object)

In [755]:
model = LinearRegression()
model.fit(X_train[selected_feats], y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [756]:
model.score(X_validate[selected_feats], y_validate)

0.5309831080966628

And then finally, we can score our model on the test data.

In [757]:
model.score(X_test[selected_feats], y_test)

0.5051412736850833