# Data Science at UCSB

# Python for Data Science: Feature Engineering

## Jason Freeberg, Fall 2016

Ahoy! Now that we have a basic understanding of machine learning, today we'll go over feature engineering, or, *the process of adding predictors to strengthen the model of a machine learning pipeline*. Let's say a company is building a model to predict trends in the stock market, they will likely begin with historical data on the NASDAQ, Dow Jones, and major stock market leaders like Alphabet Inc., Ford, large oil companies, and banks.

Then, as the team attempts to strengthen the predictability of the model, they could choose to include the Consumer Price Index (CPA), national trends in weather, or even a rolling [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) of the Wall Street Journal... for example. These are now new columns, predictors, or *features*, in our datatable to help our model predict the stock market.

Depending on who you ask, [Interaction terms](https://en.wikipedia.org/wiki/Interaction_(statistics)), [scaling](scaling predictors), [centering](http://www.theanalysisfactor.com/center-on-the-mean/), and [standarizing](https://en.wikipedia.org/wiki/Standard_score) could also fall under feature engineering. 

Since the general idea here is simple, we will spend most of today working through the nitty gritty of adding predictors to a pandas DataFrame.

#### The Data

I wanted to use a time-series dataset to continue the example from earlier, but the the support for time series modeling in Python is limited. Instead we you guts will play with a dataset fit for classification: the historical San Francisco crime breakdown from 2003 to 2015...courtest of [Kaggle](https://www.kaggle.com/c/sf-crime). Then we'll try adding and mutating our predictors. In a production model, a feature engineer might spend weeks building a web scraper, gathering gigabytes of relevant information, and parsing it down to well formated features. 

BUT! In today's lab we will simply leave out some given predictors, then add them in to assess their significance ;)

In [88]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn import cross_validation, metrics
from urllib.request import urlopen
import os

seed = 123

# I'll let you guys play with the San Francisco dataset, I'll use this set from IMDB. 
# Let's try to predict the score from some of the other predictors.

location = os.path.realpath(os.path.join(os.getcwd(), "movie_metadata.csv"))
movieData = pd.read_csv(location)

print('Nrow = {0}. Ncol = {1}'.format(movieData.shape[0], movieData.shape[1]))
print('-------------------- Column Names and Examples --------------------')
for col in range(movieData.shape[1]):
    print('{0:30}  {1:20}'.format(str(movieData.columns[col]), str(movieData.ix[6, col])))

Nrow = 5043. Ncol = 28
-------------------- Column Names and Examples --------------------
color                           Color               
director_name                   Sam Raimi           
num_critic_for_reviews          392.0               
duration                        156.0               
director_facebook_likes         0.0                 
actor_3_facebook_likes          4000.0              
actor_2_name                    James Franco        
actor_1_facebook_likes          24000.0             
gross                           336530303.0         
genres                          Action|Adventure|Romance
actor_1_name                    J.K. Simmons        
movie_title                     Spider-Man 3        
num_voted_users                 383056              
cast_total_facebook_likes       46055               
actor_3_name                    Kirsten Dunst       
facenumber_in_poster            0.0                 
plot_keywords                   sandman|spider man|symbio

In [142]:
startingPreds = movieData[['imdb_score', 'color', 'title_year', 'cast_total_facebook_likes', 'budget']].copy()

# Let's print our starting predictors

print(startingPreds.dtypes)
print('------- Missing Data: -------')
print(startingPreds.isnull().sum())

# We could spend a whole tutorial on ways to deal with missing data. But for the sake of time 
# we'll just drop rows with missing predictors.

startingPreds.dropna(axis=0, how='any', inplace=True)

# Encode our categorical variable

colorCode = LabelEncoder().fit(startingPreds['color'])
startingPreds['color'] = colorCode.transform(startingPreds['color'])

# Model building

movieTrain, movieTest = cross_validation.train_test_split(startingPreds, 
                                                          test_size = 0.3, 
                                                          random_state = seed)

linearModel = LinearRegression().fit(X = movieTrain.ix[:, movieTrain.columns != 'imdb_score'], 
                                     y = movieTrain['imdb_score'])

imdb_score                   float64
color                         object
title_year                   float64
cast_total_facebook_likes      int64
budget                       float64
dtype: object
------- Missing Data: -------
imdb_score                     0
color                         19
title_year                   108
cast_total_facebook_likes      0
budget                       492
dtype: int64


In [149]:
# Test and evaluate model

moviePredictions = pd.DataFrame(linearModel.predict(movieTest.ix[:, movieTest.columns != 'imdb_score']),
                                columns = ['predictions'])
firstResults = pd.concat([moviePredictions, movieTest['imdb_score'].reset_index(drop=True)], axis = 1)

firstMSE = metrics.mean_squared_error(y_true = firstResults['imdb_score'],
                                         y_pred = firstResults['predictions'])

firstRootMSE = np.sqrt(firstMSE)

print("Our initial root MSE = {0:.4}".format(firstRootMSE))

Our initial root MSE = 1.112


Now that we have a baseline model and test metric, we can start joining more predictors to see their influence on the model. Then we can try making some interaction terms. Remember that interaction terms may increase the overall **predictive power** of the model, but we will be sacrificing **interpretability**. Interaction between variables and  the predictability/interpretability tradeoff is at the crux of [Artificial Nueral Networks](https://en.wikipedia.org/wiki/Artificial_neural_network).

In [144]:
# If you've used SQL before, then this a simple join on the indices of the original data and our train/test sets
pd.options.mode.chained_assignment = None  # default = 'warn'

# Select more predictors
newPredictors = movieData[['duration', 'actor_1_facebook_likes', 'num_voted_users',
                           'actor_2_facebook_likes', 'language']]

# Join by the index number
movieTrain2 = pd.merge(movieTrain, 
                       newPredictors, 
                       left_index = True,
                       right_index = True,
                       how = 'left')

movieTest2 = pd.merge(movieTest, 
                      newPredictors, 
                      left_index = True,
                      right_index = True,
                      how = 'left')

# Drop rows with missing values in new features
movieTrain2.dropna(axis = 0, how = 'any', inplace = True)
movieTest2.dropna(axis = 0, how = 'any', inplace  = True)

# Encode language, a categorical variable
languageCode = LabelEncoder().fit(newPredictors.language.dropna())
movieTrain2['language'] = languageCode.transform(movieTrain2['language'])
movieTest2['language'] = languageCode.transform(movieTest2['language'])


In [165]:
# Let's rebuild and test a model!

model2 = LinearRegression().fit(X = movieTrain2.ix[:, movieTrain2.columns != 'imdb_score'],
                              y = movieTrain2['imdb_score'])

predictions2 = pd.DataFrame(model2.predict(movieTest2.ix[:, movieTest2.columns != 'imdb_score']), 
                            columns = ['prediction'])

MSE2 = metrics.mean_squared_error(y_true = movieTest2.imdb_score,
                         y_pred = predictions2)

rMSE2 = np.sqrt(MSE2)

print('Old root MSE = {0:.3}.'.format(firstRootMSE))
print('Our improved root MSE = {0:.3}.'.format(rMSE2))
print('An improvement of {0:.4}.%'.format( abs((rMSE2-firstRootMSE)/firstRootMSE)*100) )

Old root MSE = 1.11.
Our improved root MSE = 0.963.
An improvement of 13.46.%


## Your turn!

Take the wheel, sucker!