# Data Science at UCSB

# Python for Data Science: Feature Engineering

## Jason Freeberg, Fall 2016

Ahoy! Now that we have a basic understanding of machine learning, today we'll go over feature engineering, or, *the process of adding predictors to strengthen the model of a machine learning pipeline*. Let's say a company is building a model to predict trends in the stock market. They will likely begin with historical data on the NASDAQ, Dow Jones, and major stock market leaders like Alphabet Inc., Ford, large oil companies, and banks.

Then, as the team attempts to strengthen the predictability of the model, they could choose to include the Consumer Price Index (CPA), national trends in weather, or even a rolling [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) of the Wall Street Journal. These are now new columns, predictors, or *features*, in our dataframe to help our model predict the stock market.

Depending on who you ask [interaction terms](https://en.wikipedia.org/wiki/Interaction_(statistics), [scaling](scaling predictors), [centering](http://www.theanalysisfactor.com/center-on-the-mean/), and [standarizing](https://en.wikipedia.org/wiki/Standard_score) could also fall under feature engineering. 

Since the general idea here is simple, we will spend most of today working through the nitty gritty of adding predictors to a pandas DataFrame.

#### The Data

I wanted to use a time-series dataset to continue the example from earlier, but the the support for time series modeling in Python is limited. Instead we you guts will play with a dataset fit for classification: the historical San Francisco crime breakdown from 2003 to 2015...courtesy of [Kaggle](https://www.kaggle.com/c/sf-crime). Then we'll try adding and mutating our predictors. In a production model, a feature engineer might spend weeks building a web scraper, gathering gigabytes of relevant information, and parsing it down to well formated features... But in today's lab we will simply leave out some given predictors, then add them in to assess their significance ;)

In [38]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn import cross_validation, metrics
from urllib.request import urlopen
import os

seed = 123

# I'll let you guys play with the San Francisco dataset, I'll use this set from IMDB. 
# Let's try to predict the film's score from some of the predictors.

location = os.path.realpath(os.path.join(os.getcwd(), "movie_metadata.csv"))
movieData = pd.read_csv(location)

print('Nrow = {0}. Ncol = {1}'.format(movieData.shape[0], movieData.shape[1]))
print('-------------------- Column Names and Examples --------------------')
for col in range(movieData.shape[1]):
    print('{0:30}  {1:20}'.format(str(movieData.columns[col]), str(movieData.ix[6, col])))

Nrow = 5043. Ncol = 28
-------------------- Column Names and Examples --------------------
color                           Color               
director_name                   Sam Raimi           
num_critic_for_reviews          392.0               
duration                        156.0               
director_facebook_likes         0.0                 
actor_3_facebook_likes          4000.0              
actor_2_name                    James Franco        
actor_1_facebook_likes          24000.0             
gross                           336530303.0         
genres                          Action|Adventure|Romance
actor_1_name                    J.K. Simmons        
movie_title                     Spider-Man 3        
num_voted_users                 383056              
cast_total_facebook_likes       46055               
actor_3_name                    Kirsten Dunst       
facenumber_in_poster            0.0                 
plot_keywords                   sandman|spider man|symbio

In [39]:
startingPreds = movieData[['imdb_score', 'color', 'title_year', 'cast_total_facebook_likes', 'budget']].copy()

# Let's print our starting predictors

print(startingPreds.dtypes)
print('------- Missing Data: -------')
print(startingPreds.isnull().sum())

# We could spend a whole tutorial on ways to deal with missing data. But for the sake of time 
# we'll just drop rows with missing predictors.

startingPreds.dropna(axis=0, how='any', inplace=True)

# Encode our categorical variable

colorCode = LabelEncoder().fit(startingPreds['color'])
startingPreds['color'] = colorCode.transform(startingPreds['color'])

# Model building

movieTrain, movieTest = cross_validation.train_test_split(startingPreds, 
                                                          test_size = 0.3, 
                                                          random_state = seed)

linearModel = LinearRegression().fit(X = movieTrain.ix[:, movieTrain.columns != 'imdb_score'], 
                                     y = movieTrain['imdb_score'])

imdb_score                   float64
color                         object
title_year                   float64
cast_total_facebook_likes      int64
budget                       float64
dtype: object
------- Missing Data: -------
imdb_score                     0
color                         19
title_year                   108
cast_total_facebook_likes      0
budget                       492
dtype: int64


In [40]:
# Test and evaluate model

moviePredictions = pd.DataFrame(linearModel.predict(movieTest.ix[:, movieTest.columns != 'imdb_score']),
                                columns = ['predictions'])
firstResults = pd.concat([moviePredictions, movieTest['imdb_score'].reset_index(drop=True)], axis = 1)

firstMSE = metrics.mean_squared_error(y_true = firstResults['imdb_score'],
                                         y_pred = firstResults['predictions'])

firstRootMSE = np.sqrt(firstMSE)

print("Our initial root MSE = {0:.4}".format(firstRootMSE))

Our initial root MSE = 1.112


Now that we have a baseline model and test metric, we can start joining more predictors to see their influence on the model. Then we can try making some interaction terms. Remember that interaction terms may increase the overall **predictive power** of the model, but we will be sacrificing **interpretability**. Interaction between variables and  the predictability/interpretability tradeoff is at the crux of [Artificial Nueral Networks](https://en.wikipedia.org/wiki/Artificial_neural_network).

In [41]:
# If you've used SQL before, then this a simple join on the indices of the original data and our train/test sets
pd.options.mode.chained_assignment = None  # default = 'warn'

# Select more predictors
newPredictors = movieData[['duration', 'actor_1_facebook_likes', 'num_voted_users',
                           'actor_2_facebook_likes', 'language']]

# Join by the index number
movieTrain2 = pd.merge(movieTrain, 
                       newPredictors, 
                       left_index = True,
                       right_index = True,
                       how = 'left')

movieTest2 = pd.merge(movieTest, 
                      newPredictors, 
                      left_index = True,
                      right_index = True,
                      how = 'left')

# Drop rows with missing values in new features
movieTrain2.dropna(axis = 0, how = 'any', inplace = True)
movieTest2.dropna(axis = 0, how = 'any', inplace  = True)

# Encode language, a categorical variable
languageCode = LabelEncoder().fit(newPredictors.language.dropna())
movieTrain2['language'] = languageCode.transform(movieTrain2['language'])
movieTest2['language'] = languageCode.transform(movieTest2['language'])


In [42]:
# Let's rebuild and test a model!

model2 = LinearRegression().fit(X = movieTrain2.ix[:, movieTrain2.columns != 'imdb_score'],
                              y = movieTrain2['imdb_score'])

predictions2 = pd.DataFrame(model2.predict(movieTest2.ix[:, movieTest2.columns != 'imdb_score']), 
                            columns = ['prediction'])

MSE2 = metrics.mean_squared_error(y_true = movieTest2.imdb_score,
                         y_pred = predictions2)
rMSE2 = np.sqrt(MSE2)

print('Old root MSE = {0:.3}.'.format(firstRootMSE))
print('Our improved root MSE = {0:.3}.'.format(rMSE2))
print('An improvement of {0:.4}%.'.format( abs((rMSE2-firstRootMSE)/firstRootMSE)*100) )

Old root MSE = 1.11.
Our improved root MSE = 0.963.
An improvement of 13.46%.


In [45]:
# Now let's add some interaction terms.

movieTrain2['budgetXduration'] = movieTrain2.duration * movieTrain2.budget
movieTrain2['actor1Xactor2'] = movieTrain2.actor_1_facebook_likes * movieTrain2.actor_2_facebook_likes
movieTrain2['castLikesXbudget'] = movieTrain2.cast_total_facebook_likes * movieTrain2.budget

movieTest2['budgetXduration'] = movieTest2.duration * movieTest2.budget
movieTest2['actor1Xactor2'] = movieTest2.actor_1_facebook_likes * movieTest2.actor_2_facebook_likes
movieTest2['castLikesXbudget'] = movieTest2.cast_total_facebook_likes * movieTest2.budget

model3 = LinearRegression().fit(X = movieTrain2.ix[:, movieTrain2.columns != 'imdb_score'],
                                y = movieTrain2['imdb_score'])

predictions3 = pd.DataFrame(model3.predict(X = movieTest2.ix[:, movieTest2.columns != 'imdb_score']), 
                            columns = ['prediction'])

MSE3 = metrics.mean_squared_error(y_true = movieTest2.imdb_score,
                                  y_pred = predictions3)
rMSE3 = np.sqrt(MSE3)

print('Our first root MSE = {0:.3}.'.format(firstRootMSE))
print('Our second root MSE = {0:.3}.'.format(rMSE2))
print('Our last root MSE = {0:.3}'.format(rMSE3))
print('An improvement of {0:.4}%.'.format( abs((rMSE3-rMSE2)/rMSE2)*100) )

Our first root MSE = 1.11.
Our second root MSE = 0.963.
Our last root MSE = 0.958
An improvement of 0.4346%.


## Your turn!

Your task is to optimize a classifier that predicts the *type* of crime given the location, time, date, and police district of the crime.

Slowly add predictors to improve the model's accuracy, then add interaction terms. We will use the raw accuracy (# correct / # incorrect) to evaluate the model each time. However, there are [other](http://stats.stackexchange.com/questions/51296/how-to-calculate-precision-and-recall-for-multiclass-classification-using-confus) multiclass classification metrics that are worth noting. 

In [101]:
location1 = os.path.realpath(os.path.join(os.getcwd(), "crimeTrain.csv"))
crimeData = pd.read_csv(location1)
print('(Rows, Columns) =', crimeData.shape)
crimeData.head()

crimeData = crimeData[['Dates', 'Category', 'PdDistrict', 'Address', 'X', 'Y']]
crimeData.to_csv(os.getcwd()+'/crimeData.csv', index = False)


(Rows, Columns) = (676127, 9)


In [102]:
location1 = os.path.realpath(os.path.join(os.getcwd(), "crimeData.csv"))
crimeData = pd.read_csv(location1)
print('(Rows, Columns) =', crimeData.shape)
crimeData.head()

(Rows, Columns) = (676127, 6)


Unnamed: 0,Dates,Category,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


In [103]:
# If the model fit is running slowly, you can down sample the dataframe here
# crimeData = crimeData.sample(frac = 0.6, random_state = seed)

# Split up the data into train and test sets
crimeTrain, crimeTest = cross_validation.train_test_split(crimeData,
                                                         test_size = 0.3,
                                                         random_state = seed)
# Fill this list with the predictor column names
predColumns = ['PdDistrict', 'X', 'Y']

# Encode police district, which is a categorical variable
districtCode = LabelEncoder().fit(crimeData['PdDistrict'])
crimeTrain['PdDistrict'] = districtCode.transform(crimeTrain['PdDistrict'])
crimeTest['PdDistrict'] = districtCode.transform(crimeTest['PdDistrict'])

# Fit a simple classifier
kNN = KNeighborsClassifier(n_neighbors = 10).fit(X = crimeTrain[predColumns],
                                                 y = crimeTrain.Category)

# Predict on test set
crimePredictions1 = pd.DataFrame(kNN.predict(X = crimeTest[predColumns]))
accuracy1 = metrics.accuracy_score(y_pred = crimePredictions1,
                             y_true = crimeTest.Category)

print('Raw Accuracy = {0:.4}'.format(accuracy1))

Raw Accuracy = 0.2558


Not a great start, so let's add more predictors. We'll extract the day of week from the *Dates* column. Notice the ** .map( )** function used on the *Dates* column. **.map( )** takes a [lamba function](http://www.diveintopython.net/power_of_introspection/lambda_functions.html) as its argument, and applies that given function to each element in the column. Lambda functions are just like the normal functions that we learned about weeks ago, except they don't have a name and contain only one expression.

Since they need to be simple, I used two in a row. One to parse the string for the date and time, and another to extract the day of week from the date.

In [166]:
# Not a great start, let's see how we can improve things.

from datetime import datetime    # for extracting the day of week from timestamp
import re     # for regular expressions

def parseAddress(row):
    match = re.search(r'\b[A-Z]+\b(?=\s[A-Z]+)|\b[0-9]+[A-Z]+\b', row)
    if match == None:
        return None
    else:
        return match.group()

crimeTrain['DayOfWeek'] = (crimeTrain.Dates
                    .map(lambda row: datetime.strptime(row, '%Y-%m-%d %H:%M:%S'))
                    .map(lambda row: datetime.weekday(row))
                   )

crimeTrain['HourOfDay'] = (crimeTrain.Dates
                           .map(lambda row: datetime.strptime(row, '%Y-%m-%d %H:%M:%S'))
                           .map(lambda row: row.hour)
                          )

# Now we'll parse the address and extract street names. We can't use the raw address strings
# because there are too many unique strings and it would introduce a lot of noise.

crimeTrain['StrippedAddress'] = (crimeTrain.Address
                                 .map(lambda row: re.sub(r'THE|LA', '', row))
                                 .map(lambda row: parseAddress(row))
                                )

crimeTrain['StrippedAddress'].fillna('Other')

N_address = len(crimeTrain.Address.unique())
N_stripAddress = len(crimeTrain.StrippedAddress.unique())
print('Reduced the numner of addresses from {0} to {1}.'.format(N_address, N_stripAddress))
crimeTrain.head()

Reduced the numner of addresses from 20624 to 1756.


Unnamed: 0,Dates,Category,PdDistrict,Address,X,Y,DayOfWeek,HourOfDay,StrippedAddress
464683,2008-10-03 23:00:00,DRUG/NARCOTIC,3,MISSION ST / 17TH ST,-122.419516,37.763429,4,23,MISSION
552131,2007-07-09 10:30:00,ASSAULT,9,0 Block of MASON ST,-122.409268,37.7838,0,10,MASON
147589,2013-05-19 21:20:00,BURGLARY,4,800 Block of GROVE ST,-122.430563,37.77692,6,21,GROVE
388501,2009-11-11 06:30:00,DISORDERLY CONDUCT,3,3400 Block of 18TH ST,-122.420495,37.761822,2,6,18TH
600943,2006-10-08 13:00:00,NON-CRIMINAL,1,1ST ST / MARKET ST,-122.399152,37.791017,6,13,1ST


In [167]:
# Now do the same for the test set...

crimeTest['DayOfWeek'] = (crimeTest.Dates
                          .map(lambda row: datetime.strptime(row, '%Y-%m-%d %H:%M:%S'))
                          .map(lambda row: datetime.weekday(row))
                         )

crimeTest['HourOfDay'] = (crimeTest.Dates
                          .map(lambda row: datetime.strptime(row, '%Y-%m-%d %H:%M:%S'))
                          .map(lambda row: row.hour)
                         )

crimeTest['StrippedAddress'] = (crimeTest.Address
                                .map(lambda row: re.sub(r'THE|LA', '', row))
                                .map(lambda row: parseAddress(row))
                               )

crimeTest['StrippedAddress'].fillna('Other')

assert crimeTest.shape[1] == 9

In [161]:
# And let's encode StrippedAddress, a new categorical variable. Remember that there might be 
# some addresses present the training set, and absent in the test set. Therefore, we will 
# create a list of the unique levels from both crimeTrain and crimeTest.

allAddresses = list(crimeTrain.StrippedAddress.unique()) + \
                    list(crimeTest.StrippedAddress.unique())

addressCode = LabelEncoder().fit(allAddresses)
crimeTrain['StrippedAddress'] = addressCode.transform(crimeTrain['StrippedAddress'])
crimeTest['StrippedAddress'] = addressCode.transform(crimetest['StrippedAddress'])


TypeError: unorderable types: str() > NoneType()

In [158]:
a = [1,2,3]
b = [4,5,6]
a+b

[1, 2, 3, 4, 5, 6]

Dates                0
Category             0
PdDistrict           0
Address              0
X                    0
Y                    0
DayOfWeek            0
HourOfDay            0
StrippedAddress    346
dtype: int64
Dates                0
Category             0
PdDistrict           0
Address              0
X                    0
Y                    0
DayOfWeek            0
HourOfDay            0
StrippedAddress    871
dtype: int64
