In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import util

# set defaults
plt.style.use('seaborn-white')   # seaborn custom plot style
plt.rc('figure', dpi=100, figsize=(7, 5))   # set default size/resolution
plt.rc('font', size=12)   # font size

# Feature Engineering


## Outline
* Overview: Modeling and Estimation
* Designing Features for your Model
* Different Features for different Data Types

## What have we done so far?

* Data assessment and collection:
    * The data generating processes and its relationship to observed data.
    * Data collection techniques (web-scraping, apis)
* Data cleaning and manipulation
    * Pandas and Regex
* Learned ways of understanding and summarizing data
    * Smoothing techniques, visualization, TF-IDF

## Features

* A **feature** is a measurable property or characteristic of a phenomenon being observed.
* Synonyms: (explanatory) variable, attribute
* Examples include:
    - a column of a dataset.
    - a derived value from a dataset, perhaps using additional information.
    
We have been creating features to summarize data!

### Examples of features in SD salary dataset

* Salary of employee
* Employee salaries, standardized by job status (PT/FT)
* Gender/age of employees (derived from SSA names; accurate?)
* Job Family associated to a job title (uses text-techniques)

## What makes a good feature?

* Fidelity to Data Generating Process (Consistency).
* Strongly associated to phenomenon of interest ("contains information").
* Easily used in standard modeling techniques (e.g. quantitative and scaled).

Datasets often come with weak attributes; features may need to be "engineered" to convey information.

## Feature Engineering

* We already engineered features to summarize and understand data.
    - smoothing, transformations, ad hoc derived properties of data

* What can we do with it?
    - Visualization and summarization
    - Modeling (prediction; inference)

# Modeling: an Overview

Slides: DS100 (Joseph E. Gonzalez)

<img src="imgs/image_6.png">

<img src="imgs/image_7.png">

<img src="imgs/image_8.png">

### Example: Restaurant Tips

* Data: collected by a single waiter over a month
* Why build a model?
    - Predict which tables will tip the highest? (Optimize your service)
    - Predict a watier's income for the year.
    - Understand relationship between tables and tips.

In [None]:
tips = sns.load_dataset('tips')
print('number of records: ', len(tips))

In [None]:
tips.head()

### Restaurant tips: EDA

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12,5))

sns.distplot(tips.total_bill, rug=True, ax=axes[0])
axes[0].set_xlabel('total bill')

sns.distplot(tips.tip, rug=True, ax=axes[1])
axes[1].set_xlabel('tip in dollars')

fig.suptitle('Understanding tips');

#### Observations:
|Total Bill|Tip Amount|
|---|---|
|Right skewed|Right skewed|
|Mode around \$15|Mean around 3|
|Mean around \$20|Possibly bimodal (?)|
|No large bills|Large outliers (?)|

<img src="imgs/image_9.png">

<img src="imgs/image_10.png">

<img src="imgs/image_11.png">

<img src="imgs/image_12.png">

# Features in Linear Models

## Predicting child heights

* Recall, Francis Galton's obsession with understanding inheritance.
* He wanted to predict a child's *height* from their attributes of their parents.
    - attributes: family id, father height, mother height, number of children, gender, child height.

In [None]:
galton = pd.read_csv('data/galton.csv')
galton.head()

### Heights data: quick EDA
* What could be done to improve this viz?
* Is a linear model suitable for prediction? on which attributes?
* There are multiple granularities (what?); is this a problem?

In [None]:
pd.plotting.scatter_matrix(galton, figsize=(12,8));

### Attempt #1: Predict child's height using father's height

1. Plot a scatterplot with a best-fit line and prediction interval

In [None]:
sns.lmplot(x='father', y='childHeight', data=galton);

### Attempt #1: Predict child's height using father's height

Let's do the prediction "by hand":

* Recall, a prediction is a function $pred$ from the *features* (father height) to the *target* (child height).
* The quality of our prediction on the dataset is the *root mean square error* (RMSE): $${\rm RMSE} =  \sqrt{\sum_i(pred(x_i) - y_i)^2} $$ where $x_i$ are the father heights, $pred(x_i)$ are the predicted child heights, and $y_i$ are the *actual* child heights.

In [None]:
from scipy.stats import linregress

lm = linregress(galton.father, galton.childHeight)
lm

In [None]:
pred_height = lambda x: lm.slope * x + lm.intercept

In [None]:
pred_height(60)

In [None]:
pred_height(galton.father).head()

In [None]:
rmse = np.sqrt(np.sum((pred_height(galton.father) - galton.childHeight)**2))
rmse

### Visualizing the predictions
* How is our model good? bad?
    - Is a linear model appropriate? good?
    - How might we make it better?

In [None]:
# what is this code doing?
eval_data = (
    pd.concat(
        [galton[['father', 'childHeight']], pred_height(galton.father).rename('prediction')], 
        axis=1
    ).set_index('father')
    .unstack()
    .rename('height')
    .reset_index()
    .rename(columns={'level_0':'type'})
)

eval_data.sample(10)

#### Questions
* How is our model good? bad?
* Is a linear model appropriate? good?
    - How might we make it better?

In [None]:
sns.scatterplot(
    data=eval_data,
    x='father', y='height',
    hue='type'
);

### Attempt #2: adding features

* What if the father is very tall and the mother is short?
* Will adding mother's height help our predictions?
* Try: regression on two variables (mother/father height).
    - "plane of best fit"

In [None]:
# use sklearn
from sklearn.linear_model import LinearRegression

In [None]:
# typical pattern; focus on this later!

lr = LinearRegression() # initial linear regression

lr.fit(galton[['mother', 'father']], galton.childHeight) # calculate the weights

predictions = lr.predict(galton[['mother', 'father']]) # calculate predictions

In [None]:
# how good is the prediction?

np.sqrt(np.sum(np.abs(predictions - galton.childHeight)**2))

In [None]:
util.plot3Dscatter(galton, 'mother', 'father', lr, galton['childHeight'])

In [None]:
# plot results by father height
util.plot_eval_scatter(galton, pd.Series(predictions), galton['childHeight'], 'father')

In [None]:
# plot results by mother height
util.plot_eval_scatter(galton, pd.Series(predictions), galton['childHeight'], 'mother')

### Attempt #3: adding gender to the regression

* Our previous predictions are constant for a given set of parents.
* One would expect male/female children of the same parents to have different heights!
* Is it reasonable to add this attribute? Is it known when the prediction is used?

First plot a scatterplot of 'father height' vs 'child height' by group:

In [None]:
# The regression lines (predictions) are very different for male/female
sns.lmplot(x='father', y='childHeight', data=galton, hue='gender');

### Attempt #3: adding gender to the regression

* Problem: gender is *categorical*, while regression requires *quantitative* inputs!
    - The table contains two values in the column: male/female
* Solution: create a binary column called `gender=male` that:
    - is 1 when `gender` has value male, and
    - is 0 otherwise
    
This is a simple example of *one-hot encoding*.

In [None]:
galton['gender=male'] = (galton.gender == 'male').astype(int)
galton.head()

In [None]:
lr_gender = LinearRegression()
lr_gender.fit(galton[['father', 'mother', 'gender=male']], galton.childHeight)

In [None]:
predictions_gender = lr_gender.predict(galton[['father', 'mother', 'gender=male']])

In [None]:
np.sqrt(np.sum(np.abs(predictions_gender - galton.childHeight )**2))

In [None]:
# plot results by father height
util.plot_eval_scatter(galton, pd.Series(predictions_gender), galton['childHeight'], 'father')

### Visualizing regression with one-hot encoding

* One-hot encoding "pulls the two genders apart" in the scatterplot, along a 3rd dimension.

In [None]:
# The regression lines (predictions) are very different for male/female
sns.lmplot(x='father', y='childHeight', data=galton, hue='gender');

In [None]:
lr_gender_2 = LinearRegression()
lr_gender_2.fit(galton[['gender=male', 'father']], galton.childHeight)


In [None]:
util.plot3Dscatter(galton, 'gender=male', 'father', lr_gender_2, galton['childHeight'])

## Feature Engineering

### Modeling setup

Want to estimate a relationship between X and Y.
* X is the observed data (almost anything!)
* Y is a quantitative value (e.g. a correlation coefficient; a predicted value)

<img src="imgs/image_0.png">

### The missing step: data to models

* Modeling techniques typically require *quantitative* input.
* Models require (strong) relationships between X and Y.

<img src="imgs/image_1.png">

There is work to be done transforming data into effective features!

## The goal of feature engineering

* Find transformations that effectively transform data into effective quantitative variables

* Find functions $\phi:X\to\mathbb{R}^d$ where similar points $x,y\in X$ have close images $\phi(x), \phi(y)\in \mathbb{R}^d$

* A "good" choice of features depends on many factors:
    - data type (quantitative, ordinal, nominal),
    - the relationship(s) and association(s) being modeled,
    - the model type (e.g. linear models, decision tree models, neural networks).

<img src="imgs/image_2.png">

<img src="imgs/image_3.png">

## Uninformative feature: `uid`

The `uid` was likely used to join the user information (e.g., `age`, and `state`) with some `Reviews` table.  The `uid` presents several questions:
* What is the meaning of the `uid` *number*? 
* Does the magnitude of the `uid` reveal information about the rating?
* Does adding `uid` improve our model?

## Dropping Features

While uncommon there are certain scenarios where manually dropping features might be helpful:

1. when the features **does not to contain information** associated with the prediction task.  
    - Reduces over-fitting, an issue we will discuss in great detail soon.  

2. when the feature is **not available at prediction time.**  For example, the feature might contain information collected after the user entered a rating.  This is a common scenario in time-series analysis.


## Nominal feature encoding: One hot encoding

* Transform categorical features into many binary features.
* Given a column `col` with values `A1,A2,...A_N`, define the following quantitative binary columns:

$$\phi_i(x) = \left\{\begin{array}{ll}1 & {\rm if\ } x = A1 \\ 0 &  {\rm if\ } x\neq A1 \\ \end{array}\right. $$

* *Also called:* dummy encoding; indicator variables.

### Example: one hot encoding States

* A column containing US states transforms into 50 feature columns
* e.g. `phi_CA(x) = 1 if x == 'CA' else 0`
* Oftentimes, many of these columns will be *largely* 0.

<img src="imgs/image_4.png">

### One hot encoding and circuits
<img src="imgs/image_5.png">

### Example: Restaurant tips

* We want to predict `tip` from the attributes using linear regression
    - Previously: predicted `tip` from `total_bill`
* Which columns are nominal?
    - How might you transform them to features for a regression model?
    - What is the domain of your feature transformation functions?

In [None]:
tips.head()

## Baseline models
1. Tips are predicted to be a fixed percentage of the total bill (average percentage)
2. The line of best fit of `tip` vs `total_bill`.

In [None]:
tip_pct = (tips.tip/tips.total_bill).mean()
tip_pct

In [None]:
preds = tips.total_bill * tip_pct

a = pd.concat([tips.total_bill, preds.rename('prediction'), tips.tip], axis=1)
ax = plt.subplot()
a.plot(kind='line', x='total_bill', y='prediction', ax=ax, c='b')
a.plot(kind='scatter', x='total_bill', y='tip', ax=ax, c='r', alpha=0.5);

In [None]:
# RMSE error
np.sqrt(np.sum((preds - tips.tip)**2))

In [None]:
sns.lmplot(data=tips, x='total_bill', y='tip')

In [None]:
lr = LinearRegression()
lr.fit(tips[['total_bill']], tips.tip)

In [None]:
# RMSE of regression model (is it better?)
np.sqrt(np.sum((lr.predict(tips[['total_bill']]) - tips.tip)**2))

In [None]:
# R^2 coefficient
lr.score(tips[['total_bill']], tips.tip)

### One hot encoding categorical variables
* Are all of these variable nominal?
* Do we have redundant variables we can drop?

In [None]:
categorical_cols = ['sex', 'smoker', 'day', 'time']

In [None]:
features = tips.copy().loc[:,['total_bill', 'size']]
for c in categorical_cols:
    for val in tips[c].unique():
        features['%s=%s' %(c, val)] = (tips[c] == val).astype(int)

In [None]:
features.head()

In [None]:
lr = LinearRegression()
lr.fit(features, tips.tip)

In [None]:
lr.score(features, tips.tip)

In [None]:
# Error is a few cents less than previous
np.sqrt(np.sum((lr.predict(features) - tips.tip)**2))

In [None]:
preds = lr.predict(features)
a = pd.concat([tips.total_bill, pd.Series(preds).rename('prediction'), tips.tip], axis=1)
ax = plt.subplot()
a.plot(kind='scatter', x='total_bill', y='prediction', ax=ax, c='b')
a.plot(kind='scatter', x='total_bill', y='tip', ax=ax, c='r', alpha=0.5);

## One hot encoding in Scikit Learn

* One-hot encoding is done using `sklearn.feature_extraction.DictVectorizer`
    - Takes in dictionary rows as input
* One-hot encoding is also possible with `sklearn.preprocessing.OneHotEncoder`
    - Expects categorical integers as input
    - Must pre-compose with `sklearn.preprocessing.OrdinalEncoder`

In [None]:
from sklearn.feature_extraction import DictVectorizer

In [None]:
d = tips[categorical_cols].to_dict(orient='records')
d[:10]

In [None]:
vec_enc = DictVectorizer()
vec_enc.fit(d)

In [None]:
vec_enc.transform(d).toarray()

In [None]:
vec_enc.get_feature_names()

In [None]:
pd.DataFrame(vec_enc.transform(d).toarray(), columns=vec_enc.get_feature_names()).head()

## Integer encoding for ordinal columns

* If a categorical column has an order, then its values can be mapped to the integers
* The mapped values should have the same order as the number line
    - Be sure to specifically call out the mapping to maintain order!

In [None]:
# is day ordinal?
features['day'] = tips.day.replace(dict(zip(['Thur', 'Fri', 'Sat', 'Sun'], range(4))))

In [None]:
features.head()

In [None]:
# How much improvement?
lr = LinearRegression()
lr.fit(features, tips.tip)
lr.score(features, tips.tip)

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
oe.fit(tips[['day']])

### Transformations of quantitative features
* Transforming quantitative features can enhance 'hidden trends' in data.
* Examples:
    - Growth rates scaled to linear trends (e.g. log, sqrt)
    - Periodic trends separated from growth (e.g. sin)
    - Group-wise scaling
    - Interactions between variables (e.g. polynomial encoding)

### Example: de-trending periodic sales

* Daily sales volume from an e-commerce product
* Like to predict future sales, based on current trends:
    - What seasonality (periodicity) is present?
    - What is long-run growth? (linear? quadratic? exponential?)
    - Can you guess a feature that models these properties?


In [None]:
df = pd.read_csv('data/sinusoidal.csv').sort_values(by='day').reset_index(drop=True)

In [None]:
df.plot(kind='scatter', x='day', y='units sold', title='daily sales volume');

### Example: de-trending periodic sales

* Periodic sales by week (7-day period).
* Sales have ~10x difference between low and high (amplitude).
* Sales is approximately 'linear growth + periodic term'
* Feature:
$$ \phi(x) = x + 5\sin\left(\frac{2\pi\cdot x}{7}\right) $$

In [None]:
def detrend(day):
    '''
    Periodic sales volume by the week.
    Sales sees ~10x weekly difference between low and high.
    '''
    return day + 5 * np.sin(2 * np.pi * day / 7)

In [None]:
df['detrend'] = detrend(df['day'])

In [None]:
df.set_index('day').sort_index().plot()

In [None]:
# feature space vs target space
# linear relationship!
df[['units sold', 'detrend']].plot(kind='scatter', x='detrend', y='units sold');

In [None]:
sns.lmplot(data=df, x='day', y='units sold')
sns.residplot(data=df, x='day', y='units sold', color='r')

In [None]:
sns.lmplot(data=df, x='detrend', y='units sold')
sns.residplot(data=df, x='detrend', y='units sold', color='r')