In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import util

# set defaults
plt.style.use('seaborn-white')   # seaborn custom plot style
plt.rc('figure', dpi=100, figsize=(7, 5))   # set default size/resolution
plt.rc('font', size=12)   # font size

# Feature Engineering


## Outline
* Overview: Modeling and Estimation
* Designing Features for your Model
* Different Features for different Data Types

## What have we done so far?

* Data assessment and collection:
    * The data generating processes and its relationship to observed data.
    * Data collection techniques (web-scraping, apis)
* Data cleaning and manipulation
    * Pandas and Regex
* Learned ways of understanding and summarizing data
    * Smoothing techniques, visualization, TF-IDF

## Features

* A **feature** is a measurable property or characteristic of a phenomenon being observed.
* Synonyms: (explanatory) variable, attribute
* Examples include:
    - a column of a dataset.
    - a derived value from a dataset, perhaps using additional information.
    
We have been creating features to summarize data!

### Examples of features in SD salary dataset

* Salary of employee
* Employee salaries, standardized by job status (PT/FT)
* Gender/age of employees (derived from SSA names; accurate?)
* Job Family associated to a job title (uses text-techniques)

## What makes a good feature?

* Fidelity to Data Generating Process (Consistency).
* Strongly associated to phenomenon of interest ("contains information").
* Easily used in standard modeling techniques (e.g. quantitative and scaled).

Datasets often come with weak attributes; features may need to be "engineered" to convey information.

## Feature Engineering

* We already engineered features to summarize and understand data.
    - smoothing, transformations, ad hoc derived properties of data

* What can we do with it?
    - Visualization and summarization
    - Modeling (prediction; inference)

# Modeling: an Overview

Slides: DS100 (Joseph E. Gonzalez)

<img src="imgs/image_0.png">

<img src="imgs/image_1.png">

<img src="imgs/image_2.png">

## Provide insight into complex phenomena
<img src="imgs/image_3.png">

<img src="imgs/image_4.png">

<img src="imgs/image_5.png">

<img src="imgs/image_6.png">

<img src="imgs/image_7.png">

<img src="imgs/image_8.png">

### Example: Restaurant Tips

* Data: collected by a single waiter over a month
* Why build a model?
    - Predict which tables will tip the highest? (Optimize your service)
    - Predict a watier's income for the year.
    - Understand relationship between tables and tips.

In [None]:
tips = sns.load_dataset('tips')
print('number of records: ', len(tips))

In [None]:
tips.head()

### Restaurant tips: EDA

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12,5))

sns.distplot(tips.total_bill, rug=True, ax=axes[0])
axes[0].set_xlabel('total bill')

sns.distplot(tips.tip, rug=True, ax=axes[1])
axes[1].set_xlabel('tip in dollars')

fig.suptitle('Understanding tips');

#### Observations:
|Total Bill|Tip Amount|
|---|---|
|Right skewed|Right skewed|
|Mode around \$15|Mean around 3|
|Mean around \$20|Possibly bimodal (?)|
|No large bills|Large outliers (?)|

<img src="imgs/image_9.png">

<img src="imgs/image_10.png">

<img src="imgs/image_11.png">

<img src="imgs/image_12.png">

# Features in Linear Models

## Predicting child heights

* Recall, Francis Galton's obsession with understanding inheritance.
* He wanted to predict a child's *height* from their attributes of their parents.
    - attributes: family id, father height, mother height, number of children, gender, child height.

In [None]:
galton = pd.read_csv('data/galton.csv')
galton.head()

### Heights data: quick EDA
* What could be done to improve this viz?
* Is a linear model suitable for prediction? on which attributes?
* There are multiple granularities (what?); is this a problem?

In [None]:
pd.plotting.scatter_matrix(galton, figsize=(12,8));

### Attempt #1: Predict child's height using father's height

1. Plot a scatterplot with a best-fit line and prediction interval

In [None]:
sns.lmplot(x='father', y='childHeight', data=galton);

### Attempt #1: Predict child's height using father's height

Let's do the prediction "by hand":

* Recall, a prediction is a function $pred$ from the *features* (father height) to the *target* (child height).
* The quality of our prediction on the dataset is the *root mean square error* (RMSE): $${\rm RMSE} =  \sqrt{\sum_i(pred(x_i) - y_i)^2} $$ where $x_i$ are the father heights, $pred(x_i)$ are the predicted child heights, and $y_i$ are the *actual* child heights.

In [None]:
from scipy.stats import linregress

lm = linregress(galton.father, galton.childHeight)
lm

In [None]:
pred_height = lambda x: lm.slope * x + lm.intercept

In [None]:
pred_height(60)

In [None]:
pred_height(galton.father).head()

In [None]:
rmse = np.sqrt(np.sum((pred_height(galton.father) - galton.childHeight)**2))
rmse

### Visualizing the predictions
* How is our model good? bad?
    - Is a linear model appropriate? good?
    - How might we make it better?

In [None]:
# what is this code doing?
eval_data = (
    pd.concat(
        [galton[['father', 'childHeight']], pred_height(galton.father).rename('prediction')], 
        axis=1
    ).set_index('father')
    .unstack()
    .rename('height')
    .reset_index()
    .rename(columns={'level_0':'type'})
)

eval_data.sample(10)

#### Questions
* How is our model good? bad?
* Is a linear model appropriate? good?
    - How might we make it better?

In [None]:
sns.scatterplot(
    data=eval_data,
    x='father', y='height',
    hue='type'
);

### Attempt #2: adding features

* What if the father is very tall and the mother is short?
* Will adding mother's height help our predictions?
* Try: regression on two variables (mother/father height).
    - "plane of best fit"

In [None]:
# use sklearn
from sklearn.linear_model import LinearRegression

In [None]:
# typical pattern; focus on this later!

lr = LinearRegression() # initial linear regression

lr.fit(galton[['mother', 'father']], galton.childHeight) # calculate the weights

predictions = lr.predict(galton[['mother', 'father']]) # calculate predictions

In [None]:
# how good is the prediction?

np.sqrt(np.sum(np.abs(predictions - galton.childHeight)**2))

In [None]:
util.plot3Dscatter(galton, 'mother', 'father', lr, galton['childHeight'])

In [None]:
# plot results by father height
util.plot_eval_scatter(galton, pd.Series(predictions), galton['childHeight'], 'father')

In [None]:
# plot results by mother height
util.plot_eval_scatter(galton, pd.Series(predictions), galton['childHeight'], 'mother')

### Attempt #3: adding gender to the regression

* Our previous predictions are constant for a given set of parents.
* One would expect male/female children of the same parents to have different heights!
* Is it reasonable to add this attribute? Is it known when the prediction is used?

First plot a scatterplot of 'father height' vs 'child height' by group:

In [None]:
# The regression lines (predictions) are very different for male/female
sns.lmplot(x='father', y='childHeight', data=galton, hue='gender');

### Attempt #3: adding gender to the regression

* Problem: gender is *categorical*, while regression requires *quantitative* inputs!
    - The table contains two values in the column: male/female
* Solution: create a binary column called `gender=male` that:
    - is 1 when `gender` has value male, and
    - is 0 otherwise
    
This is a simple example of *one-hot encoding*.

In [None]:
galton['gender=male'] = (galton.gender == 'male').astype(int)
galton.head()

In [None]:
lr_gender = LinearRegression()
lr_gender.fit(galton[['father', 'mother', 'gender=male']], galton.childHeight)

In [None]:
predictions_gender = lr_gender.predict(galton[['father', 'mother', 'gender=male']])

In [None]:
np.sqrt(np.sum(np.abs(predictions_gender - galton.childHeight )**2))

In [None]:
# plot results by father height
util.plot_eval_scatter(galton, pd.Series(predictions_gender), galton['childHeight'], 'father')

### Visualizing regression with one-hot encoding

* One-hot encoding "pulls the two genders apart" in the scatterplot, along a 3rd dimension.

In [None]:
# The regression lines (predictions) are very different for male/female
sns.lmplot(x='father', y='childHeight', data=galton, hue='gender');

In [None]:
lr_gender_2 = LinearRegression()
lr_gender_2.fit(galton[['gender=male', 'father']], galton.childHeight)


In [None]:
util.plot3Dscatter(galton, 'gender=male', 'father', lr_gender_2, galton['childHeight'])