# Regression Continued 
## Objectives 
- Scaling - chat about fit/transform 
- Use correlations and recursive algorithms to inform feature selection
- More Feature Engineering 
- Creating Interactions between features
- Use `PolynomialFeatures` to build compound features

In [None]:
#imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#this allows plots to appear directly in the notebook
%matplotlib inline
plt.style.use('fivethirtyeight')

#sklearn imports for feature selection, scaling and polynomial features
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

#importing data
wine = pd.read_csv('data/wine.csv')
wine.head(10)

Let's imagine that we're going to try to predict wine quality based on specific features about each wine. 

### Decisions, Decisions, Decisions...

Now: Which columns (predictors) should I choose? 

There are 12 predictors I could choose from. For each of these predictors, I could either use it or not use it in my model, which means that there are $2^{12} = 4096$ _different_ models I could construct! Well, okay, one of these is the "empty model" with no predictors in it. But there are still 4095 models from which I can choose.

How can I decide which predictors to use in my model? Let's explore our options. 

1. Our first attempt might be just see which features are _correlated_ with the target to make a prediction.

We can use the correlation metric in making a decision.

In [None]:
plt.figure(figsize=(12,10))
#alternative way
#sns.set(rc={'figure.figsize':(8, 8)})
ax = sns.heatmap(wine.corr(), annot=True);# Let's look at the correlations with 'quality'
# (our dependent variable) in particular.



In [None]:
# Let's look at the correlations with 'quality'
# (our dependent variable) in particular.

wine_corrs = wine.corr()['quality'].map(abs).sort_values(ascending=False)
wine_corrs

Let's try using only a subset of the strongest correlated features to make our model.

In [None]:
# Let's choose 'alcohol' and 'density'.

wine_preds = wine[['alcohol', 'density', 'volatile acidity', 'chlorides']]
wine_target = wine['quality']

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(wine_preds, wine_target)

In [None]:
lr.score(wine_preds, wine_target)

### Let's try recursive feature elimination 

The idea behind recursive feature elimination is to start with all predictive features and then build down to a small set of features slowly, by eliminating the features with the lowest coefficients.

That is:

1. Start with a model with _all_ $n$ predictors
2. find the predictor with the smallest effect (coefficient)
3. throw that predictor out and build a model with the remaining $n-1$ predictors
4. set $n = n-1$ and repeat until $n-1$ has the value you want!

### But first.. we should _scale_ our data 
The idea behind StandardScaler is that it will transform your data so its distribution will have a mean value 0 and standard deviation of 1. In case of multivariate data(multiple features), this is done feature-wise(independently for each column of the data)

In [None]:
ss = StandardScaler()
#what is the fit method doing?
ss.fit(wine.drop('quality', axis=1))
#what about transform?
wine_scaled = ss.transform(wine.drop('quality', axis=1))

In [None]:
#initializing a regression instance and RFE
lr_rfe = LinearRegression()
select = RFE(lr_rfe, n_features_to_select=3)

In [None]:
select.fit(X=wine_scaled, y=wine['quality'])

In [None]:
list(zip(wine.columns, select.support_))

In [None]:
list(zip(wine.columns, select.ranking_))

## Feature Engineering 
> Remember: Domain knowledge can be helpful here! 🧠

In practice this aspect of data preparation can constitute a huge part of the data scientist's work. As we move into data modeling, much of the goal will be a matter of finding––**or creating**––features that are predictive of the targets we are trying to model.

There are infinitely many ways of transforming and combining a starting set of features. Good data scientists will have a nose for which engineering operations will be likely to yield fruit and for which operations won't. And part of the game here may be getting someone else on your team who understands what the data represent better than you!

**Let's do a bit of EDA and look at the chlorides column.**

In [None]:
#looking at the distribution 
wine['chlorides'].hist(bins=20);

In [None]:
wine.describe()

**We'll try building a feature that records whether the level of chlorides is greater than 0.065 (based on "high" being greater than the 75th percentile)**

In [None]:
wine['high_chlorides'] = wine['chlorides'] > 0.065

**Now we can check the correlation of this new feature with the target**

In [None]:
wine.corr()['quality']['high_chlorides']

Not bad! We don't seem to have stumbled onto a huge connection here, but this correlation value suggests that this new feature may be helpful in a final model.

## Interactions - Products of features 
Another engineering strategy we might try is **multiplying features together**.
Let's try these two features: `residual sugar` and `total sulfur dioxide`. Note that without domain knowledge or exploration, this is really a guess that this combination will predict `quality` well.

In [None]:
wine['rs*tsd'] = wine['residual sugar'] * wine['total sulfur dioxide']

In [None]:
wine.corr()['quality']['rs*tsd']

In [None]:
wine.corr()['quality']['residual sugar']

In [None]:
wine.corr()['quality']['total sulfur dioxide']

We can see these two features together have a higher correlation than each by itself!

## Polynomial Features

Instead of just multiplying features at random, we might consider trying **every possible product of features**. That's what PolynomialFeatures can do. Along with raising each feature to the specified polynomial degree. 

In [None]:
pf = PolynomialFeatures(degree=3)

X = wine.drop('quality', axis=1)
y = wine['quality']

# Fitting the PolynomialFeatures object
pf.fit(X)

In [None]:
pdf = pd.DataFrame(pf.transform(X), columns=pf.get_feature_names())
pdf

In [None]:
lr = LinearRegression()

lr.fit(pdf, y)

In [None]:
lr.score(pdf, y)

So: Is this a good idea? What are the potential dangers here?