# Feature Expansion 
## Polynomials and Interactions

---

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

plt.rcParams['figure.figsize'] = (10,6)

#### Let's look again at the Anscombe dataset

In [2]:
df = sns.load_dataset('anscombe')

In [3]:
df

Unnamed: 0,dataset,x,y
0,I,10.0,8.04
1,I,8.0,6.95
2,I,13.0,7.58
3,I,9.0,8.81
4,I,11.0,8.33
5,I,14.0,9.96
6,I,6.0,7.24
7,I,4.0,4.26
8,I,12.0,10.84
9,I,7.0,4.82


In [None]:
# lmplot

In [None]:
# check the statistsics of the different datasets

#### Questions to reflect on the fitted models: 
- **Q1**: Check out the means of the different datasets

- **Q2**: What does it mean to say "they are the same linear models"?

- **Q3** Are the models equally good to fit the data?

- **Q4** Are there obvious ways to fix the models?

#### ...and try to fix the model of second data set

In [None]:
df2 = df.loc[df['dataset']=='II',['x','y']]
df2

In [None]:
sns.scatterplot(x = df2['x'], y = df2['y'])

#### Save our X and y data

In [None]:
y = df2['y']
X = df2[['x']]

### Fit a Linear Regression model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
m = LinearRegression()

In [None]:
m.fit(X,y)

In [None]:
round(m.score(X,y),2)

In [None]:
y_pred = m.predict(X)

In [None]:
df2['y_pred'] = y_pred

In [None]:
sns.scatterplot(data = df2, x = 'x', y = 'y', label = 'y')
sns.lineplot(data = df2, x = 'x', y = 'y_pred', label = 'y pred')

#### Now let's try to fix it by adding polynomials
- For our `x`, define `x^2`

## Polynomials

- Extra features that are **powers** of an existing feature.
- Sum of powers of x, multiplied by some coefficient

$$
a_0 * x^0 + a_1 * x^1 + a_2 * x^2 + ...
$$

- Might increase accuracy of your model 
- Also increase risk of over fitting

In [None]:
X['x^2'] = X['x']**2

In [None]:
X.head()

### Let's fit again with both features

In [None]:
m.fit(X,y)

In [None]:
y_pred_poly = m.predict(X)

In [None]:
df2['y_pred_poly'] = y_pred_poly

In [None]:
sns.scatterplot(x = df2['x'], y = df2['y'], label = 'y')
sns.lineplot(x = df2['x'], y = df2['y_pred_poly'], label = 'y_pred with polynomials' )

In [None]:
m.score(X,y)

---

#### As usual ...this is something Sklearn can do for us

## Polynomial Features with Sklearn 

In [None]:
from sklearn.preprocessing import PolynomialFeatures

##### Create a polynomial feature transformer, specify the degrees

In [None]:
pt = PolynomialFeatures(degree= 3, include_bias= True,interaction_only=False)

In [None]:
pt.fit(X[['x']])

In [None]:
p_features = pt.transform(X[['x']])

In [None]:
p_features

In [None]:
pt.get_feature_names()

##### Fit transform the data in question, and look at it in a DF with column names

In [None]:
pd.DataFrame(p_features, columns=pt.get_feature_names())

---

# Interaction terms
* If our X data has 2 features, x1 and x2, then a 2 degree polynomial result would take the form:
* $1, x1, x2, x1^2, x1*x2, x2^2$
* Each of these get their own coefficient
* We can use the polynomial preprocessing function, but set interaction_only=True

In [None]:
X

In [None]:
X

In [None]:
PolynomialFeatures()

In [None]:
pt = PolynomialFeatures(interaction_only=True, include_bias= False)
p_features = pt.fit_transform(X)
pd.DataFrame(p_features, columns=pt.get_feature_names())

---