# Chapter 7 Moving Beyond Linearity
Linearity is almost always an approximation, need more flexible models.

## Polynomial Regression
+ Extend linear model with polynomial terms (e.g. $X^2, X^3, \ldots$). 
+ It's still a linear (in parameter) model but can model non-linear data. 
+ Usually don't use polynomial terms higher than degree 3 or 4.



In [None]:
from IPython.display import Image
Image('images/pw52.png', width =700)

In [None]:
Image('images/pw53.png', width =700)

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm

%matplotlib inline

Let explore how to generate the `Wage` dataset models:

In [None]:
df = pd.read_csv('data/wage.csv')
df.head(3)

+ We first fit the polynomial regression model using the following commands:

In [None]:
X1 = PolynomialFeatures(1).fit_transform(df.age.values.reshape(-1,1))
X2 = PolynomialFeatures(2).fit_transform(df.age.values.reshape(-1,1))
X3 = PolynomialFeatures(3).fit_transform(df.age.values.reshape(-1,1))
X4 = PolynomialFeatures(4).fit_transform(df.age.values.reshape(-1,1))
X5 = PolynomialFeatures(5).fit_transform(df.age.values.reshape(-1,1))

+ This syntax fits a linear model, using the `PolynomialFeatures()` function, in order to predict wage using up to a fourth-degree polynomial in `age`. 
+ The `PolynomialFeatures()` command allows us to avoid having to write out a long formula with powers
of `age`. 
+ We can then fit our linear model:

In [None]:
fit2 = sm.GLS(df.wage, X4).fit()
fit2.summary().tables[1]

+ Next, consider the task of predicting whether an individual earns more than \$250,000 per year. 
+ First, create the appropriate response vector, and then fit a logistic model using the `GLM()` function from `statsmodels`:

In [None]:
# Create response matrix
y = (df.wage > 250).map({False:0, True:1}).to_numpy()

# Fit logistic model
clf = sm.GLM(y, X4, family=sm.families.Binomial(sm.families.links.logit()))
res = clf.fit()

+ Create a grid of values for `age` at which we want predictions, and then call the generic `predict()` function for each model:

In [None]:
# Generate a sequence of age values spanning the range
age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)

# Generate test data
X_test = PolynomialFeatures(4).fit_transform(age_grid)

# Predict the value of the generated ages
pred1 = fit2.predict(X_test) # salary
pred2 = res.predict(X_test)  # Pr(wage>250)

+ Finally, plot the data and add the fit from the degree-4 polynomial.

In [None]:
# creating plots
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (16,5))
fig.suptitle('Degree-4 Polynomial', fontsize=14)

# Scatter plot with polynomial regression line
ax1.scatter(df.age, df.wage, facecolor='None', edgecolor='k', alpha=0.3)
ax1.plot(age_grid, pred1, color = 'b')
ax1.set_ylim(ymin=0)

# Logistic regression showing Pr(wage>250) for the age range.
ax2.plot(age_grid, pred2, color='b')

# Rug plot showing the distribution of wage>250 in the training data.
# 'True' on the top, 'False' on the bottom.
ax2.scatter(df.age, y/5, s=30, c='grey', marker='|', alpha=0.7)

ax2.set_ylim(-0.01,0.21)
ax2.set_xlabel('age')
ax2.set_ylabel('Pr(wage>250|age)')

### Deciding on a degree

+ In performing a polynomial regression we must decide on the degree of the polynomial to use. One way to do this is by using hypothesis tests. 
+ Now fit models ranging from linear to a degree-5 polynomial and  determine the simplest model which is sufficient to explain the relationship between `wage` and `age`.
+ Do this using the `anova_lm()` function, which performs an analysis of variance (ANOVA, using an F-test) in order to test the null hypothesis that a model $M_1$ is sufficient to explain the data against the  alternative hypothesis that a more complex model $M_2$ is required. 
+ In order to use the `anova_lm()` function, $M_1$ and $M_2$ must be **nested models**: the predictors in $M_1$ must be a subset of the predictors in $M_2$. 
+ In this case, we fit five different models and sequentially compare the simpler model to the more complex model 

(*Note:* you may get an *invalid value* Runtime Warning on the first model, because there is no "simpler model" to compare to):

In [None]:
fit_1 = fit = sm.GLS(df.wage, X1).fit()
fit_2 = fit = sm.GLS(df.wage, X2).fit()
fit_3 = fit = sm.GLS(df.wage, X3).fit()
fit_4 = fit = sm.GLS(df.wage, X4).fit()
fit_5 = fit = sm.GLS(df.wage, X5).fit()

print(sm.stats.anova_lm(fit_1, fit_2, fit_3, fit_4, fit_5, typ=1))

+ The $p$-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero $(<10^{-32})$, indicating that a linear fit is not sufficient. 
+ Similarly the $p$-value comparing the quadratic Model 2 to the cubic Model 3 is very low (0.0017), so the quadratic fit is also insufficient. 
+ The $p$-value comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is approximately 0.05 while the degree-5 polynomial Model 5 seems unnecessary because its $p$-value is 0.37. 
+ Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified.

## Step Functions
+ Also known as piecewise constant regression.
+ Cut $X$ into $K$ different regions and fit a constant to each region. 
$$y_i = \beta_0 + \beta_1C_1(x_i) + \beta_2C_2(x_i) + \ldots + \beta_KC_K(x_i) + \epsilon _i$$
where
$$\begin{align*}
 C_0(X) &= I(X<c_1) \\
 C_1(X) &= I(c_1 \leq X < c_2) \\
 & \vdots \\
 C_{K-1}(X) &= I(c_{K-1} \leq X < c_K) \\
 C_{K}(X) &= I(X \geq c_K) 
\end{align*}$$
+ The model reduces to $\hat{y} = \beta_0 + \beta_k$ where $k$ is the $k^\textrm{th}$ region. 
+ $\beta_0$ is just the estimate for y (the mean) in the region before the first cut point. 
+ Can use same approach for logistic regression to get a flat probability estimate for each region.



### Example
+ In order to fit a step function, we use the `cut()` function:

In [None]:
df_cut, bins = pd.cut(df.age, 4, retbins = True, right = True)
df_cut.value_counts(sort = False)

+ Here `cut()` automatically picked the cutpoints at 33.5, 49, and 64.5 years of age. 
+ We could also have specified our own cutpoints directly. 
+ Now let's create a set of dummy variables for use in the regression:

In [None]:
df_steps = pd.concat([df.age, df_cut, df.wage], keys = ['age','age_cuts','wage'], axis = 1)

# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_steps['age_cuts'])

# Statsmodels requires explicit adding of a constant (intercept)
df_steps_dummies = sm.add_constant(df_steps_dummies)

# Drop the (17.938, 33.5] category
df_steps_dummies = df_steps_dummies.drop(df_steps_dummies.columns[1], axis = 1)

df_steps_dummies.head(5)

+ Now to fit the models. 
+ We dropped the `age<33.5` category, so the intercept coefficient can be interpreted as the average salary for those under 33.5 years of age. 
+ The other coefficients can be interpreted as the average additional salary for those in the other age groups. 

In [None]:
fit3 = sm.GLM(df_steps.wage.to_numpy(), df_steps_dummies.astype(int).to_numpy()).fit()
fit3.summary().tables[1]

+ The intercept coefficient of 94,160 can be interpreted as the average salary for those under 33.5 years of age. 
+ The other coefficients can be interpreted as the average additional salary for those in the other age groups.

In [None]:
# Put the test data in the same bins as the training data.
bin_mapping = np.digitize(age_grid.ravel(), bins)

# Get dummies, drop first dummy category, add constant
X_test2 = sm.add_constant(pd.get_dummies(bin_mapping).drop(1, axis = 1)).astype(int).to_numpy()

# Predict the value of the generated ages using the linear model
pred2 = fit3.predict(X_test2)

# And the logistic model
clf2 = sm.GLM(y, df_steps_dummies.astype(int).to_numpy(),
              family=sm.families.Binomial(sm.families.links.logit()))
res2 = clf2.fit()
pred3 = res2.predict(X_test2)

# Plot
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (12,5))
fig.suptitle('Piecewise Constant', fontsize = 14)

# Scatter plot with polynomial regression line
ax1.scatter(df.age, df.wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
ax1.plot(age_grid, pred2, c = 'b')

ax1.set_xlabel('age')
ax1.set_ylabel('wage')
ax1.set_ylim(ymin = 0)

# Logistic regression showing Pr(wage>250) for the age range.
ax2.plot(np.arange(df.age.min(), df.age.max()).reshape(-1,1), pred3, color = 'b')

# Rug plot showing the distribution of wage>250 in the training data.
# 'True' on the top, 'False' on the bottom.
ax2.scatter(df.age, y/5, s = 30, c = 'grey', marker = '|', alpha = 0.7)

ax2.set_ylim(-0.01, 0.21)
ax2.set_xlabel('age')
ax2.set_ylabel('Pr(wage>250|age)')

## Basis functions
+ Polynomial terms and step function are both types of basis functions. 
+ A basis function, $b_k(X)$, is a function that transforms $X$. 
$$y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2b_2(x_i) + \ldots + \beta_Kb_K(x_i) + \epsilon _i$$
+ For polynomial regression, this is simply raising $X$ to a power and for step functions this is transforming $X$ into 0 or 1 based on whether $X$ is in a region or not (indicator variable). 
+ Wavelets and fourier series are also basis functions.

## Regression Splines
+ Combining piecewise constant regression and polynomial regression

### Piecewise polynomials
+ Fit separate low degree polynomials over different regions of $X$. 
+ It works by fitting a cubic regression model 
$$y_i = \beta_0 + \beta_1 x_i + \beta_2x_i^2 + \beta_3 x_i^3 + \epsilon_i$$
where the coefficients differ in different parts of the range $X$.
+ The place where the coefficients change are called knots. 
+ The polynomials are contrained so that they join smoothly at the knots.
+ Example of piecewise cubic polynomial with as single knot at point $c$:
$$y_i = \begin{cases}
\beta_{01} + \beta_{11} x_i + \beta_{21}x_i^2 + \beta_{31} x_i^3 + \epsilon_i & \textrm{if }x_i<c\\
\beta_{02} + \beta_{12} x_i + \beta_{22}x_i^2 + \beta_{32} x_i^3 + \epsilon_i & \textrm{if }x_i\geq c
\end{cases}$$





### Splines
+ Piecewise polynomials with constraints that the curves be continuous and smooth - meaning both first and second derivatives must match at the knot.
+ Fitting a spline turns out to be surprisingly simpler than it seems. 
+ We don't have to fit a 3 degree polynomial for each region. 
+ By smartly choosing basis functions, we can use least squares to solve for all the coefficients. 
+ We use the truncated power basis function which is 
$$h(x, \xi) = \begin{cases}
(x -\xi)^3, & x > \xi \\
0 & \textrm{otherwise}
\end{cases}$$ 
where $\xi$ is a knot.

+ The equation to send to least squares is 
$$\hat{y} = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 +  b_1h(x, \xi_1) + \ldots + b_Kh(x, \xi_K)$$ where we have $K$ truncated power transformations for a total of $K + 3$ predictors.

### How to choose K?
+ The regression spline is most flexible in regions that contain a lot of knots, because in those regions the polynomial coefficients can change rapidly.
+ It is common to place knots in a uniform fashion. 
+ One way to do this is to specify the desired d.o.f., and then have the software automatically place the corresponding number of knots at uniform quantiles of the data.
+ Cross validation:
    + remove a portion of the data (say 10 %),
    + fit a spline with a certain number of knots to the remaining data, 
    + use the spline to make predictions for the held-out portion. 
    + repeat this multiple times until each observation has been left out once, and compute the overall cross-validated RSS. 
    + repeated for different numbers of knots K. 
    + the value of K giving the smallest RSS is chosen.
    
### Splines vs polynomial regression
+ Splines generally do better. 
+ A complex fit can still be fit well with a 3 degree spline by placing more knots.
+ It could take a very high degree polynomial to do the same and with worse variance.

In [None]:
Image('images/pw54.png', width =700)

## Smoothing splines
+ Finding a function that minimizes RSS but 'smooth'. 
+ Smoothness here is defined as having a relatively stable second derivative. 
+ We want to find the smoothing spline, $g$, that minimizes
$$\sum_{i=1}^n{(y_i - g(x_i))^2} + \lambda \int{g^{\prime \prime}(t)^2 dt}$$
+ The first term is a *loss function* that encourages g to fit the data well. 
+ The second term is a penalty term that penalize the variability in $g$.
+ Larger tuning parameter,$\lambda$, will make $g$ smoother.
+ The function that minimizes this error is a natural cubic spline with knots at each unique value of x but with shrunken parameter estimates due to the penalty term. 
+ The tuning (smoothing) parameter is very important to control variance. Choose smoothing parameter with CV.
+ Effective degrees of freedom, $df_\lambda$, is a measure of the flexibility of the smoothing spline (higher $df_\lambda$ -> more flexible(low bias, high variance)).



In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline

# Read in the data
df = pd.read_csv('data/wage.csv')

# Generate a sequence of age values spanning the range
age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)

+ In order to fit regression splines in python, we use the ${\tt dmatrix}$ module from the ${\tt patsy}$ library. 
+ Regression splines can be fit by constructing an appropriate matrix of basis functions. 
+ The ${\tt bs()}$ function generates the entire matrix of basis functions for splines with the specified set of knots.  
+ Fitting ${\tt wage}$ to ${\tt age}$ using a regression spline is simple:

In [None]:
# Fitting  wage  to  age  using a regression spline 

from patsy import dmatrix

# Specifying 3 knots
transformed_x1 = dmatrix("bs(df.age, knots=(25,40,60), degree=3, include_intercept=False)",
                        {"df.age": df.age}, return_type='dataframe')

# Build a regular linear model from the splines
fit1 = sm.GLM(df.wage, transformed_x1).fit()
fit1.params

In [None]:
Image('images/pw57.png', width =600)


+ Here we have prespecified knots at ages 25, 40, and 60. 
+ This produces a spline with six basis functions. (Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions.) 
+ We could also use the ${\tt df}$ option to produce a spline with knots at uniform quantiles of the data:

In [None]:
fit1.summary()

In [None]:
# Use the df  option to produce a spline with knots at uniform quantiles of the data:
# Specifying 6 degrees of freedom 

transformed_x2 = dmatrix("bs(df.age, df=6, include_intercept=False)",
                        {"df.age": df.age}, return_type='dataframe')
fit2 = sm.GLM(df.wage, transformed_x2).fit()
fit2.params

In [None]:
transformed_x2.head(10)

In [None]:
fit2.summary()

+ In this case python chooses knots which correspond to the 25th, 50th, and 75th percentiles of ${\tt age}$. 
+ The function ${\tt bs()}$ also has a ${\tt degree}$ argument, so we can fit splines of any degree, rather than the
default degree of 3 (which yields a cubic spline).

+ In order to instead fit a natural spline, we use the ${\tt cr()}$ function. 
+ Here we fit a natural spline with four degrees of freedom:

In [None]:
# To fit a natural spline, we use the  𝚌𝚛() function. 
# Specifying 4 degrees of freedom

transformed_x3 = dmatrix("cr(df.age, df=4)", {"df.age": df.age}, return_type='dataframe')
fit3 = sm.GLM(df.wage, transformed_x3).fit()
fit3.params

In [None]:
fit3.summary()

+ As with the ${\tt bs()}$ function, we could instead specify the knots directly using the ${\tt knots}$ option.

+ Let's see how these three models stack up:

In [None]:
# Generate a sequence of age values spanning the range
age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)

# Make some predictions
pred1 = fit1.predict(dmatrix("bs(age_grid, knots=(25,40,60), include_intercept=False)",
                             {"age_grid": age_grid}, return_type='dataframe'))
pred2 = fit2.predict(dmatrix("bs(age_grid, df=6, include_intercept=False)",
                             {"age_grid": age_grid}, return_type='dataframe'))
pred3 = fit3.predict(dmatrix("cr(age_grid, df=4)", {"age_grid": age_grid}, return_type='dataframe'))

# Plot the splines and error bands
plt.scatter(df.age, df.wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(age_grid, pred1, color='b', label='Specifying three knots')
plt.plot(age_grid, pred2, color='r', label='Specifying df=6')
plt.plot(age_grid, pred3, color='g', label='Natural spline df=4')
[plt.vlines(i , 0, 350, linestyles='dashed', lw=2, colors='b') for i in [25,40,60]]
plt.legend()
plt.xlim(15,85)
plt.ylim(0,350)
plt.xlabel('age')
plt.ylabel('wage')

## Local Regression
+ Fits a new regression line to each point by using the nearest neighbors of that point. 
+ It uses weighted least squares, weighing points at the boundary and beyond 0 and points in the boundary a decreasing function of its distance to the point. 
+ Usually, small degree polynomials are fit to these local points. 
+ Need to choose weight function and span, $s$ of points. 
+ Larger span of points the smoother function you will get.
1. Gather the fraction, $s=k/n$ of training points whose $x_i$ are closest to $x_0$.
2. Assign a weight $K_{i0} = K(x_i, x_0)$ to each point in the neighbourhood. All but these $k$ nearest neighbors get weight zero.
3. Fit a weighted least squares regression of the $y_i$ on the $x_i$, by finding $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize
$$\sum_{i=1}^n{K_{i0}(y_i - \beta_0 - \beta_1x_i)^2}$$
4. The fitted value at $x_0$ is $$\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1x_0$$
+ Can even do local regression with pairs or more of variables but because of the curse of dimensionality, there might not be enough neighbors.



In [None]:
Image('images/pw55.png', width =700)

In [None]:
Image('images/pw56.png', width =700)

## General Additive Models
+ All the previous models all relate to single variable predictions. 
+ GAMs simply add different linear models above (like the ones above) for different variables in the model, allowing for multivariate regression/classification. 
+ Each variable gets its own model and is added together. 
+ Each own model is a building block for a GAM.

### GAM for regression
$$y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \ldots + \beta_px_{ip} + \epsilon_i$$
+ Replace each linear component, $\beta_j x_{ij}$, with a smooth nonlinear function, $f_j (x_{ij})$,
$$ y_i = \beta_0 + f_1(x_{i1}) + \ldots + f_p(x_{ip}) + \epsilon_i$$
+ GAM can use the previous methods as building blocks for fitting an additive model.
+ The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed.
+ However, we can manually add interaction terms by including additional predictors of the form $X_j \times X_k$. 
+ Or, we can add low-dimensional interaction functions of the form $f_{jk}(X_j,X_k)$ into the model using two-dimensional smoothers such as local regression, or two-dimensional splines.
+ The same technique can also be used for classification problems.



In [None]:
Image('images/pw58.png', width =700)

# Lab 7.8.1
Recreating plot 7.1

In [None]:
import pandas as pd
import numpy as np

In [None]:
wage = pd.read_csv("data/wage.csv")

In [None]:
# Use sklearn to get regression coefficients
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [None]:
poly = PolynomialFeatures(degree=4, include_bias=False)

In [None]:
X = wage[['age']]
y = wage['wage']

In [None]:
model = LinearRegression()

In [None]:
model.fit(poly.fit_transform(X), y)

In [None]:
# Coefficients are the same as in ISLR
model.intercept_, model.coef_

In [None]:
model.intercept_

### Standard error in  Scikit-learn
Sklearn doesn't supply the standard error so you'll have to write the formula yourself or use statsmodels

In [None]:
import statsmodels.formula.api as smf

In [None]:
results = smf.ols('wage ~ age + np.power(age, 2) + np.power(age, 3) + np.power(age, 4)', data=wage).fit()

In [None]:
results.summary()

In [None]:
results.bse

In [None]:
results.conf_int()

### Confidence interval for the mean
There are different confidence intervals for the mean (the regression line) and prediction. Prediction intervals are going to be much wider. The regression line will not wiggle around so much

In [None]:
from statsmodels.stats.outliers_influence import summary_table

In [None]:
st, data, ss2 = summary_table(results, alpha=0.05)

In [None]:
fittedvalues = data[:,2]
predict_mean_se  = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T

In [None]:
order = np.argsort(X.values.flatten())
x_o = X.values.flatten()[order]

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(x_o, y[order])
plt.plot(x_o, fittedvalues[order], 'r', lw=2)
plt.plot(x_o, predict_mean_ci_low[order], 'r--', lw=2)
plt.plot(x_o, predict_mean_ci_upp[order], 'r--', lw=2)

In [None]:
# Which features are necessary
smf.ols('wage ~ age + np.power(age, 2) + np.power(age, 3) + np.power(age, 4)', data=wage).fit().summary()

# Use Anova to test wheter each additional polynomial term is significant
Models must be nested here, meaning that mod2 must be a superset of mod1

In [None]:
from statsmodels.stats.api import anova_lm

In [None]:
mod1 = smf.ols('wage ~ age', data=wage).fit()
mod2 = smf.ols('wage ~ age + np.power(age, 2)', data=wage).fit()
mod3 = smf.ols('wage ~ age + np.power(age, 2) + np.power(age, 3)', data=wage).fit()
mod4 = smf.ols('wage ~ age + np.power(age, 2) + np.power(age, 3) + np.power(age, 4)', data=wage).fit()
mod5 = smf.ols('wage ~ age + np.power(age, 2) + np.power(age, 3) + np.power(age, 4) + np.power(age, 5)', data=wage).fit()

In [None]:
# Same as ISLR
# polynomial terms 4 and 5 are not needed. p > .05
anova_lm(mod1, mod2, mod3, mod4, mod5)

# Logistic regression
Prediciton of greater than 250k in income

In [None]:
wage['wage_250'] = (wage['wage'] > 250) * 1

In [None]:
results = smf.logit('wage_250 ~ age + np.power(age, 2) + np.power(age, 3) + np.power(age, 4)', data=wage).fit()

In [None]:
results.summary()

In [None]:
y_hat = np.exp(results.fittedvalues)
y = wage['wage_250'].values
x = X['age'].values
x_mean = x.mean()
n = len(y)

In [None]:
sy = np.sqrt(np.sum((y - y_hat)**2) / (n - 2))
sx = np.sum((x - x_mean) ** 2) / n
x_s = (x - x_mean) ** 2

In [None]:
sx = np.sum(x ** 2) - (x.sum() ** 2) / n

In [None]:
err = sy * np.sqrt(1/n + x_s / x_s.sum())

In [None]:
order = np.argsort(x)
x_o = x[order]

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(x_o, y[order])
plt.plot(x_o, y_hat[order], 'r', lw=2)
plt.plot(x_o, y_hat[order] + 2 * err[order], 'r--', lw=2)
plt.plot(x_o, y_hat[order] - 2 * err[order], 'r--', lw=2)
plt.ylim(0, .07)

# Step function as in 7.2
use pd.cut

In [None]:
results = smf.ols('wage ~ pd.cut(age, 4)', data=wage).fit()

In [None]:
results.summary()

# Splines

In [None]:
import scipy.interpolate as si

In [None]:
y = wage['wage'].values

In [None]:
order = np.argsort(x)

In [None]:
x_sort = x[order]
y_sort = y[order]
t = np.array([25, 40, 60])

In [None]:
spl = si.LSQUnivariateSpline(x_sort, y_sort, t)

In [None]:
spl(x_sort)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(x_sort, y_sort, 'ro', ms=5)
plt.plot(x_sort, spl(x_sort), 'g-', lw=3);

## General additive models for classification


In [None]:
pip install pygam --user


This data contains 569 observations and 30 features. The target variable in this case is whether the tumor of malignant or benign, and the features are several measurements of the tumor. For showcasing purposes, we keep the first 6 features only.

In [None]:
import pandas as pd        
from pygam import LogisticGAM
from sklearn.datasets import load_breast_cancer

#load the breast cancer data set
data = load_breast_cancer()

#keep first 6 features only
df = pd.DataFrame(data.data, columns=data.feature_names)[['mean radius', 'mean texture', 'mean perimeter', 'mean area','mean smoothness', 'mean compactness']]
target_df = pd.Series(data.target)
df.describe()

Since this is a classification problem, make sure to use pyGam’s LogisticGAM() function.

In [None]:
X = df[['mean radius', 'mean texture', 'mean perimeter', 'mean area','mean smoothness', 'mean compactness']]
y = target_df

#Fit a model with the default parameters
gam = LogisticGAM().fit(X, y)

In [None]:
gam.summary()

In [None]:
gam.accuracy(X, y)

In [None]:
plt.rcParams['figure.figsize'] = (28, 8)
fig, axs = plt.subplots(1, len(data.feature_names[0:6]))
titles = data.feature_names
for i, ax in enumerate(axs):
    XX = gam.generate_X_grid(term=i)
    pdep, confi = gam.partial_dependence(term=i, width=.95)
    ax.plot(XX[:, i], pdep)
    ax.plot(XX[:, i], confi[:, 0], c='grey', ls='--')
    ax.plot(XX[:, i], confi[:, 1], c='grey', ls='--')
    ax.set_title(titles[i])
plt.show()

### GAM for regression

In [None]:
from pygam import LinearGAM, s, f
from pygam.datasets import wage

X, y = wage(return_X_y=True)

## model
gam = LinearGAM(s(0) + s(1) + f(2))
gam.gridsearch(X, y)


## plotting
plt.figure();
fig, axs = plt.subplots(1,3);

titles = ['year', 'age', 'education']
for i, ax in enumerate(axs):
    XX = gam.generate_X_grid(term=i)
    ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX))
    ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
    if i == 0:
        ax.set_ylim(-30,30)
    ax.set_title(titles[i]);

In [None]:
gam.summary()

More on this: https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html#Functional-Form:

# KNN Regression

https://towardsdatascience.com/the-basics-knn-for-classification-and-regression-c1e8a6c955

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

wage = pd.read_csv("data/wage.csv")

from sklearn.model_selection import train_test_split
train , test = train_test_split(wage, test_size = 0.3)

X_train = train[['age']]
y_train = train['wage']
X_test = test[['age']]
y_test = test['wage']

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn=KNeighborsRegressor(n_neighbors=9)
knn.fit(X_train,y_train)

In [None]:
y_pred_knn=knn.predict(X_test)


In [None]:
plt.scatter(X_train,y_train,color="blue")
plt.scatter(X_test,knn.predict(X_test),color="red")
plt.title("Wage Prediction")
plt.xlabel("Age")
plt.ylabel("Wage")
plt.show()

In [None]:
from sklearn import neighbors
from math import sqrt
from sklearn.metrics import mean_squared_error

rmse_val = [] #to store rmse values for different k
for K in range(100):
    K = K+1
    model = neighbors.KNeighborsRegressor(n_neighbors = K)
    model.fit(X_train, y_train) #fit the model
    pred=model.predict(X_test) #make prediction on test set
    error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
    rmse_val.append(error) #store rmse values
    print("RMSE value for k= " , K , "is:", error)

In [None]:
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve
curve.plot()

In [None]:
from sklearn.model_selection import GridSearchCV
params = {"n_neighbors":list(range(1,100))}
knn = neighbors.KNeighborsRegressor()
model = GridSearchCV(knn, params, cv=5)
model.fit(X_train,y_train)
model.best_params_

# KNN Classifier

In [None]:
from sklearn import datasets, neighbors, linear_model

X_digits, y_digits = datasets.load_digits(return_X_y=True)
X_digits = X_digits / X_digits.max()

n_samples = len(X_digits)

X_train = X_digits[:int(.9 * n_samples)]
y_train = y_digits[:int(.9 * n_samples)]
X_test = X_digits[int(.9 * n_samples):]
y_test = y_digits[int(.9 * n_samples):]

knn = neighbors.KNeighborsClassifier()
logistic = linear_model.LogisticRegression(max_iter=1000)

print('KNN score: %f' % knn.fit(X_train, y_train).score(X_test, y_test))
print('LogisticRegression score: %f'
      % logistic.fit(X_train, y_train).score(X_test, y_test))

# Exercises

# 1 Cubic Spline regression
a) Since x < $\xi$ then $a_1 = \beta_0$ and $b_1 = \beta_1$ and $c_1 = \beta_2$ and $d_1 = \beta_3$

b) Must expand expression and group like polynomial terms. $a_1 = \beta_0 - \beta_4\xi^3$ and $b_1 = \beta_1 + \beta_4\xi^2$ and $c_1 = \beta_2 - 3\beta_4\xi$ and $d_1 = \beta_3 + \beta_4$

c) when $x=\xi$ the spline term equals 0 for both equations and thus they are equal

d, e) If they are equal then their derivatives must also be equal

# 2
a) g = 0  
b) g = mean(y)  
c) g = linear regression with 2 parameters - slope and intercept  
d) g = cubic term in regression with 3 parameters  
e) g = very high dimensional function that gives nearly 0 training error  

# 3

In [None]:
x = np.linspace(-2, 2, 100)

In [None]:
y = 1 + x + -2 * (x - 1) ** 2 * (x >= 1)

In [None]:
plt.plot(x, y)

# 4

In [None]:
b1_1 = (0 <= x) & (x <= 2)
b1_2 = (1 <= x) & (x <= 2)
b2_1 = (3 <= x) & (x <= 4)
b2_2 = (4 < x) & (x <= 5)

In [None]:
y = 1 + b1_1 - (x - 1) * b1_2 + (x - 3) * b2_1 + b2_2

In [None]:
plt.plot(x, y)

# 5
a) g2 will have smaller training error, since it is allowing more flexibility, can have up to a cubic model. g1 will be limited to a quadratic model as $\lambda$ approaches infinity

b) Can't tell which model will have smaller test error this depends on the 'true' relationship between x and y.

c) g1 and g2 will be the same model if there is no penalty

# 6

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn import model_selection

In [None]:
degrees = range(1, 11)
X = wage[['age']]
y = wage['wage']
final_scores = []
for degree in degrees:
    polynomial_features = PolynomialFeatures(degree=degree,
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

    scores = model_selection.cross_val_score(pipeline,
                                            X, y, cv=10, scoring='neg_mean_squared_error')
    final_scores.append(-np.mean(scores))

In [None]:
# degree 3 chosen through 10-fold CV
plt.plot(degrees, final_scores);

In [None]:
# compare to anova: already done above. More evidence that 4th and 5th degree polynomial are not needed
mod1 = smf.ols('wage ~ age', data=wage).fit()
mod2 = smf.ols('wage ~ age + np.power(age, 2)', data=wage).fit()
mod3 = smf.ols('wage ~ age + np.power(age, 2) + np.power(age, 3)', data=wage).fit()
mod4 = smf.ols('wage ~ age + np.power(age, 2) + np.power(age, 3) + np.power(age, 4)', data=wage).fit()
mod5 = smf.ols('wage ~ age + np.power(age, 2) + np.power(age, 3) + np.power(age, 4) + np.power(age, 5)', data=wage).fit()
anova_lm(mod1, mod2, mod3, mod4, mod5)

In [None]:
polynomial_features = PolynomialFeatures(degree=3, include_bias=False)
linear_regression.fit(polynomial_features.fit_transform(X), y)

In [None]:
x = np.linspace(X.values.min(), X.values.max(), 1000)

In [None]:
plt.scatter(X, y)
plt.plot(x, linear_regression.predict(polynomial_features.fit_transform(x.reshape(-1, 1))), c='r', lw=3);

In [None]:
cuts = range(1, 41)
X = wage[['age']]
y = wage['wage']
final_scores = []
for cut in cuts:
    X_new = pd.get_dummies(pd.cut(X['age'], cut)).values
    
    linear_regression = LinearRegression(fit_intercept=False)

    scores = model_selection.cross_val_score(linear_regression, X_new, y, cv=10, scoring='neg_mean_squared_error')
    final_scores.append(-np.mean(scores))

In [None]:
# looks like error stops getting better after 7 cuts
plt.plot(cuts, final_scores);

In [None]:
X_new = pd.get_dummies(pd.cut(X['age'], 7)).values
linear_regression = LinearRegression(fit_intercept=False)
linear_regression.fit(X_new, y)
plt.scatter(X, y)
order = np.argsort(X['age'])
plt.plot(X['age'].values[order], linear_regression.predict(X_new[order]), c='r', lw=3);

# 7

In [None]:
wage = pd.read_csv('data/wage.csv')

In [None]:
wage[['maritl', 'jobclass']].head()

In [None]:
X = pd.get_dummies(wage[['maritl', 'jobclass']], drop_first=False)
y = wage['wage']

In [None]:
X.head()

In [None]:
linear_regression = LinearRegression(fit_intercept=True)
linear_regression.fit(X, y)

In [None]:
linear_regression.coef_

In [None]:
linear_regression.intercept_

In [None]:
import statsmodels.api as sm

In [None]:
results_orig = smf.OLS(y, X).fit()
results_orig.summary()

In [None]:
wage[(wage['jobclass'] == '2. Information') & (wage['maritl'] == '3. Widowed')]['wage'].mean()

In [None]:
wage[wage['maritl'] == '1. Never Married']['wage'].mean()

In [None]:
wage[wage['jobclass'] == '1. Industrial']['wage'].mean()

In [None]:
wage[wage['jobclass'] == '2. Information']['wage'].mean()

In [None]:
wage['jobclass'].value_counts()

In [None]:
wage[(wage['jobclass'] == '2. Information') & (wage['maritl'] == '3. Widowed')]['wage'].mean()

In [None]:
27.6 + 82.3

In [None]:
X = pd.get_dummies(wage['maritl'] + ' ' + wage['jobclass'])
y = wage['wage']

In [None]:
results = smf.OLS(y, X).fit()
results.summary()

In [None]:
wage[(wage['jobclass'] == '2. Information') & (wage['maritl'] == '3. Widowed')]['wage'].mean()

In [None]:
results.predict([0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

In [None]:
results_orig.predict([0, 0, 0, 1, 0, 1, 0])