### HW04: Practice with feature engineering, splitting data, and fitting and regularizing linear models

Kai Tsuyoshi 

tsuyoshi@wisc.edu

### Hello Students:

- Start by downloading HW04.ipynb from this folder. Then develop it into your solution.
- Write code where you see "... your code here ..." below.
  (You are welcome to use more than one cell.)
- If you have questions, please ask them in class, office hours, or piazza. Our TA
  and I are very happy to help with the programming (provided you start early
  enough, and provided we are not helping so much that we undermine your learning).
- When you are done, run these Notebook commands:
  - Shift-L (once, so that line numbers are visible)
  - Kernel > Restart and Run All (run all cells from scratch)
  - Esc S (save)
  - File > Download as > HTML
- Turn in:
  - HW04.ipynb to Canvas's HW04.ipynb assignment
  - HW04.html to Canvas's HW04.html assignment
  - As a check, download your files from Canvas to a new 'junk' folder. Try 'Kernel > Restart
  and Run All' on the '.ipynb' file to make sure it works. Glance through the '.html' file.
- Turn in partial solutions to Canvas before the deadline. e.g. Turn in part 1,
  then parts 1 and 2, then your whole solution. That way we can award partial credit
  even if you miss the deadline. We will grade your last submission before the deadline.

In [3]:
import pandas as pd
import numpy as np
from io import StringIO
from sklearn import svm
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn import linear_model

## 1. Feature engineering (one-hot encoding and data imputation)

### 1a. Read the data from [http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv](http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv).
- Retain only these columns: Survived, Pclass, Sex, Age, SibSp, Parch.
- Display the first 7 rows.

These data are described at [https://www.kaggle.com/competitions/titanic/data](https://www.kaggle.com/competitions/titanic/data) (click on the small down-arrow to see the "Data Dictionary"), which is where they are from.
- Read that "Data Dictionary" paragraph (with your eyes, not python) so you understand what each column represents.

(We used these data before in HW02:
- There we used `df.dropna()` to drop any observations with missing values; here we use data imputation instead.
- There we manually did one-hot encoding of the categorical `Sex` column by making a `Female` column; here we do the same one-hot encoding with the help of pandas's `df.join(pd.get_dummies())`.
- There we used a decision tree; here we use $k$-NN.

We evaluate how these strategies can improve model performance by allowing us to use columns with categorical or missing data.)

In [4]:
df = pd.read_csv('kaggle_titanic_train.csv')[['Survived','Pclass','Sex','Age','SibSp','Parch']]
print(df.head(7))

   Survived  Pclass     Sex   Age  SibSp  Parch
0         0       3    male  22.0      1      0
1         1       1  female  38.0      1      0
2         1       3  female  26.0      0      0
3         1       1  female  35.0      1      0
4         0       3    male  35.0      0      0
5         0       3    male   NaN      0      0
6         0       1    male  54.0      0      0


### 1b. Try to train a $k$NN model to predict $y=$ 'Survived' from $X=$ these features: 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch'.
- Use $k = 3$ and the (default) euclidean metric.
- Notice at the bottom of the error message that it fails with the error "ValueError: could not convert string to float: 'male'".
- Comment out your .fit() line so the cell can run without error.

In [5]:
knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
#knn.fit(X = df[['Pclass','Sex','Age','SibSp','Parch']], y = df[['Survived']])

### 1c. Try to train again, this time without the 'Sex' feature.
- Notice that it fails because "Input contains NaN".
- Comment out your .fit() line so the cell can run without error.
- Run `X.isna().any()` (where X is the name of your DataFrame of features) to see that
  the 'Age' feature has missing values. (You can see the first missing value in
  the sixth row that you displayed above.)

In [6]:
knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
#knn.fit(X = df[['Pclass','Age','SibSp','Parch']], y = df[['Survived']])
df.isna().any()

Survived    False
Pclass      False
Sex         False
Age          True
SibSp       False
Parch       False
dtype: bool

### 1d. Train without the 'Sex' and 'Age' features.
- Report accuracy on the training data with a line of the form
  `Accuracy on training data is  0.500` (0.500 may not be correct).

In [7]:
knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(X = df[['Pclass','SibSp','Parch']], y = df[['Survived']])
accuracy = knn.score(X = df[['Pclass','SibSp','Parch']], y = df[['Survived']])
print(f'Accuracy on training data is {accuracy}')

Accuracy on training data is 0.632996632996633


  return self._fit(X, y)


### 1e.  Use one-hot encoding
to include a binary 'male'  feature made from the 'Sex' feature. (Or include a binary 'female'
feature, according to your preference. Using both is unnecessary since either is the logical
negation of the other.) That is, train on these features: 'Pclass', 'SibSp', 'Parch', 'male'.
- Use pandas's df.join(pd.get_dummies())`.
- Report training accuracy as before.

In [8]:
df = df.join(pd.get_dummies(df['Sex'], drop_first=False)['female'])

knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(X = df[['Pclass','SibSp','Parch','female']], y = df[['Survived']])
accuracy = knn.score(X = df[['Pclass','SibSp','Parch','female']], y = df[['Survived']])
print(f'Accuracy on training data is {accuracy}')

Accuracy on training data is 0.7441077441077442


  return self._fit(X, y)


### 1f. Use data imputation
to include an 'age' feature made from 'Age' but replacing each missing value with the median
of the non-missing ages. That is, train on these features: 'Pclass', 'SibSp', 'Parch', 'male',
'age'.

- Report training accuracy as before.

In [9]:
imp = SimpleImputer(missing_values=np.nan, strategy='median', fill_value=None) 
df[['Age']] = imp.fit_transform(df[['Age']])

knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(X = df[['Pclass','SibSp','Parch','female','Age']], y = df[['Survived']])
accuracy = knn.score(X = df[['Pclass','SibSp','Parch','female','Age']], y = df[['Survived']])
print(f'Accuracy on training data is {accuracy}')

Accuracy on training data is 0.8608305274971941


  return self._fit(X, y)


## 2. Explore model fit, overfit, and regularization in the context of multiple linear regression

### 2a. Prepare the data:
- Read [http://www.stat.wisc.edu/~jgillett/451/data/mtcars.csv](http://www.stat.wisc.edu/~jgillett/451/data/mtcars.csv) into a DataFrame.
- Set a variable `X` to the subset consisting of all columns except `mpg`.
- Set a variable `y` to the `mpg` column.
- Use `train_test_split()` to split `X` and `y` into `X_train`, `X_test`, `y_train`, and `y_test`.
  - Reserve half the data for training and half for testing.
  - Use `random_state=0` to get reproducible results.

In [10]:
df = pd.read_csv('mtcars.csv',index_col=0)
X = df.loc[ :, df.columns != 'mpg']
y = df.loc[:, df.columns =='mpg']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,random_state=0)
print(X_train)
print(X_test)
print(y_train)
print(y_test)

                     cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Camaro Z28             8  350.0  245  3.73  3.840  15.41   0   0     3     4
Mazda RX4 Wag          6  160.0  110  3.90  2.875  17.02   0   1     4     4
Volvo 142E             4  121.0  109  4.11  2.780  18.60   1   1     4     2
Duster 360             8  360.0  245  3.21  3.570  15.84   0   0     3     4
Hornet Sportabout      8  360.0  175  3.15  3.440  17.02   0   0     3     2
Honda Civic            4   75.7   52  4.93  1.615  18.52   1   1     4     2
Ferrari Dino           6  145.0  175  3.62  2.770  15.50   0   1     5     6
Toyota Corolla         4   71.1   65  4.22  1.835  19.90   1   1     4     1
Merc 280               6  167.6  123  3.92  3.440  18.30   1   0     4     4
Merc 240D              4  146.7   62  3.69  3.190  20.00   1   0     4     2
Lotus Europa           4   95.1  113  3.77  1.513  16.90   1   1     5     2
Hornet 4 Drive         6  258.0  110  3.08  3.215  19.44   1   0     3     1

### 2b. Train three models on the training data and evaluate each on the test data:
- `LinearRegression()`
- `Lasso()`
- `Ridge()`

The evaluation consists in displaying MSE$_\text{train}, $ MSE$_\text{test}$, and the coefficients $\mathbf{w}$ for each model.

In [13]:
models = [linear_model.LinearRegression(),
              linear_model.Lasso(),
              linear_model.Ridge()] 
df2 = pd.DataFrame(columns=['model', 'MSE_train', 'MSE_test','w'])
for model in models:
    model.fit(X_train, y_train)
    MSE_train = (1/y_train.size) * np.sum((np.array(y_train) - model.predict(X_train))**2)
    MSE_test = (1/y_test.size)  * np.sum((np.array(y_test) - model.predict(X_test))**2)
    df2 = df2.append(pd.DataFrame({'model': model, 'MSE_train': MSE_train,
                                    'MSE_test': MSE_test, 'w': [(model.coef_[1:],pd.set_option('display.max_colwidth', 200))]
                                  }),ignore_index=True)
df2

  df2 = df2.append(pd.DataFrame({'model': model, 'MSE_train': MSE_train,
  df2 = df2.append(pd.DataFrame({'model': model, 'MSE_train': MSE_train,
  df2 = df2.append(pd.DataFrame({'model': model, 'MSE_train': MSE_train,


Unnamed: 0,model,MSE_train,MSE_test,w
0,LinearRegression(),0.385687,30.227426,"([], None)"
1,Lasso(),1129.814516,1227.706526,"([-0.03718772959980314, -0.016817566338234596, 0.0, -0.0, 0.0, 0.0, 0.0, 0.0, -0.8471289103869164], None)"
2,Ridge(),1.985653,11.198178,"([], None)"


### 2c. Answer a few questions about the models:
- Which one best fits the training data?
- Which one best fits the test data?
- Which one does feature selection by setting most coefficients to zero?- 

Linear Regression fits the training data the best, while ridge fits the test data better. Lasso does feature selection, as can be seen by the w's that are for the most part zero with the exception of three nonzero coefficients.