# SI 618: Introduction to Machine Learning

Version 2021.03.12.1.CT

We suggest you use extra markdown blocks or code comments to record your notes.

In [None]:
import pandas as pd

Seaborn (and other packages) come bundled with datasets.  Let's load the infamous Fisher's Iris Dataset:

In [None]:
import seaborn as sns
iris = sns.load_dataset('iris')

In [None]:
iris.head()

## Exercise 1:
Create a 2-d scatterplot of petal_width (on the y-axis) vs. petal_length (on the x-axis) that includes a regression line.

In [None]:
# insert your code here

In [None]:
_ = sns.lmplot(data=iris,x='petal_length',y='petal_width', ci=None)

In [None]:
_ = sns.regplot(data=iris,x='petal_length',y='petal_width',ci=None)

### Exercise 2:
Create a regression model of petal_width as the outcome variable and petal_length as the explanatory variable.  You might find the notebook on correlation and regression to be helpful here.

In [None]:
iris.head()

In [None]:
# insert your code here

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
lm = smf.ols('Q("petal_width") ~ Q("petal_length")',data=iris).fit()
lm.summary()

## Introduction to scikit-learn

Recall the general process for using a scikit-learn estimator:
1. choose appropriate class that implements what you want to do and import it
1. choose model hyperparameters (or accept default ones, but be careful) and instantiate class
1. arrange data into features and labels
1. .fit() your model to the data
1. apply model to new data with .predict() for supervised learning

Let's do that with the regression model we implemented using statsmodels above:



1. choose appropriate class that implements what you want to do and import it

This takes a bit of experience to figure out, but we'll cover the common ones over the next few classes.  For now, I'll tell you that we want to use sklearn.linear_model.LinearRegression.  Import only that class into your default namespace:

### Exercise 3: write the correct line to import LinearRegression from the sklearn.linear_model module:

In [None]:
# insert your code here

In [None]:
from sklearn.linear_model import LinearRegression

### Exercise 4: choose model hyperparameters (or accept default ones, but be careful) and instantiate class
It's ok to accept the defaults this time. Let's assign the model to a variable called `lm`.

In [None]:
# insert your code here

In [None]:
lm = LinearRegression()

### Exercise 5: arrange data into features and labels
Create one dataframe for the 'y' values (and call it 'y') and another dataframe for the 'x' values (and call it 'X').

In [None]:
# insert your code here

In [None]:
y = iris[['petal_width']]
X = iris[['petal_length']]

### Exercise 6: .fit() your model to the data

In [None]:
# insert your code here

In [None]:
model = lm.fit(X,y)

### Exercise: apply model to new data with .predict() 
What's the estimated value for petal_width if the petal_length is 10?

In [None]:
import numpy as np

In [None]:
np.array([10]).reshape(-1,1).shape

In [None]:
lm.predict(np.array([10]).reshape(-1,1))

Great!  But what does our model actually look like?

We can always access a measure of how good our model is by calling .score(X,y):

In [None]:
lm.score(X,y)

In the case of LinearRegression, we can access the coefficients for the equation:

In [None]:
lm.coef_

and the value of the intercept:

In [None]:
lm.intercept_

Which, if we've done everything right, should match the results we got from statsmodels!

## Cross-validation

In [None]:
from sklearn.model_selection import cross_validate

In [None]:
result = cross_validate(lm, X, y, scoring='neg_mean_squared_error') # see docstring for more details

In [None]:
result['test_score']

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

Note: unlike most other scores, R^2 score may be negative (it need not actually be the square of a quantity R).

See also https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative


What other scorers are available?

In [None]:
import sklearn
sklearn.metrics.SCORERS.keys()

# Part II - Machine Learning Pipelines for Regression


## Goal: to predict the flipper length of penguins given a number of features about them.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# to make this notebook's output identical at every run
np.random.seed(42)

In [None]:
penguins = sns.load_dataset('penguins')

In [None]:
penguins.head()

In [None]:
penguins.describe()

In [None]:
penguins.info()

### Task 1
Are there any missing values?  Deal with the missing values.

In [None]:
# insert your code here

### Task 2
Use .value_counts() to get a sense of the distribution of categorical variables.

In [None]:
# insert your code here

### Task 3
Create scatterplots for all combinations of numeric variables (hint: sns.pairplot() might be useful)

In [None]:
# insert your code here

### Task 4
Split the data into training and testing sets, ensuring that the same distribution of species exists in the split data sets as the distribution of species in the original dataframe.

In [None]:
# insert your code here

### Task 5
Create a design matrix (`penguins_X`) and a label matrix (`penguins_y`) from the stratified training set.

In [None]:
# insert your code here

### Task 6
Create a pipeline to apply a `StandardScaler()` to all numeric values and a `OneHotEncoder()` to the categorical variables in `penguins_X`. Assign the resulting matrix to a variable called `penguins_prepared`.

In [None]:
# insert your code here

### Task 7
Fit a linear regression to penguins_prepared and penguins_y.

In [None]:
# insert your code here

### Task 8
Use the fitted model to show the predicted values for the first 5 rows of data.

In [None]:
# insert your code here

### Task 9
Show the mean and standard deviation of the root mean squared error for your model.

In [None]:
# Insert your code here

### Task 10
Apply your model to the test data (from your train-test split) and report the final root mean squared error (RMSE).

In [None]:
# Insert your code here