# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2021-s109a/blob/master/lectures/crest.png?raw=true"> CS-S109A Introduction to Data Science 

## Lecture 11: Clustering, Missingness, and Wrapup

**Harvard University**<br>
**Summer 2021**<br>
**Instructors:** Kevin Rader<br>
**Authors:** Rahul Dave, David Sondak, Pavlos Protopapas, Chris Tanner, Eleni Kaxiras, Kevin Rader

---

In [None]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Table of Contents 
<ol start="0">
<li> Learning Goals </li> 
<li> Clustering </li> 
<li> Missingness and Imputation  </li> 
  

In [None]:
import sys
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk
import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier 

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer, MissingIndicator

## Learning Goals

This Jupyter notebook accompanies Lecture 11. By the end of this lecture, you should be more comfortable with:

- performing basic unsupervised clustering algorithms (k-means and hierarchical)
- using basic imputation models to handle missingness

## Part 0: Data 

For this section of the notebook we will be using 2 **unrelated** data sets:

1. The classic Fisher's Iris data set: [Wiki reference](https://en.wikipedia.org/wiki/Iris_flower_data_set)
2. `receiving_2020.csv`: NFL receiving statistics: [Source](https://www.pro-football-reference.com/years/2020/receiving.htm).  Note: receivers with fewer than 10 yards were removed.

Let's take a peak at them both:

In [None]:
# First, the common iris data set (from sklearn)
from sklearn import datasets
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)  
X.columns = iris.feature_names
y = iris.target
print(iris.target_names)
np.unique(y,return_counts=True)

In [None]:
print(X.shape, y.shape)
X.head()

In [None]:
wr = pd.read_csv('../data/receiving_2020.csv')
print(wr.shape)
wr.head()

## Part 1a: *K*-Means Clustering

We first attempt *K*-Means clustering on the iris data set, which has a clear response variable (iris type).  Let's see if we can recover the three types through unsupervised clustering.  This is done using [`KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) in `sklearn.cluster`.

Let's start with $K=2$...unstandardized:

In [None]:
from sklearn.cluster import KMeans

kmeans2 = KMeans(n_clusters=2, random_state=109).fit(X[['sepal length (cm)','sepal width (cm)']])

# sum of squares from the centroids
print("Sum of Squared Distances =", kmeans2.inertia_)

#predict for new (or current) observations
print("Predicted clusters for each observation =", kmeans2.predict(X[['sepal length (cm)','sepal width (cm)']]))
# print(kmeans2.labels_): predict will match this for the training set

#the centroids
print("Centroid vectors =", kmeans2.cluster_centers_)


In [None]:
plt.scatter(X['sepal length (cm)'],X['sepal width (cm)'],c=kmeans2.labels_);

**Q1.1** Edit the code above to re-run the analysis with various different seeds.  Do the results change?

*your answer here*

**Q1.2** Edit the code above to re-run the analysis using the standardized predictors (the first two) and compare the results.

In [None]:
########
# your code here
########
from sklearn.preprocessing import StandardScaler

*your answer here*

**Q1.3** Re-run the analysis from above for $K=3$ and $K=4$.  Which appears to perform the best based on the scatterplots and based on the elbow method?

In [None]:
########
# edit and add to the code below
########

kmeans2 = KMeans(n_clusters=2, random_state=109).fit(X[['sepal length (cm)','sepal width (cm)']])



In [None]:
# plot the three scatterplots like the one from above
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (18,5))

plt.show();

In [None]:
# plot (y = Sum of Squared Distances) vs. (x = K) and evaluate.

*your answer here*

**Q1.4** Compare the $K=3$ model with the actual 3 classes (`pd.crosstab` may be helpful).  How did we perform.

In [None]:
#######
# Your code here
#######

*your answer here*

---

**Q1.5** Perform a *K*-Means clustering analysis on the receiving data using the following variables: `['age','before_catch_perc','depth_of_target','broken_tackles_per_reception','drop_perc','catch_perc','td_per_reception','firstdowns_per_reception','fum_per_reception']`.  Consider using `Ks = np.arange(2,15,1)`.  Which $K$ would you select based on the elbow method?

Note: these variables were specifically chosen because they are not directly related to receiving ability...more a representation of play style.

In [None]:
#######
# Your code here
#######




*your answer here*

**Q1.6** Investigate the top 3 players in each of the clusters.  Are there any general patterns you notice (especially in the variables like `yards`, `rank`, and `pos`)?

In [None]:
#######
# Your code here
#######

*your answer here*

---

Side note: here's how you fit a Hierarchical Clustering:

In [None]:
# fitting a hierarchical clustering
hier_cluster = AgglomerativeClustering(distance_threshold=0, n_clusters=None).fit(X)




---

## Part 2: Dealing with Missingness


Here, we create data in which the true theoretical regression line is:
$$ Y = 3X_1 + 2X_2 + \varepsilon,\hspace{0.1in} \varepsilon \sim N(0,1)$$

Note: $\rho_{X1,X2} = 0.5$

We will be inserting missingness into `x1` in various ways, and analyzing the results.

In [None]:
n = 500
np.random.seed(109)

x1 = np.random.normal(0,1,size=n)
x2 = 0.5*x1+np.random.normal(0,np.sqrt(0.75),size=n)
X = pd.DataFrame(data=np.transpose([x1,x2]),columns=["x1","x2"])

y = 3*x1 - 2*x2 + np.random.normal(0,1,size=n)
y = pd.Series(y)


df = pd.DataFrame(data=np.transpose([x1,x2,y]),columns=["x1","x2","y"])

# Checking the correlation
scipy.stats.pearsonr(x1,x2) 

In [None]:
fig,(ax1,ax2,ax3) =  plt.subplots(1, 3, figsize = (18,5))
ax1.scatter(x1,y)
ax2.scatter(x2,y)
ax3.scatter(x2,x1,color="orange")
ax1.set_title("y vs. x1")
ax2.set_title("y vs. x2")
ax3.set_title("x1 vs. x2")
plt.show()

### Poke holes in $X_1$ in 3 different ways (all roughly 20% of data are removed): 

- MCAR: just take out a random sample of 20% of observations in $X_1$
- MAR: missingness in  $X_1$ depends on $X_2$, and thus can be recovered in some way
- MNAR: missingness in  $X_1$ depends on $X_1$, and thus can be recovered in some way


In [None]:
x1_mcar = x1.copy()
x1_mar = x1.copy()
x1_mnar = x1.copy()

#missing completely at random
miss_mcar = np.random.choice(n,int(0.2*n),replace=False)
x1_mcar[miss_mcar] = np.nan

#missing at random: one way to do it
miss_mar = np.random.binomial(1,0.05+0.85*(x2>(x2.mean()+x2.std())),n)
x1_mar[miss_mar==1] = np.nan

#missing not at random: one way to do it
miss_mnar = np.random.binomial(1,0.05+0.85*(y>(y.mean()+y.std())),n)
x1_mnar[miss_mnar==1] = np.nan

In [None]:
# Create the 3 datasets with missingness
df_mcar = df.copy()
df_mar = df.copy()
df_mnar = df.copy()

# plug in the appropriate x1 with missingness
df_mcar['x1'] = x1_mcar
df_mar['x1'] = x1_mar
df_mnar['x1'] = x1_mnar

In [None]:
# no missingness: on the full dataset
ols = LinearRegression().fit(df[['x1','x2']],df['y'])
print(ols.intercept_,ols.coef_)

In [None]:
# Fit the linear regression blindly on the dataset with MCAR missingness, see what happens
LinearRegression().fit(df_mcar[['x1','x2']],df_mcar['y'])

**Q1** Why aren't the estimates exactly $\hat{\beta}_1 = 3$ and $\hat{\beta}_2 = -2$ ?  How does sklearn handle missingness?  What would be a first naive approach to handling missingness?

*your answer here*

### What happens when you just drop rows?

In [None]:
# no missingness for comparison sake
ols = LinearRegression().fit(X,y)
print(ols.intercept_,ols.coef_)

In [None]:
# MCAR: drop the rows that have any missingness
ols_mcar = LinearRegression().fit(df_mcar.dropna()[['x1','x2']],df_mcar.dropna()['y'])
print(ols_mcar.intercept_,ols_mcar.coef_)

In [None]:
# MAR: drop the rows that have any missingness
ols_mar = LinearRegression().fit(df_mar.dropna()[['x1','x2']],df_mar.dropna()['y'])
print(ols_mcar.intercept_,ols_mar.coef_)

In [None]:
# MNAR: drop the rows that have any missingness
X_mnar_raw = X.copy()
X_mnar_raw['x1'] = x1_mnar
X_mnar = X.iloc[miss_mnar==0]
y_mnar = y[miss_mnar==0]

ols_mnar = LinearRegression().fit(X_mnar,y_mnar)
print(ols_mnar.intercept_,ols_mnar.coef_)

**Q2** How do the estimates compare when just dropping rows?  Are they able to recover the values of $\beta_1$ that they should?  In which form of missingness is the result the worst?

*your answer here*

## Let's Start Imputing

In [None]:
#Make back-=up copies for later since we'll have lots of imputation approaches.
X_mcar_raw = X_mcar.copy()
X_mar_raw = X_mar.copy()
X_mnar_raw = X_mnar.copy()

### Mean Imputation:

Perform mean imputation using the `fillna`, `dropna`, and `mean` functions.

In [None]:
X_mcar = X_mcar_raw.copy()
X_mcar['x1'] = X_mcar['x1'].fillna(X_mcar['x1'].dropna().mean())

ols_mcar_mean = LinearRegression().fit(X_mcar,y)
print(ols_mcar_mean.intercept_,ols_mcar_mean.coef_)

In [None]:
X_mar = X_mar_raw.copy()


X_mar['x1'] = X_mar['x1'].fillna(X_mar['x1'].dropna().mean())

ols_mar_mean = LinearRegression().fit(X_mar,y)
print(ols_mar_mean.intercept_,ols_mar_mean.coef_)

In [None]:
X_mnar = X_mnar_raw.copy()
X_mnar['x1'] = X_mnar['x1'].fillna(X_mnar['x1'].dropna().mean())

ols_mnar_mean = LinearRegression().fit(X_mnar,y)
print(ols_mnar_mean.intercept_,ols_mnar_mean.coef_)

**Q3** How do the estimates compare when performing mean imputation vs. just dropping rows?  Have things gotten better or worse (for what types of missingness)?

*your answer here*

### Linear Regression Imputation 

This is difficult to keep straight.  There are two models here: 

1. an imputation model based on OLS concerning just the predictors (to predict $X_1$ from $X_2$) and 
2. the model we really care about to predict $Y$ from the 'improved' $X_1$ (now with imputed values) and $X_2$.

In [None]:
X_mcar = X_mcar_raw.copy()

# fit the imputation model
ols_imputer_mcar = LinearRegression().fit(X_mcar.dropna()[['x2']],X_mcar.dropna()['x1'])

# perform some imputations
yhat_impute = pd.Series(ols_imputer_mcar.predict(X_mcar[['x2']]))
X_mcar['x1'] = X_mcar['x1'].fillna(yhat_impute)

# fit the model we care about
ols_mcar_ols = LinearRegression().fit(X_mcar,y)
print(ols_mcar_ols.intercept_,ols_mcar_ols.coef_)

In [None]:
X_mar = X_mar_raw.copy()
ols_imputer_mar = LinearRegression().fit(X_mar.dropna()[['x2']],X_mar.dropna()['x1'])

yhat_impute = pd.Series(ols_imputer_mar.predict(X_mar[['x2']]))
X_mar['x1'] = X_mar['x1'].fillna(yhat_impute)

ols_mar_ols = LinearRegression().fit(X_mar,y)
print(ols_mar_ols.intercept_,ols_mar_ols.coef_)

In [None]:

X_mnar = X_mnar_raw.copy()
ols_imputer_mnar = LinearRegression().fit(X_mnar.dropna()[['x2']],X_mnar.dropna()['x1'])

yhat_impute = pd.Series(ols_imputer_mnar.predict(X_mnar[['x2']]))
X_mnar['x1'] = X_mnar['x1'].fillna(yhat_impute)

ols_mnar_ols = LinearRegression().fit(X_mnar,y)
print(ols_mnar_ols.intercept_,ols_mnar_ols.coef_)

**Q4**: How do the estimates compare when performing model-based imputation vs. mean imputation?  Have things gotten better or worse (for what types of missingness)?

*your answer here*

### $k$-NN Imputation ($k$=1)

In [None]:
X_mcar = X_mcar_raw.copy()
X_mcar = KNNImputer(n_neighbors=3).fit_transform(X_mcar)

ols_mcar_knn = LinearRegression().fit(X_mcar,y)
print(ols_mcar_knn.intercept_,ols_mcar_knn.coef_)

In [None]:
X_mar = X_mar_raw.copy()
X_mar = KNNImputer(n_neighbors=3).fit_transform(X_mar)

ols_mar_knn = LinearRegression().fit(X_mar,y)
print(ols_mar_knn.intercept_,ols_mar_knn.coef_)

In [None]:
X_mnar = X_mnar_raw.copy()
X_mnar = KNNImputer(n_neighbors=3).fit_transform(X_mnar)

ols_mnar_knn = LinearRegression().fit(X_mnar,y)
print(ols_mnar_knn.intercept_,ols_mnar_knn.coef_)

**Q5**: Which of the 4 methods for handling missingness worked best?  Which worked the worst?  Were the estimates improved or worsened in each of the 3 types of missingness?

*your answer here*

**Q6**: This exercise focused on 'inference' (considering just the estimates of coefficients, not the uncertainty of these estimates, which would be even worse).  What are the ramifications on prediction?  Is the situation more or less concerning?  

*your answer here*

---