# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2018-CS109A/blob/master/content/styles/iacs.png?raw=true"> CS-S109A Introduction to Data Science 

## Lecture 3 (*k*-NN and Linear Regression)

**Harvard University**<br>
**Summer 2020**<br>
**Instructors:** Kevin Rader<br>
**Authors:** Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, Kevin Rader

---

In [None]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Table of Contents 
<ol start="0">
<li> Learning Goals </li>
<li> k-NN</li>
<li> Linear Regression Basics </li>
<li> Model Accuracy and Comparison </li>
<li> Inference in Linear Regression </li>    

## Learning Goals

This Jupyter notebook accompanies Lecture 3. By the end of this lecture, you should be able to:

- Understand the basics of statistical modeling
- Perform predictions and interpret the results of *k*-NN and siumple linear regression models
- Evaluate the accuracy and compare models 
- Perform basic inferences in linear regression models
- Be comfortable fitting and using models from both `sklearn` and `statsmodels`, when appropriate 

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn as sk
import statsmodels as sm
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

## Part 0: Reading in and exploring the data

Two datasets are provided for this notebook which represent the train and test splits of some simplified data sets for performing regression modeling.  We would like to build models to predict `votergap` from the 2016 election (Trump-Clinton) from `density` (population density in persons per square mile) where each row represents a county in the US. 

We start by reading in the datasets for you:

**Important note: use the training dataset for all exploratory analysis and model fitting.  Only use the test dataset to evaluate and compare models.**


In [None]:

train = pd.read_csv("../data/county_election_train.csv")
test = pd.read_csv("../data/county_election_test.csv")


**Q0.1:** Look at summary statistics and visuals to explore the distributions of the 2 variables of interest along with visualizing their association.  Briefly summarize what you notice.

In [None]:
######
# your code here
###### 


*your answer here*

Note: density is very right-skewed.  Let's consider using the log scale version:

**Q0.2:** Create the variable `log_density` using `np.log` and recreate the affected visuals from before.  Comment on the appropriateness of *k*-NN and linear regression using the log-transformed and untransformed versions of density.

In [None]:
######
# your code here
######


*your answer here*

## Part 1: Fitting and using sklearn's k-NN model


### Fitting an k-NN model with k = 1

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn1 = KNeighborsRegressor(n_neighbors=1)
knn1.fit(train['log_density'], train['votergap'])

#Note you will get an error message.  What does this tell you?

In [None]:
#Let's fit the model appropriatelty 
knn1.fit(train['log_density'].values.reshape(-1, 1), train['votergap'])


**Q1.1** Fit three more k-NN models: ones with k = 10 and k = 50 and k = 100 and save thus asconsistently named objects.

In [None]:
######
# your code here
######



Predictions can be made and saved using the `sk.predict()` command, and the results can be interpretated by plotting the predictions on top of the scatterplot

In [None]:

yhat1 = knn1.predict(train['log_density'].values.reshape(-1, 1))

x_dummy1 = np.arange(-2,11,0.01)
yhat_dummy1 = knn1.predict(x_dummy1.reshape(-1, 1))

plt.scatter(train['log_density'],train['votergap'])
plt.plot(x_dummy1,yhat_dummy1,c="r")
plt.show()

**Q1.2** Recreate the scatterplot above but with all 4 k-NN models presented.  Which of the 4 models do you think is most appropriate for predicting `votergap`?

In [None]:
######
# your code here
######



*your answer here*


---

## Part 2: Linear Regression Basics 

In this section we will fit a linear model two ways: using both `sklearn` and `statmodels`.

First let's fit it using `sklearn`'s [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html):

In [None]:
from sklearn.linear_model import LinearRegression

regress = LinearRegression(fit_intercept=True).fit(train['log_density'].values.reshape(-1, 1), train['votergap'])

print("Beta0 =", regress.intercept_ ,", Beta1 =", regress.coef_)


**Q2.2** Plot the scatterplot with the fitted line.  Interpret the approriateness of a linear regression model from this plot.

In [None]:
######
# your code here
######



Next, let's use `statmodels`'s [linear_model.OLS](https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html):

In [None]:
import statsmodels.regression.linear_model as lm

X = sm.tools.add_constant(train['log_density'])
model = lm.OLS(train['votergap'],X)
results = model.fit()
results.params

**Q2.2** Write down the estimated regression model here.  What is the predicted value for Middlesex County in Massachusetts (where Harvard resides)?

In [None]:
######
# your code here
######


*your answer here*

**Q2.3** Plot the scatterplot with fitted regression line for this simple linear regression model.  Does a linear regression model seem appropriate?

In [None]:
######
# your code here
######


*your answer here*

## Part 3: Model Accuracy and Comparison 


sklearn provides a nice framework for calculating $R^2$ for a model using the `sk.score` function from the model:

In [None]:
print(knn1.score(train['log_density'].values.reshape(-1, 1),train['votergap']))

print(regress.score(train['log_density'].values.reshape(-1, 1),train['votergap']))



**Q3.1** Calculate the $R^2$ score for all models seen on the test set.  Which of the $k$-NN models wins out?  Does it outperform linear regression?

In [None]:
######
# your code here
######


*your answer here*

**Q3.2** Calculate MSE and $R^2$ manually from the linear regression model.




In [None]:
######
# your code here
######


## Part 4: Inference in Linear Regression 

`statsmodels` is definitely the way to go to for inferential calculations for linear regression models.  Use the output below to perform some interpretations:

In [None]:
model.fit().summary()

**Q4.1** Provide 95\% confidence intervals for $\beta_0$ and $\beta_1$.  Interpret the results.

*your answer here*

**Q4.2** Perform a formal hypothesis test to determine whether votergap is significantly associated with  (on the log scale)

*your answer here*

**Q4.3** Calculate the predicted values and residuals for the train set from the linear regression model.  Plot the histogram of residuals and residuals-vs.-predicted scatterplot.  Comment on the assumptions of the linear regression model 

In [None]:
######
# your code here
######

*your answer here*

**Q4.4** *we won't have time for this* Perform a bootstrap approach to calculate the 95\% confidence intervals based on the method, and compare them to the probabilistic-based ones above from statsmodels.

In [None]:
######
# your code here
######

*your answer here*