## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 
- However, there are some additional questions along the way that don't fit neatly into the one main example we'll walk through. Any question that isn't explicitly part of the main example is marked with **(detour)** at the start of the question.

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

Answer:
Which questions on the survey correlate with left-handedness?


---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from IPython.display import display
import sklearn.metrics as metrics
#from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
# Load csv seperated on tabs
df = pd.read_csv('./data.csv', sep='\\t', engine='python')

In [3]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

1) It is imperative to anonymize the data, either preserving only an aggregate, or by simply not recording personally identifiable details.
2) It is reasonable to assume that a subset of the population that falls outside whatever norm is perceived societally would be hesitant to participate in this survey, skewing the data
3) No idea what else to say on this

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [4]:
# Check for non-sensical values. The max age does not make sense for example.
pd.set_option('display.max_columns', 55)
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,2.748805,2.852772,2.657505,3.33413,3.168021,2.93021,2.564771,3.424952,2.928537,3.639818,2.867591,3.595124,3.861138,3.337237,1.999761,3.001434,2.730641,2.624044,2.543738,2.894359,3.002151,2.869503,2.741874,3.022228,3.074092,2.61066,3.465344,2.798757,2.569312,2.984226,3.385277,2.704828,2.676386,2.736616,347.808556,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,1.443078,1.556284,1.559575,1.522866,1.501683,1.575544,1.61901,1.413236,1.493122,1.414569,1.360858,1.354475,1.291425,1.426095,1.290747,1.48061,1.485883,1.481709,1.611428,1.477968,1.420032,1.659141,1.40567,1.562694,1.5464,1.409707,1.52146,1.413584,1.621772,1.483752,1.423055,1.544345,1.523097,1.471845,5908.901681,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,3.0,2.0,3.0,3.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,6.0,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,3.0,3.0,2.0,4.0,3.0,4.0,3.0,4.0,4.0,3.0,1.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,4.0,3.0,2.0,3.0,4.0,3.0,3.0,3.0,12.0,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,4.0,4.0,4.0,5.0,5.0,4.0,4.0,5.0,4.0,5.0,4.0,5.0,5.0,5.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,5.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,4.0,4.0,35.0,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,252063.0,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [5]:
df['age'].sort_values(ascending=False).head(5)

2690    23763
2137      409
2075      123
2101       86
1736       86
Name: age, dtype: int64

In [6]:
df['hand'].sort_values().head(12)

2081    0
3105    0
4015    0
3098    0
2471    0
2690    0
1846    0
2409    0
2703    0
1322    0
1145    0
2699    1
Name: hand, dtype: int64

In [7]:
# Dropping rows with unreasonable age and hand values. There are other
# categories that also have responses outside of the acceptable range, however
# non-responses outside of our prediction values are unlikely to seriously hurt
# the results by my estimation
df = df[(df['age'] < 100) & (df['hand'] > 0)]

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer: Since the response variable is binary, the problem is more suited to classification

### (detour) 6. While this isn't the problem we set out to solve, suppose I wanted to predict the age of the respondent using Q1 - Q44 as my predictors. Would this be a classification or regression problem? Why?

Answer: Now since the response variable is on a spectrum, regression becomes the more reasonable approach

### 7. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer: Pretty much always. The only exception would seem to be variables with binary values. Otherwise, most models benefit from scaling all continuous variables.

### 8. Give an example of when we might not standardize our variables.

Answer: Dummy variables should not be standardized as they are binary in nature and standardizing them can introduce some level of vagueness to how we interpret them.

### 9. Based on your answers to 7 and 8, do you think we should standardize our predictor variables in this case? Why or why not?

Answer: I believe we should as we have no dummy columns. fromgoogle and engnat both only have two possible response categories so it could be argued they should be exempt, but I don't think it'll make a difference in this case.

### 10. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: It depends on whether or not we count ambidextrous individuals as being left handed as well. In this case, I think it best to discount them and go for explicitly left handed individuals. We have already removed invalid entries from hands. Next we'll drop rows where hands == 3.

In [8]:
df = df[df['hand'] < 3]

In [9]:
df.corr()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
Q1,1.0,-0.124302,0.269071,-0.066977,0.20452,-0.077903,0.167492,-0.124964,0.236268,-0.057231,0.106825,-0.115986,0.176474,-0.067338,0.236642,-0.070227,0.195065,-0.041593,0.156262,-0.176978,0.12575,-0.098872,0.091471,-0.146249,0.204592,-0.062077,0.193494,-0.053706,0.213146,-0.060641,0.167829,-0.088928,0.137246,-0.072017,0.257918,-0.073095,0.176698,-0.057123,0.288909,-0.074934,0.162278,-0.06389,0.201736,-0.099059,-0.019478,-0.004663,0.010163,0.010045,0.022102,0.065105,-0.163622,-0.06039,-0.017253,-0.010767,-0.002313
Q2,-0.124302,1.0,-0.022664,0.275186,-0.043959,0.162859,-0.089243,0.150415,-0.096658,0.148893,-0.058272,0.249166,-0.212233,0.285582,0.057738,0.271666,-0.066861,0.160714,-0.09379,0.345268,-0.088263,0.203966,0.061723,0.188621,-0.192461,0.231407,-0.011939,0.248266,-0.035782,0.230293,-0.180986,0.217824,-0.218595,0.241395,-0.094474,0.206757,-0.080132,0.239199,-0.045601,0.253073,-0.111486,0.233846,-0.013892,0.30415,-0.00059,0.011582,0.015827,-0.108305,-0.062874,-0.084235,0.367083,0.140369,0.019888,0.033727,-0.027727
Q3,0.269071,-0.022664,1.0,0.060398,0.274082,-0.008631,0.212048,-0.118439,0.418912,-0.02513,0.149193,-0.111922,0.062062,0.011475,0.337915,-0.063942,0.326857,0.003689,0.175583,-0.143116,0.163833,-0.070226,0.200171,-0.071173,0.248011,-0.020454,0.262655,-0.008678,0.356149,-0.040715,0.17265,-0.063362,0.136448,-0.055449,0.348796,-0.007798,0.240999,-0.031981,0.495097,-0.037134,0.21153,-0.00176,0.264397,-0.005642,-0.024197,-0.020874,0.018488,-0.0427,-0.04372,-0.019865,-0.133494,0.046367,0.012955,0.059237,-0.011444
Q4,-0.066977,0.275186,0.060398,1.0,0.057696,0.088421,-0.046768,0.149198,-0.078819,0.062452,-0.10603,0.194175,-0.204035,0.247608,0.017517,0.26039,-0.036379,0.11103,-0.108384,0.237703,-0.082419,0.264681,0.051483,0.150588,-0.069949,0.252869,0.020593,0.21967,0.026076,0.171889,-0.078671,0.154047,-0.128801,0.273077,0.036431,0.237448,-0.072106,0.378345,0.024986,0.221926,-0.023,0.32621,0.013101,0.336347,-0.00471,0.00967,-0.001833,-0.071767,-0.033386,-0.003534,0.233218,0.083232,-0.012573,0.046021,-0.034601
Q5,0.20452,-0.043959,0.274082,0.057696,1.0,0.08716,0.256646,-0.04937,0.231433,0.0133,0.062632,-0.08705,0.079736,0.041851,0.330632,-0.002938,0.294067,0.019468,0.071814,-0.059699,0.103751,0.003425,0.128116,-0.012174,0.196734,0.041413,0.262687,-0.008594,0.211169,-0.005272,0.140557,0.009818,0.115514,-0.019399,0.241805,-0.022061,0.180377,0.031906,0.247185,-0.00753,0.20509,0.011803,0.262797,0.011394,-0.01594,0.018046,0.024954,0.015272,-0.086605,-0.013711,-0.063912,0.043021,-0.009748,0.020768,0.014399
Q6,-0.077903,0.162859,-0.008631,0.088421,0.08716,1.0,0.019265,0.084872,-0.025438,0.199091,0.046779,-0.006968,0.00353,0.09162,0.034409,0.058743,-0.033856,0.192405,-0.020798,0.232482,5.6e-05,0.092739,0.127461,0.106845,-0.102611,0.103452,0.043248,0.197746,-0.01927,0.089574,-0.050789,0.077405,0.033034,0.062612,0.016619,0.014088,0.086646,0.043943,-0.009384,-0.011151,0.035119,0.12465,0.043164,0.114912,0.016485,0.019394,0.066942,-0.013522,-0.247847,-0.169286,0.132152,0.111393,0.01206,-0.046476,-0.017914
Q7,0.167492,-0.089243,0.212048,-0.046768,0.256646,0.019265,1.0,-0.063634,0.242772,-0.023548,0.067337,-0.113893,0.084965,-0.062752,0.21882,-0.069096,0.266693,-0.005387,0.12543,-0.125985,0.126014,-0.036119,0.14505,-0.042711,0.159822,-0.07237,0.181439,-0.039516,0.16547,-0.076838,0.162399,-0.050949,0.127407,-0.060815,0.210961,-0.081829,0.1799,-0.064616,0.20186,-0.082247,0.172251,-0.051971,0.183628,-0.059354,-0.004395,-0.016592,0.036002,-0.001464,-0.048768,0.018684,-0.124334,0.021321,-0.010263,-0.022062,-0.001464
Q8,-0.124964,0.150415,-0.118439,0.149198,-0.04937,0.084872,-0.063634,1.0,-0.08485,0.044231,-0.114557,0.225918,-0.069406,0.17181,-0.115529,0.213593,-0.094295,0.034724,-0.154642,0.197825,-0.695749,0.207091,0.026908,0.109272,-0.087255,0.178989,-0.119016,0.120144,-0.136642,0.190857,-0.15584,0.074173,-0.180662,0.155067,-0.124333,0.206746,-0.104952,0.238909,-0.14745,0.185216,-0.092563,0.157875,-0.123841,0.163544,0.001296,-0.010533,0.000549,-0.00756,-0.055087,-0.045406,0.15276,-0.006542,-0.010425,0.032634,-0.04996
Q9,0.236268,-0.096658,0.418912,-0.078819,0.231433,-0.025438,0.242772,-0.08485,1.0,0.005999,0.128511,-0.125061,0.113427,-0.066499,0.343986,-0.142493,0.415953,-0.029557,0.247832,-0.144424,0.113755,-0.114976,0.18316,-0.127871,0.219995,-0.093425,0.250718,-0.080272,0.252062,-0.039119,0.110432,-0.20032,0.116866,-0.141737,0.260695,-0.055607,0.264643,-0.119511,0.36074,-0.104916,0.185546,-0.077494,0.252802,-0.081452,0.006057,-0.015188,0.089405,0.009564,-0.099174,-0.062486,-0.122229,-0.039595,-0.041082,0.010767,-0.025569
Q10,-0.057231,0.148893,-0.02513,0.062452,0.0133,0.199091,-0.023548,0.044231,0.005999,1.0,0.075813,0.055389,-0.047511,0.078092,0.033567,0.079559,-0.02625,0.158681,0.033585,0.192284,0.021317,0.084294,0.085751,0.112755,-0.079869,0.068791,0.035208,0.208562,-0.000793,0.117057,-0.065988,0.063933,-0.033056,0.041389,-0.009898,0.040005,0.086636,0.050451,-0.014009,0.047287,-0.027047,0.133614,0.031689,0.098056,0.015032,0.021846,0.063481,0.008223,-0.153718,-0.117986,0.141838,0.08262,-0.063713,-0.011751,-0.003539


### 11. The professor for whom you work suggests that you set $k = 4$. Why might this be a bad idea in this specific case?

Answer: For even numbered k's in general, ties are resolved by random chance which means the model fails to reliably predict what it is designed to predict

### 12. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [10]:
X = df[['Q1','Q2','Q3','Q4','Q5','Q6','Q7','Q8','Q9','Q10','Q11','Q12','Q13',
         'Q14','Q15','Q16','Q17','Q18','Q19','Q20','Q21','Q22','Q23','Q24',
         'Q25','Q26','Q27','Q28','Q29','Q30','Q31','Q32','Q33','Q34','Q35',
         'Q36','Q37','Q38','Q39','Q40','Q41','Q42','Q43','Q44']]
y = df['hand']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
knn = KNeighborsClassifier(n_neighbors=3)
model = knn.fit(X_train, y_train)
model.score(X_test, y_test)

0.8398398398398398

In [12]:
knn = KNeighborsClassifier(n_neighbors=5)
model = knn.fit(X_train, y_train)
model.score(X_test, y_test)

0.8638638638638638

In [13]:
knn = KNeighborsClassifier(n_neighbors=15)
model = knn.fit(X_train, y_train)
model.score(X_test, y_test)

0.8738738738738738

In [14]:
knn = KNeighborsClassifier(n_neighbors=25)
model = knn.fit(X_train, y_train)
model.score(X_test, y_test)

0.8738738738738738

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: Default regularization is L2 (ridge)

### 14. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features? Well, the answer is (as always), **it depends**. What is one reason you would standardize? What is one reason you would not standardize?

Answer:
- An example of when I would standardize in logistic regression is when I'm regularizing the data
- An example of when I would not standardize in logistic regression is when I'm not regularizing the data

### 15. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [15]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

In [16]:
lr.fit(X_train, y_train)
lr.score(X_train, y_train)

0.8911155644622578

In [17]:
lr.score(X_test, y_test)

0.8738738738738738

In [22]:
lasso1 = LogisticRegression(penalty='l1', C=1)
lasso1.fit(X_train, y_train)
lasso1.score(X_test, y_test)

0.8738738738738738

In [23]:
lasso10 = LogisticRegression(penalty='l1', C=.1)
lasso10.fit(X_train, y_train)
lasso10.score(X_test, y_test)

0.8738738738738738

In [24]:
ridge1 = LogisticRegression(penalty='l1', C=1)
ridge1.fit(X_train, y_train)
ridge1.score(X_test, y_test)

0.8738738738738738

In [25]:
ridge10 = LogisticRegression(penalty='l1', C=.1)
ridge10.fit(X_train, y_train)
ridge10.score(X_test, y_test)

0.8738738738738738

### (detour) 16. Suppose that, instead of predicting whether or not someone was left-handed, you wanted to predict whether someone was right-handed, left-handed, both, or missing. What type of *(hint: generalized linear)* model would you try to fit here? Why?

Answer: Multinomial Logistic Regression to deal with the multitude of classes

### (detour) 17. Suppose that, instead of predicting whether or not someone was left-handed, you wanted to predict someone's level of education *(1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree)* based on their personality question responses. What type of *(hint: generalized linear)* model would you try to fit here? Why?

Answer: Ordinal Logistic Regression since the result classes are ordinal

### (detour) 18. Suppose that, instead of predicting whether or not someone was left-handed, you wanted to predict someone's age based on their personality question responses. Realistically, we would probably fit a multiple linear regression model. However, if I tried to fit a GLM here, what type of model would be most appropriate? Why?

Answer: Gamma regression since age is always positive

---
## Step 5: Evaluate the model(s).

### 19. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)

In [26]:
from sklearn.model_selection import GridSearchCV

In [28]:
# Alex's grids
knn_params = [
    {'n_neighbors':[3,5,15,25]}
]
knn = KNeighborsClassifier()
knn_grid = GridSearchCV(knn,knn_params,cv=10,scoring="accuracy",return_train_score=True)
knn_grid.fit(X,y)

GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_neighbors': [3, 5, 15, 25]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [29]:
logreg_params = [
    {'penalty':['l2','l1'],
     'C':[.0001,0.001, 0.01, 0.1, 1]}
]
logreg = LogisticRegression(intercept_scaling=1)
lr_grid = GridSearchCV(logreg,logreg_params,cv=10,scoring="accuracy",return_train_score=True)
lr_grid.fit(X,y)

GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'penalty': ['l2', 'l1'], 'C': [0.0001, 0.001, 0.01, 0.1, 1]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [31]:
pd.DataFrame(knn_grid.cv_results_)[['param_n_neighbors','mean_test_score','mean_train_score']].T

Unnamed: 0,0,1,2,3
param_n_neighbors,3.0,5.0,15.0,25.0
mean_test_score,0.847233,0.872527,0.886802,0.886802
mean_train_score,0.902663,0.890419,0.886691,0.886802


### 20. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer: In k=3, the test score is notably lower than the train score

### 21. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer: Higher k = higher bias

### 22. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: Increase k, change your feature set, and increase dataset size

### 22. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer: They all seem fine

### 23. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer: Higher C = higher variance, lower bias

### 24. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: Reduce C, reduce feature space, get more data

---
## Step 6: Answer the problem.

### 25. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer: Logistic regression since you can evaluate the coefficients

### 26. Select your best logistic regression model. Interpret the coefficient for `Q1`.

In [37]:
list(zip(X.columns, lasso1.coef_[0]))

[('Q1', 0.011791265987770427),
 ('Q2', 0.0021493643673874126),
 ('Q3', 0.007896199988834304),
 ('Q4', -0.09190854770710016),
 ('Q5', 0.061125165564982334),
 ('Q6', -0.006297461326663085),
 ('Q7', 0.03305874631149391),
 ('Q8', -0.1277265739938772),
 ('Q9', -0.04427879842213442),
 ('Q10', 0.06417044240315972),
 ('Q11', -0.039938866239977215),
 ('Q12', 0.023495591855121924),
 ('Q13', -0.033203675780452635),
 ('Q14', -0.005413944635204452),
 ('Q15', -0.03149943124477349),
 ('Q16', 0.026050929096954938),
 ('Q17', 0.00618245532392069),
 ('Q18', 0.022415720791309796),
 ('Q19', 0.0),
 ('Q20', -0.08856885535735282),
 ('Q21', -0.04465890547936698),
 ('Q22', -0.051934088820694516),
 ('Q23', -0.06348910982725656),
 ('Q24', 0.01477617457016404),
 ('Q25', 0.037825669663428485),
 ('Q26', 0.08574016824320177),
 ('Q27', 0.09240837038752049),
 ('Q28', -0.02309459372208871),
 ('Q29', 0.023865834203809545),
 ('Q30', -0.0166467211604919),
 ('Q31', 0.0032840426126001243),
 ('Q32', 0.028094944908570122),
 ('

Answer: It is essentially irrelevant so far as I can tell.

### 27. If you have to select one model overall to be your *best* model, which model would you select? Why?

Answer: Any of the logistic regression models work well. They have good fit and offer insight to which questions had a bearing on the class determination

### 28. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer these for the professor based on the model you selected!

In [41]:
pd.DataFrame(list(zip(X.columns, abs(lasso1.coef_[0])))).sort_values(1, ascending=False)

Unnamed: 0,0,1
42,Q43,0.151506
37,Q38,0.147426
7,Q8,0.127727
26,Q27,0.092408
3,Q4,0.091909
19,Q20,0.088569
34,Q35,0.087767
25,Q26,0.08574
33,Q34,0.074245
41,Q42,0.071866


Answer: The questions in the list above had the strongest impact on the model, and therefore had the most predictive value in determining left-handedness

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following:
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)
- Fit and evaluate one or more of the generalized linear models discussed above.
- Create a plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?