## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

Always be sepcific wheb asking data science problems. To be as specific as possible and ask the "right" question.

Answer:

1. When people show an affinity to weapons, such sharp things and guns, are they likely to be left handed?
2. Are more logical people likely to be more left handed?
3. Are more hardworking prople more likely to be left handed?

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, PowerTransformer

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
# can also use .read_table(), as it is using tab as the seperator.

In [5]:
df = pd.read_csv('data.csv', sep = "\t")
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,1,1,5,5,5,1,5,1,5,1,5,1,1,1,5,5,5,1,5,1,1,1,1,5,5,1,1,1,5,5,5,1,5,1,91,232,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,2,5,3,4,1,4,1,1,1,5,2,4,4,4,1,2,1,2,1,3,1,5,2,4,4,4,4,4,1,3,1,4,4,5,17,247,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,4,5,4,3,4,1,2,3,1,3,3,3,4,5,3,2,2,2,1,4,3,3,4,4,2,2,4,2,1,4,2,2,2,2,11,6774,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,1,3,2,3,1,5,2,2,5,5,2,3,2,2,1,4,1,1,1,3,4,1,3,5,5,1,3,4,1,2,1,1,1,3,14,1072,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,1,1,5,5,5,1,5,1,5,2,5,1,5,1,5,5,5,1,5,1,5,1,5,5,5,1,1,1,5,5,5,1,5,1,10,226,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer:  
Do bot collect the name and ID of the survey respondents. Mask or do not record thier perosonal information such as name, address, etc.  
Don't ask race and religion?

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [6]:
df.isnull().sum()

Q1             0
Q2             0
Q3             0
Q4             0
Q5             0
Q6             0
Q7             0
Q8             0
Q9             0
Q10            0
Q11            0
Q12            0
Q13            0
Q14            0
Q15            0
Q16            0
Q17            0
Q18            0
Q19            0
Q20            0
Q21            0
Q22            0
Q23            0
Q24            0
Q25            0
Q26            0
Q27            0
Q28            0
Q29            0
Q30            0
Q31            0
Q32            0
Q33            0
Q34            0
Q35            0
Q36            0
Q37            0
Q38            0
Q39            0
Q40            0
Q41            0
Q42            0
Q43            0
Q44            0
introelapse    0
testelapse     0
country        0
fromgoogle     0
engnat         0
age            0
education      0
gender         0
orientation    0
race           0
religion       0
hand           0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,2.748805,2.852772,2.657505,3.33413,3.168021,2.93021,2.564771,3.424952,2.928537,3.639818,2.867591,3.595124,3.861138,3.337237,1.999761,3.001434,2.730641,2.624044,2.543738,2.894359,3.002151,2.869503,2.741874,3.022228,3.074092,2.61066,3.465344,2.798757,2.569312,2.984226,3.385277,2.704828,2.676386,2.736616,347.808556,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,1.443078,1.556284,1.559575,1.522866,1.501683,1.575544,1.61901,1.413236,1.493122,1.414569,1.360858,1.354475,1.291425,1.426095,1.290747,1.48061,1.485883,1.481709,1.611428,1.477968,1.420032,1.659141,1.40567,1.562694,1.5464,1.409707,1.52146,1.413584,1.621772,1.483752,1.423055,1.544345,1.523097,1.471845,5908.901681,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,3.0,2.0,3.0,3.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,6.0,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,3.0,3.0,2.0,4.0,3.0,4.0,3.0,4.0,4.0,3.0,1.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,4.0,3.0,2.0,3.0,4.0,3.0,3.0,3.0,12.0,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,4.0,4.0,4.0,5.0,5.0,4.0,4.0,5.0,4.0,5.0,4.0,5.0,5.0,5.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,5.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,4.0,4.0,35.0,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,252063.0,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Q1           4184 non-null   int64 
 1   Q2           4184 non-null   int64 
 2   Q3           4184 non-null   int64 
 3   Q4           4184 non-null   int64 
 4   Q5           4184 non-null   int64 
 5   Q6           4184 non-null   int64 
 6   Q7           4184 non-null   int64 
 7   Q8           4184 non-null   int64 
 8   Q9           4184 non-null   int64 
 9   Q10          4184 non-null   int64 
 10  Q11          4184 non-null   int64 
 11  Q12          4184 non-null   int64 
 12  Q13          4184 non-null   int64 
 13  Q14          4184 non-null   int64 
 14  Q15          4184 non-null   int64 
 15  Q16          4184 non-null   int64 
 16  Q17          4184 non-null   int64 
 17  Q18          4184 non-null   int64 
 18  Q19          4184 non-null   int64 
 19  Q20          4184 non-null 

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer:   
Classification.    
This is becasue we are classifying the perosn as either left handed (1) or not left handed (0).   
It is discrete and not continuous. Also, there are no natural order.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer:   
We want to put the variables/ features on the same scale.  
Therefore we standardise them. Standard scaler will shift and recenter the data (rescale) to a mean of 0 and standard deviation (var = 1) of 1. 
This is becasue some machine learning algorithm, such as KNN, is very sensitive to scale of the data. KNN looks at how 'close' the datapoints are to one another. If we do not scale, then when 2 features have very different variables, KNN will base the answer mainly on the feature with a larger scale.  
sklearn have default penalty with mean square error. (coeff are in loss func)
Need to standarise.

### 7. Give an example of when we might not standardize our variables.

Answer:   
When the variable is already the same scale, or when have strong reasons to leave it in the original scale.   
*standardisation does not mean forcing into gaussian distribution!!!*

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

Answer:   

They are already on the same scale/ standrdised.  
(Scale of 1 to 5)

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer:   
We need to change them into binary format.   
should drop the 0s which means ambidextrous, as they are too few of them.
Then the rest are not left handed and will be mapped to 0, while the number 1 in dataset means left handed and should be given 1.

In [9]:
df.hand.value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [10]:
df['y'] = [1 if i ==2 else 0 for i in df['hand']]

In [12]:
df['y'].value_counts()

0    3732
1     452
Name: y, dtype: int64

In [13]:
df = df[df['hand'] != 0].reset_index()

In [16]:
df.shape

(4173, 58)

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer:   
We need to break tie, hence we typically set it to odd numbers. Otherwise, there may be situations where the algorithm can not decide meaningfully where the point should be classified under.
!!! Odd is for tie breaking

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [17]:
X = df.drop(columns=['index', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand', 'y'], axis = 1)

y = df['y']

In [25]:
X.shape

(4173, 44)

In [26]:
y.shape

(4173,)

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [28]:
X_train.shape

(3338, 44)

In [29]:
y_train.shape

(3338,)

In [30]:
k_3 = KNeighborsClassifier(n_neighbors = 3)
k_3.fit(X_train, y_train)

k_5 = KNeighborsClassifier(n_neighbors = 5)
k_5.fit(X_train, y_train)

k_15 = KNeighborsClassifier(n_neighbors = 15)
k_15.fit(X_train, y_train)

k_25 = KNeighborsClassifier(n_neighbors = 25)
k_25.fit(X_train, y_train)

In [31]:
# When to do stratified: when the samples are very skewed, for example, when it is skewed for right hand.
# When the porportions are very off, and you want every class to be represented, then maybe you need to do stratified.
# Straitified usually used for calssification, but usually not used in regression problem.

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer:   
Yes, there is default regularisation.
penalty = l2 ==> ridge penalty.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer:  
yes. sklearn will regularize by default.   
We should standardize our features before. If not penalty will be very off due to scale.  
But this dataset is already on same sacle of 1 to 5.  
So maybe not.  


# Very important
# Regularisation: Model simplification by adding penalty term.
# Standardisation: recentering of data so they have a mean of 0 and standard deviation and var of 1. (std^2 = var)

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [34]:
lasso_1 = LogisticRegression(penalty = 'l1', C = 1.0, solver = 'liblinear')
lasso_1.fit(X_train, y_train)

lasso_10 = LogisticRegression(penalty = 'l1', C = 0.1, solver = 'liblinear')
lasso_10.fit(X_train, y_train)

ridge_1 = LogisticRegression(penalty = 'l2', C = 1.0, solver = 'liblinear')
ridge_1.fit(X_train, y_train)

ridge_10 = LogisticRegression(penalty = 'l2', C = 0.1, solver = 'liblinear')
ridge_10.fit(X_train, y_train)

### lasso ==> l1, ridge ==> l2
lasso ==> model selection ==> shrink some coeff to 0  
ridge ==> minimize contribution by noise  

[source](https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression)  
[source's source](https://online.stat.psu.edu/stat508/book/export/html/749)  
Considering the geometry of both the lasso (left) and ridge (right) models, the elliptical contours (red circles) are the cost functions for each. Relaxing the constraints introduced by the penalty factor leads to an increase in the constrained region (diamond, circle). Doing this continually, we will hit the center of the ellipse, where the results of both lasso and ridge models are similar to a linear regression model.

However, both methods determine coefficients by finding the first point where the elliptical contours hit the region of constraints. Since lasso regression takes a diamond shape in the plot for the constrained region, each time the elliptical regions intersect with these corners, at least one of the coefficients becomes zero. This is impossible in the ridge regression model as it forms a circular shape and therefore values can be shrunk close to zero, but never equal to zero.

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

Answer:   
No, we are asking is a person left handed.  
The variables given to answer this question is pychological traits.  
it is not likely to be very related.

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

Answer:  

In [38]:
knns = [k_3, k_5, k_15, k_25]
for knn in knns: 
    print("k-nearest neighbors training accuracy with k = 3: " + str(knn.score(X_train, y_train)))
    print("k-nearest neighbors testing accuracy with k = 3: " + str(knn.score(X_test, y_test)))

regs = [lasso_1, lasso_10, ridge_1, ridge_10]
for reg in regs:
    print("logistic regression training accuracy with LASSO penalty, alpha = 1: " + str(lasso_1.score(X_train, y_train)))
    print("logistic regression testing accuracy with LASSO penalty, alpha = 1: " + str(lasso_1.score(X_test, y_test)))

k-nearest neighbors training accuracy with k = 3: 0.9077291791491912
k-nearest neighbors testing accuracy with k = 3: 0.858682634730539
k-nearest neighbors training accuracy with k = 3: 0.890653085680048
k-nearest neighbors testing accuracy with k = 3: 0.8826347305389222
k-nearest neighbors training accuracy with k = 3: 0.8921509886159377
k-nearest neighbors testing accuracy with k = 3: 0.888622754491018
k-nearest neighbors training accuracy with k = 3: 0.8924505692031156
k-nearest neighbors testing accuracy with k = 3: 0.888622754491018
logistic regression training accuracy with LASSO penalty, alpha = 1: 0.8927501497902935
logistic regression testing accuracy with LASSO penalty, alpha = 1: 0.888622754491018
logistic regression training accuracy with LASSO penalty, alpha = 1: 0.8927501497902935
logistic regression testing accuracy with LASSO penalty, alpha = 1: 0.888622754491018
logistic regression training accuracy with LASSO penalty, alpha = 1: 0.8927501497902935
logistic regression 

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer:  

k=3 is obviously over fitted.  
for k = 3, the training score is much higher than the test score.  
This means the model does not generalise well, hence it is overfitted.  

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer:    
k increaase, predictions become very blurred, as too many points are taking part in decision making.  
This will result in y being far from actual value.  
Hence the bias increases, and variance decreases, as predictions made using different data will likely be same/ similar, but far from truth.   

k increase ==> bias increase, variance decrease.   
k decrease ==> bias decrease, variance increase.   


### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:   
Over fitting ==> high variance ==> decrease varaince ==> increase k  
over fitting ==> maybe too many features ==> decrease number of question used in it  
over fitting ==> change algorithm to lasso logistic regression

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer:  
None. They have the same training and testing score for all models. Also, train and test score are very close to eachother. 

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer:  
C ==> Inverse of regularization strength (documentation)  
higher C, less regularisation  
lower c, more regularisation  
regularisation ==> simplify model and generalise model ==> decrease variance of model   

C increase, bias decrease, variance increase  
C decrease, bias increase, variance decrease

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer:  
Change in C does not change model performace.  
2 possibility, small C does the job well enough, or C have no effect in changing model performance.  
Knowing:
1. Questions not likely to have anythong to do with left handedness. Small C does the job well.
2. Regularisaing gives same answer ==> features are very poor for this problem 

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:  
1. Regularize by changing C.
2. Use lasso to remove redundent features.
3. Use ridge to change weightage of coefficients.
4. Get more data points
5. Remove feature mannually. 

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer:

Logistic regression. can see how each feature affects the left handedness or not decision process.

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

Answer:  
Q1 coeff: -0.00285275

In [40]:
lasso_1.coef_

array([[-0.00285275, -0.01310503, -0.01285059, -0.05758483,  0.04475204,
        -0.00491152, -0.01517637, -0.19023224, -0.02191352,  0.05086993,
        -0.00329174,  0.00115864, -0.06389773,  0.04175196, -0.06561856,
         0.0557567 ,  0.03591048, -0.03664534, -0.00217176, -0.06435958,
        -0.06477397, -0.08994628, -0.0165969 , -0.01854693,  0.03806211,
         0.09595788,  0.03229995, -0.02861076,  0.03007233,  0.01468657,
         0.00457009,  0.00390563, -0.02584619, -0.02599571,  0.04276224,
        -0.05755718, -0.03721368,  0.09632804, -0.06808699, -0.09537414,
        -0.04158133, -0.0576506 , -0.10603634,  0.01843732]])

In [41]:
import numpy as np

np.exp(lasso_1.coef_[0][0])

0.9971513157950116

Use .exp() because logistic regession is taking ln() so just putting e back now

1. As the value for `Q1` increases by 1, the log-odds of being left-handed decreases by 0.01104. (small change only/ no change)
2. As the value for `Q1` increases by 1, an individual is 99.70% as likely to be left-handed.  (Basically no change if value increase by 1)
3. As the value for `Q1` increases by 1, an individual is 1.1% less likely to be left-handed. (small change/ no change)

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer:

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Answer:  
logistic regression. model is same for all C. meaning regularised. Probability of getting right is comparable to KNN too.  
if must knn, then k = 5, bias variance trade off.

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)