## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 
- However, there are some additional questions along the way that don't fit neatly into the one main example we'll walk through. Any question that isn't explicitly part of the main example is marked with **(detour)** at the start of the question.

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

1. What personality traits are more or less commonly identified with left-handed people?
2. Is there an observable relationship between personality traits, and preferred writing hand?
3. What personality traits are predictors of preferred writing hand?

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import numpy as np

In [2]:
df = pd.read_csv('data.csv', sep='	')

In [3]:
pd.options.display.max_rows = 99
pd.options.display.max_columns = 99

In [4]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q43,Q44,introelapse,testelapse,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,1,1,5,5,5,1,5,1,5,1,5,1,1,1,5,5,5,1,5,1,1,1,1,5,5,1,1,1,5,5,5,1,5,1,91,232,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,2,5,3,4,1,4,1,1,1,5,2,4,4,4,1,2,1,2,1,3,1,5,2,4,4,4,4,4,1,3,1,4,4,5,17,247,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,4,5,4,3,4,1,2,3,1,3,3,3,4,5,3,2,2,2,1,4,3,3,4,4,2,2,4,2,1,4,2,2,2,2,11,6774,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,1,3,2,3,1,5,2,2,5,5,2,3,2,2,1,4,1,1,1,3,4,1,3,5,5,1,3,4,1,2,1,1,1,3,14,1072,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,1,1,5,5,5,1,5,1,5,2,5,1,5,1,5,5,5,1,5,1,5,1,5,5,5,1,1,1,5,5,5,1,5,1,10,226,US,2,1,22,3,1,1,3,2,3


In [5]:
df.iloc[:, [-1]]

Unnamed: 0,hand
0,3
1,1
2,2
3,2
4,3
5,1
6,1
7,1
8,1
9,2


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer: 

1) Anonymize the data - No names or addresses or other explicitly personal info needed, or stored separate from results via a double blind mechanism if follow up is needed.

2) Data privacy best practices - delete all other identifying info once data is completely collected (the technical data at the end of this dataset and IP's for example)

3) Take steps to reduce bias in collection of data - for example if giving this questionnaire live to passerby on a sidewalk, mandate the researchers asks every 10th person who passes by to participate, regardless of the researcher's preference or perception of that person's availability to conduct the survey.


---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
Q1             4184 non-null int64
Q2             4184 non-null int64
Q3             4184 non-null int64
Q4             4184 non-null int64
Q5             4184 non-null int64
Q6             4184 non-null int64
Q7             4184 non-null int64
Q8             4184 non-null int64
Q9             4184 non-null int64
Q10            4184 non-null int64
Q11            4184 non-null int64
Q12            4184 non-null int64
Q13            4184 non-null int64
Q14            4184 non-null int64
Q15            4184 non-null int64
Q16            4184 non-null int64
Q17            4184 non-null int64
Q18            4184 non-null int64
Q19            4184 non-null int64
Q20            4184 non-null int64
Q21            4184 non-null int64
Q22            4184 non-null int64
Q23            4184 non-null int64
Q24            4184 non-null int64
Q25            4184 non-null int64
Q26            418

In [7]:
df.isnull().sum().sum()

0

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer: Classification - these are discrete variables, and specifically our y variable here is discrete - hand preference.

### (detour) 6. While this isn't the problem we set out to solve, suppose I wanted to predict the exact age of the respondent using Q1 - Q44 as my predictors. Would this be a classification or regression problem? Why?

Answer: Regression - Age is a continuous variable and is our y variable. 

### 7. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer: So certain variables that have much higher numbers do not disproportionately have more weight in a model, ex measurements of square footage of land versus number of automobiles in a garage.  

### 8. Give an example of when we might not standardize our variables.

Answer: When the variables have already been standardized, or already have a normal distribution.  We would not standardize when we are not trying to regularize variables

### 9. Based on your answers to 7 and 8, do you think we should standardize our predictor variables in this case? Why or why not?

Answer: Yes; although all the prompts in Q 1-44 are on a 1-5 ordinal scale, we don't know if they are distributed normally. 

### 10. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: We need to make sure our y category "hand" is made numerical as well as an integer rather than float, which it already is. 

In [8]:

df.isnull().sum()

Q1             0
Q2             0
Q3             0
Q4             0
Q5             0
Q6             0
Q7             0
Q8             0
Q9             0
Q10            0
Q11            0
Q12            0
Q13            0
Q14            0
Q15            0
Q16            0
Q17            0
Q18            0
Q19            0
Q20            0
Q21            0
Q22            0
Q23            0
Q24            0
Q25            0
Q26            0
Q27            0
Q28            0
Q29            0
Q30            0
Q31            0
Q32            0
Q33            0
Q34            0
Q35            0
Q36            0
Q37            0
Q38            0
Q39            0
Q40            0
Q41            0
Q42            0
Q43            0
Q44            0
introelapse    0
testelapse     0
country        0
fromgoogle     0
engnat         0
age            0
education      0
gender         0
orientation    0
race           0
religion       0
hand           0
dtype: int64

In [9]:
df.dtypes

Q1              int64
Q2              int64
Q3              int64
Q4              int64
Q5              int64
Q6              int64
Q7              int64
Q8              int64
Q9              int64
Q10             int64
Q11             int64
Q12             int64
Q13             int64
Q14             int64
Q15             int64
Q16             int64
Q17             int64
Q18             int64
Q19             int64
Q20             int64
Q21             int64
Q22             int64
Q23             int64
Q24             int64
Q25             int64
Q26             int64
Q27             int64
Q28             int64
Q29             int64
Q30             int64
Q31             int64
Q32             int64
Q33             int64
Q34             int64
Q35             int64
Q36             int64
Q37             int64
Q38             int64
Q39             int64
Q40             int64
Q41             int64
Q42             int64
Q43             int64
Q44             int64
introelapse     int64
testelapse

In [10]:
df['y'] = [1 if i == 2 else 0 for i in df['hand']]


In [12]:
df['y'].value_counts()

0    3732
1     452
Name: y, dtype: int64

In [13]:
df.drop(columns=['country'], inplace=True)

### 11. The professor for whom you work suggests that you set $k = 4$. Why might this be a bad idea in this specific case?

Answer: We do not want to result in a tie with an even number.

### 12. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [17]:
X = df.iloc[:, :44]
y = df['y']

In [18]:
X.shape

(4184, 44)

In [19]:
y.shape

(4184,)

In [20]:
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler


In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [22]:
ss  = StandardScaler()

In [23]:
ss.fit(X_train)
X_train_scaled = ss.transform(X_train)
X_test_scaled = ss.transform(X_test)

  return self.partial_fit(X, y)
  
  This is separate from the ipykernel package so we can avoid doing imports until


In [24]:
knn = KNeighborsClassifier()

In [25]:
cross_val_score(knn, X_train_scaled, y_train, cv=3)

array([0.87476099, 0.87476099, 0.87762906])

In [26]:

cross_val_score(knn, X_train, y_train, cv=3).mean()

0.877310388782664

In [27]:
knn.fit(X_train_scaled, y_train)
knn.score(X_train_scaled, y_train)

0.8910133843212237

In [28]:

knn.score(X_test_scaled, y_test)

0.8804971319311663

In [29]:
knn = KNeighborsClassifier()
cross_val_score(knn, X_train_scaled, y_train, cv=5)

array([0.88216561, 0.86942675, 0.87261146, 0.87559809, 0.8708134 ])

In [30]:
knn.fit(X_train_scaled, y_train)
knn.score(X_train_scaled, y_train)

0.8910133843212237

In [31]:
knn.score(X_test_scaled, y_test)

0.8804971319311663

In [32]:

cross_val_score(knn, X_train_scaled, y_train, cv=5).mean()

0.8741230609819279

In [33]:
knn.fit(X_train_scaled, y_train)
knn.score(X_train_scaled, y_train)

0.8910133843212237

In [34]:
knn.score(X_test_scaled, y_test)

0.8804971319311663

In [35]:
knn = KNeighborsClassifier()
cross_val_score(knn, X_train_scaled, y_train, cv=15).mean()

0.8741186299081036

In [36]:
knn.fit(X_train_scaled, y_train)
knn.score(X_train_scaled, y_train)

0.8910133843212237

In [37]:

knn.score(X_test_scaled, y_test)

0.8804971319311663

In [38]:
knn = KNeighborsClassifier()
cross_val_score(knn, X_train_scaled, y_train, cv=25).mean()

0.8731829185867896

In [39]:
knn.fit(X_train_scaled, y_train)
knn.score(X_train_scaled, y_train)

0.8910133843212237

In [40]:
knn.score(X_test_scaled, y_test)

0.8804971319311663

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: Yes; "This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers." from the documentation.

### 14. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features? Well, the answer is (as always), **it depends**. What is one reason you would standardize? What is one reason you would not standardize?

Answer:
- An example of when I would standardize in logistic regression is...If the features have widely different values that may be weighted differently
- An example of when I would not standardize in logistic regression is...if all the variables are already scaled

### 15. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [62]:
from sklearn.linear_model import Ridge, Lasso, LogisticRegression, RidgeCV, LassoCV, LogisticRegressionCV

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
lasso_1 = LogisticRegression(penalty = 'l1', C = 1.0)
lasso_1.fit(X_train, y_train)

In [65]:
ridge = RidgeCV(cv=10)
r_alphas = np.logspace(0, 100, 5)

ridge_model = ridge

ridge.fit(X_train, y_train)

ridge_model.alpha_

10.0

In [44]:
ridge.score(X_train, y_train)


0.024492024450613603

In [45]:
ridge.score(X_test,y_test)

-0.02263601731585374

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [66]:
LogisticRegression()

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [70]:
lasso = LassoCV(cv=10)   #I see if I set alpha to 1 here I have an error "k-fold cross-validation requires at least one train/test split by setting n_splits=2 or more, got n_splits=1." not sure why though.

lasso_alphas = np.logspace(10, 100, 50)

lasso_model = lasso

lasso.fit(X_train, y_train)

lasso_model.alpha_

0.009158384318626287

In [71]:
lasso.score(X_train, y_train)

0.01079083984967577

In [72]:
lasso.score(X_test, y_test)

-0.004252037735239078

---
## Step 5: Evaluate the model(s).

### 16. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not?

Answer: No; I don't think personality traits would strongly affect preferred writing hand.  My KNN tests show otherwise, but my ridge and lasso predict what I assume...

### 17. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)

Answer:

In [50]:
knn.score(X_train, y_train)

0.8951561504142767

In [51]:
knn.score(X_test, y_test)

0.8824091778202677

In [52]:
lasso.score(X_train, y_train)

0.009722970029035438

In [53]:
lasso.score(X_test, y_test)

-0.001878814827788844

In [54]:
ridge.score(X_train, y_train)


0.012895265263130118

In [55]:
ridge.score(X_test, y_test)


0.010414523341270798

In [73]:
lasso_1 = LogisticRegression(penalty = 'l1', C = 1.0)
lasso_1.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [75]:
lasso_1.score(X_train, y_train)

0.8878266411727215

In [76]:
lasso_1.score(X_test, y_test)

0.9053537284894837

In [77]:
lasso_10 = LogisticRegression(penalty = 'l1', C = 0.1)
lasso_10.fit(X_train, y_train)



LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [78]:
lasso_10.score(X_train, y_train)

0.8878266411727215

In [79]:
lasso_10.score(X_test, y_test)

0.9053537284894837

### 18. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer: my knn tests show very slight signs of overfitted (.85ish train to .82 test) but that isn't particularly high.  My ridge and lasso show both heavy bias and variance, but the training data is also overfitted relative to the test results.

### 19. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer:Bias tends to decrease, but variance tends to increase

### 20. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: increase the k value, or decrease number of x/independent variables that may be adding excess variance.

### 21. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer: See 18, we don't have evidence of overfitting.

### 22. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer: as C increases, we regularize less.  THis increase variance but reduces bias.

Vice versa as C decreases.  We regularize more, decreasing variance but increasing bias.

### 23. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What might this mean in the context of this problem?

Answer: See answer in 22.  

In this model, our independent variables do not predict our dependent variable, handedness, well.  Altering hyperparameters does not effectively change this.

### 24. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: 

1)Removing excessive features adding variance without accuracy

2) Gather more data to construct a more effective model.

3) Increase Regularization or use LASSO or Ridge

---
## Step 6: Answer the problem.

### 25. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer: Logistic Regression - We can break down the relationships of each variable by plugging them in and out.

### 26. Select a logistic regression model. Interpret the coefficient for `Q1`.

In [82]:
#Answer:  0
lasso_10.coef_

array([[ 0.        , -0.0384512 ,  0.        , -0.02653476,  0.02179383,
         0.        ,  0.        , -0.13609342, -0.03705975,  0.        ,
         0.        ,  0.        , -0.01449523,  0.        , -0.03249939,
         0.00595395,  0.03221683, -0.04112862, -0.01968344, -0.01531831,
        -0.03223186, -0.08093993, -0.06440988, -0.00652029,  0.00781393,
         0.08587605,  0.        ,  0.        ,  0.        ,  0.        ,
         0.00891907, -0.01967009,  0.        ,  0.        ,  0.        ,
         0.        , -0.01555616,  0.01685028, -0.02974151, -0.07184281,
        -0.02674465, -0.02716541, -0.08282914, -0.00560656]])

### 27. If you have to select one model overall to be your *best* model, which model would you select? Why?

Answer: My logistic regression models with lasso all performed best.

### 28. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer these for the professor based on the model you selected!

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following:
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)
- Fit and evaluate one or more of the generalized linear models discussed above.
- Create a plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?

In [None]:
SELECT "inventory"
FROM data