## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

**Answer:**

    1.  How much more (or less) likely is someone to be left handed given their answers to a list of questions regarding their personality?

    2.  Do people who are left handed agree with certain personality questions more, less, or about the same relative to right handed people?
    
    3.  Do people who are left handed agree with violent personality propositions more, less, or about the same relative to right handed people.

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression

plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [6]:
df = pd.read_csv('./data.csv', sep='\t')

In [7]:
# Check data
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [14]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [15]:
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [16]:
df.isnull().sum().head()

Q1    0
Q2    0
Q3    0
Q4    0
Q5    0
dtype: int64

In [17]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [18]:
df['country'].unique()

array(['US', 'CA', 'NL', 'GR', 'GB', 'KR', 'SE', 'NO', 'DE', 'NZ', 'CH',
       'RO', 'IL', 'IN', 'ZA', 'TR', 'JM', 'AU', 'BE', 'PL', 'CZ', 'RS',
       'TW', 'A2', 'MX', 'PH', 'ES', 'AT', 'JP', 'IT', 'SG', 'MY', 'HK',
       'FR', 'EU', 'DK', 'AE', 'EC', 'TH', 'IE', 'PK', 'BR', 'ID', 'EG',
       'NI', 'FI', 'CN', 'RU', 'SI', 'AR', 'PT', 'LB', 'DO', 'PF', 'LT',
       'BG', 'GE', 'CL', 'SK', 'EE', 'KE', 'UZ', 'LV', 'BB', 'BN', 'PR',
       'HR', 'NP', 'A1', 'PE', 'UA', 'HU', 'VN', 'TZ', 'KH', 'UY', 'VE',
       'IS', 'MP', 'CO', 'JO', 'TN', 'KW', 'CY', 'FJ', 'LK', 'VI', 'ZW',
       'IM', 'ZM', 'QA', 'DZ', 'LY', 'SA'], dtype=object)

In [19]:
df['engnat'].head()

0    1
1    1
2    2
3    1
4    1
Name: engnat, dtype: int64

**Changing 'no' from 2 to 0 in the engnat and fromgoogle columns, as this is what I am accustomed to.**

In [22]:
df['engnat'].replace(2, 0, inplace=True)

In [23]:
df['fromgoogle'].replace(2, 0, inplace=True)

In [24]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,0,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,0,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,0,0,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,0,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,0,1,22,3,1,1,3,2,3


In [25]:
col_list = list(df.columns)

In [26]:
answer_vals = [list(df[col].unique()) for col in col_list]

In [27]:
answer_vals[0]

[4, 1, 5, 3, 2, 0]

In [28]:
new_dict = dict(zip(col_list, answer_vals))

#### The answered 0 for education, 0 for gender, 0 for orientation, 0 for race, and 0 for hand, which do not equate to anything in the survey. How many 0's there are for each of these columns?

####  Some impossible/improbable responses to the age question.

#### Some respondent's testelapse (time between test start and test submission) is very low.

In [30]:
df['education'].value_counts()

2    2055
3    1086
1     546
4     446
0      51
Name: education, dtype: int64

In [31]:
df['gender'].value_counts()

2    2212
1    1586
3     304
0      82
Name: gender, dtype: int64

In [32]:
df['orientation'].value_counts()

1    2307
2     833
5     349
3     335
4     237
0     123
Name: orientation, dtype: int64

In [33]:
df['race'].value_counts()

6    2793
1     393
2     383
7     342
3     168
0      66
4      33
5       6
Name: race, dtype: int64

In [34]:
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [38]:
set(df['age'].unique())

{13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 67,
 68,
 69,
 70,
 73,
 76,
 77,
 78,
 85,
 86,
 123,
 409,
 23763}

In [35]:
df[(df['age'] == 123)]

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
2075,3,4,5,4,3,4,5,4,3,3,...,US,0,1,123,1,1,5,7,7,3


In [36]:
df[(df['age'] == 409)]

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
2137,1,2,5,4,3,1,5,1,5,3,...,US,0,1,409,2,1,5,6,1,1


In [37]:
df[(df['age'] == 23763)]

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
2690,2,5,5,1,5,5,5,5,4,2,...,US,0,0,23763,4,1,2,7,7,0


In [42]:
find_avg_list = list(df['age'])

In [43]:
find_avg_list.remove(123)

In [44]:
find_avg_list.remove(409)

In [45]:
find_avg_list.remove(23763)

In [46]:
np.mean(find_avg_list)

24.5816790241569

In [58]:
#set(df['age'].unique())

In [54]:
df['age'][409] = np.mean(find_avg_list)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age'][409] = np.mean(find_avg_list)


In [57]:
df['age'][23763] = np.mean(find_avg_list)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age'][23763] = np.mean(find_avg_list)


---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

**Answer:**

This would be a classification problem because we are trying to determine someone's class (left or right-handed).  The dependent variable, whether or not someone is left-handed, is categorical and unordered.

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

### Calculate and interpret the baseline accuracy rate:

### Short answer questions:

In this lab you'll use K-nearest neighbors and logistic regression to model handedness based off of psychological factors. Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

**Answer:**

We would standardize our variables to be certain that they are being weighed on a similar scale.  This is to ensure a certain feature, or variable, does not have undue influence on a model solely based on the matter in which it is measured.

One example of when we would standardize our variables is if we were trying to determine someone's income based on his/her age and High School GPA;  someone's age can range between 14 to 100, for example, while his/her High School GPA can only range between 0.0 and 4.0, two very differently scaled variables.

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

#### How do we settle on $k$ for a $k$-nearest neighbors model?

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your explanatory variables should be 

In [59]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

In [61]:
target = df['hand']

In [62]:
features = col_list[0:44]

In [63]:
X = df[features]
y = df['hand']

In [64]:
y.value_counts(normalize=True)

1    0.846558
2    0.108031
3    0.042782
0    0.002629
Name: hand, dtype: float64

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

#### Create and fit four separate $k$-nearest neighbors models: one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$:

**K = 3**

In [90]:
knn = KNeighborsClassifier(n_neighbors=3)

In [91]:
knn.fit(X_train, y_train)

In [92]:
knn.score(X_train, y_train)

0.8639260675589547

In [93]:
knn.score(X_test, y_test)

0.8154875717017208

In [94]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()

0.811344805608668

**K = 5**

In [102]:
knn = KNeighborsClassifier(n_neighbors=5)

In [103]:
knn.fit(X_train,y_train)

In [104]:
knn.score(X_train,y_train)

0.851497769279796

In [105]:
knn.score(X_test, y_test)

0.8413001912045889

In [106]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()

0.831421287444232

**K = 15**

In [100]:
knn = KNeighborsClassifier(n_neighbors=15)

In [101]:
knn.fit(X_train,y_train)

In [82]:
knn.score(X_train,y_train)

0.8467176545570427

In [83]:
knn.score(X_test, y_test)

0.8460803059273423

In [84]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()

0.8467176545570427

**K = 25**

In [85]:
knn = KNeighborsClassifier(n_neighbors=25)

In [86]:
knn.fit(X_train,y_train)

In [87]:
knn.score(X_train,y_train)

0.8467176545570427

In [88]:
knn.score(X_test, y_test)

0.8460803059273423

In [89]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()

0.8467176545570427

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as above.

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? What are the "best" models?