## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

**Answer:**

> 1.How much more is left handed given their answers to a list of questions?

> 2.Do people who are left handed agree with certain personality questions more than the right handed people?
    
> 3.Do people who are left handed agree with violent personality more than the right handed people?

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

from sklearn.linear_model import LinearRegression, LogisticRegression
warnings.simplefilter(action='ignore', category=FutureWarning)

plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
df = pd.read_csv('./data.csv', sep='\t')

In [3]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [4]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [5]:
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [6]:
df.isnull().sum().head()

Q1    0
Q2    0
Q3    0
Q4    0
Q5    0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [8]:
df['country'].unique()

array(['US', 'CA', 'NL', 'GR', 'GB', 'KR', 'SE', 'NO', 'DE', 'NZ', 'CH',
       'RO', 'IL', 'IN', 'ZA', 'TR', 'JM', 'AU', 'BE', 'PL', 'CZ', 'RS',
       'TW', 'A2', 'MX', 'PH', 'ES', 'AT', 'JP', 'IT', 'SG', 'MY', 'HK',
       'FR', 'EU', 'DK', 'AE', 'EC', 'TH', 'IE', 'PK', 'BR', 'ID', 'EG',
       'NI', 'FI', 'CN', 'RU', 'SI', 'AR', 'PT', 'LB', 'DO', 'PF', 'LT',
       'BG', 'GE', 'CL', 'SK', 'EE', 'KE', 'UZ', 'LV', 'BB', 'BN', 'PR',
       'HR', 'NP', 'A1', 'PE', 'UA', 'HU', 'VN', 'TZ', 'KH', 'UY', 'VE',
       'IS', 'MP', 'CO', 'JO', 'TN', 'KW', 'CY', 'FJ', 'LK', 'VI', 'ZW',
       'IM', 'ZM', 'QA', 'DZ', 'LY', 'SA'], dtype=object)

In [9]:
df['engnat'].head()

0    1
1    1
2    2
3    1
4    1
Name: engnat, dtype: int64

**Changing 'no' from 2 to 0 in the engnat and fromgoogle columns, as this is what I am accustomed to.**

In [10]:
df['engnat'].replace(2, 0, inplace=True)

In [11]:
df['fromgoogle'].replace(2, 0, inplace=True)

In [12]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,0,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,0,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,0,0,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,0,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,0,1,22,3,1,1,3,2,3


In [13]:
columns_list = list(df.columns)

In [14]:
answer_vals = [list(df[col].unique()) for col in columns_list]

In [15]:
answer_vals[0]

[4, 1, 5, 3, 2, 0]

In [16]:
new_dict = dict(zip(columns_list, answer_vals))

#### The answered 0 for education, 0 for gender, 0 for orientation, 0 for race, and 0 for hand, which do not equate to anything in the survey. How many 0's there are for each of these columns?

####  Some impossible/improbable responses to the age question.

#### Some respondent's testelapse (time between test start and test submission) is very low.

In [17]:
df['education'].value_counts()

2    2055
3    1086
1     546
4     446
0      51
Name: education, dtype: int64

In [18]:
df['gender'].value_counts()

2    2212
1    1586
3     304
0      82
Name: gender, dtype: int64

In [19]:
df['orientation'].value_counts()

1    2307
2     833
5     349
3     335
4     237
0     123
Name: orientation, dtype: int64

In [20]:
df['race'].value_counts()

6    2793
1     393
2     383
7     342
3     168
0      66
4      33
5       6
Name: race, dtype: int64

In [21]:
# Just only 11 = 0, I might think my result will not be changed alot
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [22]:
# Check age 123,409,23763
# set(df['age'].unique())

In [23]:
df[(df['age'] == 123)]

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
2075,3,4,5,4,3,4,5,4,3,3,...,US,0,1,123,1,1,5,7,7,3


In [24]:
df[(df['age'] == 409)]

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
2137,1,2,5,4,3,1,5,1,5,3,...,US,0,1,409,2,1,5,6,1,1


In [25]:
df[(df['age'] == 23763)]

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
2690,2,5,5,1,5,5,5,5,4,2,...,US,0,0,23763,4,1,2,7,7,0


In [26]:
# Find the avg age 
find_avg_list = list(df['age'])

In [27]:
find_avg_list.remove(123)

In [28]:
find_avg_list.remove(409)

In [29]:
find_avg_list.remove(23763)

In [30]:
np.mean(find_avg_list)

24.5816790241569

In [31]:
#set(df['age'].unique())

In [87]:
df.loc[409, 'age'] = np.mean(find_avg_list)

In [88]:
df.loc[ 23763,'age'] = np.mean(find_avg_list)

In [90]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand,409
0,4.0,1.0,5.0,1.0,5.0,1.0,5.0,1.0,4.0,1.0,...,0.0,1.0,22.0,3.0,1.0,1.0,3.0,2.0,3.0,
1,1.0,5.0,1.0,4.0,2.0,5.0,5.0,4.0,1.0,5.0,...,0.0,1.0,14.0,1.0,2.0,2.0,6.0,1.0,1.0,
2,1.0,2.0,1.0,1.0,5.0,4.0,3.0,2.0,1.0,4.0,...,0.0,0.0,30.0,4.0,1.0,1.0,1.0,1.0,2.0,
3,1.0,4.0,1.0,5.0,1.0,4.0,5.0,4.0,3.0,5.0,...,0.0,1.0,18.0,2.0,2.0,5.0,3.0,2.0,2.0,
4,5.0,1.0,5.0,1.0,5.0,1.0,5.0,1.0,3.0,1.0,...,0.0,1.0,22.0,3.0,1.0,1.0,3.0,2.0,3.0,


---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

**Answer:**

This would be a classification problem because we are trying to determine someone's class (left or right-handed).  The dependent variable, whether or not someone is left-handed, is categorical and unordered.

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

**Answer:**

> Observe the dataset. 

> Find missing values.

> Categorize the values, such as 
> - Categorical: Categorical variables can have a set number of values.
> - Continuous: Continuous variables can have an infinite number of values.Discrete: 
> - Discrete variables can have a set number of values that must be numeric.

> Find the shape of  dataset.

> Identifying relationships between variables.

> Select the right model


### Calculate and interpret the baseline accuracy rate:

### Short answer questions:

In this lab you'll use K-nearest neighbors and logistic regression to model handedness based off of psychological factors. Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

**Answer:**

Fundamentally, classification is about predicting a label and regression is about predicting a quantity

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

**Answer:**
K-nearest neighboor has realationship with bias variance when more K the model can be underfiting.

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

**Answer:**

If the scale of features is very different then normalization is required. This is because the distance calculation done in KNN uses feature values. When the one feature values are large than other, that feature will dominate the distance hence the outcome of the KNN

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

**Answer:**

An example of when we might not standardize our variables are when they are all being weighed on the same scale.

#### How do we settle on $k$ for a $k$-nearest neighbors model?

**Answer:**

The k-nearest neighbor algorithm is imported from the scikit-learn package.
Create feature and target variables.
Split data into training and test data.
Generate a k-NN model using neighbors value.
Train or fit the data into the model.
Predict the future.

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

**Answer:**

Default is L2 - Ridge regularization

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

**Answer:**

Agrument C - refers to an Inverse of Alpha(regulazation parameter)
C= 1/alpha if alpha = 1 , so C is = 1 and if alpha = 0.1, so C is = 10

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

**Answer:**

When is High model complexcity which is causing overfiting.

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

**Answer:**

LogReg has coeff(stat parameters) which are sometimes refered as betas.KNN is non-parameter and has no coeff or (stat parameters)
Conceptually,KNN isn't too hard to explian to a wide audience.

Explaning relationship in meaningful and actionable way?
KNN fails and LogReg is better.

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your explanatory variables should be 

In [34]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

In [35]:
target = df['hand']

In [37]:
columns_list[0:44]

['Q1',
 'Q2',
 'Q3',
 'Q4',
 'Q5',
 'Q6',
 'Q7',
 'Q8',
 'Q9',
 'Q10',
 'Q11',
 'Q12',
 'Q13',
 'Q14',
 'Q15',
 'Q16',
 'Q17',
 'Q18',
 'Q19',
 'Q20',
 'Q21',
 'Q22',
 'Q23',
 'Q24',
 'Q25',
 'Q26',
 'Q27',
 'Q28',
 'Q29',
 'Q30',
 'Q31',
 'Q32',
 'Q33',
 'Q34',
 'Q35',
 'Q36',
 'Q37',
 'Q38',
 'Q39',
 'Q40',
 'Q41',
 'Q42',
 'Q43',
 'Q44']

In [39]:
features = columns_list[0:44]

In [40]:
X = df[features]
y = df['hand']

In [41]:
y.value_counts(normalize=True)

1    0.846558
2    0.108031
3    0.042782
0    0.002629
Name: hand, dtype: float64

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

#### Create and fit four separate $k$-nearest neighbors models: one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$:

In [43]:
k_scores = pd.DataFrame(columns=['k','train_score','test_score'])

for n, k in enumerate([3,5,15,25]):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    train_score = knn.score(X_train,y_train)
    test_score = knn.score(X_test,y_test)
    k_scores.loc[n] = [k,train_score,test_score]

In [44]:
k_scores

Unnamed: 0,k,train_score,test_score
0,3.0,0.863926,0.815488
1,5.0,0.851498,0.8413
2,15.0,0.846718,0.84608
3,25.0,0.846718,0.84608


**K = 3**

In [45]:
knn = KNeighborsClassifier(n_neighbors=3)

In [46]:
knn.fit(X_train, y_train)

In [47]:
knn.score(X_train, y_train)

0.8639260675589547

In [48]:
knn.score(X_test, y_test)

0.8154875717017208

In [49]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()

0.811344805608668

**K = 5**

In [50]:
knn = KNeighborsClassifier(n_neighbors=5)

In [51]:
knn.fit(X_train,y_train)

In [52]:
knn.score(X_train,y_train)

0.851497769279796

In [53]:
knn.score(X_test, y_test)

0.8413001912045889

In [54]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()

0.831421287444232

**K = 15**

In [55]:
knn = KNeighborsClassifier(n_neighbors=15)

In [56]:
knn.fit(X_train,y_train)

In [57]:
knn.score(X_train,y_train)

0.8467176545570427

In [58]:
knn.score(X_test, y_test)

0.8460803059273423

In [59]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()

0.8467176545570427

**K = 25**

In [60]:
knn = KNeighborsClassifier(n_neighbors=25)

In [61]:
knn.fit(X_train,y_train)

In [62]:
knn.score(X_train,y_train)

0.8467176545570427

In [63]:
knn.score(X_test, y_test)

0.8460803059273423

In [64]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()

0.8467176545570427

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

**Answer:**

The Lasso with alpha = 1, and both Ridge logistic regressions showed evidence of overfitting - all three models had cross_val_test scores higher on their training set than their testing set.

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as above.

In [65]:
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel

**LASSO and $\alpha = 1$**

In [66]:
logreg = LogisticRegression(C=1, penalty='l1',solver='liblinear')

In [67]:
logreg.fit(X_train, y_train)

In [68]:
logreg.score(X_test, y_test)

0.8460803059273423

In [69]:
cross_val_score(logreg, X_train, y_train, cv = 3).mean()

0.8467176545570427

**LASSO and $\alpha = 10$**

In [70]:
logreg = LogisticRegression(C = (1/10),penalty='l1',solver='liblinear')

In [71]:
logreg.fit(X_train, y_train)

In [72]:
logreg.score(X_test, y_test)

0.8460803059273423

In [73]:
cross_val_score(logreg, X_train, y_train, cv = 3).mean()

0.8460803059273423

**Ridge and $\alpha = 1$**

In [74]:
logreg = LogisticRegression(C = 1,penalty='l1',solver='liblinear')

In [75]:
logreg.fit(X_train, y_train)

In [76]:
logreg.score(X_train, y_train)

0.8467176545570427

In [77]:
logreg.score(X_test, y_test)

0.8460803059273423

In [78]:
cross_val_score(logreg, X_train, y_train, cv = 3).mean()

0.8467176545570427

**Ridge and $\alpha = 10$**

In [79]:
logreg = LogisticRegression(C =(1/10),penalty='l1',solver='liblinear')

In [80]:
logreg.fit(X_train, y_train)

In [81]:
logreg.score(X_train, y_train)

0.8460803059273423

In [82]:
logreg.score(X_test, y_test)

0.8460803059273423

In [83]:
cross_val_score(logreg, X_train, y_train, cv = 3).mean()

0.8460803059273423

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

no overfitting  on this case.

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? What are the "best" models?

**Answer:**

If I wanted to understand which psychological features are most important in determining left-handedness, I would rather use a logistic regression because it will learn a linear classifier.