## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

1. Does a left-handed person play more video games than a right-handed person?
2. Does a left-handed person work harder in school than a right-handed person?
3. Is left-handedness more common in males than females?

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
# library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier


In [2]:
df = pd.read_csv("data.csv",sep="\t") # seperator = "\t" for tab instead of default comma (,)

In [3]:
df.shape

(4184, 56)

In [4]:
# Note that there is no missing data.
#df.isnull().sum()
#df.info()

---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

In [5]:
# Left and right-handed personality tend to have some different due to cultural discrimination and 
# they could also be advantage in some sports such as boxing due to the rarity of left-handness of (~10%).
# Historically, the system tends to force left-handed people to use right hand as the norm.
# Hence, a considerable proportion of left-handed people can use both hand.

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [6]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### Calculate and interpret the baseline accuracy rate:

In [7]:
df["hand"].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [8]:
df = df[df["hand"] != 0]

In [9]:
df["hand"].value_counts()

1    3542
2     452
3     179
Name: hand, dtype: int64

In [10]:
df.groupby("gender")["hand"].value_counts(normalize =True).round(4)*100

gender  hand
0       1       73.42
        2       20.25
        3        6.33
1       1       83.58
        2       12.26
        3        4.17
2       1       86.86
        2        9.47
        3        3.67
3       1       80.26
        2       10.86
        3        8.88
Name: hand, dtype: float64

In [11]:
# Gender 0 : Missing, Gender 1: Male, Gender 2: Female, Gender 3:Others
# Hand: 1 = right handed, 2 = left handed, 3 = ambidextrous(both).

Male have a slightly higher left-handed proportion at 12.26% to female at 9.47%.

In [12]:
# For simplicity, this model would group both hand as left hand.
df["hand"] = df["hand"].map({1:1,2:2,3:2})

In [13]:
df.groupby("gender")["hand"].value_counts(normalize =True).round(4)*100

gender  hand
0       1       73.42
        2       26.58
1       1       83.58
        2       16.42
2       1       86.86
        2       13.14
3       1       80.26
        2       19.74
Name: hand, dtype: float64

In [14]:
df["hand"].value_counts(normalize=True)*100

1    84.878984
2    15.121016
Name: hand, dtype: float64

### Short answer questions:

In this lab you'll use K-nearest neighbors and logistic regression to model handedness based off of psychological factors. Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

In [15]:
# Regression predicts numeric values while classification predicts the label, group or class of data. 

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

In [16]:
# The relationship between k and bias-variance tradeoff is 
# the low number of k would result in low bias and high variance (overfitted).
# the high number of k would result in high bias and low variance (underfitted).
# we need to adjust number of k to optimal number.

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

In [17]:
# k is distance based method which heavily relied on the unit

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

In [18]:
# No because the explanatory variables are already in the same unit/scale of 1 to 5.

#### How do we settle on $k$ for a $k$-nearest neighbors model?

In [19]:
# We find the optimal k by iterate over number of k to find the 
# low bias and variance using cross validation techniques.

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

In [20]:
# L2 or Ridge is the default regularization for logistic regression.

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength: (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

In [21]:
# Logistic Regression argument C is inverse relationship with regularization strength.
# The smaller values specify stronger regularization.

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

In [22]:
# Regularization strength have positive relationship with bias.
# Regularization strength have inverse relationship with variance.

# Zero to low regularization strength tends to have low bias and high variance.
# High regularization strength tends to have high bias and low variance.

# Moderate regularization strength tends to have the best results.

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

In [23]:
# Logistic regression have coefficients which can be used for explaination of the impacts
# of the factors while kNN have no clear easy explanations for the factors.

# The feature extractions in logistic regression is intuitive while kNN is not.

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your explanatory variables (X) should be everything except hand.
Your target variable (y) is hand. 1 = right handed, 2 = left handed, 3 = ambidextrous(both).

In [24]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,2
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,2


In [25]:
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [26]:
X_drop = ['introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand']
X = df.drop(columns = X_drop)
y = df["hand"]

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state= 42, stratify= y )

#### Create and fit four separate $k$-nearest neighbors models: one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$:

In [28]:
# Instantiate models
knn3 = KNeighborsClassifier(n_neighbors= 3)
knn5 = KNeighborsClassifier(n_neighbors= 5)
knn15 = KNeighborsClassifier(n_neighbors= 15)
knn25 = KNeighborsClassifier(n_neighbors= 25)

# Train or Fit models
knn3.fit(X_train,y_train)
knn5.fit(X_train,y_train)
knn15.fit(X_train,y_train)
knn25.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=25)

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [29]:
# Model 1: kNN (n=3)
print(f"Train Score: {knn3.score(X_train,y_train)}")
print(f"Test Score: {knn3.score(X_test,y_test)}")

Train Score: 0.8718440396292745
Test Score: 0.7998084291187739


In [30]:
# Model 2: kNN (n=5)
print(f"Train Score: {knn5.score(X_train,y_train)}")
print(f"Test Score: {knn5.score(X_test,y_test)}")

Train Score: 0.8497922658996484
Test Score: 0.8362068965517241


In [31]:
# Model 3: kNN (n=15)
print(f"Train Score: {knn15.score(X_train,y_train)}")
print(f"Test Score: {knn15.score(X_test,y_test)}")

Train Score: 0.8497922658996484
Test Score: 0.8496168582375478


In [32]:
# Model 4: kNN (n=25)
print(f"Train Score: {knn25.score(X_train,y_train)}")
print(f"Test Score: {knn25.score(X_test,y_test)}")

Train Score: 0.8488334931287952
Test Score: 0.8486590038314177


In [33]:
# kNN (n = 15) have same test score to kNN (n = 25)
knn15.score(X_test,y_test) - knn25.score(X_test,y_test)

0.0009578544061301653

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as above.

In [34]:
# Instantiate models
# L1 : Lasso, L2 : Ridge
logreg = LogisticRegression(max_iter=1000)
lasso1 = LogisticRegression(C=1, penalty='l1', solver='liblinear')
lasso10 = LogisticRegression(C=0.10, penalty='l1', solver='liblinear')
ridge1 = LogisticRegression(C=1, penalty = "l2", max_iter=1000)
ridge10 = LogisticRegression(C=0.10, penalty = "l2", max_iter=1000)

# Train or fit model
logreg.fit(X_train,y_train)
lasso1.fit(X_train,y_train)
lasso10.fit(X_train,y_train)
ridge1.fit(X_train,y_train)
ridge10.fit(X_train,y_train)

LogisticRegression(C=0.1, max_iter=1000)

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [35]:
# Model 1: Lasso (l1, alpha = 1)
print(f"Train Score: {lasso1.score(X_train,y_train)}")
print(f"Test Score: {lasso1.score(X_test,y_test)}")

Train Score: 0.8488334931287952
Test Score: 0.8486590038314177


In [36]:
# Model 2: Lasso (l1, alpha = 10)
print(f"Train Score: {lasso10.score(X_train,y_train)}")
print(f"Test Score: {lasso10.score(X_test,y_test)}")

Train Score: 0.8485139022051774
Test Score: 0.8496168582375478


In [37]:
# Model 3: Ridge (l2, alpha = 1)
print(f"Train Score: {ridge1.score(X_train,y_train)}")
print(f"Test Score: {ridge1.score(X_test,y_test)}")

Train Score: 0.8488334931287952
Test Score: 0.8486590038314177


In [38]:
# Model 4: Ridge (l2, alpha = 10)
print(f"Train Score: {ridge10.score(X_train,y_train)}")
print(f"Test Score: {ridge10.score(X_test,y_test)}")

Train Score: 0.8488334931287952
Test Score: 0.8486590038314177


In [39]:
# Model 5: Default Logistic Regression
print(f"Train Score: {logreg.score(X_train,y_train)}")
print(f"Test Score: {logreg.score(X_test,y_test)}")

Train Score: 0.8488334931287952
Test Score: 0.8486590038314177


In [40]:
ridge10.score(X_test,y_test) - knn15.score(X_test,y_test)

-0.0009578544061301653

In [41]:
lasso10.score(X_test,y_test) - knn15.score(X_test,y_test)

0.0

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? What are the "best" models?

#### No, none of our models are worth moving forward with as they can not beat the baseline of 84.87% by predicting the majority class.

#### The best models are kNN at 15 number of neighbours and Logistic Regression with Lasso penalty at alpha = 10 as they provide us with the highest test score.

In [42]:
from sklearn.metrics import classification_report
knn15_preds = knn15.predict(X_test)
print(classification_report(y_test,knn15_preds))

              precision    recall  f1-score   support

           1       0.85      1.00      0.92       886
           2       1.00      0.01      0.01       158

    accuracy                           0.85      1044
   macro avg       0.92      0.50      0.47      1044
weighted avg       0.87      0.85      0.78      1044



In [44]:
lasso10_preds = knn15.predict(X_test)
print(classification_report(y_test,lasso10_preds))

              precision    recall  f1-score   support

           1       0.85      1.00      0.92       886
           2       1.00      0.01      0.01       158

    accuracy                           0.85      1044
   macro avg       0.92      0.50      0.47      1044
weighted avg       0.87      0.85      0.78      1044



In [43]:
logreg_preds = logreg.predict(X_test)
print(classification_report(y_test,logreg_preds))

              precision    recall  f1-score   support

           1       0.85      1.00      0.92       886
           2       0.00      0.00      0.00       158

    accuracy                           0.85      1044
   macro avg       0.42      0.50      0.46      1044
weighted avg       0.72      0.85      0.78      1044



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [45]:
knn5_preds = knn5.predict(X_test)
print(classification_report(y_test,knn5_preds))

              precision    recall  f1-score   support

           1       0.85      0.97      0.91       886
           2       0.30      0.06      0.10       158

    accuracy                           0.84      1044
   macro avg       0.58      0.52      0.51      1044
weighted avg       0.77      0.84      0.79      1044



In [None]:
knn5_preds = knn5.predict(X_test)
print(classification_report(y_test,knn5_preds))