## Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

1. Does an association exists between personality (introvert/extravert) and left-handed people?
2. Do left-handed people tend to have more creativity than right-handed people?
3. Do left-handed people tend to have a lower EQ than right-handed people?

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [126]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, LogisticRegression

In [2]:
df = pd.read_csv('./data.csv', sep='\t')

In [3]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [43]:
# checking the nan values 
df.isnull().sum()[df.isnull().sum()!=0]

Series([], dtype: int64)

---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [47]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [51]:
df.isnull().sum()[df.isnull().sum() !=0]

Series([], dtype: int64)

In [53]:
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [63]:
df.groupby('hand')['testelapse'].describe()

# hand	"What hand do you use to write with?" 	1=Right, 2=Left, 3=Both
# with the answer of 0, it indicates the missing value, decided to drop it out, 
# as it is only small proportion of the entire datasets

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
hand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,11.0,312.363636,165.227887,153.0,216.5,275.0,319.5,725.0
1,3542.0,505.453981,3378.542199,7.0,187.0,244.0,324.75,119834.0
2,452.0,356.329646,1376.12076,14.0,177.0,234.0,322.0,28465.0
3,179.0,298.782123,252.093643,110.0,175.5,239.0,329.0,2369.0


In [75]:
# drop the missing value (0) from hand columns
df = df[df['hand'] != 0]

# checking the work
df.groupby('hand')['testelapse'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
hand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,3542.0,505.453981,3378.542199,7.0,187.0,244.0,324.75,119834.0
2,452.0,356.329646,1376.12076,14.0,177.0,234.0,322.0,28465.0
3,179.0,298.782123,252.093643,110.0,175.5,239.0,329.0,2369.0


### Calculate and interpret the baseline accuracy rate:

In [94]:
df['hand'].value_counts(normalize=True).mul(100)

hand
1    84.878984
2    10.831536
3     4.289480
Name: proportion, dtype: float64

- A baseline accuracy of 70% means when trying to predict 1 (left-handed people), it would be correct 70% of the time.
- If your model’s accuracy is higher than the baseline, indicating that the model is effectively learning the patterns from the datsets. If not, the model may need improvement.

### Short answer questions:

In this lab, you'll use K-nearest neighbors and logistic regression to model handedness based on psychological factors. 

Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

Answer here: 
- **Regression** is the predictive model that input is the continuous number (such as housing price, temperature, and weights). When we want to see/ calculate the accuracy we can use the metrics that can compare the error from predicted value and the actual value such as RMSE, MAE, and other loss functions.
- **Classification** is the predicitve model that the input can be classified or it can be seperated into such as female/male, types of flowers, and got the cancer/ did not get the cancer. We can use the metrics such as confusion metrics to measure the performance of the model.

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

Answer here:  
- The relation between k and the bias -variance tradeoff is when number of k is increasing, the higher bias it gets (too generalized) vice versa. It is becuase it going to involve more data when the circle gets larger.

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

Answer here:
- The kNN algorithm is calculating based on the distance between data points to make its predictions. Without standardizing,the k-NN model might be biased to the features that have larger scales such as a wieght between 40-90 kg. and height of between 150-190 cm,which will reduce the model’s effectiveness.

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

Answer here:
- In this case, if we want to use 'age' as one of the feature then the answer is yes, becuase 'age' is the only one variables that answered / or the data in the datasets are not in scaled like other data. To aviod baised we need to standardized it
  

#### How do we settle on $k$ for a $k$-nearest neighbors model?

Answer here:
There are many ways to select k for kNN model. 
1. doing CV (crosss-validation)
2. choosing the k that gives the lowest loss function(if in this case we should use confustion matrix as we are doing the classification problems
3. choosing k by slecting from k that gives the sweet spot (balance for both variance / and bias)
4. use plotting error with k

All these methods can be used to find the most appropraite k value that gives the best results. Additionally, to be efficient, the concept of gridsearch also can be use if there is another parameters too.

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

Answer here:
- L1 (LASSO Regularization), penalizes the sum of the absolute values of the coefficients. It has the effect of performing feature selection by shrinking some coefficients to exactly zero.
- L2 (Ridge regularization), penalizes the sum of the squared coefficients, helpign reducing the overfitting by shrinking the cofficient. 


#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

Answer here:
- The C parameter in LogisticRegression is the inverse of the regularization strength, the lower C means the stronger the regularization vice versa.
- When using C parameter, the model is given more freedom to fit the training data, and when the regularization penalty is small. The model will be more complex (high variance).

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

Answer here:
- The more the regularization stregth is, the more penalties to the model, cofficeint are more penalized and drive to almost zero. This can leads to more simplier model with lower variance and higher biased, it is to be more underfitting than overfitting.

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

Answer here:
- Logistic regression is considered more interpretable than nearest neighbors because it has the cofficeint, like if the cofficient is positive it means that as variable increases, the probability of the outcome being increases as well.
- In logistic regression, each coefficient give direct insight into importance of features , while knn, there is no straightforward way to assess feature importance because the predictions are predicted based on the close distance neighbors.
  

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your features should be:

In [118]:
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [122]:
X = df[['age', 'education', 'gender']]
y = df['hand']

In [128]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state =42)

In [130]:
# instantiate the scaler object 
sc = StandardScaler()

# fit and transfor the X_train and assign to a new a vraible calle = X_train_sc / X_test_sc
X_train_sc = sc.fit_transform(X_train)
X_test_sc  = sc.transform(X_test)

In [132]:
# Instantiate KNN
knn = KNeighborsClassifier()

In [134]:
# how the accuracy is the model at predicting class labels 
print (cross_val_score(knn, X_train_sc, y_train, cv=10))
cross_val_score(knn, X_train_sc, y_train, cv=10).mean()

[0.84025559 0.8370607  0.82428115 0.83386581 0.8370607  0.84025559
 0.80830671 0.83386581 0.84025559 0.83974359]


0.8334951257475218

In [136]:
print (cross_val_score(knn, X_test_sc, y_test, cv=10))
cross_val_score(knn, X_test_sc, y_test, cv=10).mean()

[0.82857143 0.82857143 0.83809524 0.83809524 0.85576923 0.84615385
 0.85576923 0.84615385 0.84615385 0.83653846]


0.8419871794871794

The differences between the test and train are just minor 

#### Create and fit four separate $k$-nearest neighbors models: 
- one with $k = 3$
- one with $k = 5$
- one with $k = 15$
- one with $k = 25$:

In [139]:
# fit or tarin the model - knn model 
knn.fit(X_train_sc, y_train)

In [141]:
def kclass(k):
    knn = KNeighborsClassifier(n_neighbors=k)

    knn.fit(X_train_sc, y_train)
    print(f'Training score: {knn.score(X_train_sc, y_train)}')

    knn.fit(X_test_sc, y_test)
    print(f'Testing score: {knn.score(X_test_sc, y_test)}')

In [143]:
kclass(3)

Training score: 0.8437200383509108
Testing score: 0.8496168582375478


In [145]:
kclass(5)

Training score: 0.8382869926494088
Testing score: 0.8505747126436781


In [147]:
kclass(15)

Training score: 0.8488334931287952
Testing score: 0.8486590038314177


In [153]:
kclass(25)

Training score: 0.8488334931287952
Testing score: 0.8486590038314177


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate? (84.878984%)

**Answer**: The model with k3,15,25 are generalized well, and the score is only a little difference. However, model with k5, the model is a little bit underfit as the score of training model is lesser than the test model. All the model beats baseline.

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as used above with kNN.

In [165]:
# fit model

# 1. Logistic Regression with LASSO and alpha = 1
L_alpha1 = LogisticRegression(penalty='l1', C=1, solver='saga', max_iter=1000)
print(L_alpha1.fit(X_train_sc, y_train))

# 2. Logistic Regression with LASSO and alpha = 10
L_alpha10 = LogisticRegression(penalty='l1', C=0.1, solver='saga', max_iter=1000) 
print(L_alpha10.fit(X_train_sc, y_train))

# 3. Logistic Regression with Ridge and alpha = 1
R_alpha1 = LogisticRegression(penalty='l2', C=1, solver='lbfgs', max_iter=1000)
print(R_alpha1.fit(X_train_sc, y_train))

# 4. Logistic Regression with Ridge and alpha = 10
R_alpha10 = LogisticRegression(penalty='l2', C=0.1, solver='lbfgs', max_iter=1000) 
print(R_alpha10.fit(X_train_sc, y_train))

LogisticRegression(C=1, max_iter=1000, penalty='l1', solver='saga')
LogisticRegression(C=0.1, max_iter=1000, penalty='l1', solver='saga')
LogisticRegression(C=1, max_iter=1000)
LogisticRegression(C=0.1, max_iter=1000)


In [180]:
print("LASSO (alpha=1) Training Score:", L_alpha1.score(X_train_sc, y_train))
print("LASSO (alpha=1) Testing Score:", L_alpha1.score(X_test_sc, y_test))
print(f'Different: {L_alpha1.score(X_test_sc, y_test) - L_alpha1.score(X_train_sc, y_train)}')
print()

print("LASSO (alpha=10) Training Score:", L_alpha10.score(X_train_sc, y_train))
print("LASSO (alpha=10) Testing Score:", L_alpha10.score(X_test_sc, y_test))
print(f'Different: {L_alpha10.score(X_test_sc, y_test) - L_alpha10.score(X_train_sc, y_train)}')
print()

print("Ridge (alpha=1) Training Score:", R_alpha1.score(X_train_sc, y_train))
print("Ridge (alpha=1) Testing Score:", R_alpha1.score(X_test_sc, y_test))
print(f'Different: {R_alpha1.score(X_test_sc, y_test) - R_alpha1.score(X_train_sc, y_train)}')
print()

print("Ridge (alpha=10) Training Score:", R_alpha10.score(X_train_sc, y_train))
print("Ridge (alpha=10) Testing Score:", R_alpha10.score(X_test_sc, y_test))
print(f'Different: {R_alpha1.score(X_test_sc, y_test) - R_alpha1.score(X_train_sc, y_train)}')

LASSO (alpha=1) Training Score: 0.8488334931287952
LASSO (alpha=1) Testing Score: 0.8486590038314177
Different: -0.0001744892973775114

LASSO (alpha=10) Training Score: 0.8488334931287952
LASSO (alpha=10) Testing Score: 0.8486590038314177
Different: -0.0001744892973775114

Ridge (alpha=1) Training Score: 0.8488334931287952
Ridge (alpha=1) Testing Score: 0.8486590038314177
Different: -0.0001744892973775114

Ridge (alpha=10) Training Score: 0.8488334931287952
Ridge (alpha=10) Testing Score: 0.8486590038314177
Different: -0.0001744892973775114


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

Are any of your models overfit or underfit?
There are only a slight difference in scores, therefore the model is usable, not overfit or underfit. And all the model beats baseline.

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? 

What are the "best" models?

In [186]:
# find the best model: the best model is allas it has the same errors of  -0.0001744892973775114, 
# it is ok to move forward for both logisticRegression and KNN but if have to choose I will chooose LogisticRegression 
# as it is interpretable 