## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/lab-4.01-classification-master/data.csv', delimiter='\t')


In [None]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [None]:
df.isnull().sum()

Q1             0
Q2             0
Q3             0
Q4             0
Q5             0
Q6             0
Q7             0
Q8             0
Q9             0
Q10            0
Q11            0
Q12            0
Q13            0
Q14            0
Q15            0
Q16            0
Q17            0
Q18            0
Q19            0
Q20            0
Q21            0
Q22            0
Q23            0
Q24            0
Q25            0
Q26            0
Q27            0
Q28            0
Q29            0
Q30            0
Q31            0
Q32            0
Q33            0
Q34            0
Q35            0
Q36            0
Q37            0
Q38            0
Q39            0
Q40            0
Q41            0
Q42            0
Q43            0
Q44            0
introelapse    0
testelapse     0
country        0
fromgoogle     0
engnat         0
age            0
education      0
gender         0
orientation    0
race           0
religion       0
hand           0
dtype: int64

In [None]:
#1=Right, 2=Left, 3=Both

df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

Apparently, the topic of personality and handedness has been researched for a pretty long time. Left handed people have been associated with negative traits. I assume the fixation on left-handed people is due to the fact that only 10% of the population is left handed. I don't believe this experiment will show that left handed people are more strongly associated with behaviors that fall under neuroticism. I tend to believe we lose the ability to be ambidextrous because we don't practice using both hands. If you don't use it, you lose it. 

### Is left handedness associated with impulsive behavior?
### Is left handedness associated with risky behavior?
### Is left handedness associated with apathy?

In [None]:
text = pd.read_csv('/content/drive/MyDrive/lab-4.01-classification-master/codebook.txt', delimiter='|')
text

Unnamed: 0,This data was collected from an interactive version of the Open Sex Role Inventory in 2014.
0,The following items were rated on a five point...
1,Q1\tI have studied how to win at gambling.
2,Q2\tI have thought about dying my hair.
3,"Q3\tI have thrown knives, axes or other sharp ..."
4,Q4\tI give people handmade gifts.
5,Q5\tI have day dreamed about saving someone fr...
6,Q6\tI get embarrassed when people read things ...
7,Q7\tI have been very interested in historical ...
8,Q8\tI know the birthdays of my friends.
9,Q9\tI like guns.


### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [None]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


# Calculate and interpret the baseline accuracy rate: Since this is based on probability, there is a 50/50 chance of being left handed. So baseline accuracy would be greater than 50% if left-handedness is associated with negative traits. 

In [None]:
a = .50 
baseline = 1/1-a
baseline


0.5

### Short answer questions:

In this lab you'll use K-nearest neighbors and logistic regression to model handedness based off of psychological factors. Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

Regression is based on numbers classified as continuous. In classification, numbers are discrete. 



```
# This is formatted as code
```

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

Small k value = low bias and high variance which can lead to overfitting. 
A large k value = high bias low variance and the model is too simplistic to capture patterns in the data. 



#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

Standardizing predictor variables improves accuracy. 

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

Yes, if variables are on two different scales, training works best when the features are standardized. 

#### How do we settle on $k$ for a $k$-nearest neighbors model?

Trial and Error, the elbow method (plot the curve to see where the shape looks like an elbow which would give the best k value), domain knowledge, Euclidean distance or Manhattan distance. 



#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

Default regularization is L2. 

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

Smaller c values mean stronger regularization. Larger C values mean weaker regularization. 

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

The regularization strength in machine learning models, such as linear regression or logistic regression, is used to balance the bias-variance tradeoff.

Regularization balances bias and variance which means the model will make less mistakes on unseen data. 

#### Logistic regression is considered more interpretable than $k$-nearestneighbors. Explain why.

Logistic regression is generally considered more interpretable than $k$-nearest neighbors for several reasons:

Coeffiecients are clear, direct and easy to interpret. Logistic regressions are based on probability which is easy to understand. 






---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your explanatory variables should be 

Target variable(y) is hand. Looked at the unique values for hand: 3,2,1,0

0 - no response

1 - Right handed

2 - Left handed

3 - Both

In [None]:
df['hand'].unique()

array([3, 1, 2, 0])

In [None]:
# Create binary column for handedness. Right hand = 0 and left hand =1

In [None]:
# create a new column 'left_hand_dummy' with default value 0
df['left_hand_dummy'] = 0

# set the value to 1 for left-handed individuals
df.loc[df['hand'] == 2, 'left_hand_dummy'] = 1


In [None]:
df['left_hand_dummy']

0       0
1       0
2       1
3       1
4       0
       ..
4179    0
4180    0
4181    0
4182    0
4183    0
Name: left_hand_dummy, Length: 4184, dtype: int64

In [None]:
df.dtypes

In [None]:
df.drop('hand', axis=1, inplace=True)
df.info()

In [None]:
df.drop('country', axis=1, inplace=True)


In [None]:
non_features = ['left_hand_dummy', 'introelapse', 'testelapse', 'fromgoogle', 'engnat', 'age', 'education', 'religion', 'gender', 'orientation', 'race']
features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns if col not in non_features]


In [None]:
#Create features
y = df['left_hand_dummy']
X = features


In [None]:
# Check to make sure all features are present and left_hand_dummy not included
print(X)

['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11', 'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21', 'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31', 'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41', 'Q42', 'Q43', 'Q44']


### Create and fit four separate $k$-nearest neighbors models: one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

# Define KNN models (3,5,15,25)
knn_3 = KNeighborsClassifier(n_neighbors=3)
knn_5 = KNeighborsClassifier(n_neighbors=5)
knn_15 = KNeighborsClassifier(n_neighbors=15)
knn_25 = KNeighborsClassifier(n_neighbors=25)

# Fit the models on the training data
knn_3.fit(X_train, y_train)
knn_5.fit(X_train, y_train)
knn_15.fit(X_train, y_train)
knn_25.fit(X_train, y_train)


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [None]:
print("KNN (k=3) score: {:.2f}".format(knn_3.score(X_test, y_test)))
print("KNN (k=5) score: {:.2f}".format(knn_5.score(X_test, y_test)))
print("KNN (k=15) score: {:.2f}".format(knn_15.score(X_test, y_test)))
print("KNN (k=25) score: {:.2f}".format(knn_25.score(X_test, y_test)))


KNN (k=3) score: 0.85
KNN (k=5) score: 0.87
KNN (k=15) score: 0.88
KNN (k=25) score: 0.88


---

## Step 4 & 5 Modeling: logistic regression

### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as above.

In [None]:
from sklearn.linear_model import LogisticRegression
# Received warning to increase iterations
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
# Create four separate logistic regression models
lasso_1 = LogisticRegression(penalty='l1', solver='liblinear', C=1, max_iter=1000)
lasso_10 = LogisticRegression(penalty='l1', solver='liblinear', C=0.1, max_iter=1000)
ridge_1 = LogisticRegression(penalty='l2', solver='liblinear', C=1, max_iter=1000)
ridge_10 = LogisticRegression(penalty='l2', solver='liblinear', C=0.1, max_iter=1000)


# Fit the models on the training data
lasso_1.fit(X_train, y_train)
lasso_10.fit(X_train, y_train)
ridge_1.fit(X_train, y_train)
ridge_10.fit(X_train, y_train)


### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. Are any of your models overfit or underfit? Do any of your models beat the baseline accuracy rate?

In [None]:

# Score the Lasso and Ridge models 
lasso_1_score = lasso_1.score(X_test, y_test)
lasso_10_score = lasso_10.score(X_test, y_test)
ridge_1_score = ridge_1.score(X_test, y_test)
ridge_10_score = ridge_10.score(X_test, y_test)

# Print scores
print("Lasso (alpha=1) test score: {:.3f}".format(lasso_1_score))
print("Lasso (alpha=10) test score: {:.3f}".format(lasso_10_score))
print("Ridge (alpha=1) test score: {:.3f}".format(ridge_1_score))
print("Ridge (alpha=10) test score: {:.3f}".format(ridge_10_score))

Lasso (alpha=1) test score: 1.000
Lasso (alpha=10) test score: 1.000
Ridge (alpha=1) test score: 1.000
Ridge (alpha=10) test score: 1.000


---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? What are the "best" models?

## All of the Lasso and Ridge scores are perfect which is a sign of overfitting. I wouldn't move forward with any of these models. The KNN model performed the best when KNN was 15 or 25. I think that's a lot of neighbors and I would not use any of these models. This data set wasn't the best. The question "Are left handed people more impulsive," could not be answered with this set of questions. The models beat my baseline but many of the questions (features) are irrelevant. 