##  4 : Predicting Left-Handedness from Psychological Factors

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> Check out the codebook in the repo for some inspiration.

> You'll be asked to answer one of these questions later on, so make sure your questions are based on the data provided (specifically Q1 - Q44)!

Answer: 
   - Which question/s (among Q1-Q44) has/have the highest association with left-handness
   - As one of the questions get more respondants, does it mean that those pople are more likelt left-handed
   - How the distribution of left-handed people vary with the reponses to Q6
    

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

> Notice that the data is separated by *tabs* (not commas, like most .csv files). Check out the parameters to see if there is anything that might help you parse this.

In [79]:
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

In [80]:
# Load the data
df= pd.read_csv('data.csv', delimiter='\t') 


In [81]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer:

 - be  responsible for collecting personal information such as gender identity or avoid collecting such information if not  critically needed
 - Gather sensitive personal information anonymously 
 - Make some personal information to be optional for respondents in the survey questionnaire


---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [82]:
# Check the data type and missed values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Q1           4184 non-null   int64 
 1   Q2           4184 non-null   int64 
 2   Q3           4184 non-null   int64 
 3   Q4           4184 non-null   int64 
 4   Q5           4184 non-null   int64 
 5   Q6           4184 non-null   int64 
 6   Q7           4184 non-null   int64 
 7   Q8           4184 non-null   int64 
 8   Q9           4184 non-null   int64 
 9   Q10          4184 non-null   int64 
 10  Q11          4184 non-null   int64 
 11  Q12          4184 non-null   int64 
 12  Q13          4184 non-null   int64 
 13  Q14          4184 non-null   int64 
 14  Q15          4184 non-null   int64 
 15  Q16          4184 non-null   int64 
 16  Q17          4184 non-null   int64 
 17  Q18          4184 non-null   int64 
 18  Q19          4184 non-null   int64 
 19  Q20          4184 non-null 

In [83]:
df.dtypes

Q1              int64
Q2              int64
Q3              int64
Q4              int64
Q5              int64
Q6              int64
Q7              int64
Q8              int64
Q9              int64
Q10             int64
Q11             int64
Q12             int64
Q13             int64
Q14             int64
Q15             int64
Q16             int64
Q17             int64
Q18             int64
Q19             int64
Q20             int64
Q21             int64
Q22             int64
Q23             int64
Q24             int64
Q25             int64
Q26             int64
Q27             int64
Q28             int64
Q29             int64
Q30             int64
Q31             int64
Q32             int64
Q33             int64
Q34             int64
Q35             int64
Q36             int64
Q37             int64
Q38             int64
Q39             int64
Q40             int64
Q41             int64
Q42             int64
Q43             int64
Q44             int64
introelapse     int64
testelapse

In [84]:
# Generate general descriptive statistics including the missed values
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [85]:
# Check if there is any missed values
df.isnull().sum()

Q1             0
Q2             0
Q3             0
Q4             0
Q5             0
Q6             0
Q7             0
Q8             0
Q9             0
Q10            0
Q11            0
Q12            0
Q13            0
Q14            0
Q15            0
Q16            0
Q17            0
Q18            0
Q19            0
Q20            0
Q21            0
Q22            0
Q23            0
Q24            0
Q25            0
Q26            0
Q27            0
Q28            0
Q29            0
Q30            0
Q31            0
Q32            0
Q33            0
Q34            0
Q35            0
Q36            0
Q37            0
Q38            0
Q39            0
Q40            0
Q41            0
Q42            0
Q43            0
Q44            0
introelapse    0
testelapse     0
country        0
fromgoogle     0
engnat         0
age            0
education      0
gender         0
orientation    0
race           0
religion       0
hand           0
dtype: int64

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer:
 - This would be a calssification because the result is either a left-handed or not, which means it is a discrete.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer:
 - the data standardized the variables when the features of the input data set have large differences between their ranges, or when they are measured in different measurement units.  The data needs to be standardized to make the standard deviation values of the features to be one(“1”).  For example, if we need to predict a purchasing capacity based on country, age, and salary information, we may need data standardization when the input features have different units (e.g salary in USD and Euro). 

### 7. Give an example of when we might not standardize our variables.

Answer:
- we might not need to standardize our variables when they are already in the same scale

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case (remember we're only using Q1 - Q44 as predictor variables)? Why or why not?

Answer:
The `Q1`-`Q44` as predictors are on the same scale (1 to 5) not need to be standarized.  The interpretability of a one-unit increase with the response to an individual question.

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

> Note: Think critically about how to clean the $y$ variable based on your problem statement.
   
   > Be sure to provide some explanation/justification for your choice.

Answer: 

In [86]:
# Check the hand data
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

- The variable 'hand' has four catagories (1, 2, 3, 0). The data need to have two catagories 1 and  0 variable. Becuse the work aim is to predict whether a person is left-handed or not.
  
- Let us keep 1 and 0 but map 2 and 3 to be 1

In [87]:
# Map 2 and 3 to be 1
df['hand_category'] = [1 if i !=0 else 0 for i in df['hand']]



In [88]:
df['hand_category'].value_counts()

1    4173
0      11
Name: hand_category, dtype: int64

In [89]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand,hand_category
0,4,1,5,1,5,1,5,1,4,1,...,2,1,22,3,1,1,3,2,3,1
1,1,5,1,4,2,5,5,4,1,5,...,2,1,14,1,2,2,6,1,1,1
2,1,2,1,1,5,4,3,2,1,4,...,2,2,30,4,1,1,1,1,2,1
3,1,4,1,5,1,4,5,4,3,5,...,2,1,18,2,2,5,3,2,2,1
4,5,1,5,1,5,1,5,1,3,1,...,2,1,22,3,1,1,3,2,3,1


### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer:  setting k=4 may lead to have equalliy likely for an individual to be left-handed and right-handed. So, it is preferable to avoid having even number of catagories when predicting a discrete output

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [90]:
## let us extract fetures, Q1 -Q40 for x and hand_catagory for y
X=df.iloc[: , :40]

y= df['hand_category']

In [91]:
X.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q31,Q32,Q33,Q34,Q35,Q36,Q37,Q38,Q39,Q40
0,4,1,5,1,5,1,5,1,4,1,...,1,1,1,5,5,1,1,1,5,5
1,1,5,1,4,2,5,5,4,1,5,...,1,5,2,4,4,4,4,4,1,3
2,1,2,1,1,5,4,3,2,1,4,...,3,3,4,4,2,2,4,2,1,4
3,1,4,1,5,1,4,5,4,3,5,...,4,1,3,5,5,1,3,4,1,2
4,5,1,5,1,5,1,5,1,3,1,...,5,1,5,5,5,1,1,1,5,5


In [92]:
# Create train and test data sets
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.33,random_state=42)

In [93]:
#Creating four separate models, one with  𝑘=3 , one with  𝑘=5 , one with  𝑘=15 , and one with  𝑘=25 .
k3_model = KNeighborsClassifier(n_neighbors=3)
k3_model.fit(X_train, y_train)

k5_model = KNeighborsClassifier(n_neighbors=5)
k5_model.fit(X_train, y_train)

k15_model = KNeighborsClassifier(n_neighbors=15)
k15_model.fit(X_train, y_train)

k25_model = KNeighborsClassifier(n_neighbors=25)
k25_model.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=25)

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: 
There is default regularization! 
- `penalty = 'l2'` indicates the L2 or Ridge penalty.
- `C = 1.0` indicates that the inverse of regularization strength is 1. Note that $C = \frac{1}{\alpha} \Rightarrow 1 = \frac{1}{\alpha} \Rightarrow \alpha = 1$.

The loss function would include $\alpha\sum_{i=1}^p \hat{\beta}_i^2$ as a penalty, where $\alpha = 1$.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer: Yes, there is a default regularization. The default penality ='l2' which inicates the L2 or Ridge penality

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate logistic regression models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [94]:
from sklearn.linear_model import LogisticRegression


In [95]:
#Create separate logistic regression with different alpha for LASSO, and Ridge models 
LASSO_1 = LogisticRegression(penalty = 'l2', C = 1.0, solver='liblinear')
LASSO_1.fit(X_train, y_train)

LASSO_10 = LogisticRegression(penalty = 'l2', C = 0.1, solver='liblinear')
LASSO_10.fit(X_train, y_train)

Ridge_1 = LogisticRegression(penalty = 'l2', C = 1.0, solver='liblinear')
Ridge_1.fit(X_train, y_train)

Ridge_10 = LogisticRegression(penalty = 'l2', C = 0.1,solver='liblinear')
Ridge_10.fit(X_train, y_train)

LogisticRegression(C=0.1, solver='liblinear')

In [96]:
# Looking the coefficients for alpha 1
LASSO_1.coef_

array([[-0.32293604,  0.02945205,  0.23377735,  0.01272328, -0.09852271,
         0.05477166, -0.12335061,  0.38489623,  0.03142363, -0.06710267,
         0.27749186,  0.26257252, -0.00110213, -0.22574504, -0.06215569,
        -0.26517928, -0.04251817,  0.14862419, -0.05400182,  0.17031888,
         0.42778387, -0.04406387,  0.40404285,  0.11451983,  0.06681423,
        -0.11254221,  0.81629147, -0.10389965,  0.38183553, -0.21355201,
        -0.15832209, -0.06165789, -0.25792209,  0.49373811, -0.00331766,
         0.52291343,  0.5232108 , -0.07286639, -0.74000514, -0.06996833]])

In [97]:
# Looking the coefficients for alpha 10
LASSO_10.coef_

array([[-0.13810637,  0.02036947,  0.10757276,  0.0054843 , -0.06423386,
         0.08777703, -0.00792648,  0.17988053,  0.06495145,  0.03613372,
         0.21447323,  0.17301221,  0.03743953, -0.11407916, -0.04163679,
        -0.12765837, -0.02896922,  0.08496414, -0.0082932 ,  0.10946702,
         0.21694251,  0.05784412,  0.25839327,  0.09501031,  0.05446041,
        -0.04002524,  0.42744958, -0.02211373,  0.22236734, -0.08499255,
        -0.04227952, -0.04901205, -0.08965768,  0.29825841, -0.0275634 ,
         0.2870823 ,  0.28874343, -0.0074286 , -0.38004349, -0.03410301]])

In [98]:
# Looking the coefficients for alpha 1
Ridge_1.coef_

array([[-0.32293604,  0.02945205,  0.23377735,  0.01272328, -0.09852271,
         0.05477166, -0.12335061,  0.38489623,  0.03142363, -0.06710267,
         0.27749186,  0.26257252, -0.00110213, -0.22574504, -0.06215569,
        -0.26517928, -0.04251817,  0.14862419, -0.05400182,  0.17031888,
         0.42778387, -0.04406387,  0.40404285,  0.11451983,  0.06681423,
        -0.11254221,  0.81629147, -0.10389965,  0.38183553, -0.21355201,
        -0.15832209, -0.06165789, -0.25792209,  0.49373811, -0.00331766,
         0.52291343,  0.5232108 , -0.07286639, -0.74000514, -0.06996833]])

In [99]:
# Looking the coefficients for alpha 10
Ridge_10.coef_

array([[-0.13810637,  0.02036947,  0.10757276,  0.0054843 , -0.06423386,
         0.08777703, -0.00792648,  0.17988053,  0.06495145,  0.03613372,
         0.21447323,  0.17301221,  0.03743953, -0.11407916, -0.04163679,
        -0.12765837, -0.02896922,  0.08496414, -0.0082932 ,  0.10946702,
         0.21694251,  0.05784412,  0.25839327,  0.09501031,  0.05446041,
        -0.04002524,  0.42744958, -0.02211373,  0.22236734, -0.08499255,
        -0.04227952, -0.04901205, -0.08965768,  0.29825841, -0.0275634 ,
         0.2870823 ,  0.28874343, -0.0074286 , -0.38004349, -0.03410301]])

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

> For this question, consider your own thoughts (or research) on the relationship between psychological factors and handedness.

> When evaluating your models later on, consider whether a high score always means the variables are good predictors. Is this always the case?

Answer: I don't think X variables (Physiological factores)  will do a good job of predicting y( handness)

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

In [100]:
print("k-NN training accuracy k = 3 : " + str(k3_model.score(X_train, y_train)))
print("k-NN testing accuracy  k = 3 : " + str(k3_model.score(X_test, y_test)))

print("k-NN training accuracy k = 5 : " + str(k5_model.score(X_train, y_train)))
print("k-NN testing accuracy  k = 5 : " + str(k5_model.score(X_test, y_test)))

print("k-NN training accuracy k =15 : " + str(k15_model.score(X_train, y_train)))
print("k-NN testing accuracy  k =15 : " + str(k15_model.score(X_test, y_test)))

print("k-NN training accuracy k =25 : " + str(k25_model.score(X_train, y_train)))
print("k-NN testing accuracy  k =25 : " + str(k25_model.score(X_test, y_test)))


print("Log.Regression training accuracy LASSO penalty, 𝛼=1: " + str(LASSO_1.score(X_train, y_train)))
print("Log.Regression testing accuracy LASSO penalty, 𝛼=1 : " + str(LASSO_1.score(X_test, y_test)))

print("Log.Regression training accuracy LASSO penalty, 𝛼=10: " + str(LASSO_10.score(X_train, y_train)))
print("Log.Regression testing accuracy LASSO penalty, 𝛼=10 : " + str(LASSO_10.score(X_test, y_test)))


print("Log.Regression training accuracy Ridge penalty, 𝛼=1 : " + str(Ridge_1.score(X_train, y_train)))
print("Log.Regression testing accuracy Ridge penalty, 𝛼=1 : " + str(Ridge_1.score(X_test, y_test)))

print("Log.Regression training accuracy Ridge penalty, 𝛼=10 : " + str(Ridge_10.score(X_train, y_train)))
print("Log.Regression testing accuracy  Ridge penalty, 𝛼=10 : " + str(Ridge_10.score(X_test, y_test)))

k-NN training accuracy k = 3 : 0.9971459150909739
k-NN testing accuracy  k = 3 : 0.997827661115134
k-NN training accuracy k = 5 : 0.9971459150909739
k-NN testing accuracy  k = 5 : 0.997827661115134
k-NN training accuracy k =15 : 0.9971459150909739
k-NN testing accuracy  k =15 : 0.997827661115134
k-NN training accuracy k =25 : 0.9971459150909739
k-NN testing accuracy  k =25 : 0.997827661115134
Log.Regression training accuracy LASSO penalty, 𝛼=1: 0.9971459150909739
Log.Regression testing accuracy LASSO penalty, 𝛼=1 : 0.99637943519189
Log.Regression training accuracy LASSO penalty, 𝛼=10: 0.9971459150909739
Log.Regression testing accuracy LASSO penalty, 𝛼=10 : 0.99637943519189
Log.Regression training accuracy Ridge penalty, 𝛼=1 : 0.9971459150909739
Log.Regression testing accuracy Ridge penalty, 𝛼=1 : 0.99637943519189
Log.Regression training accuracy Ridge penalty, 𝛼=10 : 0.9971459150909739
Log.Regression testing accuracy  Ridge penalty, 𝛼=10 : 0.99637943519189


### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer:The training accuracy is better than the testing accuracy it induced overfitting. I do not see the training accuracy be better than the testing accuracy in all the data with KNN models that leads to a conclusion of there is no overfitting.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer:
As 𝑘 increases, bias increases and variance decreases
As 𝑘 decreases, bias  decreases and variance increases

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: Increasing K to decrease variance,  reduce the number of predictors or use less flexible model other than the KNN

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer:
I see all the LASSO and Ridge have better training accuracy than the testing. Therefore, there might be very slight overfitting. In specific to these results, I would prefer to say no overfitting since the difference of the train and test scores in Lasso and Ridge for all the values are lower than 2%.


### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer:

-As  𝐶  increases, regularize becomes lower. The less regularizes leads increase the variance and decreases bias .

-As  𝐶  decreases,  it regularizes more. The more regularized, decreased the variance and increases bias.

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer:

In this case, the change of the regularization hyperparameter C has no much effect on models coefficients. This may be an indicator that the X variables are probably not good predictors of the y(handedness).

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:

- Remove or combined some features that are not relevant.

- Increase regularization.

- Get more data (evidence).

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer:

I prefer logistic regression as it enables to estimate coefficients that indicates the effect of a unit change in psychological factors on handness 

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

In [101]:
# Evaluate the coefficient
LASSO_1.coef_

array([[-0.32293604,  0.02945205,  0.23377735,  0.01272328, -0.09852271,
         0.05477166, -0.12335061,  0.38489623,  0.03142363, -0.06710267,
         0.27749186,  0.26257252, -0.00110213, -0.22574504, -0.06215569,
        -0.26517928, -0.04251817,  0.14862419, -0.05400182,  0.17031888,
         0.42778387, -0.04406387,  0.40404285,  0.11451983,  0.06681423,
        -0.11254221,  0.81629147, -0.10389965,  0.38183553, -0.21355201,
        -0.15832209, -0.06165789, -0.25792209,  0.49373811, -0.00331766,
         0.52291343,  0.5232108 , -0.07286639, -0.74000514, -0.06996833]])

Answer: 
More or less the LASSO models perform the same, I picked two coefficients to present as an example. The coefficient for Q1 is -0.32294 and -0.069968 of Q40.

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer:

Of the models carried out in this lab, I would select the KNN models as I do not see any evidence of overfitting as compared with the other models.

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

In [102]:
pd.pivot_table(df[['Q1', 'hand', 'religion']], index = 'hand', columns = 'Q1', aggfunc = 'count')

Unnamed: 0_level_0,religion,religion,religion,religion,religion,religion
Q1,0,1,2,3,4,5
hand,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,,6.0,2.0,1.0,,2.0
1,2.0,2159.0,332.0,408.0,377.0,264.0
2,2.0,277.0,37.0,50.0,58.0,28.0
3,,79.0,14.0,35.0,26.0,25.0


Answer:
These were the questions:
- Which question/s (among Q1-Q44) has/have the highest association with left-handedness? Logistic regression and KNN are preferably good to predict the left-handedness 
- As one of the questions gets more respondents, does it mean that those people are more likely left-handed? There may be few people in the data left-handled.
- How does the distribution of left-handed people vary with the responses to Q6? Based on LASSO_1 model, the distribution of left-handed people varies by 27% with the response to Q6.