## Lab 6: Classification Models

The purpose of this lab is to utilize the computing power of the scikit-learn environment to build linear models that predict categorical variable values.



### I. The Titanic Data Set

If you are unfamiliar with the Titanic disaster or need a refresher, I recommend reading https://en.wikipedia.org/wiki/Sinking_of_the_Titanic before starting this lab.

For this lab we will analyze the a well-known data set that contains data on the passengers of the Titanic. You can read about the data set (and the famous Kaggle machine learning competition that utilizes it) here: https://www.kaggle.com/competitions/titanic/data.

The data set itself is split into two sub-data sets: ```titanic_train``` and ```titanic_test```. We will build our models by fitting them to ```titanic_train```. It is common practice to test a model's performance on the data that it hasn't seen (```titanic_test```) but we won't do so in this lab.



#### Preprocessing.

Before uploading ```titanic_train```, open and view the data in excel. You will notice that missing data is a real problem with this data set, which means that we will have to do more preprocessing than usual. There are two people for which the ```Embarked``` variable is missing. Do an internet search to determine what these values should be, then enter them into the raw data set in excel, then save the modified ```titanic_train.csv``` file on your local machine.

In [None]:
# Import the titanic_train dataset
from google.colab import files

uploaded = files.upload()

Saving titanic_train.csv to titanic_train.csv


In [None]:
# View dataset
import pandas as pd

titanic_train = pd.read_csv("titanic_train.csv")
titanic_train.head()

Unnamed: 0,Name,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,"Braund, Mr. Owen Harris",0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,"Heikkinen, Miss. Laina",1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,1,female,35.0,1,0,113803,53.1,C123,S
4,"Allen, Mr. William Henry",0,3,male,35.0,0,0,373450,8.05,,S


**Exercise 1**. How many different ports did the Titanic embark from?

In [None]:
#S,Q,C
#3

We will use the ```Pclass```, ```Sex```, ```Age```, ```SibSp```, ```Parch```, ```Fare```, and ```Embarked``` variables to predict ```Survived```. In order to make the categorical variables ```Pclass```, ```Sex```, and ```Embarked``` viable model inputs we need to represent them using numbers. I have included a block of code that will do this for you below. Run it and read this blog post: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/.

In [None]:
# Preprocess that data set: Do not change this code!

from sklearn.preprocessing import OneHotEncoder

df = titanic_train.drop(["Name","Ticket", "Cabin"], axis = 1)
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(df[['Pclass', 'Embarked']]).toarray())
final_df = df.join(encoder_df)
final_df = final_df.drop(["Pclass", "Embarked"], axis = 1)
final_df['Sex'].replace(['male', 'female'], [0, 1], inplace=True)
final_df.columns = ["Survived", "Sex", "Age", "SibSp", "Parch", "Fare", "Pclass1", "Pclass2", "Pclass3", "EmbarkC", "EmbarkQ", "EmbarkS"]
titanic_train = final_df # titanic_train is still a pandas dataframe
titanic_train.head()

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Pclass1,Pclass2,Pclass3,EmbarkC,EmbarkQ,EmbarkS
0,0,0,22.0,1,0,7.25,0.0,0.0,1.0,0.0,0.0,1.0
1,1,1,38.0,1,0,71.2833,1.0,0.0,0.0,1.0,0.0,0.0
2,1,1,26.0,0,0,7.925,0.0,0.0,1.0,0.0,0.0,1.0
3,1,1,35.0,1,0,53.1,1.0,0.0,0.0,0.0,0.0,1.0
4,0,0,35.0,0,0,8.05,0.0,0.0,1.0,0.0,0.0,1.0


**Exercise 2**. What is the type of encoding that was used to express the ```Pclass``` and ```Embark``` variables as numbers? What is the logic for doing this?

In [None]:
# Construct the design matrix X and y vector here.
from sklearn import preprocessing
#X = titanic_train.to_numpy()[:,1:11]
#y = titanic_train.to_numpy()[:,0]
X = titanic_train.to_numpy()[:,[1,2,3,4,5,6,7,8,9,10,11]]
y = titanic_train.to_numpy()[:,0]

rows, columns = X.shape

print("The number of rows in X is:", rows)
print("The number of columns in X is:", columns)
print(X)
print()
print(y)


#The type of encoding used in this code is One-hot encoding.

#One-hot encoding is a process of converting categorical variables into a numerical representation. In this case, the categorical variables 'Pclass' and 'Embarked' are converted into multiple binary columns (also known as dummy variables) representing each unique category.

The number of rows in X is: 891
The number of columns in X is: 11
[[ 0. 22.  1. ...  0.  0.  1.]
 [ 1. 38.  1. ...  1.  0.  0.]
 [ 1. 26.  0. ...  0.  0.  1.]
 ...
 [ 1. nan  1. ...  0.  0.  1.]
 [ 0. 26.  0. ...  1.  0.  0.]
 [ 0. 32.  0. ...  0.  1.  0.]]

[0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1.
 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0.
 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0.
 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1.
 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0.
 0. 1. 0. 0

We dropped the ```Ticket``` and ```Cabin``` variables because there were so many missing values. We are using the ```Age``` variable in our design matrix but it is also missing values. In class we have:

1. Replaced missing values with column means.

2. Dropped rows with missing values.

A third option is to use regression to estimate the missing ```age``` values from the other variables. For this data set we will use a method called K-Nearest-Neighbors (KNN), which estimates the missing value based on the average of the ```age``` variable in the *k* closest vectors. This can be seen as a more accurate and improved (though more computationally costly) version of #1 above.  

In [None]:
# Fill in missing age values using KNN- just run this code.

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors = 5)
X = imputer.fit_transform(X)
print(X)

[[ 0.  22.   1.  ...  0.   0.   1. ]
 [ 1.  38.   1.  ...  1.   0.   0. ]
 [ 1.  26.   0.  ...  0.   0.   1. ]
 ...
 [ 1.  16.8  1.  ...  0.   0.   1. ]
 [ 0.  26.   0.  ...  1.   0.   0. ]
 [ 0.  32.   0.  ...  0.   1.   0. ]]


As we saw in class, standardizing the design matrix by changing it to z-scores will improve the logistic regression algorithm's performance. Change ```X``` into ```X_std``` as we did in class.

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler() #construct the scalar object
X_std = scaler.fit_transform(X) #deploy and calculate
rows, columns = X_std.shape
print(rows)
print(columns)

print(X_std)


891
11
[[-0.73769513 -0.59541179  0.43279337 ... -0.4838099  -0.30974338
   0.61930636]
 [ 1.35557354  0.5833991   0.43279337 ...  2.06692751 -0.30974338
  -1.61470971]
 [ 1.35557354 -0.30070907 -0.4745452  ... -0.4838099  -0.30974338
   0.61930636]
 ...
 [ 1.35557354 -0.97852532  0.43279337 ... -0.4838099  -0.30974338
   0.61930636]
 [-0.73769513 -0.30070907 -0.4745452  ...  2.06692751 -0.30974338
  -1.61470971]
 [-0.73769513  0.14134501 -0.4745452  ... -0.4838099   3.22847904
  -1.61470971]]


### II. The Logistic Regression Model

Build a logistic regression model called ```titanic_model``` in python using ```sklearn```. The full model will have the form:

\begin{align} \hat{p(\vec{x}}) = \frac{e^{\hat{\beta_0} + \hat{\beta_1}Sex + \hat{\beta_2}Age + \hat{\beta_3}Sibsp + \hat{\beta_4}Parch + \hat{\beta_5}Fare + \hat{\beta_6}Pclass1 + \hat{\beta_7}Pclass2 + \hat{\beta_8}Pclass3 + \hat{\beta_9}EmbarkedC + \hat{\beta_{10}}EmbarkedQ + \hat{\beta_{11}}EmbarkedS}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}Sex + \hat{\beta_2}Age + \hat{\beta_3}Sibsp + \hat{\beta_4}Parch + \hat{\beta_5}Fare + \hat{\beta_6}Pclass1 + \hat{\beta_7}Pclass2 + \hat{\beta_8}Pclass3 + \hat{\beta_9}EmbarkedC + \hat{\beta_{10}}EmbarkedQ + \hat{\beta_{11}}EmbarkedS}} \end{align}


In [None]:
from sklearn.linear_model import LogisticRegression

titanic_model = LogisticRegression().fit(X_std, y)

print(titanic_model.coef_)
print()
print(titanic_model.intercept_)


[[ 1.2547952  -0.52444834 -0.34613562 -0.07558441  0.09700901  0.51818866
   0.08302305 -0.5140426   0.07182477  0.05503385 -0.0976545 ]]

[-0.65349996]


**Exercise 3**. State $\hat{\beta_1}$. What does its sign tell you about the model's prediction of a female suriving relative to a male?

In [None]:
import numpy as np


p_female = 1 / (1 + np.exp(- (titanic_model.intercept_ + 1.2547952 * 1)))
print(p_female)
print()
p_male = 1 / (1 + np.exp(- (titanic_model.intercept_)))
print(p_male)

#this means that female has more suriving rate than male.

[0.64595258]

[0.34220126]


**Exercise 4**. State $\hat{\beta_2}$. The sign tells you that a passenger had a better chance of survival the ____ they were. Given what you know about who got priority getting into the lifeboats, why does this make sense?

In [None]:
#The negative sign of b_hat2 indicates that the model predicts a decrease in the log odds of survival as the age of the passenger increases. This means that, according to the model, younger passengers have a better chance of survival compared to older passengers. This is consistent with what we know about who got priority getting into the lifeboats during the Titanic disaster, where women and children were given priority.


**Exercise 5**. Which value of the original ```Embarked``` variable gave a passenger the worse chances of survival, accouring to the models betas? Explain your answer.

In [None]:
 #-5.07298157e-04  3.38187023e-02  1.13573897e-01
p_C = 1 / (1 + np.exp(- (titanic_model.intercept_ + 0.07182477 * 1)))
p_Q = 1 / (1 + np.exp(- (titanic_model.intercept_ + 0.05503385 * 1)))
p_S = 1 / (1 + np.exp(- (titanic_model.intercept_ + -0.0976545 * 1)))
print("p_C: ", p_C)
print("p_Q: ", p_Q)
print("p_S: ", p_S)
#EmbarkC will have less survival rate, since the titanic_model.coef_ for EmbarkC， EmbarkQ， EmbarkS are -5.07298157e-04,  3.38187023e-02,  1.13573897e-01
#the EmbarkC has the lowest value, and by using p = 1 / (1 + np.exp(- (titanic_model.intercept_ + b_hat * 1)))
#we can predict the probability of survial for a passenger in different embark


p_C:  [0.35854722]
p_Q:  [0.3546947]
p_S:  [0.3205698]


**Exercise 6**. Find the model's accuracy.



In [None]:
from sklearn import metrics

y_hat = titanic_model.predict(X_std)
confusion_matrix = metrics.confusion_matrix(y,y_hat, labels = titanic_model.classes_)

print(confusion_matrix)

[[481  68]
 [100 242]]


In [None]:
tn, fp, fn, tp = confusion_matrix.ravel()
accuracy = (tp + tn)/X.shape[0]
print(accuracy)
#0.8114

0.8114478114478114


**Exercise 7**. Find the model's F1-score.

In [None]:
precision = tp/(tp+fp)
recall = tp/(tp + fn)

f_1 = 2*precision*recall/(precision + recall)
print(f_1)

#0.7423


0.7423312883435583


**Exercise 8**. A "null" model that predicts "did not survive" for any value of inputs would achieve an accuracy of ___.  

In [None]:
num_survive = sum(y)
print(num_survive)

accuracy_null = (X.shape[0]-num_survive)/X.shape[0]
print(accuracy_null)
#0.616


342.0
0.6161616161616161


**Exercise 9**. Jack Dawson, a 20 year old man in Southampton, wins a 3rd class ticket (valued at $8) in a poker game. He’s traveling without relatives. What does ```titanic_model``` estimate for his likelihood of surival? (Hint: See revised class notes for lecture 9w. I messed this up in class- I forgot to standardize ```mom```!)

In [None]:
#12%
from sklearn.preprocessing import StandardScaler
import numpy as np

jack = np.array([[0,20,0,0,8,0,0,1,0,0,1]])
jack_std = scaler.transform(jack)
print(titanic_model.predict_proba(jack_std))



[[0.87528967 0.12471033]]


[[-0.73769513 -0.74276315 -0.4745452  -0.47367361 -0.48734416 -0.56568542
  -0.51015154  0.90258736 -0.4838099  -0.30974338  0.61930636]]
[[0.87528967 0.12471033]]

### III. A Neural Network Model

A more computationally expensive way to do classification is neural networks and deep learning. Read about the multi-layer perceptron here: https://scikit-learn.org/stable/modules/neural_networks_supervised.html.


**Exercise 10**. Build and fit an ```MLPClassifier``` called ```titanic_nn```. Use the "adam" solver, with an alpha of 1e-4, hidden layer sizes of (36, 8), maximum iterations of 10,0000, and a random state of 1 (for reproducability). What F1-score does this model achieve?  

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score
# Create the MLPClassifier object
titanic_nn = MLPClassifier(solver='adam', alpha=1e-4, hidden_layer_sizes=(36, 8), max_iter=10000, random_state=1)
titanic_nn.fit(X_std, y)

y_hat = titanic_nn.predict(X_std)
print("y_hat", y_hat)

confusion_matrix = metrics.confusion_matrix(y,y_hat, labels = titanic_nn.classes_)

print(confusion_matrix)


tn, fp, fn, tp = confusion_matrix.ravel()
accuracy = (tp + tn)/X.shape[0]
print("accuracy: ", accuracy)


precision = tp/(tp+fp)
recall = tp/(tp + fn)

f_1 = 2*precision*recall/(precision + recall)
print("f_1: ", f_1)


#0.8432

y_hat [0. 1. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0.
 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0.
 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0.
 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0.
 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1.
 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0.

### IV. Course Evaluations

**Exercise 11**. You should have recieved a link for the course evaluation form. Please complete the form for 1 point of extra credit on this assignment.

o determine whether this was a good portfolio for the investor, we need to calculate the expected return and the risk of the portfolio.
The expected return for stock A is 10% and for stock B is -2%. Thus, the expected return on the portfolio is:

Expected return = 0.3 x 10% + 0.7 x (-2%) = 2.8%

The risk of the portfolio can be measured by calculating its standard deviation. The formula for calculating the standard deviation of a two-asset portfolio is:

σp = sqrt(w1^2σ1^2 + w2^2σ2^2 + 2w1w2ρσ1σ2)

Where w1 and w2 are the weights of stock A and stock B respectively, σ1 and σ2 are the standard deviations of stocks A and B respectively, ρ is the correlation coefficient between the returns of the two stocks.

Using the above formula with the given values, we get:

σp = sqrt((0.3^2x10^2 + 0.7^2x(-2)^2 + 2x0.3x0.7x0x10x(-2))

σp=8.007%

As the expected return of the portfolio is less than the average rate of return, but before taking the risk into account, the portfolio might appear to be a profitable one. However, after accounting for risk (as shown through the high standard deviation), the investor may be better off exploring other investment opportunities.

Assuming that short-selling is not allowed, we can use the capital asset pricing model (CAPM) to find the optimal portfolio allocation. The CAPM formula is:
Expected return = Rf + β(Rm - Rf)

where Rf is the risk-free rate, β is the beta of the stock, Rm is the market return.

To derive the required inputs for this equation, first, we calculate the beta for each stock:

βA = 0 (not given)
βB = -2%/10% = -0.2

Next, based on the definition in question, the treasury bill rate can be used as the risk-free rate. This value has been mentioned as 6%.

Finally, there is no mention of the market return in the problem statement. As such, we will assume that the expected market return over the past year was 8% (a typical value for an S&P index fund).

Substituting the values in the CAPM formula, we obtain:

Expected return of A = 6% + 0 × (8% - 6%) = 6%
Expected return of B = 6% + (-0.2) × (8% - 6%) = 5.6%

Thus, a better portfolio choice would have been to allocate 100% of the investment to stock A.

The Sharpe Ratio measures the excess return generated by an investment per unit of volatility. It is calculated as:
Sharpe Ratio = (Expected portfolio return – Risk-free rate) / Portfolio standard deviation

Substituting the values from the previous answers, we obtain:

SR = (2.8% – 6%) / 8.007% = -0.475

A negative Sharpe Ratio suggests that the portfolio had a poor risk-return tradeoff and thus would not have been an ideal option for an investor.

