# Section 0: Import

In [70]:
import pandas as pd

In [71]:
data = pd.read_csv("space_titanic_clean.csv")

In [72]:
data.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [73]:
#check label distribution
print(str(data[data["Transported"] == True].shape[0]) + " passengers were Transported")
print(str(data[data["Transported"] == False].shape[0]) + " passeners were NOT transported")

3327 passengers were Transported
3279 passeners were NOT transported


# Section 1: Data Processing + Feature Engineering
Before we start building and training our model, we have to process our dataset a bit to get it ready for input to the model. More specifically, we should aim to do a few things:

1. Remove unnecessary features
2. Scale features
3. Encode categorical data (categorical --> numerical data)

## Remove Unnecessary Features
More often than not, there will be features we know will not help us predict our label. While doing so requires some domain knowledge, in this case, we will make some assumptions. We will remove the following features first:

1. ```PassengerId```
2. ```Name```

In [74]:
#remove the PassengerId and Name columns from the dataset
data = data.drop(["PassengerId", "Name"], axis=1, inplace=False)

Another time we may want to remove features, is if they would be too difficult to encode into numerical data. This might include long text data (reviews, comments), or categorical data with too many categories (high cardinality, different possible values). 

Let's take a look at the ```Cabin``` column. Getting all the unique values, we see that there are about 5305 different cabins. In this case, it might be best to remove the ```Cabin``` feature from out dataset.

In [75]:
#check number of unique cabins
data["Cabin"].nunique()

5305

In [76]:
#remove Cabin column from dataset
data = data.drop(["Cabin"], axis=1, inplace=False)

## Feature Engineering (Encoding)
Now that we've removed all the features we needed, let's move on to encoding our categorical variables. 

The most popular method is One-hot Encoding, and it is the one we will use. However, know that for different situations, we may want to use other methods of encoding.

In [77]:
#one-hot encode 'Destination' and 'HomePlanet' columns
encoded = pd.get_dummies(data[["Destination", "HomePlanet"]], drop_first=True, dtype=int)
data.drop(["Destination", "HomePlanet"], axis=1, inplace=True)
data = data.join(encoded)

Last thing, we need to label encode our True/False columns. We'll denote 0 for False, and 1 for True

In [78]:
#Label encode CryoSleep, VIP, and Transported columns
data['CryoSleep'] = data['CryoSleep'].map(lambda val: 1 if val else 0)
data['VIP'] = data['VIP'].map(lambda val: 1 if val else 0)
data['Transported'] = data['Transported'].map(lambda val: 1 if val else 0)

Now that our dataset is all cleaned and processed, let's move onto modeling!

In [79]:
data

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,HomePlanet_Europa,HomePlanet_Mars
0,0,39.0,0,0.0,0.0,0.0,0.0,0.0,0,0,1,1,0
1,0,24.0,0,109.0,9.0,25.0,549.0,44.0,1,0,1,0,0
2,0,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,0,1,1,0
3,0,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,0,1,1,0
4,0,16.0,0,303.0,70.0,151.0,565.0,2.0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6601,0,41.0,1,0.0,6819.0,0.0,1643.0,74.0,0,0,0,1,0
6602,1,18.0,0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0
6603,0,26.0,0,0.0,0.0,1872.0,1.0,0.0,1,0,1,0,0
6604,0,32.0,0,0.0,1049.0,0.0,353.0,3235.0,0,0,0,1,0


# Section 3: Creating Training, Validation, and Testing Datasets

The following command splits a given dataset into two distinct sets (training and testing): 
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
```
* `X` is the set of feature variables in the dataset
* `y` is the target variable in the dataset 
* `test_size` is the fraction of the original dataset that should be reserved for the testing

**Note** Typically in machine learning we also perform what is called validation. Validation during training is like giving "mini" tests during the learning process. Typically validation helps with preventing the model from overfitting during training. For the sake of this workshop, we won't be covering validation but I encourage you to still read up about it!

In [80]:
from sklearn.model_selection import train_test_split

In [81]:
#First, separate features from label
X = data.drop("Transported", axis=1, inplace=False) #every column except the label
y = data["Transported"] #just the label

Now use `X` and `y` to perform the `train_test_split` to obtain your training and test sets

In [82]:
# type your answer here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

One more thing, it is always good practice to standardize our data before modeling. This ensures that all our numerical data is on the same scale!

In [83]:
from sklearn.preprocessing import StandardScaler

In [84]:
#initialize scaler
scaler = StandardScaler()

In [85]:
#fit scaler to our training set
scaler.fit(X_train)

In [86]:
#use transform scale both the training and test set
X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

In [87]:
print("Size of training set: " + str(X_train.shape[0]))
print("Size of test set: " + str(X_test.shape[0]))

Size of training set: 4954
Size of test set: 1652


Now we're ready to build our model!

# Section 4: Modeling
When modeling in machine learning, we rarely only use one model! In this case, we'll be trying a few different models:

1. Logistic Regression
2. Support Vector Classifier
3. Decision Tree

Note, there are many, many different models for classification. These are some of the more popular ones.

## Logistic Regression

Let's start by creating a simple logistic regression model. We can import it and create an instance of it.

In [88]:
#type your answer here
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(solver="saga")

### Fit
After we've created our model, we now need to train it using our training set. Use `.fit(X_train, y_train)`

In [89]:
#type your answer here
lr_model.fit(X_train, y_train)

### Predict
Now that the model has been trained, we can test its performance using the test set. Use `.predict(X_test)`

In [90]:
#get predictions from test set
lr_predicted = lr_model.predict(X_test)

### Evaluation
There are many ways to evaluate how well our model performs. These methods vary between classification and regression as well. 

For classification, a very popular and intuitive metric is the accuracy, which is simply the percentage of observations that the model correctly classified (predicted).

In [91]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Now let's use `accuracy_score(...)` to obtain the accuracy of our model on the test set!

In [92]:
#type your answer here
lr_accuracy = accuracy_score(y_test, lr_predicted)

In [93]:
print("Logistic Regression achieved an accuracy of: " + str(round(lr_accuracy *100, 2)) + "%!")

Logistic Regression achieved an accuracy of: 77.66%!


For classification, other popular metrics are the F1-Score, Recall and Precision. We can use a confusion matrix to obtain these values.

Use `classification_report(...)` to obtain the confusion matrix and other metrics.

In [94]:
#type your answer here
mtrx = pd.DataFrame(classification_report(y_test, lr_predicted, output_dict=True))
mtrx

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.78803,0.765882,0.776634,0.776956,0.777023
recall,0.760529,0.792935,0.776634,0.776732,0.776634
f1-score,0.774036,0.779174,0.776634,0.776605,0.776589
support,831.0,821.0,0.776634,1652.0,1652.0


### Interpretation
Besides evaluation, interpretation is also a key step. Given our model and its output, what can we deduce from it? For example:
1. What are the most important features towards determining whether a passenger was transported or not?
2. What are the least important features?

This idea is called feature importance. Typically there's a trade-off between interpretability of a model and its complexity. 

Use your model's `.coef_[0]` attribute to get the feature coefficients. Put them alongside the name of the features (hint: use a dataframe).

In [95]:
#type your answer here
lr_coef = pd.DataFrame({"weight": lr_model.coef_[0]})
lr_coef["coef"] = X.columns
lr_coef = lr_coef[["coef", "weight"]]
lr_coef.sort_values(by="weight", ascending=False)

Unnamed: 0,coef,weight
10,HomePlanet_Europa,0.96497
4,FoodCourt,0.860186
0,CryoSleep,0.625439
5,ShoppingMall,0.393804
11,HomePlanet_Mars,0.278842
2,VIP,-0.130018
1,Age,-0.14218
8,Destination_PSO J318.5-22,-0.160249
9,Destination_TRAPPIST-1e,-0.273173
3,RoomService,-0.932039


From the coefficient weights, we can tell that the most important feature (feature with the highest weight **magnitude**) are ```Spa```, ```VRDeck```, and ```HomePlanet_Europa```. Generally higher magnitude weights result in the weight having a stronger impact on the result of the prediction, whether it be negative or positive.

Let's see if we can do better with a more complex model!

# Section 5: Validation with Support Vector Classifier

Now we'll do something similar as above, except we will introduce a few more concepts:
1. Validation
2. Hyperparameter tuning

### Validation
In addition to a training and test set, we typically use a validation set to test and choose the best hyperparameters for our model. 

Hyperparameters are sort of like 'settings' for our model. Each model has a variety of different hyperparameters that tune and affect the performance of the model. When validating hyperparameters, we normally instantiate lists of possible hyperparameter values, and iterate through each possible combination and testing the model accuracy. We keep track of the combination of hyperparameter values that yield the highest accuracy.

In [96]:
# Perform a train-test-split on the training set to get the validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=0)

In [97]:
from sklearn.svm import SVC

In [98]:
#hyperparameter grid
C = [0.01, 0.1, 0.25, 0.5, 0.75, 1.0]
gamma = [0.01, 0.1, 0.15, 0.2, 0.3]
kernels = ["linear", "rbf", "poly"]

#keep track of best accuracy, best parameters
best_svc = None
best_val_acc = 0.0
best_C, best_gamma, best_kernel = None,None,None

#loop through each possible hyperparameter combination
for kernel in kernels:
    for c in C:
        for gamm in gamma:
            #create an SVC model using the hyperparameters
            #fit the model
            #get predictions from the validation set
            #get the accuracy from the validation set (call it svc_accuracy)
            # ---- type your answer here --- 
            svc_model = SVC(kernel=kernel, class_weight="balanced", C=c, gamma=gamm)
            svc_model.fit(X_train, y_train)
            svc_predicted = svc_model.predict(X_val)
            svc_accuracy = accuracy_score(y_val, svc_predicted)
            # --- type your answer here ---
            
            print("Accuracy: " + str(round(svc_accuracy *100, 2)) + "% with C=" + str(c) + ", gamma=" + str(gamm) + " and kernel=" + kernel)
            
            #check to see if accuracy improved
            if svc_accuracy > best_val_acc:
                best_svc = svc_model
                best_val_acc = svc_accuracy
                best_kernel = kernel
                best_C = c
                best_gamma = gamm
                
print("BEST Accuracy: " + str(round(best_val_acc*100, 2)) + "% with C=" + str(best_C) + ", gamma=" + str(best_gamma) + " and kernel=" + best_kernel)

Accuracy: 75.95% with C=0.01, gamma=0.01 and kernel=linear
Accuracy: 75.95% with C=0.01, gamma=0.1 and kernel=linear
Accuracy: 75.95% with C=0.01, gamma=0.15 and kernel=linear
Accuracy: 75.95% with C=0.01, gamma=0.2 and kernel=linear
Accuracy: 75.95% with C=0.01, gamma=0.3 and kernel=linear
Accuracy: 76.76% with C=0.1, gamma=0.01 and kernel=linear
Accuracy: 76.76% with C=0.1, gamma=0.1 and kernel=linear
Accuracy: 76.76% with C=0.1, gamma=0.15 and kernel=linear
Accuracy: 76.76% with C=0.1, gamma=0.2 and kernel=linear
Accuracy: 76.76% with C=0.1, gamma=0.3 and kernel=linear
Accuracy: 77.16% with C=0.25, gamma=0.01 and kernel=linear
Accuracy: 77.16% with C=0.25, gamma=0.1 and kernel=linear


Accuracy: 77.16% with C=0.25, gamma=0.15 and kernel=linear
Accuracy: 77.16% with C=0.25, gamma=0.2 and kernel=linear
Accuracy: 77.16% with C=0.25, gamma=0.3 and kernel=linear
Accuracy: 77.24% with C=0.5, gamma=0.01 and kernel=linear
Accuracy: 77.24% with C=0.5, gamma=0.1 and kernel=linear
Accuracy: 77.24% with C=0.5, gamma=0.15 and kernel=linear
Accuracy: 77.24% with C=0.5, gamma=0.2 and kernel=linear
Accuracy: 77.24% with C=0.5, gamma=0.3 and kernel=linear
Accuracy: 77.32% with C=0.75, gamma=0.01 and kernel=linear
Accuracy: 77.32% with C=0.75, gamma=0.1 and kernel=linear
Accuracy: 77.32% with C=0.75, gamma=0.15 and kernel=linear
Accuracy: 77.32% with C=0.75, gamma=0.2 and kernel=linear
Accuracy: 77.32% with C=0.75, gamma=0.3 and kernel=linear
Accuracy: 77.24% with C=1.0, gamma=0.01 and kernel=linear
Accuracy: 77.24% with C=1.0, gamma=0.1 and kernel=linear
Accuracy: 77.24% with C=1.0, gamma=0.15 and kernel=linear
Accuracy: 77.24% with C=1.0, gamma=0.2 and kernel=linear
Accuracy: 77.24%

Now let's see how well our **validated** model performs on the test set!

In [99]:
#get predictions from test set
svc_pred = best_svc.predict(X_test)

In [100]:
#let's get our model accuracy
svc_test_accuracy = accuracy_score(y_test, svc_pred)

In [101]:
print("SVC achieved an accuracy of: " + str(round(svc_test_accuracy*100, 2)) + "%!")

SVC achieved an accuracy of: 78.39%!


# Section 6: Other models

## Decision Tree
Your turn! Let's do everything one more time, now with the Decision Tree Classifier

In [102]:
from sklearn.tree import DecisionTreeClassifier

In [103]:
#try it yourself!
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_predicted = dt_model.predict(X_val)
dt_accuracy = accuracy_score(y_val, dt_predicted)

In [104]:
print("Decision Tree Classsifier achieved an accuracy of: " + str(round(dt_accuracy, 2)) + "%!")

Decision Tree Classsifier achieved an accuracy of: 0.72%!


# That's a wrap!
There are **many** more types of models used for classification, as well as different metrics used to capture model performance. While we only went through a few, we encourage you to seek out more models on the Scikit-learn documentation!
