## Project 5 : Classification

## Instructions

### Description

Practice classification on the Titanic dataset.

### Grading

For grading purposes, we will clear all outputs from all your cells and then run them all from the top.  Please test your notebook in the same fashion before turning it in.

### Submitting Your Solution

To submit your notebook, first clear all the cells (this won't matter too much this time, but for larger data sets in the future, it will make the file smaller).  Then use the File->Download As->Notebook to obtain the notebook file.  Finally, submit the notebook file on Canvas.


In [None]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### Introduction

On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

Intro Videos: 
https://www.youtube.com/watch?v=3lyiZMeTKIo
and
https://www.youtube.com/watch?v=ItjXTieWKyI 

The `titanic_data.csv` file contains data for `887` of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (`0=No`), their age, their passenger-class (`1=1st Class, Upper`), gender, and the fare they paid (£s*). For more on the currency: http://www.statisticalconsultants.co.nz/blog/titanic-fare-data.html

We are going to try to see if there are correlations between the feature data provided (find a best subset of features) and passenger survival.

### Problem 1: Load and understand the data (35 points)

#### Your task (some of this is the work you completed for L14 - be sure to copy that work into here as needed)
Conduct some preprocessing steps to explore the following and provide code/answers in the below cells:
1. Load the `titanic_data.csv` file into a pandas dataframe
2. Explore the data provided (e.g., looking at statistics using describe(), value_counts(), histograms, scatter plots of various features, etc.) 
3. What are the names of feature columns that appear to be usable for learning?
4. What is the name of the column that appears to represent our target?
5. Formulate a hypothesis about the relationship between given feature data and the target
6. How did Pclass affect passenngers' chances of survival?
7. What is the age distribution of survivors?

In [None]:
# Step 1. Load the `titanic_data.csv` file into a pandas dataframe
data = pd.read_csv('titanic_data.csv')
X = data.iloc[:, 1:]
y = data["Survived"]
data.head()

In [None]:
# Step 2. Explore the data provided (e.g., looking at statistics using describe(), value_counts(), histograms, scatter plots of various features, etc.) 
data.describe()

In [None]:
fig, axs = plt.subplots(nrows=2, ncols=3)

axs[0,0].hist(data.loc[data["Survived"] == 1, "Pclass"], 15)
axs[0,0].set_xlabel("Pclass")

axs[0,1].hist(data.loc[data["Survived"] == 1, "Fare"], 15)
axs[0,1].set_xlabel("Fare")

axs[1,0].hist(data.loc[data["Survived"] == 1, "Age"], 50)
axs[1,0].set_xlabel("Age")
axs[1,0].set_xticks(np.arange(0,max(data["Age"])+10, 10))

axs[1,1].hist(data.loc[data["Survived"] == 1, "Parents/Children Aboard"], 15)
axs[1,1].set_xlabel("Parents/Children Aboard")

axs[0,2].hist(data.loc[data["Survived"] == 1, "Siblings/Spouses Aboard"], 15)
axs[0,2].set_xlabel("Siblings/Spouses Aboard")

axs[1,2].hist(data.loc[data["Survived"] == 1, "Sex"], 2)
axs[1,2].set_xlabel("Sex")

fig.set_constrained_layout(True)
fig.suptitle("Histograms of those who survived")
fig.supylabel("Number of People")
plt.show()

In [None]:
fig, axs = plt.subplots(nrows=2, ncols=3)

axs[0,0].hist(data.loc[data["Survived"] == 0, "Pclass"], 15)
axs[0,0].set_xlabel("Pclass")

axs[0,1].hist(data.loc[data["Survived"] == 0, "Fare"], 15)
axs[0,1].set_xlabel("Fare")

axs[1,0].hist(data.loc[data["Survived"] == 0, "Age"], 50)
axs[1,0].set_xlabel("Age")
axs[1,0].set_xticks(np.arange(0,max(data["Age"])+10, 10))

axs[1,1].hist(data.loc[data["Survived"] == 0, "Parents/Children Aboard"], 15)
axs[1,1].set_xlabel("Parents/Children Aboard")

axs[0,2].hist(data.loc[data["Survived"] == 0, "Siblings/Spouses Aboard"], 15)
axs[0,2].set_xlabel("Siblings/Spouses Aboard")

axs[1,2].hist(data.loc[data["Survived"] == 0, "Sex"], 2)
axs[1,2].set_xlabel("Sex")

fig.set_constrained_layout(True)
fig.suptitle("Histograms of those who didn't survive")
fig.supylabel("Number of People")
plt.show()

In [None]:
fig, axs = plt.subplots(nrows=2, ncols=2)

axs[0,0].plot(data["Pclass"], data["Fare"], "b.")
axs[0,0].set_xlabel("Pclass")
axs[0,0].set_ylabel("Fare")

axs[1,0].plot(data["Pclass"], data["Age"], "b.")
axs[1,0].set_xlabel("Pclass")
axs[1,0].set_ylabel("Age")

axs[0,1].plot(data["Siblings/Spouses Aboard"], data["Parents/Children Aboard"], "b.")
axs[0,1].set_xlabel("Siblings/Spouses Aboard")
axs[0,1].set_ylabel("Parents/Children Aboard")

fig.set_constrained_layout(True)
fig.suptitle("Scatter plot Comparisons")
plt.show()

---

**Edit this cell to provide answers to the following steps:**

---

Step 3. What are the names of feature columns that appear to be usable for learning?

    I think that the features that appear the most valuable for learning would be Age, Sex, and Pclass.

Step 4. What is the name of the column that appears to represent our target?

    Survived

Step 5. Formulate a hypothesis about the relationship between given feature data and the target

    My hypothesis is that the main variables that contributed to survival were: the younger you are the more likely you are to survive, Women were more likely to survive, and those in Pclass 3 were the least likely to survive.
    
Step 6.

     Pclass 3 had a significantly less chance of surivival compared to the other classes.


In [None]:
#Step 6. How did Pclass affect passenngers' chances of survival?
#Show your work with a bar plot, dataframe selection, or visual of your choice.
Pclass_died = data.loc[data["Survived"] == 0, "Pclass"].value_counts()
Pclass_survived = data.loc[data["Survived"] == 1, "Pclass"].value_counts()
Pclass_data = pd.DataFrame({
    "Pclass_died": Pclass_died, "Pclass_survived": Pclass_survived
})
Pclass_data.plot(kind="bar")
plt.title("How many people survived in eahc Pclass")
plt.ylabel("Number of People")
plt.show()

In [None]:
#Step 7. What is the age distribution of survivors?
#Show your work with a dataframe operation and/or histogram plot.

plt.hist(data.loc[data["Survived"] == 1, "Age"], 50)
plt.xlabel("Age")
plt.xticks(np.arange(0,max(data["Age"])+10, 10))
plt.title("Age distribution of survivors")
plt.show()

### Problem 2: transform the data (10 points)
The `Sex` column is categorical, meaning its data are separable into groups, but not numerical. To be able to work with this data, we need numbers, so you task is to transform the `Sex` column into numerical data with pandas' `get_dummies` feature and remove the original categorical `Sex` column.

In [None]:
data_with_dummies = pd.get_dummies(data, columns=["Sex"])
data_with_dummies.head()

### Problem 3: Classification (30 points)
Now that the data is transformed, we want to run various classification experiments on it. The first is `K Nearest Neighbors`, which you will conduct by:

1. Define input and target data by creating lists of dataframe columns (e.g., inputs = ['Pclass', etc.)
2. Split the data into training and testing sets with `train_test_split()`
3. Create a `KNeighborsClassifier` using `5` neighbors at first (you can experiment with this parameter)
4. Train your model by passing the training dataset to `fit()`
5. Calculate predicted target values(y_hat) by passing the testing dataset to `predict()`
6. Print the accuracy of the model with `score()`

** Note: If you get a python warning as you use the Y, trainY, or testY vector in some of the function calls about "DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, )", you can look up how to use trainY.values.ravel() or trainY.values.flatten() or another function, etc.

In [None]:
inputs = data_with_dummies[["Pclass", "Sex_female", "Sex_male", "Age"]]
target = data_with_dummies["Survived"]

In [None]:
from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(inputs, target, test_size = 0.2)
print(trainX.shape)
print(testX.shape)
print(trainY.shape)
print(testY.shape)

In [None]:
# from sklearn.neighbors import KNeighborsClassifier
k = 5
model = KNeighborsClassifier(k)
model.fit(trainX, trainY)
y_hat = model.predict(testX)
print(model.score(testX, testY))

### Problem 4: Cross validation, classification report (15 points)
- Using the concepts from the 17-model_selection slides and the [`cross_val_score`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function from scikit-learn, estimate the f-score ([`f1-score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)) (you can use however many folds you wish). 

    `cross_val_score` is a handy utility, but it can be confusing. It doesn't return a model like `KNeighborsClassifier(...)`. Instead, it uses a model and dataset that you provide and runs the whole train-predict-score process for each of `k` folds (refer to the notes on Cross Validation for more information). The function returns a list of scores, one for each of the k folds. To get to a single score, it is possible to take the mean or median of this list of scores. There are also even do more involved statistical techniques. However, it is also correct to just provide the list of scores.

    By default, the `cross_val_score` utility will apply the default scoring metric (accuracy) to every cross validation fold. To get it to apply `f1-score` instead, you will need to create a "scorer" that calculates f1-scores using [`make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer), and then pass this object to the `scoring` parameter of `cross_val_score`. Since this has a few parts to it, let me just give you that scorer object: ```scorerVar = make_scorer(f1_score, pos_label=1)```

- Using the concepts from the end of the 14-classification slides, output a confusion matrix.

- Also, output a classification report [`classification_report`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) from sklearn.metrics showing more of the metrics: precision, recall, f1-score for both of our classes.

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, f1_score, classification_report, make_scorer
from sklearn import model_selection

scorerVar = make_scorer(f1_score, pos_label=1)
folds = 10
cvScore = model_selection.cross_val_score(model, trainX, trainY, cv = folds, scoring=scorerVar)
print("Output of Cross_val_score using f1_scorer and cv={}: \n{}".format(folds , cvScore))

In [None]:
conf_matrix = confusion_matrix(testY, y_hat)
print("Confusion Matrix: \n{}".format(conf_matrix))

In [None]:
classi_report = classification_report(testY, y_hat, target_names=["Died-0", "Survived-1"])
print("Classification Report: \n{}".format(classi_report))

### Problem 5: Support Vector Machines (15 points)
Now, repeat the above experiment using the using a Support Vector classifier [`SVC`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) with default parameters (RBF kernel) model in scikit-learn, and output:

- The fit accuracy (using the `score` method of the model)
- The f-score (using the [`cross_val_score`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function)
- The confusion matrix
- The precision, recall, and f-measure for the 1 class (you can just print the results of the [`classification_report`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) function from sklearn.metrics)

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

#create a model object
model_SVC = SVC()
#train our model
model_SVC.fit(trainX, trainY)
#evaluate the model 
y_hat_SVC = model_SVC.predict(testX)
model_score_SVC = model_SVC.score(testX,testY)
print("Model fit Accuracy: \n{}".format(model_score_SVC))
print()
#setup to get f-score and cv
folds_SVC = 10
cvScore_SVC = model_selection.cross_val_score(model_SVC, trainX, trainY, cv = folds_SVC, scoring=scorerVar)
print("Output of Cross_val_score using f1_scorer and cv={}: \n{}".format(folds_SVC , cvScore_SVC))
print()
#confusion matrix
conf_matrix_SVC = confusion_matrix(testY, y_hat_SVC)
print("Confusion Matrix: \n{}".format(conf_matrix_SVC))
print()
#classification report
classi_report_SVC = classification_report(testY, y_hat_SVC, target_names=["Died-0", "Survived-1"])
print("Classification Report: \n{}".format(classi_report_SVC))

### Problem 6: Logistic Regression (15 points)

Now, repeat the above experiment using the [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model in scikit-learn, and output:

- The fit accuracy (using the `score` method of the model)
- The f-score (using the [`cross_val_score`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function)
- The confusion matrix
- The precision, recall, and f-measure for the 1 class (you can just print the results of the [`classification_report`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) function from sklearn.metrics)

In [None]:
from sklearn.linear_model import LogisticRegression

#create a model object
model_LR = LogisticRegression()
#train our model
model_LR.fit(trainX, trainY)
#evaluate the model 
y_hat_LR = model_LR.predict(testX)
model_score_LR = model_LR.score(testX,testY)
print("Model fit Accuracy: \n{}".format(model_score_LR))
print()
#setup to get f-score and cv
folds_LR = 10
cvScore_LR = model_selection.cross_val_score(model_LR, trainX, trainY, cv = folds_LR, scoring=scorerVar)
print("Output of Cross_val_score using f1_scorer and cv={}: \n{}".format(folds_LR , cvScore_LR))
print()
#confusion matrix
conf_matrix_LR = confusion_matrix(testY, y_hat_LR)
print("Confusion Matrix: \n{}".format(conf_matrix_LR))
print()
#classification report
classi_report_LR = classification_report(testY, y_hat_LR, target_names=["Died-0", "Survived-1"])
print("Classification Report: \n{}".format(classi_report_LR))


### Problem 7: Comparision and Discussion (5 points)
Edit this cell to provide a brief discussion (3-5 sentances at most):
1. What was the model/algorithm that performed best for you?
2. What feaures and hyperparameters were used to achieve that performance?
3. What insights did you gain from your experimentation about the predictive power of this dataset and did it match your original hypothesis about the relationship between given feature data and the target?

    The model that performed the best for me was the K nearest Neighbors algorithm using a k value of 5 and the features Age,Sex, and Pclass. It came out to an accuracy of 85%.
    By running through these experiments I have learned about how useful data can be in order to predict future outcomes. I did not realize that it was possible to get such a high accuracy even with such an unpredictable event. I also think that my hypothesis was correct considering the features that I chose brought about an accuracy value of 85%. 


### Questionnaire
1) How long did you spend on this assignment?
<br><br>
    I spent about 3 hours on this assignment
<br><br>
2) What did you like about it? What did you not like about it?
<br><br>
    I liked learning just how powerful machine learning algorithms could be.
<br><br>
3) Did you find any errors or is there anything you would like changed?
<br><br>
    Nothing that I can think of
<br><br>