# Would YOU Have Survived the Titanic?

![Alt text](titanic.jpg)

Before any fun begins, let us set up the stage with some imports. These import statements will bring the tools into the workspace that is this notebook.

### Imports

In [None]:
#For reading in the spreadsheet data
import pandas as pd
#For data manipulation
import random
import numpy as mp
#For the real machine learning tricks
from sklearn import datasets, svm, cross_validation, tree, preprocessing, metrics
#For visualizing data
import matplotlib.pyplot as plt
%matplotlib inline

PANDAS: we will use read_csv from pandas to "read" in the data from the train csv file. The "head" call returns the first N rows, and N=5 by default. 

### Reading the Data

In [None]:
train_df = pd.read_csv('train.csv', index_col=None, na_values=['NA'])
test_df = pd.read_csv('test.csv', index_col=None, na_values=['NA'])
train_df.head()

The point of having a data scientist process data rather than a machine, is to prevent the plot of Terminator from coming true. Just kidding, we are rreally trying to stop the plot of iRobot from coming true. Ha. Ha. Ha.

Okay, enough with the attempts at witty jokes.

The REAL point of having a data scientist process the data is because a human can make better calls than a machine. YOU need to understand the data you are given and adapt your analysis of it based off of your own judgement. 

Therefore, here are the column heading meanings: 
   <li>survival - binary (0 = No, 1 = Yes)</li>
   <li>class - passenger class (1 = 1st - upper, 2 = 2nd, 3 = 3rd)</li>
   <li>name - passenger's name</li>
   <li>sex - sex</li>
   <li>age - age</li>
   <li>sibsp - number of siblings/spouses aboard</li>
   <li>parch - number of parents/children aboard</li>
   <li>ticket - ticket number</li>
   <li>fare - passenger fare</li>
   <li>cabin - cabin</li>
   <li>embarked - point of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)</li>
   <li>boat - lifeboat (if survived)</li>
   <li>body - body number (if did not survive and body was recovered)</li>

In [None]:
#Before we do anything else, let us checkout the survival rate on the Titanic
train_df['Survived'].mean()
#Here, we just called the "Survived" column and took the mean

So what does this mean? Only 38% of the passengers survived :(

Now, it is time to think. What do we know about the early 20th century? 

Social classes! 

We know that 1st class areas on the Titanic were off limits to 2nd class and 3rd class passengers.  1st class passengers were the rich folks. 2nd class passengers were the middle-class people. 3rd class passengers bought the economy ticket to the cruise. I would guess that 1st class passengers would have had better chances at survival. Let's see if I am right.

In [None]:
#Here, "groupby" will group by values from column "Pclass" => group by class
class_grouping = train_df.groupby('Pclass').mean()['Survived']
class_grouping.plot.bar()

Alright, looks like 63% survival rate beats 47% and 24%.

Let us see what we can infer from the other data.

In [None]:
#TODO: Group the survival rates by sex. Pull up the plot for an easier view.


That's right. Titanic officers, by the good old tradition, prioritized women and children when lifeboat evacuations came about. Our statistical results clearly reflect the first part of this policy, as across all classes women were much more likely to survive than the men. 

In [None]:
#Now, let us drop some features we deem to be less important.

In [None]:
train_combed_df = train_df.drop(['Name', 'PassengerId', 'SibSp', 'Parch', 'Ticket', 'Fare'], axis = 1)
test_combed_df = test_df.drop(['Name', 'PassengerId', 'SibSp', 'Parch', 'Ticket', 'Fare'], axis = 1)
train_combed_df.head()

Alright, much cleaner! 

Now, we can get back to the goal of the Kaggle competition - predicting whether someone will live or die - a binary output (0 = dead, 1 = survived).

### Modeling, Predicting, and Solving

In [None]:
# machine learning
from sklearn.linear_model import LogisticRegression

In [None]:
#Getting the data ready
X_train = train_combed_df.drop("Survived", axis = 1)
Y_train = train_combed_df["Survived"]

X_test = test_combed_df.copy()
X_train.shape, Y_train.shape, X_test.shape

In [None]:
#Logistic Regression

logRegModel = LogisticRegression()
logRegModel.fit(X_train, Y_train)
Y_pred = logRegModel.predict(X_test)
acc_log = round(logRegModel.score(X_train, Y_train) * 100, 2)
acc_log

Oh no! What's wrong?

In [None]:
X_train

Some of our data is missing - and these missing values will not be of much help in our data analysis. Thus, we will drop all NaN values with a simple command from the combed train set to keep the two matrices equal in size:

In [None]:
together_DA = train_combed_df.dropna()

.dropna()   removes the   NaN   values from every remaining column/feature.

In [None]:
together_DA

In [None]:
#Divide the dataset into a train and a test set (1) X_train and (2) Y_train.
X_train = together_DA.drop("Survived", axis = 1)
Y_train = together_DA["Survived"]

X_test = test_combed_df.copy()

X_train

Let us try creating a logistic regression model once more.

In [None]:
#Logistic Regression

logRegModel = LogisticRegression()
logRegModel.fit(X_train, Y_train)
Y_pred = logRegModel.predict(X_test)
acc_log = round(logRegModel.score(X_train, Y_train) * 100, 2)
acc_log

![Alt text](blob.jpg)

#### WHAT'S WRONG NOW?

![Alt text](confused_cat.jpg)

Well, we are trying to perform math on words - notice the contents of the "Sex," "Cabin," and "Embarked" columns - not numbers, are they? We take the easy way for the "Cabin" and "Embarked" columns for now...

In [None]:
X_train_bare = together_DA.drop(['Cabin', "Embarked"], axis = 1)
X_train_bare.head()

X_test_DA = X_test.dropna()
X_test_bare = X_test_DA.drop(['Cabin', "Embarked"], axis = 1)

In [None]:
combined = [X_train_bare, X_test_bare]

for dataset in combined:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [None]:
#Split the data into (1) X_train_bare and (2) Y_train_bare sets to train the features against the survival outcomes.
Y_train_bare = X_train_bare["Survived"]
X_train_bare = X_train_bare.drop(["Survived"], axis = 1)



Now we can go ahead and create a logistic regression model.

Model 1: Logistic Regression

In [None]:
#Logistic Regression

logRegModel = LogisticRegression()
logRegModel.fit(X_train_bare, Y_train_bare)
Y_pred = logRegModel.predict(X_test_bare)
acc_log = round(logRegModel.score(X_train_bare, Y_train_bare) * 100, 2)

print("The accuracy of our model is ")
acc_log

Model 2: Support Vector Machines (aka SVC)

In [None]:
#SVC

from sklearn.svm import SVC, LinearSVC

svc = SVC()

svc.fit(X_train_bare, Y_train_bare)
Y_pred = svc.predict(X_test_bare)
acc_svc = round(svc.score(X_train_bare, Y_train_bare) * 100, 2)

print("The accuracy of our model is ")
acc_svc

Model 3: Linear SVC

In [None]:
#Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train_bare, Y_train_bare)
Y_pred = linear_svc.predict(X_test_bare)
acc_linear_svc = round(linear_svc.score(X_train_bare, Y_train_bare) * 100, 2)
acc_linear_svc

Model 4: K-Nearest Neighbors (KNN)

In [None]:
#KNN

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train_bare, Y_train_bare)
Y_pred = knn.predict(X_test_bare)
acc_knn = round(knn.score(X_train_bare, Y_train_bare) * 100, 2)
acc_knn

Model 5: Random Forest

In [None]:
#Random Forest

from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train_bare, Y_train_bare)
Y_pred = random_forest.predict(X_test_bare)
random_forest.score(X_train_bare, Y_train_bare)
acc_random_forest = round(random_forest.score(X_train_bare, Y_train_bare) * 100, 2)
acc_random_forest

Model 6: Decision Tree Classifier

In [None]:
#Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_bare, Y_train_bare)
Y_pred = decision_tree.predict(X_test_bare)
acc_decision_tree = round(decision_tree.score(X_train_bare, Y_train_bare) * 100, 2)
acc_decision_tree