<a href="https://www.kaggle.com/code/amirmotefaker/titanic-machine-learning-from-disaster-best?scriptVersionId=144953116" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Titanic - Machine Learning from Disaster
- The sinking of the Titanic is one of the most infamous shipwrecks in history.

- On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

- While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

- In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).


# Import Libraries

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm # Statistical computations and models for Python

from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric import smoothers_lowess

from pandas import Series, DataFrame

from patsy import dmatrices # A Python package for describing statistical models and for building design matrices.

from sklearn import datasets, svm

# from KaggleAux import predict as ka # see github.com/agconti/kaggleaux for more details
# KaggleAux is a collection of statistical tools to aid Data Science competitors in Kaggle Competitions.
# pip install gnureadline, ipython, matplotlib, mock, nose, numpy, pandas, pyparsing, python-dateutil, pytz, six, wsgiref, scipy, statsmodels, patsy

import matplotlib.pyplot as plt
%matplotlib inline

# Data Handling

In [None]:
# Let's read our data in using pandas:
df = pd.read_csv("/kaggle/input/titanic/train.csv") 

In [None]:
# Show an overview of our data:
df

#### Looking at the data frame above:

- First, it lets us know we have 891 observations, or passengers, to analyze here:

    - Int64Index: 891 entries, 0 to 890

- Next it shows us all of the columns in DataFrame. Each column tells us something about each of our observations, like their name, sex, or age. These columns are called features of our dataset. 


- After each feature it lets us know how many values it contains. While most of our features have complete data on every observation, like the survived feature here:

    - survived    891  non-null values 

- some are missing information, like the age feature:

    - age         714  non-null values 


- These missing values are represented as NaNs.

#### Take care of missing values:

- The features 'Ticket' and 'Cabin' have many missing values and so can’t add much value to our analysis. To handle this we will drop them from the data frame to preserve the integrity of our dataset.


- To do that we'll use this line of code to drop the features entirely:

    - df = df.drop(['ticket','cabin'], axis=1) 


- While this line of code removes the NaN values from every remaining column/feature:

    - df = df.dropna()

- Now we have a clean and tidy dataset that is ready for analysis. Because .dropna() removes an observation from our data even if it only has 1 NaN in one of the features, it would have removed most of our dataset if we had not dropped the ticket and cabin features first.

In [None]:
df = df.drop(['Ticket','Cabin'], axis=1)

In [None]:
# Remove NaN values
df = df.dropna() 

In [None]:
df

# Take a Look at your data graphically

### Distribution of Survival, (1 = Survived)


In [None]:
# specifies the parameters of our graphs
fig = plt.figure(figsize=(18,6), dpi=1600) 
alpha=alpha_scatterplot = 0.2 
alpha_bar_chart = 0.55

# lets us plot many diffrent shaped graphs together 
ax1 = plt.subplot2grid((2,3),(0,0))

# plots a bar graph of those who surived vs those who did not.               
df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart)

# this nicely sets the margins in matplotlib to deal with a recent bug 1.3.1
ax1.set_xlim(-1, 2)

# puts a title on our graph
plt.title("Distribution of Survival, (1 = Survived)")

### Survival by Age,  (1 = Survived)


In [None]:
# specifies the parameters of our graphs
fig = plt.figure(figsize=(18,6), dpi=1600) 
alpha=alpha_scatterplot = 0.2 
alpha_bar_chart = 0.55

plt.subplot2grid((2,3),(0,1))
plt.scatter(df.Survived, df.Age, alpha=alpha_scatterplot)

# sets the y axis lable
plt.ylabel("Age")

# formats the grid line style of our graphs                          
plt.grid(visible=True, which='major', axis='y')  
plt.title("Survival by Age,  (1 = Survived)")

### Class Distribution


In [None]:
# specifies the parameters of our graphs
fig = plt.figure(figsize=(18,6), dpi=1600) 
alpha=alpha_scatterplot = 0.2 
alpha_bar_chart = 0.55

ax3 = plt.subplot2grid((2,3),(0,2))
df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart)
ax3.set_ylim(-1, len(df.Pclass.value_counts()))
plt.title("Class Distribution")

### Age Distribution within classes


In [None]:
# specifies the parameters of our graphs
fig = plt.figure(figsize=(18,6), dpi=1600) 
alpha=alpha_scatterplot = 0.2 
alpha_bar_chart = 0.55

plt.subplot2grid((2,3),(1,0), colspan=2)

# plots a kernel density estimate of the subset of the 1st class passangers's age
df.Age[df.Pclass == 1].plot(kind='kde')    
df.Age[df.Pclass == 2].plot(kind='kde')
df.Age[df.Pclass == 3].plot(kind='kde')

# plots an axis lable
plt.xlabel("Age")    
plt.title("Age Distribution within classes")

# sets our legend for our graph.
plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best')

### Passengers per boarding location


In [None]:
# specifies the parameters of our graphs
fig = plt.figure(figsize=(18,6), dpi=1600) 
alpha=alpha_scatterplot = 0.2 
alpha_bar_chart = 0.55

ax5 = plt.subplot2grid((2,3),(1,2))
df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart)
ax5.set_xlim(-1, len(df.Embarked.value_counts()))

# specifies the parameters of our graphs
plt.title("Passengers per boarding location")

# Exploratory Visualization:

- The point of this competition is to predict if an individual will survive based on the features in the data like:

    - Traveling Class (called pclass in the data)
    - Sex
    - Age
    - Fare Price

- Let’s see if we can gain a better understanding of who survived and died.

### A bar graph of those who survived versus those who died.

In [None]:
plt.figure(figsize=(6,4))
fig, ax = plt.subplots()
df.Survived.value_counts().plot(kind='barh', color="blue", alpha=.65)
ax.set_ylim(-1, len(df.Survived.value_counts())) 
plt.title("Survival Breakdown (1 = Survived, 0 = Died)")

#### Let’s break the previous graph down by gender

In [None]:
fig = plt.figure(figsize=(18,6))

#create a plot of two subsets, male and female, of the survived variable.
#After we do that we call value_counts() so it can be easily plotted as a bar graph. 
#'barh' is just a horizontal bar graph
df_male = df.Survived[df.Sex == 'male'].value_counts().sort_index()
df_female = df.Survived[df.Sex == 'female'].value_counts().sort_index()

ax1 = fig.add_subplot(121)
df_male.plot(kind='barh',label='Male', alpha=0.55)
df_female.plot(kind='barh', color='#FA2379',label='Female', alpha=0.55)
plt.title("Who Survived? with respect to Gender, (raw value counts) "); plt.legend(loc='best')
ax1.set_ylim(-1, 2) 

#adjust graph to display the proportions of survival by gender
ax2 = fig.add_subplot(122)
(df_male/float(df_male.sum())).plot(kind='barh',label='Male', alpha=0.55)  
(df_female/float(df_female.sum())).plot(kind='barh', color='#FA2379',label='Female', alpha=0.55)
plt.title("Who Survived proportionally? with respect to Gender"); plt.legend(loc='best')

ax2.set_ylim(-1, 2)

- Here it’s clear that although more men died and survived in raw value counts, females had a greater survival rate proportionally (~25%), than men (~20%).

- Can we capture more of the structure by using Pclass? 
- Here we will bucket classes as the lowest class or any of the high classes (classes 1 - 2). 3 is the lowest class. 
- Let’s break it down by Gender and what Class they were traveling in.

### Who Survived? with respect to Gender and Class

In [None]:
fig = plt.figure(figsize=(18,4), dpi=1600)
alpha_level = 0.65

# Building on the previous code, here we create an additional subset within the gender subset 
# we created for the survived variable. That's a lot of subsets. After we do that we call value_counts() 
# so it can be easily plotted as a bar graph. 
# this is repeated for each gender class pair.
ax1=fig.add_subplot(141)
female_highclass = df.Survived[df.Sex == 'female'][df.Pclass != 3].value_counts()
female_highclass.plot(kind='bar', label='female, highclass', color='#FA2479', alpha=alpha_level)
ax1.set_xticklabels(["Survived", "Died"], rotation=0)
ax1.set_xlim(-1, len(female_highclass))
plt.title("Who Survived? with respect to Gender and Class"); plt.legend(loc='best')

ax2=fig.add_subplot(142, sharey=ax1)
female_lowclass = df.Survived[df.Sex == 'female'][df.Pclass == 3].value_counts()
female_lowclass.plot(kind='bar', label='female, low class', color='pink', alpha=alpha_level)
ax2.set_xticklabels(["Died","Survived"], rotation=0)
ax2.set_xlim(-1, len(female_lowclass))
plt.legend(loc='best')

ax3=fig.add_subplot(143, sharey=ax1)
male_lowclass = df.Survived[df.Sex == 'male'][df.Pclass == 3].value_counts()
male_lowclass.plot(kind='bar', label='male, low class',color='lightblue', alpha=alpha_level)
ax3.set_xticklabels(["Died","Survived"], rotation=0)
ax3.set_xlim(-1, len(male_lowclass))
plt.legend(loc='best')

ax4=fig.add_subplot(144, sharey=ax1)
male_highclass = df.Survived[df.Sex == 'male'][df.Pclass != 3].value_counts()
male_highclass.plot(kind='bar', label='male, highclass', alpha=alpha_level, color='steelblue')
ax4.set_xticklabels(["Died","Survived"], rotation=0)
ax4.set_xlim(-1, len(male_highclass))
plt.legend(loc='best')

- Now we have a lot more information on who survived and died in the tragedy. 
- With this deeper understanding, we are better equipped to create better more insightful models. 
- This is a typical process in interactive data analysis.

    - First, you start small and understand the most basic relationships and slowly increment the complexity of your analysis as you discover more and more about the data you’re working with. 

In [None]:
fig = plt.figure(figsize=(18,12), dpi=1600)
a = 0.65

# Step 1
ax1 = fig.add_subplot(341)
df.Survived.value_counts().plot(kind='bar', color="blue", alpha=a)
ax1.set_xlim(-1, len(df.Survived.value_counts()))
plt.title("Step. 1")

# Step 2
# Who Survived? with respect to Gender
ax2 = fig.add_subplot(345)
df.Survived[df.Sex == 'male'].value_counts().plot(kind='bar',label='Male')
df.Survived[df.Sex == 'female'].value_counts().plot(kind='bar', color='#FA2379',label='Female')
ax2.set_xlim(-1, 2)
plt.title("Step. 2 \nWho Survived? with respect to Gender."); plt.legend(loc='best')

# Who Survied proportionally?
ax3 = fig.add_subplot(346)
(df.Survived[df.Sex == 'male'].value_counts()/float(df.Sex[df.Sex == 'male'].size)).plot(kind='bar',label='Male')
(df.Survived[df.Sex == 'female'].value_counts()/float(df.Sex[df.Sex == 'female'].size)).plot(kind='bar', color='#FA2379',label='Female')
ax3.set_xlim(-1,2)
plt.title("Who Survied proportionally?"); plt.legend(loc='best')


# Step 3
# Who Survived? with respect to Gender and Class
ax4 = fig.add_subplot(349)
female_highclass = df.Survived[df.Sex == 'female'][df.Pclass != 3].value_counts()
female_highclass.plot(kind='bar', label='female highclass', color='#FA2479', alpha=a)
ax4.set_xticklabels(["Survived", "Died"], rotation=0)
ax4.set_xlim(-1, len(female_highclass))
plt.title("Who Survived? with respect to Gender and Class"); plt.legend(loc='best')

ax5 = fig.add_subplot(3,4,10, sharey=ax1)
female_lowclass = df.Survived[df.Sex == 'female'][df.Pclass == 3].value_counts()
female_lowclass.plot(kind='bar', label='female, low class', color='pink', alpha=a)
ax5.set_xticklabels(["Died","Survived"], rotation=0)
ax5.set_xlim(-1, len(female_lowclass))
plt.legend(loc='best')

ax6 = fig.add_subplot(3,4,11, sharey=ax1)
male_lowclass = df.Survived[df.Sex == 'male'][df.Pclass == 3].value_counts()
male_lowclass.plot(kind='bar', label='male, low class',color='lightblue', alpha=a)
ax6.set_xticklabels(["Died","Survived"], rotation=0)
ax6.set_xlim(-1, len(male_lowclass))
plt.legend(loc='best')

ax7 = fig.add_subplot(3,4,12, sharey=ax1)
male_highclass = df.Survived[df.Sex == 'male'][df.Pclass != 3].value_counts()
male_highclass.plot(kind='bar', label='male highclass', alpha=a, color='steelblue')
ax7.set_xticklabels(["Died","Survived"], rotation=0)
ax7.set_xlim(-1, len(male_highclass))
plt.legend(loc='best')

# Supervised Machine Learning

### Logistic Regression

- Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. 

#### Logistic Regression Model

In [None]:
# model formula for Logistic Regression Model

# here the ~ sign is an = sign, and the features of our dataset
# are written as a formula to predict survived. The C() lets our 
# regression know that those variables are categorical.
# Ref: http://patsy.readthedocs.org/en/latest/formulas.html
formula = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp  + C(Embarked)' 

# create a results dictionary to hold our regression results for easy analysis later        
results = {} 

In [None]:
# create a regression friendly dataframe using patsy's dmatrices function
y,x = dmatrices(formula, data=df, return_type='dataframe')

# instantiate our model
model = sm.Logit(y,x)

# fit our model to the training data
res = model.fit()

# save the result for outputing predictions later
results['Logistic'] = [res, formula]
res.summary()

In [None]:
# Plot Predictions Vs Actual
plt.figure(figsize=(18,4));
plt.subplot(121, facecolor="#DBDBDB")

# Generate predictions from our fitted model
ypred = res.predict(x)
plt.plot(x.index, ypred, 'bo', x.index, y, 'mo', alpha=.25);
plt.grid(color='white', linestyle='dashed')
plt.title('Logit predictions, Blue: \nFitted/predicted values: Red');

# Residuals
ax2 = plt.subplot(122, facecolor="#DBDBDB")
plt.plot(res.resid_dev, 'r-')
plt.grid(color='white', linestyle='dashed')
ax2.set_xlim(-1, len(res.resid_dev))
plt.title('Logit Residuals');

#### Look at the predictions we generated graphically:

In [None]:
# Distribution of our Predictions
fig = plt.figure(figsize=(18,9), dpi=1600)
a = .2

fig.add_subplot(221, facecolor="#DBDBDB")
kde_res = KDEUnivariate(res.predict())
kde_res.fit()
plt.plot(kde_res.support,kde_res.density)
plt.fill_between(kde_res.support,kde_res.density, alpha=a)
plt.title("Distribution of our Predictions")

In [None]:
# The Change of Survival Probability by Gender (1 = Male)
fig = plt.figure(figsize=(18,9), dpi=1600)
a = .2

fig.add_subplot(222, facecolor="#DBDBDB")
plt.scatter(res.predict(),x['C(Sex)[T.male]'] , alpha=a)
plt.grid(visible=True, which='major', axis='x')
plt.xlabel("Predicted chance of survival")
plt.ylabel("Gender Bool")
plt.title("The Change of Survival Probability by Gender (1 = Male)")

In [None]:
# The Change of Survival Probability by Lower Class (1 = 3rd Class)
fig = plt.figure(figsize=(18,9), dpi=1600)
a = .2

fig.add_subplot(223, facecolor="#DBDBDB")
plt.scatter(res.predict(),x['C(Pclass)[T.3]'] , alpha=a)
plt.xlabel("Predicted chance of survival")
plt.ylabel("Class Bool")
plt.grid(visible=True, which='major', axis='x')
plt.title("The Change of Survival Probability by Lower Class (1 = 3rd Class)")

In [None]:
# The Change of Survival Probability by Age
fig = plt.figure(figsize=(18,9), dpi=1600)
a = .2

fig.add_subplot(224, facecolor="#DBDBDB")
plt.scatter(res.predict(),x.Age , alpha=a)
plt.grid(True, linewidth=0.15)
plt.title("The Change of Survival Probability by Age")
plt.xlabel("Predicted chance of survival")
plt.ylabel("Age")

# Use our model to predict the test set values

#### Read the test data

In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

#### Examine our data frame

In [None]:
test_data

- Add our independent variable to our test data. (It is usually left blank by Kaggle because it is the value you are trying to predict.)

In [None]:
test_data['Survived'] = 1.23

#### Our binned results data:

In [None]:
results

#### Use your model to make prediction on our test set (Only on Kaggle with kaggleaux)

In [None]:
# compared_results = ka.predict(test_data, results, 'Logit')
# compared_results = Series(compared_resuts)  # convert our model to a series for easy output

#### Output and submit to kaggle (Only on Kaggle with kaggleaux)

In [None]:
# compared_resuts.to_csv("data/output/logitregres.csv")
# compared_resuts.to_csv("data/output/logitregres.csv")

In [None]:
# Create an acceptable formula for our machine learning algorithms
formula_ml = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp + Parch + C(Embarked)'

# Support Vector Machine (SVM)

In [None]:
# Set plotting parameters
plt.figure(figsize=(8,6))

In [None]:
# Create a regression friendly data frame
y, x = dmatrices(formula_ml, data=df, return_type='matrix')

In [None]:
# select which features we would like to analyze
# try chaning the selection here for diffrent output.
# Choose : [2,3] - pretty sweet DBs [3,1] --standard DBs [7,3] -very cool DBs,
# [3,6] -- very long complex dbs, could take over an hour to calculate! 
feature_1 = 2
feature_2 = 3

X = np.asarray(x)
X = X[:,[feature_1, feature_2]]  


y = np.asarray(y)
# needs to be 1 dimenstional so we flatten. it comes out of dmatirces with a shape. 
y = y.flatten()      

n_sample = len(X)

np.random.seed(0)
order = np.random.permutation(n_sample)

X = X[order]
y = y[order].astype(float)

In [None]:
# do a cross validation
nighty_precent_of_sample = int(.9 * n_sample)
X_train = X[:nighty_precent_of_sample]
y_train = y[:nighty_precent_of_sample]
X_test = X[nighty_precent_of_sample:]
y_test = y[nighty_precent_of_sample:]

In [None]:
# create a list of the types of kerneks we will use for your analysis
types_of_kernels = ['linear', 'rbf', 'poly']

In [None]:
# specify our color map for plotting the results
color_map = plt.cm.RdBu_r

In [None]:
# fit the model
for fig_num, kernel in enumerate(types_of_kernels):
    clf = svm.SVC(kernel=kernel, gamma=3)
    clf.fit(X_train, y_train)

    plt.figure(fig_num)
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=color_map)

    # circle out the test data
    plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none', zorder=10)
    
    plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.pcolormesh(XX, YY, Z > 0, cmap=color_map)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
               levels=[-.5, 0, .5])

    plt.title(kernel)
    plt.show()

- Any value in the blue survived while anyone in the red did not. 
- Check out the graph for the linear transformation. It created its decision boundary right on 50%! That guess from earlier turned out to be pretty good. As you can see, the remaining decision boundaries are much more complex than our original linear decision boundary. These more complex boundaries may be able to capture more structure in the dataset if that structure exists, and so might create a more powerful predictive model.

In [None]:
# Here you can output which ever result you would like by changing the Kernel and clf.predict lines
# Change kernel here to poly, rbf or linear
# adjusting the gamma level also changes the degree to which the model is fitted
clf = svm.SVC(kernel='poly', gamma=3).fit(X_train, y_train)                                                            
y,x = dmatrices(formula_ml, data=test_data, return_type='dataframe')

# Change the interger values within x.ix[:,[6,3]].dropna() explore the relationships between other 
# features. the ints are column postions. ie. [6,3] 6th column and the third column are evaluated. 
res_svm = clf.predict(x.iloc[:,[6,3]].dropna())

res_svm = DataFrame(res_svm,columns=['Survived'])
res_svm.to_csv("/kaggle/working//svm_poly.csv") # saves the results for you, change the name as you please. 

# Random Forest

In [None]:
# import the machine learning library that holds the randomforest
import sklearn.ensemble as ske

# Create the random forest model and fit the model to our training data
y, x = dmatrices(formula_ml, data=df, return_type='dataframe')
# RandomForestClassifier expects a 1 demensional NumPy array, so we convert
y = np.asarray(y).ravel()
#instantiate and fit our model
results_rf = ske.RandomForestClassifier(n_estimators=100).fit(x, y)

# Score the results
score = results_rf.score(x, y)
print ("Mean accuracy of Random Forest Predictions on the data was: {0}".format(score))