## Titanic Data

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this project, we will complete the analysis of what sorts of people were likely to survive. In particular, We will apply the tools of machine learning to predict which passengers survived the tragedy.

A good first step is to think logically about the columns and what we're trying to predict. What variables might logically affect the outcome of survived? (reading more about the Titanic might help here).

We know that women and children were more likely to survive. Thus, Age and Sex are probably good predictors. It's also logical to think that passenger class might affect the outcome, as first class cabins were closer to the deck of the ship. Fare is tied to passenger class, and will probably be highly correlated with it, but might add some additional information. Number of siblings and parents/children will probably be correlated with survival one way or the other, as either there are more people to help you, or more people to think about and try to save.

There's a less clear link between survival and columns like Embarked (maybe there is some information about how close to the top of the ship people's cabins were here), Ticket, and Name.

This step is generally known as acquiring domain knowledge, and it fairly important to most machine learning tasks. We're looking to engineer the features so that we maximize the information we have about what we're trying to predict.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
train = pd.read_csv('../data/titanic_train.csv')
train.shape

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe()

- The Age column of the dataset contains only 714 rows whereas all the other Columns have 891 rows.
- Obviously some of the rows do no have the value of Age column. We cannot remove these rows as we need more data to make better algorithm.

In [None]:
sns.heatmap(train.corr(),annot=True)

In [None]:
train.corr()["Survived"]

In [None]:
sns.set_style('whitegrid')
#COUNTPLOT TO SEE SURVIVAL RATES(GENDER-BASED)
sns.countplot(x='Survived', data=train, hue='Sex', palette='RdBu_r')

In [None]:
#COUNTPLOT TO SEE SURVIVAL RATES(CLASS BASED)
sns.countplot(x='Survived', data=train, hue='Pclass', palette='rainbow')

In [None]:
train['Age'].hist(bins=30, color='darkred', alpha=0.5)

In [None]:
sns. countplot(x='SibSp', data=train)

In [None]:
train[train['SibSp'] == 0]['Age'].hist(bins=30)

In [None]:
train['Fare'].hist(color='g', bins=50, figsize=(12, 6))

In [None]:
train[train['Fare']<70]['Fare'].hist(color='g', bins=50, figsize=(12, 6))

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='Pclass', y='Age', data=train)

In [None]:
#FILL MEAN AGE IN EACH EMPTY SPACE IN AGE COLUMN ACCORDING TO CLASS
def fill_age(col):
    Age=col[0]
    Pclass=col[1]
    
    if pd.isnull(Age):
        
        if Pclass==1:
            return 38
        elif Pclass==2:
            return 30
        elif Pclass==3:
            return 25
        
    else:
        return Age

In [None]:
train['Age'] = train[['Age','Pclass']].apply(fill_age,axis=1)

In [None]:
train['Age'].head(20)

In [None]:
del train['Cabin']

In [None]:
train.dropna(inplace=True)

In [None]:
#CONVERSION OF SEX INTO 0 AND 1
pd.get_dummies(train['Sex']).head()

In [None]:
pd.get_dummies(train['Sex'], drop_first=True).head()

In [None]:
#WE HAVE ADDED A NEW COLUMN 'Sex' WHERE MALE=1 AND FEMALE=0
train['Sex'] = pd.get_dummies(train['Sex'], drop_first=True)
train.head()

In [None]:
train['Embarked'].value_counts()

In [None]:
embark = pd.get_dummies(train['Embarked'], drop_first=True)

In [None]:
embark.head()

In [None]:
train.drop(['Sex', 'PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)

In [None]:
train.head()

In [None]:
train = pd.concat([train, sex, embark], axis=1)

In [None]:
del train['Embarked']

In [None]:
train.head(30)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(train.drop('Survived', axis=1), train['Survived'], test_size=0.3)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression(solver="saga", max_iter=100000)

In [None]:
logmodel.fit(X_train, Y_train)

In [None]:
prediction = logmodel.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(Y_test, prediction))

**According to the proposed model, if we obtain the information at the time of shipment to the Titanic, 81% sure that the person would die in the shipwreck.**

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
print(confusion_matrix(Y_test, prediction))