# Problem Statement
*- To Predict the Titanic Survivals*

# Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

# 1. Importing Libraries and Data

In [None]:
import numpy as np # For Linear Algebra
import pandas as pd # Data

# For visualization
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
import seaborn as sns
sns.set(style="white") #white background style for seaborn plots
sns.set(style="whitegrid", color_codes=True)

# Model building
from sklearn import preprocessing
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split, cross_val_score

In [None]:
# Read CSV train data file into DataFrame
train_df = pd.read_csv("../input/train.csv")

# Read CSV test data file into DataFrame
test_df = pd.read_csv("../input/test.csv")

# preview train data
train_df.head()

In [None]:
print(f'The number of records in the train data is {train_df.shape[0]}.')
print(f'The number of records in the test data is {test_df.shape[0]}.')

In [None]:
# preview test data
test_df.head()

## Variable Description
<b>pclass</b>: A proxy for socio-economic status (SES)
                             1st = Upper
                            2nd = Middle
                            3rd = Lower

<b>age</b>: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

<b>sibsp</b>:  The dataset defines family relations in this way...
                            Sibling = brother, sister, stepbrother, stepsister
                            Spouse = husband, wife (mistresses and fiancés were ignored)

<b>parch</b>: The dataset defines family relations in this way...
                            Parent = mother, father
                            Child = daughter, son, stepdaughter, stepson
                            Some children travelled only with a nanny, therefore parch=0 for them.
         
 <b>Survived</b>: 0 = Did not survive
                                  1 = Survived

# 2. Data Quality & Missing Value Treatment


In [None]:
# check missing values in train data
train_df.isnull().sum()

### checking by Age

In [None]:
# percent of missing "Age" 
print('Percent of missing "Age" records is %.2f%%' %((train_df['Age'].isnull().sum()/
                                                      train_df.shape[0])*100))

In [None]:
# How the Age column looks,
%matplotlib inline
ax = train_df["Age"].hist(bins=15, density=True, stacked=True, color='teal', alpha=0.6)
train_df["Age"].plot(kind='density', color='teal')
ax.set(xlabel='Age')
plt.xlim(-10,85)

- Age is Skewed right, we can fill the missing values with median.

In [None]:
train_df["Age"].fillna(train_df["Age"].median(skipna=True), inplace=True)
test_df["Age"].fillna(train_df["Age"].median(skipna=True), inplace=True)

### Embarked Column

In [None]:
# percent of missing "Embarked" 
print('Percent of missing "Embarked" records is %.2f%%' %((train_df['Embarked'].isnull().sum()/
                                                           train_df.shape[0])*100))

In [None]:
print('Passengers by Port of Embarkation: ')
print(train_df['Embarked'].value_counts())
sns.countplot(x='Embarked', data=train_df, palette='rainbow')
plt.show()

- Most people boarded from Port S(Southampton), let's impute missing values with S.

In [None]:
train_df["Embarked"].fillna(train_df['Embarked'].value_counts().idxmax(), inplace=True)
test_df["Embarked"].fillna(train_df['Embarked'].value_counts().idxmax(), inplace=True)

### Cabin Column

In [None]:
# percent of missing "Cabin" 
print('Percent of missing "Cabin" records is %.2f%%' %((train_df['Cabin'].isnull().sum()/
                                                        train_df.shape[0])*100))

- Since, 77% of the data is missing in Cabin column, we could drop this column.

In [None]:
train_df.drop('Cabin', axis=1, inplace=True)
test_df.drop('Cabin', axis=1, inplace=True)

### Fare 

In [None]:
train_df.drop('Fare', axis=1, inplace=True)
test_df.drop('Fare', axis=1, inplace=True)

In [None]:
# check missing values in adjusted train data
train_df.isnull().sum()

## Additional Columns

In [None]:
## Create categorical variable for traveling alone
train_df['TravelAlone']=np.where((train_df["SibSp"]+train_df["Parch"])>0, 0, 1)
train_df.drop('SibSp', axis=1, inplace=True)
train_df.drop('Parch', axis=1, inplace=True)

test_df['TravelAlone']=np.where((test_df["SibSp"]+test_df["Parch"])>0, 0, 1)
test_df.drop('SibSp', axis=1, inplace=True)
test_df.drop('Parch', axis=1, inplace=True)

In [None]:
#create categorical variables and drop some variables
training=pd.get_dummies(train_df, columns=["Pclass","Embarked","Sex"])
training.drop('Sex_female', axis=1, inplace=True)
training.drop('PassengerId', axis=1, inplace=True)
training.drop('Name', axis=1, inplace=True)
training.drop('Ticket', axis=1, inplace=True)

final_train = training
final_train.head()

testing=pd.get_dummies(test_df, columns=["Pclass","Embarked","Sex"])
testing.drop('Sex_female', axis=1, inplace=True)
testing.drop('PassengerId', axis=1, inplace=True)
testing.drop('Name', axis=1, inplace=True)
testing.drop('Ticket', axis=1, inplace=True)

final_test = testing

# 3. Exploratory Data Analysis

In [None]:
plt.figure(figsize=(15,8))
ax = sns.kdeplot(final_train["Age"][final_train.Survived == 1], color="darkturquoise", shade=True)
sns.kdeplot(final_train["Age"][final_train.Survived == 0], color="lightcoral", shade=True)
plt.legend(['Survived', 'Died'])
plt.title('Density Plot of Age for Surviving Population and Deceased Population')
ax.set(xlabel='Age')
plt.xlim(-10,85)
plt.show()

- Considering the survival rate of passengers under 16, we can also include another categorical variable in the dataset: "Minor"

In [None]:
final_train['IsMinor']=np.where(final_train['Age']<=16, 1, 0)
final_test['IsMinor']=np.where(final_test['Age']<=16, 1, 0)

In [None]:
sns.barplot('Pclass', 'Survived', data=train_df, color="orange")
plt.show()

- First class people who survived are more in number.

In [None]:
sns.barplot('Embarked', 'Survived', data=train_df, color="teal")
plt.show()

- People who boarded from Cherbourg, France seems to have survived the most, this may be related to Pclass.

In [None]:
sns.barplot('Sex', 'Survived', data=train_df, color="green")
plt.show()

- Female survived the most.

# 4. Feature Selection (RFECV)

In [None]:
import warnings
warnings.filterwarnings('ignore')

X = training.drop('Survived', axis=1) # Independent varaibles
y = training['Survived'] # Dependent variables

# Let's choose Logistic Regression
rfecv = RFECV(estimator=LogisticRegression(), step=1, cv=10, scoring='accuracy')
rfecv.fit(X, y)

print(f'Number of optimal features: {rfecv.n_features_}')
print(f'Selected optimal features: {list(X.columns[rfecv.support_])}')

# 5. Model Evaluation using Logistic Regression

In [None]:
# Splitting the data into train and test to evaluate our model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

# Logistic Regression
lg = LogisticRegression(n_jobs=-1)

# Training (Finding the optimal weights)
lg.fit(X_train, y_train)

# Predictions
y_pred = lg.predict(X_test)

In [None]:
# Review our predictions
y_pred

In [None]:
# Evaluation
print(f'Accuracy: {metrics.accuracy_score(y_test, y_pred)}')
print(f'ROC AUC Score: {metrics.roc_auc_score(y_test, y_pred)}')
print(f'Classification Report:\n{metrics.classification_report(y_test, y_pred)}')

- Satisfying results! 

# 6. Submission

In [None]:
predictions = lg.predict(testing)
ID = pd.read_csv('../input/test.csv').PassengerId
submit_df = pd.DataFrame()
submit_df['PassengerId'] = ID
submit_df['Survived'] = predictions

submit_df.head()

In [None]:
# Saving the file,
submit_df.to_csv('submission.csv', index=False)