# TITANIC SURVIVAL PREDICTION PROJECT
## Business Understanding / Problem Definition

**Goal:**

Predict the survival of the passangers in Titanic by using the most efficient machine learning model. 

**Variables:**

i)  ***Categorical:***

    a) Nominal:

Survival: Survived = 1 , Dead = 0

Sex     : Female = 1   , Male = 0

Embarked: Port of Embarkation -> C = Cherbourg, Q = Queenstown, S = Southampton

    b) Ordinal:

Pclass  : Ticket class -> 1 = 1st, 2 = 2nd, 3 = 3rd ( 1st class indicates the richest people)

    c) Useless:
    
Ticket: Ticket number (However value is numerical ticket number does't count any measurable quantity)

Cabin: Cabin number (This value may contain information about resposibility of the crew and may be categorical)

ii)  ***Numerical:***

Age: Age in years

SibSp: # of siblings / spouses aboard the Titanic

Parch: # of parents / children aboard the Titanic

Fare: Passenger fare (this value should be classified into groups but there exists 248 different type)



**Variable Notes:**

Pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

SibSp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# Data Understanding (Exploratory Data Analysis)

## Importing Librarires

In [None]:
# data analysis libraries:
import numpy as np
import pandas as pd

# data visualization libraries:
import matplotlib.pyplot as plt
import seaborn as sns

# to ignore warnings:
import warnings
warnings.filterwarnings('ignore')

# to display all columns:
pd.set_option('display.max_columns', None)

# to make the model
from sklearn.model_selection import train_test_split, GridSearchCV

## Loading Data

In [None]:
pwd

In [None]:
# Read train and test data with pd.read_csv():
train_data = pd.read_csv("/kaggle/input/titanic-traintest-data/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic-traintest-data/test.csv")

In [None]:
# copy data in order to avoid any change in the original:
train = train_data.copy()
test = test_data.copy()

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.info()

## Analysis and Visualization of Numeric and Categorical Variables

### Basic summary statistics about the numerical data

In [None]:
train.describe().T

### Classes of some categorical variables

In [None]:
train['Pclass'].value_counts()

In [None]:
train['Sex'].value_counts()

In [None]:
train['SibSp'].value_counts()

In [None]:
train['Parch'].value_counts()

In [None]:
train['Ticket'].value_counts()

In [None]:
train['Cabin'].value_counts()

In [None]:
train['Embarked'].value_counts()

In [None]:
train['Age'].value_counts()

### Visualization

In general, barplot is used for categorical variables while histogram, density and boxplot are used for numerical data.

#### Pclass vs survived:

In [None]:
sns.barplot(x = 'Pclass', y = 'Survived', data = train);

#### SibSp vs survived:

In [None]:
sns.barplot(x = 'SibSp', y = 'Survived', data = train);

#### Parch vs survived:

In [None]:
sns.barplot(x = 'Parch', y = 'Survived', data = train);

#### Sex vs survived:

In [None]:
sns.barplot(x = 'Sex', y = 'Survived', data = train);

##### Sex vs survived (including Plcass):

In [None]:
sns.barplot(x = 'Sex', y = 'Survived', hue = "Pclass" , data = train);

# Data Preparation

## Deleting Unnecessary Variables

In [None]:
train.head()

### Ticket and Cabin

In [None]:
# Since we classified the ticket and the cabin variables as useless 
# we can seperate them from the data for probable further use
train_Ticket = train["Ticket"]
test_Ticket = test["Ticket"]
train_Cabin = train["Cabin"]
test_Cabin = test["Cabin"]

In [None]:
train_Ticket.head()

In [None]:
# Now we drop the Ticket and Cabin feature
train = train.drop(['Ticket' , 'Cabin'], axis = 1)
test = test.drop(['Ticket' , 'Cabin'], axis = 1)

train.head()

## Missing Value Treatment

In [None]:
train.isnull().sum()

### Age

In [None]:
train["Age"] = train["Age"].fillna(train["Age"].mean())

In [None]:
test["Age"] = test["Age"].fillna(test["Age"].mean())

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

#### 2 Embarked missing in train data , 1 Fare missing in test data

In [None]:
train["Embarked"].value_counts()

In [None]:
# Fill NA with the most frequent value:
train["Embarked"] = train["Embarked"].fillna("S")

In [None]:
test[test["Fare"].isnull()]

In [None]:
test[["Pclass","Fare"]].groupby("Pclass").mean()

In [None]:
test["Fare"] = test["Fare"].fillna(12.46)

In [None]:
test["Fare"].isnull().sum()

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
# There is no missing data now!

## Outlier Treatment

In [None]:
train.describe().T

In [None]:
# Let's try to catch any outlier data of the variables:
sns.boxplot(x = train['Age']);

In [None]:
sns.boxplot(x = train['Fare']);

In [None]:
# There are outliers in Fare and Age data. We want to keep all data in interquartile range IQR

In [None]:
# For Age Data
Q1 = train['Age'].quantile(0.25)
Q3 = train['Age'].quantile(0.75)
IQR = Q3 - Q1

age_lower_limit = Q1- 1.5*IQR
print(age_lower_limit)

age_upper_limit = Q3 + 1.5*IQR
print(age_upper_limit)

In [None]:
Q1 = train['Fare'].quantile(0.25)
Q3 = train['Fare'].quantile(0.75)
IQR = Q3 - Q1

fare_lower_limit = Q1- 1.5*IQR
print(fare_lower_limit)

fare_upper_limit = Q3 + 1.5*IQR
print(fare_upper_limit)

In [None]:
# We will not use lower_limit information --> force data not exceed upper_limit

In [None]:
# observations with Age data higher than the upper limit:

df_a = train['Age'] > (age_upper_limit)
df_a.value_counts()

In [None]:
# There is only 42 data in 891 outlying so it is better to fix them to upper_limit
# Almost 5 % go deeper:
train.sort_values("Age", ascending=False).head(42)
# distribution is nice but the only value 80 --> nearest 76

In [None]:
train['Age'] = train['Age'].replace(80, 74)
train.sort_values("Age", ascending=False).head()

In [None]:
test.sort_values("Age", ascending=False).head()

In [None]:
test['Age'] = test['Age'].replace(76, 67)
test.sort_values("Age", ascending=False).head()

In [None]:
# observations with Fare data higher than the upper limit:

df_f = train['Fare'] > (fare_upper_limit)
df_f.value_counts()

In [None]:
# More than 10 % --> only manuel touch to the max 3 values
train.sort_values("Fare", ascending=False).head()

In [None]:
train['Fare'] = train['Fare'].replace(512.3292, 263)
train.sort_values("Fare", ascending=False).head()

In [None]:
test.sort_values("Fare", ascending=False).head()

In [None]:
test['Fare'] = test['Fare'].replace(512.3292, 263)
test.sort_values("Fare", ascending=False).head()

In [None]:
sns.heatmap(train.corr(), annot = True);

In [None]:
# Considerable correlation exists only between Survived VS Sex and Pclass (parch vs SibSp ignorable)

## Variable Transformation

### Sex

In [None]:
# Convert Sex values into 1-0:
# (Male:1 Female:0)

from sklearn import preprocessing

lbe = preprocessing.LabelEncoder()
train["Sex"] = lbe.fit_transform(train["Sex"])
test["Sex"] = lbe.fit_transform(test["Sex"])

In [None]:
train.head()

### AgeGroup

In [None]:
bins = [0, 5, 12, 18, 24, 35, 60, np.inf]
mylabels = ['Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = mylabels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels = mylabels)

In [None]:
# Map each Age value to a numerical value:
age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3, 'Student': 4, 'Young Adult': 5, 'Adult': 6, 'Senior': 7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
test['AgeGroup'] = test['AgeGroup'].map(age_mapping)

In [None]:
train.head()

In [None]:
#drop the Age feature:
train = train.drop(['Age'], axis = 1)
test = test.drop(['Age'], axis = 1)

In [None]:
train["Fare"].head()

### Fare

In [None]:
train["Fare"]

In [None]:
# Map Fare values into groups of numerical values:
train['FareBand'] = pd.qcut(train['Fare'], 10, labels = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
test['FareBand'] = pd.qcut(test['Fare'], 10, labels = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

In [None]:
# Drop Fare values
train = train.drop(['Fare'], axis = 1)
test = test.drop(['Fare'], axis = 1)

In [None]:
train["FareBand"]

In [None]:
# Drop also name values  --> useless:
train = train.drop(['Name'], axis = 1)
test = test.drop(['Name'], axis = 1)


In [None]:
train.head()

In [None]:
# Map each Embarked value to a numerical value:

embarked_mapping = {"S": 1, "C": 2, "Q": 3}

train['Embarked'] = train['Embarked'].map(embarked_mapping)
test['Embarked'] = test['Embarked'].map(embarked_mapping)

In [None]:
train.head()

In [None]:
?sns.heatmap

In [None]:
sns.heatmap(train.corr(), annot = True , cbar = True , square=True );

In [None]:
sns.heatmap(test.corr(), annot = True );

## Feature Engineering

### Family Size

In [None]:
train.head()

In [None]:
train["FamilySize"] = train_data["SibSp"] + train_data["Parch"] + 1

In [None]:
test["FamilySize"] = test_data["SibSp"] + test_data["Parch"] + 1

In [None]:
# Create new feature of family size:

train['Single'] = train['FamilySize'].map(lambda s: 1 if s == 1 else 0)
train['SmallFam'] = train['FamilySize'].map(lambda s: 1 if  s == 2  else 0)
train['MedFam'] = train['FamilySize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
train['LargeFam'] = train['FamilySize'].map(lambda s: 1 if s >= 5 else 0)

In [None]:
train.head()

In [None]:
# Create new feature of family size:

test['Single'] = test['FamilySize'].map(lambda s: 1 if s == 1 else 0)
test['SmallFam'] = test['FamilySize'].map(lambda s: 1 if  s == 2  else 0)
test['MedFam'] = test['FamilySize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
test['LargeFam'] = test['FamilySize'].map(lambda s: 1 if s >= 5 else 0)

In [None]:
test.head()

### Embarked

In [None]:
# Convert Embarked into dummy variables:

train = pd.get_dummies(train, columns = ["Embarked"], prefix="Em")

In [None]:
train.head()

In [None]:
test = pd.get_dummies(test, columns = ["Title"])
test = pd.get_dummies(test, columns = ["Embarked"], prefix="Em")

In [None]:
test.head()

### Pclass

In [None]:
# Create categorical values for Pclass:
train["Pclass"] = train["Pclass"].astype("category")
train = pd.get_dummies(train, columns = ["Pclass"],prefix="Pc")

In [None]:
test["Pclass"] = test["Pclass"].astype("category")
test = pd.get_dummies(test, columns = ["Pclass"],prefix="Pc")

In [None]:
train.head()

In [None]:
test.head()

### Alone Feature

In [None]:
train['personnum']=train.SibSp + train.Parch
train.head()

In [None]:
Alone = []
for i in train["personnum"]:
    if i ==0:
        Alone.append(1)
    else:
        Alone.append(0)


In [None]:
Alone = pd.DataFrame(Alone)
#Alone.head()
Alone.describe().T

In [None]:
#train = train.drop(["0"], axis = 1)
#train = train.drop(["Alone"] , axis=1)

In [None]:
Alone.columns = ["Alone"]
#Al = Alone.rename(columns={'0': 'Alone'}
#Alone.column_name = "Alone"
Alone.head()

In [None]:
train = pd.concat((train,Alone) , axis=1)
#train = train.drop(["personnum"] , axis = 1)
train.head()

In [None]:
train = train.drop([0] , axis = 1)

In [None]:
train.head()

In [None]:
#Test Data
test['personnum']=test.SibSp + test.Parch

Alone_t = []
for i in test["personnum"]:
    if i ==0:
        Alone_t.append(1)
    else:
        Alone_t.append(0)
Alone_t = pd.DataFrame(Alone_t)
Alone_t.describe().T
Alone_t.columns = ["Alone"]
test = pd.concat((test,Alone_t) , axis=1)

In [None]:
test.head()

In [None]:
test = test.drop(["personnum"] , axis = 1)

# Modeling, Evaluation and Model Tuning

## Spliting the train data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
predictors = train.drop(['Survived', 'PassengerId'], axis=1)
target = train["Survived"]
x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.20, random_state = 0)

In [None]:
x_train.shape

In [None]:
x_test.shape

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
acc_logreg = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_logreg)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_test)
acc_randomforest = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_randomforest)

## Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbk = GradientBoostingClassifier()
gbk.fit(x_train, y_train)
y_pred = gbk.predict(x_test)
acc_gbk = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_gbk)

In [None]:
xgb_params = {
        'n_estimators': [200, 500],
        'subsample': [0.6, 1.0],
        'max_depth': [2,5,8],
        'learning_rate': [0.1,0.01,0.02],
        "min_samples_split": [2,5,10]}

In [None]:
xgb = GradientBoostingClassifier()

xgb_cv_model = GridSearchCV(xgb, xgb_params, cv = 10, n_jobs = -1, verbose = 2)

In [None]:
xgb_cv_model.fit(x_train, y_train)

In [None]:
xgb_cv_model.best_params_

In [None]:
xgb = GradientBoostingClassifier(learning_rate = xgb_cv_model.best_params_["learning_rate"], 
                    max_depth = xgb_cv_model.best_params_["max_depth"],
                    min_samples_split = xgb_cv_model.best_params_["min_samples_split"],
                    n_estimators = xgb_cv_model.best_params_["n_estimators"],
                    subsample = xgb_cv_model.best_params_["subsample"])

In [None]:
xgb_tuned =  xgb.fit(x_train,y_train)

In [None]:
y_pred = xgb_tuned.predict(x_test)
acc_gbk = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_gbk)

In [None]:
models = pd.DataFrame({
    'Model': ['Logistic Regression','Random Forest', 'Gradient Boosting Classifier'],
    'Score': [acc_logreg, acc_randomforest,  acc_gbk]})
models.sort_values(by='Score', ascending=False)

# Submission

In [None]:
train

In [None]:
test

In [None]:
#set ids as PassengerId and predict survival 
ids = test['PassengerId']
predictions = xgb_tuned.predict(test.drop('PassengerId', axis=1))

#set the output as a dataframe and convert to csv file named submission.csv
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('MHsubmission.csv', index=False)

In [None]:
output.head()