# Business Understanding / Problem Definition

**Titanic Survival Prediction:**

Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

**Variables and Their Types:**

Survival: Survival -> 0 = No, 1 = Yes

Pclass: Ticket class -> 1 = 1st, 2 = 2nd, 3 = 3rd

Sex: Sex

Age: Age in years

SibSp: # of siblings / spouses aboard the Titanic

Parch: # of parents / children aboard the Titanic

Ticket: Ticket number

Fare: Passenger fare

Cabin: Cabin number

Embarked: Port of Embarkation -> C = Cherbourg, Q = Queenstown, S = Southampton

**Variable Notes:**

Pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

SibSp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# Data Understanding (Exploratory Data Analysis)

## Importing Librarires

In [None]:
# data analysis libraries:
import numpy as np
import pandas as pd

# data visualization libraries:
import matplotlib.pyplot as plt
import seaborn as sns

# to ignore warnings:
import warnings
warnings.filterwarnings('ignore')

# to display all columns:
pd.set_option('display.max_columns', None)

from sklearn.model_selection import train_test_split, GridSearchCV

## Loading Data

In [None]:
# Read train and test data with pd.read_csv():
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
# copy data in order to avoid any change in the original:
train = train_data.copy()
test = test_data.copy()

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.info()

## Analysis and Visualization of Numeric and Categorical Variables

### Basic summary statistics about the numerical data

In [None]:
train.describe().T

### Classes of some categorical variables

In [None]:
train['Pclass'].value_counts()

In [None]:
train['Sex'].value_counts()

In [None]:
train['SibSp'].value_counts()

In [None]:
train['Parch'].value_counts()

In [None]:
train['Ticket'].value_counts()

In [None]:
train['Cabin'].value_counts()

In [None]:
train['Embarked'].value_counts()

### Visualization

In general, barplot is used for categorical variables while histogram, density and boxplot are used for numerical data.

#### Pclass vs survived:

In [None]:
sns.barplot(x = 'Pclass', y = 'Survived', data = train);

#### SibSp vs survived:

In [None]:
sns.barplot(x = 'SibSp', y = 'Survived', data = train);

#### Parch vs survived:

In [None]:
sns.barplot(x = 'Parch', y = 'Survived', data = train);

#### Sex vs survived:

In [None]:
sns.barplot(x = 'Sex', y = 'Survived', data = train);

# Data Preparation

## Deleting Unnecessary Variables

In [None]:
train.head()

### Ticket

In [None]:
# We can drop the Ticket feature since it is unlikely to have useful information
train = train.drop(['Ticket'], axis = 1)
test = test.drop(['Ticket'], axis = 1)

train.head()

## Outlier Treatment

In [None]:
train.describe().T

In [None]:
# It looks like there is a problem in Fare max data. Visualize with boxplot.
sns.boxplot(x = train['Fare']);

In [None]:
Q1 = train['Fare'].quantile(0.25)
Q3 = train['Fare'].quantile(0.75)
IQR = Q3 - Q1

lower_limit = Q1- 1.5*IQR
lower_limit

upper_limit = Q3 + 1.5*IQR
upper_limit

In [None]:
# observations with Fare data higher than the upper limit:

train['Fare'] > (upper_limit)

In [None]:
train.sort_values("Fare", ascending=False).head()

In [None]:
# In boxplot, there are too many data higher than upper limit; we can not change all. Just repress the highest value -512- 
train['Fare'] = train['Fare'].replace(512.3292, 300)

In [None]:
train.sort_values("Fare", ascending=False).head()

In [None]:
test.sort_values("Fare", ascending=False)

In [None]:
test['Fare'] = test['Fare'].replace(512.3292, 300)

In [None]:
test.sort_values("Fare", ascending=False)

## Missing Value Treatment

In [None]:
train.isnull().sum()

### Age

In [None]:
train["Age"] = train["Age"].fillna(train["Age"].mean())

In [None]:
test["Age"] = test["Age"].fillna(test["Age"].mean())

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

### Embarked

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
train["Embarked"].value_counts()

In [None]:
# Fill NA with the most frequent value:
train["Embarked"] = train["Embarked"].fillna("S")

In [None]:
test["Embarked"] = test["Embarked"].fillna("S")

### Fare

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
test[test["Fare"].isnull()]

In [None]:
test[["Pclass","Fare"]].groupby("Pclass").mean()

In [None]:
test["Fare"] = test["Fare"].fillna(12)

In [None]:
test["Fare"].isnull().sum()

### Cabin

In [None]:
# Create CabinBool variable which states if someone has a Cabin data or not:

train["CabinBool"] = (train["Cabin"].notnull().astype('int'))
test["CabinBool"] = (test["Cabin"].notnull().astype('int'))

train = train.drop(['Cabin'], axis = 1)
test = test.drop(['Cabin'], axis = 1)

train.head()

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

## Variable Transformation

### Embarked

In [None]:
# Map each Embarked value to a numerical value:

embarked_mapping = {"S": 1, "C": 2, "Q": 3}

train['Embarked'] = train['Embarked'].map(embarked_mapping)
test['Embarked'] = test['Embarked'].map(embarked_mapping)

In [None]:
train.head()

### Sex

In [None]:
# Convert Sex values into 1-0:

from sklearn import preprocessing

lbe = preprocessing.LabelEncoder()
train["Sex"] = lbe.fit_transform(train["Sex"])
test["Sex"] = lbe.fit_transform(test["Sex"])

In [None]:
train.head()

### Name - Title

In [None]:
train["Title"] = train["Name"].str.extract(' ([A-Za-z]+)\.', expand=False)
test["Title"] = test["Name"].str.extract(' ([A-Za-z]+)\.', expand=False)

In [None]:
train.head()

In [None]:
train['Title'] = train['Title'].replace(['Lady', 'Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] = train['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')

In [None]:
test['Title'] = test['Title'].replace(['Lady', 'Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')
test['Title'] = test['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
test['Title'] = test['Title'].replace('Mlle', 'Miss')
test['Title'] = test['Title'].replace('Ms', 'Miss')
test['Title'] = test['Title'].replace('Mme', 'Mrs')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train[["Title","PassengerId"]].groupby("Title").count()

In [None]:
train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

In [None]:
# Map each of the title groups to a numerical value

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Royal": 5, "Rare": 5}

train['Title'] = train['Title'].map(title_mapping)

In [None]:
train.isnull().sum()

In [None]:
test['Title'] = test['Title'].map(title_mapping)

In [None]:
test.head()

In [None]:
train = train.drop(['Name'], axis = 1)
test = test.drop(['Name'], axis = 1)

In [None]:
train.head()

### AgeGroup

In [None]:
bins = [0, 5, 12, 18, 24, 35, 60, np.inf]
mylabels = ['Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = mylabels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels = mylabels)

In [None]:
# Map each Age value to a numerical value:
age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3, 'Student': 4, 'Young Adult': 5, 'Adult': 6, 'Senior': 7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
test['AgeGroup'] = test['AgeGroup'].map(age_mapping)

In [None]:
train.head()

In [None]:
#dropping the Age feature for now, might change:
train = train.drop(['Age'], axis = 1)
test = test.drop(['Age'], axis = 1)

In [None]:
train.head()

### Fare

In [None]:
# Map Fare values into groups of numerical values:
train['FareBand'] = pd.qcut(train['Fare'], 4, labels = [1, 2, 3, 4])
test['FareBand'] = pd.qcut(test['Fare'], 4, labels = [1, 2, 3, 4])

In [None]:
# Drop Fare values:
train = train.drop(['Fare'], axis = 1)
test = test.drop(['Fare'], axis = 1)

In [None]:
train.head()

## Feature Engineering

### Family Size

In [None]:
train.head()

In [None]:
train["FamilySize"] = train_data["SibSp"] + train_data["Parch"] + 1

In [None]:
test["FamilySize"] = test_data["SibSp"] + test_data["Parch"] + 1

In [None]:
# Create new feature of family size:

train['Single'] = train['FamilySize'].map(lambda s: 1 if s == 1 else 0)
train['SmallFam'] = train['FamilySize'].map(lambda s: 1 if  s == 2  else 0)
train['MedFam'] = train['FamilySize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
train['LargeFam'] = train['FamilySize'].map(lambda s: 1 if s >= 5 else 0)

In [None]:
train.head()

In [None]:
# Create new feature of family size:

test['Single'] = test['FamilySize'].map(lambda s: 1 if s == 1 else 0)
test['SmallFam'] = test['FamilySize'].map(lambda s: 1 if  s == 2  else 0)
test['MedFam'] = test['FamilySize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
test['LargeFam'] = test['FamilySize'].map(lambda s: 1 if s >= 5 else 0)

In [None]:
test.head()

### Embarked & Title

In [None]:
# Convert Title and Embarked into dummy variables:

train = pd.get_dummies(train, columns = ["Title"])
train = pd.get_dummies(train, columns = ["Embarked"], prefix="Em")

In [None]:
train.head()

In [None]:
test = pd.get_dummies(test, columns = ["Title"])
test = pd.get_dummies(test, columns = ["Embarked"], prefix="Em")

In [None]:
test.head()

### Pclass

In [None]:
# Create categorical values for Pclass:
train["Pclass"] = train["Pclass"].astype("category")
train = pd.get_dummies(train, columns = ["Pclass"],prefix="Pc")

In [None]:
test["Pclass"] = test["Pclass"].astype("category")
test = pd.get_dummies(test, columns = ["Pclass"],prefix="Pc")

In [None]:
train.head()

In [None]:
test.head()

# Modeling, Evaluation and Model Tuning

## Spliting the train data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
predictors = train.drop(['Survived', 'PassengerId'], axis=1)
target = train["Survived"]
x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.20, random_state = 0)

In [None]:
x_train.shape

In [None]:
x_test.shape

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
acc_logreg = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_logreg)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_test)
acc_randomforest = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_randomforest)

## Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbk = GradientBoostingClassifier()
gbk.fit(x_train, y_train)
y_pred = gbk.predict(x_test)
acc_gbk = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_gbk)

In [None]:
xgb_params = {
        'n_estimators': [200, 500],
        'subsample': [0.6, 1.0],
        'max_depth': [2,5,8],
        'learning_rate': [0.1,0.01,0.02],
        "min_samples_split": [2,5,10]}

In [None]:
xgb = GradientBoostingClassifier()

xgb_cv_model = GridSearchCV(xgb, xgb_params, cv = 10, n_jobs = -1, verbose = 2)

In [None]:
xgb_cv_model.fit(x_train, y_train)

In [None]:
xgb_cv_model.best_params_

In [None]:
xgb = GradientBoostingClassifier(learning_rate = xgb_cv_model.best_params_["learning_rate"], 
                    max_depth = xgb_cv_model.best_params_["max_depth"],
                    min_samples_split = xgb_cv_model.best_params_["min_samples_split"],
                    n_estimators = xgb_cv_model.best_params_["n_estimators"],
                    subsample = xgb_cv_model.best_params_["subsample"])

In [None]:
xgb_tuned =  xgb.fit(x_train,y_train)

In [None]:
y_pred = xgb_tuned.predict(x_test)
acc_gbk = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_gbk)

In [None]:
#With all data 

In [None]:
xgb_tuned =  xgb.fit(predictors,target)

In [None]:
y_pred = xgb_tuned.predict(x_test)
acc_gbk = round(accuracy_score(y_pred, y_test) * 100, 2)
print(acc_gbk)

# Deployment

In [None]:
test

In [None]:
#set ids as PassengerId and predict survival 
ids = test['PassengerId']
predictions = xgb_tuned.predict(test.drop('PassengerId', axis=1))

#set the output as a dataframe and convert to csv file named submission.csv
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('submission.csv', index=False)

In [None]:
output.head()