# <center>Hey! Thank you for choosing my notebook 📑
> <center>😎 It is bunch of fun, read it and fell free to comment. Your upvotes makes me work harder in myself 🚀🔥

**Workflow:**

* Exploratory Data Analysis.
* Surviving rate
* Pclass
* Name
* Sex
* Age
* SibSp, Parch
* Ticket
* Fare
* Cabin
* Embarked

**Feature Engineering:**
* Imputation on Embarked and Age columns
* Title extraction
* Ticket first letters
* Cabin first letters
* Encoding sex column
* Family size
* One Hot Encoding for all categorical variables

**Machine Learning:**
* Split data into train and test sets
* Initialize a Random Forest Classifier
* Hyperparameter Tuning with Grid Search
* Prediction


In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

# I will keep the resulting plots
%matplotlib inline

# Enable Jupyter Notebook's intellisense
%config IPCompleter.greedy=True

# We want to see whole content (non-truncated)
pd.set_option('display.max_colwidth', None)

# 1. Exploratory Data Analysis

In [None]:
train = pd.read_csv("../input/titanic/train.csv")

display(train.head())

print(train.info())
print(train.info())
print(train.describe())

**Notes:**

* There are some missing values in Age, Embarked and Cabin columns.
* We do not need PassengerId column
* The surviving rate is 38.3% in our dataset

**Survived**
Let's start with Survived column. It contains integer 1 or 0 which correspond to surviving ( 1 = Survived, 0 = Not Survived)

In [None]:
import seaborn as sns

In [None]:
# Visualize with a countplot
sns.countplot(x="Survived", data=train)
plt.show()

# Print the proportions
print(train["Survived"].value_counts(normalize=True))

**Pclass**
Pclass column contains the socioeconomic status of the passengers. It might be predictive for our model

*  1 = Upper
*  2 = Middle
*  3 = Lower

In [None]:
# Visualize with a countplot
sns.countplot(x="Pclass", hue="Survived", data=train)
plt.show()

# Proportion of people survived for each class
print(train["Survived"].groupby(train["Pclass"]).mean())

# How many people we have in each class?
print(train["Pclass"].value_counts())

As I expected, first class passengers have higher surviving rate. We will use this information in our training data.

**Name**

At a first glance, I thought that I would use the titles.

In [None]:
# Display first five rows of the Name column
display(train[["Name"]].head())

We can extract the titles from names.

In [None]:
# Get titles
train["Title"] = train['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

# Print title counts
print(train["Title"].value_counts())


Is there any relationship between titles and surviving

In [None]:
# Print the Surviving rates by title
print(train["Survived"].groupby(train["Title"]).mean().sort_values(ascending=False))

Apparently, there is relationship between titles and surviving rate. In feature engineering part, I will group title by their surviving rates like following

* higher = the Countess, Mlle, Lady, Ms , Sir, Mme, Mrs, Miss, Master
* neutral = Major, Col, Dr
* lower = Mr, Rev, Jonkheer, Don, Capt

# Age

In [None]:
# Print the missing values in Age column
print(train["Age"].isnull().sum())

There are 177 missing values in Age column, we will impute them in Feature engineering part. 
Now, let's look at the distribution of ages by surviving

In [None]:
# Survived by age
sns.distplot(train[train.Survived==1]["Age"],color="y", bins=7, label="1")

# Death by age
sns.distplot(train[train.Survived==0]["Age"], bins=7, label="0")
plt.legend()
plt.title("Age Distribution")
plt.show()

# **Sex**

Is sex important for surviving?

In [None]:
# Visualize with a countplot
sns.countplot(x="Sex", hue="Survived", data=train)
plt.show()

# Proportion of people survived for each class
print(train["Survived"].groupby(train["Sex"]).mean())

# How many people we have in each class?
print(train["Sex"].value_counts())

Obviously, there is a relationship between sex and surviving.

* SibSp & Parch
* SibSp = Sibling or Spouse number
* Parch = Parent or Children number

I decided to make a new feature called family size by summing the SibSp and Parch columns

In [None]:
print(train["SibSp"].value_counts())

print(train["Parch"].value_counts())

train["family_size"] = train["SibSp"] + train["Parch"]

print(train["family_size"].value_counts())

# Proportion of people survived for each class
print(train["Survived"].groupby(train["family_size"]).mean().sort_values(ascending=False))

Apparently, family size is important to survive. I am going to group them in feature engineering step like following

* **big family** = if family size > 3
* **small family** = if family size > 0 and family size < =3
* **alone** = family size == 0

Ticket

At first, I thought that I would drop this column but after exploration I found useful features.

In [None]:
# Print the first five rows of the Ticket column
print(train["Ticket"].head(15))

I extracted only first letters of the tickets because I thought that they would indicate the ticket type.



In [None]:
# Get first letters of the tickets
train["Ticket_first"] = train["Ticket"].apply(lambda x: str(x)[0])

# Print value counts
print(train["Ticket_first"].value_counts())

# Surviving rates of first letters
print(train.groupby("Ticket_first")["Survived"].mean().sort_values(ascending=False))

The first letters of the tickets are correlated with surviving rate somehow. I am going to group them like following

* higher surviving rate = F, 1, P , 9
* neutral = S, C, 2
* lower surviving rate = else

Fare
We can plot a histogram to see Fare distribution

In [None]:
# Print 3 bins of Fare column
print(pd.cut(train['Fare'], 3).value_counts())

# Plot the histogram
sns.distplot(train["Fare"])
plt.show()

# Print binned Fares by surviving rate
print(train['Survived'].groupby(pd.cut(train['Fare'], 3)).mean())

There is also a correlation between ticket fares and surviving

Cabin

![https://raw.githubusercontent.com/Bhasfe/titanic/ae0e2f00f9945227a26005447626f3f6a703c60b/images/titanic.png](http://)

I found this figure wikiwand.com. The figure shows us the most affacted parts of the Titanic and the Cabin locations. Although there are many missing value in Cabin column, I decided to extract the Cabin information to try whether it works or not.

In [None]:
# Print the unique values in the Cabin column
print(train["Cabin"].unique())

# Get the first letters of Cabins
train["Cabin_first"] = train["Cabin"].apply(lambda x: str(x)[0])

# Print value counts of first letters
print(train["Cabin_first"].value_counts())

# Surviving rate of Cabin first letters
print(train.groupby("Cabin_first")["Survived"].mean().sort_values(ascending=False))

According to surviving rates. I will group the Cabins like following

* higher surviving rate = D, E, B, F, C
* neutral = G, A
* lower surviving rate else

**Embarked**

Embarked is a categorical features which shows us the port of embarkation.

* C = Cherbourg
* Q = Queenstown
* S = Southampton

In [None]:
# Make a countplot
sns.countplot(x="Embarked", hue="Survived", data=train)
plt.show()

# Print the value counts
print(train["Embarked"].value_counts())

# Surviving rates of Embarked
print(train["Survived"].groupby(train["Embarked"]).mean())

No doubt, C has the higher surviving rate. 
We will definetely use this information.

# 2. Feature Engineering
We have learned a lot from exploratory data analysis. Now we can start feature engineering. Firstly, let's load the train and the test sets.

In [None]:
# Load the train and the test datasets
train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")

print(test.info())

There is one missing value in the Fare column of the test set. I imputed it by using mean.

In [None]:
# Put the mean into the missing value
test['Fare'].fillna(train['Fare'].mean(), inplace = True)

I have used two types of Imputer from sklearn. Iterative imputer for age imputation, and Simple imputer ( with most frequent strategy) for Embarked

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Imputers
imp_embarked = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
imp_age = IterativeImputer(max_iter=100, random_state=34, n_nearest_features=2)

# Impute Embarked
train["Embarked"] = imp_embarked.fit_transform(train[["Embarked"]])
test["Embarked"] = imp_embarked.transform(test[["Embarked"]])

# Impute Age
train["Age"] = np.round(imp_age.fit_transform(train[["Age"]]))
test["Age"] = np.round(imp_age.transform(test[["Age"]]))

We also encode the sex column.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize a Label Encoder
le = LabelEncoder()

# Encode Sex
train["Sex"] = le.fit_transform(train[["Sex"]].values.ravel())
test["Sex"] = le.fit_transform(test[["Sex"]].values.ravel())

In EDA, we decided to use family size feature

In [None]:
# Family Size
train["Fsize"] = train["SibSp"] + train["Parch"]
test["Fsize"] = test["SibSp"] + test["Parch"]

Ticket first letters and Cabin first letters are also needed

In [None]:
# Ticket first letters
train["Ticket"] = train["Ticket"].apply(lambda x: str(x)[0])
test["Ticket"] = test["Ticket"].apply(lambda x: str(x)[0])

# Cabin first letters
train["Cabin"] = train["Cabin"].apply(lambda x: str(x)[0])
test["Cabin"] = test["Cabin"].apply(lambda x: str(x)[0])

Extract the titles from the names

In [None]:
# Titles
train["Title"] = train['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
test["Title"] = test['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

Now, we need some helper functions to group our categories

In [None]:
# Group the family_size column
def assign_passenger_label(family_size):
    if family_size == 0:
        return "Alone"
    elif family_size <=3:
        return "Small_family"
    else:
        return "Big_family"
    
# Group the Ticket column
def assign_label_ticket(first):
    if first in ["F", "1", "P", "9"]:
        return "Ticket_high"
    elif first in ["S", "C", "2"]:
        return "Ticket_middle"
    else:
        return "Ticket_low"
    
# Group the Title column    
def assign_label_title(title):
    if title in ["the Countess", "Mlle", "Lady", "Ms", "Sir", "Mme", "Mrs", "Miss", "Master"]:
        return "Title_high"
    elif title in ["Major", "Col", "Dr"]:
        return "Title_middle"
    else:
        return "Title_low"
    
# Group the Cabin column  
def assign_label_cabin(cabin):
    if cabin in ["D", "E", "B", "F", "C"]:
        return "Cabin_high"
    elif cabin in ["G", "A"]:
        return "Cabin_middle"
    else:
        return "Cabin_low"

Apply the functions.

In [None]:
# Family size
train["Fsize"] = train["Fsize"].apply(assign_passenger_label)
test["Fsize"] = test["Fsize"].apply(assign_passenger_label)

# Ticket
train["Ticket"] = train["Ticket"].apply(assign_label_ticket)
test["Ticket"] = test["Ticket"].apply(assign_label_ticket)

# Title
train["Title"] = train["Title"].apply(assign_label_title)
test["Title"] = test["Title"].apply(assign_label_title)

# Cabin
train["Cabin"] = train["Cabin"].apply(assign_label_cabin)
test["Cabin"] = test["Cabin"].apply(assign_label_cabin)

It's time to use One Hot Encoding

In [None]:
train = pd.get_dummies(columns=["Pclass", "Embarked", "Ticket", "Cabin","Title", "Fsize"], data=train, drop_first=True)
test = pd.get_dummies(columns=["Pclass", "Embarked", "Ticket", "Cabin", "Title", "Fsize"], data=test, drop_first=True)

Drop the colums that are no longer needed

In [None]:
target = train["Survived"]
train.drop(["Survived", "SibSp", "Parch", "Name", "PassengerId"], axis=1, inplace=True)
test.drop(["SibSp", "Parch", "Name","PassengerId"], axis=1, inplace=True)

Final look

In [None]:
display(train.head())
display(test.head())

print(train.info())
print(test.info())

# 3. Machine Learning

To evaluate our model's performance, we need to split our train data into training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

# Select the features and the target
X = train.values
y = target.values

# Split the data info training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=34, stratify=y)

I have used GridSearchCV for tuning my Random Forest Classifier

In [None]:
# Import Necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, classification_report

# Initialize a RandomForestClassifier
rf = RandomForestClassifier(random_state=34)

params = {'n_estimators': [50, 100, 200, 300, 350],
          'max_depth': [3,4,5,7, 10,15,20],
          'criterion':['entropy', 'gini'],
          'min_samples_leaf' : [1, 2, 3, 4, 5, 10],
          'max_features':['auto'],
          'min_samples_split': [3, 5, 10, 15, 20],
          'max_leaf_nodes':[2,3,4,5],
          }

clf = GridSearchCV(estimator=rf,param_grid=params,cv=10, n_jobs=-1)

clf.fit(X_train, y_train.ravel())

print(clf.best_estimator_)
print(clf.best_score_)

rf_best = clf.best_estimator_

# Predict from the test set
y_pred = clf.predict(X_test)

# Print the accuracy with accuracy_score function
print("Accuracy: ", accuracy_score(y_test, y_pred))

# Print the confusion matrix
print("\nConfusion Matrix\n")
print(confusion_matrix(y_test, y_pred))

Save the model



In [None]:
pickle.dump(rf_best, open("model.pkl", 'wb'))


We can look at the feature importances.

In [None]:
# Create a pandas series with feature importances
importance = pd.Series(rf_best.feature_importances_,index=train.columns).sort_values(ascending=False)

sns.barplot(x=importance, y=importance.index)
# Add labels to your graph
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title("Important Features")
plt.show()

Train the model again with entire train data.



In [None]:
last_clf = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=4, max_features='auto',
                       max_leaf_nodes=5, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=15,
                       min_weight_fraction_leaf=0.0, n_estimators=350,
                       n_jobs=None, oob_score=True, random_state=34, verbose=0,
                       warm_start=False)

last_clf.fit(train, target)
print("%.4f" % last_clf.oob_score_)

Prepare the submission file



In [None]:
# Store passenger ids
ids = pd.read_csv("test.csv")[["PassengerId"]].values

# Make predictions
predictions = last_clf.predict(test.values)

# Print the predictions
print(predictions)

# Create a dictionary with passenger ids and predictions
df = {'PassengerId': ids.ravel(), 'Survived':predictions}

# Create a DataFrame named submission
submission = pd.DataFrame(df)

# Display the first five rows of submission
display(submission.head())

# Save the file
submission.to_csv("submission_last.csv", index=False)