1. **Imports**

Import all necessary libraries.

Note: I am not a fan of visualizations(you will rarely find one in my notebook) and I tend to understand the data better when I look at tables and raw numbers.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

Read the data into dataframes

In [2]:
## Reading data to dataframes
train = pd.read_csv("/kaggle/input/titanic/train.csv")

test = pd.read_csv("/kaggle/input/titanic/test.csv")


**2. Understanding the Data**

I combined both the dataframes for my EDA, this helps you understand and identify pattern better and also find edge cases if any

In [3]:
## Combining test and train df into a isngle dataframe
data = pd.DataFrame()
data = data.combine(train,test)
print(data.info())
print("\n\n")
print(data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None



       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.48

This returns the null count from each column of our combined dataset. As can be seen the dataset is fairly good, the majority of nulls lie in the three columns namely: "Age","Cabin" and "Embarked". It might be a good idea to drop the column "Cabin"(approx 75% of the data is missing).

In [4]:
data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Creating a table to understand the dependence of numerical features on Survival. It can be deduced that all these features have some correlation on the probability of survival.

In [5]:
## Survival dependence based on numerical variables
pd.pivot_table(data, index = 'Survived', values = ['Age','SibSp','Parch','Fare'])

Unnamed: 0_level_0,Age,Fare,Parch,SibSp
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,30.626179,22.117887,0.32969,0.553734
1,28.34369,48.395408,0.464912,0.473684


Similarly, performing the same exercise for categorical features. Sex of the Passenger seems to have significant impact on his/her probability of survival.

In [6]:
## Comparing impact on survival based on categorical variables
print(data[['Sex','Survived']].groupby(['Sex'] , as_index = False).mean())

      Sex  Survived
0  female  0.742038
1    male  0.188908


From the table in #2, we can deduce that Fare has a high SD(standard deviation). Hense to reduce SD, it might be a better idea to divide it into groups based on a quantile system, pd.qcut() helps us achieve exactly that.



In [7]:
# print(data[['Fare','Survived']].groupby(['Fare'] , as_index = False).mean())

data['Fare_Range'] = pd.qcut(data['Fare'], 4)

print(data[['Fare_Range','Survived']].groupby(['Fare_Range'] , as_index = False).mean())

        Fare_Range  Survived
0   (-0.001, 7.91]  0.197309
1   (7.91, 14.454]  0.303571
2   (14.454, 31.0]  0.454955
3  (31.0, 512.329]  0.581081


Pclass dependance

In [8]:
print(data[['Pclass','Survived']].groupby(['Pclass'] , as_index = False).mean())

   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363


3. **Feature Engineering** 

We create a new column 'Family_size' and check it's impact on Survival

In [9]:
data['Family_size'] = data['Parch'] + data['SibSp'] + 1
    
print(data[['Family_size','Survived']].groupby(['Family_size'] , as_index = False).mean())

   Family_size  Survived
0            1  0.303538
1            2  0.552795
2            3  0.578431
3            4  0.724138
4            5  0.200000
5            6  0.136364
6            7  0.333333
7            8  0.000000
8           11  0.000000


Creating new column 'Is_Alone' and checking it's impact on Survival

In [10]:
data['Is_Alone'] = np.where(data['Family_size'] == 1 , 1 , 0)


print(data[['Is_Alone','Survived']].groupby(['Is_Alone'] , as_index = False).mean())


   Is_Alone  Survived
0         0  0.505650
1         1  0.303538


Categorizing titles 

In [11]:
data['Titles'] = data['Name'].str.extract(r', (\w+\.)')

## Categorizing titles

data['Titles'] = data['Titles'].replace(['Capt.', 'Col.',  'Don.',  'Dr.', 'Jonkheer.',  'Lady.',  'Major.',  'Master.',
                                           'Rev.',  'Sir.', np.nan] , 'Special')

data['Titles'] = data['Titles'].replace(['Mlle.','Mlle','Ms.','Miss.'],'Miss')
data['Titles'] = data['Titles'].replace(['Mme.','Mme','Mrs.'],'Mrs')
data['Titles'] = data['Titles'].replace('Mr.','Mr')


print(data[['Titles','Survived']].groupby(['Titles'] , as_index = False).mean())

    Titles  Survived
0     Miss  0.702703
1       Mr  0.156673
2      Mrs  0.793651
3  Special  0.492063


In [12]:
##Replace age nulls with median
data['Age'].fillna(data['Age'].median(), inplace=True)


In [13]:
##Analysing nulls in Embarked column
data[data['Embarked'].isna()]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare_Range,Family_size,Is_Alone,Titles
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,"(31.0, 512.329]",1,1,Miss
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,"(31.0, 512.329]",1,1,Mrs


Identifying passengers with Sex = Female and Fare<80, to see where majority of the passengers embarked from..

Replacing nulls in column 'Embarked' with 'S'. 

In [14]:
print(data.loc[(data['Sex'] == "female") & (data['Fare'] <=  80)])

data['Embarked'].fillna("S", inplace = True)

     PassengerId  Survived  Pclass  \
1              2         1       1   
2              3         1       3   
3              4         1       1   
8              9         1       3   
9             10         1       2   
..           ...       ...     ...   
880          881         1       2   
882          883         0       3   
885          886         0       3   
887          888         1       1   
888          889         0       3   

                                                  Name     Sex   Age  SibSp  \
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   
9                  Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   
..                                                 ...     ...   ... 

In [15]:
## Now that we have more or less removed all the nulls and analysed the data, we will apply the above transformation to our test-train dataset

In [16]:
combine = [test,train]

In [17]:
for dfs in combine:
    print(dfs)

     PassengerId  Pclass                                          Name  \
0            892       3                              Kelly, Mr. James   
1            893       3              Wilkes, Mrs. James (Ellen Needs)   
2            894       2                     Myles, Mr. Thomas Francis   
3            895       3                              Wirz, Mr. Albert   
4            896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)   
..           ...     ...                                           ...   
413         1305       3                            Spector, Mr. Woolf   
414         1306       1                  Oliva y Ocana, Dona. Fermina   
415         1307       3                  Saether, Mr. Simon Sivertsen   
416         1308       3                           Ware, Mr. Frederick   
417         1309       3                      Peter, Master. Michael J   

        Sex   Age  SibSp  Parch              Ticket      Fare Cabin Embarked  
0      male  34.5      0      0 

In [18]:
for dfs in combine:
    
    dfs['Family_size'] = dfs['Parch'] + dfs['SibSp'] + 1
    
    
    dfs['Is_Alone'] = np.where(dfs['Family_size'] == 1 , 1 , 0)
    
    
    dfs['Titles'] = dfs['Name'].str.extract(r', (\w+\.)')

    dfs['Titles'] = dfs['Titles'].replace(['Capt.', 'Col.',  'Don.',  'Dr.', 'Jonkheer.',  'Lady.',  'Major.',
                                           'Rev.',  'Sir.', np.nan] , 'Special')
    dfs['Titles'] = dfs['Titles'].replace(['Mlle.','Mlle','Ms.','Miss.'],'Miss')
    dfs['Titles'] = dfs['Titles'].replace(['Mme.','Mme','Mrs.'],'Mrs')
    dfs['Titles'] = dfs['Titles'].replace(['Mr.','Master.'],'Mr')

    
    dfs['Age'].fillna(dfs['Age'].median(), inplace=True)
    
    
    dfs['Embarked'].fillna("S", inplace = True)
    
## Mapping to numericals

    dfs['Sex'] = dfs['Sex'].replace(['male','female'],[0,1])
    
    title_mapping = {"Miss" : 1, "Mr" : 2, "Mrs" : 3, "Special" : 4}
    dfs["Titles"] = dfs["Titles"].map(title_mapping)
    dfs["Titles"] = dfs["Titles"].fillna(0)

    dfs['Embarked'] = dfs['Embarked'].replace(['S','C','Q'],[0,1,2])
    
    
    dfs.loc[ dfs['Fare'] <= 7.91, 'Fare'] = 0
    dfs.loc[(dfs['Fare'] > 7.91) & (dfs['Fare'] <= 14.454), 'Fare'] = 1
    dfs.loc[(dfs['Fare'] > 14.454) & (dfs['Fare'] <= 31), 'Fare']   = 2
    dfs.loc[ dfs['Fare'] > 31, 'Fare'] = 3
    dfs['Fare'] = dfs['Fare'].astype(int, errors='ignore').fillna(0)
    
    
    
    dfs.loc[ dfs['Age'] <= 16, 'Age'] = 0
    dfs.loc[(dfs['Age'] > 16) & (dfs['Age'] <= 32), 'Age'] = 1
    dfs.loc[(dfs['Age'] > 32) & (dfs['Age'] <= 48), 'Age'] = 2
    dfs.loc[(dfs['Age'] > 48) & (dfs['Age'] <= 64), 'Age'] = 3
    dfs.loc[ dfs['Age'] > 64, 'Age']

    
    

In [19]:
## Find correlation between features

corr = train.corr()
corr.style.background_gradient(cmap='coolwarm')


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Family_size,Is_Alone,Titles
PassengerId,1.0,-0.005007,-0.035144,-0.042939,-0.011045,-0.057527,-0.001652,-0.023689,-0.030467,-0.040143,0.057462,0.076338
Survived,-0.005007,1.0,-0.338481,0.543351,-0.06577,-0.035322,0.081629,0.295875,0.106811,0.016639,-0.203367,-0.030926
Pclass,-0.035144,-0.338481,1.0,-0.1319,-0.112962,0.083081,0.018443,-0.628459,0.045702,0.065997,0.135207,-0.161782
Sex,-0.042939,0.543351,-0.1319,1.0,-0.086111,0.114631,0.245489,0.24894,0.116569,0.200988,-0.303646,-0.16918
Age,-0.011045,-0.06577,-0.112962,-0.086111,1.0,-0.065076,-0.041678,0.022975,0.038244,-0.065298,0.061521,0.065145
SibSp,-0.057527,-0.035322,0.083081,0.114631,-0.065076,1.0,0.414838,0.394248,-0.059961,0.890712,-0.584471,-0.030595
Parch,-0.001652,0.081629,0.018443,0.245489,-0.041678,0.414838,1.0,0.393048,-0.078665,0.783111,-0.583398,0.026855
Fare,-0.023689,0.295875,-0.628459,0.24894,0.022975,0.394248,0.393048,1.0,-0.091096,0.465815,-0.568942,0.135275
Embarked,-0.030467,0.106811,0.045702,0.116569,0.038244,-0.059961,-0.078665,-0.091096,1.0,-0.080281,0.017807,-0.11377
Family_size,-0.040143,0.016639,0.065997,0.200988,-0.065298,0.890712,0.783111,0.465815,-0.080281,1.0,-0.690922,-0.007495


High correlation between SibSP,Parch and Family_Size.

In [20]:
print(test)

     PassengerId  Pclass                                          Name  Sex  \
0            892       3                              Kelly, Mr. James    0   
1            893       3              Wilkes, Mrs. James (Ellen Needs)    1   
2            894       2                     Myles, Mr. Thomas Francis    0   
3            895       3                              Wirz, Mr. Albert    0   
4            896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)    1   
..           ...     ...                                           ...  ...   
413         1305       3                            Spector, Mr. Woolf    0   
414         1306       1                  Oliva y Ocana, Dona. Fermina    1   
415         1307       3                  Saether, Mr. Simon Sivertsen    0   
416         1308       3                           Ware, Mr. Frederick    0   
417         1309       3                      Peter, Master. Michael J    0   

     Age  SibSp  Parch              Ticket  Fare Ca

Dropping columns not relevant to our models

In [21]:
drop_dict = ['Name', 'Ticket', 'Cabin', 'SibSp',\
                 'Parch','Family_size']



train = train.drop(drop_dict, axis = 1)
test = test.drop(drop_dict, axis = 1)


In [22]:
print(train)

print("\n" + "__"+ "\n")

print(test)

     PassengerId  Survived  Pclass  Sex  Age  Fare  Embarked  Is_Alone  Titles
0              1         0       3    0  1.0     0         0         0       2
1              2         1       1    1  2.0     3         1         0       3
2              3         1       3    1  1.0     1         0         1       1
3              4         1       1    1  2.0     3         0         0       3
4              5         0       3    0  2.0     1         0         1       2
..           ...       ...     ...  ...  ...   ...       ...       ...     ...
886          887         0       2    0  1.0     1         0         1       4
887          888         1       1    1  1.0     2         0         1       1
888          889         0       3    1  1.0     2         0         0       1
889          890         1       1    0  1.0     2         1         1       2
890          891         0       3    0  1.0     0         2         1       2

[891 rows x 9 columns]

__

     PassengerId  Pclas

4. **Model Selection**

Preparing dataset for Model

In [23]:
X_train = train.drop(["Survived","PassengerId"], axis=1)
Y_train = train["Survived"]
X_test  = test.drop("PassengerId", axis=1).copy()


X_train.shape, Y_train.shape, X_test.shape

((891, 7), (891,), (418, 7))

Trying out different models

In [24]:
## KNN prediction test
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
predn_knn = knn.predict(X_test)
acc_knn = knn.score(X_train, Y_train)

acc_knn

0.8507295173961841

In [25]:
## Decision Tree test
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
predn_dt = decision_tree.predict(X_test)
acc_decision_tree = decision_tree.score(X_train, Y_train)

acc_decision_tree

0.8664421997755332

In [26]:
# Random Forest Test
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
predn_rf = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = random_forest.score(X_train, Y_train)

acc_random_forest

0.8664421997755332

In [27]:
## Gradient Boosting Test
from sklearn.ensemble import GradientBoostingClassifier

GradientBoosting = GradientBoostingClassifier(n_estimators=100)
GradientBoosting.fit(X_train, Y_train)
predn_gb = GradientBoosting.predict(X_test)
GradientBoosting.score(X_train, Y_train)
acc_GradientBoosting = GradientBoosting.score(X_train, Y_train)

acc_GradientBoosting


0.8395061728395061

In [28]:
## SVM
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, Y_train)
predn_svc = clf.predict(X_test)

acc_svc = clf.score(X_train, Y_train)

acc_svc

0.813692480359147

In [29]:
## Voting Classifier Test
from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(estimators=[('KNN', KNeighborsClassifier(n_neighbors=3)),
                                        ('GB',GradientBoostingClassifier(n_estimators=100)),
                                        ('RF', RandomForestClassifier(n_estimators=100, random_state=0)),
                                        ('DT', DecisionTreeClassifier(random_state=0)),],
                           voting='soft').fit(X_train, Y_train)
predn_vc = ensemble.predict(X_test)

acc_vc = ensemble.score(X_train, Y_train)

acc_vc

0.8630751964085297

Table to calculate percentage accuracy for training dataset

In [30]:
table = pd.DataFrame({
    'Model': ['KNN','Random Forest','Decision Tree','GradientBoosting','SVC','Voting Classifier'],
    'Score': [ acc_knn, acc_random_forest, acc_decision_tree , acc_GradientBoosting,acc_svc,acc_vc]})
table['Score'] = table['Score']*100

print(table)




               Model      Score
0                KNN  85.072952
1      Random Forest  86.644220
2      Decision Tree  86.644220
3   GradientBoosting  83.950617
4                SVC  81.369248
5  Voting Classifier  86.307520


In [31]:
# submission = pd.DataFrame({
#         "PassengerId": test["PassengerId"],
#         "Survived": predn_gb
#     })

# submission.to_csv("submission.csv" , index = False)

# submission

**If you find this notebook useful, support with an upvote👍**