https://www.kaggle.com/c/titanic/overview

## DATASET DETAILS

The dataset is divided into train and test data. It has 11 features plus the target variable which is survived. The aim is to determine if the passenger survived or not.
The training-set has 891 examples and 418 of test data samples

### COLUMNS: 

-  Passenger Id - id of the passenger
-  Pclass - passenger class (1 = Upper, 2 = Middle, 3 = Lower)
-  sex - gender of the passenger
-  age - age of the passenger
-  name - name of the passenger 
-  sibsp - no of siblings/spouses on board
-  parents - no of parents/children on board
-  ticket - ticket number
-  fare - passenger fare
-  cabin - cabin number
-  embarked - port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Target Variable - Survived (0 or 1)

## 1. IMPORTING LIBRARIES

In [1]:
#import the required libraries

# linear algebra
import numpy as np 
# data processing
import pandas as pd 
# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

In [2]:
#importing the decision tree module 

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

## 2. LOADING DATASET

In [3]:
#Loading the dataset both train and test data

test_data = pd.read_csv("test.csv")
train_data = pd.read_csv("train.csv")

In [4]:
combined = train_data.append(test_data)

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
#Look at the data

train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 3. DATA PREPARATION AND PREPROCESSING

In [7]:
#check for the number of null values

train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [8]:
test_data.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [9]:
#drop the cabin as most of it are null value

train_data = train_data.drop(columns='Cabin', axis=1)
test_data = test_data.drop(columns='Cabin', axis=1)

In [10]:
#fill the age with the mean values

train_data['Age'].fillna(combined['Age'].mean(), inplace=True)
test_data['Age'].fillna(combined['Age'].mean(), inplace=True)

In [11]:
#Embarked has only 2 missing values so fill with the most frequent ones (mode)

train_data['Embarked'].fillna(combined['Embarked'].mode()[0], inplace=True)

In [12]:
#since fare is only 1 that is missing fill 0

test_data['Fare'].fillna(0,inplace= True)

In [13]:
#check if the results are satisfactory, not handling survived as it is our target variable

train_data.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [14]:
test_data.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [15]:
#data transformation of categorical variables

train_data.replace({'Sex':{'male':0,'female':1}, 'Embarked':{'S':0,'C':1,'Q':2}}, inplace=True)
test_data.replace({'Sex':{'male':0,'female':1}, 'Embarked':{'S':0,'C':1,'Q':2}}, inplace=True)

In [16]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,0


In [17]:
#convert age from float to int
train_data['Age'] = train_data['Age'].astype(int)
test_data['Age'] = test_data['Age'].astype(int)

In [18]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38,1,0,PC 17599,71.2833,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26,0,0,STON/O2. 3101282,7.925,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35,1,0,113803,53.1,0
4,5,0,3,"Allen, Mr. William Henry",0,35,0,0,373450,8.05,0


## 4. MODEL BUILDING

In [19]:
#remove insignificant features passenger id , name and ticket dont contribute to the prediction 

X_train = train_data.drop(columns = ['PassengerId','Name','Ticket','Survived'],axis=1)
Y_train = train_data['Survived']

In [21]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    int64  
 1   Sex       891 non-null    int64  
 2   Age       891 non-null    int32  
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Fare      891 non-null    float64
 6   Embarked  891 non-null    int64  
dtypes: float64(1), int32(1), int64(5)
memory usage: 45.4 KB


In [22]:
X_test = test_data.drop(columns = ['PassengerId','Name','Ticket'],axis=1)

In [23]:
#create the decision tree classifier 

decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, Y_train) 

DecisionTreeClassifier()

## 5. PREDICTION

In [24]:
#predicting with the test data

Y_pred = decision_tree.predict(X_test) 
print(Y_pred)

[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 0 1 1 0
 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0
 0 0 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1
 0 1 0 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0
 0 1 1 1 1 0 0 1 0 0 1]


## 6. EVALUATING PERFORMANCE

In [25]:
#printing the accuracy score of the model

acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
print(acc_decision_tree)

97.64
