#### Titanic Survived
The dataset contains details of the passenger and how many are survived in titanic. 

We will train our model using Decision tree to predict the survival state of a person.

For dataset-<a href="https://github.com/IronStark007/Datasets/blob/main/titanic.csv">Click here.</a>

#### Prepossing the data

In [1]:
#importing necessary modules for prepossessing the data & analysis
import pandas as pd 
import numpy as np

In [2]:
# loading the dataset using pandas
df=pd.read_csv(r'C:/Users/ansar/Downloads/titanic.csv')

# reading first five rows of the dataset for better understanding
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The dataset contains many columns which we do not need during our model training like PassengerId, Name, Sibsp, Parch, Ticket, Cabin, Embarked. So we will simply drop all these columns.

Our independent columns are - Survived, Sex, Age, Fare
and dependent column is - Survival

In [3]:
#drop unnessary columns from the dataset
df=df.drop(labels=['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked'],axis=1)

#checking the dataset after dropping
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [4]:
#checking the info about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   Fare      891 non-null    float64
dtypes: float64(2), int64(2), object(1)
memory usage: 34.9+ KB


Since from 891 entries only age column has some null values present in it So we have to fill the null value by mean,median, mode or by zero or by simply dropping the rows which contains null values

And also the sex column is categorical column so we can change it to numerical column by passing 1 for Male and 0 for Female

In [5]:
#checking the statistics of the dataset
df.describe(include='all')

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
count,891.0,891.0,891,714.0,891.0
unique,,,2,,
top,,,male,,
freq,,,577,,
mean,0.383838,2.308642,,29.699118,32.204208
std,0.486592,0.836071,,14.526497,49.693429
min,0.0,1.0,,0.42,0.0
25%,0.0,2.0,,20.125,7.9104
50%,0.0,3.0,,28.0,14.4542
75%,1.0,3.0,,38.0,31.0


So the Age column has mean=30 and median = 28. We can choose either of these to fill the null values 

In [6]:
#Filling the Age null values with its median
df['Age'].fillna(value=28,inplace=True)

# Checking again the operation done or not
df['Age'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

Now we have to Convert Sex column into numerical 

In [7]:
#checking the values in Sex
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [8]:
#conditional formatting
for i in df['Sex']:
    if i=='male':
        df['Sex'].replace('male',1,inplace=True)
    else:
        df['Sex'].replace('female',0,inplace=True)

In [9]:
#checking the column
df['Sex'].value_counts()

1    577
0    314
Name: Sex, dtype: int64

In [10]:
# Checking again the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int64  
 3   Age       891 non-null    float64
 4   Fare      891 non-null    float64
dtypes: float64(2), int64(3)
memory usage: 34.9 KB


Now all the values is filled. Its time to Spliting the data into train and test
#### Splitting the dataset

In [11]:
#importing train_test_split from sklearn
from sklearn.model_selection import train_test_split

In [12]:
#dividing the data into independent(X) and dependent(Y) variable
X=df[['Pclass', 'Sex', 'Age', 'Fare']]
Y=df['Survived']

#checking the shape
print('X:',X.shape)
print('Y:',Y.shape)

X: (891, 4)
Y: (891,)


In [13]:
#splitting into train and test
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=1)

#### Training the model
Now its time to train the model using Decision tree

In [14]:
#importing decision tree module
from sklearn.tree import DecisionTreeClassifier

#creating an instance of the decision tree
model=DecisionTreeClassifier(random_state=100)

In [15]:
# fitting the train data into the model
model.fit(X_train,Y_train)

DecisionTreeClassifier(random_state=100)

After fitting the training data into the model Now we will check the accuracy by comparing the true value with predicted value
using accuracy score module of sklearn

In [16]:
#importing necessary library 
from sklearn.metrics import accuracy_score

In [17]:
#Checking the accuracy on train data
y_pred_train=model.predict(X_train)
acc=accuracy_score(Y_train,y_pred_train)
print('Accuracy on train data :',acc*100)

Accuracy on train data : 98.7158908507223


In [18]:
#Checking the accuracy on train data
y_pred_test=model.predict(X_test)
acc=accuracy_score(Y_test,y_pred_test)
print('Accuracy on test data :',acc*100)

Accuracy on test data : 73.88059701492537


So our model is giving 98 % accuracy on train data and 74 % accuracy on test data which is not bad

Lets see the confusion matrix of the model

In [19]:
# importing confusion matrix from sklearn
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y_test,y_pred_test)
print('Confusion matrix of test data :\n',cm)

Confusion matrix of test data :
 [[126  27]
 [ 43  72]]


In [20]:
cm=confusion_matrix(Y_train,y_pred_train)
print('Confusion matrix of train data :\n',cm)

Confusion matrix of train data :
 [[395   1]
 [  7 220]]


In [24]:
#Tree plotting 
from sklearn import tree

tree.

'|--- feature_1 <= 0.50\n|   |--- feature_0 <= 2.50\n|   |   |--- feature_2 <= 2.50\n|   |   |   |--- feature_0 <= 1.50\n|   |   |   |   |--- class: 0\n|   |   |   |--- feature_0 >  1.50\n|   |   |   |   |--- class: 1\n|   |   |--- feature_2 >  2.50\n|   |   |   |--- feature_3 <= 28.86\n|   |   |   |   |--- feature_3 <= 28.23\n|   |   |   |   |   |--- feature_2 <= 21.50\n|   |   |   |   |   |   |--- class: 1\n|   |   |   |   |   |--- feature_2 >  21.50\n|   |   |   |   |   |   |--- feature_2 <= 26.50\n|   |   |   |   |   |   |   |--- feature_2 <= 25.50\n|   |   |   |   |   |   |   |   |--- feature_3 <= 19.50\n|   |   |   |   |   |   |   |   |   |--- class: 0\n|   |   |   |   |   |   |   |   |--- feature_3 >  19.50\n|   |   |   |   |   |   |   |   |   |--- class: 1\n|   |   |   |   |   |   |   |--- feature_2 >  25.50\n|   |   |   |   |   |   |   |   |--- class: 0\n|   |   |   |   |   |   |--- feature_2 >  26.50\n|   |   |   |   |   |   |   |--- feature_2 <= 37.00\n|   |   |   |   |   | 

#### Validating the data
We can validate the data by giving it a new input and check whether the model is predicted correct or not

In [None]:
#giving input (taking a random value from the dataset)
input=(1,0,38,71.2833)

# converting the input into numpy array 
inp_array=np.asarray(input)

#checking the shape of the input
print(inp_array.shape)

In [None]:
#converting the input into 2d array
inp_reshape=inp_array.reshape(1,-1)

#checking the shape of the array
print(inp_reshape.shape)

In [None]:
#predicting the output
output=model.predict(inp_reshape)
if output==1:
    print('Survived')
else:
    print('Not Survived')

So our model is predicted correct output (I have taken 2nd row of the dataset as an input)

## Thank You