<a href="https://colab.research.google.com/github/DevanshParmar/Data-Science-Summer-Camp-2021/blob/main/Decision%20Tree%20Model%20on%20Titanic%20Survival%20Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Decision Tree Model on Titanic Survival Dataset**
This is an implementation of the Decision Tree Model, a machine learning model on the Titanic survival dataset. 

#### **Uploads**
Setting up libraries and uploading dataset files.

In [None]:
import numpy as np
import pandas as pd
import zipfile
from google.colab import drive
import os

In [None]:
drive.mount('/content/gdrive')
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
%cd /content/gdrive/My Drive/Kaggle
!kaggle competitions download -c titanic
!ls

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/Kaggle
train.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
gender_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
gender_submission.csv  kaggle.json  test.csv  train.csv


#### **Data Preprocessing**
Making a function to make changes into the dataframe, such as deleting various unnecessary columns and converting Sex to male=0, female=1 objective case; and filling in median value of Age wherever it is not available.

In [None]:
print("Input Dataframes:")
data = pd.read_csv('/content/gdrive/MyDrive/Kaggle/train.csv')
col_names1 = ['PassId', 'Survived', 'PClass', 'Name', 'Sex', 'Age', 'SibSp', 'ParCh', 'Ticket', 'Fare', 'Cabin', 'Embarked']
data.columns = col_names1
print(data.head())
print(" ")

test = pd.read_csv('/content/gdrive/MyDrive/Kaggle/test.csv')
col_names2 = ['PassId', 'PClass', 'Name', 'Sex', 'Age', 'SibSp', 'ParCh', 'Ticket', 'Fare', 'Cabin', 'Embarked']
test.columns = col_names2
print(test.head())
print(" ")

test_survived = pd.read_csv('/content/gdrive/MyDrive/Kaggle/gender_submission.csv')
test_survived.columns = ['PassId', 'Survived']
print(test_survived.head())
print(" ")

print("Output Dataframes:")
data.pop("PassId")
data.pop("Name")
data.pop("Ticket")
data.pop("Cabin")
data.pop("Embarked")
data.Sex.replace(('male', 'female'), (0, 1), inplace=True)
print(data.head())
print(" ")

test.pop("PassId")
test.pop("Name")
test.pop("Ticket")
test.pop("Cabin")
test.pop("Embarked")
test.Sex.replace(('male', 'female'), (0, 1), inplace=True)
test_survived.pop("PassId")
test['Survived'] = test_survived.values
print(test.head())
print(" ")

Input Dataframes:
   PassId  Survived  PClass  ...     Fare Cabin  Embarked
0       1         0       3  ...   7.2500   NaN         S
1       2         1       1  ...  71.2833   C85         C
2       3         1       3  ...   7.9250   NaN         S
3       4         1       1  ...  53.1000  C123         S
4       5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]
 
   PassId  PClass  ... Cabin Embarked
0     892       3  ...   NaN        Q
1     893       3  ...   NaN        S
2     894       2  ...   NaN        Q
3     895       3  ...   NaN        S
4     896       3  ...   NaN        S

[5 rows x 11 columns]
 
   PassId  Survived
0     892         0
1     893         1
2     894         0
3     895         0
4     896         1
 
Output Dataframes:
   Survived  PClass  Sex   Age  SibSp  ParCh     Fare
0         0       3    0  22.0      1      0   7.2500
1         1       1    1  38.0      1      0  71.2833
2         1       3    1  26.0      0      0   7.9250

#### **Decision Tree Functions**
In the next three blocks, we define:
1. Entropy Function
2. Division Algorithm
3. Information Gain Function

All three of them are important in the study of Decision Trees.

In [None]:
def entropy(target_col):
    elements, counts = np.unique(target_col,return_counts = True)
    sum = 0.0
    n = np.sum(counts)
    for i in counts:
        p = i/n
        sum = sum - (p * np.log2(p))
    return sum

In [None]:
def division(input_data, title, mean):
    right = pd.DataFrame([], columns = input_data.columns)
    left = pd.DataFrame([], columns = input_data.columns)
    k = input_data.shape[0]
    for x in range(k):
        value = input_data[title].loc[x]
        if value >= mean:
            right = right.append(input_data.iloc[x])
        else:
            left = left.append(input_data.iloc[x])
    return right, left

In [None]:
def iGain(input_data, title, mean):
    right, left = division(input_data, title, mean)
    k = input_data.shape[0]
    left_ratio = float(left.shape[0])/k
    right_ratio = float(right.shape[0])/k
    if left.shape[0] == 0 or right.shape[0] == 0:
        return -99999
    igain = entropy(input_data.Survived) - ( left_ratio * entropy(left.Survived) + right_ratio * entropy(right.Survived))
    return igain

#### **Modeling**
In the next block we define the decision tree model. 
1. The first function inside the class initialises the model.
2. The second is the main training module.
3. The third function is the prediction module.

In [None]:
class DT:
    def __init__(self, depth=0, max_depth=5):
        self.left = None
        self.right = None
        self.title_name = None
        self.mean_val = None
        self.depth = depth
        self.max_depth = max_depth
        self.target = None
    #                              
    #                              
    def train_model(self, input_train):
        features = ['PClass', 'Sex', 'Age', 'SibSp', 'ParCh', 'Fare']             
        iGains = []
        for col in features: 
            iGains.append(iGain(input_train, col, input_train[col].mean()))
        #                              
        self.title_name = features[np.argmax(iGains)]                     
        self.mean_val = input_train[self.title_name].mean()  
        #                              
        r_data, l_data = division(input_train, self.title_name, self.mean_val)   
        r_data = r_data.reset_index(drop=True)                    
        l_data = l_data.reset_index(drop=True)
        #                              
        if l_data.shape[0] == 0 or r_data.shape[0] == 0:              
            if input_train.Survived.mean() >= 0.5: 
                self.target = 1                                               
            else:                                                                       
                self.target = 0
            return
        #                              
        if self.depth >= self.max_depth:                                     
            if input_train.Survived.mean() >= 0.5:
                self.target = 1
            else:
                self.target = 0
            return
        #                              
        self.left = DT(self.depth+1,self.max_depth)                   
        self.left.train_model(l_data)
        self.right = DT(self.depth+1,self.max_depth)                  
        self.right.train_model(r_data)
        #                              
        if input_train.Survived.mean() >= 0.5:
            self.target = 1
        else:
            self.target = 0
        return
    #                              
    #                              
    def predictions(self,test_df):                                                     
        if test_df[self.title_name] > self.mean_val:
            if self.right is None:
                return self.target
            return self.right.predictions(test_df)
        #                              
        if test_df[self.title_name] < self.mean_val:
            if self.left is None:
                return self.target
            return self.left.predictions(test_df)

In [None]:
model = DT()
model.train_model(data)

#### **Predictions and Accuracy**
In the next two blocks, we have measured the various statistical parameters of our model, such as accuracy, loss, F1 score, sensitivity and precision.

In [None]:
def stats(dataset):
    prediction = []
    for i in range(dataset.shape[0]):
        prediction.append(model.predictions(dataset.loc[i]))
    prediction = np.array(prediction)
    survive_data = np.array(dataset['Survived'])
    #                              
    loss = 0
    f_neg = 0
    f_pos = 0 
    t_neg = 0
    t_pos = 0
    #                              
    for i, j in zip(prediction, survive_data):
        if i == 1 and j == 1:
            t_pos+=1
        elif i == 1 and j == 0:
            f_pos+=1
            loss+=1
        elif i==0 and j == 1:
            f_neg+=1
            loss+=1
        else:
            t_neg+=1
    #                              
    rec = t_pos / (t_pos + f_neg)
    prc = t_pos / (t_pos + f_pos)
    acc = (t_pos + t_neg) / (t_pos + t_neg + f_pos + f_neg)
    f1s = 2 * prc * rec / (prc + rec)
    #                              
    print('   Accuracy is {:.2f}%'.format(100*acc))
    print('       Loss is',loss)
    print('   F1 Score is {:.4f}'.format(f1s))
    print('Sensitivity is {:.4f}'.format(rec))
    print('  Precision is {:.4f}'.format(prc))

In [None]:
print("Statistics for Training dataset are:")
print(" ")
stats(data)
print(" ")
print(" ")
print("Statistics for Test dataset are:")
print(" ")
stats(test)

Statistics for Training dataset are:
 
   Accuracy is 85.30%
       Loss is 131
   F1 Score is 0.7706
Sensitivity is 0.7432
  Precision is 0.8000
 
 
Statistics for Test dataset are:
 
   Accuracy is 92.34%
       Loss is 32
   F1 Score is 0.8841
Sensitivity is 0.9313
  Precision is 0.8414


#### **References**

1. Gagan Panwar's YouTube playlist over the same topic was a great help: www.youtube.com/playlist?list=PL9mhv0CavXYg3KFKct0JnslSwBCpAd_g0
2. Some Towards Data Science (TDS) articles were helpful, especially: www.towardsdatascience.com/decision-trees-for-classification-id3-algorithm-explained-89df76e72df1
3. This Exsilio blog was greatly helpful in visualing the final statistics of the model: www.blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/