<a href="https://colab.research.google.com/github/Ankur-v-2004/Industrial-data-science/blob/main/Titanic_Transformer_%26_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Application of Decision Tree Classifier on Titanic Dataset
1. Holistic approach to data preprocessing
2. **Missing Value imputation** using **Column Tranformer** and deploying mean as strategy on numerical variables
3. **Missing Value imputation** usingusing **Column Tranformer** and deploying most frequent value as strategy for nominal variable
4. Use of **Decision Tree Classifier** for prediction
5. Use of **Column Transformer for data preprocessing**
6. Use of **Pipeline for data preprocessing**
7. Use of **pipeline for fitting the model algorithm**
8. Use the **pipeline for prediction**
9. Use the **pipeline for cross validation**
10. Display the steps performed in the **pipeline**
11. Use the **classification report** and **accuracy score** for model evaluation





## Import the necessary libraries

In [1]:
import seaborn as sns
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score

### Load the Titanic dataset from Seaborn and have a first look at the dataset and Display the first few rows of the dataset


In [2]:
tdf = sns.load_dataset('titanic')

tdf.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### It appears that there are large number of columnns which might be replicated.Therefore, it is pertinant to look at the listing the columns of dataset to screen for necessary columns

In [3]:
tdf.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

### Dropping the unnecessary and replicated columns


In [4]:
tdf.drop(columns=['class','deck','who','adult_male','deck','embarked', 'alive',
       'alone'],axis=1,inplace=True)

### Get an overview of the dataset for accessing the missing values as well as the data types of the columns

In [5]:
tdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embark_town  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


In [6]:
tdf.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town
0,0,3,male,22.0,1,0,7.25,Southampton
1,1,1,female,38.0,1,0,71.2833,Cherbourg
2,1,3,female,26.0,0,0,7.925,Southampton
3,1,1,female,35.0,1,0,53.1,Southampton
4,0,3,male,35.0,0,0,8.05,Southampton


### Splitting the data into Independent and dependent variables

In [7]:
X = tdf.drop('survived', axis=1)
y = tdf['survived']

### Split the data into train and test dataset

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
X_train.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embark_town
331,1,male,45.5,0,0,28.5,Southampton
733,2,male,23.0,0,0,13.0,Southampton
382,3,male,32.0,0,0,7.925,Southampton
704,3,male,26.0,1,0,7.8542,Southampton
813,3,female,6.0,4,2,31.275,Southampton


### Impute the missing values in the age column with the mean of the column and the missing values in the embarked_town column with the mode of the column using Column Transformer


In [10]:
trf1 = ColumnTransformer([
    ('impute_age', SimpleImputer(), [2]),
    ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6])],
    remainder='passthrough'
)

### Creating a tranformer for OneHot Encoding of sex and Embarked columns


In [11]:
trf2 = ColumnTransformer([
    ('ohe_sex_embarked', OneHotEncoder(sparse=False, handle_unknown='ignore'), [1,6])],
    remainder='passthrough'
                         )

### Appling the MinMax Scaler to the whole data

In [12]:
trf3 = ColumnTransformer([
    ('scale', MinMaxScaler(), slice(0,10))]
                         )

### Applying the Decision Tree classifier algorithm

In [13]:
trf4 = DecisionTreeClassifier()

### Creating a pipeline by combining tranformers and algorithm

In [14]:
pipe = Pipeline([
    ('trf1', trf1),
    ('trf2', trf2),
    ('trf3', trf3),
    ('trf4', trf4)
    ])

### Displaying the steps performed in the pipeline

In [15]:
pipe

### Use the whole data to fit the in the pipe and cross validate the model with 5 folds

In [16]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.6364258364195594

### Now fit the train dataset in the pipeline and display the fitted pipe

In [17]:
pipe.fit(X_train, y_train)

### Based on the pipeline created above, prediction is made using test dataset and stored in the variable y_pred

In [18]:
y_pred = pipe.predict(X_test)

### Evaluate the model accuracy using accuracy score and classification report

In [19]:
accuracy_score(y_test, y_pred)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.64      0.83      0.72       105
           1       0.58      0.34      0.43        74

    accuracy                           0.63       179
   macro avg       0.61      0.58      0.57       179
weighted avg       0.62      0.63      0.60       179

