Problem Statement 1: Load the 'content/titanic.csv' dataset into a DataFrame and perform the
following tasks:
1. Identify the null values and remove the null rows and columns by using the dropna()
function
2. Considering the 'Survived' column as the target, separate the target variable from the
independent variables
3. Select only the numeric columns from the input variables
4. Split the data into five folds using KFold() function
5. Build a decision tree classifier model and print model accuracies for all the data folds
6. Find the accuracies of the model for all the folds using a cross validator and compare the
accuracies with the model accuracies


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold

In [None]:
df = pd.read_csv("titanic_dataset.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
df.dropna(inplace= True)

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [None]:
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
1,2,1,1,38.0,1,0,71.2833
3,4,1,1,35.0,1,0,53.1
6,7,0,1,54.0,0,0,51.8625
10,11,1,3,4.0,1,1,16.7
11,12,1,1,58.0,0,0,26.55


In [None]:
X =  df_numerics_only.drop(columns = 'Survived')
X.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
1,2,1,38.0,1,0,71.2833
3,4,1,35.0,1,0,53.1
6,7,1,54.0,0,0,51.8625
10,11,3,4.0,1,1,16.7
11,12,1,58.0,0,0,26.55


In [None]:
y= df['Survived']
y.head()

1     1
3     1
6     0
10    1
11    1
Name: Survived, dtype: int64

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate
kf = KFold(n_splits=5)

for train_index, test_index in kf.split(X):
  X_train, X_test = X.iloc[train_index], X.iloc[test_index]
  y_train, y_test = y.iloc[train_index], y.iloc[test_index]
  dt = DecisionTreeClassifier(random_state=42)
  dt.fit(X_train, y_train)
  y_pred = dt.predict(X_test)
  print("Accuracy:", accuracy_score(y_pred, y_test))

c= cross_validate(dt, X,y,cv=5, scoring='accuracy')
print("Cross Validation Accuracy:",c['test_score'])


Accuracy: 0.5135135135135135
Accuracy: 0.6216216216216216
Accuracy: 0.6486486486486487
Accuracy: 0.75
Accuracy: 0.7777777777777778
Cross Validation Accuracy: [0.35135135 0.64864865 0.40540541 0.63888889 0.72222222]
