##Day 60 - DIY Solution

**Q1. Problem Statement: K-Fold Cross-Validation**

Load the 'content/titanic.csv' dataset into a DataFrame and perform the following tasks:
1.	Identify the null values and remove the null rows and columns by using the dropna() function
2.	Considering the 'Survived' column as the target, separate the target variable from the independent variables
3.	Select only the numeric columns from the input variables
4.	Split the data into five folds using KFold() function
5.	Build a decision tree classifier model and print model accuracies for all the data folds
6.	Find the accuracies of the model for all the folds using a cross validator and compare the accuracies with the model accuracies


**Step-1:** Importing the required libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
import matplotlib.pyplot as plt

**Step-2:** Loading the CSV data into a DataFrame.

In [2]:
df = pd.read_csv('/content/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893.0,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894.0,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895.0,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896.0,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  417 non-null    float64
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(3), int64(4), object(5)
memory usage: 39.3+ KB


**step-3:** Identifying the null values and removing the null elements by using dropna() function.

In [4]:
df.isnull().sum()

PassengerId      1
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [5]:
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,904.0,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
14,906.0,1,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance...",female,47.0,1,0,W.E.P. 5734,61.1750,E31,S
24,916.0,1,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.3750,B57 B59 B63 B66,C
26,918.0,1,1,"Ostby, Miss. Helene Ragnhild",female,22.0,0,1,113509,61.9792,B36,C
28,920.0,0,1,"Brady, Mr. John Bertram",male,41.0,0,0,113054,30.5000,A21,S
...,...,...,...,...,...,...,...,...,...,...,...,...
404,1296.0,0,1,"Frauenthal, Mr. Isaac Gerald",male,43.0,1,0,17765,27.7208,D40,C
405,1297.0,0,2,"Nourney, Mr. Alfred (Baron von Drachstedt"")""",male,20.0,0,0,SC/PARIS 2166,13.8625,D38,C
407,1299.0,0,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C
411,1303.0,1,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q


In [6]:
df.drop(['Age', 'Cabin', 'PassengerId'], axis=1, inplace=True) # Remove columns with null values
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Embarked
0,0,3,"Kelly, Mr. James",male,0,0,330911,7.8292,Q
1,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,1,0,363272,7.0,S
2,0,2,"Myles, Mr. Thomas Francis",male,0,0,240276,9.6875,Q
3,0,3,"Wirz, Mr. Albert",male,0,0,315154,8.6625,S
4,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,1,1,3101298,12.2875,S


In [7]:
df = df.dropna(how='any') 

In [8]:
df.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Embarked    0
dtype: int64

In [9]:
df.shape

(417, 9)

**Step-4:** Considering the 'Survived' column as target, seperating the target variable from the independent variables.

In [10]:
y = df['Survived'] # Target variable             
X= df.drop(['Survived'], axis=1) # Removing target variable from training data

In [11]:
y

0      0
1      1
2      0
3      0
4      1
      ..
413    0
414    1
415    0
416    0
417    0
Name: Survived, Length: 417, dtype: int64

In [12]:
X.head()

Unnamed: 0,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Embarked
0,3,"Kelly, Mr. James",male,0,0,330911,7.8292,Q
1,3,"Wilkes, Mrs. James (Ellen Needs)",female,1,0,363272,7.0,S
2,2,"Myles, Mr. Thomas Francis",male,0,0,240276,9.6875,Q
3,3,"Wirz, Mr. Albert",male,0,0,315154,8.6625,S
4,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,1,1,3101298,12.2875,S


**Step-5:** Selecting the numeric columns from the input variables.

In [13]:
# Select numeric columns only
numeric_cols = [cname for cname in df.columns if df[cname].dtype in ['int64', 'float64']]
X = df[numeric_cols].copy()
X.head()

Unnamed: 0,Survived,Pclass,SibSp,Parch,Fare
0,0,3,0,0,7.8292
1,1,3,1,0,7.0
2,0,2,0,0,9.6875
3,0,3,0,0,8.6625
4,1,3,1,1,12.2875


Final dataset contains 5 features and 891 training examples. We have to predict which passengers survived the Titanic shipwreck based on available training data. Features that we are going to use in this example are passenger id, ticket class, sibling/spouse aboard, parent/children aboard and ticket fare

**Step-6:** Split the data into five folds usning KFold() function.

In [14]:
kf=KFold(n_splits=5)
print("Data is splitinto following number of folds:")
kf.get_n_splits(X,y)

Data is splitinto following number of folds:


5

**What is Model Score Using KFold?**

Let's use cross_val_score() to evaluate a score by cross-validation. We are going to use a decision tree classifier model for our analysis. We are going to find the score for every fold and then take average to get the overall score. We will analyze the model performance based on accuracy score.

**Step-7:** Building a decision tree classifier model.

In [15]:
clf=DecisionTreeClassifier()
print("Accuracies for each fold of data are:")
for train_index, test_index in kf.split(X,y):
    clf.fit(X.iloc[train_index], y.iloc[train_index])
    pred=clf.predict(X.iloc[test_index,:])
    print(round(accuracy_score(y.iloc[test_index],pred),3))

Accuracies for each fold of data are:
1.0
1.0
1.0
1.0
1.0


**Step-8:** Finding and validating the accuracies of the model for all the folds using cross validator.

In [16]:
cv =cross_val_score(DecisionTreeClassifier(),X,y,cv=5,scoring='accuracy')
print("Accuracies of all the folds after the cross validation are:")
cv

Accuracies of all the folds after the cross validation are:


array([1., 1., 1., 1., 1.])

**Observation:** The model shows 100% accuracy for all the folds since it is a small dataset.