                                  # Train_Test_Split Methods

When working with machine learning models, it's important to split your data into training and testing sets to evaluate the performance of your model. Here are some common methods for splitting data:

In [136]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

In [137]:
df=pd.read_csv("IrisData.csv")

In [138]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [139]:
df.shape

(150, 6)

In [140]:
df.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [141]:
df=df.drop('Id',axis=1)

In [142]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [143]:
df.describe()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [144]:
df['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

# Holdout Method -Basic

Randomly splits the dataset into two parts: training set and testing set.
It is Simple and fast.

In [145]:
X=df.drop('Species',axis=1)
y=df['Species']

In [146]:
X

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [147]:
y

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

In [148]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['Species'] = label_encoder.fit_transform(df['Species'])


In [149]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [150]:
X_train.shape,y_train.shape

((120, 4), (120,))

In [151]:
X_test.shape,y_test.shape

((30, 4), (30,))

In [152]:
y_train.value_counts(), y_test.value_counts()

(Iris-versicolor    41
 Iris-setosa        40
 Iris-virginica     39
 Name: Species, dtype: int64,
 Iris-virginica     11
 Iris-setosa        10
 Iris-versicolor     9
 Name: Species, dtype: int64)

In [153]:
model1=DecisionTreeClassifier()

In [154]:
model1.fit(X_train,y_train)

In [155]:
y_pred=model1.predict(X_test)

In [156]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)


In [157]:
accuracy

1.0

In [158]:
print(report)

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        10
Iris-versicolor       1.00      1.00      1.00         9
 Iris-virginica       1.00      1.00      1.00        11

       accuracy                           1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30



# Startify Method

Similar to K-Fold Cross-Validation but maintains the same class distribution in each fold as in the original dataset. This is particularly useful for imbalanced datasets.

In [159]:
X_train,X_tes,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

In [160]:
y_train.value_counts()

Iris-setosa        40
Iris-virginica     40
Iris-versicolor    40
Name: Species, dtype: int64

In [161]:
y_test.value_counts()

Iris-setosa        10
Iris-virginica     10
Iris-versicolor    10
Name: Species, dtype: int64

In [162]:
model2=DecisionTreeClassifier()

In [163]:
model2.fit(X_train,y_train)

In [164]:
y_pred2=model2.predict(X_test)

In [165]:
accuracy2=accuracy_score(y_test,y_pred2)
report2=classification_report(y_test,y_pred2)

In [166]:
accuracy2

0.4

In [167]:
print(report2)

                 precision    recall  f1-score   support

    Iris-setosa       0.30      0.30      0.30        10
Iris-versicolor       0.44      0.40      0.42        10
 Iris-virginica       0.45      0.50      0.48        10

       accuracy                           0.40        30
      macro avg       0.40      0.40      0.40        30
   weighted avg       0.40      0.40      0.40        30



If you have a small dataset, stratified sampling might result in very small sample sizes for each class in the training set. This can make it difficult for the decision tree to learn meaningful patterns, leading to poor performance.
Decision trees are prone to overfitting, especially if they are not pruned or if they are allowed to grow very deep. Overfitting means that the model captures noise in the training data, which can lead to poor generalization to the test data. 
This can be addressed using hyperparameter tuning or pruning the tree.

In [184]:
clf = DecisionTreeClassifier(max_depth=3, min_samples_split=2,min_samples_leaf=1)
clf.fit(X_train, y_train)

In [185]:
pred=clf.predict(X_test)

In [186]:
score=accuracy_score(y_test,pred)

In [187]:
score

0.96

# K-Fold Cross Validation Method

Splits the dataset into k equally sized folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times with each fold used exactly once as the test set.

In [172]:
folds=StratifiedKFold(n_splits=3)

In [173]:
for train_index,test_index in folds.split(X,y):
    X_train,X_test,y_train,y_test=X.iloc[train_index],X.iloc[test_index],y.iloc[train_index],y.iloc[test_index]


In [174]:
X_train.shape,X_test.shape

((100, 4), (50, 4))

In [175]:
model3=DecisionTreeClassifier()

In [176]:
model3.fit(X_train,y_train)

In [177]:
y_pred3=model3.predict(X_test)

In [178]:
accuracy3=accuracy_score(y_test,y_pred3)

In [179]:
accuracy3

1.0