## **Overfitting in ML**


Overfitting in ML refers to the phenomenon in which the model works well with training data but fails for unseen/test data. <br>
**Reasons for overfitting:**<br>
1.   The size of the training data is small and doesn't contain enough data samples that covers all possible input values.
2.   Training data contains a alot of irrelevant information, i.e., noise. Therefore the model learns the noise as well leading to incorrect predictions.

Overfitting leads to high variance. On the other hand the bias is low in overfitting. This means that the model's predictions are close to the true values in the training data, indicating good performance on the training set.



#Demonstrating overfitting in ML

In [None]:
#Required modules:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Let us first load the dataset:
data=pd.read_csv('/content/drive/MyDrive/practise/housing.csv')

In [None]:
pd.set_option('display.max_columns',None)

In [None]:
#Understanding the data:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


In [None]:
data.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [None]:
data.shape

(506, 14)

### **Checking for NULL values:**

In [None]:
data.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

### Dropping the columns which contain NULL values

### Detecting outliers:

In [None]:
numeric_columns=data.drop(columns=['id','diagnosis'])
plt.figure(figsize=(8,6))
sns.boxplot(numeric_columns)

KeyError: ignored

In [None]:
#Using Z-score to detect outliers:
z_score=np.abs((numeric_columns-numeric_columns.mean())/numeric_columns.std())
threshold_val=3
outliers=z_score>threshold_val
#printing the outliers:
data[outliers.any(axis=1)]

We observe that there are 74 records containing outlier values.

###Removing the outliers:

In [None]:
data=data[~outliers.any(axis=1)]

In [None]:
data.shape

We observe that the outliers have been removed

## Building the Decision Tree Classifier to demonstrate overfitting:

In [None]:
data.dtypes

In [None]:
#We shall first convert the data type of diagnosis to numerical value, i.e 0 for B and 1 for M
data.diagnosis.replace({"M":1,"B":0},inplace=True)

In [None]:
X=data.drop(columns=['diagnosis'])
y=data['diagnosis']

In [None]:
#Splitting the data into training and testing.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1234)

In [None]:
#Building the model:
model=DecisionTreeClassifier(max_depth=None)

In [None]:
#Training the model:
model.fit(X_train,y_train)

In [None]:
X_pred=model.predict(X_train)

In [None]:
train_accuracy=accuracy_score(y_train,X_pred)

In [None]:
print("The accuracy of the training set is observed to be: ",train_accuracy)

In [None]:
y_pred=model.predict(X_test)
test_accuracy=accuracy_score(y_test,y_pred)
print("The accuracy of the test data is observed to be: ",test_accuracy)

We observe that the model works well with training data but the accuracy is 0.8 for test data. Thus there is overfitting.

In [None]:
results=zip(y_train,y_pred)
for result in results:
  t,p=result[0],result[1]
  print(f'{t:17}{p:17}{1 if t==p else 0}')

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
df=pd.DataFrame(confusion_matrix(y_test,y_pred),index=['M','B'], columns=['M','B'])

In [None]:
df

In [None]:
print(classification_report(y_test,y_pred))

# **How to Avoid Overfitting?**

Regularization: Use techniques like pruning (limiting the depth of the tree) or setting minimum samples per leaf to avoid growing the tree too deep and reduce overfitting.

Cross-validation: Perform cross-validation to get a more reliable estimate of the model's performance on unseen data.

Feature selection: Consider selecting only the most relevant features to avoid noise and irrelevant information.

Ensemble methods: Utilize ensemble methods like Random Forests or Gradient Boosting, which can help mitigate overfitting compared to a single Decision Tree.

Hyperparameter tuning: Fine-tune hyperparameters to find the optimal configuration that balances performance and generalization.

In [None]:
#Using cross-validation
from sklearn.model_selection import cross_val_score
X=
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=4567)
model=DecisionTreeClassifier(max_depth=4)
k=6
cv_scores=cross_val_score(model,X_train,y_train,cv=k)

In [None]:
avg_cv_score=np.mean(cv_scores)

In [None]:
for fold, score in enumerate(cv_scores, start=1):
    print(f"Fold {fold}: {score:.4f}")

In [None]:
# Fit the model on the full training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Evaluate the model's performance on the test data
test_accuracy = np.mean(y_pred == y_test)
print(f"Test Accuracy: {test_accuracy:.4f}")

In [None]:
print("Training accuracy: ",avg_cv_score)
print("Test Accuracy: ",test_accuracy)

In [None]:
df=pd.DataFrame(confusion_matrix(y_test,y_pred),index=['M','B'], columns=['M','B'])
print(df)
print(classification_report(y_test,y_pred))

Thus, we observe that the model's performance has improved when we used K-fold cross validation to avoid overfitting.
