Holdout method is a model evaluation technique where the dataset is split into two subsets: a training set and a validation set. The advantages of the holdout method are its simplicity and speed. However, it may result in high variance due to the dependency on a single validation set and may not effectively capture the model's performance on unseen data.

Model evaluation with validation involves splitting the dataset into three subsets: a training set, a validation set, and a test set. The training set is used to train the model, the validation set is used to tune hyperparameters and assess model performance, and the test set is used for final evaluation. The advantages of this approach include better estimation of model performance on unseen data, the ability to fine-tune hyperparameters, and reduced risk of overfitting. However, it requires a larger dataset, adds additional computational complexity, and may introduce bias if the validation set is not representative of the overall data distribution. It strikes a balance between simplicity and accuracy compared to other evaluation techniques like cross-validation.



Kewan Sulaiman Saleh

In [1]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt


In [2]:

new_df = pd.read_csv('stroke-data.csv')
#deleting missing values
new_df = new_df.dropna()
new_df = new_df.fillna('')

#remove duplicates
new_df = new_df.drop_duplicates()


d_list = new_df.select_dtypes(include = ['object']).columns.tolist()


from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for i in d_list:
    le.fit(new_df[i])
    new_df[i] = le.transform(new_df[i])


new_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4909 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4909 non-null   int64  
 1   gender             4909 non-null   int64  
 2   age                4909 non-null   float64
 3   hypertension       4909 non-null   int64  
 4   heart_disease      4909 non-null   int64  
 5   ever_married       4909 non-null   int64  
 6   work_type          4909 non-null   int64  
 7   Residence_type     4909 non-null   int64  
 8   avg_glucose_level  4909 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     4909 non-null   int64  
 11  stroke             4909 non-null   int64  
dtypes: float64(3), int64(9)
memory usage: 498.6 KB


In [3]:
new_df = new_df.dropna()
X = new_df.drop("stroke", axis=1)
y = new_df["stroke"]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.34,train_size=0.66, random_state=42)

In [4]:
# Step 4: Instantiate the k-NN classifier
naive_bayes = GaussianNB()

naive_bayes.fit(X_train, y_train)
# Step 5: Train the k-NN model
y_pred = naive_bayes.predict(X_test)



In [5]:


#plt.figure(figsize=(12, 12))
#plot_tree(clf, feature_names=X.columns, class_names=['No Stroke', 'Stroke'], filled=True, rounded=True)
#plt.show()

In [6]:
#accuracy metrces 
naive_bayes.fit(X_train, y_train)
# Step 5: Train the k-NN model
y_pred = naive_bayes.predict(X_test)

accuracy = (y_pred == y_test).mean()

# Calculate true positives, false positives, and false negatives
tp = ((y_pred == 1) & (y_test == 1)).sum()
tn = ((y_pred == 0) & (y_test == 0)).sum()
fp = ((y_pred == 1) & (y_test == 0)).sum()
fn = ((y_pred == 0) & (y_test == 1)).sum()

# Calculate precision
precision = tp / (tp + fp)

# Calculate recall
recall = tp / (tp + fn)

# Calculate F1-score
f1_score = 2 * (precision * recall) / (precision + recall)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1_score)

Accuracy: 0.9161676646706587
Precision: 0.19148936170212766
Recall: 0.21951219512195122
F1-score: 0.20454545454545456


In [7]:
# Perform 3-fold cross-validation and obtain predicted labels

import numpy as np
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

naive_bayes2 = GaussianNB()

predicted = cross_val_predict(naive_bayes2, X, y, cv=3)

# Calculate the confusion matrix for each fold
confusion_matrices = []
accuracy=[]
precision=[]
recall=[]
f1_score=[]
for fold in range(3):
    cm = confusion_matrix(y, predicted)
    confusion_matrices.append(cm)

# Calculate TP, TN, FP, FN for each fold
for fold, cm in enumerate(confusion_matrices):
    tp = cm[1, 1]
    tn = cm[0, 0]
    fp = cm[0, 1]
    fn = cm[1, 0]
    ac=(tp+tn)/(tp+tn+fn+fp)
    accuracy.append(ac)
    precision.append(tp / (tp + fp))
    recall.append( tp / (tp + fn))
    p=(tp / (tp + fp))
    r=tp / (tp + fn)
    score=2 * (p * r) / (p + r)
    f1_score.append(score)     

print("Accuracy:", np.mean(accuracy))
print("Precision:", np.mean(precision))
print("Recall:", np.mean(recall))
print("F1-score:", np.mean(f1_score))


Accuracy: 0.9232022815237318
Precision: 0.17938931297709923
Recall: 0.22488038277511965
F1-score: 0.19957537154989388


Results: The result of 3 fold validation reliable than hold-out, because iterate the training and testing three times, the place of test and train changed 3 times.

In [8]:
# Perform 5-fold cross-validation and obtain predicted labels

import numpy as np
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

naive_bayes3 = GaussianNB()

predicted = cross_val_predict(naive_bayes3, X, y, cv=5)

# Calculate the confusion matrix for each fold
confusion_matrices = []
accuracy=[]
precision=[]
recall=[]
f1_score=[]
for fold in range(5):
    cm = confusion_matrix(y, predicted)
    confusion_matrices.append(cm)

# Calculate TP, TN, FP, FN for each fold
for fold, cm in enumerate(confusion_matrices):
    tp = cm[1, 1]
    tn = cm[0, 0]
    fp = cm[0, 1]
    fn = cm[1, 0]
    ac=(tp+tn)/(tp+tn+fn+fp)
    accuracy.append(ac)
    precision.append(tp / (tp + fp))
    recall.append( tp / (tp + fn))
    p=(tp / (tp + fp))
    r=tp / (tp + fn)
    score=2 * (p * r) / (p + r)
    f1_score.append(score)     

print("Accuracy:", np.mean(accuracy))
print("Precision:", np.mean(precision))
print("Recall:", np.mean(recall))
print("F1-score:", np.mean(f1_score))


Accuracy: 0.9234059889997963
Precision: 0.1776061776061776
Recall: 0.2200956937799043
F1-score: 0.19658119658119658


Results: The result of 5 fold validation reliable than hold-out, because iterate the training and testing five times, the place of test and train changed 5 times, the test set change the palce in the dataset five times with diffrent data.

Two  ways to improve the Naive Bayes algorithm are:

1-Handle class imbalance: If the dataset has imbalanced classes, consider applying techniques like oversampling (e.g., SMOTE) or undersampling to balance the class distribution. This can help improve the model's performance, especially if the minority class is important.

2-Hyperparameter tuning: Experiment with different hyperparameter settings for the Naive Bayes classifier. You can use techniques like grid search or random search to find the 


In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE

# Create a Naive Bayes classifier
naive_bayes = GaussianNB()

#Solove inbalancing in the dataset
smt = SMOTE(random_state = 0)
x_train_res, y_train_res = smt.fit_resample(X_train, y_train)

# Define the hyperparameter grid
param_grid = {
    'var_smoothing': [1e-9, 1e-8, 1e-7]  # Example list of hyperparameter values to try
}



# Perform grid search
grid_search = GridSearchCV(estimator=naive_bayes, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)

accuracy = (y_pred == y_test).mean()

cm = confusion_matrix(y_test, y_pred)

# Calculate true positives, false positives, and false negatives
tn = cm[0, 0]
fp = cm[0, 1]
fn = cm[1, 0]
tp = cm[1, 1]
# Calculate precision
precision = tp / (tp + fp)
# Calculate recall
recall = tp / (tp + fn)
# Calculate F1-score
f1_score = 2 * (precision * recall) / (precision + recall)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1_score)

Accuracy: 0.925748502994012
Precision: 0.18181818181818182
Recall: 0.14634146341463414
F1-score: 0.16216216216216217


The results of Naive algorithm improved by using hyperparmmeter tuning and inblance smoothing