In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt

# Load the medical dataset
# You can replace the url with the actual path to your dataset
url = '/kaggle/input/ckdisease/kidney_disease.csv'
df = pd.read_csv(url)

# Handling missing values
df = df.dropna()

# Encode categorical variables
label_encoder = LabelEncoder()
df['rbc'] = label_encoder.fit_transform(df['rbc'])
df['pc'] = label_encoder.fit_transform(df['pc'])
df['pcc'] = label_encoder.fit_transform(df['pcc'])
df['ba'] = label_encoder.fit_transform(df['ba'])
# ... Repeat this for other categorical columns

# Select features and target variable
X = df.drop('classification', axis=1)
y = df['classification']

# Convert categorical variables to numerical using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create a Gaussian Naive Bayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = nb_classifier.predict(X_test)

# Evaluate the model
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Display the confusion matrix
conf_matrix = metrics.confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Display classification report
classification_report = metrics.classification_report(y_test, y_pred)
print("Classification Report:\n", classification_report)

# Visualize the confusion matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
            xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

This Python script uses the scikit-learn library to create a Gaussian Naive Bayes classifier for a kidney disease dataset. The goal is to predict the 'classification' of a patient, which could be the presence or absence of kidney disease.

The script starts by importing the necessary libraries. It then loads the kidney disease dataset from a specified URL into a pandas DataFrame. Any missing values in the dataset are dropped using the `dropna()` function.

The script then encodes categorical variables in the dataset using a LabelEncoder. This converts categories into numerical values, which are required for the Naive Bayes classifier. The columns 'rbc', 'pc', 'pcc', and 'ba' are encoded in this way. If there are other categorical columns in the dataset, they should be encoded in the same manner.

The 'classification' column, which is the target variable, is separated from the rest of the dataset. The remaining columns, which are the features, are stored in `X`. If there are any categorical variables in the features, they are converted to numeric using one-hot encoding.

The dataset is then split into a training set and a test set, with 80% of the data used for training and 20% used for testing. This is done using the `train_test_split` function from scikit-learn. The `stratify` parameter is set to `y`, which means that the split will preserve the proportion of target class instances in both the training and test sets.

A Gaussian Naive Bayes classifier is created and trained using the training data. The trained classifier is then used to make predictions on the test set.

The accuracy of these predictions is calculated by comparing them to the actual classifications in the test set. The confusion matrix, which shows the number of true positive, true negative, false positive, and false negative predictions, is also displayed. A classification report, which includes precision, recall, f1-score, and support for each class, is printed as well.

Finally, the confusion matrix is visualized using a heatmap from the seaborn library. The x and y labels of the heatmap correspond to the classes of the target variable. The color of each cell in the heatmap corresponds to the number of instances of each class in the confusion matrix.