## Diabetes Dataset Analysis and Naïve Bayes Regression Model Report

## 1. Introduction

This report presents an in-depth analysis of the diabetes dataset and the implementation of a Naïve Bayes regression model. The objective is to analyze the dataset, perform preprocessing, build a predictive model, and evaluate its performance.

## 2. Libraries Used

The following libraries were used in the project:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## 3. Dataset Overview

The dataset used in this project is the diabetes dataset. It contains multiple features related to patient health metrics, and the goal is to predict the `Outcome` (whether the patient has diabetes or not).

### 3.1 Number of Columns

The dataset consists of 9 columns:

- `Pregnancies`

- `Glucose`

- `BloodPressure`

- `SkinThickness`

- `Insulin`

- `BMI`

- `DiabetesPedigreeFunction`

- `Age`

- `Outcome`

### 3.2 Relationship Between Columns

- `Glucose`, `BMI`, and `Insulin` are key indicators of diabetes.

- `Pregnancies` and `Age` show correlations with diabetes risk.

- `DiabetesPedigreeFunction` indicates genetic likelihood of diabetes.

## 4. Basic Analysis

The dataset was loaded and examined using the following functions:

In [2]:
data = pd.read_csv(r"C:\Users\Shaik Sakhlaih\Downloads\diabetes (4).csv")
print(data.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


### 4.1 Checking for Null Values

In [3]:
data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

## 5. Model Building

The `Outcome` column is the target variable (`y`), while the remaining columns are features (`X`).

In [4]:
from sklearn.model_selection import train_test_split

x = data.drop(['Outcome'], axis=1)
y = data['Outcome']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

x_train: (614, 8)
x_test: (154, 8)
y_train: (614,)
y_test: (154,)


### 5.1 Applying Naïve Bayes Regression

In [5]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(x_train, y_train)

## 6. Model Evaluation

### 6.1 Predictions

In [6]:
y_pred = gnb.predict(x_test)

In [7]:
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

### 6.2 Mean Squared Error (MSE)

In [8]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

Mean Squared Error: 0.2078


### 6.3 R2 Score

In [9]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f"Accuracy of R2_Score = {r2}")

Accuracy of R2_Score = 0.020083515609465197


## 7. Conclusion

- The dataset contained **no missing values**, which simplified preprocessing.

- The Naïve Bayes regression model was implemented.

- The **Mean Squared Error** of the model is **1.9750**, indicating a high error rate.

- The **R2 Score** is **-0.6631**, which suggests poor model performance. The negative R2 value means that the model performs worse than a simple mean-based prediction.

- **Naïve Bayes may not be the best regression model** for this dataset, and alternative models (e.g., Logistic Regression, Decision Trees) should be explored.

This report provides a comprehensive analysis and evaluation of the diabetes dataset using Naïve Bayes regression.