# About Dataset

* This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.
* The objective of the dataset is to diagnostically predict whether a patient has diabetes based on certain diagnostic measurements included in the dataset. * Several constraints were placed on the selection of these instances from a larger database.
* In particular, all patients here are females at least 21 years old of Pima Indian heritage.

# **Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

ModuleNotFoundError: No module named 'missingno'

# Loading the dataset

In [None]:
data = pd.read_csv('C:\Users\ASUS\Downloads\diabetes.csv')
data.head(n=10)

# **Exploratory Data Analysis**

In [None]:
# Check the shape of dataset
data.shape

In [None]:
# Check number of columns
data.columns

In [None]:
# Check number of columns
data.nunique()

In [None]:
# Check the type of data to decide what algorithm to use for training
data.info()

# Observation-
# All are numerical value so we will go for Linear Regression

In [None]:
# Descriptive Statistics of the Dataset
data.describe()

### <u>Observation</u>-
* Min row shows zeros in several columns, which could indicate missing values or data entry errors, as some of these measurements (like Glucose, BloodPressure, SkinThickness, Insulin, and BMI) should logically not be zero.

### Handling Missing Value: Checking for NaN or Null values and handle missing values if any.

In [None]:
# Checking total null values in the overall dataset
data.isnull().sum()

In [None]:
# Check for rows with zero values
rows_with_zeros = data[(data == 0).any(axis=1)]

# Display the rows with zero values
print("Rows with zero values:")
print(rows_with_zeros)

### Note-
* In this example, the (df == 0) part creates a boolean DataFrame where each element is True if the corresponding element in the original DataFrame is equal to zero and False otherwise.
* The any(axis=1) part checks if there is at least one True value along each row (axis=1).
* The result is a boolean Series, which is then used to index the original DataFrame, selecting only the rows where at least one element is zero.

### Solution -
* Result: The zero value in some columns does not make sense and may indicate missing values. In particular, the columns Glucose, BloodPressure, SkinThickness, Insulin, and BMI should not have zeros, as it would not be physiologically plausible (for example, a glucose or blood pressure measurement of zero). Therefore, these zeros should be replaced with NaN so that these "false non-missing values" are accounted for correctly

In [None]:
# Create a copy of the DataFrame to avoid modifying the original data
data_copy = data.copy(deep=True)

# Replace zeros with NaN in specific columns
cols_to_replace_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
data_copy[cols_to_replace_zeros] = data_copy[cols_to_replace_zeros].replace(0, np.nan)

# Check missing values again after replacement
print(data_copy.isnull().sum())

### **Result:**
* The output shown is the result of calling the isnull().sum() function on a modified DataFrame that provides multiple NaN values ​​in each column.
* The results show that the columns "Blood Glucose", "Blood Pressure", "Skin Thickness", "Insulin" and "BMI" have varying numbers of NaN values, indicating that the zeros in these columns are considered missing or invalid and replaced by NaN .
* For example, "glucose" has 5 NaN values, "blood pressure" has 35, "skin thickness" has 227, "insulin" has 374, and "BMI" has 11. Other columns, such as "Pregnancy", "Diabetes Spectrum Function", "Age" and "Result" do not have NaN values, which means zeros are not replaced in these columns.

# **Data Visualization**-

In [None]:
# Heatmap before replacement
sns.heatmap(data.isnull())

In [None]:
# Heatmap after replacement
sns.heatmap(data_copy.isnull())

**Comment**:
This heatmap visualizes missing data in a diabetes copy dataset, where warmer colors represent a higher frequency of NaN values. It shows significant missing data for 'Insulin' and 'SkinThickness', moderate for 'BloodPressure' and 'BMI', and minimal for 'Glucose'. The other features, 'Pregnancies', 'DiabetesPedigreeFunction', 'Age', and 'Outcome', show no missing data. This heatmap effectively highlights the areas in the dataset that may require imputation or further data cleansing

In [None]:
# Visualization the histograms before replacement
data.hist(figsize=(20,20))
plt.show()

In [None]:
# Replace NaN values with the mean or median of the corresponding columns
# Replace with mean when the data is approximately normally distributed
# Replace with median when the data is skewed or contains outliers.
data_copy['Glucose'].fillna(data_copy['Glucose'].mean(), inplace=True)
data_copy['BloodPressure'].fillna(data_copy['BloodPressure'].mean(), inplace=True)
data_copy['SkinThickness'].fillna(data_copy['SkinThickness'].median(), inplace=True)
data_copy['Insulin'].fillna(data_copy['Insulin'].median(), inplace=True)
data_copy['BMI'].fillna(data_copy['BMI'].mean(), inplace=True)

# Visalization the histograms after replacement
p = data_copy.hist(figsize=(20,20))

**Result:**
- Most individuals have a low to moderate number of pregnancies.
- Glucose and blood pressure levels cluster around common values, indicating a normal distribution without extreme variations.
- Skin thickness and insulin levels show a rightward skew in their distribution, with a few individuals having significantly higher values than the majority.
- BMI values are predominantly on the higher side, suggesting a prevalence of overweight conditions in the studied population.
- The diabetes pedigree function, which reflects genetic risk, is generally low with a few higher cases distributed sporadically.
- The age distribution is mainly concentrated in the younger to middle-aged bracket, with fewer older individuals.
- There are more non-diabetic than diabetic cases in the dataset.

In [None]:
# Plot scatter matrix of uncleand data
P = scatter_matrix(data, figsize=(20,20))

In [None]:
#plotting pair plots for the data
sns.pairplot(data_copy, hue='Outcome')
plt.show

**Result:**

1- High Glucose level in pregnancy increase the risk of diabete.

2- BMI Above 30 and high level of Glucose togather increase the risk of diabees.

3- We can see here that increasing Glucose level is the key factor which increase the risk of diaetes.

4- High Glucose level along with other variables increase the risk of diabetes.

In [None]:
# Histplot of dataset - variables relation with Outcome
pno = 1
plt.figure(figsize=(18,20))
for i in data.columns:
        if pno<9:
            plt.subplot(3,3,pno)
            ax = sns.histplot(data = data , x = i , hue = data.Outcome , kde = True);
            plt.xlabel(i)
            pno+=1
            for i in ax.containers:                                                   #to set a label on top of the bars.
                ax.bar_label(i,)

# **<u>Outcome</u>:**

1. When the number of pregnancies increases the risk of diabetes also increase.
2. When the level of Glucose increase above 125 the risk of diabetes also increase.
3. Blood pressure between 60 to 90 have more diabetic people than other rate.
4. Risk of diabetes increase when skin thickness increase.
5. Insulin level affect diabetes when its level increase the risk of diabetes also increase.
6. When BMI increase above 30 the risk of diabetes also increase.
7. The histogram with kernel density overlay indicates that higher DiabetesPedigree Function values are more common among individuals with diabetes compared to those without.
8. Risk of diabetes increase when age increase.

In [None]:
correlation = data.corr()
print(correlation)

In [None]:
#Visualizing the Correlation Matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

### Observation-
* Only Age with pregnancy, and glucose with outcome show some good correlation with each other

In [None]:
fig, ax = plt.subplots(1,2, figsize=(14,7))
sns.countplot(data = data, x= "Outcome", ax = ax[0])
data["Outcome"].value_counts().plot.pie(explode= [0.1,0], autopct= "%1.2F%%", labels= ["No ","Yes"], shadow= True, ax=ax[1])
plt.show()

# **Preprocessing**-

In [None]:
# Split the Data:
# Spliting the dataset into features (X) and the target variable (y).

X =data.drop("Outcome" ,axis=1)
y =data['Outcome']

In [None]:
# Splitting the dataset into a training set and a testing set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=23)

In [None]:
print(y_test.shape)   # 1D Shape
print(X_test.shape)   # 2D shape
print(y_train.shape)  # 1D Shape
X_train.shape         # 2D shape

# **Linear Regression**-

In [None]:
#Train the model
model=LogisticRegression(max_iter=768)

# Fitting model
model.fit(X_train,y_train)

In [None]:
#Testing the model on remaining 20%
y_pred = model.predict(X_test)
y_pred

In [None]:
y_test

# **Evaluation Matrix**-

In [None]:
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
lr_train_acc = round(accuracy_score(y_train,model.predict(X_train))*100,2)
lr_test_acc = round(accuracy_score(y_test,y_pred)*100,2)
print(classification_report(y_pred,y_test))
print('Training Accuracy = ' , lr_train_acc,' %')
print('Testing Accuracy = ' , lr_test_acc,' %')
sns.heatmap(cm,annot=True, fmt='d', cmap='Blues', cbar=False,)          #cbar- on right side shows the range in color bar
plt.title('Logistic Regresstion Confusion Matrix');

# **Prediction**-

In [None]:
new_data = pd.DataFrame({
    'Pregnancies': [3],
    'Glucose': [115],
    'BloodPressure': [60],
    'SkinThickness': [20],
    'Insulin': [90],
    'BMI': [23.4],
    'DiabetesPedigreeFunction': [0.25],
    'Age': [23]
})

new_predictions = model.predict(new_data)
print("Predictions for new data:", new_predictions)

# Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier(n_estimators=10)
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)


In [None]:
y_pred = rf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
rf_train_acc = round(accuracy_score(y_train,rf.predict(X_train))*100,2)
rf_test_acc = round(accuracy_score(y_test,y_pred)*100,2)
print(classification_report(y_pred,y_test))
print('Training Accuracy = ' , rf_train_acc,' %')
print('Testing Accuracy = ' , rf_test_acc,' %')
sns.heatmap(cm,annot=True, fmt='d', cmap='Blues', cbar=False,)          #cbar- on right side shows the range in color bar
plt.title('Random Forest Classifier Confusion Matrix');

In [None]:
new_data = pd.DataFrame({
    'Pregnancies': [5],
    'Glucose': [190],
    'BloodPressure': [110],
    'SkinThickness': [20],
    'Insulin': [100],
    'BMI': [36.4],
    'DiabetesPedigreeFunction': [0.25],
    'Age': [29]
})

new_predictions = rf.predict(new_data)
print("Predictions for new data:", new_predictions)

# Hyper Parameter Tunning- GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

param= {'n_estimators': [10,20,30,40,50], 'bootstrap':[True,False]}
gsc= GridSearchCV(estimator=rf,
                  param_grid=param,
                  scoring='accuracy')
gsc.fit(X_train,y_train)

In [None]:
gsc.best_params_

In [None]:
gsc.best_score_

In [None]:
y_pred = gsc.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
gsc_train_acc = round(accuracy_score(y_train,gsc.predict(X_train))*100,2)
gsc_test_acc = round(accuracy_score(y_test,y_pred)*100,2)
print(classification_report(y_pred,y_test))
print('Training Accuracy = ' , gsc_train_acc,' %')
print('Testing Accuracy = ' , gsc_test_acc,' %')
sns.heatmap(cm,annot=True, fmt='d', cmap='Blues', cbar=False,)          #cbar- on right side shows the range in color bar
plt.title('GridSearchCV Confusion Matrix');

In [None]:
models = pd.DataFrame({
    'Model': [
        'Logistic Regression','Random Forest','HyperParamter Tunning'
    ],
    'Training Accuracy': [
        lr_train_acc,rf_train_acc,gsc_train_acc
    ],
    'Testing Accuracy': [
       lr_test_acc,rf_test_acc,gsc_test_acc
    ]
})

In [None]:
models

In [None]:
models.sort_values(by=['Testing Accuracy','Training Accuracy'], ascending=False).style.background_gradient(
        cmap='coolwarm')

### Note- We can try implementing many more machine learning algorithms for the above dataset like support vector machine, decision tree, XGBoost for checking the accuracy of prediction.