## Diabetes Prediction 

Diabetes prediction refers to the process of using various data, such as medical records, demographic information, lifestyle factors, and genetic predispositions, to predict the likelihood of an individual developing diabetes in the future. Machine learning techniques are commonly used for diabetes prediction. These techniques involve training a model on historical data of individuals who have been diagnosed with diabetes or have risk factors for diabetes, and then using this trained model to make predictions on new, unseen data.



The following steps are followed,

* Import Library
* Load the diabetes dataset
* Split the data into 6 files
* Save the data files
* Define the model
* Define the pipeline
* Train the models
* Save the model
* Evaluate the model on the training set

#### Import Library

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.pipeline import make_pipeline
from joblib import dump, load

import pickle

#### Reading the dataset

In [27]:
diabetes_df = pd.read_csv('diabetes.csv')

In [28]:
diabetes_df.head()

Unnamed: 0,Index,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0,6,148,72,35,0,33.6,0.627,50,1
1,1,1,85,66,29,0,26.6,0.351,31,0
2,2,8,183,64,0,0,23.3,0.672,32,1
3,3,1,89,66,23,94,28.1,0.167,21,0
4,4,0,137,40,35,168,43.1,2.288,33,1


In [29]:
diabetes_df.sample(20)

Unnamed: 0,Index,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
177,177,0,129,110,46,130,67.1,0.319,26,1
408,408,8,197,74,0,0,25.9,1.191,39,1
711,711,5,126,78,27,22,29.6,0.439,40,0
756,756,7,137,90,41,0,32.0,0.391,39,0
166,166,3,148,66,25,0,32.5,0.256,22,0
154,154,8,188,78,0,0,47.9,0.137,43,1
46,46,1,146,56,0,0,29.7,0.564,29,0
240,240,1,91,64,24,0,29.2,0.192,21,0
322,322,0,124,70,20,0,27.4,0.254,36,1
184,184,4,141,74,0,0,27.6,0.244,40,0


In [30]:
diabetes_df.shape

(772, 10)

In [31]:
diabetes_df.describe(include='all')

Unnamed: 0,Index,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0
mean,385.479275,3.836788,120.86658,69.099741,20.534974,79.531088,31.986788,0.471049,33.233161,0.348446
std,222.965942,3.364851,31.905935,19.308651,15.934368,115.05793,7.868856,0.33068,11.748667,0.476787
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,192.75,1.0,99.0,62.0,0.0,0.0,27.3,0.244,24.0,0.0
50%,385.5,3.0,117.0,72.0,23.0,27.0,32.0,0.37,29.0,0.0
75%,578.25,6.0,140.0,80.0,32.0,126.25,36.6,0.6245,41.0,1.0
max,767.0,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [35]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 772 entries, 0 to 771
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Index                     772 non-null    int64  
 1   Pregnancies               772 non-null    int64  
 2   Glucose                   772 non-null    int64  
 3   BloodPressure             772 non-null    int64  
 4   SkinThickness             772 non-null    int64  
 5   Insulin                   772 non-null    int64  
 6   BMI                       772 non-null    float64
 7   DiabetesPedigreeFunction  772 non-null    float64
 8   Age                       772 non-null    int64  
 9   Outcome                   772 non-null    int64  
dtypes: float64(2), int64(8)
memory usage: 60.4 KB


In [36]:
diabetes_df.duplicated().value_counts()

False    768
True       4
Name: count, dtype: int64

In [37]:
diabetes_df.drop_duplicates()

Unnamed: 0,Index,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0,6,148,72,35,0,33.6,0.627,50,1
1,1,1,85,66,29,0,26.6,0.351,31,0
2,2,8,183,64,0,0,23.3,0.672,32,1
3,3,1,89,66,23,94,28.1,0.167,21,0
4,4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...,...
763,763,10,101,76,48,180,32.9,0.171,63,0
764,764,2,122,70,27,0,36.8,0.340,27,0
765,765,5,121,72,23,112,26.2,0.245,30,0
766,766,1,126,60,0,0,30.1,0.349,47,1


In [38]:
diabetes_df.isnull().value_counts()

Index  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  BMI    DiabetesPedigreeFunction  Age    Outcome
False  False        False    False          False          False    False  False                     False  False      772
Name: count, dtype: int64

In [39]:
diabetes_df.isnull().sum()

Index                       0
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

#### Separate Features and Target variable

In [40]:
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

#### Split the data

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=123)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.50, random_state=123)

#### Save the data

In [42]:
X_train.to_csv('Data/X_train.csv', index=False)
X_test.to_csv('Data/X_test.csv', index=False)
y_train.to_csv('Data/y_train.csv', index=False)
y_test.to_csv('Data/y_test.csv', index=False)
X_val.to_csv('Data/X_val.csv', index=False)
y_val.to_csv('Data/y_val.csv', index=False)



In [43]:
print(X_train.shape,X_test.shape,X_val.shape,y_train.shape,y_test.shape,y_val.shape)

(231, 9) (309, 9) (232, 9) (231,) (309,) (232,)


In [44]:
X_train

Unnamed: 0,Index,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
191,191,9,123,70,44,94,33.1,0.374,40
582,582,12,121,78,17,0,26.5,0.259,62
675,675,6,195,70,0,0,30.9,0.328,31
197,197,3,107,62,13,48,22.9,0.678,23
451,451,2,134,70,0,0,28.9,0.542,23
...,...,...,...,...,...,...,...,...,...
238,238,9,164,84,21,0,30.8,0.831,32
429,429,1,95,82,25,180,35.0,0.233,43
8,8,2,197,70,45,543,30.5,0.158,53
56,56,7,187,68,39,304,37.7,0.254,41


In [45]:
y_train

191    0
582    0
675    1
197    1
451    1
      ..
238    1
429    1
8      1
56     1
467    0
Name: Outcome, Length: 231, dtype: int64

#### Scaling Numerical data

In [46]:
numerical_cols=X_train.select_dtypes(exclude='object')
numerical_cols

scaler=StandardScaler() #numerical_cols
numerical_scaler=scaler.fit(numerical_cols)
numerical_scaler

# Save StandardScaler Model

dump(numerical_scaler,'Model/standard_scaler.pkl')


['Model/standard_scaler.pkl']

In [47]:

scaled_data=numerical_scaler.transform(numerical_cols)

In [48]:
numerical_scaled_data=pd.DataFrame(scaled_data,columns=numerical_cols.columns)
numerical_scaled_data

Unnamed: 0,Index,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,-0.886600,1.593253,0.056784,-0.009729,1.424548,0.023425,0.155281,-0.257791,0.555341
1,0.890831,2.511062,-0.003525,0.451260,-0.263956,-0.762742,-0.633330,-0.627427,2.430587
2,1.313597,0.675444,2.227906,-0.009729,-1.327087,-0.762742,-0.107590,-0.405646,-0.211805
3,-0.859325,-0.242365,-0.425687,-0.470717,-0.514104,-0.361295,-1.063481,0.719331,-0.893712
4,0.295324,-0.548302,0.388484,-0.009729,-1.327087,-0.762742,-0.346562,0.282197,-0.893712
...,...,...,...,...,...,...,...,...,...
226,-0.672945,1.593253,1.293118,0.797001,-0.013807,-0.762742,-0.119538,1.211107,-0.126566
227,0.195315,-0.854238,-0.787541,0.681754,0.236342,0.742684,0.382305,-0.710996,0.811057
228,-1.718493,-0.548302,2.288215,-0.009729,1.487085,3.778625,-0.155384,-0.952063,1.663441
229,-1.500292,0.981380,1.986670,-0.124976,1.111862,1.779755,0.704918,-0.643498,0.640580


In [49]:
Outcome = numerical_scaled_data

#### Define the models

In [50]:
models = [
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('Logistic Regression', LogisticRegression())
]

# Define the pipeline
pipeline = make_pipeline(StandardScaler(), RFE(estimator=LogisticRegression(), n_features_to_select=3))

# Train and save each model
for name, model in models:
    pipeline.set_params(rfe__estimator=model, rfe__n_features_to_select=3)
    pipeline.fit(Outcome, y_train)
    
    
    # Save the model
    with open(f'Model/{name}_model.pkl', 'wb') as f:
        pickle.dump(pipeline, f)

# Load the saved model
with open('Model/Decision Tree_model.pkl', 'rb') as f:
    decision_tree_model = pickle.load(f)

with open('Model/Random Forest_model.pkl', 'rb') as f:
    random_forest_model = pickle.load(f)

with open('Model/Logistic Regression_model.pkl', 'rb') as f:
    logistic_regression_model = pickle.load(f)

In [51]:
pred_train = decision_tree_model.predict(Outcome)
from sklearn import metrics
acc = metrics.accuracy_score(y_train,pred_train)*100
print("Accuracy of Decision_tree_model Train is {}".format(acc))

Accuracy of Decision_tree_model Train is 100.0


In [52]:
pred_train = random_forest_model.predict(Outcome)
from sklearn import metrics
acc = metrics.accuracy_score(y_train,pred_train)*100
print("Accuracy of Random_forest_model Train is {}".format(acc))

Accuracy of Random_forest_model Train is 100.0


In [53]:
pred_train = logistic_regression_model.predict(Outcome)
from sklearn import metrics
acc = metrics.accuracy_score(y_train,pred_train)*100
print("Accuracy of Logistic_regression_model Train is {}".format(acc))

Accuracy of Logistic_regression_model Train is 80.08658008658008


In [57]:
from evalution_test import evaluvation_metrics

In [58]:
y_pred,test_score=evaluvation_metrics('Data/X_val.csv','Data/y_val.csv',"Model/Decision Tree_model.pkl")
test_score

64.22413793103449

In [59]:
y_pred,test_score=evaluvation_metrics('Data/X_val.csv','Data/y_val.csv',"Model/Random Forest_model.pkl")
test_score

67.67241379310344

In [60]:
y_pred,test_score=evaluvation_metrics('Data/X_val.csv','Data/y_val.csv',"Model/Logistic Regression_model.pkl")
test_score

71.98275862068965