# BUSINESS PROBLEM

We are creating a machine learning model to forecast a person's likelihood of acquiring diabetes based on their demographic data and past health measurements. This model's primary goal is to help medical professionals identidy people who are more likely to experience problems from diabetes within in a given time frame. Using characteristics including age, BMI, diabetes pedigree function,skin thickness, blood pressure, insulin levels, glucose levels, and pregnancies.To lessen the effects and development of diabetes the model will make it possible to implement preventive intervention techniques, individualized treatment regimens, and focused health education initatives.

# DATA COLLECTION

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer 
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score as acc, classification_report as clr, confusion_matrix as cm, recall_score as rs, precision_score as ps, f1_score as fs, roc_auc_score as ras
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

1. Data analysis and manipluation are the primary uses of pandas.
2. We mostly import GridSearchCV and train_test_split from the sklearn.model_selection module. It is mostly used to divide the dataset into test and train sets.
3. By eliminating the mean and scaling the data to the unit variance.
4. ColumnTransformer adds transformers to a Dataframe or array's designated columns.
5. sklearn.svm's SVC for classification problems, support vector classifier is used.
6. A decision tree classifier for classification tasks is represented by the DecisionTreeClassifier class from the sklearn.tree module.
7. A Random forest classifier for classification tasks.
8. The Random Forest Classifier class, mostly utilized for classification tasks, is derived from the sklearn.ensemble module.
9. The accuracy score aids in calculating the classification model's accuracy.
10. The primary classification metrics are displayed in a text report that is generated by the classification report.
11. The display function from the IPython.display module is usually used to produce visualizations or show dataframes in Jupter notebooks or Ipython environments.

In [2]:
df1 = pd.read_csv(r"C:\Users\Mohana Krishnan\Downloads\Machine Learning code\Dataset\diabetes.csv")

In [3]:
display(df1.head())

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


The above dataset contains 9 columns and 768 rows or records. In that we can easily distingush between the Independent Variable (IV) and the Dependent Variable (DV):

IV:- Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction and Age.

DV:- Outcome

In [4]:
df= df1.copy()

We copy the dataframe (df) into a new variable (df1) dataframe in order not to lose the original value of the dataset in the future or any misleading in the procedure of the model building.

In [5]:
print("The size of the df_train is: ", (df_train :=df.sample(frac=0.8)).shape, "\nThe size of the df_test is: ", (df_test :=df.drop(df_train.index)).shape)

The size of the df_train is:  (614, 9) 
The size of the df_test is:  (154, 9)


We used **train_test_split** to divide our dataset **(df)** into an **80%** training set and a **20%** testing set for effective model.

**NOTE**

1. We build the Model using train data not with the test data.

2. Both train & test data will have same number of columns.

# DATA EXPLORATION 

In [6]:
df_train.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

We check the data types of each columns in the dataset for the better understanding of the dataset.

Here we have only two type int64 and float64 both comes under numeric category 

In [7]:
null_values =df_train.isnull().sum()
print(null_values)

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


The dataset doesn't contain any **null** values.

In [8]:
df_train.duplicated().sum()

0

The datset doesn't contain any **duplicate** values.

In [9]:
information = df_train.info()
print("The Overview is: ",information)

<class 'pandas.core.frame.DataFrame'>
Index: 614 entries, 560 to 76
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               614 non-null    int64  
 1   Glucose                   614 non-null    int64  
 2   BloodPressure             614 non-null    int64  
 3   SkinThickness             614 non-null    int64  
 4   Insulin                   614 non-null    int64  
 5   BMI                       614 non-null    float64
 6   DiabetesPedigreeFunction  614 non-null    float64
 7   Age                       614 non-null    int64  
 8   Outcome                   614 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 48.0 KB
The Overview is:  None


In [10]:
display(df_train.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,614.0,3.776873,3.309591,0.0,1.0,3.0,6.0,17.0
Glucose,614.0,121.262215,32.401156,0.0,99.0,116.0,141.0,199.0
BloodPressure,614.0,69.144951,19.322543,0.0,64.0,72.0,80.0,122.0
SkinThickness,614.0,20.23127,16.102909,0.0,0.0,22.0,32.0,99.0
Insulin,614.0,79.301303,118.738937,0.0,0.0,7.0,121.5,846.0
BMI,614.0,31.96645,7.817192,0.0,27.3,31.95,36.5,67.1
DiabetesPedigreeFunction,614.0,0.474655,0.338948,0.084,0.245,0.365,0.637,2.42
Age,614.0,32.822476,11.410168,21.0,24.0,29.0,40.0,72.0
Outcome,614.0,0.358306,0.479894,0.0,0.0,0.0,1.0,1.0


**DECODING EACH OF THE ROWS**


1. **COUNT**:- Allow us to know the number of non-null values are there in each dataset column.
2. **MEAN**:- It determines the dataset's average value for every column.
3. **STD**:- The dispersion, or spread of the values around the mean value is primarily measured by the standard deviation.
4. **MIN**:- It determines the lowest value in every dataset column.
5. **25%**:- 25% of the values are less than or equal to the first quartile of the data.
6. **50%**:- 50% of the results are less than or equal to this value, which is the second quartile of the data.
7. **75%**:- 75% of the values in the data are less than or equal to this value, which is the third quartile.
8. **MAX**:- The maximum value in each column.

In [11]:
df_train['Outcome'].value_counts()

Outcome
0    394
1    220
Name: count, dtype: int64

0 - Non-Diabetic

1- Diabetic

In [12]:
df_train.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.22335,109.893401,68.274112,19.187817,68.286802,30.175635,0.42784,30.65736
1,4.768182,141.622727,70.704545,22.1,99.027273,35.173636,0.558495,36.7


We are checking the mean value of Outcome column with respect to other columns in the dataset

# DATA PREPROCESSING

In [13]:
display('X_train:', (x_train := df_train.drop(columns='Outcome')),
       'Y_train:', (y_train := df_train['Outcome']),
        'X_test:', (x_test := df_test.drop(columns='Outcome')),
       'Y_test:', (y_test := df_test['Outcome']))

'X_train:'

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
560,6,125,76,0,0,33.8,0.121,54
398,3,82,70,0,0,21.1,0.389,25
303,5,115,98,0,0,52.9,0.209,28
114,7,160,54,32,175,30.5,0.588,39
639,1,100,74,12,46,19.5,0.149,28
...,...,...,...,...,...,...,...,...
636,5,104,74,0,0,28.8,0.153,48
442,4,117,64,27,120,33.2,0.230,24
424,8,151,78,32,210,42.9,0.516,36
145,0,102,75,23,0,0.0,0.572,21


'Y_train:'

560    1
398    0
303    1
114    1
639    0
      ..
636    0
442    0
424    1
145    0
76     0
Name: Outcome, Length: 614, dtype: int64

'X_test:'

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
14,5,166,72,19,175,25.8,0.587,51
28,13,145,82,19,110,22.2,0.245,57
33,6,92,92,0,0,19.9,0.188,28
34,10,122,78,31,0,27.6,0.512,45
36,11,138,76,0,0,33.2,0.420,35
...,...,...,...,...,...,...,...,...
751,1,121,78,39,74,39.0,0.261,28
752,3,108,62,24,0,26.0,0.223,25
754,8,154,78,32,0,32.4,0.443,45
755,1,128,88,39,110,36.5,1.057,37


'Y_test:'

14     1
28     0
33     0
34     0
36     0
      ..
751    0
752    0
754    1
755    1
756    0
Name: Outcome, Length: 154, dtype: int64

In [14]:
class_distribution = df["Outcome"].value_counts()
total_instances = len(df)
class_ratios = class_distribution / total_instances
print("Class Ratios:")
print(class_ratios)

Class Ratios:
Outcome
0    0.651042
1    0.348958
Name: count, dtype: float64


In [15]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
x_train, y_train = smote.fit_resample(x_train, y_train)

I used over sampling here as my dataset is not **balanced**. You can clearly witness from my **class_ratios** score for my target column **Outcome**. So for class **0** it is almost **65%** and for class **1** it is **35%** which is **unbalanced**. So there is a higher chance for my model to predict on side that is patient is **non-diabetic** as it refer to 0 class. That's why I'm using here **over_sampling** inorder to produce more **resampling** data for mydataset. Next time my model predict prefectly.

# FEATURE ENGINEERING 

In [16]:
numerical_attributes = x_train.select_dtypes(include=['int64','float64']).columns

In [17]:
x_train = ColumnTransformer([('Standard_Scaling', StandardScaler(), numerical_attributes)]).fit_transform(x_train)
x_test =  ColumnTransformer([('Standard_Scaling', StandardScaler(), numerical_attributes)]).fit_transform(x_test)

print("The size of the x_train is: ", x_train.shape)
print("The size of the x_test is: ", x_test.shape)

The size of the x_train is:  (788, 8)
The size of the x_test is:  (154, 8)


We are performing feature scaling on the numerical attributes of the dataset using **StandardScaler** and **ColumnTransformer** function from the **sklearn.preprocessing** module.

**Function**

1.**StandardScaler**:- By subtracting the mean and scaling to the unit variance, the feature is scaled.

2.**ColumnTransformer**:- Allows different transformations to be applied to different columns of the dataset.

# MODEL TRAINING

## MODEL IS TRAINED WITH SVM ALGORITHM

In [18]:
svm_para = {'C' : [0.1, 1, 10, 100, 1000], 'kernel' : ['rbf'], 'gamma' : ['scale','auto']}
svm_grid_searching = GridSearchCV(SVC(), svm_para, cv=5, n_jobs=-1,verbose=1)
svm_grid_searching.fit(x_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


1. **svm_para** In this variable only we define few parameters for our algorithm training.
     
     a. **c** : The primary purpose of this regularization parameter is to forecast the regularization's intensity, which is inversely proportional to C.
     
     b. **kernel** : Precompute the kernel matrix from data matrices is its primary usage
     
     c. **gamma** : It is mostly used to define the rbf kernel coefficient, which affects the model's flexibility and decision boundary shape. 

2. **svm_grid_searching** The ***GridSearchCV*** object, purpose is to initialize the variable, is to do a grid search to determine the optimal set of hyperparameters for the SVM model.
     
     a. **SVC()** : we initializes the ***Support Vector Classifier*** object inside our variable, which is used for the svm model.
     
     b. **svm_para** : This is the dictionary which we defined already and used for the parameter tuning. 
     
     c. **cv=5** : The number of folds for cross-validation is specified here.
     
     d. **n_jobs=-1** : This parameter Utilizes parallel computing to speed up the grid search process by using all available                           CPU cores.
     
     e. **verbose=1** : It regulates the gridsearch procedures verbosity.

3. **svm_grid_searching.fit(x_train,y_train)** : Fitting the training data to the svm_grid_search object for cross validation.     

In [19]:
best_parameter =svm_grid_searching.best_params_
model_best_score =svm_grid_searching.best_score_

print("The Best Parameters for SVM is:")
print(best_parameter)
print("Best Cross_Validation Score is: {:.2f}".format(model_best_score))

The Best Parameters for SVM is:
{'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best Cross_Validation Score is: 0.78


## MODEL IS TRAINED WITH DECISION TREE

In [20]:
dt_params = {'criterion': ['gini','entropy'], 'splitter': ['best','random'], 'max_depth': [None,5,10,20,50], 'min_samples_split': [2,5,10,20], 'min_samples_leaf': [1,2,4,8]}
dt_grid_searching = GridSearchCV(DecisionTreeClassifier(), dt_params, cv=5, n_jobs = -1, verbose=1)
dt_grid_searching.fit(x_train,y_train)

Fitting 5 folds for each of 320 candidates, totalling 1600 fits


1. **dt_params** : his dictionary describes the parameters and values that can be used for the Decision Tree grid search.

   a. **criterion** :*Entropy* assesses impurity by information gain; the *gini* utilizes the gini impurity criteria.
   
   b. **splitter** : It split at each node for the given parameter.
   
   c. **max_depth** : It determines the decision tree's maximum depth.
   
   d. **min_samples_split** : To separate an internal node, minimum number of samples is required.
   
   e. **min_samples_leaf** : Number of samples need to be present at leaf node.
   
2. **dt_grid_searching** : this variable initializes a ***GridSearchCV*** object for performing hyperparameter tuning using grid                           search.

   a. **DecisionTreeClassifier()** : The model is adjusted to the Decision Tree Classifier object.
   
   b. **dt_params** : This is the dictionary containing the parameters to be tuned, as defined earlier.
   
   c. **cv=5** : The number of folds for cross-validation is specified here
     
   d. **n_jobs=-1** : This parameter Utilizes parallel computing to speed up the grid search process by using all available                           CPU cores.
     
   e. **verbose=1** : It regulates the gridsearch procedures verbosity.

3. **dt_grid_searching.fit(x_train,y_train)** : Fitting the training data to the dt_grid_search object for cross validation.
   

In [21]:
best_parameter =dt_grid_searching.best_params_
model_best_score =dt_grid_searching.best_score_

print("The Best Parameters for Decision Tree is:")
print(best_parameter)
print("Best Cross_Validation Score is: {:.2f}".format(model_best_score))

The Best Parameters for Decision Tree is:
{'criterion': 'entropy', 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'}
Best Cross_Validation Score is: 0.78


## MODEL IS TRAINED WITH K-NEAREST NEIGHBORS 

In [22]:
knn_params= {'n_neighbors' : [3,5,7,10,15],'weights' : ['uniform', 'distance'], 'algorithm' : ['auto','ball_tree', 'kd_tree'], 'leaf_size' : [30,50,100]}
knn_grid_searching = GridSearchCV(KNeighborsClassifier(),knn_params,cv=5,n_jobs=-1,verbose=1)
knn_grid_searching.fit(x_train,y_train)

Fitting 5 folds for each of 90 candidates, totalling 450 fits


1. **knn_params** : This dictionary describes the parameters and values that can be used for the K-Nearest Neighbors Classifier grid search.

   a. **n_neighbors** : This parameter represents the number of neighbors to consider when making predictions. 
   
   b. **weigths** :  It's set to *uniform* indicating that all points in each neighborhood are weighthed equally.
   
   c. **algorithm** : It is set to *auto* which chooses the best algorithm on its own based on the parameters supplied to the fit method.
   
   d. **leaf_size** : This parameter controls the leaf size passed to *BallTree* or *KDTree* which are data structures used for                       efficient neighbor searches. It is set to 30 meaning the grid search will explore this single value.
   
2. **knn_grid_searching** : This variable initializes a **GridSearchCV** object for performing hyperparameter tuning using grid                              search for the **KNN Classifier**.

   a. **KNeighborsClassifier** : This function initializes a K-Nearest Classifier object.
   
   b. **knn_params** : This is the dictionary containing the parameters to be tuned, as defined earlier.
   
   c.  **cv=5** : The number of folds for cross-validation is specified here.
     
   d. **n_jobs=-1** : It regulates the gridsearch procedures verbosity.
     
   e. **verbose=1** : It controls the verbosity of the grid search process.A value of *1* indicates that progress message will                         be printed during the search.
   
3. **knn_grid_search.fit(x_train,y_train)** : Fitting the training data to the knn_grid_search object for cross validation.

In [23]:
best_parameter =knn_grid_searching.best_params_
model_best_score =knn_grid_searching.best_score_

print("The Best Parameters for K-Nearest Neighbors is:")
print(best_parameter)
print("Best Cross_Validation Score is: {:.2f}".format(model_best_score))

The Best Parameters for K-Nearest Neighbors is:
{'algorithm': 'auto', 'leaf_size': 30, 'n_neighbors': 3, 'weights': 'distance'}
Best Cross_Validation Score is: 0.79


## MODEL IS TRAINED WITH RANDOM FOREST CLASSIFIER

In [24]:
rf_model ={'n_estimators':[100,200],'criterion':['gini','entropy'],'max_features':['sqrt','auto'], 'max_depth' : [None,10], 'min_samples_split' : [2,5], 'min_samples_leaf' : [1,2], 'max_features' : [True, False]}
rf_grid_searching = GridSearchCV(RandomForestClassifier(random_state=42),rf_model,cv=5,n_jobs=-1,verbose=1)
rf_grid_searching.fit(x_train,y_train)

Fitting 5 folds for each of 64 candidates, totalling 320 fits


1. **rf_model** : This is a dictionary contains the hyperparameters that will be tuned during the grid search for the Random Forest model.

   a. **n_estimators** : Tell us how many trees are in the forest.  Considering that the value is set to 100 and 200 in this instance, here will be a total of 100,200 decision trees.     
   
   b. **criterion** :It's set to *gini*and *entropy* which measures impurity using the Gini impurity criterion and provide information gain using the enrtopy.
   
   c. **max_features** : It's set to *sqrt* and *auto* meaning it will consider the square root of the total number of features and auto for all features.
   
   d. **max_depth** : We are setting up maximum depth of the tree that algorithm need to be trained.
   
   e. **min_samples_leaf** : Number of samples need to be present at leaf node.
   
   f. **min_samples_split** : To separate an internal node, minimum number of samples is required.
   
2. **rf_grid_searching** : This initializes a **GridSearchCV** object for performing hyperparameter tuning using grid search for the **Random Forest Classifier**.

   a. **RandomForestClassifier()** :The model is adjusted to the Random Forest Classifier object. **random_state** : For results that can be repeated, this argument determines the seed that the random number generator uses.
   
   b. **rf_model** : This is the dictionary containing the parameters to be tuned, as defined earlier.
   
   c.  **cv=5** : The numberof folds for cross-validation is specified here.
     
   d. **n_jobs=-1** : This parameter Utilizes parallel computing to speed up the grid search process by using all available                           CPU cores.
     
   e. **verbose=1** : It regulates the gridsearch procedures verbosity.
   
3. **rf_grid_searching.fit(x_train,y_train)** : Fitting the training data to the rf_grid_search object for cross validation.

In [25]:
best_parameter =rf_grid_searching.best_params_
model_best_score =rf_grid_searching.best_score_

print("The Best Parameters for RandomForestClassifier is:")
print(best_parameter)
print("Best Cross_Validation Score is: {:.2f}".format(model_best_score))

The Best Parameters for RandomForestClassifier is:
{'criterion': 'gini', 'max_depth': None, 'max_features': True, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best Cross_Validation Score is: 0.81


### CONCLUSION

Out of these four supervised algorthim that is **SVM, Decision Tree, K-Nearest Classifier and RandomForestClassifier**. RandomForestClassifier algorithm fits better for this model as it score 81% when compared to other Three algorithm score.

# MODEL EVALUATION

In [26]:
y_pred = rf_grid_searching.predict(x_test)
accuracy = acc(y_test,y_pred)

In [27]:
print("The Accuracy of the model is : {:.2f}".format(accuracy))

The Accuracy of the model is : 0.73


In [28]:
prec = ps(y_test, y_pred)
print("The Precision of the model is: {:.2f}".format(prec))

The Precision of the model is: 0.54


In [29]:
f1_sc = fs(y_test,y_pred)
print("The F1 Score is: {:.2f}".format(f1_sc))

The F1 Score is: 0.65


In [30]:
confusion_mat = cm(y_test,y_pred)
print("The confusion matrix is below: ")
print(confusion_mat)

The confusion matrix is below: 
[[73 33]
 [ 9 39]]


In [31]:
cl_report = clr(y_test,y_pred)
print("The classification report is:")
print(cl_report)

The classification report is:
              precision    recall  f1-score   support

           0       0.89      0.69      0.78       106
           1       0.54      0.81      0.65        48

    accuracy                           0.73       154
   macro avg       0.72      0.75      0.71       154
weighted avg       0.78      0.73      0.74       154



In [32]:
recall = rs(y_test, y_pred)
print("The Score of Recall is: {:.2f}".format(recall))

The Score of Recall is: 0.81


In [33]:
r_a_s = ras(y_test, y_pred)
print("The Score of ROC_AUC is: {:.2f}".format(r_a_s))

The Score of ROC_AUC is: 0.75


# CONCLUSION

Thus, the diabetes prediciton project implemented using Python and Random Forest Classifier algorithm has demonstrated the efficiency of the algorithm based on machine learning in the healthcare fied.Through meticulous data analaysis, feature selection and model training, we have developed a robust predictive tool capable of identifying potential diabetes cases with high accuracy. Leveraging Random Forest Classifier abaility to handle complex datasets and find optimal hyperplane boundaries, the project not only showcases the power of advanced algorithms but also highlights the importance of data-driven decision-making in the field of medicine. By harnessing the potential of this predicitive model, healthcare professionals can proactively identify individuals at risk, enabling early intervention and personalized care strategies, ultimately leading to improved patient outcomes and a significant positive impact on public health.

# DATASET SOURCE LINK

https://www.kaggle.com/datasets/saurabh00007/diabetescsv