**TO PREDICT THE DIABETES USING THE PIMA DIABETES DATASET from Kaggle**

**OBJECTIVE:** The dataset is to diagnostically predict whether a patient has diabetes or not based on certain diagnostic measurements such as the number of pregnancies the patient has had, their BMI, insulin level, age etc. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

There is one **TARGET** variable (Dependent feature), "Outcome" and **PREDICTOR** variables (Independent features) includes "the number of pregnancies the patient has had, their BMI, insulin level, age, glucose, bloodpressure, skinthickness, diabetespedigreefunction".

**Using the Random forest and Xgbooster algorithm**

**IMPORTING LIBRARIES**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**How to read a CSV file**

In [None]:
data=pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")

**How to find the total rows and columns in a dataset**

In [None]:
data.shape

**How to print the first 5 rows and columns in dataset**

In [None]:
data.head()

**Finding the number of ones and zeros from "Outcome" column**

In [None]:
Outcome_1_count = len(data.loc[data['Outcome']== 1])
Outcome_0_count = len(data.loc[data['Outcome']== 0])

**The dataset is not imbalanced for the algorithms that we are using (Random forest and XGBooster)**

In [None]:
(Outcome_1_count,Outcome_0_count)

**Checking if we have Null values in the dataset**

In [None]:
data.isnull().values.any()

**Finding the correlation using heatmap (Seaborn Library)**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
corrmat=data.corr()
top_corr_features= corrmat.index
plt.figure(figsize=(10,10))
#Plotting heatmap
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap='RdYlGn')

**If we want just the correlation data with just the numeric values, We use the below function**

In [None]:
data.corr()

**Checking for other missing zeros values in the dataset by checking each column**

In [None]:
print("Total number of rows:{0}".format(len(data)))
print("Number of rows missing Pregnancies:{0}".format((len(data.loc[data['Pregnancies']==0]))))
print("Number of rows missing Glucose:{0}".format((len(data.loc[data['Glucose']==0]))))
print("Number of rows missing BloodPressure:{0}".format((len(data.loc[data['BloodPressure']==0]))))
print("Number of rows missing SkinThickness:{0}".format((len(data.loc[data['SkinThickness']==0]))))
print("Number of rows missing Insulin:{0}".format((len(data.loc[data['Insulin']==0]))))
print("Number of rows missing BMI:{0}".format((len(data.loc[data['BMI']==0]))))
print("Number of rows missing DiabetesPedigreeFunction:{0}".format((len(data.loc[data['DiabetesPedigreeFunction']==0]))))
print("Number of rows missing Age:{0}".format((len(data.loc[data['Age']==0]))))

**Train_Test_split**

In [None]:
from sklearn.model_selection import train_test_split
feature_columns=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']
predicted_class=['Outcome']

In [None]:
X= data[feature_columns].values
y= data[predicted_class].values

# Performing the split 
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=10)
# ( when we want to make a split, we might get a different set of train and test data points and will not help you in debugging in case we get an issue.So,we use random_state in train_test_split)

In [None]:
# Displaying the shape and we can infer that 30% of the data is test data and 70% of the data is train data because we have test_size as 0.30.
print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

**Impution** is used where the missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located.
In this case, We will be taking the "mean"

In [None]:
from sklearn.impute import SimpleImputer
fill_values= SimpleImputer(missing_values=0,strategy='mean')
X_train= fill_values.fit_transform(X_train)
X_test=fill_values.fit_transform(X_test)

Applying **RandomForest** algorithm
  ravel() function: It returns contiguous flattened array(1D array with all the input-array elements and with the same type as it). A copy is made only if needed.

In [None]:
from sklearn.ensemble import RandomForestClassifier
random_forest_model= RandomForestClassifier(random_state=2)
#(When we use 42 as random_state, we will always get the same output the first time we make the split. 
# This is useful if we want reproducible results)

random_forest_model.fit(X_train,y_train.ravel())

Finding the **Accuracy** of the RandomForest algorithm

In [None]:
predict_train_data=random_forest_model.predict(X_test)
from sklearn import metrics
print("Accuracy={0:3f}".format(metrics.accuracy_score(y_test,predict_train_data)))

**XGBCLASSIFIER Algorithm**

**Hyperparameter optimization:**
  It is a single set of well-performing hyperparameters that we can use to configure your model.

**Below are the Booster Parameters**

In [None]:
params={
    "learning_rate" : [0.05,0.10,0.15,0.20,0.25,0.30], 
    "max_depth"     : [3,4,5,6,8,10,9,7],
    "min_child_weight": [1,3,5,7],
    "gamma":[0.0,0.1,0.2,0.3,0.4],
    "colsample":[0.3,0.4,0.5,0.7] 
    
}

**Hyperparameter optimization using RandomizedSearchCV**

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import xgboost

In [None]:
classifier=xgboost.XGBClassifier()

In [None]:
random_search=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring="roc_auc",n_jobs=-1,cv=5,verbose=3)

In [None]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour,temp_sec = divmod((datetime.now()- start_time).total_seconds(),3600)
        tmin,tsec = divmod(temp_sec,60)
        print('\n Time taken: %i hour %i minutes and %s seconds.'% (thour,tmin,round(tsec,2)))

In [None]:
from datetime import datetime
start_time=timer(None)
random_search.fit(X_train,y_train.ravel())
timer(start_time)


**Finding the best estimator for XGBclassifier**

In [None]:
random_search.best_estimator_

In [None]:
classifier=xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample=0.7,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              gamma=0.4, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.05, max_delta_step=0,
              max_depth=10, min_child_weight=5, missing=0,
              monotone_constraints='()', n_estimators=100, n_jobs=4,
              num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              validate_parameters=1, verbosity=None)

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
transformer = ColumnTransformer([('one_hot_encoder', OneHotEncoder(), [0])],remainder='passthrough')
X = np.array(transformer.fit_transform(X), dtype=np.float)
labelencoder_Y = LabelEncoder()
y = labelencoder_Y.fit_transform(y.ravel())

In [None]:
classifier.fit(X_train,y_train.ravel())

In [None]:
y_pred=classifier.predict(X_test)

**Finding the confusion matrix and Accuracy_score**

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score

cm=confusion_matrix(y_test,y_pred)
Accuracy=accuracy_score(y_test,y_pred)

print(cm)
print(Accuracy)

In [None]:
from sklearn.model_selection import cross_val_score
Accuracy=cross_val_score(classifier,X_train,y_train.ravel(),cv=10)

In [None]:
Accuracy

**Finding the Accuracy of the XGBooster**

In [None]:
Accuracy.mean()