![Diabetes](https://www.diabetes.co.uk/wp-content/uploads/2019/01/How-to-Bring-Down-High-Blood-Sugar-Levels-1.png)

Hello Kagglers, <br>
    In this notebook I tried to create an optimal model for Diabetes Prediction from given .csv data.<br>
    Points covered are:<br>
     0] Exploratory Data Analysis and Visualization<br>
     1] Data Normalized Distribution<br>
     2] Data Up-Sampling  for Imbalance data<br>
     3] Feature Engineering and Selection<br>
     4] Fine tuning of Models.<br>
 <br>
 If you found this notebook helpful, your *Upvote Will Encourage Me* !!! 😀😇😊
 
 ## Problem Statement : Diabetes Prediction
 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelBinarizer , StandardScaler ,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report , accuracy_score,plot_confusion_matrix

from xgboost import XGBClassifier, plot_importance
from sklearn.linear_model import LogisticRegression , RidgeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier 
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


from sklearn.utils import resample
from sklearn.ensemble import AdaBoostClassifier , GradientBoostingClassifier , VotingClassifier , RandomForestClassifier

In [None]:
data = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
data.head(10)

# Overlooking whole data in a single window.

In [None]:
sns.pairplot(data)

Looks major of the attributes contain Non-Normally Distributed data points.
Also Data is seriously imbalance.

In [None]:
data["Outcome"].hist()

*Checking for null values distribution...*

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(data.isnull())

In [None]:
data.describe()

*Checking for feature importance of attributes...
By feeding data to Classifier
*

In [None]:

X = data.drop(["Outcome"],axis=1)
Y = data["Outcome"]

In [None]:
XGBR = XGBClassifier()
XGBR.fit(X,Y)
features = XGBR.feature_importances_
Columns = X.columns
for i,j in enumerate(features):
    print(Columns[i],"->",j)

plt.figure(figsize=(16,6))
plt.title(label="XGB")
#plt.bar([x for x in range(len(features))],features)
plt.bar([x for x in (Columns)],features)
plt.show()

plot_importance(XGBR)

From above graph we can say Skin "Thickness" is least important attribute.

Where "Glucose", "DiabetesPedigreeFunction","BMI" are oe of the most attributes.

Lets benchmark dataset i.e. train Classifier without any explicit featue engineering or modification in data.


In [None]:
X = data.drop(["Outcome"],axis=1)
Y = data["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=0)

XGBR = XGBClassifier()
XGBR.fit(X_train,y_train)
y_pred = XGBR.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')

To performance lets adjust distribution of these attributes->

"Pregnancies","Glucose","SkinThickness","Insulin","DiabetesPedigreeFunction","Age"

Taking log value of data-points of these features will distribute them normally.

In [None]:
#"Pregnancies","Glucose","SkinThickness","Insulin","DiabetesPedigreeFunction","Age"
data_new = data.copy()

data_new["Pregnancies"].hist()
plt.show()
data_new["Glucose"].hist()
plt.show()
data_new["SkinThickness"].hist()
plt.show()
data_new["Insulin"].hist()
plt.show()
data_new["DiabetesPedigreeFunction"].hist()
plt.show()
data_new["Age"].hist()
plt.show()


data_new["Pregnancies"] = [np.log(i) if i!=0 else 0 for i in data_new["Pregnancies"]]
data_new["Glucose"] = [np.log(i) if i!=0 else 0 for i in data_new["Glucose"]]
data_new["SkinThickness"] = [np.log(i) if i!=0 else 0 for i in data_new["SkinThickness"]]
data_new["Insulin"] = [np.log(i) if i!=0 else 0 for i in data_new["Insulin"]]
data_new["DiabetesPedigreeFunction"] = [np.log(i) if i!=0 else 0 for i in data_new["DiabetesPedigreeFunction"]]
data_new["Age"] = [np.log(i) if i!=0 else 0 for i in data_new["Age"]]

print("="*10,"\nAfter normal distibution operation\n")

data_new["Pregnancies"].hist()
plt.show()
data_new["Glucose"].hist()
plt.show()
data_new["SkinThickness"].hist()
plt.show()
data_new["Insulin"].hist()
plt.show()
data_new["DiabetesPedigreeFunction"].hist()
plt.show()
data_new["Age"].hist()
plt.show()

sns.pairplot(data_new)

In [None]:
from scipy.stats import boxcox
from scipy.special import inv_boxcox
#"Pregnancies","Glucose","SkinThickness","Insulin","DiabetesPedigreeFunction","Age"

data_boxcox = data.copy()

to_convert = [i if i!=0 else 1 for i in data_boxcox["Pregnancies"].values]
data_boxcox["Pregnancies"],fitted_lambda= boxcox(to_convert,lmbda=None)
inv_boxcox(data_boxcox["Pregnancies"],fitted_lambda)
data_boxcox["Pregnancies"].hist()
plt.show()


to_convert = [i if i!=0 else 1 for i in data_boxcox["Glucose"].values]
data_boxcox["Glucose"],fitted_lambda= boxcox(to_convert,lmbda=None)
inv_boxcox(data_boxcox["Glucose"],fitted_lambda)
data_boxcox["Glucose"].hist()
plt.show()

to_convert = [i if i!=0 else 1 for i in data_boxcox["SkinThickness"].values]
data_boxcox["SkinThickness"],fitted_lambda= boxcox(to_convert,lmbda=None)
inv_boxcox(data_boxcox["SkinThickness"],fitted_lambda)
data_boxcox["SkinThickness"].hist()
plt.show()

to_convert = [i if i!=0 else 1 for i in data_boxcox["Insulin"].values]
data_boxcox["Insulin"],fitted_lambda= boxcox(to_convert,lmbda=None)
inv_boxcox(data_boxcox["Insulin"],fitted_lambda)
data_boxcox["Insulin"].hist()
plt.show()

to_convert = [i if i!=0 else 1 for i in data_boxcox["DiabetesPedigreeFunction"].values]
data_boxcox["DiabetesPedigreeFunction"],fitted_lambda= boxcox(to_convert,lmbda=None)
inv_boxcox(data_boxcox["DiabetesPedigreeFunction"],fitted_lambda)
data_boxcox["DiabetesPedigreeFunction"].hist()
plt.show()

to_convert = [i if i!=0 else 1 for i in data_boxcox["Age"].values]
data_boxcox["Age"],fitted_lambda= boxcox(to_convert,lmbda=None)
inv_boxcox(data_boxcox["Age"],fitted_lambda)
data_boxcox["Age"].hist()
plt.show()

sns.pairplot(data_boxcox)

In [None]:
print(data_new.describe())
print(data_boxcox.describe())

# Term Normally Distributed data referes to <br>
1] Satndard Deviation of data = 1 and Mean of data = 0. <br>
2] Graph plot of data gives Bell Curve<br>

![Bell Curve](https://i.pinimg.com/originals/dd/5e/f9/dd5ef94c82281d75ff0bce252c6be136.jpg)
<br>
<br>
Intuition behind the the Nromal Data Distribution in simple langugae is Most of the data is<br>
at near Mean of the whole data. <br>

68% data points relie between -1 and 1<br>
95% data points reliw between -2 and 2<br>

Standard deviation ensures that all the data points are grouped together and having specific range.<br>
Infact Stadard deviation = 1 removes outliers from data. Outliers are the points which not fit <br>
in the normal range of points. It is simply away from the mean.<br>


From above graphs after logarithm value method and BoxCox method both are removing outliers from data.<br>
Graph shows data after BoxCox method data is normally distributed. <br>
Where mathematical values are showing log value has removed outliers <br>

We will try both methods.

In [None]:
X = data_boxcox.drop(["Outcome"],axis=1)
Y = data_boxcox["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=0)

XGBR = XGBClassifier()
XGBR.fit(X_train,y_train)
y_pred = XGBR.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')

# Understanding Classification report
![Metrics](https://static.packt-cdn.com/products/9781785282287/graphics/B04223_10_02.jpg)
![Accuracy](https://miro.medium.com/max/1594/0*qLxAWTs-gZjQvTi4.jpg)
![Precision](https://miro.medium.com/max/1104/1*5PvyyMvH5n42XICQrlXOzw.png)
![Recall](https://lawtomated.com/wp-content/uploads/2019/10/Recall_1.png)
![F1 Score](https://datascience103579984.files.wordpress.com/2019/04/capture3-24.png)



In [None]:
X = data_new.drop(["Outcome"],axis=1)
Y = data_new["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=0)

XGBR = XGBClassifier()
XGBR.fit(X_train,y_train)
y_pred = XGBR.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')

Accuracy increased significantly!!!.😀 <br>
One more noticable thing is improvement in performance of model for second class. <br>
Look at Precision, Recall and F1 Score is increased for first class but also for second class. <br>

Now lets tackle second problem which is Imabalance Data.


In [None]:
print("Count of Negative class: ",list(data["Outcome"]).count(0))
print("Count of Positive class: ",list(data["Outcome"]).count(1))
data["Outcome"].hist()


Data is highly biased towards Negative outcome i.e "0" than positive "1".<br>
Out of 768 records 500 records holds Negative outcome. Where only 268 records holds positive outcome. <br>

*To overcome this problem we can Up-Sample or Down-Sample data points according to Minority and Majority.*

In [None]:
#To keep BoxCox data as it is to use the same for later.
data_bal = data_boxcox.copy()

#Getting seperated data with 1 and 0 status.
df_majority = data_bal[data_bal.Outcome==0]
df_minority = data_bal[data_bal.Outcome==1]

#Here we are downsampling the Majority Class Data Points. 
#i.e. We will get equal amount of datapoint as Minority class from Majority class

df_manjority_downsampled = resample(df_majority,replace=False,n_samples=268,random_state=123)
df_downsampled = pd.concat([df_manjority_downsampled,df_minority])
print("Downsampled data:->\n",df_downsampled.Outcome.value_counts())

#Here we are upsampling the Minority Class Data Points. 
#i.e. We will get equal amount of datapoint as Majority class from Minority class
df_monority_upsampled = resample(df_minority,replace=True,n_samples=500,random_state=123)
df_upsampled = pd.concat([df_majority,df_monority_upsampled])
print("Upsampled data:->\n",df_upsampled.Outcome.value_counts())

In [None]:
X = df_downsampled.drop(["Outcome"],axis=1)
Y = df_downsampled["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=0)

XGBR = XGBClassifier()
XGBR.fit(X_train,y_train)
y_pred = XGBR.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')

In [None]:
X = df_upsampled.drop(["Outcome"],axis=1)
Y = df_upsampled["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=0)

XGBR = XGBClassifier()
XGBR.fit(X_train,y_train)
y_pred = XGBR.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')

Accuracy and other metrics too says it all. <br>
Up-Sampling is helpfull in our case to make data balanced. <br>
Now to create an optimal model Fine-Tuning of model Classifer is needed.

In [None]:
X = df_upsampled.drop(["Outcome"],axis=1)
Y = df_upsampled["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=0)

XGBR = XGBClassifier(learning_rate =0.1,n_estimators=100000,max_depth=6,min_child_weight=6,gamma=0,subsample=0.6,colsample_bytree=0.8,
 reg_alpha=0.005, objective= 'binary:logistic', nthread=2, scale_pos_weight=1, seed=27)
XGBR.fit(X_train,y_train)
y_pred = XGBR.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(XGBR,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')

In [None]:
X = df_upsampled.drop(["Outcome"],axis=1)
Y = df_upsampled["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=27)

#RF = RandomForestClassifier(n_estimators=1000,random_state=0,n_jobs=1000,max_depth=70,bootstrap=True)
RF = RandomForestClassifier(n_estimators=10000,random_state=42,n_jobs=1000,max_depth=70,bootstrap=True)
RF.fit(X_train,y_train)
y_pred = RF.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(RF,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(RF,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')



1.5 % addition is also a good step.<br>
# Peak points till now:-> <br>
Benchmark : 75.97 <br>
XGB : 87.50 <br>
RF : 88.50

In [None]:
#Lets scale the data
StSc = StandardScaler()
MnMx = MinMaxScaler()

X = df_upsampled.drop(["Outcome"],axis=1)
Y = df_upsampled["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=27)

X_train , X_test = MnMx.fit_transform(X_train) , MnMx.fit_transform(X_test)

RF = RandomForestClassifier(n_estimators=1000,random_state=0,n_jobs=1000,max_depth=70,bootstrap=True)
#RF = RandomForestClassifier(n_estimators=10000,random_state=42,n_jobs=1000,max_depth=70,bootstrap=True)
RF.fit(X_train,y_train)
y_pred = RF.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(RF,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(RF,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')



In [None]:
#Heat map of dataset with relative importance
matrix = data_boxcox.drop(["Outcome"],axis=1).corr()
#f , ax = plt.subplots(figsize=(18,6))
plt.figure(figsize=(18,8))
sns.heatmap(matrix,vmax=0.8,square=True,cmap="BuPu")

Lets make action of Feature-Engineering. <br>
Remember attributes "Pregnancies" , "SkinThickness" , "Insulin" are having less imporance so lets send them for rest.

In [None]:
#X = df_upsampled.drop(["Outcome" , "Pregnancies" , "SkinThickness" ,"Insulin"],axis=1) # 0.89
X = df_upsampled.drop(["Outcome" ,"BloodPressure", "Pregnancies"  ,"SkinThickness" ,"Insulin"],axis=1)
Y = df_upsampled["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=27)

RF = RandomForestClassifier(n_estimators=1000,random_state=0,n_jobs=1000,max_depth=70,bootstrap=True)
#RF = RandomForestClassifier(n_estimators=10000,random_state=42,n_jobs=1000,max_depth=70,bootstrap=True)
RF.fit(X_train,y_train)
y_pred = RF.predict(X_test)
print(classification_report(y_pred,y_test))
print(accuracy_score(y_pred,y_test))
plot_confusion_matrix(RF,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues)
plot_confusion_matrix(RF,X_test,y_test,display_labels=["Diabetic","Non-Diabetic"],cmap=plt.cm.Blues,normalize='true')


In [None]:
X = df_upsampled.drop(["Outcome" ,"BloodPressure", "Pregnancies"  ,"SkinThickness" ,"Insulin"],axis=1)
Y = df_upsampled["Outcome"]
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=27)


models = []
models.append(("XGB",XGBClassifier()))
models.append(("RF",RandomForestClassifier()))
models.append(("DT",DecisionTreeClassifier()))
models.append(("ADB",AdaBoostClassifier()))
models.append(("GB",GradientBoostingClassifier()))

ensemble = VotingClassifier(estimators=models)
ensemble.fit(X_train,y_train)
y_pred = ensemble.predict(X_test) 
print(classification_report(y_pred,y_test))
print("Voting Ensemble:>",accuracy_score(y_pred,y_test))



SVM = SVC(kernel="linear",class_weight="balanced",probability=True)
SVM.fit(X_train,y_train)
y_pred = SVM.predict(X_test)
print(classification_report(y_pred,y_test))
print("SVM:>",accuracy_score(y_pred,y_test))


XGBC = XGBClassifier(learning_rate =0.1,n_estimators=10000,max_depth=4,min_child_weight=6,gamma=0,subsample=0.6,colsample_bytree=0.8,
 reg_alpha=0.005, objective= 'binary:logistic', nthread=2, scale_pos_weight=1, seed=27)
XGBC.fit(X_train,y_train)
y_pred = XGBC.predict(X_test)
print(classification_report(y_pred,y_test))
print("XGBoost:>",accuracy_score(y_pred,y_test))


RF = RandomForestClassifier(n_estimators=1000,random_state=0,n_jobs=1000,max_depth=70,bootstrap=True)
RF.fit(X_train,y_train)
y_pred = RF.predict(X_test)
print(classification_report(y_pred,y_test))
print("RandomForestClassifier:>",accuracy_score(y_pred,y_test))


Model2 = GradientBoostingClassifier(random_state=0)
Model2.fit(X_train,y_train)
y_pred = Model2.predict(X_test)
print(classification_report(y_pred,y_test))
print("GradientBoostingClassifier:>",accuracy_score(y_pred,y_test))


Model3 = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=100,
 max_features=1.0, max_leaf_nodes=10,
 min_impurity_split=1e-07, min_samples_leaf=1,
 min_samples_split=2, min_weight_fraction_leaf=0.10,
 presort=False, random_state=27, splitter='best')
Model3.fit(X_train,y_train)
y_pred = Model3.predict(X_test)
print(classification_report(y_pred,y_test))
print("DecisionTreeClassifier:>",accuracy_score(y_pred,y_test))


Model4 = AdaBoostClassifier()
Model4.fit(X_train,y_train)
y_pred = Model4.predict(X_test)
print(classification_report(y_pred,y_test))
print("AdaBoostClassifier:>",accuracy_score(y_pred,y_test))


Model5 = LinearDiscriminantAnalysis()
Model5.fit(X_train,y_train)
y_pred = Model5.predict(X_test)
print(classification_report(y_pred,y_test))
print("LinearDiscriminantAnalysis:>",accuracy_score(y_pred,y_test))

KNN = KNeighborsClassifier(leaf_size=1,p=2,n_neighbors=20)
KNN.fit(X_train,y_train)
y_pred = KNN.predict(X_test)
print(classification_report(y_pred,y_test))
print("KNeighborsClassifier:>",accuracy_score(y_pred,y_test))


Model7 = GaussianNB()
Model7.fit(X_train,y_train)
y_pred = Model7.predict(X_test)
print(classification_report(y_pred,y_test))
print("GaussianNB:>",accuracy_score(y_pred,y_test))


Model8 = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Model8.fit(X_train,y_train)
y_pred = Model8.predict(X_test)
print(classification_report(y_pred,y_test))
print("Logistic Regression:>",accuracy_score(y_pred,y_test))



# Magical Outliers Removing Technique 🚀
<br>
In this method outliers will be removed from original data and will directly fitted in model without any<br>
explicit feature engineering or sampling.

In [None]:

data_ = data.copy()
data_.loc[(data_.SkinThickness<5)& (data_.Outcome==0), 'SkinThickness']=int(data_[(data_.Outcome==0)]['SkinThickness'].mean())
data_.loc[(data_.SkinThickness<5)& (data_.Outcome==1), 'SkinThickness']=int(data_[(data_.Outcome==1)]['SkinThickness'].mean())
data_.loc[(data_.Insulin==0)& (data_.Outcome==0), 'Insulin']=int(data_[(data_.Outcome==0)]['Insulin'].mean())
data_.loc[(data_.Insulin==0)& (data_.Outcome==1), 'Insulin']=int(data_[(data_.Outcome==1)]['Insulin'].mean())

In [None]:
X = np.array(data_[["Pregnancies","BloodPressure","Glucose","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age"]])
Y = np.array(data_.Outcome)
X_train , X_test , y_train , y_test = train_test_split(X,Y,test_size=0.20,random_state=0)


models = []
models.append(("XGB",XGBClassifier()))
models.append(("RF",RandomForestClassifier()))
models.append(("DT",DecisionTreeClassifier()))
models.append(("ADB",AdaBoostClassifier()))
models.append(("GB",GradientBoostingClassifier()))

ensemble = VotingClassifier(estimators=models)
ensemble.fit(X_train,y_train)
y_pred = ensemble.predict(X_test) 
print(classification_report(y_pred,y_test))
print("Voting Ensemble:>",accuracy_score(y_pred,y_test))



SVM = SVC(kernel="linear",class_weight="balanced",probability=True)
SVM.fit(X_train,y_train)
y_pred = SVM.predict(X_test)
print(classification_report(y_pred,y_test))
print("SVM:>",accuracy_score(y_pred,y_test))


XGBC = XGBClassifier(learning_rate =0.1,n_estimators=10000,max_depth=4,min_child_weight=6,gamma=0,subsample=0.6,colsample_bytree=0.8,
 reg_alpha=0.005, objective= 'binary:logistic', nthread=2, scale_pos_weight=1, seed=27)
XGBC.fit(X_train,y_train)
y_pred = XGBC.predict(X_test)
print(classification_report(y_pred,y_test))
print("XGBoost:>",accuracy_score(y_pred,y_test))

Model1 = RandomForestClassifier(n_estimators=1000,random_state=0,n_jobs=1000,max_depth=70,bootstrap=True)
Model1.fit(X_train,y_train)
y_pred = Model1.predict(X_test)
print(classification_report(y_pred,y_test))
print("RandomForestClassifier:>",accuracy_score(y_pred,y_test))


Model2 = GradientBoostingClassifier(random_state=0)
Model2.fit(X_train,y_train)
y_pred = Model2.predict(X_test)
print(classification_report(y_pred,y_test))
print("GradientBoostingClassifier:>",accuracy_score(y_pred,y_test))


Model3 = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=100,
 max_features=1.0, max_leaf_nodes=10,
 min_impurity_split=1e-07, min_samples_leaf=1,
 min_samples_split=2, min_weight_fraction_leaf=0.10,
 presort=False, random_state=27, splitter='best')
Model3.fit(X_train,y_train)
y_pred = Model3.predict(X_test)
print(classification_report(y_pred,y_test))
print("DecisionTreeClassifier:>",accuracy_score(y_pred,y_test))


Model4 = AdaBoostClassifier()
Model4.fit(X_train,y_train)
y_pred = Model4.predict(X_test)
print(classification_report(y_pred,y_test))
print("AdaBoostClassifier:>",accuracy_score(y_pred,y_test))


Model5 = LinearDiscriminantAnalysis()
Model5.fit(X_train,y_train)
y_pred = Model5.predict(X_test)
print(classification_report(y_pred,y_test))
print("LinearDiscriminantAnalysis:>",accuracy_score(y_pred,y_test))

KNN = KNeighborsClassifier(leaf_size=1,p=2,n_neighbors=20)
KNN.fit(X_train,y_train)
y_pred = KNN.predict(X_test)
print(classification_report(y_pred,y_test))
print("KNeighborsClassifier:>",accuracy_score(y_pred,y_test))


Model7 = GaussianNB()
Model7.fit(X_train,y_train)
y_pred = Model7.predict(X_test)
print(classification_report(y_pred,y_test))
print("GaussianNB:>",accuracy_score(y_pred,y_test))


Model8 = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Model8.fit(X_train,y_train)
y_pred = Model8.predict(X_test)
print(classification_report(y_pred,y_test))
print("Logistic Regression:>",accuracy_score(y_pred,y_test))



# Results 💹📈
<br>
Benchmark : 75.97 <---> *Without any processing* <br>
XGBoost : 87.50 <---> *After Distribution Normalization + Up-Sampling + Feature Selection* <br>
XGBoost & Random Forest : 89.00 <---> *After Distribution Normalization + Up-Sampling + Feature Selection + Fine Tuning + Random State in Data Spliting*<br>
Gradient Boosting Classifier : 92.20 <---> *After removing outliers*

In [None]:
np.mean(list(data.SkinThickness))

Visit for outlier removal techniques<br> 
https://www.kaggle.com/akhileshdkapse/starter-guide-eda-acc-87-precision-92/notebook#Removing-outliers-! <br>
And <br>
https://www.kaggle.com/abdulrahmanahajj/diabetes-acc-92-auc-0-914 