#Using Machine Learning to Predict Diabetes

Every dataset has a story. In this dataset, I want to figure out whats is the effect of the columns: Pregnancies,	Glucose,	BloodPressure,	SkinThickness,	Insulin,	BMI,	DiabetesPedigreeFunction, and Age on the column Outcome. Feature Importance will be used to see which columns have a bigger weight on the Outcome column. This project will use machine learning to predict whether one has diabetes or not.

There are six parts to the dataset shown in the Dataset Key. Each part shows the different steps taken to finish this project.

The higher the score, the better the model. The data was brought from Kaggle, a well known machine learning website and a repository for data. For more information on the dataset, click the link below.


## [Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database)

# Project Key

1. Preparing the Dataset
2. Running Models WITHOUT Neural Networks*
3. Running Models WITH Neural Networks*
4. Predicting
5. Graphs  
6. Conclusion
*Running Models With Neural Networks is separated from Running Models Without   
 Neural Networks because Neural Networks are more complicated.

# 1. Preparing Dataset

In [None]:
#Load the dataset
import pandas as pd
diabetes_data = 'https://raw.githubusercontent.com/Crazy-Coding-Physicist/Reviews/main/diabetes%5B1%5D.csv'
df = pd.read_csv(diabetes_data)
df.head()

To clear things up, the 'DiabetesPedigreeFunction' column shows if ancestors have had diabetes. The 'BMI' column stands for body mass index. The 'outcome' column has only two values, 0 or 1. This is the column that describes whether one has diabetes or not. Both accuracy and precision will be used to get a score. 

In [None]:
#Size of dataset
df.shape

In [None]:
#Checking for null values
df.isna().sum()

In [None]:
#Checking for type of dataset
df.info()

In [None]:
#@title Feature Importance
#figuring out feature importance
# random forest for feature importance on a classification problem
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = RandomForestClassifier()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# random forest for feature importance on a classification problem
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = RandomForestClassifier()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

Feature importance sees which columns have more weight in the dataset

In [None]:
#Selecting columns
y = df.iloc[:, -1]
X = df.iloc[:, :-1]
X.head()

In [None]:
#Checking to see if 'y' was selected
y.head()

In [None]:
#Checking the different values in the Outcome column
df['Outcome'].value_counts()

In [None]:
y.describe()

In [None]:
#Selecting Model
from sklearn.model_selection import cross_val_score as CVS
from sklearn.linear_model import LogisticRegression as LoRe
from xgboost import XGBClassifier as XGBC
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.neighbors import KNeighborsClassifier as KNC 
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier as MLPC
from sklearn.model_selection import KFold as KF
from keras.models import Sequential
from keras.layers import Dense, Dropout
import tensorflow as tf

In [None]:
#Using train_test_split for Prediction(5) https://colab.research.google.com/drive/13JgTPzQ7kWb8vCsJ2pOmDGJGpYHBWccS#scrollTo=1K1KHAUMM2ud
from sklearn.model_selection import train_test_split as TTS
X_test, X_train, y_test, y_train = TTS(X,y)
model = RFC()
model.fit(X_train, y_train)
#For classification, import accuracy_score
from sklearn.metrics import accuracy_score

In [None]:
#Creating function for classifying using precision
def classifier(model):
  scores = CVS(model, X, y)
  print(f'Scores: {scores}')
  print(f'Mean score: {scores.mean()}')

In [None]:
#Creating function for classifying using precision
def classifier_precision(model):
  scores = CVS(model, X, y, scoring='precision')
  print(f'Scores: {scores}')
  print(f'Mean score: {scores.mean()}')

In [None]:
kf = KF(shuffle=True)
y_pred = model.predict(X_test)

# 2. Running Models WITHOUT Neural Networks

In [None]:
#Creating a Classification Report
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))

###Highest Score Without Finetuning : LoRe 0.77

In [None]:
#Getting a simple accuracy score
accuracy_score(y_pred, y_test)

Accuracy will still be used but different models will be selected such as LogisticRegression or XGBoost.

In [None]:
classifier(LoRe(max_iter=1000))

In [None]:
classifier(XGBC())

In [None]:
classifier(DTC())

In [None]:
classifier(RFC())

In [None]:
classifier(KNC())

Highest Precision Score: 0.719

In [None]:
classifier_precision(LoRe(max_iter=10000))

In [None]:
classifier_precision(XGBC())

###Highest Score With Finetuning: 0.7553

In [None]:
"""
params = {'max_depth': [2, 3, 5, 6],
          'min_samples_split': [5, 6, 8], 
          'max_features': [0.1, 0.075, 0.5, 0.0025],
          'min_impurity_decrease': [0.0025, 0.005, 0.0075, 0.01],
          'min_samples_leaf': [3, 4, 5]}
"""
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 2000, num = 25)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
params = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

clf = RFC()

grid_clf = GridSearchCV(clf, params, cv=5)

grid_clf.fit(X_train, y_train)

best_params = grid_clf.best_params_

best_score = grid_clf.best_score_

print(f'Best params: {best_params}')
print(f'Best score: {best_score}')

In [None]:
from sklearn.model_selection import GridSearchCV

params = {'max_depth': [2, 3, 5, None],
          'min_samples_split': [2, 4, 6, 8], 
          'max_features': [None, 0.1],
          'min_impurity_decrease': [0.0, 0.0025, 0.005, 0.0075, 0.01],
          'min_samples_leaf': [1, 2, 3, 4, 5]}


clf = DTC()

grid_clf = GridSearchCV(clf, params, cv=5)

grid_clf.fit(X_train, y_train)

best_params = grid_clf.best_params_

best_score = grid_clf.best_score_

print(f'Best params: {best_params}')
print(f'Best score: {best_score}')

In [None]:
#This will make it easier to do GridSearchCV
def grid_search(params, model=XGBC(objective='reg:squarederror')):
  
  # Initialize model
  grid = GridSearchCV(model, params, scoring='neg_mean_squared_error')
  
  # Fit model on data
  grid.fit(X, y)
  
  # Extract best params
  best_params = grid.best_params_

  # Print best params
  print("Best params:", best_params)

  # Compute best score
  best_score = grid.best_score_
  
  # Turn score into RMSE (root mean squared error)
  best_score = (-best_score.mean())**0.5

  # Print best score
  print("Best score: {:.5f}".format(best_score))

In [None]:
params = {'n_estimators':[750]}
grid_search(params)

Suprisingly, this score is higher than the predictions made with the other GridSearch tests. Let's do another test.

In [None]:
params = {'n_estimators':[750]}
grid_search(params)

In [None]:
params = {'n_estimators':[50]}
grid_search(params)

In [None]:
params = {'n_estimators':[800]}
grid_search(params)

In [None]:
params = {'n_estimators':[1000]}
grid_search(params)

In [None]:
params = {'n_estimators':[850]}
grid_search(params)

In [None]:
params = {'learning_rate':[0.05, 0.1, 0.15], 'n_estimators':[750]}
grid_search(params)

In [None]:
params = {'learning_rate':[0.05, 0.0075, 0.15, 0.45], 
          'colsample_bylevel': [0.5, 0.75, 1],
          'colsample_bytree': [0.5, 0.75, 1],
          'colsample_bynode': [0.5, 0.75, 1],
          'n_estimators':[750]}
grid_search(params)

In [None]:
params = {'learning_rate':[0.05, 0.0075, 0.15, 0.45],
          'colsample_bylevel': [0.5, 0.75, 1],
          'colsample_bytree': [0.5, 0.75, 1],
          'colsample_bynode': [0.5, 0.75, 1],
          'n_estimators':[750]}
grid_search(params)

#3. Running Models WITH Neural Networks

###Highest Score Without Finetuning: 0.699

In [None]:
scores = CVS(MLPC(), X, y, cv=kf)
score = scores.mean()
print(score)

In [None]:
scores = CVS(MLPC(), X, y, cv=kf)
score = scores.mean()
print(score)

In [None]:
scores = CVS(MLPC(), X, y, cv=kf)
score = scores.mean()
print(score)

There's not much you can do with MLPC without finetuning...

###Highest Score With Finetuning: 0.69



In [None]:
scores = CVS(MLPC(max_iter=5000), X, y, cv=kf)
score = scores.mean()
print(score)

In [None]:
scores = CVS(MLPC(max_iter=1000,), X, y, cv=kf)
score = scores.mean()
print(score)

In [None]:
scores = CVS(MLPC(max_iter=500,), X, y, cv=kf)
score = scores.mean()
print(score)

In [None]:
scores = CVS(MLPC(max_iter=250,), X, y, cv=kf)
score = scores.mean()
print(score)

In [None]:
scores = CVS(MLPC(hidden_layer_sizes=(100), max_iter=10000, random_state=2), X, y, cv=kf)
print(score)

In [None]:
scores = XGBC(MLPC(hidden_layer_sizes=(100), max_iter=10000, random_state=2), X, y, cv=kf)
print(score)

In [None]:
scores = RFC(MLPC(hidden_layer_sizes=(100), max_iter=10000, random_state=2), X, y)
print(score)

In [None]:
scores = XGBC(MLPC(hidden_layer_sizes=(100), max_iter=10000, random_state=2), X, y, cv=kf)
print(score)

In [None]:
scores = XGBC(MLPC(hidden_layer_sizes=(100), max_iter=10000, random_state=2, solver='adam', verbose=True), X, y, cv=kf)
print(score)

In [None]:
scores = XGBC(MLPC(hidden_layer_sizes=(100), max_iter=10000, random_state=2, solver='ftrl', verbose=True), X, y, cv=kf)
print(score)

Deep Learning implementing Tensorflow and Keras.
High Score: 0.68

In [None]:
num_cols = X.shape[1]
model = Sequential()
model.add(Dense(2, input_shape=(num_cols,), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#print(model.summary())
model.compile(optimizer='ftrl', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=250)
model.evaluate(X_test, y_test)

In [None]:
num_cols = X.shape[1]
model = Sequential()
model.add(Dense(1, input_shape=(num_cols,), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
#print(model.summary())
model.compile(optimizer='Ftrl', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=250)
model.evaluate(X_test, y_test)

Notice how big the nueral nets are? There's three layers, but only one neuron per layer. Sometimes when there is a small amount of data it's best to have only a small amount of neurons.

In [None]:
num_cols = X.shape[1]
model = Sequential()
model.add(Dense(8, input_shape=(num_cols,), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#print(model.summary())
model.compile(optimizer='Ftrl', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=1000)
model.evaluate(X_test, y_test)

In [None]:
num_cols = X.shape[1]
model = Sequential()
model.add(Dense(8, input_shape=(num_cols,), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#print(model.summary())
model.compile(optimizer='Ftrl', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10000)
model.evaluate(X_test, y_test)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.constraints import maxnorm

num_cols = X.shape[1]

# Initialize model
model = Sequential()

# Adding initial hidden layer
model.add(Dense(8, input_shape = (num_cols,), activation='relu', kernel_constraint=maxnorm(3)))

# Dropout prevents overfitting
model.add(Dropout(0.1))

# Add more hidden layers
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Define early stopping monitor
early_stopping_monitor = EarlyStopping(patience=100)

# Fit the model on data
model.fit(X_train, y_train, epochs=2000, batch_size=20, validation_split=0.2, callbacks=[early_stopping_monitor])

# Score the model
model.evaluate(X_test, y_test)


In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.constraints import maxnorm

num_cols = X.shape[1]

# Initialize model
model = Sequential()

# Adding initial hidden layer
model.add(Dense(8, input_shape = (num_cols,), activation='relu', kernel_constraint=maxnorm(3)))

# Dropout prevents overfitting
model.add(Dropout(0.1))

# Add more hidden layers
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Define early stopping monitor
early_stopping_monitor = EarlyStopping(patience=100)

# Fit the model on data
model.fit(X_train, y_train, epochs=10000, batch_size=20, validation_split=0.2, callbacks=[early_stopping_monitor])

# Score the model
model.evaluate(X_test, y_test)


###Highest *precision* score: 0.60



In [None]:
num_cols = X.shape[1]
model1 = Sequential()
model1.add(Dense(8, input_shape=(num_cols,), activation='relu'))
model1.add(Dropout(0.2))
model1.add(Dense(4, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))
#print(model.summary())
model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=[tf.keras.metrics.Precision()])
model1.fit(X_train, y_train, epochs=250)
model1.evaluate(X_test, y_test)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.constraints import maxnorm

num_cols = X.shape[1]

# Initialize model
model = Sequential()

# Adding initial hidden layer
model.add(Dense(4, input_shape = (num_cols,), activation='relu', kernel_constraint=maxnorm(3)))

# Dropout prevents overfitting
model.add(Dropout(0.1))

# Add more hidden layers
model.add(Dense(2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[tf.keras.metrics.Precision()])

# Define early stopping monitor
early_stopping_monitor = EarlyStopping(patience=100)

# Fit the model on data
model.fit(X_train, y_train, epochs=2000, batch_size=20, validation_split=0.2, callbacks=[early_stopping_monitor])

# Score the model
model.evaluate(X_test, y_test)


In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping

num_cols = X.shape[1]

# Initialize model
model = Sequential()

# Adding hidden layers
model.add(Dense(1, input_shape = (num_cols,), activation='relu'))
model.add(Dense(1, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[tf.keras.metrics.Precision()])

# Define early stopping monitor
early_stopping_monitor = EarlyStopping(patience=100)

# Fit the model on data
model.fit(X_train, y_train, epochs=2000, batch_size=20, validation_split=0.2, callbacks=[early_stopping_monitor])

# Score the model
model.evaluate(X_test, y_test)

# 4. Predicting Diabetes

In [None]:
#To jog memory about dataset; Correctly pick values
df.head()

In [None]:
#Testing out first row
model.predict([[6,	148,	72,	35,	0,	33.6,	0.627, 50	]])

In [None]:
#Inputing random values
model.predict([[5, 148, 35, 23, 0, 10.6, 0.167, 21 ]])

In [None]:
X_test.head()

In [None]:
#Predicting X_test
model.predict(X_test) 

In [None]:
model.predict_proba(X_test)

In [None]:
model.predict_proba([[6,	148,	72,	35,	0,	33.6,	0.627, 50	]])

#5. Graphs 

In [None]:
# Get an idea of what might go in the graph
df.describe()

In [None]:
#@title Columns on Columns
from pandas.plotting import scatter_matrix
p=scatter_matrix(df,figsize=(25, 25))

In [None]:
sns.histplot(data=df, x="Outcome")

In [None]:
#Age Range
import seaborn as sns
sns.lineplot(data=df["Age"])

In [None]:
#Age on BMI
sns.lineplot(
    data=df,
    x="Age", y="BMI"
)

In [None]:
#Age on Bloodpressure
sns.lineplot(
    data=df,
    x="Age", y="BloodPressure"
)

In [None]:
#Age on DiabetesPedigreeFunction
sns.lineplot(
    data=df,
    x="Age", y="DiabetesPedigreeFunction"
)

In [None]:
import matplotlib.pyplot as plt

side_length = 9
data = df
# Generate the heatmap
sns.heatmap(data)
plt.show()

In [None]:
import seaborn as sns
df = df

# Large bandwidth
sns.kdeplot(df['BloodPressure'], shade=True, bw=.5, color="olive")
#sns.plt.show()

# Narrower bandwidth
sns.kdeplot(df['BloodPressure'], shade=True, bw=.05, color="olive")


In [None]:
#Bloodpressure on Outcome
import seaborn as sns
df = df
 
# Custom the inside plot: options are: “scatter” | “reg” | “resid” | “kde” | “hex”
sns.jointplot(x=df["BloodPressure"], y=df["Outcome"], kind='scatter')
sns.jointplot(x=df["BloodPressure"], y=df["Outcome"], kind='hex')
sns.jointplot(x=df["BloodPressure"], y=df["Outcome"], kind='kde')
sns.jointplot(x=df["BloodPressure"], y=df["Outcome"], kind='scatter', s=200, color='m', edgecolor="skyblue", linewidth=2)
 
# Custom the color
sns.set(style="white", color_codes=True)
sns.jointplot(x=df["BloodPressure"], y=df["Outcome"], kind='kde', color="skyblue")


#6. Conclusion

To conclude, every dataset shows a story. To take raw data and then predict the future is jaw-dropping. Advanced classification techniques have been used in this project, including XGBoost (XtremeGradientBoosting), MLPClassifiers, and neural networks with TensorFlow and Keras. 

-Key Takeaway-
* The factor that most affects the outcome is blood pressure.


If you use this notebook, please reference it.
