##  Diabetes Risk Stratification - Demographic with SIM

In [None]:
import pandas as pd #Pandas is high performance data manipulation 
import matplotlib.pyplot as plt   # matplot is for python graphics
import numpy as np   #numpy is for array processing
import seaborn  as sns
import story_board as sb

# Import the data set 

In [None]:
definition ='''

## Introduction to Diabetes Analysis - From WEBMD 

### What Is Diabetes?

Diabetes is a number of diseases that involve problems with the hormone insulin. There is no cure for diabetes.

### What Are the Different Types of Diabetes?

There are three major types of diabetes: **type 1 diabetes**, **type 2 diabetes**, and **gestational diabetes**.

### What Is Diabetes Insipidus?

Diabetes insipidus causes you to have an almost unquenchable thirst and your body to make a lot of urine.

### What Is Gestagenic Diabetes Insipidus?

Gestational DI, or gestagenic diabetes insipidus, is a rare disorder that happens in pregnancy, usually in the third trimester.

'''

sb.start_story(definition)


In [None]:
definition ='''

# ðŸ§  Importing the PIMA Indians Dataset - Website KAGGLE.COM

The data comes from the National Institute of Diabetes and Digestive and Kidney Diseases but is highly curated upon registry to KAGGLE. 

The dataset is for the purposes of predicting whether patients have diabetes based on other variables in the dataset. 

This data is very specific to a sub-population - **IT IS NOT** a representative of a **real world diabetes data set** - it is for learning ONLY. 

## ðŸ§  The PIMA Indians Data Content 

The observations in the dataset are records of females over 21 years of age with a Pima Indian heritage and variables 
such as: blood pressure, insulin levels, body mass index, etc. 

##  Attributes Normal Value Range
1. **Glucose**: Glucose (< 140) = Normal, Glucose (140-200) = Pre-Diabetic, Glucose (> 200) = Diabetic
2. **BloodPressure**: B.P (< 60) = Below Normal, B.P (60-80) = Normal, B.P (80-90) = Stage 1 Hypertension, B.P (90-120) = Stage 2 Hypertension, B.P (> 120) = Hypertensive Crisis
3. **SkinThickness**: SkinThickness (< 10) = Below Normal, SkinThickness (10-30) = Normal, SkinThickness (> 30) = Above Normal
4. **Insulin** : Insulin (< 200) = Normal, Insulin (> 200) = Above Normal
5. **Body Mass Index**: BMI (< 18.5) = Underweight, BMI (18.5-25) = Normal, BMI (25-30) = Overweight, BMI (> 30) = Obese

'''

sb.outmd(definition)


In [None]:
# dataset = pd.read_csv("C:/Documents/datasets/diabetes.csv")
dataset = pd.read_csv("diabetes.csv")

# Examine the data set - Descriptive Statistics

In [None]:
dataset.describe()

In [None]:
dataset.head(5)   #provides top 5 rows

In [None]:
dataset.shape #print the shape of the matrix

In [None]:
dataset.isnull().values.any()  # determine if any of the dataset is null

# Establish a correlation Matrix 

In [None]:
correlation_matrix = dataset.corr()     #establish a correlation matrix for all fields
top_correlation_features = correlation_matrix.index
plt.figure(figsize=(10,10))
g=sns.heatmap(dataset[top_correlation_features].corr(),annot=True,cmap="RdYlGn")


In [None]:
dataset.corr()

In [None]:
diabetes_map = {True: 1, False: 0}
dataset['Diabetes'] = dataset['Outcome'].map(diabetes_map)
dataset.head(5)

In [None]:
diabetes_true_count = len(dataset.loc[dataset['Diabetes'] == True])
diabetes_false_count = len(dataset.loc[dataset['Diabetes'] == False])
(diabetes_true_count,diabetes_false_count)

In [None]:
## Train Test Split
from sklearn.model_selection import train_test_split
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'SkinThickness']
predicted_class = ['Diabetes']

In [None]:
X = dataset[feature_columns].values
y = dataset[predicted_class].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=10)

In [None]:
print("total number of rows : {0}".format(len(dataset)))
print("number of rows missing Glucose: {0}".format(len(dataset.loc[dataset['Glucose'] == 0])))
print("number of rows missing BloodPressure: {0}".format(len(dataset.loc[dataset['BloodPressure'] == 0])))
print("number of rows missing Insulin: {0}".format(len(dataset.loc[dataset['Insulin'] == 0])))
print("number of rows missing BMI: {0}".format(len(dataset.loc[dataset['BMI'] == 0])))
print("number of rows missing DiabetesPedigreeFunction: {0}".format(len(dataset.loc[dataset['DiabetesPedigreeFunction'] == 0])))
print("number of rows missing Age: {0}".format(len(dataset.loc[dataset['Age'] == 0])))
print("number of rows missing SkinThickness: {0}".format(len(dataset.loc[dataset['SkinThickness'] == 0])))

In [None]:
from sklearn.impute import SimpleImputer
fill_values = SimpleImputer(missing_values=0, strategy="mean")

X_train = fill_values.fit_transform(X_train)
X_test = fill_values.fit_transform(X_test)

In [None]:
## Apply Algorithm

from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(random_state=10)

random_forest_model.fit(X_train, y_train.ravel())

In [None]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=10, verbose=0, warm_start=False)

In [None]:
predict_train_data = random_forest_model.predict(X_test)

from sklearn import metrics

print("Accuracy = {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))

In [None]:
## Hyper Parameter Optimization

params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
    
}

In [1]:
import story_board as sb
definition ='''
## Explanation of Each Model 

1. **Logistic Regression**: A linear model used for binary classification that estimates the probability of a sample belonging to a particular class.

2. **Decision Tree**: A tree-like model that splits the data into subsets based on the value of input features, making decisions based on feature values to classify instances.

3. **K-Nearest Neighbor (KNN)**: A non-parametric method used for classification by finding the 'k' nearest data points in the feature space and assigning the most common class among them to the query point.

4. **Gaussian Naive Bayes**: A probabilistic classifier based on Bayes' theorem with the assumption of independence among features, often used for text classification tasks.

5. **Multinomial Naive Bayes**: Similar to Gaussian Naive Bayes but specifically designed for classification tasks with discrete features, such as word counts in text classification.

6. **Support Vector Classifier (SVC)**: A supervised learning algorithm that finds the hyperplane that best separates classes in a high-dimensional space, often used for binary classification.

7. **Random Forest**: An ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

8. **XGBoost**: An optimized gradient boosting library that implements machine learning algorithms under the Gradient Boosting framework, known for its speed and performance in handling large datasets.

9. **Multi-layer Perceptron (MLP)**: A type of artificial neural network composed of multiple layers of nodes (neurons) that can learn non-linear relationships between input and output data.

10. **Gradient Boosting Classifier**: A machine learning technique that builds an ensemble of weak learners (typically decision trees) in a sequential manner, with each tree correcting the errors of its predecessors, resulting in a strong predictive model.

'''

sb.outmd(definition)


## Explanation of Each Model 

1. **Logistic Regression**: A linear model used for binary classification that estimates the probability of a sample belonging to a particular class.

2. **Decision Tree**: A tree-like model that splits the data into subsets based on the value of input features, making decisions based on feature values to classify instances.

3. **K-Nearest Neighbor (KNN)**: A non-parametric method used for classification by finding the 'k' nearest data points in the feature space and assigning the most common class among them to the query point.

4. **Gaussian Naive Bayes**: A probabilistic classifier based on Bayes' theorem with the assumption of independence among features, often used for text classification tasks.

5. **Multinomial Naive Bayes**: Similar to Gaussian Naive Bayes but specifically designed for classification tasks with discrete features, such as word counts in text classification.

6. **Support Vector Classifier (SVC)**: A supervised learning algorithm that finds the hyperplane that best separates classes in a high-dimensional space, often used for binary classification.

7. **Random Forest**: An ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

8. **XGBoost**: An optimized gradient boosting library that implements machine learning algorithms under the Gradient Boosting framework, known for its speed and performance in handling large datasets.

9. **Multi-layer Perceptron (MLP)**: A type of artificial neural network composed of multiple layers of nodes (neurons) that can learn non-linear relationships between input and output data.

10. **Gradient Boosting Classifier**: A machine learning technique that builds an ensemble of weak learners (typically decision trees) in a sequential manner, with each tree correcting the errors of its predecessors, resulting in a strong predictive model.



In [None]:
## Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
import xgboost

In [None]:
classifier=xgboost.XGBClassifier()


In [None]:
random_search=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

In [None]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

In [None]:
from datetime import datetime
# Here we go
start_time = timer(None) # timing starts from this point for "start_time" variable
random_search.fit(X,y.ravel())
timer(start_time) # timing ends here for "start_time" variable

In [None]:
random_search.best_estimator_

In [None]:
classifier=xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.3, gamma=0.0, learning_rate=0.25,
       max_delta_step=0, max_depth=3, min_child_weight=7, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1)

In [None]:
from sklearn.model_selection import cross_val_score
score=cross_val_score(classifier,X,y.ravel(),cv=10)

In [None]:
score