# Diabetes Prediction with CART Algorithm
 * **Diabetes**, is a group of metabolic disorders in which there are high blood sugar levels over a prolonged period. Symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger. If left untreated, diabetes can cause many complications. Acute complications can include diabetic ketoacidosis, hyperosmolar hyperglycemic state, or death. Serious long-term complications include cardiovascular disease, stroke, chronic kidney disease, foot ulcers, and damage to the eyes.

# Data Set and Story
 * This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

 * Pregnancies: Number of times pregnant
 * Glucose: Glucose
 * BloodPressure: Blood pressure 
 * SkinThickness: Triceps skin fold thickness
 * Insulin: Insulin
 * BMI: Body mass index 
 * DiabetesPedigreeFunction: Diabetes pedigree function
 * Age: Age (years)
 * Outcome: The knowledge of whether there is diabetes (this is our target) 

In [1]:
!pip install skompiler
import numpy as np # linear algebra
import pandas as pd

import warnings
import pandas as pd
import numpy as np
from skompiler import skompile
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from sklearn.model_selection import *
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)

Collecting skompiler
[?25l  Downloading https://files.pythonhosted.org/packages/77/31/cbc7fc4fe064425efbf2044b94bd1973472d8aac85b4d9346c7330e03dfb/SKompiler-0.5.5.tar.gz (48kB)
[K     |██████▊                         | 10kB 15.9MB/s eta 0:00:01[K     |█████████████▌                  | 20kB 14.1MB/s eta 0:00:01[K     |████████████████████▎           | 30kB 11.6MB/s eta 0:00:01[K     |███████████████████████████     | 40kB 9.9MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 3.5MB/s 
Building wheels for collected packages: skompiler
  Building wheel for skompiler (setup.py) ... [?25l[?25hdone
  Created wheel for skompiler: filename=SKompiler-0.5.5-cp37-none-any.whl size=57149 sha256=f5532b46f8703071c05f20d8e98058d1b14c744136b2f963fe29050808d1679d
  Stored in directory: /root/.cache/pip/wheels/21/eb/99/6e32f89da503a823f6bc1a985abfdaeaa01b3a6b5ac5776d15
Successfully built skompiler
Installing collected packages: skompiler
Successfully installed skompiler-0.5.5


In [2]:
df = pd.read_csv("/content/sample_data/diabetes.csv")

# Looking at the first 5 rows of the data set
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [5]:
df[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]] = \
    df[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]].replace(0, np.NaN)

In [6]:
df.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [7]:
def grab_col_names(dataframe, cat_th=10, car_th=20):
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]

    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]

    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]

    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]
    
    print(f'cat_cols: {len(cat_cols)}')
    print(f'num_cols: {len(num_cols)}')
    print(f'cat_but_car: {len(cat_but_car)}')
    print(f'num_but_cat: {len(num_but_cat)}')
    
    return cat_cols, cat_but_car, num_cols, num_but_cat


In [8]:
cat_cols, cat_but_car, num_cols, num_but_cat = grab_col_names(df)

cat_cols: 1
num_cols: 8
cat_but_car: 0
num_but_cat: 1


In [9]:
# Setting an upper and lower limit for outliers
def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.25)
    quartile3 = dataframe[variable].quantile(0.75)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

In [10]:
# The function that examines whether there is an outlier according to the threshold values we have determined.
def check_outlier(dataframe, col_name):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return False

In [11]:
for col in num_cols:
    print(col, check_outlier(df, col))

Pregnancies True
Glucose False
BloodPressure True
SkinThickness True
Insulin True
BMI True
DiabetesPedigreeFunction True
Age True


In [12]:
# Replacing outliers with upper and lower limit
def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In [13]:
for col in num_cols:
        replace_with_thresholds(df, col)

## Missing Values

In [14]:
df.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [15]:
df.pivot_table(df, index=["Outcome"])

Unnamed: 0_level_0,Age,BMI,BloodPressure,DiabetesPedigreeFunction,Glucose,Insulin,Pregnancies,SkinThickness
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,31.135,30.841141,70.810811,0.420264,110.643863,125.186553,3.298,27.227147
1,37.052239,35.262406,75.210317,0.531022,142.319549,189.782692,4.843284,32.733333


In [16]:
for col in df.columns:
    df.loc[(df["Outcome"] == 0) & (df[col].isnull()), col] = df[df["Outcome"] == 0][col].median()
    df.loc[(df["Outcome"] == 1) & (df[col].isnull()), col] = df[df["Outcome"] == 1][col].median()

In [17]:

df.loc[(df["BMI"] < 18.5), "NEW_BMI_CAT"] = "Underweight"
df.loc[(df["BMI"] > 18.5) & (df["BMI"] < 25), "NEW_BMI_CAT"] = "Normal"
df.loc[(df["BMI"] > 25) & (df["BMI"] < 30), "NEW_BMI_CAT"] = "Overweight"
df.loc[(df["BMI"] > 30) & (df["BMI"] < 40), "NEW_BMI_CAT"] = "Obese"

df.loc[(df["Glucose"] < 70), "NEW_GLUCOSE_CAT"] = "Low"
df.loc[(df["Glucose"] > 70) & (df["Glucose"] < 99), "NEW_GLUCOSE_CAT"] = "Normal"
df.loc[(df["Glucose"] > 99) & (df["Glucose"] < 126), "NEW_GLUCOSE_CAT"] = "Secret"
df.loc[(df["Glucose"] > 126) & (df["Glucose"] < 200), "NEW_GLUCOSE_CAT"] = "High"

df.loc[df['SkinThickness'] < 30, "NEW_SKIN_THICKNESS"] = "Normal"
df.loc[df['SkinThickness'] >= 30, "NEW_SKIN_THICKNESS"] = "HighFat"

df.loc[df['Pregnancies'] == 0, "NEW_PREGNANCIES"] = "NoPregnancy"
df.loc[((df['Pregnancies'] > 0) & (df['Pregnancies'] <= 4)), "NEW_PREGNANCIES"] = "StdPregnancy"
df.loc[(df['Pregnancies'] > 4), "NEW_PREGNANCIES"] = "OverPregnancy"

df.loc[(df['SkinThickness'] < 30) & (df['BloodPressure'] < 80), "NEW_CIRCULATION_LEVEL"] = "Normal"
df.loc[(df['SkinThickness'] >= 30) & (df['BloodPressure'] >= 80), "NEW_CIRCULATION_LEVEL"] = "CircularAtHighRisk"
df.loc[((df['SkinThickness'] < 30) & (df['BloodPressure'] >= 80))
       | ((df['SkinThickness'] >= 30) & (df['BloodPressure'] < 80)), "NEW_CIRCULATION_LEVEL"] = "CircularAtMediumRisk"

df["Pre_Age_Cat"] = df["Age"] * df["Pregnancies"]

df["Ins_Glu_Cat"] = df["Glucose"] * df["Insulin"]

In [18]:
def label_encoder(dataframe, binary_col):
    labelencoder = preprocessing.LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
    return dataframe

In [19]:
binary_cols = [col for col in df.columns if df[col].dtypes == "O"
               and len(df[col].unique()) == 2]

In [23]:
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe  

# One-Hot Encoding

In [24]:
ohe_cols = [col for col in df.columns if 10 >= len(df[col].unique()) > 2]

In [25]:
one_hot_encoder(df, ohe_cols, drop_first=True)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,NEW_SKIN_THICKNESS,Pre_Age_Cat,Ins_Glu_Cat,NEW_BMI_CAT_Obese,NEW_BMI_CAT_Overweight,NEW_BMI_CAT_Underweight,NEW_GLUCOSE_CAT_Low,NEW_GLUCOSE_CAT_Normal,NEW_GLUCOSE_CAT_Secret,NEW_PREGNANCIES_OverPregnancy,NEW_PREGNANCIES_StdPregnancy,NEW_CIRCULATION_LEVEL_CircularAtMediumRisk,NEW_CIRCULATION_LEVEL_Normal
0,6,85,21,27,103,122,350,29,1,HighFat,300.0,25086.0,1,0,0,0,0,0,1,0,1,0
1,1,22,18,21,66,61,196,10,0,Normal,31.0,8712.5,0,1,0,0,1,0,0,1,0,1
2,8,120,16,24,103,29,368,11,1,HighFat,256.0,31018.5,0,0,0,0,0,0,1,0,1,0
3,1,26,18,15,61,76,53,0,0,Normal,21.0,8366.0,0,1,0,0,1,0,0,1,0,1
4,0,74,3,27,102,208,489,12,1,HighFat,0.0,23016.0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,10,38,25,40,109,117,55,42,0,HighFat,630.0,18180.0,1,0,0,0,0,1,1,0,1,0
764,2,59,20,19,66,154,187,6,0,Normal,54.0,12505.0,1,0,0,0,0,1,0,1,0,1
765,5,58,21,15,71,57,115,9,0,Normal,150.0,13552.0,0,1,0,0,0,1,1,0,0,1
766,1,63,13,24,103,94,195,26,1,HighFat,47.0,21357.0,1,0,0,0,0,0,0,1,1,0


In [26]:
y = df["Outcome"]
X = df.drop(["Outcome"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=17)