# Diabetes Prediction Using Machine Learning 
This is module of Miloo Workshop : **Diabetes Prediction Using Machine Learning** . 

This module will give example of how to build model to predict diabetes patient. Start from importing necessary library, reading and manipulating data, training and testing model, and evaluate model.

Please refer to this link for more info regarding the dataset : https://www.kaggle.com/code/vincentlugat/pima-indians-diabetes-eda-prediction-0-906/data

## 1. Import necessary Library
For this practice, we use pandas, numpy, seaborn, scikit-learn, and matplotlib

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt


**Age Categorization**
1. Boomers : 58-82
2. Gen X : 46-57
3. Millenials : 27-45
4. Gen Z : 7-26

In [None]:
### age categorization pythonic way
def categorize_age(age):
  try:
    age = int(age)

    if age >= 7 and age <= 26:
      return "Gen Z"
    elif age >= 27 and age <= 45:
      return "Millennials"
    elif age >= 46 and age <= 57:
      return "Gen X"
    elif age >= 58 and age <= 82:
      return "Boomers"
    
  except:
    return np.nan

**BMI Categorization** 
1. Underweight : 0-18.5
2. Normal : 18.5-22.9
3. Overweight : 23-24.9
4. Obesity Rank 1 : 25-29.9
5. Obesity Rank 2 : 30-100

In [None]:
### bmi categorization pythonic way
def categorize_bmi(bmi):
  try:
    bmi = float(bmi)

    if bmi >= 0 and bmi < 18.5:
      return "Underweight"
    elif bmi >= 18.5 and bmi < 23:
      return "Normal"
    elif bmi >= 23 and bmi < 25:
      return "Overweight"
    elif bmi >= 25 and bmi < 30:
      return "Obesity Rank 1"
    elif bmi >= 30 and bmi < 100:
      return "Obesity Rank 2"
    
  except:
    return np.nan

## 2. Read data 
This section will focus on importing/loading data to the notebook from github

### 2.1 Read data from local directory or drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Miloo/Workshop/Miloo Bootcamp Beginer/diabetes.csv")
df.head(10)

### 2.2 Read data from git

In [None]:
!wget https://raw.githubusercontent.com/Miloo-workshop/miloo-workshop-beginner/master/diabetes.csv 

In [None]:
df = pd.read_csv("diabetes.csv")
df

## 3. Data Understanding

This section will show you how to understand dataset sircumstances such as how many null values, minimum and maximum value of each feature, and other statistical condition of data


### 3.1 Check Data Info

In [None]:
df.info()

### 3.2 Check Null Data

In [None]:
df.count()

### 3.3 Statistical Description

In [None]:
df.describe()

## 4. Feature Engineering
Now, we will manipulate feature to be able to use for machine learning modeling such as set nominal data to categorical, one hot encode for categorical feature, and select feature to be used for modeling

### 4.1 Uderstanding Features Type

In [None]:
numerical_features = ['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']
categotical_features = ['Outcome','Manipulated_Age','Manipulated_BMI']

### 4.2 Data Manipulation

Set Nominal data to categorical

##### Column Age





In [None]:
df["Manipulated_Age"] = df["Age"].apply(categorize_age)

##### Column BMI

In [None]:
df["Manipulated_BMI"] = df["BMI"].apply(categorize_bmi)

##### Column Outcome

In [None]:
df

In [None]:
df.columns.to_list()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df["Manipulated_Age"])
le.classes_

In [None]:
le_age = le.transform(df["Manipulated_Age"])

In [None]:
# le.inverse_transform(le_age)

In [None]:
le_BMI = le.fit_transform(df['Manipulated_BMI'])
# le_BMI

In [None]:
# le.inverse_transform(le_BMI)

In [None]:
# ad Labeled_features to dataframe
df['le_BMI'] = le_BMI
df['le_age'] = le_age

In [None]:
df

### 4.3 Data Selection

Selected Data for modeling

In [None]:
selected_features = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','DiabetesPedigreeFunction','le_BMI','le_age']
target_features = ['Outcome']
selected_df =  df[selected_features+target_features]
selected_df

## 5. Model Creation
This section will focus on how we train, test, and evaluate the machine learning model

### 5.1 Train Test Data Selection

In [None]:
# set X for feaatures, y for target

X = df[selected_features]
y = df[target_features]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

##### Training Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

clf = DecisionTreeClassifier(max_leaf_nodes=6, random_state=0)
clf.fit(X_train, y_train)

In [None]:
text_representation = tree.export_text(clf,feature_names=list(df[selected_features].columns))
print(text_representation)

In [None]:
# tree.plot_tree(clf)
# plt.show()

fig = plt.figure(figsize=(25,15))
_ = tree.plot_tree(clf, 
                   feature_names=list(df[selected_features].columns),  
                  #  class_names=list(df[target_features].columns),
                   rounded=True,
                   filled=True)

##### Test Model

In [None]:
# clf.predict_log_proba(X_test)

In [None]:
y_true = y_test
y_pred = clf.predict(X_test)

##### Evaluate model 

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)
plt.show()

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true,y_pred))

In [None]:
from sklearn.metrics import f1_score
print("f1_macro     = ",f1_score(y_true, y_pred, average='macro'))
print("f1_micro     = ",f1_score(y_true, y_pred, average='micro'))
print("f1_weighted  = ",f1_score(y_true, y_pred, average='weighted'))


## 5. Conclusion

Based on our practice, Here are what we learn :
1. To be able to build model, we have to know data condition
2. Do data manipulation if it is necessary
3. Feature selection is used to filter which feature that necessary to use
4. Modeling process consist of :
   - Training \: Model learn pattern of training data to predict outcomes
   - Testing \: Model predict new data after training process
   - Evaluation \: Measure model performance whether its performance is good or bad
