#### Case 1: Diabetes Classification Analysis


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

##### 1. Determine the accuracy of determining the output if the following pairs are only considered at a time:
a. Glucose and Blood Pressure

b. Glucose and Insulin

c. Insulin and BMI

d. BMI and Diabetes Pedigree Function

In [3]:
# Load the Datasets

df = pd.read_csv('../../datasets/diabetes.csv')
test_df = pd.read_csv('../../datasets/diabetes_data_table.csv')

In [13]:
feature_pairs = [
    ('Glucose', 'BloodPressure'),
    ('Glucose', 'Insulin'),
    ('Insulin', 'BMI'),
    ('BMI', 'DiabetesPedigreeFunction'),
]

y = df['Outcome']

for pair in feature_pairs:
    X = df[list(pair)]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train, y_train)

    print('Accuracy of K-NN classifier on training set: {:.2f}%'.format(
        knn.score(X_train, y_train) * 100))
    print('Accuracy of K-NN classifier on test set: {:.2f}%'.format(
        knn.score(X_test, y_test) * 100))

    print('Predicting the accuracy based on the test set:')
    print(knn.predict(test_df[list(pair)]))

    print('Predicting the probability based on the test set:')
    print(knn.predict_proba(test_df[list(pair)]))
    print("")

Accuracy of K-NN classifier on training set: 78.82%
Accuracy of K-NN classifier on test set: 73.44%
Predicting the accuracy based on the test set:
[0 1 1 0 1 0 1 0 0 1 1 0 1 0 1]
Predicting the probability based on the test set:
[[0.8 0.2]
 [0.4 0.6]
 [0.  1. ]
 [1.  0. ]
 [0.2 0.8]
 [1.  0. ]
 [0.4 0.6]
 [1.  0. ]
 [0.8 0.2]
 [0.4 0.6]
 [0.4 0.6]
 [1.  0. ]
 [0.  1. ]
 [0.8 0.2]
 [0.4 0.6]]

Accuracy of K-NN classifier on training set: 77.43%
Accuracy of K-NN classifier on test set: 73.96%
Predicting the accuracy based on the test set:
[0 0 1 0 1 0 1 0 0 0 1 0 1 0 1]
Predicting the probability based on the test set:
[[1.  0. ]
 [0.8 0.2]
 [0.2 0.8]
 [0.8 0.2]
 [0.4 0.6]
 [1.  0. ]
 [0.4 0.6]
 [1.  0. ]
 [0.8 0.2]
 [0.6 0.4]
 [0.4 0.6]
 [0.8 0.2]
 [0.2 0.8]
 [0.8 0.2]
 [0.4 0.6]]

Accuracy of K-NN classifier on training set: 77.08%
Accuracy of K-NN classifier on test set: 68.23%
Predicting the accuracy based on the test set:
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Predicting the probability ba

#### 2. Read through the article about diabetes and answer the following questions:

a. What is the main difference between diabetes type 1 and type 2?

The main difference between diabetes type 1 and 2 is that in
diabetes type 1, your body can't create insulin to use glucose as energy for your cells because cells in your pancreas that produce insulin are being attacked by the body's immune system.

In type 2 diabetes, your body releases insulin throughout your body but can't use them efficiently, then overtime the demand compared to the ability to produce insulin is not equal resulting in insulin deficiency.

b. What is lifestyle diabetes and how do we prevent it?

Lifestyle diabetes is diabetes that you can get for having horrible day to day lifestyle choices. It is primarily type 2 diabetes since it is the one that is dependent on your body, especially your age. You can prevent this by having a proper diet, exercise and a decent weight loss regime if advised by the doctor.

c. Which part of the body is affected by diabetes? Explain.

The complication of diabetes starts in your pancreas since it affects the ability to produce insulin. In the long run of having diabetes the parts of the body that can be affected are the heart, eyes, kidneys, and your feet because of the rampant presence of glucose in your body that is not being converted by the insulin into energy. Symptoms can also consists of 
Wounds healing slower than an average person
Frequency of urination thus burdening the kidneys
Constant hunger
Overall fatigue
Blurry vision

d. In your opinion, how will data mining be able to contribute to the research and development of cures and procedures in treating diabetes?

Since we are still tackling the causes of why the body of a patient attacks the cells in the pancreas that are responsible for making insulin(type 1 diabetes), data mining can help us determine  in the future as to what the causes are. In doing so we can spread awareness as to how we can prevent diabetes from happening in the first place. Even the treatment method should improve after we unlock all the secrets that encapsulates the disease diabetes. We can also predict if such lifestyle choices are going to lead into a situation in which a person will have diabetes in the future.


##### 3. Create a program that will accept the features in the data set. Then using K Nearest Neighbor Aggregation,
determine if the patient is diabetic or not.

a. Glucose and Blood Pressure

b. Glucose and Insulin

c. Glucose and Age

d. Insulin and BMI

e. Insulin and Age

In [14]:
# Create different models with different feature paire
pairs: list[tuple[str, str]] = [
    ('Glucose', 'BloodPressure'),
    ('Glucose', 'Insulin'),
    ('Glucose', 'Age'),
    ('Insulin', 'BMI'),
    ('Insulin', 'Age'),
]


y = df['Outcome']

models: list[KNeighborsClassifier] = []

for pair in pairs:
    X = df[list(pair)]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train, y_train)

    models.append(knn)

In [17]:
import pickle

# Save the models
for i, model in enumerate(models):
    pair = pairs[i]
    model_name = f'../../models/knn_{pair[0].lower()}_{pair[1].lower()}_model.pkl'
    pickle.dump(model, open(model_name, 'wb'))
