## E1. Exercise: Let's work on a different set of features "Age" and repeat

**Create a new Jupyter Notbook and do the following**

Following the above hands-on, initialize and train Logistic Regression and MLP classifiers for predicting diabetes using the following columns:

**Feature Columns (Input):**
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- Age: Age (years)

**Label (Output):**
- Outcome: Class variable (0 or 1)

What accuracy figures are you getting for the two classifiers? Are they very different from the accuracy figures we for in Section 3 above? Write down in a markdown block.

In [32]:
import pandas as pd

train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0.1,Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,466,0,74,52,10,36,27.8,0.269,22,0
1,719,5,97,76,27,0,35.6,0.378,52,1
2,319,6,194,78,0,0,23.5,0.129,59,1
3,402,5,136,84,41,88,35.0,0.286,35,1
4,752,3,108,62,24,0,26.0,0.223,25,0


In [33]:
train_features = train_df[["Glucose","BloodPressure", "Insulin", "BMI", "Age"]]

train_labels = train_df["Outcome"]

train_features.head()
train_labels.head()

0    0
1    1
2    1
3    1
4    0
Name: Outcome, dtype: int64

In [34]:
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier


lr_classifier = LogisticRegression(solver='lbfgs',max_iter=10000)
mlp_classifier = MLPClassifier(solver='lbfgs', alpha=1e-5,
                               hidden_layer_sizes=(8, 2), random_state=11,max_iter=10000)


# train our models
lr_classifier.fit(train_features.to_numpy(),train_labels.to_numpy())
mlp_classifier.fit(train_features.to_numpy(),train_labels.to_numpy())

MLPClassifier(alpha=1e-05, hidden_layer_sizes=(8, 2), max_iter=10000,
              random_state=11, solver='lbfgs')

In [35]:
from sklearn.metrics import accuracy_score

#load test data
test_df = pd.read_csv("test.csv")

# Extract the input features
test_inputs = test_df[["Glucose","BloodPressure", "Insulin", "BMI", "Age"]]

y_actual = test_df["Outcome"]

# predict using logistic regress]ion model
y_predicted_lr = lr_classifier.predict(test_inputs.to_numpy())
lr_accuracy_score = accuracy_score(y_predicted_lr,y_actual)

# predict using logistic regression model
y_predicted_mlp = mlp_classifier.predict(test_inputs.to_numpy())
mlp_accuracy_score = accuracy_score(y_predicted_mlp,y_actual)

print (f"Accuracy of the Logistic Classifier = {lr_accuracy_score}")
print (f"Accuracy of the MLP Classifier = {mlp_accuracy_score}")

Accuracy of the Logistic Classifier = 0.7402597402597403
Accuracy of the MLP Classifier = 0.6298701298701299


#
For this new data set and inputs we can see that the logistic classifier is more accurate than the MLP one once again. However the logisitc classifier is less accurate than when we utilized more categories. But the MLP classifier was more accurate with these variables selected.

In [36]:
# Storing
import pickle

file_to_write = open("E1_diabetes_best_model.saved","wb")
pickle.dump(lr_classifier,file_to_write)
file_to_write.close()

In [37]:
import pickle
import numpy

model_file = open("E1_diabetes_best_model.saved","rb")
model = pickle.load(model_file)
model_file.close()

# Let's prepare a sample input
pregnancies = 0
glucose = 200
bp = 66
skin_thickness = 20
insulin = 95
bmi = 32.9
diabetes_pedigree = 0.6
age = 28

input_data =numpy.array([[glucose, bp, insulin, bmi, age]]) 

y_predicted_lr = lr_classifier.predict(input_data)

if y_predicted_lr[0]==1:
    print ("The person is likely to have diabetes in the near future")
if y_predicted_lr[0]==0:
    print ("The person will not have diabetes")

The person is likely to have diabetes in the near future
