# Predicting Diabetes

In [3]:
from pathlib import Path
import pandas as pd

In [4]:
data = Path('../Resources/diabetes.csv')
df = pd.read_csv(data)
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Separate the Features (X) from the Target (y)

In [5]:
y = df.Outcome
X = df.drop(columns='Outcome')

## Split our data into training and testing

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
X_train.shape

(576, 8)

## Create a Logistic Regression Model

In [7]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs', max_iter=200, random_state=1)
classifier

LogisticRegression(max_iter=200, random_state=1)

## Fit (train) or model using the training data

In [8]:
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=200, random_state=1)

## Score the model using the test data

In [9]:
print(f" Training Data Score: {classifier.score(X_train, y_train)}")
print(f" Test Data Score: {classifier.score(X_test, y_test)}")

 Training Data Score: 0.7829861111111112
 Test Data Score: 0.7760416666666666


## Make predictions

In [10]:
predictions = classifier.predict(X_test)
print(predictions)

[0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0
 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0
 1 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 1 0 0 0 0]


In [11]:
pd.DataFrame({"Prediction": predictions, "Actual": y_test})

Unnamed: 0,Prediction,Actual
118,0,0
132,1,1
3,0,0
693,1,1
654,0,0
106,0,0
715,1,1
670,1,0
154,1,1
75,0,0


In [12]:
result = pd.DataFrame({"Prediction": predictions, "Actual": y_test}).reset_index(drop=True)
result.head()

Unnamed: 0,Prediction,Actual
0,0,0
1,1,1
2,0,0
3,1,1
4,0,0


### How sure are you that models can actually predict diabetes?

My predictions score accuracy is 75% 

### Would you feel comfortable giving the diagnosis of diabetes based off the predictions of the model?

No. The prediction isn't 100% accurate, there is a room for error, as well as false positives

----------

# Diagnosing the Model

Evaluate the accuracy and health of the logistic regression model by creating a confusion matrix and classification report to describe the performance of the models.

### Confusion Matrix

In [13]:
from sklearn.metrics import confusion_matrix

In [14]:
# Create confusion matrix
confusion_matrix(y_test, predictions)

array([[113,  12],
       [ 31,  36]], dtype=int64)

### Classification Report

In [15]:
from sklearn.metrics import classification_report

In [16]:
# Create the classification report
target_names = ["No Diabetes", "Diabetes"]
print(classification_report(y_test, predictions, target_names=target_names))

              precision    recall  f1-score   support

 No Diabetes       0.78      0.90      0.84       125
    Diabetes       0.75      0.54      0.63        67

    accuracy                           0.78       192
   macro avg       0.77      0.72      0.73       192
weighted avg       0.77      0.78      0.77       192



#### Conclusion:

The Accuracy of predicting patients with no Diabetes is higher than accuracy of patients with Diabetes 

BUT 

low Recall of 54% means higher false negative = patients with negative results for diabetes aren't actually negative.