# Pima Diabetes Data Set 

Columns

* pregnancies - Number of times pregnant
* Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* BloodPressure - Diastolic blood pressure (mm Hg)
* SkinThickness - Triceps skin fold thickness (mm)
* Insulin - 2-Hour serum insulin (mu U/ml)
* BMI - Body mass index (weight in kg/(height in m)^2)
* DiabetesPedigreeFunction - Diabetes pedigree function
* Age - Age (years)
*  Outcome - Class variable (0 or 1) class value 1 is interpreted as "tested positive for diabetes

Class distribution: 
* 0 : 500 (healthy)
* 1 : 268 

Data characteristics:

The database contains only data about female patients who are of Pima heritage are 21 or older
All the attributes are numeric
The data may contain invalid or null values
Total number of cases presented are 786

## Loading Libraries

* your task is to load the libraries that you usually need (Pandas , numpy, maplotlib.pyplot )
* you can add the 'as' keyword so that you can use an abbreviation to call the function from the libraries
* for example: import numpy as np

In [None]:
# show plots inside the notebook  
%matplotlib inline 





## Data Loading

* your task is to load the dataset
* the dataset is located here: "data/diabetes.csv"
* it is a csv file so you need a function for pandas to load the csv file
* and you need to store the result into a variable

Data Set Information:

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima heritage. 

Attribute Information:

1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1) - Outcome - Target

## Data Checking

* you have created the dataset object now use the method .info() to get more information of the dataset

* visualise on screen the 5 rows of the dataset by using the method .head() on your dataset object

* now use the method describe()

* your task now is to see if the target distribution is balanced. Do we have the same ammount of 0 and 1 in the outcome column. There are several way to achieve this, in the solution you will see a new way but use the one you prefer or google an alternative solution

using value_counts()

Now we know that there are 768 people with an uneven distribution of the outcome (healthy:sick = 500:268). 

## Data Stratification

When we split the dataset into train and test datasets, the split is completely random. Thus the instances of each class label or outcome in the train or test datasets is random. Thus we may have many instances of class 1 in training data and less instances of class 2 in the training data. So during classification, we may have accurate predictions for class1 but not for class2. Thus we stratify the data, so that we have proportionate data for all the classes in both the training and testing data.

* YOUR TURN . load the train_test_split module from sklearn.model_selection

* YOUR TURN create the x and y (the features and the target variables) Using the code that we used in the previous notebook

* YOUR TURN divide into training and testing and use STRATIFY with the column Outcome
* use random_state=101 if you want to have the same output at the end as mine

In [None]:
# divide into training and testing data - use stratify


## Make the Model

* YOUR TURN import the LogisticRegression module from sklearn.linear_model

* YOUR TURN create the model by calling LogisticRegression and assigning to a variable that you can call model

* YOUR TURN fit the data to the model (X_train and y_train)

YOUR TURN collect the prediction into a variable called prediction from your x_test data

**accuracy**

... is the percentage of correct predictions.

* YOUR TURN import the module metrics from sklearn

In [None]:
# calculate accuracy


YOUR TURN calculate and print the accuracy score (use y_test and prediction) and accuracy score or simply score (x_test,y_test)

**Confusion Matrix**

YOUR TURN from sklearn.metrics import confusion_matrix

YOUR TURN print the confusion matrix

the visual respresentation of the confusion matrix helps a lot. We can see the following in the table:

index 0 = class 0: Person will have diabetes in 5 years<br />
index 1 = class 1 : Person will NOT have diabetes in 5 years

**True Positives (TP)**: (35) we correctly predicted that they do have diabetes<br />
**True Negatives (TN)**: (112) we correctly predicted that they don't have diabetes<br />
**False Positives (FP)**: (32) we incorrectly predicted that they do have diabetes<br />
**False Negatives (FN)**: (13) we incorrectly predicted that they don't have diabetes

From these values we can calculate the following classification metrics:

**Sensitivity** (aka "True Positive Rate" or "Recall"): When the actual value is positive, how often is the prediction correct?
- Something we want to maximize
- How "sensitive" is the classifier to detecting positive instances?




### classification report

YOUR TURN from sklearn.metrics import classification_report

YOUR TURN print the classification report of your y_test and prediction

Right now we have a model that is quiet good in correctly detecting if a patient is healthy in 5 years (112 true negative == 0.90% of recall). But I think it's more important to find people, which will be sick in 5 years to apply preventive measures. We can adjust our model by changing the classification threshold.

**Classification Threshold**

YOUR TURN save in a variable called save_predictions_proba the predicted probability from your X_test of being survived (column 1 of the predicted probability)

YOUR TURN print the histogram of the predicted probabilities so that you can have an idea of what other threshold you can potentially use to construct a model with higher recall

In [None]:
# histogram of predicted probabilities


Just a small number of observations with probability > 0.5, most observations have a probability < 0.5 and would be predicted "no diabetes" in our case. We can increase the sensitivity (increase number of TP) of the classifier by decreasing the threshold for predicting diabetes.

YOUR TURN from sklearn.preprocessing import binarize

YOUR TURN use the binarize class that you imported to store in a variable prediction2 the new predictions with a threshold of 0.3 (remember that you need to reshape the save_predictions_proba you created before

YOUR TURN print the new confusion matrix (y_test and prediction2)
PS. you already imported the library 

YOUR TURN print the classification report

Observations: 

- Threshold of 0.5 is used by default (for binary problems) to convert predicted probabilities into class predictions
- Threshold can be adjusted to increase sensitivity

- Adjusting the threshold should be one of the last steps you do in the model-building process