# Analyzing Medical Data

In this exercise, you train an ML model to determine whether a patient might have diabetes.

The exercise uses the _Pima Indian Diabetes_ data set from the National Institute of Diabetes and Digestive and Kidney Diseases. 
The data set consists of approximately 800 cases of medical data for female patients with and without diabetes.
The data includes features such as blood pressure, heart rate, and age.
The data set is available as a CSV file in this repository.

Explore the data to recognize whether you can use it to train a model that recognizes patients who have diabetes.

> _NOTE:  In the interest of time, this notebook performs a simple and superficial analysis of the data.
A more detailed study would require more time._

### 1. Import the required libraries and load the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Load the data set into a Pandas data frame called "data"
data = pd.read_csv("data/diabetes.csv")

# Obtain the length (rows) and width (columns) of the data set
data.shape

The data contains 768 rows and 9 columns.  

Use the `head` method  of the Pandas dataframe to preview the first five rows.

In [None]:
data.head()

### 2. Inspect basic information.

Use standard data analysis methods to start exploring the data.

Inspect the column names and associated data types.
The `info` method of a Pandas data frame displays the column names and data types in a data frame.

In [None]:
data.info()

Note the different value types:

* `Pregnancies`, `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `Age`, and `Outcome` contain integer values.
* Body mass index(`BMI`) and `DiabetesPedigreeFunction` contain float values.

Use the `describe` method to see basic statistical information for each column, such as percentiles, mean, and standard deviation.

In [None]:
data.describe()


The dataset consists of several medical variables, which are the input features, and one target variable: `Outcome`.

* `Pregnancies`:     Number of times pregnant
* `Glucose`:         Plasma glucose concentration in an oral glucose tolerance test
* `Blood Pressure`:  Diastolic blood pressure (mm Hg)
* `SkinThickness`:   Triceps skin fold thickness (mm)
* `Insulin`:         2-Hour serum insulin (mu U/ml)
* `BMI`:             Body mass index (weight in kg/(height in m)^2)
* `DiabetesPedigreeFunction` Probability of diabetes based on family history
* `Age`:             Age (years)
* `Outcome`:         Target variable. Whether the patient has diabetes (`1`) or not (`0`)

Count the number are diabetes cases.

In [None]:
data.Outcome.value_counts()

268 of 768 cases are diabetes cases.

### 3. Identify missing data

Plot the data to visualize the data distribution.
Use the `hist` method to plot a histogram.
You can use histograms to see how the data is distributed for each variable and detect outliers.

In [None]:
# Plot histograms of the columns on multiple subplots
plt.close('all')
data.hist(bins=20, figsize=(10, 8))

The dataset is evenly distributed except for some outliers in `BMI`, `Blood Pressure`, and `DiabetesPedigreeFunction`.

Additionally, note the high number of `0` values in the `SkinThickness` and `Insulin` features.
In this particular scenario, the `0` values might indicate missing data for those features.


Reuse the `head` method to see the `0` values in the data. Print the first 20 rows.

In [None]:
data.head(20)

Print the last 20 rows of the dataset and determine if those rows also contain `0` values.

In [None]:
data.tail(20)

Determine the number of `0` values in the dataset.

In [None]:
# Select all the rows and only the feature columns
feature_data = data.iloc[:, :-1]

# Count the total number of rows
num_cases = data.shape[0]

# Number & percent of '0's for each feature
numZero = (feature_data[:] == 0).sum()
perZero = ((feature_data[:] == 0).sum())/num_cases*100

print(f"\nRows, Feature columns: {feature_data.shape}")
print("\n== Number of 0's:")
print(numZero)
print("\n == Percentage of 0's:")
print(perZero)

The data set contains 227 zero values for `SkinThickness` and 374 zero values for `Insulin`.
Aproximately half of the cases have missing insulin values.

To build and train a reliable ML model, you should address these missing values.
However, for the sake of simplicity, this exercise keeps the outliers in the dataset.

Verify whether the missing data values are correlated to each other.

### 4. Look for correlations

Compute the _standard correlation coefficient_ between every pair of attributes by using the `corr` method.
Note that, for bigger data sets, especially those with many features, computing the correlations might take a very long time.

In [None]:
corrM = data.corr()
corrM.style.background_gradient().format(precision=3)

The data does not show strong correlations, which means that there is a notable degree of independency across the features.

The `Pregnancies` and `Age` features have a moderate correlation.

The `Insulin` and `SkinThickness` columns, which include many missing values, have a correlation value of 0.437, which might be caused by the high number `0` values in these two features.

## References

* Pima Indians Diabetes Database: Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press. Available at: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database 