[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Clinical-Informatics-Interest-Group/CLiC.notebooks/blob/main/notebooks/uci_diabetes_data.ipynb)

# 1. Hospital Readmission Data

An original publication with the dataset we will explore investigated the impact of HbA1c measurement in hospital readmission rates by modeling the relationship with a multivariable logistic regression.

The open dataset is provided by the University of California Irvine here
https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008

The original publication of data source
https://www.hindawi.com/journals/bmri/2014/781670/

Publication authors description of the data.
https://www.hindawi.com/journals/bmri/2014/781670/tab1/

In [1]:
# The first cell is reserved for importing useful libraries that we
# will need for machine learning and working with lots of data.
import pandas as pd
import numpy as np
import zipfile, requests, io #We'll use these to get and extract the data from its repository
from pathlib import Path

In [8]:
# Run this block to download the data into the working directory.
# In Colab, this is a directory on the cloud within your google drive.
# The 'get_data' function checks to see if the data has already been download
# and downloads it if not.
def get_data():
    if Path('./dataset_diabetes/diabetic_data.csv').is_file():
        pass
    else:
        r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip')
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall()
        
get_data() 

In [None]:
# Assign the variable "df" to the pandas DataFrame containing our data.
df = pd.read_csv('./dataset_diabetes/diabetic_data.csv')
df.head() # Calling the first 5 rows of the data to have a first look

In [None]:
# Ask pandas if it knows any of the data samples are duplicates
df.duplicated().sum()

In [None]:
# See the class labels to decide if any na-values should be dropped
for col in df.columns[2:]:
    print(col, 'labels:', np.unique(df[col]), "\n")


In [None]:
# We see some of the data entries are '?'. Let's encode this so that python knows these are missing data. 
df = pd.read_csv('./dataset_diabetes/diabetic_data.csv', na_values=['?'])

Last block threw an interesting error. We'll investigate this shortly with "df.dtypes()"

In [None]:
# Now we can find the number of samples missing data for each feature
df.isnull().sum()

We already knew the data was missing some values. See https://www.hindawi.com/journals/bmri/2014/781670/tab1/

We will have to consider whether to drop these feature columns all together. Just to be sure how many "samples" we have, let's check the total number of rows.

In [None]:
len(df)