[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Clinical-Informatics-Interest-Group/CLiC.notebooks/blob/main/notebooks/notebook1.ipynb)

# 1. Hospital Readmission Data

An original publication with the dataset we will explore investigated the impact of HbA1c measurement in hospital readmission rates by modeling the relationship with a multivariable logistic regression.

The open dataset is provided by the University of California Irvine here
https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008

The original publication of data source
https://www.hindawi.com/journals/bmri/2014/781670/

Publication authors description of the data.
https://www.hindawi.com/journals/bmri/2014/781670/tab1/

In [1]:
# The first cell is reserved for importing useful libraries that we
# will need for machine learning and working with lots of data.
import pandas as pd
import numpy as np
import zipfile, requests, io #We'll use these to get and extract the data from its repository
from pathlib import Path

In [2]:
# Run this block to download the data into the working directory.
# In Colab, this is a directory on the cloud within your google drive.
# The 'get_data' function checks to see if the data has already been download
# and downloads it if not.
def get_data():
    if Path('./dataset_diabetes/diabetic_data.csv').is_file():
        pass
    else:
        r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip')
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall()
        
get_data() 

In [3]:
# Assign the variable "df" to the pandas DataFrame containing our data.
df = pd.read_csv('./dataset_diabetes/diabetic_data.csv')
df.head() # Calling the first 5 rows of the data to have a first look

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [4]:
# Ask pandas if it knows any of the data samples are duplicates
df.duplicated().sum()

0

In [5]:
# See the class labels to decide if any na-values should be dropped
for col in df.columns[2:]:
    print(col, 'labels:', np.unique(df[col]), "\n")


race labels: ['?' 'AfricanAmerican' 'Asian' 'Caucasian' 'Hispanic' 'Other'] 

gender labels: ['Female' 'Male' 'Unknown/Invalid'] 

age labels: ['[0-10)' '[10-20)' '[20-30)' '[30-40)' '[40-50)' '[50-60)' '[60-70)'
 '[70-80)' '[80-90)' '[90-100)'] 

weight labels: ['>200' '?' '[0-25)' '[100-125)' '[125-150)' '[150-175)' '[175-200)'
 '[25-50)' '[50-75)' '[75-100)'] 

admission_type_id labels: [1 2 3 4 5 6 7 8] 

discharge_disposition_id labels: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25
 27 28] 

admission_source_id labels: [ 1  2  3  4  5  6  7  8  9 10 11 13 14 17 20 22 25] 

time_in_hospital labels: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14] 

payer_code labels: ['?' 'BC' 'CH' 'CM' 'CP' 'DM' 'FR' 'HM' 'MC' 'MD' 'MP' 'OG' 'OT' 'PO' 'SI'
 'SP' 'UN' 'WC'] 

medical_specialty labels: ['?' 'AllergyandImmunology' 'Anesthesiology' 'Anesthesiology-Pediatric'
 'Cardiology' 'Cardiology-Pediatric' 'DCPTEAM' 'Dentistry' 'Dermatology'
 'Emergency/Trauma' 'Endocrinolog

In [6]:
# We see some of the data entries are '?'. Let's encode this so that python knows these are missing data. 
df = pd.read_csv('./dataset_diabetes/diabetic_data.csv', na_values=['?'])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Last block threw an interesting error. We'll investigate this shortly with "df.dtypes()"

In [7]:
# Now we can find the number of samples missing data for each feature
df.isnull().sum()

encounter_id                    0
patient_nbr                     0
race                         2273
gender                          0
age                             0
weight                      98569
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
payer_code                  40256
medical_specialty           49949
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                         21
diag_2                        358
diag_3                       1423
number_diagnoses                0
max_glu_serum                   0
A1Cresult                       0
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide 

We already knew the data was missing some values. See https://www.hindawi.com/journals/bmri/2014/781670/tab1/

We will have to consider whether to drop these feature columns all together. Just to be sure how many "samples" we have, let's check the total number of rows.

In [8]:
len(df)

101766

In [10]:
df.dtypes

encounter_id                 int64
patient_nbr                  int64
race                        object
gender                      object
age                         object
weight                      object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepiride         