## Length of the code {-}
No restriction

**Delete this section from the report, when using this template.** 

## Data quality check / cleaning / preparation 

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.** An example is given below.

### Data quality check
*By Elton John*

The code below visualizes the distribution of all the variables in the dataset, and their association with the response.

In [None]:
# Loading relevant libraries 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import random

# Loading the data
data = pd.read_csv("dataset_diabetes/diabetic_data.csv")
ID = pd.read_csv('dataset_diabetes/IDs_mapping.csv')

In [7]:
#...Distribution of continuous variables...#

In [8]:
#...Distribution of categorical variables...#

In [9]:
#...Association of the response with the predictors...#

### Data cleaning
*By Anastasia Wei*

From the data quality check we realized that:

1. Some of the columns that should have contained only numeric values, specifically <>, <>, and <> have special characters such as \*, #, %. We'll remove these characters, and convert the datatype of these columns to numeric.

2. Some of the columns have more than 60% missing values, and it is very difficult to impute their values, as the values seem to be missing at random with negligible association with other predictors. We'll remove such columns from the data.

3. The column `number_of_bedrooms` has some unreasonably high values such as 15. As our data consist of single-family homes in Evanston, we suspect that any value greater than 5 may be incorrect. We'll replace all values that are greater than 5 with an estimate obtained using the $K$-nearest neighbor approach.

4. The columns `house_price` has some unreasonably high values. We'll tag all values greater than 1 billion dollars as "potentially incorrect observation", to see if they distort our prediction / inference later on.

The code below implements the above cleaning.

In [None]:
######-----------Changing the IDs into a three column format------------#########
IDs = pd.DataFrame(index = range(63), columns = ['ID_types', 'ID_num', 'Description'])

IDs.loc[:8, 'ID_types'] = ['admission_type_id'] * 9
IDs.loc[:8, 'ID_num'] = ID.loc[:8,'admission_type_id'].values
IDs.loc[:8, 'Description'] = ID.loc[:8,'description'].values

IDs.loc[8:38, 'ID_types'] = ['discharge_disposition_id'] * 31
IDs.loc[8:38, 'ID_num'] = ID.loc[10:40,'admission_type_id'].values
IDs.loc[8:38, 'Description'] = ID.loc[10:40,'description'].values

IDs.loc[38:, 'ID_types'] = ['admission_source_id'] * 25
IDs.loc[38:, 'ID_num'] = ID.loc[42:,'admission_type_id'].values
IDs.loc[38:, 'Description'] = ID.loc[42:,'description'].values

# Saving the cleaned IDs to a new csv file
IDs.to_csv('IDs_clean.csv')

######-----------Dropping the columns w/ more than 50% values missing------------#########
data.drop(['weight','medical_specialty'], axis = 1, inplace = True)

######-----------Removing duplicate records for the same patient------------#########
print('Length before removing Duplicates:', len(data))
data.drop_duplicates(['patient_nbr'], keep = 'first', inplace = True)
print('Length after removing Duplicates:', len(data))

######-----------Changing readmission to two levels instead of three------------#########

# Checking the values of the readmitted column and changing it to two levels
# Readmitted is defined here as '1' if the patient returns to hospital within 30 days
data.readmitted.value_counts()
data['readmitted'] = data['readmitted'].apply(lambda x: 0 if x == 'NO' or x == '>30'
                                              else 1)

######-----------Imputing Missing Values in Race by drawing randomly from the race in the observations------------#########

# Checking the value counts of race and where values are missing 
races = data['race'].loc[data['race'] != '?'].values
data['race'].value_counts()

# Applying a lambda function to the race column to impute missing values 
data['race'] = data['race'].apply(lambda x: random.choice(races) if x == '?' else x)
data['race'].value_counts()

# Re-indexing
data.index = np.arange(0,len(data))


### Data preparation
*By Anastasia Wei*

The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

1. Since we need to predict house price, we derived some new predictors *(from existing predictors)* that intuitively seem to be helpuful to predict house price. 

2. We have shuffled the dataset to prepare it for K-fold cross validation.

3. We have created a standardized version of the dataset, as we will use it to develop Lasso / Ridge regression models.

In [3]:
######-----------Replacing the age range with the middle of the interval------------#########
replaceDict = {'[0-10)' : 5,'[10-20)' : 15, '[20-30)' : 25, 
               '[30-40)' : 35,'[40-50)' : 45,'[50-60)' : 55,
               '[60-70)' : 65,'[70-80)' : 75,'[80-90)' : 85,
               '[90-100)' : 95}

data['age'] = data['age'].apply(lambda x : replaceDict[x])
print(data['age'].head())

######-----------Using domain knowledge on diag_1, diag_2, and diag_3 to bin the diagnoses------------#########

# Defining a helper function to determine if a number is a float
def isfloat(num):
    try:
        float(num)
        return True
    except ValueError:
        return False

# Defining a function to bin the diagnoses columns
def diag_transform(x):
    if str(x)[0] == 'V' or str(x)[0] == 'E':
        return 'other'
    elif isfloat(x):
        if int(float(x)) in range(390, 460) or int(float(x)) == 785:
            return 'circulatory'
        elif int(float(x)) in range(460, 520) or int(float(x)) == 786:
            return 'respiratory'
        elif int(float(x)) in range(520, 580) or int(float(x)) == 787:
            return 'digestive'
        elif int(float(x)) == 250:
            return'diabetes'
        elif int(float(x)) in range(800, 1000):
            return 'injury'
        elif int(float(x)) in range(710, 740):
            return 'musculoskeletal'
        elif int(float(x)) in range(580, 630) or int(float(x)) == 788:
            return 'genitourinary'
        elif int(float(x)) in range(140, 240):
            return 'neoplasms'
        elif int(float(x)) in range(630, 680):
            return 'pregnecy'
        else:
            return 'other'
    else:
        return 'other'

# Applying the functions to the appropriate columns in the dataframe 
data['diag_1'] = data['diag_1'].apply(diag_transform)
data['diag_2'] = data['diag_2'].apply(diag_transform)
data['diag_3'] = data['diag_3'].apply(diag_transform)

In [None]:
######-----------Using domain knowledge to bin admission_type_id, discharge_disposition_id, admission_source_id------------#########

# Changing the datatypes of the appropriate columns 
data['admission_type_id'] = data['admission_type_id'].astype('int')
data['admission_source_id'] = data['admission_source_id'].astype('int')
data['discharge_disposition_id'] = data['discharge_disposition_id'].astype('int')

# Defining helper functions to transform the data 
def ad_type_transform(x):
    if x in [2,7]:
        return 1
    elif x in [6,8]:
        return 5
    else:
        return x
    
def ad_source_transform(x):
    if x in [2,3]:
        return 1
    elif x in [5,6,10,22,25]:
        return 4
    elif x in [15,17,20,21]:
        return 9
    elif x in [13,14]:
        return 11
    else:
        return x
    
def discharge_transform(x):
    if x in [6, 8, 9, 13]:
        return 1
    elif x in [3, 4, 5, 14, 22, 23, 24]:
        return 2
    elif x in [12, 15, 16, 17]:
        return 10
    elif x in [19, 20, 21]:
        return 11
    elif x in [25, 26]:
        return 18
    elif x in [13,14,19,20,21]:
        return 11
    else:
        return x

    
# Applying the helper functions to the appropriate columns
data['admission_type_id'] = data['admission_type_id'].apply(ad_type_transform)
data['admission_source_id'] = data['admission_source_id'].apply(ad_source_transform)
data['discharge_disposition_id'] = data['discharge_disposition_id'].apply(discharge_transform)

In [None]:
######-----Creating number of changes variable (number of 'ups' and 'downs' in medication)-------#########

# Creating a list of the drugs in the dataset
druglist = ['metformin','repaglinide', 'nateglinide', 'chlorpropamide',
            'glimepiride','acetohexamide', 'glipizide', 'glyburide', 
            'tolbutamide','pioglitazone', 'rosiglitazone', 'acarbose', 
            'miglitol', 'troglitazone','tolazamide', 'examide', 'citoglipton', 
            'insulin', 'glyburide-metformin', 'glipizide-metformin',
            'glimepiride-pioglitazone', 'metformin-rosiglitazone',
            'metformin-pioglitazone']

# Defining a helper function to help create this variable 
num_of_changes = []
for i in tqdm(range(len(data))) :
    changeCount = 0
    for col in druglist : 
        if data.iloc[i][col] in ['Down', 'Up'] :
            changeCount += 1
    num_of_changes.append(changeCount)
    
# Creating the variable in the dataset
data['num_of_changes'] = num_of_changes

In [None]:
######-----Saving the cleaned data as a CSV-------#########
data.to_csv('diabetes_cleaned.csv')

######-----Creating test and train datasets-------#########

# Using 80% for train, 20% for test
print(len(data))
print(len(data)*0.2)

# Creating the test data
pop = list(np.arange(0,71518))
test_loc = random.sample(pop, k = 14304)
test = data.iloc[test_loc]
test.index = np.arange(0, 14304)

# Creating the train data 
train_loc = list(set(pop) - set(test_loc))
train = data.iloc[train_loc]
train.index = np.arange(0, 57214)

# Saving both dfs as CSVs
test.to_csv('test_csv')
train.to_csv('train.csv')

## Exploratory data analysis
*By Anastasia Wei, Lila Wells, Kaitlyn Hung, and Amy Wang*

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

## Developing the model
*By Anastasia Wei, Lila Wells, Kaitlyn Hung, and Amy Wang*

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

### Code fitting the final model

Put the code(s) that fit the final model(s) in separate cell(s), i.e., the code with the `.ols()` or `.logit()` functions.

## Conclusions and Recommendations to stakeholder(s)
*By Anastasia Wei, Lila Wells, Kaitlyn Hung, and Amy Wang*

Put the odds-ratio calculations here