# Introduction

In order to make our dataset easier for the algorithms in final analysis easier to process, it's helpful to make a few adjustments to the format of the dataset. These include mapping the features and, later on, under-sampling.

## Preparing to Map Features

Before running a logistic regression, it is important to map categorical features into separate, binary variables for each of their categories (more information on that process soon). But first, we need to prepare a few variables for mapping. First, the number of lab procedures ranges from 1-132, but in some cases there are very few patients who had a given number of procedures. To keep a desirable ratio of new binary variables to actual observations, we'll first bin the number of procedures so that there are only 13 new variables, which will make machine-learning techniques more fruitful.

There are also a few ID features (e.g., admission type ID) that are functionally categorical but have numeric values. We'll change their data types so that we can map those variables accordingly, too. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#read in wrangled data set 
readmit = pd.read_csv('readmit_for_map.csv')
readmit.head()

Unnamed: 0,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,days_in_hospital,num_lab_procedures,num_procedures,num_medications,...,glimepiride_pioglitazone,metformin_rosiglitazone,metformin_pioglitazone,change,diabetesMed,readmit30,num_visits,first_diag,second_diag,third_diag
0,Caucasian,Female,[50-60),2,1,1,8,77,6,33,...,No,No,No,Ch,Yes,1,2,circulatory,injury,digestive
1,Caucasian,Female,[50-60),3,1,1,2,49,1,11,...,No,No,No,No,No,0,1,musculoskeletal,other,diabetes
2,Caucasian,Female,[80-90),1,3,7,4,68,2,23,...,No,No,No,No,Yes,0,1,injury,respiratory,other
3,Caucasian,Female,[80-90),1,1,7,3,46,0,20,...,No,No,No,Ch,Yes,0,1,neoplasms,circulatory,circulatory
4,AfricanAmerican,Female,[30-40),1,1,7,5,49,0,5,...,No,No,No,No,Yes,0,1,genitourinary,neoplasms,diabetes


In [3]:
# bin mumber of lab procedures variable
 # write function
def bin_labs(col):
    if (col >= 1) & (col <= 10):
        return '[1-10]'
    if (col >= 11) & (col <= 20):
        return '[11-20]'
    if (col >= 21) & (col <= 30):
        return '[21-30]'
    if (col >= 31) & (col <= 40):
        return '[31-40]'
    if (col >= 41) & (col <= 50):
        return '[41-50]'
    if (col >= 51) & (col <= 60):
        return '[51-60]'
    if (col >= 61) & (col <= 70):
        return '[61-70]'
    if (col >= 71) & (col <= 80):
        return '[71-80]'
    if (col >= 81) & (col <= 90):
        return '[81-90]'
    if (col >= 91) & (col <= 100):
        return '[91-100]'
    if (col >= 101) & (col <= 110):
        return '[101-110]'
    if (col >= 111) & (col <= 120):
        return '[111-120]'
    else:
        return '[121-132]' 

In [4]:
 # apply function to relevant variable, check df
readmit['num_lab_procs'] = readmit.num_lab_procedures.apply(lambda col: bin_labs(col))
readmit.head()

Unnamed: 0,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,days_in_hospital,num_lab_procedures,num_procedures,num_medications,...,metformin_rosiglitazone,metformin_pioglitazone,change,diabetesMed,readmit30,num_visits,first_diag,second_diag,third_diag,num_lab_procs
0,Caucasian,Female,[50-60),2,1,1,8,77,6,33,...,No,No,Ch,Yes,1,2,circulatory,injury,digestive,[71-80]
1,Caucasian,Female,[50-60),3,1,1,2,49,1,11,...,No,No,No,No,0,1,musculoskeletal,other,diabetes,[41-50]
2,Caucasian,Female,[80-90),1,3,7,4,68,2,23,...,No,No,No,Yes,0,1,injury,respiratory,other,[61-70]
3,Caucasian,Female,[80-90),1,1,7,3,46,0,20,...,No,No,Ch,Yes,0,1,neoplasms,circulatory,circulatory,[41-50]
4,AfricanAmerican,Female,[30-40),1,1,7,5,49,0,5,...,No,No,No,Yes,0,1,genitourinary,neoplasms,diabetes,[41-50]


In [5]:
 # drop old lab-procedures variable
readmit_procs = readmit.drop(['num_lab_procedures'], axis = 1)

In [6]:
# change categorical variables from numeric to object where necessary
cols = readmit_procs[['admission_type_id', 'discharge_disposition_id', 'admission_source_id']] 
readmit_procs[['admission_type_id', 'discharge_disposition_id', 'admission_source_id']] = cols.astype(object)

## Mapping the Features

Before performing logistic regression, we need to map variables with text values to series of binary features. For example, suppose there are three admission types: A, B, and C. In the original admission-type column in the data set, each patient encounter would have A, B, or C. But machine-learning algorithms read binaries (and numerical categories, where appropriate) instead of letters and words. The encoding process creates two new columns: A and B. A patient encounter will have 1 in A and 0 in B to indicate admission type A. Converse encoding indicates type B, and a 0 in both columns indicates type C. This process is called one-hot encoding, and it will be helpful in other types of analysis like tree-based analysis as well. 

In [7]:
# check variable types -- only integer and object
print(readmit_procs.dtypes.unique())

[dtype('O') dtype('int64')]


In [8]:
# one-hot encoding for each object-column value
 # columns setting converts object and category dtypes; dropfirst creates n-1 dummies for n categories of a variable
one_hot = pd.get_dummies(readmit_procs, columns = None, drop_first = True) 
one_hot.head()

Unnamed: 0,days_in_hospital,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,readmit30,num_visits,race_Asian,...,num_lab_procs_[111-120],num_lab_procs_[121-132],num_lab_procs_[21-30],num_lab_procs_[31-40],num_lab_procs_[41-50],num_lab_procs_[51-60],num_lab_procs_[61-70],num_lab_procs_[71-80],num_lab_procs_[81-90],num_lab_procs_[91-100]
0,8,6,33,0,0,0,8,1,2,0,...,0,0,0,0,0,0,0,1,0,0
1,2,1,11,0,0,0,3,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,4,2,23,0,0,0,9,0,1,0,...,0,0,0,0,0,0,1,0,0,0
3,3,0,20,0,0,0,9,0,1,0,...,0,0,0,0,1,0,0,0,0,0
4,5,0,5,0,0,0,3,0,1,0,...,0,0,0,0,1,0,0,0,0,0


In [9]:
# write one-hot-encoded dataframe to csv for use in other analyses
one_hot.to_csv('diabetes_readmission_onehot.csv')