# Logistic Regression

## Mapping the Features

Before performing logistic regression, we need to map variables with text values to series of binary features. For example, suppose there are three admission types: A, B, and C. In the original admission-type column in the data set, each patient encounter would have A, B, or C. But machine-learning algorithms read binaries (and numerical categories, where appropriate) instead of letters and words. The encoding process creates two new columns: A and B. A patient encounter will have 1 in A and 0 in B to indicate admission type A. Converse encoding indicates type B, and a 0 in both columns indicates type C. This process is called one-hot encoding, and it will be helpful in other types of analysis like tree-based analysis as well. 

In [29]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [45]:
#read in wrangled data set 
readmit = pd.read_csv('readmit_for_map.csv')

# separate df without dependent variable
#readmit_noY = readmit.drop('readmit30', axis = 1)

In [46]:
#split data into training (80%) and test (20%)
split = np.random.rand(len(readmit)) < 0.81

readmit_train = readmit[split]
readmit_test = readmit[~split]

In [52]:
#map each categorical or object variable to n-1 binary variables, where n = no. of categories/values in variable
def oneHotEncode2(df, le_dict = {}): #dictionary to distinguish training and test data
    if not le_dict:
        columnsToEncode = list(df.select_dtypes(include=['category','object'])) #capture categorical and object data
        train = True;
    else:
        columnsToEncode = le_dict.keys()   
        train = False;

    for feature in columnsToEncode:
        if train:
            le_dict[feature] = LabelEncoder() #one-hot encoding for training data
        try:
            if train: #check for mismatches in variables between test and training data
                df[feature] = le_dict[feature].fit_transform(df[feature]) 
            else:
                df[feature] = le_dict[feature].transform(df[feature])

            df = pd.concat([df, #build dummy variables into df
                              pd.get_dummies(df[feature]).rename(columns=lambda x: feature + '_' + str(x))], axis=1)
            df = df.drop(feature, axis=1) #drop original feature
        except: 
            print('Error encoding ' + feature) #produce error warning if feature not found in test data
            #df[feature] = df[feature].convert_objects(convert_numeric='force')
            df[feature] = df[feature].apply(pd.to_numeric, errors='coerce') #convert feature values to numeric
    return (df, le_dict)

In [53]:
readmit_train, le_dict = oneHotEncode2(readmit_train)

In [54]:
readmit_test, _ = oneHotEncode2(readmit_test, le_dict)

In [58]:
readmit.num_lab_procedures.value_counts()

1      2155
43     1796
44     1562
45     1532
38     1451
46     1448
47     1416
40     1404
39     1360
37     1357
41     1356
42     1324
48     1316
49     1303
35     1244
51     1242
50     1240
36     1223
54     1206
55     1184
52     1161
56     1153
53     1153
57     1140
58     1100
59     1069
34     1061
61     1057
60     1010
62      994
       ... 
88       72
87       63
89       49
90       46
91       45
93       38
94       36
92       34
95       33
97       21
96       19
98       18
100      11
101       7
103       6
99        6
102       4
106       4
108       4
105       3
109       2
111       2
113       2
104       2
120       1
132       1
107       1
114       1
118       1
121       1
Name: num_lab_procedures, dtype: int64