## Introduction

TEAMMATES: Akshat and Annie

The overall goal is to predict whether a payment by a company to a medical doctor or facility
was made as part of a research project or not.

### Imports

In [2]:
# data loading and manipulation
import pandas as pd
import numpy as np
import random
# from dirty_cat import TargetEncoder
from category_encoders import TargetEncoder

# scikit learn
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, confusion_matrix, roc_auc_score, auc, average_precision_score
from sklearn.ensemble import RandomForestClassifier

# unbalanced sets
from imblearn.under_sampling import RandomUnderSampler

# plotting
import matplotlib.pyplot as plt

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Load data

The positive class corresponds to the payments that were made by a company to a doctor or facility that is part of the **research project**. The negative class on the other hand are the **general payments**. 

In the original data sets, the ratio of the positive class to the negative class is 1/20, making the positive class the minority class. 

Because the data sets are so large, we will subsample from the classes in order to maintain the same ratio. Thus we take 120K data points from Class 0, and 20K data points from Class 1. 

120K from the positive class turns out to be ~20% of the data, and 2M from the negative class is ~20% from the negative class. 

In [3]:
# # Import 20% data randomly
# p = 0.2
# df0 = pd.read_csv('../payments2017/d0.csv', skiprows=lambda i: i>0 and random.random() > p)
# df1 = pd.read_csv('../payments2017/d1.csv', skiprows=lambda i: i>0 and random.random() > p)

# # Write sampled data for future use
# df0.to_csv('../payments2017/gen_payments_sampled.csv')
# df1.to_csv('../payments2017/res_payments_sampled.csv')

In [4]:
# Import from sampled files
df0 = pd.read_csv('../payments2017/gen_payments_sampled.csv')
df1 = pd.read_csv('../payments2017/res_payments_sampled.csv')

In [5]:
df0.shape

(2135022, 76)

In [6]:
df1.shape

(120693, 177)

## Feature Intersection

What features should be excluded because they leak the target information?

There are 75 features present in the negative class, and 176 in the positive class. Our approach to combining the data sets for both the positive and the negative classs it to take an intersection of the features. 

In [7]:
notPrs = list(set(list(df1.columns)).difference(list(df0.columns)))
featureIntersection = list(set(list(df1.columns)).difference(notPrs))
print("There are {} features present in the intersection of the two dataframes.".format(len(featureIntersection) - 1))

df1 = df1[featureIntersection]
df0 = df0[featureIntersection]

There are 64 features present in the intersection of the two dataframes.


Before we concatenate the two data sets, we add an indicator variable to each one specifying which class the data belongs to. We call this feature **target**, which is equal to 1 for the positive class and 0 for the negative class.

In [8]:
df1['Target'] = 1
df0['Target'] = 0

df = pd.concat([df1, df0], axis=0)
df.shape

(2255715, 66)

## NA Columns

We examine the missing values of the data and see that a lot of the features have the majority of their data missing.

In [23]:
NAs = df.isna().mean().sort_values(ascending=False)

In [24]:
NAs

Recipient_Province                                                  0.999946
Recipient_Postal_Code                                               0.999930
Physician_License_State_code5                                       0.999841
Physician_License_State_code4                                       0.999206
Associated_Drug_or_Biological_NDC_5                                 0.997693
Physician_License_State_code3                                       0.995460
Product_Category_or_Therapeutic_Area_5                              0.993670
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_5           0.993548
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5            0.993480
Covered_or_Noncovered_Indicator_5                                   0.993370
Teaching_Hospital_CCN                                               0.987726
Teaching_Hospital_Name                                              0.987726
Teaching_Hospital_ID                                                0.987726

## Train-Test Split

In [31]:
features = df.drop(columns='Target')
target = df['Target']
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)
model_scores = dict()

## Random Undersampling

In [32]:
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
X_train_rus = pd.DataFrame(X_train_rus, columns = X_train.columns)

X_train, X_val, y_train, y_val = train_test_split(X_train_rus, y_train_rus, random_state=42)

## Feature Identification ~ Task 1

We identifying possible irrelevant columns by looking at the features are Names or IDs, such as the hospial ID, record ID, postal codes or physcian names.

In [34]:
columns_to_drop = ['Recipient_Province', 
'Recipient_Postal_Code', 
'Recipient_Primary_Business_Street_Address_Line2',
'Teaching_Hospital_Name', 
'Teaching_Hospital_CCN',
'Teaching_Hospital_ID',
'Physician_Name_Suffix',       
'Program_Year', 
'Physician_Profile_ID', 
'Physician_Last_Name', 
'Physician_First_Name',
'Record_ID',
'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID',
'Physician_Profile_ID',
'Recipient_Zip_Code',
'Date_of_Payment',
'Physician_Middle_Name',
'Payment_Publication_Date', 
'Unnamed: 0' # Index col from one of the DFs
]

As part of out baseline estimate, we will aslo drop columns with any missing value at all. Later on we will not drop all of them and try to impute the missing values.

In [35]:
nan_columns = NAs[NAs > 0] 
nan_columns = np.array(nan_columns.index)
to_drop_baseline = list(set(nan_columns) | set(columns_to_drop))

In [36]:
X_train_Baseline = X_train.drop(columns=to_drop_baseline, axis ='columns')

Checking single variable performances to identify leakage issues

In [46]:
objVars = ['Covered_Recipient_Type', # leaking target info
            'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name', # leaking target info
            'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name', # leaking target info
           'Form_of_Payment_or_Transfer_of_Value',
           'Dispute_Status_for_Publication', 
           'Delay_in_Publication_Indicator',
           'Related_Product_Indicator',
           'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country',
           'Change_Type',
           'Total_Amount_of_Payment_USDollars']

single_var_auc = dict()
single_var_acc = dict()

for var in objVars:
    
    if var != 'Total_Amount_of_Payment_USDollars':
        baseline_pipe = Pipeline([
                                ("dummies", OneHotEncoder(handle_unknown='ignore')),
                                ("logreg", LogisticRegression(solver='lbfgs'))])
    else:
        baseline_pipe = Pipeline([('scalar', StandardScaler()),
                                   ("logreg", LogisticRegression(solver='lbfgs'))])

    # Baseline Training and testing
#     logreg = baseline_pipe.fit(X_train[[var]], y_train)
#     y_score = logreg.predict_proba(X_val[[var]])
    
    # Store in dict
#     single_var_auc[var] = roc_auc_score(y_val, y_score[:, 1])
    
    single_var_acc[var] = np.mean(cross_val_score(baseline_pipe, X_train[[var]], y_train, cv=5))
    
single_var_acc

{'Covered_Recipient_Type': 0.9776754060061915,
 'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name': 0.7644055216048586,
 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name': 0.8555550430863548,
 'Form_of_Payment_or_Transfer_of_Value': 0.8169737585085819,
 'Dispute_Status_for_Publication': 0.500808064079488,
 'Delay_in_Publication_Indicator': 0.5006391044268736,
 'Related_Product_Indicator': 0.541409525582002,
 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country': 0.549585674391527,
 'Change_Type': 0.5181226628317186,
 'Total_Amount_of_Payment_USDollars': 0.752034873891773}

In [47]:
{key: single_var_acc[key] for key in single_var_acc if single_var_acc[key] > 0.8}

{'Covered_Recipient_Type': 0.9776754060061915,
 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name': 0.8555550430863548,
 'Form_of_Payment_or_Transfer_of_Value': 0.8169737585085819}

We see that three of the scores stand out: 

- Covered_Recipient_Type,
- Form_of_Payment_or_Transfer_of_Value, and 
- Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name

It is likely that these features leak target information, and so we drop them from the sample. 

In [54]:
X_train_Baseline.nunique()

Dispute_Status_for_Publication                                          2
Delay_in_Publication_Indicator                                          1
Change_Type                                                             3
Related_Product_Indicator                                               2
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country       20
Total_Amount_of_Payment_USDollars                                   35659
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name             818
dtype: int64

Additionally, we can remove the feature Delay_in_Publication_Indicator as it only has one unique value and would not add important information to the model.

In [56]:
columns_to_drop += ['Covered_Recipient_Type', # leaking target info
                    'Form_of_Payment_or_Transfer_of_Value', # leaking target info
                    'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name', # leaking target info
                    'Delay_in_Publication_Indicator' # single unique value
                    ]

In [57]:
to_drop_baseline = list(set(nan_columns) | set(columns_to_drop))

In [58]:
X_train_Baseline = X_train.drop(columns=to_drop_baseline, axis ='columns')

Our new baseline model consists of 7 featues, 6 of which are categorical and 1 that is continuous.

In [59]:
X_train_Baseline.shape

(136128, 6)

In [60]:
X_train_Baseline.head()

Unnamed: 0,Dispute_Status_for_Publication,Change_Type,Related_Product_Indicator,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country,Total_Amount_of_Payment_USDollars,Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name
118253,No,UNCHANGED,Yes,United States,10619.1,"Takeda Pharmaceuticals U.S.A., Inc."
179387,No,UNCHANGED,No,United States,15.0,Incyte Corporation
110251,No,UNCHANGED,Yes,United States,1011.15,Eisai Inc.
92018,No,UNCHANGED,Yes,United States,522.48,Alcon Research Ltd
29002,No,UNCHANGED,Yes,United States,13.03,Sanofi and Genzyme US Companies


## Baselining ~ Task 2

We can handle the categorical variables in multiple ways. 

In [64]:
# Defining continuous and categorical variables
objVars = ['Dispute_Status_for_Publication', 'Change_Type',
       'Related_Product_Indicator',
       'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country',
       'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name']

contVars = ['Total_Amount_of_Payment_USDollars']

In [65]:
# Defining ColumnTransformer
preprocessor = ColumnTransformer(transformers=[("dummies", OneHotEncoder(handle_unknown='ignore'), objVars)],
                                 remainder='passthrough')

# Create pipeplines
baseline_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                          ("logreg", LogisticRegression(C=1000000, solver='lbfgs', max_iter=1000))
                         ])

In [76]:
# Baseline Training and testing
# logreg = baseline_pipe.fit(pd.DataFrame(X_train_rus, columns = X_train.columns), y_train_rus)
# y_score = logreg.predict_proba(pd.DataFrame(X_train_rus, columns = X_train.columns))
# baseline= roc_auc_score(y_train_rus, y_score[:, 1])

baseline_acc = cross_val_score(baseline_pipe, X_train_Baseline, y_train, cv=5)
baseline_cv_score_acc = np.mean(baseline_acc)
model_scores['baseline_cv_acc'] = baseline_cv_score_acc

In [77]:
baseline_roc = cross_val_score(baseline_pipe, X_train_Baseline, y_train, scoring= 'roc_auc', cv=5)
baseline_cv_score_roc = np.mean(baseline_roc)
model_scores['baseline_cv_roc'] = baseline_cv_score_roc

In [78]:
model_scores

{'baseline_cv_acc': 0.8829996800013659, 'baseline_cv_roc': 0.9507626293414436}

## Feature engineering ~ Task 3

**Handling the NAs:**

Imputing NA with 'Missing' values --> I would still drop variables that have really high number of NAs - because some of them leak target info, but they have so many missing values that they don't actually leak. 

In [79]:
nan_columns = NAs[NAs > 0.5] 
nan_columns = np.array(nan_columns.index)
columns_to_drop = list(set(nan_columns) | set(columns_to_drop))

In [80]:
X_train_engineered = X_train.drop(columns=columns_to_drop)

In [81]:
obj_vars = X_train_engineered.drop(columns=['Total_Amount_of_Payment_USDollars']).columns.values
cont_vars = ['Total_Amount_of_Payment_USDollars']

In [85]:
X_train_engineered.columns

Index(['Physician_Primary_Type',
       'Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_1',
       'Dispute_Status_for_Publication',
       'Product_Category_or_Therapeutic_Area_1', 'Recipient_State',
       'Change_Type', 'Related_Product_Indicator',
       'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_1',
       'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State',
       'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country',
       'Associated_Drug_or_Biological_NDC_1', 'Physician_License_State_code1',
       'Recipient_Primary_Business_Street_Address_Line1', 'Recipient_City',
       'Total_Amount_of_Payment_USDollars',
       'Covered_or_Noncovered_Indicator_1', 'Recipient_Country',
       'Physician_Specialty',
       'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name'],
      dtype='object')

### Identify high cardinality categorical variables

In [28]:
# Identify which variables to target encode
target_based_encoding = []
for col in obj_vars:
    print(col, len(X_train_engineered[col].unique()))
    
    if len(X_train_engineered[col].unique()) > 100:
        target_based_encoding.append(col)

len(target_based_encoding)

Physician_License_State_code1 60
Physician_License_State_code2 54
Covered_or_Noncovered_Indicator_5 3
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State 47
Recipient_City 15144
Product_Category_or_Therapeutic_Area_5 202
Associated_Drug_or_Biological_NDC_2 537
Form_of_Payment_or_Transfer_of_Value 6
Recipient_State 60
Covered_or_Noncovered_Indicator_4 3
Covered_or_Noncovered_Indicator_1 3
Recipient_Primary_Business_Street_Address_Line1 299591
Covered_or_Noncovered_Indicator_3 3
Recipient_Country 15
Physician_Primary_Type 7
Physician_License_State_code5 32
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_1 5
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5 526
Physician_License_State_code4 43
Physician_Specialty 374
Change_Type 3
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_5 5
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_1 8260
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_4 988
Name_of_Drug_or_Biological_or_Device_or_Medical_Suppl

17

In [29]:
# Final categorical variables
categorical = [cols for cols in obj_vars if cols not in target_based_encoding]
len(categorical) + len(target_based_encoding)

43

In [30]:
# # Train-Test split
# target = X_train_engineered['Target']
# features = X_train_engineered.drop(columns='Target')
# X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)

# Random undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train_engineered, y_train)
X_train_rus = pd.DataFrame(X_train_rus, columns = X_train_engineered.columns)

Checking for target leakage again since we've added features not used before (for all variables except target encoded ones)

Impute the missing values of the categorical data with the value "Missing"

In [83]:
single_var = dict()

cat_pipe = Pipeline([
                        ('Impute', SimpleImputer(strategy='constant', fill_value="Missing")),
                        ("dummies", OneHotEncoder(handle_unknown='ignore')),
                        ("logreg", LogisticRegression(C=1000000, solver='lbfgs', max_iter=1000))])

cont_pipe = Pipeline([
                        ('scalar', StandardScaler()),
                        ("logreg", LogisticRegression(C=1000000, solver='lbfgs', max_iter=1000))])

for var in X_train_rus.columns:
    
    if var != 'Total_Amount_of_Payment_USDollars':
        # Baseline Training and testing
        logreg = cat_pipe.fit(X_train_rus[[var]], y_train_rus)
        y_score = logreg.predict_proba(X_train_rus[[var]])
            
    else:
        # Baseline Training and testing
        logreg = cont_pipe.fit(X_train_rus[[var]], y_train_rus)
        y_score = logreg.predict_proba(X_train_rus[[var]])
    
    # Store in dict
    single_var[var] = roc_auc_score(y_train_rus, y_score[:, 1])
    
single_var    

{'Form_of_Payment_or_Transfer_of_Value': 0.8162809022541647,
 'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name': 0.8614237063860364,
 'Covered_Recipient_Type': 0.9791609259787764,
 'Change_Type': 0.5311575726709323,
 'Delay_in_Publication_Indicator': 0.5,
 'Dispute_Status_for_Publication': 0.5000940359106548,
 'Related_Product_Indicator': 0.5408447743691297,
 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name': 0.9425137176068114,
 'Total_Amount_of_Payment_USDollars': 0.8966281827879227,
 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country': 0.5521105709016974}

Adding leakage variables to list of variables to be removed. The threshold for deciding if the feature leaks target information is set to an AUC_ROC of 0.7. Similarly, we remove all the other associated features (1-5)

In [84]:
{key: single_var[key] for key in single_var if single_var[key] > 0.7}

{'Form_of_Payment_or_Transfer_of_Value': 0.8162809022541647,
 'Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name': 0.8614237063860364,
 'Covered_Recipient_Type': 0.9791609259787764,
 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name': 0.9425137176068114,
 'Total_Amount_of_Payment_USDollars': 0.8966281827879227}

In [79]:
columns_to_drop += ['Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_1', 
                    'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_2', 
                    'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_3',
                    'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_4',
                    'Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5',
                    
                    'Physician_License_State_code1',
                    'Physician_License_State_code2',
                    'Physician_License_State_code3',
                    'Physician_License_State_code4',
                    'Physician_License_State_code5', 
                    
                    'Recipient_Primary_Business_Street_Address_Line1',
                    'Recipient_Primary_Business_Street_Address_Line2',
                    'Recipient_Primary_Business_Street_Address_Line3',
                    'Recipient_Primary_Business_Street_Address_Line4',
                    'Recipient_Primary_Business_Street_Address_Line5',
                    
                    
                    
                   'Physician_Primary_Type',
                   'Physician_Specialty',
                   'Physician_License_State_code1', 
                   #Associated_Drug_or_Biological_NDC_1, 
                   #Form_of_Payment_or_Transfer_of_Value, 
                   ]

In [33]:
X_train_engineered = X_train.drop(columns=columns_to_drop)

In [34]:
obj_vars = X_train_engineered.drop(columns=['Total_Amount_of_Payment_USDollars']).columns.values
cont_vars = ['Total_Amount_of_Payment_USDollars']

In [35]:
# Identify which variables to target encode
target_based_encoding = []
for col in obj_vars:
    print(col, len(X_train_engineered[col].unique()))
    
    if len(X_train_engineered[col].unique()) > 100:
        target_based_encoding.append(col)

len(target_based_encoding)

Physician_License_State_code2 54
Covered_or_Noncovered_Indicator_5 3
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State 47
Recipient_City 15144
Product_Category_or_Therapeutic_Area_5 202
Associated_Drug_or_Biological_NDC_2 537
Form_of_Payment_or_Transfer_of_Value 6
Recipient_State 60
Covered_or_Noncovered_Indicator_4 3
Covered_or_Noncovered_Indicator_1 3
Covered_or_Noncovered_Indicator_3 3
Recipient_Country 15
Physician_License_State_code5 32
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_1 5
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5 526
Physician_License_State_code4 43
Change_Type 3
Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_5 5
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_4 988
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_2 2550
Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_3 1785
Associated_Drug_or_Biological_NDC_4 189
Associated_Drug_or_Biological_NDC_5 82
Product_Category_or_Therapeutic_Area_1 1588
Delay

14

In [36]:
# Final categorical variables
categorical = [cols for cols in obj_vars if cols not in target_based_encoding]
len(categorical) + len(target_based_encoding)

38

In [37]:
# Random undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train_engineered, y_train)
X_train_rus = pd.DataFrame(X_train_rus, columns = X_train_engineered.columns)
X_train_rus = X_train_rus[categorical + cont_vars]

In [59]:
pd.DataFrame(X_train_rus.columns)

Unnamed: 0,0
0,Physician_License_State_code2
1,Covered_or_Noncovered_Indicator_3
2,Applicable_Manufacturer_or_Applicable_GPO_Maki...
3,Recipient_State
4,Associated_Drug_or_Biological_NDC_5
5,Applicable_Manufacturer_or_Applicable_GPO_Maki...
6,Covered_or_Noncovered_Indicator_5
7,Dispute_Status_for_Publication
8,Physician_License_State_code3
9,Form_of_Payment_or_Transfer_of_Value


### Model without high cardinality categorical variables

In [104]:
# Model without high cardinality categorical variables
# Defining ColumnTransformer
preprocessor = ColumnTransformer(transformers=[("scalar", StandardScaler(), cont_vars),
                                              ("dummies", make_pipeline(SimpleImputer(strategy='constant', fill_value="Missing"),
                                                                        OneHotEncoder(handle_unknown='ignore')), categorical)
                                             ])

# Create pipeplines
take2_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                             ("logreg", LogisticRegression(C=1000000, solver='lbfgs', max_iter=500))
                            ])



In [105]:
# Baseline Training and testing
imputed_model = cross_val_score(take2_pipe, X_train_rus, y_train_rus, scoring='roc_auc', cv=5)
imputed_model_cv_score = np.mean(imputed_model)
model_scores['imputed_model_cv_score'] = imputed_model_cv_score

In [106]:
model_scores

{'baseline_cv': 0.9362845959438688,
 'imputed_model_cv_score': 0.9535514508509628}

### Including Categorical Columns with high cardinality with Target Encoding

Encoding separately as takes a lot of time

In [38]:
# Random undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train_engineered, y_train)
X_train_rus = pd.DataFrame(X_train_rus, columns = X_train_engineered.columns)

In [39]:
# Target Encoding
# # Takes hell lot of time (~10mins)
# # but does the job

# Convert NAs of categorical variables to None
for col in target_based_encoding:
    X_train_rus[col].fillna("None", inplace=True)

# Fitting target encoder
target_enc = TargetEncoder(verbose=1, cols=target_based_encoding, return_df=True, handle_unknown='ignore')
targets_encoded = target_enc.fit_transform(X_train_rus, y_train_rus)


In [40]:
# Defining ColumnTransformer
preprocessor = ColumnTransformer(transformers=[("scalar", StandardScaler(), cont_vars),
                                              ("dummies", make_pipeline(SimpleImputer(strategy='constant', fill_value="Missing"),
                                                                        OneHotEncoder(handle_unknown='ignore')), categorical)
                                             ], remainder='passthrough')

# Create pipeplines
take3_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                             ("logreg", LogisticRegression(C=1000000, solver='lbfgs', max_iter=500))
                            ])



In [41]:
# Baseline Training and testing
target_enc_model = cross_val_score(take3_pipe, targets_encoded, y_train_rus, scoring='roc_auc', cv=5)
target_enc_model_cv_score = np.mean(target_enc_model)
model_scores['target_enc_model'] = target_enc_model_cv_score


In [42]:
model_scores

{'target_enc_model': 0.981209215106308}

## Linear SVC

In [112]:
# Defining ColumnTransformer
preprocessor = ColumnTransformer(transformers=[("scalar", StandardScaler(), cont_vars),
                                              ("dummies", make_pipeline(SimpleImputer(strategy='constant', fill_value="Missing"),
                                                                        OneHotEncoder(handle_unknown='ignore')), categorical)
                                             ], remainder='passthrough')

# Create pipeplines
svc_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                           ("SVC", LinearSVC()) 
                          ])

In [115]:
# Baseline training and testing
svc_model = cross_val_score(svc_pipe, targets_encoded, y_train_rus, scoring='roc_auc', cv=5)
svc_model_cv_score = np.mean(svc_model)
model_scores['svc_model'] = svc_model_cv_score

In [116]:
model_scores

{'baseline_cv': 0.9362845959438688,
 'imputed_model_cv_score': 0.9535514508509628,
 'target_enc_model': 0.9829174974299428,
 'svc_model': 0.9811154450064498}

## Random Forest Classifier

Baseline Random Forest

In [44]:
# Defining ColumnTransformer
rf_preprocessor = ColumnTransformer(transformers=[
                                                ("dummies", make_pipeline(SimpleImputer(strategy='constant', fill_value="Missing"),
                                                                        OneHotEncoder(handle_unknown='ignore')), categorical)
                                             ], remainder='passthrough')

# Create pipeplines
rf_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                           ("randomForest", RandomForestClassifier()) 
                          ])

In [47]:
# training and testing
rf_model = cross_val_score(rf_pipe, targets_encoded, y_train_rus, scoring='roc_auc', cv=5)
rf_model_cv_score = np.mean(rf_model)
rf_model_cv_score

0.9932242352899097

Grid Search

## XGB Classifier

## Feature Importance ~ Task 5 

## Code for drawing ROC curve

In [None]:
y_score = logreg.fit(pd.DataFrame(X_train_rus, columns = X_train.columns), 
                     pd.DataFrame(y_train_rus)).predict_proba(X_test)

In [None]:
preds = logreg.predict(X_test)
tn, fp, fn, tp  = confusion_matrix(y_test, preds).ravel()
print([tn, fp])
print([fn, tp])

In [None]:
roc_auc_score(y_test, y_score[:, 1])

In [None]:
plot_roc(y_test, list(y_score[:, 1]))

In [None]:
average_precision_score(y_test, y_score)

In [None]:
def plot_roc(y_test, y_score):
    
    fpr, tpr, thresholds = roc_curve(y_test, y_score)
    
    roc_auc = auc(fpr, tpr)
    
    lw = 2
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
y_prob = logreg.predict_proba(X_test)