## CADE Implementation

A novel implementation of methedologies described in the following paper:
<br>
###### Classifier-Adjusted Density Estimation for Anomaly Detection and One-Class Classification: Lisa Friedland, Amanda Gentzel  and David Jensen
###### https://pdfs.semanticscholar.org/e4e6/033069a8569ba16f64da3061538bcb90bec6.pdf
<br>
Briefly, the following steps will be performed:
<br>
<br>
Step 1 - "Duplicate" current data set (using uniform versions of each variable)
<br>
Step 2 - Label all original data as 0. Label all duplicated data 1
<br>
Step 3 - Combine data sets
<br>
Step 4 - Train a classifier on combined set
<br>
Step 5 - Score original data with classifier

#### Packages

In [17]:
import pandas as pd
import numpy as np
from dfply import *
from matplotlib import pyplot
import random


from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
#import xgboost as xgb

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import clear_output

#### Load data and preprocess

In [2]:
# Read in data
df = pd.read_csv("/Users/mfairb/Documents/ML Projects/Project - HR Analytics/hr_analytics.csv")

# Preprocess data function
def preprocess(df):
    
    df = df.dropna().copy()
    
    # Create X & y
    X = df >> select(df.no_of_trainings, df.age, df.previous_year_rating, df.length_of_service, df.avg_training_score)
    y = df >> select(df.is_promoted)
    
    return(X, y)

# Preprocess data and split into train and test
X, y = preprocess(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=273)

clear_output()

In [3]:
X_train.shape
X_test.shape

(34062, 5)

(14598, 5)

### Complete CADE Function

###### Func: uniform_transform(col)

In [4]:
# Helper function to create uniform version of features
def uniform_transform(col):
    
    # Define col_length as length of the column
    col_length = len(col) 
    
    # If data type is integer, sample integer values uniformly between lower and upper limits
    if all(isinstance(c, int) for c in col):
        new_var = random.choices( range( min(col), max(col)+1 ), k = col_length )
        
    # Else if data type is float
    elif all(isinstance(c, float) for c in col):
        
        # Check to see if data should be integer (all values are actually integer values just stored as float) and if so, sample integer values uniformly between lower and upper limits
        if all(c.is_integer() for c in col): # Check to see if all can actually be integers
            col = list(map(int, col)) # Convert to integer
            new_var = random.choices( range( min(col), max(col)+1 ), k = col_length )
            
        # If actually a float, sample uniformly between lower and upper limits
        else:
            new_var = np.random.uniform( low = min(col), high = max(col), size = col_length )
       
    # Else if a character values, sample uniformly from available levels
    elif all(isinstance(c, str) for c in col):
        new_var = random.choices( np.unique(col), k = col_length )

    
    return(new_var)

###### Func: create_cade_sets(X, y)

In [5]:
# Complete CADE function - prepares X/y to be modeled
def create_cade_sets(X, y):
    
    # Apply uniform transformation to all columns to create X_fake
    X_fake = X.apply(lambda x: uniform_transform(x), axis = 0).copy()
    
    # Create 0/1 responses. The zeroes will be assigned to the real data. The ones to the fake data
    zeros = np.zeros(X.shape[0])
    ones = np.ones(X_fake.shape[0])
    
    # Combine data sets and response vars. This will create a df with actual data on top and fake data under it.
    # This allows us to append the ones array after the zeros array to create the appropriate response
    X_cade = X.append(X_fake)
    y_cade = np.append(zeros, ones)
    
    return(X_cade, y_cade)

###### Func: train_model(X_train, y_train, model)

In [6]:
# Function to fit ML models
def train_model(X_train, y_train, model):
    
    # Logistic regression
    if model == "log":
        
        log = LogisticRegression() # Create model
        log.fit(X_train,y_train) # Fit model
        return(log)
    
    # Random Forest
    elif model == "rf":    
        
        rf = RandomForestClassifier(n_estimators=500)
        rf.fit(X_train,y_train)
        return(rf)
    
    # XGBoost
    elif model == "xgb":
        
        xgb_mod = xgb.XGBClassifier(random_state=273,learning_rate=0.01)
        xgb_mod.fit(X_train, y_train)
        return(xgb)

###### Func: cade(df, n_tree)

In [7]:
def cade(X, y, X_test, model):
    
    # Create rf classifier and fit on X_cade and y_cade dfs
    model = train_model(X, y, model = model)
    
    # Predict probabilities on test set using above model
    X_test = X_test.reset_index(drop=True)
    X_test['probs'] = pd.DataFrame(model.predict_proba(X_test)).iloc[:,1]

    return(X_test)

In [21]:
# Create cade sets
X_cade, y_cade = create_cade_sets(X_train, y_train)

# Perform CADE - return test set with predicted probabilities
ins_cade = cade(X = X_cade, y = y_cade, X_test = X_train, model = 'log')

In [22]:
ins_cade >> arrange('probs', ascending = False)

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,avg_training_score,probs
31535,10,42,3.0,13,49,0.999994
23797,10,36,1.0,10,66,0.999994
21559,10,28,3.0,2,71,0.999959
16271,9,27,3.0,4,75,0.999882
8756,9,26,3.0,3,73,0.999845
...,...,...,...,...,...,...
28516,1,28,5.0,1,48,0.007433
33340,1,27,5.0,1,48,0.007342
21849,1,27,5.0,1,48,0.007342
27022,1,27,5.0,1,48,0.007342
