## Machine Learning Foundations

# Machine Learning Model Building Workflow
Sumudu Tennakoon, PhD

<hr>
To learn more about Python, refeer to the following websites

* Python : www.python.org

To learn more about the Python packages we explore in this notebook, refeer to the following websites

* NumPy : www.numpy.org
* Matplotlib : www.matplotlib.org
* Pandas : https://pandas.pydata.org
* Scikit-Learn : https://scikit-learn.org/
* Pickle: https://docs.python.org/3/library/pickle.html
* Joblib : https://joblib.readthedocs.io/

In [1]:
import numpy as np
import pandas as pd

from matplotlib import pyplot
import seaborn as sns

import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
from matplotlib import cm # Colomaps
import seaborn as sns

# Classifier algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#train test split
from sklearn.model_selection import train_test_split

# Model evaluation
from sklearn import metrics

# 1. Load Data

In [2]:
file_name = 'https://raw.githubusercontent.com/SumuduTennakoon/MLFoundations/main/Datasets/income_data.csv'

# Load CSV File

data = pd.read_csv(file_name)
data.sample(5)

Unnamed: 0.1,Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class
39363,14548,42,Private,220531,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,32.0,United-States,<=50K.
2968,2968,49,Self-emp-inc,102771,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0.0,0.0,52.0,United-States,>50K
11856,11860,38,Private,357173,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,United-States,<=50K
30939,6121,29,Private,249948,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Female,0.0,0.0,50.0,United-States,<=50K.
39305,14490,21,?,221418,Some-college,10,Never-married,?,Own-child,White,Female,0.0,0.0,40.0,United-States,<=50K.


# 2. Pre-process Data for Model Training
Reuse code from Notebook Feature Engineering

## Cleaning

In [3]:
# Drop unwanted column
data.drop(labels='Unnamed: 0', axis=1, inplace=True)

# Drop rows with missing values
data.dropna(how='any', axis=0, inplace=True)

# Remove leading and trailing spaces in string values
def remove_spaces(data, columns):
    for column in columns:
        data[column] = data[column].str.strip()
    return data

columns = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'class']
data = remove_spaces(data, columns)

# Resolve duplicate representation in 'class' column
data['class'].replace('>50K.', '>50K', inplace=True)
data['class'].replace('<=50K.', '<=50K', inplace=True)

# Convert 'class' column to Binary Column
# "<=50K" -> 0
# ">50K"  -> 1
data['earn_gt_50K'] = np.where(data['class']=='>50K',1,0)

# Create Unique ID for Each Row
data['ID'] = data.index+1

data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class,earn_gt_50K,ID
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K,0,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K,0,2
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K,0,3
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K,0,4
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,0,5


## Feature Engineering: One-hot Encode

In [4]:
#####################################
# Numeric to Caregorical (Binning)
#####################################

# age
labels = ['<20', '20-30', '30-40', '40-50', '50-60', '>60']
bin_edges = [0, 20, 30, 40, 50, 60, np.inf]
data['age_group'] = pd.cut(x=data['age'], bins=bin_edges, labels=labels)

# hours_per_week
labels = ['<25', '20-35', '35-45', '45-60', '>60']
bin_edges = [0, 20, 35, 45, 60, np.inf]
data['hours_per_week_group'] = pd.cut(x=data['hours_per_week'], bins=bin_edges, labels=labels)

#####################################
# Re-group Catergorical Column Values
#####################################

# education
data['education_group'] = data['education'].replace({
    'Preschool':'school', '1st-4th':'school', '5th-6th':'school', '7th-8th':'school', '9th':'school',
    '10th':'h_school', '11th':'h_school', '12th':'h_school', 'HS-grad':'h_school',
    'Some-college':'university_eq', 'Assoc-acdm':'university_eq', 'Assoc-voc':'university_eq', 
    'Bachelors':'university', 'Masters':'university', 'Doctorate':'university', 'Prof-school':'university'
})

#sex
data['is_male'] = np.where(data['sex']=='Male', 1,0)

# workclass
data['workclass_group'] = data['workclass'].replace({'?':'Other', 'Without-pay':'Other', 'Never-worked':'Other', 'Local-gov':'Local-State-gov', 'State-gov':'Local-State-gov'})

# marital_status
data['marital_status_group'] = data['marital_status'].replace({'Divorced':'Divorced-Separated-Widowed-Absent', 'Separated':'Divorced-Separated-Widowed-Absent', 'Widowed':'Divorced-Separated-Widowed-Absent', 
    'Married-spouse-absent':'Divorced-Separated-Widowed-Absent',
    'Married-civ-spouse':'Married-civ-AF-spouse', 'Married-AF-spouse':'Married-civ-AF-spouse'
})

# occupation
data['occupation_group'] = data['occupation'].replace({'Prof-specialty':'Exec-managerial-Prof-specialty', 'Exec-managerial':'Exec-managerial-Prof-specialty', 
    'Protective-serv':'Armed-Forces-Protective-serv', 'Armed-Forces':'Armed-Forces-Protective-serv',
    'Priv-house-serv':'Priv-house-serv-Handlers-cleaners-Other', 'Handlers-cleaners':'Priv-house-serv-Handlers-cleaners-Other', 
    'Other-service':'Priv-house-serv-Handlers-cleaners-Other', '?':'Priv-house-serv-Handlers-cleaners-Other',
    'Farming-fishing':'Farming-fishing-Machine-op-inspct', 'Machine-op-inspct':'Farming-fishing-Machine-op-inspct',
})

################
# One-hot Encode
################

# Get Dummies
categorical_columns = ['workclass_group', 'marital_status_group', 'occupation_group', 'is_male', 'age_group', 'hours_per_week_group', 'education_group']
dummy_columns_df = pd.get_dummies(data[categorical_columns], drop_first=False)

dummy_columns_list = list(dummy_columns_df.columns)

# Merge Dummy Values with the Data
data = pd.concat([data, dummy_columns_df], axis=1)
data

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,...,age_group_>60,hours_per_week_group_<25,hours_per_week_group_20-35,hours_per_week_group_35-45,hours_per_week_group_45-60,hours_per_week_group_>60,education_group_h_school,education_group_school,education_group_university,education_group_university_eq
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,...,0,0,0,1,0,0,0,0,1,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,...,0,1,0,0,0,0,0,0,1,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,...,0,0,0,1,0,0,1,0,0,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,...,0,0,0,1,0,0,1,0,0,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41090,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,...,0,0,0,1,0,0,0,0,1,0
41091,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,...,1,0,0,1,0,0,1,0,0,0
41092,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,...,0,0,0,0,1,0,0,0,1,0
41093,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,...,0,0,0,1,0,0,0,0,1,0


## Data Pre-processing Function (Pipeline)

In [5]:
def target_varible_pre_processing(data):    

    # Remove leading and trailing spaces in string values
    def remove_spaces(data, columns):
        for column in columns:
            data[column] = data[column].str.strip()
        return data

    data = remove_spaces(data, ['class'])

    # Resolve duplicate representation in 'class' column
    data['class'].replace('>50K.', '>50K', inplace=True)
    data['class'].replace('<=50K.', '<=50K', inplace=True)

    # Convert 'class' column to Binary Column
    data['earn_gt_50K'] = np.where(data['class']=='>50K',1,0) 

    return data

def pre_process(data):   

    input_columns = ['age', 'hours_per_week', 'workclass', 'education', 'marital_status', 'occupation', 'sex']
    data = data[input_columns]
    
    ##########
    # Cleaning
    ##########
    # Create Unique ID for Each Row
    data['ID'] = data.index+1

    # Remove leading and trailing spaces in string values
    def remove_spaces(data, columns):
        for column in columns:
            data[column] = data[column].str.strip()
        return data

    string_columns = ['workclass', 'education', 'marital_status', 'occupation', 'sex']
    data = remove_spaces(data, string_columns)

    # Drop unwanted column
    data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')

    # Drop rows with missing values
    data.dropna(how='any', axis=0, inplace=True)

    #####################################
    # Numeric to Caregorical (Binning)
    #####################################

    # age
    labels = ['<20', '20-30', '30-40', '40-50', '50-60', '>60']
    bin_edges = [0, 20, 30, 40, 50, 60, np.inf]
    data['age_group'] = pd.cut(x=data['age'], bins=bin_edges, labels=labels)

    # hours_per_week
    labels = ['<25', '20-35', '35-45', '45-60', '>60']
    bin_edges = [0, 20, 35, 45, 60, np.inf]
    data['hours_per_week_group'] = pd.cut(x=data['hours_per_week'], bins=bin_edges, labels=labels)

    #####################################
    # Re-group Catergorical Column Values
    #####################################

    # education
    data['education_group'] = data['education'].replace({
        'Preschool':'school', '1st-4th':'school', '5th-6th':'school', '7th-8th':'school', '9th':'school',
        '10th':'h_school', '11th':'h_school', '12th':'h_school', 'HS-grad':'h_school',
        'Some-college':'university_eq', 'Assoc-acdm':'university_eq', 'Assoc-voc':'university_eq', 
        'Bachelors':'university', 'Masters':'university', 'Doctorate':'university', 'Prof-school':'university'
    })

    #sex
    data['is_male'] = np.where(data['sex']=='Male', 1,0)

    # workclass
    data['workclass_group'] = data['workclass'].replace({'?':'Other', 'Without-pay':'Other', 'Never-worked':'Other', 'Local-gov':'Local-State-gov', 'State-gov':'Local-State-gov'})

    # marital_status
    data['marital_status_group'] = data['marital_status'].replace({'Divorced':'Divorced-Separated-Widowed-Absent', 'Separated':'Divorced-Separated-Widowed-Absent', 'Widowed':'Divorced-Separated-Widowed-Absent', 
        'Married-spouse-absent':'Divorced-Separated-Widowed-Absent',
        'Married-civ-spouse':'Married-civ-AF-spouse', 'Married-AF-spouse':'Married-civ-AF-spouse'
    })

    # occupation
    data['occupation_group'] = data['occupation'].replace({'Prof-specialty':'Exec-managerial-Prof-specialty', 'Exec-managerial':'Exec-managerial-Prof-specialty', 
        'Protective-serv':'Armed-Forces-Protective-serv', 'Armed-Forces':'Armed-Forces-Protective-serv',
        'Priv-house-serv':'Priv-house-serv-Handlers-cleaners-Other', 'Handlers-cleaners':'Priv-house-serv-Handlers-cleaners-Other', 
        'Other-service':'Priv-house-serv-Handlers-cleaners-Other', '?':'Priv-house-serv-Handlers-cleaners-Other',
        'Farming-fishing':'Farming-fishing-Machine-op-inspct', 'Machine-op-inspct':'Farming-fishing-Machine-op-inspct',
    })

    ################
    # One-hot Encode
    ################

    # Get Dummies
    categorical_columns = ['workclass_group', 'marital_status_group', 'occupation_group', 'is_male', 'age_group', 'hours_per_week_group', 'education_group']
    dummy_columns_df = pd.get_dummies(data[categorical_columns], drop_first=False)

    # Merge Dummy Values with the Data
    data = pd.concat([data, dummy_columns_df], axis=1)

    ################
    # Select Columns
    ################
    X_variables = ['is_male',
                    'workclass_group_Federal-gov',
                    'workclass_group_Local-State-gov',
                    'workclass_group_Private',
                    'workclass_group_Self-emp-inc',
                    'workclass_group_Self-emp-not-inc',
                    'marital_status_group_Married-civ-AF-spouse',
                    'marital_status_group_Never-married',
                    'occupation_group_Craft-repair',
                    'occupation_group_Farming-fishing-Machine-op-inspct',
                    'occupation_group_Priv-house-serv-Handlers-cleaners-Other',
                    'occupation_group_Exec-managerial-Prof-specialty',
                    'occupation_group_Armed-Forces-Protective-serv',
                    'occupation_group_Sales',
                    'occupation_group_Tech-support',
                    'occupation_group_Transport-moving',
                    'age_group_20-30',
                    'age_group_30-40',
                    'age_group_40-50',
                    'age_group_50-60',
                    'age_group_>60',
                    'hours_per_week_group_20-35',
                    'hours_per_week_group_35-45',
                    'hours_per_week_group_45-60',
                    'hours_per_week_group_>60',
                    'education_group_university',
                    'education_group_school',
                    'education_group_university_eq'
    ]
    
    #############################
    # Assign 0 to missing columns
    #############################
    for x in list(set(X_variables) - set(data.columns)):
        data[x] = 0
        
    return data[X_variables]

# 3. Setup Model Training

## Select Variables

In [6]:
X_variables = ['is_male',
                'workclass_group_Federal-gov',
                'workclass_group_Local-State-gov',
                'workclass_group_Private',
                'workclass_group_Self-emp-inc',
                'workclass_group_Self-emp-not-inc',
                'marital_status_group_Married-civ-AF-spouse',
                'marital_status_group_Never-married',
                'occupation_group_Craft-repair',
                'occupation_group_Farming-fishing-Machine-op-inspct',
                'occupation_group_Priv-house-serv-Handlers-cleaners-Other',
                'occupation_group_Exec-managerial-Prof-specialty',
                'occupation_group_Armed-Forces-Protective-serv',
                'occupation_group_Sales',
                'occupation_group_Tech-support',
                'occupation_group_Transport-moving',
                'age_group_20-30',
                'age_group_30-40',
                'age_group_40-50',
                'age_group_50-60',
                'age_group_>60',
                'hours_per_week_group_20-35',
                'hours_per_week_group_35-45',
                'hours_per_week_group_45-60',
                'hours_per_week_group_>60',
                'education_group_university',
                'education_group_school',
                'education_group_university_eq'
]

y_variable = 'earn_gt_50K'

# Train Test Split

In [7]:
data_train, data_test = train_test_split(data, test_size=0.3, random_state=42)

print(F"Train sample size = {len(data_train)}")
print(F"Test sample size  = {len(data_test)}")

Train sample size = 28765
Test sample size  = 12329


# Model Training Function

In [8]:
def model_train(model_name, model, data_train, data_test, X_variables, y_variable):
    # Split datasets into Features (X) and Target (y)
    X_train = data_train[X_variables]
    y_train = data_train[y_variable]
    X_test = data_test[X_variables]
    y_test = data_test[y_variable]

    # train Model
    model.fit(X_train, y_train)

    # Make Predictions
    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    test_result = pd.DataFrame(data={'y_act':y_test, 'y_pred':y_pred, 'y_pred_prob':y_pred_prob})

    #Evaluate
    accuracy = metrics.accuracy_score(test_result['y_act'], test_result['y_pred']) 
    precision = metrics.precision_score(test_result['y_act'], test_result['y_pred'], average='binary', pos_label=1)
    recall = metrics.recall_score(test_result['y_act'], test_result['y_pred'], average='binary', pos_label=1)
    f1_score = metrics.f1_score(test_result['y_act'], test_result['y_pred'], average='weighted')  #weighted accounts for label imbalance.
    roc_auc = metrics.roc_auc_score(test_result['y_act'], test_result['y_pred_prob'])

    return ({'model_name':model_name, 
                'model':model, 
                'accuracy':accuracy, 
                'precision':precision,
                'recall':recall,
                'f1_score':f1_score,
                'roc_auc':roc_auc,
    })

In [9]:
model0 = model_train('rf_new', RandomForestClassifier(n_estimators=500, max_depth=10, n_jobs=3, verbose=1), data_train, data_test, X_variables, y_variable)
model0

[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.2s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:    1.2s
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed:    2.8s
[Parallel(n_jobs=3)]: Done 500 out of 500 | elapsed:    3.2s finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.0s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:    0.1s
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed:    0.2s
[Parallel(n_jobs=3)]: Done 500 out of 500 | elapsed:    0.3s finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.0s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:    0.1s
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed:    0.3s
[Parallel(n_jobs=3)]: Done 500 out of 500 | elapsed:    0.3s finished


{'model_name': 'rf_new',
 'model': RandomForestClassifier(max_depth=10, n_estimators=500, n_jobs=3, verbose=1),
 'accuracy': 0.832833157595912,
 'precision': 0.7199201198202696,
 'recall': 0.49014276002719237,
 'f1_score': 0.8209429783519845,
 'roc_auc': 0.8844529444187714}

# Fitting Multipe Models with Different Hyperparamaters

## [A] Manualy explore hyperparameter space

In [10]:
models = []
models.append(model_train('lgr1', LogisticRegression(n_jobs=3, verbose=1), data_train, data_test, X_variables, y_variable))
models.append(model_train('rf1', RandomForestClassifier(n_estimators=100, max_depth=None, n_jobs=3, verbose=1), data_train, data_test, X_variables, y_variable))
models.append(model_train('rf2', RandomForestClassifier(n_estimators=500, max_depth=None, n_jobs=3, verbose=1), data_train, data_test, X_variables, y_variable))
models.append(model_train('rf3', RandomForestClassifier(n_estimators=500, max_depth=10, n_jobs=3, verbose=1), data_train, data_test, X_variables, y_variable))
models.append(model_train('rf4', RandomForestClassifier(n_estimators=500, max_depth=20, n_jobs=3, verbose=1), data_train, data_test, X_variables, y_variable))
models = pd.DataFrame(models)
models

[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   1 out of   1 | elapsed:    2.1s finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.5s
[Parallel(n_jobs=3)]: Done 100 out of 100 | elapsed:    1.1s finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.0s
[Parallel(n_jobs=3)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.0s
[Parallel(n_jobs=3)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.4s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:    2.3s
[Parallel(n_jobs=3)]

Unnamed: 0,model_name,model,accuracy,precision,recall,f1_score,roc_auc
0,lgr1,"LogisticRegression(n_jobs=3, verbose=1)",0.833725,0.694251,0.541808,0.826238,0.884696
1,rf1,"(DecisionTreeClassifier(max_features='auto', r...",0.823587,0.648356,0.56968,0.819535,0.864033
2,rf2,"(DecisionTreeClassifier(max_features='auto', r...",0.824154,0.650936,0.567301,0.819854,0.865115
3,rf3,"(DecisionTreeClassifier(max_depth=10, max_feat...",0.833482,0.720377,0.493882,0.821861,0.884383
4,rf4,"(DecisionTreeClassifier(max_depth=20, max_feat...",0.824641,0.655008,0.560163,0.819743,0.868135


## [B] Use of Grid Search

In [11]:
from sklearn.model_selection import GridSearchCV

parameters = {'n_estimators': [100,500], 'max_depth': [None, 10, 20]}
gs_model = GridSearchCV(RandomForestClassifier(), parameters, n_jobs=2, verbose=3, pre_dispatch=2)
gs_model.fit(data_train[X_variables], data_train[y_variable])

Fitting 5 folds for each of 6 candidates, totalling 30 fits


GridSearchCV(estimator=RandomForestClassifier(), n_jobs=2,
             param_grid={'max_depth': [None, 10, 20],
                         'n_estimators': [100, 500]},
             pre_dispatch=2, verbose=3)

In [12]:
# Best Model Paramaters
print(gs_model.best_params_) 

{'max_depth': 10, 'n_estimators': 500}


In [13]:
from sklearn.metrics import classification_report, confusion_matrix 
 
y_pred = gs_model.predict(data_test[X_variables]) 

print(classification_report(data_test[y_variable], y_pred)) 
print(confusion_matrix(data_test[y_variable], y_pred)) 

              precision    recall  f1-score   support

           0       0.86      0.94      0.90      9387
           1       0.72      0.50      0.59      2942

    accuracy                           0.83     12329
   macro avg       0.79      0.72      0.74     12329
weighted avg       0.82      0.83      0.82     12329

[[8824  563]
 [1485 1457]]


# Select Best Model

In [14]:
# Select best model 
model = models.query("model_name=='rf3'")
model 

Unnamed: 0,model_name,model,accuracy,precision,recall,f1_score,roc_auc
3,rf3,"(DecisionTreeClassifier(max_depth=10, max_feat...",0.833482,0.720377,0.493882,0.821861,0.884383


In [15]:
model = model['model'].values[0]
model

RandomForestClassifier(max_depth=10, n_estimators=500, n_jobs=3, verbose=1)

# Saving Best Model

## [A] Use Pickle

In [16]:
import pickle

save_file = 'model_rf_test.pickle'
pickle.dump(model, open(save_file, 'wb'))

In [17]:
# loading from file
model_ = pickle.load(open(save_file, 'rb'))
model_

RandomForestClassifier(max_depth=10, n_estimators=500, n_jobs=3, verbose=1)

## [B] Use Joblib (recommended)

In [18]:
import joblib

save_file = 'model_rf_test.joblib'
joblib.dump(model, open(save_file, 'wb'))

In [19]:
# loading from file
model_ = joblib.load(save_file)
model_

RandomForestClassifier(max_depth=10, n_estimators=500, n_jobs=3, verbose=1)

# 4. Predict on a Sample Data

In [20]:
#Load Fresh Dataset
sample_input = pd.read_csv(file_name).sample(10)
sample_input

Unnamed: 0.1,Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class
1975,1975,40,Private,121956,Bachelors,13,Married-spouse-absent,Prof-specialty,Not-in-family,Asian-Pac-Islander,Male,13550.0,0.0,40.0,Cambodia,>50K
40988,16174,64,Local-gov,47298,Doctorate,16,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,45.0,United-States,>50K.
28262,3444,46,Private,37353,Some-college,10,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0.0,0.0,45.0,United-States,<=50K.
38583,13767,32,Federal-gov,72630,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,14084.0,0.0,55.0,United-States,>50K.
10758,10761,33,Private,191385,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Male,0.0,0.0,40.0,Canada,<=50K
27855,3037,54,?,99208,Preschool,1,Married-civ-spouse,?,Husband,White,Male,0.0,0.0,16.0,United-States,<=50K.
8211,8213,37,Private,108140,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024.0,0.0,45.0,United-States,>50K
30189,5371,24,Private,479296,9th,5,Never-married,Sales,Own-child,White,Male,0.0,0.0,48.0,United-States,<=50K.
3804,3804,19,Private,262515,11th,7,Never-married,Other-service,Other-relative,White,Male,0.0,0.0,20.0,United-States,<=50K
226,226,60,?,24215,10th,6,Divorced,?,Not-in-family,Amer-Indian-Eskimo,Female,0.0,0.0,10.0,United-States,<=50K


In [21]:
pre_process(sample_input)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

Unnamed: 0,is_male,is_male.1,workclass_group_Federal-gov,workclass_group_Local-State-gov,workclass_group_Private,workclass_group_Self-emp-inc,workclass_group_Self-emp-not-inc,marital_status_group_Married-civ-AF-spouse,marital_status_group_Never-married,occupation_group_Craft-repair,...,age_group_40-50,age_group_50-60,age_group_>60,hours_per_week_group_20-35,hours_per_week_group_35-45,hours_per_week_group_45-60,hours_per_week_group_>60,education_group_university,education_group_school,education_group_university_eq
1975,1,1,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
40988,1,1,0,1,0,0,0,1,0,0,...,0,0,1,0,1,0,0,1,0,0
28262,1,1,0,0,1,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
38583,1,1,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,0
10758,1,1,0,0,1,0,0,0,1,0,...,0,0,0,0,1,0,0,1,0,0
27855,1,1,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
8211,1,1,0,0,1,0,0,1,0,0,...,0,0,0,0,1,0,0,1,0,0
30189,1,1,0,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
3804,1,1,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
226,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [22]:
model.predict_proba(pre_process(sample_input))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

array([[0.58145236, 0.41854764],
       [0.41312671, 0.58687329],
       [0.6892787 , 0.3107213 ],
       [0.47758209, 0.52241791],
       [0.73379372, 0.26620628],
       [0.88620221, 0.11379779],
       [0.2629664 , 0.7370336 ],
       [0.96176188, 0.03823812],
       [0.99410978, 0.00589022],
       [0.96885159, 0.03114841]])

# Score Function

In [23]:
def score(input_data, model):
    return model.predict_proba(input_data)

In [24]:
prediction = score(input_data=pre_process(sample_input), model=model)
prediction

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

array([[0.58145236, 0.41854764],
       [0.41312671, 0.58687329],
       [0.6892787 , 0.3107213 ],
       [0.47758209, 0.52241791],
       [0.73379372, 0.26620628],
       [0.88620221, 0.11379779],
       [0.2629664 , 0.7370336 ],
       [0.96176188, 0.03823812],
       [0.99410978, 0.00589022],
       [0.96885159, 0.03114841]])

# 5. Post-processing Function for Prediction

In [25]:
prediction = score(input_data=pre_process(sample_input), model=model)

output = []
for i in range(len(prediction)):
    if prediction[i][1]>prediction[i][0]:
        output.append(F">50K ({prediction[i][1]:.2f})")
    else:
        output.append(F"<=50K ({prediction[i][0]:.2f})")
output

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

['<=50K (0.58)',
 '>50K (0.59)',
 '<=50K (0.69)',
 '>50K (0.52)',
 '<=50K (0.73)',
 '<=50K (0.89)',
 '>50K (0.74)',
 '<=50K (0.96)',
 '<=50K (0.99)',
 '<=50K (0.97)']

In [26]:
def post_process(prediction):
    output = []
    for i in range(len(prediction)):
        if prediction[i][1]>prediction[i][0]:
            output.append(F">50K ({prediction[i][1]:.2f})")
        else:
            output.append(F"<=50K ({prediction[i][0]:.2f})")
    
    if len(output)==1:
        return output[0]
    else:
        return output

In [27]:
# Output value 
sample_output = post_process(score(input_data=pre_process(sample_input), model=model))
sample_output

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

['<=50K (0.58)',
 '>50K (0.59)',
 '<=50K (0.69)',
 '>50K (0.52)',
 '<=50K (0.73)',
 '<=50K (0.89)',
 '>50K (0.74)',
 '<=50K (0.96)',
 '<=50K (0.99)',
 '<=50K (0.97)']

In [28]:
# Create new column in input dataset
sample_input['prediction'] = post_process(model.predict_proba(pre_process(sample_input)))
sample_input

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

Unnamed: 0.1,Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class,prediction
1975,1975,40,Private,121956,Bachelors,13,Married-spouse-absent,Prof-specialty,Not-in-family,Asian-Pac-Islander,Male,13550.0,0.0,40.0,Cambodia,>50K,<=50K (0.58)
40988,16174,64,Local-gov,47298,Doctorate,16,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,45.0,United-States,>50K.,>50K (0.59)
28262,3444,46,Private,37353,Some-college,10,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0.0,0.0,45.0,United-States,<=50K.,<=50K (0.69)
38583,13767,32,Federal-gov,72630,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,14084.0,0.0,55.0,United-States,>50K.,>50K (0.52)
10758,10761,33,Private,191385,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Male,0.0,0.0,40.0,Canada,<=50K,<=50K (0.73)
27855,3037,54,?,99208,Preschool,1,Married-civ-spouse,?,Husband,White,Male,0.0,0.0,16.0,United-States,<=50K.,<=50K (0.89)
8211,8213,37,Private,108140,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024.0,0.0,45.0,United-States,>50K,>50K (0.74)
30189,5371,24,Private,479296,9th,5,Never-married,Sales,Own-child,White,Male,0.0,0.0,48.0,United-States,<=50K.,<=50K (0.96)
3804,3804,19,Private,262515,11th,7,Never-married,Other-service,Other-relative,White,Male,0.0,0.0,20.0,United-States,<=50K,<=50K (0.99)
226,226,60,?,24215,10th,6,Divorced,?,Not-in-family,Amer-Indian-Eskimo,Female,0.0,0.0,10.0,United-States,<=50K,<=50K (0.97)


# 6. Prediction Function for Application (Inference Pipeline)
Put everything together

In [29]:
def app_prediction_function(input_data, model):
    return post_process(score(input_data=pre_process(input_data), model=model))

In [30]:
input_data = pd.read_csv(file_name).sample(1)

print(input_data.transpose())
app_prediction_function(input_data, model)

                              39268
Unnamed: 0                    14453
age                              44
workclass               Federal-gov
fnlwgt                       259307
education                 Bachelors
education_num                    13
marital_status   Married-civ-spouse
occupation          Exec-managerial
relationship                Husband
race                          White
sex                            Male
capital_gain                    0.0
capital_loss                    0.0
hours_per_week                 40.0
native_country        United-States
class                         >50K.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

'>50K (0.80)'

## Sample Input as dictionary

In [31]:
input_data = pd.read_csv(file_name).sample(1)

input_data = input_data.to_dict(orient='records')[0]
input_data

{'Unnamed: 0': 8132,
 'age': 30,
 'workclass': ' Private',
 'fnlwgt': 110239,
 'education': ' 10th',
 'education_num': 6,
 'marital_status': ' Married-civ-spouse',
 'occupation': ' Transport-moving',
 'relationship': ' Husband',
 'race': ' White',
 'sex': ' Male',
 'capital_gain': 0.0,
 'capital_loss': 0.0,
 'hours_per_week': 55.0,
 'native_country': ' United-States',
 'class': ' <=50K'}

## Convert Input Data to DataFrame

In [32]:
input_data = pd.DataFrame([input_data])
input_data

Unnamed: 0.1,Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class
0,8132,30,Private,110239,10th,6,Married-civ-spouse,Transport-moving,Husband,White,Male,0.0,0.0,55.0,United-States,<=50K


In [33]:
input_data_json = input_data.to_json(orient='records')
input_data_json

'[{"Unnamed: 0":8132,"age":30,"workclass":" Private","fnlwgt":110239,"education":" 10th","education_num":6,"marital_status":" Married-civ-spouse","occupation":" Transport-moving","relationship":" Husband","race":" White","sex":" Male","capital_gain":0.0,"capital_loss":0.0,"hours_per_week":55.0,"native_country":" United-States","class":" <=50K"}]'

## Get Prediction

In [34]:
app_prediction_function(input_data, model)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

'<=50K (0.78)'

# 7. Repeat Training

In [35]:
data = pd.read_csv(file_name)
data.dropna(how='any', axis=0, inplace=True)

data = target_varible_pre_processing(data)

data = pd.concat([data, pre_process(data)], axis=1)


X_variables = ['is_male',
                'workclass_group_Federal-gov',
                'workclass_group_Local-State-gov',
                'workclass_group_Private',
                'workclass_group_Self-emp-inc',
                'workclass_group_Self-emp-not-inc',
                'marital_status_group_Married-civ-AF-spouse',
                'marital_status_group_Never-married',
                'occupation_group_Craft-repair',
                'occupation_group_Farming-fishing-Machine-op-inspct',
                'occupation_group_Priv-house-serv-Handlers-cleaners-Other',
                'occupation_group_Exec-managerial-Prof-specialty',
                'occupation_group_Armed-Forces-Protective-serv',
                'occupation_group_Sales',
                'occupation_group_Tech-support',
                'occupation_group_Transport-moving',
                'age_group_20-30',
                'age_group_30-40',
                'age_group_40-50',
                'age_group_50-60',
                'age_group_>60',
                'hours_per_week_group_20-35',
                'hours_per_week_group_35-45',
                'hours_per_week_group_45-60',
                'hours_per_week_group_>60',
                'education_group_university',
                'education_group_school',
                'education_group_university_eq'
]

y_variable = 'earn_gt_50K'

data_train, data_test = train_test_split(data, test_size=0.3, random_state=42)

model = LogisticRegression()

model_dict = model_train("retrained_rf", model, data_train, data_test, X_variables, y_variable)

model_dict

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ID'] = data.index+1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = data[column].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(labels='Unnamed: 0', axis=1, inplace=True, errors='ignore')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

{'model_name': 'retrained_rf',
 'model': LogisticRegression(),
 'accuracy': 0.8337253629653663,
 'precision': 0.6942508710801394,
 'recall': 0.5418082936777702,
 'f1_score': 0.8262376137174818,
 'roc_auc': 0.8846960232619898}

<hr>
Last update 2022-11-15 by Sumudu Tennakoon

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.