<a href="https://colab.research.google.com/github/SamuelMiller413/Coding-Nomads-Deep-Learning-Miniproject-1/blob/master/DL_Miniproject_Tabular_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Intrdouction
---

## Problem Statement:
> An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market.

>In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers.

>You are required to help the manager to predict the right group of the new customers.

## Resources
>Data    (from Kaggle)
* [Customer Segmentation Dataset](https://www.kaggle.com/datasets/abisheksudarshan/customer-segmentation?select=train.csv) 

> Notebooks of Influence
* [Janatahack](https://www.kaggle.com/code/abisheksudarshan/av-janatahack-customer-segmentation/data) 
* [Seun Ayegboyin](https://www.kaggle.com/code/seunayegboyin/customer-segmentation-with-kmeans-and-pca) (PCA/Clustering)

---
# Work Flow
---

### NOTES
* Write 
        def grep()
        def greps_all()
* OHE all
* Impute NaN
* Train Models
* Train DL Models


### Workflow Outline

---
#### 1. Pre-Training
---
* Setup
    * Data Loading
    * Download the Dataset
    * Split Data
    * Custom Functions
* Initial EDA / Data Visualization
* Feature Engineering and Transformation
* Continued EDA / Data Visualization
* Pipelines
<br>
---
#### 2. Training
---
* Traditional ML Modeling
* Pure Torch Model
* High-level Libraries and Tabular Frameworks
<br>
---
#### 3. Testing
---
* Model Selection and Test Set Evaluation
* Notes and Findings



---
# 1. Pre-Training
---
* Setup
    * Data Loading
    * Download the Dataset
    * Split Data
    * Custom Functions
* Initial EDA / Data Visualization
* Feature Engineering and Transformation
* Continued EDA / Data Visualization
* Pipelines

## Setup

#### Retrieval and Cloud Connection

In [1]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# upload 'kaggle.json'
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"samuelmiller413","key":"d635e1518efc064bb478bb67bb2c22aa"}'}

In [3]:
# make kaggle directory 
! mkdir ~/.kaggle
# store JSON
! cp kaggle.json ~/.kaggle/
# change permissions
! chmod 600 ~/.kaggle/kaggle.json

# download data from kaggle
! kaggle datasets download -d 'abisheksudarshan/customer-segmentation'
# make directory for data
! mkdir customer-segmentation

# unzip into directory
! cd.. customer-segmentation/
! unzip customer-segmentation.zip -d customer-segmentation/
# remove .zip
! rm -r customer-segmentation.zip

mkdir: cannot create directory ‘/root/.kaggle’: File exists
Downloading customer-segmentation.zip to /content
  0% 0.00/98.7k [00:00<?, ?B/s]
100% 98.7k/98.7k [00:00<00:00, 70.1MB/s]
mkdir: cannot create directory ‘customer-segmentation’: File exists
/bin/bash: cd..: command not found
Archive:  customer-segmentation.zip
replace customer-segmentation/test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace customer-segmentation/train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


#### Imports

In [2]:
%%capture
# import libraries
import numpy as np 
import pandas as pd 
import pprint
import inspect

# PRE-PROCESSING                                                                          
from sklearn.preprocessing import StandardScaler
from sklearn_pandas import CategoricalImputer

#  MODELS
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb

# FEATURE SELECTION                                                                          
from sklearn.feature_selection import SelectFromModel, mutual_info_regression, RFE, RFECV

# PIPELINE                                     
from sklearn.pipeline import Pipeline
                                          
# NEURAL NETWORK                                                   
from sklearn.neural_network import MLPRegressor

# CROSS VALIDATION
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit, StratifiedKFold
from sklearn.model_selection import learning_curve, cross_val_predict
from sklearn.model_selection import KFold, RandomizedSearchCV

# EVALUATION
from sklearn.metrics import mean_squared_log_error, mean_squared_error, roc_auc_score, accuracy_score, log_loss, classification_report
from sklearn.metrics import silhouette_score
from sklearn.metrics import SCORERS

# PLOTTING
import random
import matplotlib.pyplot as plt
import seaborn as sns

# FROM example janatahack notebook 

# IGNORE WARNINGS
import warnings
warnings.filterwarnings("ignore")
from collections import Counter


#### Download Dataset

In [5]:
csv_train = '/content/customer-segmentation/train.csv'
csv_test = '/content/customer-segmentation/test.csv'

#### Split Data

In [258]:
'''
    data is pre-split into train/test csv files: 
    1. train:   contains all columns
    2. test:    contains no label columnn

    > Treating 'train' as a full dataset

    Instead of an X, y split, 
        using: 
            X = feature_cols    
            y = label_col   
'''
# read csv
df = pd.read_csv(csv_train)

# train_test split 80/20
train_df, test_df = train_test_split(
    df, test_size=0.20, random_state=42
    )

# features
feature_cols = full_train.columns.tolist()
feature_cols.remove('Segmentation')

# label
label_col = 'Segmentation'

In [253]:
train_df[feature_cols]

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6
...,...,...,...,...,...,...,...,...,...,...
8063,464018,Male,No,22,No,,0.0,Low,7.0,Cat_1
8064,464685,Male,No,35,No,Executive,3.0,Low,4.0,Cat_4
8065,465406,Female,No,33,Yes,Healthcare,1.0,Low,1.0,Cat_6
8066,467299,Female,No,27,Yes,Healthcare,1.0,Low,4.0,Cat_6


In [None]:
train_df[label_col]

## Functions

### Retrieve Variable Name

In [12]:
def retrieve_name(var):
    callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
    return [var_name for var_name, var_val in callers_local_vars if var_val is var]
    # above code --> https://stackoverflow.com/questions/18425225/getting-the-name-of-a-variable-as-a-string
                            # user: scohe001

### Summarize

In [13]:
def summarize(subset=None):
    '''
    Doc:
    Summarizes EDA for Pandas dataframe.

        Parameters
        -------------
            subset      : (dataframe)
                Specify the dataframe or subset of dataframe to summarize
        Returns
        -------------
            Summary functions from those stored in eda_summary
    '''
    # initialize summary dict
    eda_summary_att = {
        'shape':subset.shape, 
        'isnull':subset.isnull,
        'dtypes':subset.dtypes
        }
    eda_summary_func = { 
        'info':subset.info,     
        'describe':subset.describe(),
        'corr':subset.corr(),
        'value_counts':subset.value_counts()
        }
    # display name of subset
    print(f"\nSummary of Subset:  '{retrieve_name(subset)[0]}'\n")
    # loop through and call each summary function
    for k,v in eda_summary_att.items():
        if k == 'isnull':
            v = eda_summary_att[k]().sum()
            print(f"\n--------\n{k.upper()}\n--------\n\n{v}\n---")    
        else:
            # summary = subset.v()
            print(f"\n--------\n{k.upper()}\n--------\n\n{v}\n---")
    for k,v in eda_summary_func.items():
        if k == 'value_counts':
            for i in range(1,(len(subset.columns)-1)):
                by_feature = subset[subset.columns[i]]
                print(by_feature.value_counts())

        else:
            print(f"\n--------\n{k.upper()}\n--------\n\n{v}\n---")

### Data Viz

In [15]:
# initialize plotting dict
graph_ = {
    'relplot': sns.relplot,
    'scatterplot': sns.scatterplot,
    'lineplot': sns.lineplot,
    'displot': sns.displot,
    'histplot': sns.histplot,
    'kdeplot': sns.kdeplot,
    'ecdfplot': sns.ecdfplot,
    'rugplot': sns.rugplot,
    'catplot': sns.catplot,
    'stripplot': sns.stripplot,
    'swarmplot': sns.swarmplot,
    'boxplot': sns.boxplot,
    'violinplot': sns.violinplot,
    'boxenplot': sns.boxenplot,
    'pointplot': sns.pointplot,
    'barplot': sns.barplot,
    'countplot': sns.countplot,
    'lmplot':' sns.lmplot',
    'regplot': 'sns.regplot',
    'residplot': 'sns.residplot',
    'heatmap': 'sns.heatmap',
    'clustermap': 'sns.clustermap',
    'FacetGrid': 'sns.FacetGrid',
    'pairplot': 'sns.pairplot',
    'PairGrid': 'sns.PairGrid',
    'jointplot': 'sns.jointplot',
    'JointGrid': sns.JointGrid
    }
# plot sizes    
sizes = {'s':((10,6)),'m':(14,8), 'l':(18,10)}

def plot_grep(feature, label=train_df[label_col], plot_type='countplot', fig_size='s',subset=train_df): 
    '''
    Doc:
    Greps feature with label.
    Displays a seaborn plot of grep.

        Parameters
        -------------
            feature   : (str)
                Feature to plot 'label' by
            label     : (str)    
                Label (or additional feature) to plot 'feature' by  
                default   = target
                    * save a variable 'target' as df.columns['target variable']               
            plot_type : (dict)
                Specify a seaborn plot function from {graph_}
                    default: 'countplot'      
            fig_size  : {'s', 'm', 'l'}
                Choose a small , medium, or large figure size
                default   : {'s'=10x6, 'm'=14x8, 'l'=18x10}
            subset    : (dataframe)
                Specify which subset of the dataframe to use
                default   = 'train_df'
                examples  : 'train_df', 'test_df', etc.
        
        Returns
        -------------
            Seaborn plot of grep(feature,label)

        Plot Options: 
        -------------
        --> from Seaborn API Reference : https://seaborn.pydata.org/api.html
        
        Relational Plots
        ---
        'relplot'     - Figure-level interface for drawing relational plots onto a FacetGrid.
        'scatterplot' - Draw a scatter plot with possibility of several semantic groupings.
        'lineplot'    - Draw a line plot with possibility of several semantic groupings.

        Distribution Plots
        ---
        'displot'     - Figure-level interface for drawing distribution plots onto a FacetGrid.
        'histplot'    - Plot univariate or bivariate histograms to show distributions of datasets.
        'kdeplot'     - Plot univariate or bivariate distributions using kernel density estimation.

        Categorical Plots
        ---
        'ecdfplot'    - Plot empirical cumulative distribution functions.
        'rugplot'     - Plot marginal distributions by drawing ticks along the x and y axes.
        'catplot'     - Figure-level interface for drawing categorical plots onto a FacetGrid.
        'stripplot'   - Draw a scatterplot where one variable is categorical.
        'swarmplot'   - Draw a categorical scatterplot with non-overlapping points.
        'boxplot'     - Draw a box plot to show distributions with respect to categories.
        'violinplot'  - Draw a combination of boxplot and kernel density estimate.
        'boxenplot'   - Draw an enhanced box plot for larger datasets
        'pointplot'   - Show point estimates and confidence intervals using scatter plot glyphs.
        'barplot'     - Show point estimates and confidence intervals as rectangular bars.
        'countplot'   - Show the counts of observations in each categorical bin using bars.

        Regression Plots
        ---
        'lmplot'      - Plot data and regression model fits across a FacetGrid.
        'regplot'     - Plot data and a linear regression model fit.
        'residplot'   - lot the residuals of a linear regression.

        Matrix Plots
        ---
        'heatmap'     - Plot rectangular data as a color-encoded matrix.
        'clustermap'  - Plot a matrix dataset as a hierarchically-clustered heatmap.

        MULTI-PLOT GRIDS

        FacetGrid
        ---
        'FacetGrid'   - Multi-plot grid for plotting conditional relationships.

        Pair Grids
        ---
        'pairplot'    - Plot pairwise relationships in a dataset.
        'PairGrid'    - Subplot grid for plotting pairwise relationships in a dataset.

        Joint 'Grids'
        ---
        'jointplot'   - Draw a plot of two variables with bivariate and univariate graphs.
        'JointGrid'   - Grid for drawing a bivariate plot with marginal univariate plots.
    '''
    # filter input for matching 
    feature = feature.title()
    label = label.title()
    # increase size for age
    if (feature == 'Age') & (fig_size == 's'):
        aging = True
        while aging == True:
            resp = input(f'This {feature} feature has quite the range!\n\
            Would you like to increase the size for better readability?\n\n(y) or (n)   --> ')
            if resp.lower() == 'n':
                print("You know best!")
                aging = False
            else:
                print("Great idea!")
                fig_size = 'm'
                aging = False   
    # plot
    fig, ax = plt.subplots(figsize=sizes[fig_size])
    graph_[plot_type](subset[feature],hue=subset[label])
    plt.show() 

def plot_all_grep(feature, plot_type='countplot', fig_size='s',subset=train_df): 
    '''
    **WRITE DOC**
    '''
    # filter input for matching 
    feature = feature.title()
    # label = label.title()target
    # increase size for age
    for i in range(1,(len(subset.columns)-1)):
        by_feature = subset[subset.columns[i]]
        if (subset.columns[i] == 'Age') & (fig_size == 's'):
            # plot
            fig, ax = plt.subplots(figsize=sizes['m'])
            graph_[plot_type](by_feature,hue=subset[feature])
            plt.show() 
        else:
            # plot
            fig, ax = plt.subplots(figsize=sizes[fig_size])
            graph_[plot_type](by_feature,hue=subset[feature])
            plt.show() 

### OHE

In [None]:
def ohe(subset=None):
    # must be run before fill_na
    obj_list = []
    for i in range(0,len(subset.dtypes)):
        if subset.dtypes[i] == 'object':
            if subset.columns[i] == 'Segmentation':
                pass
            else:
                obj_list.append(subset.columns[i])
        else:
            pass
    print(f"\nOHE of Subset:  '{retrieve_name(subset)[0]}'\n\nColumns Include:")
    [print(f"   {x}") for x in obj_list]
    resp = input("\nProceed with OHE?     y/n \n\n  -->  ")
    if resp == 'n':
        print("\nnevermind then...\n")
    else:
        print("\nLet's do it!\n")
        print(obj_list)
        subset = pd.get_dummies(data=subset, columns=obj_list)
        print("...")
        subset.head()
    return subset

### Encode Label

In [None]:
le = preprocessing.LabelEncoder()

### Impute

In [82]:
def fill_nan(subset=None):
    '''
    ** Write DOC ***
    '''
    imputer = CategoricalImputer()
    nan_list = []
    null_count = subset.isnull().any()
    for i in range(0,len(subset.columns)):
        if null_count[i] == True:
            nan_list.append(subset.columns[i])
        else:
            pass
    for feature in nan_list:
        # impute NaN values
        filled = imputer.fit_transform(np.array(subset[feature], dtype=float))
        # update subset
        subset[feature] = filled
    return subset

### Scale


In [None]:
scaler = StandardScaler()

### Transformation Script

In [246]:
def transform_df(subset=None):
    # ohe
    subset_1 = ohe(subset)

    # impute
    subset_1 = fill_nan(subset_1)

    # feature cols
    feature_cols = subset_1.columns.tolist()
    feature_cols.remove('Segmentation')
    # label
    label_col = 'Segmentation'

    # scale
    scaled_array = scaler.fit_transform(subset_1[feature_cols])
    # return to df
    subset_1 = pd.DataFrame(scaled_array, columns=feature_cols)
    # Concat dfs
    subset = pd.concat([subset_1, subset[label_col]], axis=1)

    # encode label
    subset[subset[label_col]]=le.fit_transform(subset['Segmentation'])
    
    return subset

### Cross Validate ###

In [None]:
cv = KFold(n_splits = 3, shuffle=True)

## Initial EDA & Data Visualization

In [None]:
summarize(train_df)

In [None]:
plot_all_grep('segmentation')

In [None]:
plot_grep(feature='gender', label='spending_score')

In [None]:
train_df['Gender'].value_counts()

In [None]:
plot_grep('gender')

## Feature engineering and transformation

### Transform : Impute, OHE, Scale

In [31]:
# train
train_df = transform_df(train_df)
# test
test_df = transform_df(test_df)

#### Update Column Groups

In [33]:
# categorical
cat_cols = ['Gender_Female', 'Gender_Male', 'Ever_Married_No',
       'Ever_Married_Yes', 'Graduated_No', 'Graduated_Yes',
       'Profession_Artist', 'Profession_Doctor', 'Profession_Engineer',
       'Profession_Entertainment', 'Profession_Executive',
       'Profession_Healthcare', 'Profession_Homemaker', 'Profession_Lawyer',
       'Profession_Marketing', 'Spending_Score_Average', 'Spending_Score_High',
       'Spending_Score_Low', 'Var_1_Cat_1', 'Var_1_Cat_2', 'Var_1_Cat_3',
       'Var_1_Cat_4', 'Var_1_Cat_5', 'Var_1_Cat_6', 'Var_1_Cat_7']

# continuous
cont_cols = ['Age','Work_Experience', 'Family_Size']

# features
feature_cols = train_df.columns.tolist()
feature_cols.remove('Segmentation')
# label
label_col = 'Segmentation'

# print(feature_cols)
# print('---')
# print(label_col)

## Continued EDA & Data Visualization

In [34]:
# correlation object
corr= train_df.corr().round(2)
# getting the Upper Triangle of the co-relation matrix
up_tri = np.triu(corr)

# PLOT
f, ax = plt.subplots(figsize=(30,30))
sns.heatmap(corr, vmin=-1, vmax=1,cmap= 'gist_stern',annot=True,linewidth=0.5,square=True, mask=up_tri) #,mask=matrix
plt.show();


In [35]:
# sns.pairplot(train)

---
# 2. Training
---
* Traditional ML Modeling
* Pure Torch Model
* High-level Libraries and Tabular Frameworks

## Traditional ML modeling

### Dummy Classifier

In [None]:
# init model
dummy = DummyClassifier(strategy='stratified')
# fit model
dummy.fit(train_df[feature_cols], train_df[label_col])
# score
eval_score = accuracy_score(test_df[label_col], dummy.predict(test_df[feature_cols]))
# display
print('Eval ACC: {}'.format(eval_score))

### Random Forest Classifier

In [179]:
# base model
rfc = RandomForestClassifier(n_estimators=50,max_depth=100, n_jobs=-1)

'rfc' added to 'pipes'


##### Parameters


In [40]:
# parameter lists
criterion_list = ['gini','entropy']
n_estimators_list = [int(x) for x in np.linspace(start = 100, stop = 200, num = 50)]
max_features_list = ['sqrt', 'log2']
max_depth_list = [int(x) for x in np.linspace(10, 110, num=11)]
min_samples_split_list = [2, 5]
min_samples_leaf_list = [1, 4]
bootstrap_list = [True, False]

# grid
grid_rfc = {
            'n_estimators': n_estimators_list,
            'criterion': criterion_list,
            'max_features': max_features_list,
            'max_depth': max_depth_list,
            'min_samples_split': min_samples_split_list,
            'min_samples_leaf': min_samples_leaf_list,
            'bootstrap': bootstrap_list
            }

##### Random Search (RCV)


In [41]:
# searcher
rcv_rfc = RandomizedSearchCV(estimator=rfc, param_distributions=grid_rfc, 
                                 n_iter=10, verbose=2, cv=cv, random_state=42, 
                                 n_jobs=-1)

# fit to model
# %%time
rcv_rfc.fit(train[feature_cols], train[label_col])

# best-imator
best_rfc = rcv_rfc.best_estimator_

##### CV Scores

In [46]:
# get scores
mean_score_rfc = rcv_rfc.cv_results_['mean_test_score']
std_score_rfc = rcv_rfc.cv_results_['std_test_score']
params_rfc = rcv_rfc.cv_results_['params']

# frame
cv_score_df = pd.DataFrame(params_rfc)
cv_score_df['mean_score_rfc'] = mean_score_rfc
cv_score_df['std_score_rfc'] = std_score_rfc

# breakdown
breakdown_rfc = {"Index": rcv_rfc.best_index_,
"Params" : rcv_rfc.best_params_,
"Estimator" : rcv_rfc.best_estimator_,
"Score" : rcv_rfc.best_score_}

In [None]:
# display scores
for k,v in breakdown_rfc.items():
    print(f"\n{k}:")
    print(f"\n{v}")

#### Test

In [50]:
rfc.fit(train[feature_cols],train[label_col])
rfc.predict(train[feature_cols])
# init model
rfc = best_rfc
# fit model
rfc.fit(train[feature_cols],train[label_col])
# score
rfc_score = accuracy_score(test_df[label_col], rfc.predict(test_df[feature_cols]))
# display
print('Eval ACC: {}'.format(rfc_score))

Pipeline(steps=[('scaler', StandardScaler()),
                ('rfc',
                 RandomForestClassifier(criterion='entropy', max_depth=10,
                                        max_features='sqrt', n_estimators=153,
                                        n_jobs=-1))])

### LGBM

#### Parameters

In [3]:
# parameter lists
learning_rate_list = [0.02,0.03,0.04]
max_depth_list = [int(x) for x in np.linspace(10, 110, num=11)]
n_estimators_list = [int(x) for x in np.linspace(start = 100, stop = 200, num = 50)]
boosting_type_list = ['gbdt', 'rf', 'dart', 'goss']
subsample_list = [0.5,0.6,0.7,0.8]
colsample_bytree_list = [0.5,0.6,0.7,0.8,0.9]
min_data_in_leaf_list = [50,100,150]
reg_alpha_list = [0, 1, 1.5]
reg_lambda_list = [0, 1]

# grid
grid_lgbm = {
    'learning_rate': learning_rate_list,
    'max_depth': max_depth_list,
    'n_estimators': n_estimators_list,
    'boosting_type': boosting_type_list,
    'subsample_type': subsample_list,
    'colsample_bytree': colsample_bytree_list,
    'min_data_in_leaf': min_data_in_leaf_list,
    'reg_alpha': reg_alpha_list,
    'reg_lambda': reg_lambda_list,
            }

# alt
    # params = {}
    # params['learning_rate'] = 0.03
    # params['max_depth'] = 25
    # params['n_estimators'] = 3000
    # params['objective'] = 'multiclass'
    # params['boosting_type'] = 'gbdt'
    # params['subsample'] = 0.7
    # params['random_state'] = 42
    # params['colsample_bytree']=.9
    # params['min_data_in_leaf'] = 100
    # params['reg_alpha'] = 1.7
    # params['reg_lambda'] = 1.11
    # #params['class_weight']: {0: 0.44, 1: 0.4, 2: 0.37}

##### Random Search (RCV)


In [None]:
# searcher
rcv_lgbm = RandomizedSearchCV(estimator=lbgm, param_distributions=grid_lgbm, 
                                 n_iter=10, verbose=2, cv=cv, random_state=42, 
                                 n_jobs=-1)

# fit to model
# %%time
rcv_lgbm.fit(train_df[feature_cols], train_df[label_col])

# best-imator
best_lgbm = rcv_lgbm.best_estimator_

##### CV Scores

In [None]:
# get scores
mean_score_lgbm = rcv_lgbm.cv_results_['mean_test_score']
std_score_lgbm = rcv_lgbm.cv_results_['std_test_score']
params_lgbm = rcv_lgbm.cv_results_['params']

# frame
cv_score_df = pd.DataFrame(params_lgbm)
cv_score_df['mean_score_rfc'] = mean_score_lgbm
cv_score_df['std_score_rfc'] = std_score_lgbm

# breakdown
breakdown_lgbm = {"Index": rcv_lgbm.best_index_,
"Params" : rcv_lgbm.best_params_,
"Estimator" : rcv_lgbm.best_estimator_,
"Score" : rcv_lgbm.best_score_}

In [None]:
# display scores
for k,v in breakdown_lgbm.items():
    print(f"\n{k}:")
    print(f"\n{v}")

Pipeline(steps=[('scaler', StandardScaler()),
                ('rfc',
                 RandomForestClassifier(criterion='entropy', max_depth=10,
                                        max_features='sqrt', n_estimators=153,
                                        n_jobs=-1))])

#### Test

In [269]:
# init model
lgbm = best_lgbm
# fit model
lgbm.fit(train_df[feature_cols], train_df[label_col], early_stopping_rounds=100, eval_set=[(train_df[feature_cols], train_df[label_col]), (test_df[feature_cols], test_df[label_col])], eval_metric='multi_error', verbose=True, categorical_feature=cat_cols)
# score
lgbm_score = accuracy_score(test_df[label_col], lgbm.predict(test_df[feature_cols]))
# display
print('Eval ACC: {}'.format(lgbm_score))

[1]	valid_0's multi_error: 0.719008	valid_0's multi_logloss: 1.37182	valid_1's multi_error: 0.718711	valid_1's multi_logloss: 1.37231
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's multi_error: 0.719008	valid_0's multi_logloss: 1.36105	valid_1's multi_error: 0.718711	valid_1's multi_logloss: 1.36227
[3]	valid_0's multi_error: 0.683678	valid_0's multi_logloss: 1.34951	valid_1's multi_error: 0.682776	valid_1's multi_logloss: 1.3513
[4]	valid_0's multi_error: 0.617355	valid_0's multi_logloss: 1.33891	valid_1's multi_error: 0.620198	valid_1's multi_logloss: 1.34126
[5]	valid_0's multi_error: 0.585537	valid_0's multi_logloss: 1.32879	valid_1's multi_error: 0.58767	valid_1's multi_logloss: 1.33178
[6]	valid_0's multi_error: 0.559091	valid_0's multi_logloss: 1.31927	valid_1's multi_error: 0.560099	valid_1's multi_logloss: 1.32294
[7]	valid_0's multi_error: 0.523347	valid_0's multi_logloss: 1.30932	valid_1's multi_error: 0.530359	valid_1's multi_logloss: 1.31363
[

## Pure `torch` model

In [None]:
# Your code here

## High-level libraries and tabular frameworks

In [79]:
from fastai.tabular.all import *


In [80]:
TabularDataLoaders.from_df?

In [82]:
train_df1 = pd.read_csv(csv_train)


In [84]:
train_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


In [85]:
cat_cols = ['Gender','Ever_Married','Graduated','Profession','Spending_Score']
cont_cols = ['Age','Work_Experience', 'Family_Size']
label_col = ['Segmentation']

In [86]:
dls = TabularDataLoaders.from_df(train_df1, procs = [Categorify, FillMissing, Normalize], cat_names=cat_cols, cont_names=cont_cols, y_names=label_col)

In [87]:
dls.show_batch()

Unnamed: 0,Gender,Ever_Married,Graduated,Profession,Spending_Score,Work_Experience_na,Family_Size_na,Age,Work_Experience,Family_Size,Segmentation
0,Female,Yes,Yes,Artist,Average,False,False,41.0,1.0,3.0,B
1,Female,No,Yes,Artist,Low,False,False,45.0,4.0,1.0,A
2,Female,No,No,Entertainment,Low,False,False,19.000001,1.0,5.0,D
3,Male,Yes,Yes,Entertainment,Average,False,False,40.0,1.0,2.0,A
4,Male,Yes,Yes,Entertainment,Low,False,False,48.0,9.0,1.0,D
5,Male,No,Yes,Artist,Low,False,False,37.0,8.0,3.0,C
6,Female,No,Yes,Artist,Low,False,False,43.0,-8.914188e-08,1.0,A
7,Female,Yes,Yes,Marketing,Low,False,False,51.0,9.0,1.0,A
8,Female,Yes,Yes,Artist,High,False,False,69.0,-8.914188e-08,2.0,C
9,Male,No,Yes,Healthcare,Low,False,False,25.000001,1.0,3.0,D


In [None]:
dls.one_batch()

In [92]:
# create learner 
learn = tabular_learner(dls=dls,metrics=accuracy)

In [None]:
learn.fit_one_cycle(61, lr_max=1e-3, start_epoch=60)

In [None]:
learn.recorder.plot_loss()

In [107]:
learn.model

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(3, 3)
    (1): Embedding(3, 3)
    (2): Embedding(3, 3)
    (3): Embedding(10, 6)
    (4): Embedding(4, 3)
    (5): Embedding(3, 3)
    (6): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): Linear(in_features=27, out_features=200, bias=False)
      (1): ReLU(inplace=True)
      (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): LinBnDrop(
      (0): Linear(in_features=200, out_features=100, bias=False)
      (1): ReLU(inplace=True)
      (2): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=4, bias=True)
    )
  )
)

In [None]:
# add model hook right before last lin layer
g                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

---
# 3. Testing
---
* Model Selection and Test Set Evaluation
* Notes and Findings

## Model selection and test set evaluation

In [None]:
# Your code here

## Notes and findings

What did you learn?