<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-master-table" data-toc-modified-id="Load-master-table-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load master table</a></span></li><li><span><a href="#Select-columns" data-toc-modified-id="Select-columns-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Select columns</a></span></li><li><span><a href="#Convert-the-pandas-dataframe-to-my-helper-function-object" data-toc-modified-id="Convert-the-pandas-dataframe-to-my-helper-function-object-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Convert the pandas dataframe to my helper function object</a></span><ul class="toc-item"><li><span><a href="#Set-the-target-variable" data-toc-modified-id="Set-the-target-variable-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Set the target variable</a></span></li><li><span><a href="#We-can-also-track-the-actions-performed" data-toc-modified-id="We-can-also-track-the-actions-performed-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>We can also track the actions performed</a></span></li><li><span><a href="#Check-the-na-values" data-toc-modified-id="Check-the-na-values-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Check the na values</a></span></li></ul></li><li><span><a href="#Models" data-toc-modified-id="Models-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Models</a></span><ul class="toc-item"><li><span><a href="#Logistic-regression-model" data-toc-modified-id="Logistic-regression-model-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Logistic regression model</a></span><ul class="toc-item"><li><span><a href="#Model-results" data-toc-modified-id="Model-results-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>Model results</a></span></li><li><span><a href="#Taking-a-look-at-the-train-dataset-used" data-toc-modified-id="Taking-a-look-at-the-train-dataset-used-4.1.2"><span class="toc-item-num">4.1.2&nbsp;&nbsp;</span>Taking a look at the train dataset used</a></span></li></ul></li><li><span><a href="#Decision-tree" data-toc-modified-id="Decision-tree-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Decision tree</a></span><ul class="toc-item"><li><span><a href="#Model-Performance" data-toc-modified-id="Model-Performance-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Model Performance</a></span></li></ul></li><li><span><a href="#SVM-model" data-toc-modified-id="SVM-model-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>SVM model</a></span><ul class="toc-item"><li><span><a href="#Model-performance" data-toc-modified-id="Model-performance-4.3.1"><span class="toc-item-num">4.3.1&nbsp;&nbsp;</span>Model performance</a></span></li></ul></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Random Forest</a></span><ul class="toc-item"><li><span><a href="#Model-performance" data-toc-modified-id="Model-performance-4.4.1"><span class="toc-item-num">4.4.1&nbsp;&nbsp;</span>Model performance</a></span></li><li><span><a href="#Extracting-feature-importances" data-toc-modified-id="Extracting-feature-importances-4.4.2"><span class="toc-item-num">4.4.2&nbsp;&nbsp;</span>Extracting feature importances</a></span></li></ul></li><li><span><a href="#Random-forest-randomized-search" data-toc-modified-id="Random-forest-randomized-search-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Random forest randomized search</a></span><ul class="toc-item"><li><span><a href="#Best-estimator-from-the-random-search" data-toc-modified-id="Best-estimator-from-the-random-search-4.5.1"><span class="toc-item-num">4.5.1&nbsp;&nbsp;</span>Best estimator from the random search</a></span></li><li><span><a href="#Best-estimator-performance" data-toc-modified-id="Best-estimator-performance-4.5.2"><span class="toc-item-num">4.5.2&nbsp;&nbsp;</span>Best estimator performance</a></span></li><li><span><a href="#Model-performance-after-grid-search" data-toc-modified-id="Model-performance-after-grid-search-4.5.3"><span class="toc-item-num">4.5.3&nbsp;&nbsp;</span>Model performance after grid search</a></span></li></ul></li></ul></li></ul></div>

The following code has been taken from my github repo https://github.com/SCK22/HelperFunctions/blob/development/ML/HelperFunctionsMLClass.py

In [1]:
	"""Helper functions for Machine Learning"""
	import numpy as np
	import pandas as pd
	from sklearn.model_selection import train_test_split
	from sklearn.tree import DecisionTreeClassifier
	from sklearn.linear_model import LogisticRegression
	from sklearn.impute import SimpleImputer
	from sklearn.preprocessing import StandardScaler
	from sklearn.preprocessing import OneHotEncoder
	from sklearn.pipeline import Pipeline
	from sklearn.compose import ColumnTransformer
	from sklearn.metrics import accuracy_score, recall_score, f1_score

	class HelperFunctionsML:
		"""Helper functions for Machine Learning and EDA"""

		def __init__(self, dataset):
			"""Helper Functions to do ML"""
			self.dataset = dataset
			self.target = None
			self.col_names = self.dataset.columns
			self.cat_num_extracted = False
			self.dummies = False
			self.nrows = dataset.shape[0]
			self.ncols = dataset.shape[1]
			self.actions_performed = {}
			self.model_columns = []
			self.model = None

		def update_actions_performed(self, dict_key, attributes = {}, needed_for_test = False):
			self.actions_performed[dict_key] = {"attributes" : attributes, "needed_for_test" : needed_for_test }
		# setter methods
		
		def set_target(self, col_name):
			self.target = col_name
			self.update_actions_performed("set_target", attributes =  {"col_name" : col_name})
			return "Succesfully set target column!"
		
		def set_pos_label(self, level_name):
			self.pos_label = level_name
			self.update_actions_performed("set_pos_label", attributes =  {"level_name" : level_name})
			return "Succesfully set pos_label!"
		
		def test(self, col_name):
			temp = getattr(self, "set_target")(col_name)
			# print("getattr(self.set_target)(col_name) : {}".format(temp))

		def set_X_train(self, X_train):
			self.X_train = X_train
			self.update_actions_performed("set_X_train", attributes =  {"X_train" : X_train})
		
		def set_X_validation(self, X_validation):
			self.X_validation = X_validation
			self.update_actions_performed("set_X_validation", attributes =   {"X_validation" : X_validation})

		def set_y_train(self, y_train):
			self.y_train = y_train
			self.update_actions_performed("set_y_train", attributes =  {"y_train" : y_train})
		
		def set_y_validation(self, y_validation):
			self.y_validation = y_validation
			self.update_actions_performed("set_y_validation", attributes =  {"y_validation" : y_validation})

		def check_has_na_values(self):
			has_na_values = True if np.sum(self.dataset.isnull().sum())>0 else False
			self.has_na_values = has_na_values
			self.update_actions_performed("check_has_na_values")
			if self.has_na_values:
				print(self.dataset.isnull().sum())
			else:
				print("No Null values in the dataset.")
			return has_na_values
		
		# @staticmethod
		def cat_num_extract(self):
			f_name = "cat_num_extract"
			"""This function returns the names of the Categorical and Nmeric attributes in the same order."""
			self.cat_cols = [i for i in self.dataset.columns if self.dataset[i].dtype in ['O', 'object'] and i != self.target]
			self.num_cols = [i for i in self.dataset.columns if (i not in self.cat_cols) and i != self.target]
			self.cat_num_extracted = True
			self.update_actions_performed(f_name)
			return {"cat_cols" : self.cat_cols, "num_cols": self.num_cols}
		
		def extend_cat_cols(self, new_cols = []):
			# if not self.cat_num_extracted:
			# 	print("Running cat_num_extract")
			# 	self.cat_num_extract()
			self.cat_cols.extend(new_cols)

		def get_num_cols(self):
			return len(self.num_cols)
		def get_cat_cols(self):
			return len(self.cat_cols)
		def extend_num_cols(self, new_cols = []):
			# if not self.cat_num_extracted:
			# 	print("Running cat_num_extract")
			# 	self.cat_num_extract()
			self.cat_cols.extend(new_cols)
			
		@staticmethod
		def list_of_na_cols(dataset, per=0.3):
			"""This function will return the columns with na values"""
			na_cols = dataset.isnull().sum()[dataset.isnull().sum() > per]
			return list(na_cols.index)

		def get_mode(self, x):
			values, counts = np.unique(x.dropna(), return_counts=True)
			m = counts.argmax()
			return values[m]
		
		def numeric_pipeline(self,strategy = "median"):
			numeric_transformer = Pipeline(steps=[
			('imputer', SimpleImputer(strategy=strategy)),
			('scaler', StandardScaler())])
			self.numeric_transformer = numeric_transformer

		def categorical_pipeline(self,strategy = "most_frequent"):
			categorical_transformer = Pipeline(steps=[
			('imputer', SimpleImputer(strategy=strategy,fill_value="missing_value")),
			('onehot', OneHotEncoder(handle_unknown='ignore'))])
			self.categorical_transformer = categorical_transformer

		def set_pipeline(self):
			self.preprocessor  = ColumnTransformer(
			transformers=[
			('num', self.numeric_transformer	, self.num_cols),
			('cat', self.categorical_transformer, self.cat_cols)])
		
		def set_imputer_numeric(self, strategy = "median"):
			self.imputer_numeric = SimpleImputer(strategy = strategy)
			self.actions_performed["set_imputer_numeric"] = {"attributes" : {"strategy" : "median"}}

		def set_imputer_categorical(self, strategy = "most_frequent"):
			self.imputer_categorical = SimpleImputer(strategy = strategy)
			self.actions_performed["imputer_categorical"] = {"attributes" : {"strategy" : "most_frequent"}}
		
		def train_imputer_numeric(self):
			self.imputer_numeric.fit(self.dataset.loc[:,self.num_cols])
			self.update_actions_performed("train_imputer_numeric")
		
		def train_imputer_categorical(self):
			self.imputer_categorical.fit(self.dataset.loc[:,self.cat_cols])
			self.update_actions_performed("train_imputer_categorical")

		def impute_numeric_cols(self,test_dataset= None, return_frames = False):
			"""This function replaces the na values with the colum mean or median ,  based on the selection.
			This function might not be needed anymore as sklearn .22 has KNN imputation which I would prefer to use."""
			f_name = "impute_numeric_cols"
			num_cols = self.num_cols
			print("num_cols : {}".format(num_cols))
			if test_dataset is not None:
				return self.imputer_numeric.transform(test_dataset.loc[:,num_cols])
			else:
				self.dataset.loc[:,num_cols] = self.imputer_numeric.transform(self.dataset.loc[:,num_cols])
			self.update_actions_performed(dict_key = f_name, needed_for_test = True)
			if return_frames:
				return self.dataset
		
		def impute_categorical_cols(self,test_dataset= None, return_frames = False):
			f_name = "impute_categorical_cols"
			print("cat_cols : {}".format(self.cat_cols))
			if test_dataset is not None:
				return self.imputer_categorical.transform(test_dataset.loc[:,self.cat_cols])
			else:
				self.dataset.loc[:,self.cat_cols] = self.imputer_categorical.transform(self.dataset.loc[:,self.cat_cols])
			self.update_actions_performed(dict_key = f_name, needed_for_test = True)
			if return_frames:
				return self.dataset

		def create_dummy_data_frame(self, categorical_attributes = None, return_frames = False):
			"""This function returns a dataframe of dummified colums Pass the dataset and the column names."""
			if categorical_attributes is None:
				categorical_attributes = self.cat_cols
				print("cat_cols : {}".format(categorical_attributes))
			if categorical_attributes:
				dummy_df =pd.get_dummies(self.dataset.loc[:,categorical_attributes]) 
				self.catergorical_dummies = dummy_df
				self.update_actions_performed(dict_key = "create_dummy_data_frame", attributes = {"columns_present" : list(dummy_df.columns)}, needed_for_test = True)
				self.dummies = True
				if return_frames:
					return dummy_df
			else:
				self.dummies = False
				if return_frames:
					return self.dataset

		def set_target_type(self, col_type):
			if self.target is not None:
				self.dataset.loc[:, [self.target]] = self.dataset.loc[:, [self.target]].astype(col_type)
			else:
				print("target not set, call the function set_target with the name of the target column")
		
		# def create_data_for_model(self):
		# 	if self.dummies:
		# 		self.model_data = pd.concat([self.dataset.loc[:, self.num_cols], self.catergorical_dummies, self.dataset.loc[:, self.target]], axis = 1)
		# 	else:
		# 		self.model_data = self.dataset
		
		def create_train_test_split(self,cat_cols, num_cols, validation_size = 0.3, random_state= 42, return_frames= False):
			# self.create_data_for_model()
			self.model_data = self.dataset
			if self.target is not None:
				cols_for_models= []
				cols_for_models.extend(num_cols)
				cols_for_models.extend(cat_cols)
				cols_for_models = list(set(cols_for_models))
				self.cols_for_models = cols_for_models
				X = self.dataset.loc[:, [i for i in self.dataset.columns if i in cols_for_models]]
				y = self.model_data.loc[:, [self.target]]
				X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = validation_size, random_state= random_state) 
				self.X_train = X_train
				self.X_validation = X_validation
				self.y_train  = y_train
				self.y_validation = y_validation
				self.nrowstrain, self.ncolstrain = X_train.shape
				self.nrowvalidation, self.ncolsvalidation = X_validation.shape
			else:
				print("target not set, call the function set_target with the name of the target column")
			if return_frames:
				return X_train, X_validation, y_train, y_validation

		def compute_mertics(self, preds, trues):
			"""Compute various metrics"""
			try:
				acc = accuracy_score(y_pred=preds, y_true=trues)  # Accuracy for the passed set
			except: 
				acc = None
			try:
				prec = precision_score(y_pred=preds, y_true=trues, average='macro',pos_label = self.pos_label)  # precision for the passed set
			except: 
				prec = None
			try:
				rec = recall_score(y_pred=preds, y_true=trues, average='macro',pos_label = self.pos_label)  # recall for the passed set
			except:
				rec = None
			try:
				f1 = f1_score(y_pred=preds, y_true=trues, pos_label = self.pos_label)  # precision for the passed set
			except: 
				f1 = None

			model_performance = {}
			model_performance["precision"] = prec
			model_performance["f1_score"] = f1
			model_performance["recall"] = rec
			model_performance["accuracy"] = acc
			return model_performance	
		
		def apply_model_predict_validate(self, model_obj, model_name = None, feature_names= None):
			"""This function
			1.Applies the specified model to the train data.
			2.Validates on the validation set.
			3.Predicts on the test dataset
			4.Returns the predictions along with some scores.
			"""
			# if feature_names is None:
			# 	feature_names = []
			# 	feature_names = self.num_cols
			# 	feature_names.extend(self.cat_cols)
			# 	print("feature_names", feature_names)
			# 	if self.cat_num_extracted == False:
			# 		self.cat_num_extract()
			# 	self.model_columns.extend(self.cat_cols)
			# 	self.model_columns.extend(self.num_cols)
			# 	feature_names = self.model_columns
			# print("Using {} columns for model building".format(feature_names))
			# fit the model
			# self.create_train_test_split()
			self.numeric_pipeline()
			self.categorical_pipeline()
			self.set_pipeline()
			model_obj = Pipeline(steps=[('preprocessor', self.preprocessor),('classifier', model_obj)])
			self.model = model_obj
			self.model.fit(self.X_train, self.y_train)
			# evaluation metrics
			pred_train = self.model.predict(self.X_train)  # predict on the validation set 
			# self.pred_train = pred_train
			pred_val = self.model.predict(self.X_validation)  # predict on the validation set
			# self.pred_val = pred_val
			performance = {}
			performance["train"] = self.compute_mertics(pred_train, self.y_train)
			performance["validation"] = self.compute_mertics(pred_val, self.y_validation)
			if model_name is None:
				try:
					if "sklearn" in type(self.model).__module__:
						performance["model_obj"] = type(self.model).__name__
				except:
					performance["model_obj"] = self.model
			else:
				performance["model_obj"] = model_name
			# print("model_performance : {}".format(model_performance))	
			# print(pd.DataFrame(performance))
			
			return [pred_train,pred_val, performance]

		def apply_log_reg(self, feature_names = None):
			"""apply basic logistic regression model"""
			model = LogisticRegression()
			return self.apply_model_predict_validate(model, feature_names = feature_names)

		def apply_dtree_class(self, feature_names = None):
			"""apply basic Decision tree classification model"""
			model = DecisionTreeClassifier()
			return self.apply_model_predict_validate(model, feature_names = feature_names)

		def compare_model_performance(self, model_objs_list = [], model_names_list = [], feature_names = None):
			"""pass a list of model objects, this function will apply all the models and return a dataframe with the performance 
			measures
			Input : list of model objects
					list of model names (optional)
					features to be used for model building
			"""
			if len(model_objs_list) == 1:
				return self.apply_model_predict_validate(model_objs_list[0], feature_names = feature_names)
			if len(model_objs_list) > 1:
				model_perf_list = []
				if model_names_list:
					for model, name in zip(model_objs_list,model_names_list):
						_,_,model_perf = self.apply_model_predict_validate(model, model_name=name, feature_names = feature_names)
						model_perf_list.append(model_perf)
				else:
					for model in model_objs_list:
						_,_,model_perf = self.apply_model_predict_validate(model, feature_names = feature_names)
						model_perf_list.append(model_perf)
			df = pd.DataFrame(model_perf_list)
			# df = df.set_index(df["model_obj"])
			# df.reset_index()
			return df

		# Experimental
		# Replace values in an attribute with other values
		def replace_attribute_values(self, target_attribute, originals, replace_with):
			"""Experimental:
			This function takes a pandas series object and replaces the specified values with specified values."""
			if len(originals) == len(replace_with):
				for i in range(len(originals)):
					self.X_train[target_attribute].replace(originals[i], replace_with[i], inplace=True)
					self.X_validation[target_attribute].replace(originals[i], replace_with[i], inplace=True)
			elif len(originals != len(replace_with)):
				raise ValueError("replacement values do not match the size of originals")
			return 


In [2]:
"""Helper Functions for Plotting"""
import numpy as np

import plotly
import plotly.offline as pyoff
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
init_notebook_mode(connected = True)

def generate_layout_bar(col_name):
    """
    Generate a layout object for bar chart
    """
    layout_bar = go.Layout(
        autosize=False,  # auto size the graph? use False if you are specifying the height and width
        width=800,  # height of the figure in pixels
        height=600,  # height of the figure in pixels
        title="Distribution of {} column".format(col_name),  # title of the figure
        # more granular control on the title font
        titlefont=dict(
            family='Courier New, monospace',  # font family
            size=14,  # size of the font
            color='black'  # color of the font
        ),
        # granular control on the axes objects
        xaxis=dict(
            tickfont=dict(
                family='Courier New, monospace',  # font family
                size=14,  # size of ticks displayed on the x axis
                color='black'  # color of the font
            )
        ),
        yaxis=dict(
            #         range=[0,100],
            title='Percentage',
            titlefont=dict(
                size=14,
                color='black'
            ),
            tickfont=dict(
                family='Courier New, monospace',  # font family
                size=14,  # size of ticks displayed on the y axis
                color='black'  # color of the font
            )
        ),
        font=dict(
            family='Courier New, monospace',  # font family
            color="white",  # color of the font
            size=12  # size of the font displayed on the bar
        )
    )
    return layout_bar


def plot_count_bar(dataframe_name, col_name, top_n=None):
    """
    Plot a bar chart for the categorical columns
    Arguments:
    dataframe name
    categorical column name
    Output:
    Plot
    """
    # create a table with value counts
    temp = dataframe_name[col_name].value_counts()
    if top_n is not None:
        temp = temp.head(top_n)
    # creating a Bar chart object of plotly
    data = [go.Bar(
            x=temp.index.astype(str),  # x axis values
            y=np.round(temp.values.astype(float) / temp.values.sum(), 4) * 100,  # y axis values
            text=['{}%'.format(i) for i in np.round(temp.values.astype(float) / temp.values.sum(), 4) * 100],
            # text to be displayed on the bar, we are doing this to display the '%' symbol along with the number on the bar
            textposition='auto',  # specify at which position on the bar the text should appear
            marker=dict(color='#0047AB'),)]  # change color of the bar
    # color used here Cobalt Blue
    layout_bar = generate_layout_bar(col_name=col_name)
    fig = go.Figure(data=data, layout=layout_bar)
    return iplot(fig)

def plot_bar(dataframe_name, cat_col_name, num_col_name, top_n = 20):
    """
    Plot a bar chart with the mentioned columns
    Arguments:
    dataframe name
    categorical column name
    numeric column name
    Output:
    Plot
    """
    # create a table with value counts
    dataframe_name = dataframe_name.sort_values(by = num_col_name, ascending = False)
    dataframe_name = dataframe_name.head(top_n)
    x = dataframe_name[cat_col_name]
    y = dataframe_name[num_col_name]
    # creating a Bar chart object of plotly
    data = [go.Bar(
            x=x,  # x axis values
            y=y,  # y axis values
            text=['{}%'.format(np.round(i,2)) for i in y],
            # text to be displayed on the bar, we are doing this to display the '%' symbol along with the number on the bar
            textposition='auto',  # specify at which position on the bar the text should appear
            marker=dict(color='#0047AB'),)]  # change color of the bar
    # color used here Cobalt Blue
    layout_bar = generate_layout_bar(col_name=cat_col_name)
    fig = go.Figure(data=data, layout=layout_bar)
    return iplot(fig)


def plot_hist(dataframe, col_name):
    """Plot histogram"""
    data = [go.Histogram(x=dataframe[col_name],
                         marker=dict(
        color='#CC0E1D',  # Lava (#CC0E1D)
        #         color = 'rgb(200,0,0)'   # you can provide color in HEX format or rgb format, genrally programmers prefer HEX format as it is a single string value and easy to pass as a variable
    ))]
    layout = go.Layout(title="Histogram of {}".format(col_name))
    fig = go.Figure(data=data, layout=layout)
    return iplot(fig)


def plot_multi_box(dataframe, col_name, num_col_name):
    """Plot multiple box plots based on the levels in a column"""
    data = []
    for i in dataframe[col_name].unique():
        trace = go.Box(y=dataframe[num_col_name][dataframe[col_name] == i],
                       name=i)
        data.append(trace)
    layout = go.Layout(title="Boxplot of levels in {} for {} column".format(col_name, num_col_name))
    fig = go.Figure(data=data, layout=layout)
    return (iplot(fig))

## Load master table

In [3]:
import os
import sys


In [4]:
def load_data(file_name):
    return pd.read_csv(f"{file_name}", na_values=[" ", "", np.nan, "nan", "Na", "NaN"])

In [5]:
# from ML.HelperFunctionsMLClass import HelperFunctionsML

In [6]:
file_paths_dict = {}
for dirname, _, filenames in os.walk('model_data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        file_paths_dict[filename] =os.path.join(dirname, filename)

print(file_paths_dict)

model_data\master.csv
model_data\notifications_flat.csv
model_data\users.csv
model_data\X_train.csv
model_data\X_validation.csv
model_data\y_train.csv
model_data\y_validation.csv
{'master.csv': 'model_data\\master.csv', 'notifications_flat.csv': 'model_data\\notifications_flat.csv', 'users.csv': 'model_data\\users.csv', 'X_train.csv': 'model_data\\X_train.csv', 'X_validation.csv': 'model_data\\X_validation.csv', 'y_train.csv': 'model_data\\y_train.csv', 'y_validation.csv': 'model_data\\y_validation.csv'}


In [7]:
file_paths_dict

{'master.csv': 'model_data\\master.csv',
 'notifications_flat.csv': 'model_data\\notifications_flat.csv',
 'users.csv': 'model_data\\users.csv',
 'X_train.csv': 'model_data\\X_train.csv',
 'X_validation.csv': 'model_data\\X_validation.csv',
 'y_train.csv': 'model_data\\y_train.csv',
 'y_validation.csv': 'model_data\\y_validation.csv'}

In [8]:
train = load_data(file_paths_dict["master.csv"])

In [9]:
train.isnull().sum()

user_id                                        0
birth_year                                     0
country                                        0
city                                           0
created_date                                   0
user_settings_crypto_unlocked                  0
plan                                           0
attributes_notifications_marketing_push     6422
attributes_notifications_marketing_email    6422
num_contacts                                   0
num_referrals                                  0
num_successful_referrals                       0
age                                            0
METAL                                          0
METAL_FREE                                     0
PREMIUM                                        0
PREMIUM_FREE                                   0
PREMIUM_OFFER                                  0
STANDARD                                       0
Android                                        0
Apple               

## Select columns 

In [10]:
train.columns

Index(['user_id', 'birth_year', 'country', 'city', 'created_date',
       'user_settings_crypto_unlocked', 'plan',
       'attributes_notifications_marketing_push',
       'attributes_notifications_marketing_email', 'num_contacts',
       'num_referrals', 'num_successful_referrals', 'age', 'METAL',
       'METAL_FREE', 'PREMIUM', 'PREMIUM_FREE', 'PREMIUM_OFFER', 'STANDARD',
       'Android', 'Apple', 'num_notifications', 'num_notified_15',
       'num_notified_30', 'num_notified_45', 'num_notified_60',
       'num_notified_90', 'num_notified_180', 'num_channles_of_com',
       'num_reasons_com', 'num_transactions_Fri', 'num_transactions_Mon',
       'num_transactions_Sat', 'num_transactions_Sun', 'num_transactions_Thu',
       'num_transactions_Tue', 'num_transactions_Wed', 'INBOUND_count',
       'OUTBOUND_count', 'CANCELLED_count', 'COMPLETED_count',
       'DECLINED_count', 'FAILED_count', 'PENDING_count', 'REVERTED_count',
       'is_user_engaged', 'median_usd', 'mean_usd', 'min_us

In [11]:
train.city.nunique()

5744

Dropping the columns with na values, attributes_notifications_marketing_push,attributes_notifications_marketing_email these attributes are not helping the model building, the information is about the notifications is captured in the engineered features

and std_usd (standard deviation of amount) can be imputed using median, but for now I am dropping it

City has too many levels, dropping for now as we have the country information

In [12]:
cols_needed = ['user_id','age','num_contacts',
       'num_referrals', 'num_successful_referrals', 'METAL',
       'METAL_FREE', 'PREMIUM', 'PREMIUM_FREE', 'PREMIUM_OFFER', 'STANDARD',
       'Android', 'Apple', 'num_notifications', 'num_notified_15',
       'num_notified_30', 'num_notified_45', 'num_notified_60',
       'num_notified_90', 'num_notified_180', 'num_channles_of_com',
       'num_reasons_com', 'num_transactions_Fri', 'num_transactions_Mon',
       'num_transactions_Sat', 'num_transactions_Sun', 'num_transactions_Thu',
       'num_transactions_Tue', 'num_transactions_Wed', 'INBOUND_count',
       'OUTBOUND_count', 'CANCELLED_count', 'COMPLETED_count',
       'DECLINED_count', 'FAILED_count', 'PENDING_count', 'REVERTED_count',
       'is_user_engaged', 'median_usd', 'mean_usd', 'min_usd', 'max_usd']

In [13]:
train = train.loc[:,cols_needed]

## Convert the pandas dataframe to my helper function object

In [14]:
# I have developed a package for handling the dataset, preprocessing, model building and evaluation for more info, refer to my gihub link provided above
train = HelperFunctionsML(train)

In [15]:
train.nrows, train.ncols

(18337, 42)

### Set the target variable

In [16]:
train.dataset.columns

Index(['user_id', 'age', 'num_contacts', 'num_referrals',
       'num_successful_referrals', 'METAL', 'METAL_FREE', 'PREMIUM',
       'PREMIUM_FREE', 'PREMIUM_OFFER', 'STANDARD', 'Android', 'Apple',
       'num_notifications', 'num_notified_15', 'num_notified_30',
       'num_notified_45', 'num_notified_60', 'num_notified_90',
       'num_notified_180', 'num_channles_of_com', 'num_reasons_com',
       'num_transactions_Fri', 'num_transactions_Mon', 'num_transactions_Sat',
       'num_transactions_Sun', 'num_transactions_Thu', 'num_transactions_Tue',
       'num_transactions_Wed', 'INBOUND_count', 'OUTBOUND_count',
       'CANCELLED_count', 'COMPLETED_count', 'DECLINED_count', 'FAILED_count',
       'PENDING_count', 'REVERTED_count', 'is_user_engaged', 'median_usd',
       'mean_usd', 'min_usd', 'max_usd'],
      dtype='object')

In [17]:
train.set_target("is_user_engaged")

'Succesfully set target column!'

### We can also track the actions performed

In [18]:
train.actions_performed

{'set_target': {'attributes': {'col_name': 'is_user_engaged'},
  'needed_for_test': False}}

### Check the na values

In [19]:
train.check_has_na_values()

No Null values in the dataset.


False

In [20]:
train.set_pos_label("NOT_ENGAGED")

'Succesfully set pos_label!'

In [21]:
train.set_target("is_user_engaged")

'Succesfully set target column!'

## Models

In [22]:
cat_cols = []
num_cols = train.dataset.columns.tolist()
num_cols.remove("user_id")

In [23]:
num_cols.remove("is_user_engaged")

In [24]:
train.create_train_test_split(cat_cols, num_cols)

In [25]:
# train.X_train.to_csv("model_data/X_train.csv",index=False)
# train.y_train.to_csv("model_data/y_train.csv",index=False)
# train.X_validation.to_csv("model_data/X_validation.csv",index=False)
# train.y_validation.to_csv("model_data/y_validation.csv",index=False)

In [26]:
train.num_cols = num_cols

In [27]:
train.cat_cols = cat_cols

### Logistic regression model

In [152]:
train.dataset.dtypes

user_id                      object
age                           int64
num_contacts                  int64
num_referrals                 int64
num_successful_referrals      int64
METAL                         int64
METAL_FREE                    int64
PREMIUM                       int64
PREMIUM_FREE                  int64
PREMIUM_OFFER                 int64
STANDARD                      int64
Android                       int64
Apple                         int64
num_notifications             int64
num_notified_15             float64
num_notified_30             float64
num_notified_45             float64
num_notified_60             float64
num_notified_90             float64
num_notified_180            float64
num_channles_of_com           int64
num_reasons_com               int64
num_transactions_Fri          int64
num_transactions_Mon          int64
num_transactions_Sat          int64
num_transactions_Sun          int64
num_transactions_Thu          int64
num_transactions_Tue        

In [153]:
log_reg = LogisticRegression()

In [154]:
_,_,model_performance = train.apply_model_predict_validate(model_obj = log_reg, model_name = "Logreg")


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Note that pos_label (set to 'NOT_ENGAGED') is ignored when average != 'binary' (got 'macro'). You may use labels=[pos_label] to specify a single positive class.



In [155]:
model_performance

{'train': {'precision': None,
  'f1_score': 0.548466690165668,
  'recall': 0.709870093158768,
  'accuracy': 0.9001947798987144},
 'validation': {'precision': None,
  'f1_score': 0.5434949961508853,
  'recall': 0.7054466523451486,
  'accuracy': 0.8922210105416212},
 'model_obj': 'Logreg'}

In [156]:
pd.DataFrame(model_performance)

Unnamed: 0,train,validation,model_obj
accuracy,0.900195,0.892221,Logreg
f1_score,0.548467,0.543495,Logreg
precision,,,Logreg
recall,0.70987,0.705447,Logreg


#### Model results

I will use recall as the metric for the positive class (NOT_ENGAGED) for the problem, reson being that the recall metric looks at how many cases were true among the model predictions.

Recall for the train and validation are around 70% , inplies that there is no overfitting, I will use this as baseline and try to improve performance

#### Taking a look at the train dataset used

In [157]:
train.X_train.head()

Unnamed: 0,age,num_contacts,num_referrals,num_successful_referrals,METAL,METAL_FREE,PREMIUM,PREMIUM_FREE,PREMIUM_OFFER,STANDARD,...,CANCELLED_count,COMPLETED_count,DECLINED_count,FAILED_count,PENDING_count,REVERTED_count,median_usd,mean_usd,min_usd,max_usd
17408,33,1,0,0,0,0,0,0,0,1,...,0,1,0,0,0,1,4.405,4.405,0.09,8.72
11998,65,4,0,0,0,0,0,0,0,1,...,0,4,0,0,0,1,5.01,6.2,1.0,10.0
9238,57,0,0,0,0,0,0,0,0,1,...,0,29,9,3,0,13,136.58,217.016667,0.85,1159.22
12032,31,5,0,0,0,0,0,0,0,1,...,0,13,2,0,0,6,7.63,16.222381,0.0,85.11
872,24,5,0,0,0,0,0,0,0,1,...,0,97,1,3,1,1,10.95,27.442136,0.0,250.0


### Decision tree 

In [158]:
dtree_max_depth = DecisionTreeClassifier(max_depth = 3)
_, _ ,model_performance = train.apply_model_predict_validate(model_obj = dtree_max_depth)


Note that pos_label (set to 'NOT_ENGAGED') is ignored when average != 'binary' (got 'macro'). You may use labels=[pos_label] to specify a single positive class.



In [159]:
model_performance

{'train': {'precision': None,
  'f1_score': 0.5497981157469718,
  'recall': 0.7168055977630414,
  'accuracy': 0.8957537982080249},
 'validation': {'precision': None,
  'f1_score': 0.5655070318282753,
  'recall': 0.7211723505191551,
  'accuracy': 0.8933115230825155},
 'model_obj': 'Pipeline'}

In [160]:
pd.DataFrame(model_performance)

Unnamed: 0,train,validation,model_obj
accuracy,0.895754,0.893312,Pipeline
f1_score,0.549798,0.565507,Pipeline
precision,,,Pipeline
recall,0.716806,0.721172,Pipeline


#### Model Performance

The recall for both the train and validation increased yb about 1% , no significant difference compared to the logistic regression model

### SVM model

In [161]:
from sklearn import svm

In [162]:
svc_model = svm.SVC()

In [163]:
_, _ ,model_performance = train.apply_model_predict_validate(svc_model)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


Note that pos_label (set to 'NOT_ENGAGED') is ignored when average != 'binary' (got 'macro'). You may use labels=[pos_label] to specify a single positive class.



In [164]:
pd.DataFrame(model_performance)

Unnamed: 0,train,validation,model_obj
accuracy,0.893494,0.880407,Pipeline
f1_score,0.443177,0.406137,Pipeline
precision,,,Pipeline
recall,0.648983,0.631943,Pipeline


#### Model performance

Recall for SVC model (without hyperparameter tuning) is about 64% for train and validation, less than logistic regression model.


### Random Forest

In [165]:
from sklearn.ensemble import RandomForestClassifier

In [166]:
rf = RandomForestClassifier()

In [167]:
_, _ ,model_performance = train.apply_model_predict_validate(rf)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


Note that pos_label (set to 'NOT_ENGAGED') is ignored when average != 'binary' (got 'macro'). You may use labels=[pos_label] to specify a single positive class.



In [168]:
pd.DataFrame(model_performance)

Unnamed: 0,train,validation,model_obj
accuracy,1.0,0.909306,Pipeline
f1_score,1.0,0.620532,Pipeline
precision,,,Pipeline
recall,1.0,0.744053,Pipeline


#### Model performance

The recall on train is 100%, and the validations is 74% this is the case of overfitting, will tune the model

#### Extracting feature importances

In [169]:
train.X_train.columns

Index(['age', 'num_contacts', 'num_referrals', 'num_successful_referrals',
       'METAL', 'METAL_FREE', 'PREMIUM', 'PREMIUM_FREE', 'PREMIUM_OFFER',
       'STANDARD', 'Android', 'Apple', 'num_notifications', 'num_notified_15',
       'num_notified_30', 'num_notified_45', 'num_notified_60',
       'num_notified_90', 'num_notified_180', 'num_channles_of_com',
       'num_reasons_com', 'num_transactions_Fri', 'num_transactions_Mon',
       'num_transactions_Sat', 'num_transactions_Sun', 'num_transactions_Thu',
       'num_transactions_Tue', 'num_transactions_Wed', 'INBOUND_count',
       'OUTBOUND_count', 'CANCELLED_count', 'COMPLETED_count',
       'DECLINED_count', 'FAILED_count', 'PENDING_count', 'REVERTED_count',
       'median_usd', 'mean_usd', 'min_usd', 'max_usd'],
      dtype='object')

In [170]:
train.model[-1].feature_importances_

array([3.96076837e-02, 4.98923789e-02, 0.00000000e+00, 0.00000000e+00,
       1.74951740e-04, 4.75735683e-05, 6.04883389e-04, 0.00000000e+00,
       3.37341481e-06, 1.06404604e-03, 6.72685414e-03, 7.24290414e-03,
       2.13260667e-02, 1.93505869e-02, 1.74446611e-02, 1.80062647e-02,
       1.85730698e-02, 2.67140109e-02, 6.93773451e-02, 1.53605509e-02,
       2.91969268e-02, 4.22102473e-02, 3.30033872e-02, 3.84210956e-02,
       4.26598142e-02, 2.84755901e-02, 2.81746328e-02, 2.62422605e-02,
       5.57178869e-02, 7.07650891e-02, 3.10620496e-03, 7.99783286e-02,
       4.01227840e-02, 1.36965529e-02, 6.97051833e-03, 2.29826106e-02,
       4.35071913e-02, 3.32832700e-02, 4.09048163e-02, 9.06358747e-03])

In [171]:
feature_importances_df = pd.DataFrame(zip(train.X_train.columns, train.model[-1].feature_importances_))
feature_importances_df.columns = ["Column", "Importance"]

In [172]:
feature_importances_df.head()

Unnamed: 0,Column,Importance
0,age,0.039608
1,num_contacts,0.049892
2,num_referrals,0.0
3,num_successful_referrals,0.0
4,METAL,0.000175


In [173]:
plot_bar(feature_importances_df, "Column","Importance" )

### Random forest randomized search

In [174]:
X_train = train.X_train
y_train = train.y_train

In [186]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = list(range(30,300,10))
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 100, num =5)]
max_depth = []
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


In [187]:
print(random_grid)

{'n_estimators': [30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290], 'max_features': ['auto', 'sqrt'], 'max_depth': [None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


In [189]:
%%time
rf = RandomForestClassifier()

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   53.6s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  5.6min finished

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Wall time: 5min 43s


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

#### Best estimator from the random search

In [190]:
best_random = rf_random.best_estimator_

In [191]:
best_random

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=210,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

#### Best estimator performance

In [192]:
_,_, model_performance= train.apply_model_predict_validate(best_random)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().


Note that pos_label (set to 'NOT_ENGAGED') is ignored when average != 'binary' (got 'macro'). You may use labels=[pos_label] to specify a single positive class.



In [193]:
pd.DataFrame(model_performance)

Unnamed: 0,train,validation,model_obj
accuracy,0.99961,0.909487,Pipeline
f1_score,0.998556,0.621005,Pipeline
precision,,,Pipeline
recall,0.999044,0.74416,Pipeline


#### Model performance after grid search

The recall of the best estimator using random search on train data is ~100% and validation is 74.4% this is also overfitting