# ANALYSIS REPORT
##### Onigbanjo Mowalola  - 9397115
#####  HULT International Business School
#####  DAT-5303 | Machine Learning 
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

## Case Introduction
Apprentice Chef, Inc. is an innovative company providing busy professionals with quick gourmet meals and disposable cookware that can be prepared in 30 minutes or less via their various online platforms

## Regression Model
Median meal rating explains the satisfaction rating given by customers, which is useful in understanding their tastes and preferences, and general opinion on products.
The positive correlated categories are avg prep video time at 0.64, median meal ratings at 0.61 and total photos viewed at 0.47. The highest negative correlation is average clicks per visit, implying the more people use the site, the less they purchase. This might be due to ease of navigation, user friendliness amongst other things. 
______________________________________________________________________________
## Classification Model
With regards to Cross Sell Success, Professional, which represents the subset of customers who are registered with a company email, has the highest correlation, implying more promotion success. Junk has a negative correlation, as they may be inactive email addresses, as such customers may not be aware of the Promotion.

## Actionable Recommendations 
Based on the insights, revenue is impacted by the prep time, presumably, customer would be more likely to purchase meals with less time. The better the median meal the more likely purchase becomes. I would recommend that the company employs tactics to generate more feedback, such as rewards for survey/feedback participation.
I would recommend, improving the site, making it easier and friendlier for customers, by adding features that make re-ordering easier/faster, sending meal image recommendations, and , curating recommendations based on previous ratings, and also targeting more professional users with regards to the promo.


Best Models

Regression Model -  0.861 - The model explains up to an 86% variation 

Classification Model - 0.715 - The tree can predict a success/failure rate of promotion up to about 70
% 

## REGRESSON MODEL

In [None]:
# importing libraries
import pandas as pd # data science essentials
import numpy as np # data science essentials
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # enhanced data visualization
import statsmodels.formula.api as smf # linear regression (statsmodels)
import sklearn.linear_model # linear models
from sklearn.model_selection import train_test_split # train/test split
from sklearn.linear_model import LinearRegression # linear regression (scikit-learn)


# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

#importing the dataset 

# specifying the path and file name
file = "./datasets/Chef_Data.xlsx"

# reading the file into Python
Chef_Data = pd.read_excel(io = file)

# checking the file
#Chef_Data.head(n=5)

# preparing response variable
new_chef_target= Chef_Data['REVENUE']

new_chef = Chef_Data.drop(['REVENUE', 'log_REVENUE', 'NAME', 'FIRST_NAME', 
                            'FAMILY_NAME', 'log_CANCELLATIONS_BEFORE_NOON',
                           'log_CANCELLATIONS_AFTER_NOON', 'log_MOBILE_LOGINS', 
                           'log_WEEKLY_PLAN', 'log_EARLY_DELIVERIES', 'log_LATE_DELIVERIES', 
                            'log_MASTER_CLASSES_ATTENDED', 'log_TOTAL_PHOTOS_VIEWED'], axis = 1)

  


## setting up more than one train-test split ##
###############################################
# FULL X-dataset (normal Y)
x_train_FULL, x_test_FULL, y_train_FULL, y_test_FULL = train_test_split(
            new_chef,     # x-variables
            new_chef_target,   # y-variable
            test_size = 0.25,
            random_state = 219)
# INSTANTIATING a model object
lasso_model = sklearn.linear_model.Lasso(alpha = 1.0,
                                         normalize = True) # default magitude

# FITTING to the training data
lasso_fit = lasso_model.fit(x_train_FULL, y_train_FULL)


# PREDICTING on new data
lasso_pred = lasso_fit.predict(x_test_FULL)


# SCORING the results
lasso_train_score = lasso_model.score(x_train_FULL, y_train_FULL).round(4)
lasso_test_score  = lasso_model.score(x_test_FULL, y_test_FULL).round(4)


print('Training Score:', lasso_train_score)
print('Testing Score:',  lasso_test_score)

# saving scoring data for future use
lasso_train_score = lasso_model.score(x_train_FULL, y_train_FULL).round(4) # using R-square
lasso_test_score  = lasso_model.score(x_test_FULL, y_test_FULL).round(4)   # using R-square


# displaying and saving the gap between training and testing
print('Lasso Train-Test Gap :', abs(lasso_train_score - lasso_test_score).round(4))
lasso_test_gap = abs(lasso_train_score - lasso_test_score).round(4)




In [None]:
# zipping each feature name to its coefficient
lasso_model_values = zip(Chef_Data.columns, lasso_fit.coef_.round(decimals = 2))


# setting up a placeholder list to store model features
lasso_model_lst = [('intercept', lasso_fit.intercept_.round(decimals = 2))]


# printing out each feature-coefficient pair one by one
for val in lasso_model_values:
    lasso_model_lst.append(val)
    

# checking the results
for pair in lasso_model_lst:
    print(pair)

In [None]:
## This code may have to be run more than once ##

# dropping coefficients that are equal to zero

# printing out each feature-coefficient pair one by one
for feature, coefficient in lasso_model_lst:
        
        if coefficient == 0:
            lasso_model_lst.remove((feature, coefficient))

            
# checking the results
for pair in lasso_model_lst:
    print(pair)
    
#Run 4 times

In [None]:
print(f"""
Model                   Train Score           Test Score
-----                   -----------            ----------
Lasso                 {lasso_train_score}     {lasso_test_score}
""")


# creating a dictionary for model results
model_performance = {
    
    'Model Type'    : [ 'Lasso'],
           
    'Training' : [ lasso_train_score],
           
    'Testing'  : [lasso_test_score],
                    
    'Train-Test Gap' : [lasso_test_gap],
                    
    'Model Size' : [len(lasso_model_lst)],
                    
    'Model' : [lasso_model_lst]}


# converting model_performance into a DataFrame
linear_model_performance = pd.DataFrame(model_performance)


# sending model results to Excel
linear_model_performance.to_excel('./model_results/linear_model_performance.xlsx',
                           index = False)

## CLASSIFICATION MODEL

In [None]:
################################################################################
# Importing Packages                                                           #
################################################################################

from sklearn.metrics import confusion_matrix             # confusion matrix
from sklearn.metrics import roc_auc_score                # Calculating the ROC and AUC
from sklearn.tree import DecisionTreeClassifier          # classification trees
from sklearn.ensemble import RandomForestClassifier      # Random Forest for classification   







# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)

# loading data
chef_df = pd.read_excel('./datasets/Apprentice_Chef_Dataset.xlsx')


# displaying the head of the dataset
#chef_df.head(n = 5)

########################################
# visual_cm
########################################
def visual_cm(true_y, pred_y, labels = None):
    """
Creates a visualization of a confusion matrix.

PARAMETERS
----------
true_y : true values for the response variable
pred_y : predicted values for the response variable
labels : , default None
    """
    # visualizing the confusion matrix

    # setting labels
    lbls = labels
    

    # declaring a confusion matrix object
    cm = confusion_matrix(y_true = true_y,
                          y_pred = pred_y)


    # heatmap
    sns.heatmap(cm,
                annot       = True,
                xticklabels = lbls,
                yticklabels = lbls,
                cmap        = 'Blues',
                fmt         = 'g')


    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix of the Classifier')
    plt.show()
    


In [None]:
chef_df.head(n=5)

In [None]:
#checking the correlations between the variables and cross_sell_success
df_corr = chef_df.corr(method = 'pearson').round(decimals = 2)

df_corr['CROSS_SELL_SUCCESS'].sort_values(ascending = False)

In [None]:
# Dictinary with significant variables and all available variables 
variable_dict = {
                       
    'logit_sig' : [ 'MOBILE_NUMBER', 'CANCELLATIONS_BEFORE_NOON', 'CONTACTS_W_CUSTOMER_SERVICE',
                   'TASTES_AND_PREFERENCES', 'PC_LOGINS', 'EARLY_DELIVERIES',
                   'REFRIGERATED_LOCKER', 'professional', 'personal','Name_Length','weekend_fighter']
}

# train/test split with the full model
chef_df_data   =  chef_df.loc[ : , variable_dict['logit_sig']]
chef_df_target =  chef_df.loc[ : , 'CROSS_SELL_SUCCESS']


# this is the exact code we were using before
x_train, x_test, y_train, y_test = train_test_split(
            chef_df_data,
            chef_df_target,
            test_size    = 0.25,
            random_state = 219,
            stratify     = chef_df_target)


# building a model based on hyperparameter tuning results

# INSTANTIATING a logistic regression model with tuned values
tree_tuned = DecisionTreeClassifier(max_depth=5, min_samples_leaf=17, random_state=219,
                       splitter='best', criterion= 'entropy')


# FIT step 
tree_tuned_fit = tree_tuned.fit(chef_df_data,chef_df_target)

# PREDICTING based on the testing set
tree_tuned_pred = tree_tuned.predict(x_test)


# SCORING the results
print('Training ACCURACY:', tree_tuned.score(x_train, y_train).round(4))
print('Testing  ACCURACY:', tree_tuned.score(x_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = tree_tuned_pred).round(4))


# saving scoring data for future use
tree_tuned_train_score = tree_tuned.score(x_train, y_train).round(4) # accuracy
tree_tuned_test_score  = tree_tuned.score(x_test, y_test).round(4)   # accuracy


# saving the AUC score
tree_tuned_auc         = roc_auc_score(y_true  = y_test,
                                     y_score = tree_tuned_pred).round(4) # auc


# unpacking the confusion matrix
tuned_tree_tn, \
tuned_tree_fp, \
tuned_tree_fn, \
tuned_tree_tp = confusion_matrix(y_true = y_test, y_pred = tree_tuned_pred).ravel()


# printing each result one-by-one
print(f"""
True Negatives : {tuned_tree_tn}
False Positives: {tuned_tree_fp}
False Negatives: {tuned_tree_fn}
True Positives : {tuned_tree_tp}
""")

# calling the visual_cm function
visual_cm(true_y = y_test,
          pred_y = tree_tuned_pred,
          labels = ['CROSS_SELL_SUCCESS', 'NOT CROSS_SELL_SUCESS'])


## FINAL RESULTS

In [None]:
# converting model_performance into a DataFrame
linear_model_performance = pd.DataFrame(linear_model_performance)


# concatenating with former performance DataFrame
total_performance = pd.concat([ linear_model_performance],
                              axis = 0)


total_performance.sort_values(by = 'Testing',
                              ascending = False)


# sending model results to Excel
total_performance.to_excel('./datasets/linear_model_performance.xlsx',
                           index = False)


# checking the results
total_performance

In [None]:
# declaring model performance objects
tree_train_acc = tree_tuned.score(x_train, y_train).round(4)
tree_test_acc  = tree_tuned.score(x_test, y_test).round(4)
tree_auc       = roc_auc_score(y_true  = y_test,
                              y_score = tree_tuned_pred).round(4)


# appending to model_performance
model_performance =  {     'Model Name'        : ['Tuned Tree'],
                           'Training Accuracy' : [tree_train_acc],
                           'Testing Accuracy'  : [tree_test_acc],
                           'AUC Score'         : [tree_auc],
                           'Confusion Matrix'  : [(tuned_tree_tn,
                                                  tuned_tree_fp,
                                                  tuned_tree_fn,
                                                  tuned_tree_tp)]}


# converting model_performance into a DataFrame
model_performance = pd.DataFrame(model_performance)
# saving the DataFrame to Excel
model_performance.to_excel('./model_results/classification_model_performance.xlsx',
                           index = False)
# checking the results
model_performance