
INTRODUCTION
The Cross_Sell_Success dataset is a valuable resource for the business  to improve their performance by accurately predicting the outcomes of cross-selling promotions on some factors that could either be a success or failure of the promotion. The Cross_Sell_Success dataset contains 1,946 entries and 17 columns, with no missing values. Our primary goal is to use this dataset to train a machine learning model that can accurately predict future outcomes of success of the cross-sell promotion. To accomplish this, we will focus on identifying the most effective model type and optimizing it for the best possible AUC score. By doing so, we aim to create a model that can reliably and accurately predict outcomes for new datasets or models. This will provide valuable insights that can be used to make informed business decisions and improve overall performance.

DATA ANALYSIS
The data analysis process began with the importation of all necessary libraries. The dataset was then described in detail, providing insight into its contents. Feature engineering was done to improve the performance and accuracy of machine learning models. The email column was split into three rows and renamed using the iteration and rename functions. One hot encoding was used on the 'mail' variable in the 'sell_success' DataFrame. This technique was employed to convert categorical data into numerical data, making it easier for machine learning models to process and analyze.

A correlation analysis was conducted, revealing low or non-existent correlations between the X-variables and the Y-variable (Cross_Sell_Success). The variable with the highest positive correlation was n_email_3, with a correlation value of 0.19. The variable with the highest negative correlation was n_email_2, with a correlation value of -0.28.

The code was then split into two separate data frames - one containing the explanatory variables (features) and the other containing the response variable (target). The X-data included the names of the variables deemed the most relevant predictors of cross-sell success, based on previous analysis. This step was crucial in preparing the data for the predictive model, as it involved selecting the most relevant variables and separating the response variable from the explanatory variables. This separation is vital because the explanatory variables are used to train the model, while the response variable is used to evaluate the model's performance.

In the development of a machine learning model, it is essential to split the data into training and testing sets to ensure that the model is trained on a subset of the data and tested on an independent subset. This is a crucial step in the model development process as it helps to prevent overfitting and allows for the evaluation of the model's performance on new, unseen data.

To achieve this, the Scikit-learn library provides a useful function called train_test_split, which splits the data into training and testing sets. In this particular case, the function was used to split the data into 90% for training and 10% for testing. Additionally, the function was passed the sell_success_data and sell_success_target parameters to represent the independent and dependent variables, respectively. The random_state parameter was set to 219 to ensure that the split is always the same, and the stratify parameter was set to sell_success_target to ensure that the balance of classes in the target variable is preserved in both the training and testing sets.

After the data was prepared, logistic regression was used to predict the likelihood of a customer making a cross-sell purchase. The model used several customer attributes, including the number of emails sent, cancellations after noon, mobile logins, average time per site visit, unique meals purchased, PC logins, number of emails opened, and revenue, to analyze the relationship between a binary dependent variable and one or more independent variables.

The logistic regression analysis revealed that n_email_3, CANCELLATIONS_AFTER_NOON, and n_email_2 had a minimum p-value of 0.0000 and were statistically significant to the cross-sell success. However, revenue and UNIQUE_MEALS_PURCH were not significant to the cross-sell success, indicating that they may have been some of the reasons why the cross-sell success failed in some areas.

The use of train_test_split function and logistic regression analysis were essential steps in the development of the machine learning model. The data was split into training and testing sets, ensuring that the testing set was representative of the data set as a whole, while logistic regression analysis helped to identify which customer attributes were significant to the cross-sell success. From the findings it was seen that n_email_3, CANCELLATIONS_AFTER_NOON, MOBILE_LOGINS, Avg_time_per_visit, pc_logins and n_email_2 were the only significant variables.These findings can be used to optimize the cross-sell strategy and improve the success rate in the future.The logistic regression model to predict the likelihood of a customer purchasing a meal package offered by cross_sell_success. Features from the dataset were selected in which the model will use to make its predictions. "sell_success_data" stores the features and "sell_success_target" stores the target variable that we want to predict.
The "train_test_split" function is used to split the dataset into two parts, one part for training the model and the other for testing it. The "stratify" parameter is used to ensure that the training and test datasets have similar proportions of the target variable.
Then, a logistic regression model is instantiated with certain hyperparameters (solver which was set to 'newton-cg’, C was to 1, and random_state as 219 which are optimized for the given problem. The model is then fit to the training data, and then used to predict the test data.
The model's performance is then evaluated using the accuracy metric, which measures the proportion of correct predictions. The accuracy scores of both the training and test datasets are 72%  and 71% respectively the gap LogReg Train-Test gap was 0.0136. 
The confusion matrix is a useful tool for evaluating the performance of a classification model by providing information on the accuracy and precision of its predictions. In this case, the true positive value was 119, the false positive was 44, the false negative was 13, and the true negative was 19.

To further evaluate the performance of the model, the AUC (Area Under the Curve) score for a logistic regression model's prediction on a test dataset was calculated. The roc_auc_score function from the sklearn.metrics module was used, which takes in two parameters: y_true, representing the true target values, and y_score, representing the predicted probabilities of the target values. The calculated AUC score was rounded to four decimal places.

The AUC score is a measure of how well a model can distinguish between positive and negative classes. A score of 1 indicates perfect classification, while a score of 0.5 indicates random guessing. In this case, the obtained AUC score of 0.6016 suggests that the model has a limited ability to distinguish between the positive and negative classes. The score of 0.6016 was saved for future use in the variable logreg_auc_score.

Other models were instantiated to verify which could give a higher AUC score, such as the decision tree classifier, random forest classifier, and gradient boosting classifier. However, all of these models had an AUC score lower than the 0.6016 obtained with the logistic regression model.

Interestingly, the random forest model had a full tree training accuracy of 0.8633, full tree testing accuracy of 0.0028, and full tree AUC score of 0.6334. However, the logistic regression model was still preferred due to its higher testing accuracy.

In summary, the confusion matrix and AUC score were used to evaluate the performance of a classification model. The logistic regression model was found to have limited ability to distinguish between positive and negative classes, with an AUC score of 0.6016. Although the random forest model had a higher AUC score, the logistic regression model was still preferred due to its higher testing accuracy.


In [None]:
# importing libraries
import numpy as np  
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split 
import statsmodels.formula.api as smf 


# loading data
sell_success = pd.read_excel(io ='./__storage/Cross_Sell_Success_Dataset_2023.xlsx')



# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)


# displaying the head of the dataset
sell_success.head(n = 5)

In [None]:
sell_success.info(verbose = True)

In [None]:
sell_success.columns

In [None]:
# describing the dataset
sell_success.describe(include = 'number').round(decimals = 2)

In [None]:
# Creating a new column called 'mail' in the sell_success DataFrame and setting all the values to 0
sell_success['mail'] = 0

# Iterate over each row in the sell_success DataFrame
for index, row in sell_success.iterrows():
# Extract the domain name from the email address using split() and set it to email_domain
    email_domain = row['EMAIL'].split('@')[-1]
# If the email domain is one of these values, set the 'mail' value for that row to 1
    if email_domain in ['gmail.com', 'yahoo.com', 'protonmail.com']:
        sell_success.loc[index, 'mail'] = 1
# If the email domain is one of these values, set the 'mail' value for that row to 2
    elif email_domain in ['me.com', 'aol.com', 'hotmail.com', 'live.com', 'msn.com', 'passport.com']:
        sell_success.loc[index, 'mail'] = 2
# If the email domain is not one of the above values, set the 'mail' value for that row to 3
    else:
        sell_success.loc[index, 'mail'] = 3

In [None]:
# Converting the 'mail' column in the sell_success DataFrame to integer type
sell_success['mail'] = sell_success['mail'].astype(int)

In [None]:
# one hot encoding variables
one_hot_email       = pd.get_dummies(sell_success['mail'])

# joining codings together
sell_success = sell_success.join(other = [one_hot_email])



# checking results
sell_success.columns

In [None]:
#Reeenaming columns
sell_success = sell_success.rename(columns={1: 'n_email_1',2 :'n_email_2', 3:'n_email_3'})

In [None]:
#Getting information about Data set
sell_success.info(verbose = True)

In [None]:
#calculating the correlation coefficients
sell_corr = sell_success.corr(method = "pearson").round(decimals = 2)

sell_corr['CROSS_SELL_SUCCESS'].sort_values(ascending = False)

In [None]:
##listing all variables
X_data = { 'REVENUE', 'TOTAL_MEALS_ORDERED', 'UNIQUE_MEALS_PURCH', 'CONTACTS_W_CUSTOMER_SERVICE', 'PRODUCT_CATEGORIES_VIEWED',
                   'AVG_TIME_PER_SITE_VISIT', 'CANCELLATIONS_AFTER_NOON', 'PC_LOGINS', 'MOBILE_LOGINS', 'WEEKLY_PLAN', 'LATE_DELIVERIES ',
                   'AVG_PREP_VID_TIME', 'LARGEST_ORDER_SIZE', 'AVG_MEAN_RATING', 'TOTAL_PHOTOS_VIEWED',
                    'n_email_2', 'n_email_3'}

# declaring explanatory variables
sell_success_data =sell_success[X_data]

# declaring response variable
sell_success_target = sell_success.loc[:, "CROSS_SELL_SUCCESS"]


In [None]:
# train-test split with stratification
x_train, x_test, y_train, y_test = train_test_split(
          sell_success_data,
          sell_success_target,
            test_size    = 0.10,
            random_state = 219,
            stratify     = sell_success_target) # preserving balance


# merging training data for statsmodels
sell_success_train = pd.concat([x_train, y_train], axis = 1)

In [None]:
# instantiating a logistic regression model object
logistic_small = smf.logit(formula = """ CROSS_SELL_SUCCESS ~ n_email_3+CANCELLATIONS_AFTER_NOON+MOBILE_LOGINS+
                                                             AVG_TIME_PER_SITE_VISIT+UNIQUE_MEALS_PURCH+PC_LOGINS+n_email_2+REVENUE""",
                           data    = sell_success_train)


# fitting the model object
results_logistic = logistic_small.fit()


# checking the results SUMMARY
results_logistic.summary2()

In [None]:
# importing libraries
import pandas            as pd                       
import matplotlib.pyplot as plt                      
import seaborn           as sns                      
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression  
import statsmodels.formula.api as smf               
from sklearn.metrics import confusion_matrix         
from sklearn.metrics import roc_auc_score            
from sklearn.neighbors import KNeighborsClassifier   
from sklearn.neighbors import KNeighborsRegressor    
from sklearn.preprocessing import StandardScaler     
from sklearn.tree import DecisionTreeClassifier      
from sklearn.tree import plot_tree                   
from sklearn.ensemble import GradientBoostingRegressor 
from sklearn.ensemble import RandomForestRegressor

In [None]:
# creating a dictionary to store success models
success_dict = {

 # full model
 'logit_full'   : ['REVENUE', 'TOTAL_MEALS_ORDERED', 'UNIQUE_MEALS_PURCH', 'CONTACTS_W_CUSTOMER_SERVICE', 'PRODUCT_CATEGORIES_VIEWED',
                   'AVG_TIME_PER_SITE_VISIT', 'CANCELLATIONS_AFTER_NOON', 'PC_LOGINS', 'MOBILE_LOGINS', 'WEEKLY_PLAN', 'LATE_DELIVERIES ',
                   'AVG_PREP_VID_TIME', 'LARGEST_ORDER_SIZE', 'AVG_MEAN_RATING', 'TOTAL_PHOTOS_VIEWED',
                    'n_email_2', 'n_email_3'],
 

 # significant variables only (set 1)
 'logit_sig'    : ['n_email_3','CANCELLATIONS_AFTER_NOON','MOBILE_LOGINS',
                  'AVG_TIME_PER_SITE_VISIT','UNIQUE_MEALS_PURCH','PC_LOGINS',
                   'n_email_2','REVENUE'],
    
    
 # significant variables only (set 2)
 'logit_sig_2'  : ['n_email_3','CANCELLATIONS_AFTER_NOON','MOBILE_LOGINS',
                  'AVG_TIME_PER_SITE_VISIT','PC_LOGINS',
                   'n_email_2','REVENUE']

}

In [None]:
# printing success n@3u7variable sets
print(f"""
/--------------------------\\
|Explanatory Variable Sets |
\\--------------------------/

Full Model:
-----------
{success_dict['logit_full']}


First Significant p-value Model:
--------------------------------
{success_dict['logit_sig']}


Second Significant p-value Model:
---------------------------------
{success_dict['logit_sig_2']}
""")

In [None]:
# train/test split with the full model
sell_success_data   =  sell_success.loc[ : , success_dict['logit_sig']]
sell_success_target =  sell_success.loc[ : , 'CROSS_SELL_SUCCESS']


# this is the exact code we were using before
x_train, x_test, y_train, y_test = train_test_split(
            sell_success_data,
            sell_success_target,
            random_state = 219,
            test_size    = 0.10,
            stratify     = sell_success_target)


# INSTANTIATING a logistic regression model
logreg = LogisticRegression(solver = 'newton-cg',
                            C = 1,
                            random_state = 219)


# FITTING the training data
logreg_fit = logreg.fit(x_train, y_train)


# PREDICTING based on the testing set
logreg_pred = logreg_fit.predict(x_test)


# SCORING the results
print('LogReg Training ACCURACY:', logreg_fit.score(x_train, y_train).round(4))
print('LogReg Testing  ACCURACY:', logreg_fit.score(x_test, y_test).round(4))

# saving scoring data for future use
logreg_train_score = logreg_fit.score(x_train, y_train).round(4) # accuracy
logreg_test_score  = logreg_fit.score(x_test, y_test).round(4)   # accuracy


# displaying and saving the gap between training and testing
print('LogReg Train-Test Gap   :', abs(logreg_train_score - logreg_test_score).round(4))
logreg_test_gap = abs(logreg_train_score - logreg_test_score).round(4)

In [None]:
# creating a confusion matrix
print(confusion_matrix(y_true = y_test,
                       y_pred = logreg_pred))

In [None]:
# unpacking the confusion matrix
logreg_tn, \
logreg_fp, \
logreg_fn, \
logreg_tp = confusion_matrix(y_true = y_test, y_pred = logreg_pred).ravel()


# printing each result one-by-one
print(f"""
True Negatives : {logreg_tn}
False Positives: {logreg_fp}
False Negatives: {logreg_fn}
True Positives : {logreg_tp}
""")

In [None]:
# area under the roc curve (auc)
print(roc_auc_score(y_true  = y_test,
                    y_score = logreg_pred).round(decimals = 4))


# saving AUC score for future use
logreg_auc_score = roc_auc_score(y_true  = y_test,
                                 y_score = logreg_pred).round(decimals = 4)

In [None]:
# zipping each feature name to its coefficient
logreg_model_values = zip(sell_success[success_dict['logit_sig_2']].columns,
                          logreg_fit.coef_.ravel().round(decimals = 2))


# setting up a placeholder list to store model features
logreg_model_lst = [('intercept', logreg_fit.intercept_[0].round(decimals = 2))]


# printing out each feature-coefficient pair one by one
for val in logreg_model_values:
    logreg_model_lst.append(val)
    

# checking the results
for pair in logreg_model_lst:
    print(pair)

In [None]:
#######################################
# plot_feature_importances
########################################
def plot_feature_importances(model, train, export = False):
    """
    Plots the importance of features from a CART model.
    
    PARAMETERS
    ----------
    model  : CART model
    train  : explanatory variable training data
    export : whether or not to export as a .png image, default False
    """
    
    # declaring the number
    n_features = x_train.shape[1]
    
    # setting plot window
    fig, ax = plt.subplots(figsize=(12,9))
    
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), train.columns)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    
    if export == True:
        plt.savefig('Tree_Leaf_50_Feature_Importance.png')

In [None]:
# INSTANTIATING a classification tree object
full_tree = DecisionTreeClassifier()


# FITTING the training data
full_tree_fit = full_tree.fit(x_train, y_train)


# PREDICTING on new data
full_tree_pred = full_tree_fit.predict(x_test)


# SCORING the model
print('Full Tree Training ACCURACY:', full_tree_fit.score(x_train,
                                                     y_train).round(4))

print('Full Tree Testing ACCURACY :', full_tree_fit.score(x_test,
                                                     y_test).round(4))

print('Full Tree AUC Score:', roc_auc_score(y_true  = y_test,
                                            y_score = full_tree_pred).round(4))


# saving scoring data for future use
full_tree_train_score = full_tree_fit.score(x_train, y_train).round(4) # accuracy
full_tree_test_score  = full_tree_fit.score(x_test, y_test).round(4)   # accuracy


# saving AUC
full_tree_auc_score   = roc_auc_score(y_true  = y_test,
                                      y_score = full_tree_pred).round(4) # auc

In [None]:
# INSTANTIATING a classification tree object
full_tree = LogisticRegression()


# FITTING the training data
full_tree_fit = full_tree.fit(x_train, y_train)


# PREDICTING on new data
full_tree_pred = full_tree_fit.predict(x_test)


# SCORING the model
print('Full Tree Training ACCURACY:', full_tree_fit.score(x_train,
                                                     y_train).round(4))

print('Full Tree Testing ACCURACY :', full_tree_fit.score(x_test,
                                                     y_test).round(4))

print('Full Tree AUC Score:', roc_auc_score(y_true  = y_test,
                                            y_score = full_tree_pred).round(4))


# saving scoring data for future use
full_tree_train_score = full_tree_fit.score(x_train, y_train).round(4) # accuracy
full_tree_test_score  = full_tree_fit.score(x_test, y_test).round(4)   # accuracy


# saving AUC
full_tree_auc_score   = roc_auc_score(y_true  = y_test,
                                      y_score = full_tree_pred).round(4) # auc

In [None]:
# INSTANTIATING a classification tree object
full_tree = GradientBoostingRegressor()


# FITTING the training data
full_tree_fit = full_tree.fit(x_train, y_train)


# PREDICTING on new data
full_tree_pred = full_tree_fit.predict(x_test)


# SCORING the model
print('Full Tree Training ACCURACY:', full_tree_fit.score(x_train,
                                                     y_train).round(4))

print('Full Tree Testing ACCURACY :', full_tree_fit.score(x_test,
                                                     y_test).round(4))

print('Full Tree AUC Score:', roc_auc_score(y_true  = y_test,
                                            y_score = full_tree_pred).round(4))


# saving scoring data for future use
full_tree_train_score = full_tree_fit.score(x_train, y_train).round(4) # accuracy
full_tree_test_score  = full_tree_fit.score(x_test, y_test).round(4)   # accuracy


# saving AUC
full_tree_auc_score   = roc_auc_score(y_true  = y_test,
                                      y_score = full_tree_pred).round(4) # auc

In [None]:
# INSTANTIATING a classification tree object
full_tree = KNeighborsRegressor()


# FITTING the training data
full_tree_fit = full_tree.fit(x_train, y_train)


# PREDICTING on new data
full_tree_pred = full_tree_fit.predict(x_test)


# SCORING the model
print('Full Tree Training ACCURACY:', full_tree_fit.score(x_train,
                                                     y_train).round(4))

print('Full Tree Testing ACCURACY :', full_tree_fit.score(x_test,
                                                     y_test).round(4))

print('Full Tree AUC Score:', roc_auc_score(y_true  = y_test,
                                            y_score = full_tree_pred).round(4))


# saving scoring data for future use
full_tree_train_score = full_tree_fit.score(x_train, y_train).round(4) # accuracy
full_tree_test_score  = full_tree_fit.score(x_test, y_test).round(4)   # accuracy


# saving AUC
full_tree_auc_score   = roc_auc_score(y_true  = y_test,
                                      y_score = full_tree_pred).round(4) # auc

In [None]:
# INSTANTIATING a classification tree object
full_tree = RandomForestRegressor()


# FITTING the training data
full_tree_fit = full_tree.fit(x_train, y_train)


# PREDICTING on new data
full_tree_pred = full_tree_fit.predict(x_test)


# SCORING the model
print('Full Tree Training ACCURACY:', full_tree_fit.score(x_train,
                                                     y_train).round(4))

print('Full Tree Testing ACCURACY :', full_tree_fit.score(x_test,
                                                     y_test).round(4))

print('Full Tree AUC Score:', roc_auc_score(y_true  = y_test,
                                            y_score = full_tree_pred).round(4))


# saving scoring data for future use
full_tree_train_score = full_tree_fit.score(x_train, y_train).round(4) # accuracy
full_tree_test_score  = full_tree_fit.score(x_test, y_test).round(4)   # accuracy


# saving AUC
full_tree_auc_score   = roc_auc_score(y_true  = y_test,
                                      y_score = full_tree_pred).round(4) # auc

In [None]:
print(f""" The final model adopted has the following attributes: \n

    Model Type:          Logistic Regression ({logreg})
    Training Accuracy:   {logreg_train_score}
    Testing Accuracy:    {logreg_test_score}
    Train-Test Gap:      {logreg_test_gap}
    AUC Score:           {logreg_auc_score}
    Confusion Matrix:    TN: {logreg_tn}, FP: {logreg_fp}, FN: {logreg_fn}, TP:{logreg_tp}
    """)
