# Apprentice Chef

### A1: Regression Model Development
Estrella Spaans | Machine Learning

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

After three years serving customers across the San Francisco Bay Area, the executives at Apprentice Chef have decided to take on an analytics project to better understand how much revenue to expect from each customer within their first year of using their services.Thus, they have hired you on a full-time contract to analyze their data, develop your top insights, and build a machine learning model to predict revenue over the first year of each customer’s life cycle.They have explained to you that for this project, they are not interested in a time series analysis and instead would like to “keep things simple” by providing you with a data set of aggregated customer information.


## 1. Data Preperation

In here, I imported the essential packages and uploaded the file as a dataframe. In the description, it was mentioned that one of th columns was mislabeled, so I changed 'LARGEST_ORDER_SIZE' to 'AVERAGE_MEALS_ORDERED'. I also changed the columns names to lowercase as this is easier for the analysis. 

In [1]:
### PACKAGES, FILE, CHANGES AND SHOWING DATA 
## 1. IMPORTANT PACKAGES
import pandas as pd # essential datascience package
import matplotlib.pyplot as plt # data visualization
import numpy as np # mathimatical functions 
import seaborn as sns  # enhanced graphical output
import random as rand# random number generation
import sklearn.linear_model # to run different models
import statsmodels.formula.api as smf # linear regression (statsmodels)
from sklearn.model_selection import train_test_split  # train/test split
from sklearn.neighbors import KNeighborsRegressor # KNN for Regression
from sklearn.preprocessing import StandardScaler # standard scaler

# pip install gender_guesser (remove # if you need to install it)
import gender_guesser.detector as gender # guess gender based on (given) name

# setting pandas print options (columns, rows, and display width)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## 2. IMPORTING THE DATESET
# Specifying the path and file name
file = './data/Apprentice_Chef_Dataset.xlsx'

# Reading the file into Python
ap_customers = pd.read_excel(io=file)

## 3. CHANGING MISLABELD COLUMN NAME AND COLUMN PRESENTATION
# Changing the name of largest_order_size to average_meals_ordered
ap_customers.rename(columns={'LARGEST_ORDER_SIZE':'AVERAGE_MEALS_ORDERED'}, 
                    inplace=True)

# Changing the capitalized columns to lowercase (personal preference)
ap_customers.columns = map(str.lower, ap_customers.columns)

## 4. SHOWING THE DATAFRAME
# Checking if the data was imported and changed correctly 
ap_customers.head(n = 5)

Unnamed: 0,revenue,cross_sell_success,name,email,first_name,family_name,total_meals_ordered,unique_meals_purch,contacts_w_customer_service,product_categories_viewed,avg_time_per_site_visit,mobile_number,cancellations_before_noon,cancellations_after_noon,tastes_and_preferences,pc_logins,mobile_logins,weekly_plan,early_deliveries,late_deliveries,package_locker,refrigerated_locker,avg_prep_vid_time,average_meals_ordered,master_classes_attended,median_meal_rating,avg_clicks_per_visit,total_photos_viewed
0,393.0,1,Saathos,saathos@unitedhealth.com,Saathos,Saathos,14,6,12,10,48.0,1,3,1,1,5,2,0,0,2,0,0,33.4,1,0,1,17,0
1,1365.0,1,Alysanne Osgrey,alysanne.osgrey@ge.org,Alysanne,Osgrey,87,3,8,8,40.35,1,0,0,1,5,1,12,0,2,0,0,84.8,1,0,3,13,170
2,800.0,1,Edwyd Fossoway,edwyd.fossoway@jnj.com,Edwyd,Fossoway,15,7,11,5,19.77,1,3,0,1,6,1,1,0,1,0,0,63.0,1,0,2,16,0
3,600.0,1,Eleyna Westerling,eleyna.westerling@ge.org,Eleyna,Westerling,13,6,11,5,90.0,1,2,0,1,6,1,14,0,3,0,0,43.8,1,0,2,14,0
4,1490.0,1,Elyn Norridge,elyn.norridge@jnj.com,Elyn,Norridge,47,8,6,10,40.38,1,0,0,0,5,1,5,0,8,0,0,84.8,1,1,3,12,205


## 2. Understanding the Data

In the exploratory analysis, I made sure to get familiar with the data, checking the data types, null-values, definitions so that I can catogize them into continious, count/interval, and categorical variable types.


In [2]:
## Info & Descriptive Analysis 

## The number of non-null values / data type of each variable (remove # to run)
# ap_customers.info()

## The descriptive statistics of numeric variables (remove # to run)
# ap_customers.describe(include = 'number').round(decimals=2)

## The descriptive statistics of non-numeric variables (remove # to run)
# ap_customers.describe(include = object)

In [3]:
## Checking the distribution of the response variable "Revenue" 
# developing a distribution of REVENUE
sns.displot(data   = ap_customers,
            x      = 'revenue',
            height = 5,
            aspect = 2)

plt.title(label   = f"""Distribution of reponse variable "Revenue" """)
plt.xlabel(xlabel = "Revenue") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")

# displaying the histogram
plt.close()

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /> 

Due to the fact that Apprentice Chef want to know how much revenue they can expect from first-year customers, <b> the response variable will be revenue</b>. Based on the outputs above, I identified the data type of each original variable in the dataset: 

| Continuous                | Count/Interval              | Categorical  
|:-------------------------:|:---------------------------:|:----------------------:|
| revenue                   | avg_clicks_per_visit        | cross_sell_success     | 
| avg_time_per_site_visit   | unique_meals_purch          | name                   |
| avg_prep_vid_time         | contacts_w_customer_service | email                  |
|                           | product_categories_viewed   | first_name             |
|                           | cancellations_before_noon   | family_name            |
|                           | cancellations_after_noon    | mobile_number          |
|                           | pc_logins	                  | tastes_and_preferences |
|                           | mobile_logins               | package_locker         |
|                           | weekly_plan                 | refrigerated_locker    |
|                           | early_deliveries            |                        |
|                           | late_deliveries             |                        |
|                           | master_classes_attended     |                        |
|                           | average_meals_ordered       |                        |
|                           | total_meals_ordered         |                        |
|                           | total_photos_viewed         |                        |
|                           | median_meals_rating.        |                        |
|<img width=300/>|<img width=300/>|<img width=300/>|

<h4> Other Insights </h4>

* There are 47 people who did not have their family name registered in their profile.
* The response variable is skewed towards the right and seems bi-modal. 

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /> 

## 3. Preperation Feature Engineering

<b> Functions & Variables </b>

In [4]:
## Creating Specific placeholders for each of the variables 

# Creating an placeholder for delivery variables 
delivery_variables = ['early_deliveries', 'late_deliveries']

# Creating an placeholder for cancellations variables 
cancellation_variables = ['cancellations_before_noon','cancellations_after_noon']

# Creating an placeholder for behavior_variables
behavior_variables = ['pc_logins', 'mobile_logins','avg_clicks_per_visit','product_categories_viewed', 'total_photos_viewed', 'median_meal_rating']

# Creating an placeholder for service_variables
purchase_variables = ['unique_meals_purch','average_meals_ordered','weekly_plan', "total_meals_ordered"]

# Creating an placeholder for service_variables
service_variables = ['contacts_w_customer_service','master_classes_attended']

# Creating an placeholder for categorical_variables
categorical_variables = ['name','email', 'first_name','family_name','mobile_number','cross_sell_success','tastes_and_preferences','package_locker', 'refrigerated_locker']

In [5]:
## Count the number of 0 values of each variables 
early_deliveries_no   = ap_customers['early_deliveries'].isin([0]).sum() 
late_deliveries_no = ap_customers['late_deliveries'].isin([0]).sum()
cancellations_before_noon_no = ap_customers['cancellations_before_noon'].isin([0]).sum() 
cancellations_after_noon_no = ap_customers['cancellations_after_noon'].isin([0]).sum()
total_photos_viewed_no = ap_customers['total_photos_viewed'].isin([0]).sum() 
product_categories_viewed_no = ap_customers['product_categories_viewed'].isin([0]).sum()
pc_logins_no = ap_customers['pc_logins'].isin([0]).sum()
mobile_logins_no = ap_customers['mobile_logins'].isin([0]).sum()
avg_clicks_per_visit_no = ap_customers['avg_clicks_per_visit'].isin([0]).sum()
total_meals_ordered_no = ap_customers['total_meals_ordered'].isin([0]).sum()
average_meals_ordered_no = ap_customers['average_meals_ordered'].isin([0]).sum()
weekly_plan_no = ap_customers['weekly_plan'].isin([0]).sum()
unique_meals_purch_no = ap_customers['unique_meals_purch'].isin([0]).sum()
master_classes_attended_no = ap_customers['master_classes_attended'].isin([0]).sum()
contacts_w_customer_service_no = ap_customers['contacts_w_customer_service'].isin([0]).sum()
median_meal_rating_no = ap_customers['median_meal_rating'].isin([0]).sum()

In [6]:
## Copy the dataset to make changes with other variables 

# Make a copy
ap_customer_2 = ap_customers.copy()


## 4. Feature Engineering

In [7]:
# FEATURE ENGIGEERING: New variables 
##############################################################################
#Variable: total cancellations
ap_customer_2['total_cancellations'] = ap_customer_2['cancellations_before_noon'] + ap_customer_2['cancellations_after_noon']

##############################################################################
# Variable: Occasion 
# STEP 1: splitting personal emails

# placeholder list
placeholder_lst = []

# looping over each email address
for index, col in ap_customer_2.iterrows():
    
    # splitting email domain at '@'
    split_email = ap_customer_2.loc[index, 'email'].split(sep = '@')
    
    # appending placeholder_lst with the results
    placeholder_lst.append(split_email)
    

# converting placeholder_lst into a DataFrame to convert the email
email_df = pd.DataFrame(placeholder_lst)

# Creating a new list
placeholder_lst2 = []

#defining which emails belong to professional 
professional = ['mmm.com','amex.com','apple.com','boeing.com','caterpillar.com',\
                'chevron.com','cisco.com','cocacola.com','disney.com','dupont.com',\
                'exxon.com','ge.org','goldmansacs.com','homedepot.com','ibm.com',\
                'intel.com@jnj.com','jpmorgan.com','mcdonalds.com','merck.com',\
                'microsoft.com','nike.com','pfizer.com','pg.com','travelers.com',\
                'unitedtech.com','unitedhealth.com','verizon.com','visa.com',\
                'walmart.com']

#defining which emails belong to personal
personal = ['gmail.com','yahoo.com','protonmail.com']

# loop over each variable to determine which category it is and put it in
# a list
for row in email_df[1]:
    if row in professional: 
        placeholder_lst2.append('work')
    elif row in personal:
        placeholder_lst2.append('personal')
    else: 
        placeholder_lst2.append('other')

# Adding a new column to the dataframe 
ap_customer_2['occasion'] = placeholder_lst2  

# Setting a placeholder
occasion = ['occasion']

# Getting the dummies for each of the occasions. 
occasion_dummies = pd.get_dummies(ap_customer_2['occasion'])

# Dropping the original column
ap_customer_2 = ap_customer_2.drop(columns= occasion)

# Adding the dummies to the feature engieering dataset
ap_customer_2 = ap_customer_2.join([occasion_dummies])

##############################################################################
# Creating a new variable: total order with a calculation of other variables
ap_customer_2['total_orders'] = round(ap_customer_2['total_meals_ordered'] / ap_customer_2['average_meals_ordered'],2)


In [8]:
# FEATURE ENGINEERING: Log variables 
# Creating a dataframe for all log values
log_variables = ap_customer_2.copy()

# Specifying which variables data types need to be changed to a float
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

# loop over datatypes to change the datatype
for c in log_variables.columns: 
    if log_variables[c].dtype in numerics:
        log_variables[c] = log_variables[c].astype(float)

# Dropping all the variables that do not need changing
log_variables = log_variables.drop(columns = categorical_variables) 

# Specifying whihch columns have 0 in the dataset.
value_change_columns = ["cancellations_before_noon","cancellations_after_noon","mobile_logins","weekly_plan","total_photos_viewed", "late_deliveries", "early_deliveries", "master_classes_attended", "total_cancellations"]

# Changing the 0 to 0.01 for a proper log transformation
log_variables[value_change_columns] = log_variables[value_change_columns].replace({0.0:0.01})

#loop over all values in columns to change it into log; 
for column in log_variables.columns:
    try:
        log_variables[column] = np.log10(log_variables[column])
    except (ValueError, AttributeError):
        pass

# Adding log to each of the columsn to make it clear which are transformations
log_variables.columns = [str(col) + '_log' for col in log_variables.columns]

# Adding these columns to the feature engineering dataframe
ap_customer_2 = pd.concat([ap_customer_2, log_variables],axis = 1)


  result = getattr(ufunc, method)(*inputs, **kwargs)


In [9]:
# FEATURE ENGINEERING: Continious variables 

# dummy variable for spending time on the website
ap_customer_2['length_time_spent_website'] = 0

# iterating over each original column to
# change values in the new feature columns
for index, value in ap_customer_2.iterrows():
    
    # people that spend more than 60 minutes on the website
    if ap_customer_2.loc[index, 'avg_time_per_site_visit'] > 60:
        ap_customer_2.loc[index, 'length_time_spent_website'] = 1


  # people that spend more less than 60 minutes on the website
    if ap_customer_2.loc[index, 'avg_time_per_site_visit'] <=60:
        ap_customer_2.loc[index, 'length_time_spent_website'] = 0

In [10]:
# FEATURE ENGINEERING: Count/Internal Variables 

#############################################################################
#Delivery variables 

# dummy variable for having a basement.
ap_customer_2['has_early_deliveries'] = 0
ap_customer_2['has_late_deliveries'] = 0

# iterating over each original column to
# change values in the new feature columns
for index, value in ap_customer_2.iterrows():
    
    # has_early_deliveries 
    if ap_customer_2.loc[index, 'early_deliveries'] > 0:
        ap_customer_2.loc[index, 'has_early_deliveries'] = 1


  # has_late_deliveries 
    if ap_customer_2.loc[index, 'late_deliveries'] > 0:
        ap_customer_2.loc[index, 'has_late_deliveries'] = 1

#############################################################################
#Cancellations variable

#Creating a dummy variable
ap_customer_2['has_cancellations']   = 0

# iterating over each original column to
# change values in the new feature columns
for index, value in ap_customer_2.iterrows():
    
    # has_cancellations_b_noon
    if ap_customer_2.loc[index, 'total_cancellations'] > 0:
        ap_customer_2.loc[index, 'has_cancellations'] = 1
        
#############################################################################
# Behavior Variables 

#Creating  dummy variables
ap_customer_2['has_total_photos_viewed']   = 0
ap_customer_2['has_mobile_logins']         = 0

# iterating over each original column to
# change values in the new feature columns
for index, value in ap_customer_2.iterrows():
    
    # has_early_deliveries 
    if ap_customer_2.loc[index, 'mobile_logins'] > 0:
        ap_customer_2.loc[index, 'has_mobile_logins'] = 1


  # has_late_deliveries 
    if ap_customer_2.loc[index, 'total_photos_viewed'] > 0:
        ap_customer_2.loc[index, 'has_total_photos_viewed'] = 1

#############################################################################
# Purchase Variables 

#Creating a dummy variable
ap_customer_2['has_weekly_plan']   = 0

# iterating over each original column to
# change values in the new feature columns
for index, value in ap_customer_2.iterrows():
    
    # has_early_deliveries 
    if ap_customer_2.loc[index, 'weekly_plan'] > 0:
        ap_customer_2.loc[index, 'has_weekly_plan'] = 1
        

#############################################################################
# Service Variables 

#Creating a dummy variable
ap_customer_2['has_master_classes_attended']   = 0

# iterating over each original column to
# change values in the new feature columns
for index, value in ap_customer_2.iterrows():
    
    # has_early_deliveries 
    if ap_customer_2.loc[index, 'master_classes_attended'] > 0:
        ap_customer_2.loc[index, 'has_master_classes_attended'] = 1
        
#############################################################################        
#Splitting up the variables for ranking Because of high correlation

# Using pd.get_dummies to get the different rankings
ratings = pd.get_dummies(ap_customer_2['median_meal_rating'])

#Giving the columns a name for each of the rankings 
ratings.columns = ['one_star_rank', 'two_star_rank', 'three_star_rank', 'four_star_rank','five_star_rank'] 

# Dropping the orihinal column
ap_customer_2 = ap_customer_2.drop(columns=['median_meal_rating'])

#Adding the new created columns to the feature engineering dataset
ap_customer_2 = ap_customer_2.join([ratings])

In [11]:
# FEATURE ENGINEERING Categorical Variables 

# dropping categorical variables after they've been encoded
categorical_variables2 = ['name', 'first_name','email','family_name']

# Cropping the columns that are not needed
ap_customer_2 = ap_customer_2.drop(columns= categorical_variables2)

In [12]:
#Checking_Final Dataset 

# Creating a variables that shows all the columns that need to be dropped
drop_final = ['other_log','personal_log','work_log','five_star_rank']
# Dropping the final variables that are not supposed to be created
ap_customer_2=ap_customer_2.drop(columns=drop_final)

#Showing the final dataset with feature engineering variables. 
ap_customer_2.head(n=5)

Unnamed: 0,revenue,cross_sell_success,total_meals_ordered,unique_meals_purch,contacts_w_customer_service,product_categories_viewed,avg_time_per_site_visit,mobile_number,cancellations_before_noon,cancellations_after_noon,tastes_and_preferences,pc_logins,mobile_logins,weekly_plan,early_deliveries,late_deliveries,package_locker,refrigerated_locker,avg_prep_vid_time,average_meals_ordered,master_classes_attended,avg_clicks_per_visit,total_photos_viewed,total_cancellations,other,personal,work,total_orders,revenue_log,total_meals_ordered_log,unique_meals_purch_log,contacts_w_customer_service_log,product_categories_viewed_log,avg_time_per_site_visit_log,cancellations_before_noon_log,cancellations_after_noon_log,pc_logins_log,mobile_logins_log,weekly_plan_log,early_deliveries_log,late_deliveries_log,avg_prep_vid_time_log,average_meals_ordered_log,master_classes_attended_log,median_meal_rating_log,avg_clicks_per_visit_log,total_photos_viewed_log,total_cancellations_log,total_orders_log,length_time_spent_website,has_early_deliveries,has_late_deliveries,has_cancellations,has_total_photos_viewed,has_mobile_logins,has_weekly_plan,has_master_classes_attended,one_star_rank,two_star_rank,three_star_rank,four_star_rank
0,393.0,1,14,6,12,10,48.0,1,3,1,1,5,2,0,0,2,0,0,33.4,1,0,17,0,4,0,0,1,14.0,2.594393,1.146128,0.778151,1.079181,1.0,1.681241,0.477121,0.0,0.69897,0.30103,-2.0,-2.0,0.30103,1.523746,0.0,-2.0,0.0,1.230449,-2.0,0.60206,1.146128,0,0,1,1,0,1,0,0,1,0,0,0
1,1365.0,1,87,3,8,8,40.35,1,0,0,1,5,1,12,0,2,0,0,84.8,1,0,13,170,0,0,0,1,87.0,3.135133,1.939519,0.477121,0.90309,0.90309,1.605844,-2.0,-2.0,0.69897,0.0,1.079181,-2.0,0.30103,1.928396,0.0,-2.0,0.477121,1.113943,2.230449,-2.0,1.939519,0,0,1,0,1,1,1,0,0,0,1,0
2,800.0,1,15,7,11,5,19.77,1,3,0,1,6,1,1,0,1,0,0,63.0,1,0,16,0,3,1,0,0,15.0,2.90309,1.176091,0.845098,1.041393,0.69897,1.296007,0.477121,-2.0,0.778151,0.0,0.0,-2.0,0.0,1.799341,0.0,-2.0,0.30103,1.20412,-2.0,0.477121,1.176091,0,0,1,1,0,1,1,0,0,1,0,0
3,600.0,1,13,6,11,5,90.0,1,2,0,1,6,1,14,0,3,0,0,43.8,1,0,14,0,2,0,0,1,13.0,2.778151,1.113943,0.778151,1.041393,0.69897,1.954243,0.30103,-2.0,0.778151,0.0,1.146128,-2.0,0.477121,1.641474,0.0,-2.0,0.30103,1.146128,-2.0,0.30103,1.113943,1,0,1,1,0,1,1,0,0,1,0,0
4,1490.0,1,47,8,6,10,40.38,1,0,0,0,5,1,5,0,8,0,0,84.8,1,1,12,205,0,1,0,0,47.0,3.173186,1.672098,0.90309,0.778151,1.0,1.606166,-2.0,-2.0,0.69897,0.0,0.69897,-2.0,0.90309,1.928396,0.0,0.0,0.477121,1.079181,2.311754,-2.0,1.672098,0,0,1,0,1,1,1,1,0,0,1,0


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /> 

## 5. Model Development

### Distribution of Response Variable

In [13]:
# Plot distribution of revenue and log_revenue 
## CHANGE plt.close() to plt.show() TO DISPLAY THE GRAPHS

# Setting size of the plots and how many plots are shown next to each other:
fig, ax = plt.subplots(figsize=(20, 6), ncols=2)

#Developing a plot for the distribution of log_revenue 
sns.histplot(data   = ap_customer_2,
             x      = 'revenue_log', 
             ax     = ax[0]) #showing the location (first plot)

#Developing a plot for the distribution of log_revenue 
sns.histplot(data   = ap_customer_2,
             x      = 'revenue', 
             color = "skyblue",
             ax     = ax[1]) #showing the location (second plot)

# Setting the titles for each plot
ax[0].set_title('Distribution of Log Revenue') 
ax[1].set_title('Distribution of Original Revenue')

# displaying the histogram
plt.close()

When performing regressions, it is important that the response variable is normally distrubited and does not show any kurtosis or skewness. In this case, "revenue" has some skewness to the right. Therefore, I applied a log transformation to the variable to see whether this would make a difference. The bi-modelness of the variable will be ignored for now and changed, if needed, after creating the first regression models.


### Correlation Matrix

In [14]:
# CORRELATIONS 
# Creating correlation matrix, round them, and sort them
df_corr = ap_customer_2.corr().round(2).sort_values('revenue')

# Create overview that shows the correlation with log revenue and revenue
df_corr.loc[:'avg_prep_vid_time',['revenue','revenue_log']   ]

Unnamed: 0,revenue,revenue_log
avg_clicks_per_visit_log,-0.56,-0.58
avg_clicks_per_visit,-0.55,-0.58
two_star_rank,-0.37,-0.42
one_star_rank,-0.2,-0.28
unique_meals_purch_log,-0.12,-0.13
unique_meals_purch,-0.06,-0.08
cancellations_after_noon_log,-0.04,-0.04
cancellations_after_noon,-0.04,-0.04
has_weekly_plan,-0.03,-0.03
weekly_plan_log,-0.02,-0.02


In [15]:
# BEST OLS Model According to StatsModels 
# Creating a new dataset to make sure everything works fine. 
model_test = ap_customer_2.copy()

#Dropping the reponse variables from the dataset 
model_data = ap_customer_2.drop(columns=['revenue', 'revenue_log'])

#Setting the target response (either revenue or log revenue)
target_t1 = model_test.loc[ : ,'revenue']
target_t2 = model_test.loc[ : , 'revenue_log'] # ready for use later

# Splitting up the data for training and testing focus: log_revenue
x_train_test, x_test_test, y_train_test, y_test_test = train_test_split(
            model_data,
            target_t2,
            test_size = 0.25,
            random_state = 219)

# Merging training data together for Linear Regression from Statsmodel to work
test_train = pd.concat([x_train_test, y_train_test], axis = 1)


# Step 1: build a model in StatsModels 
lm_best = smf.ols(formula =  """revenue_log ~ 
                            total_orders +
                            total_meals_ordered_log +
                            unique_meals_purch_log +
                            contacts_w_customer_service +
                            avg_time_per_site_visit_log +
                            master_classes_attended_log +
                            total_photos_viewed_log+
                            length_time_spent_website +
                            one_star_rank +
                            two_star_rank +
                            four_star_rank""",data = test_train)

# Step 2: fit the model based on the data
results = lm_best.fit()


# Step 3: analyze the summary output
print(results.summary())

# Creating a list with all the variables used: x_variables
x_variables = ['avg_prep_vid_time','average_meals_ordered', 'total_orders', 'total_meals_ordered_log','unique_meals_purch_log', 'contacts_w_customer_service','avg_time_per_site_visit_log', 'master_classes_attended_log','total_photos_viewed_log','length_time_spent_website','one_star_rank', 'two_star_rank', 'four_star_rank']

                            OLS Regression Results                            
Dep. Variable:            revenue_log   R-squared:                       0.715
Model:                            OLS   Adj. R-squared:                  0.713
Method:                 Least Squares   F-statistic:                     330.3
Date:                Tue, 09 Feb 2021   Prob (F-statistic):               0.00
Time:                        23:52:21   Log-Likelihood:                 1127.8
No. Observations:                1459   AIC:                            -2232.
Df Residuals:                    1447   BIC:                            -2168.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

### Model 1: OLS

In [16]:
# MODEL 1 - OLS Rsponse: revenue_log 
#############################################################################
model_1 = ap_customer_2.copy()

model_1_data = model_1[['avg_prep_vid_time','average_meals_ordered', 'total_orders', 'total_meals_ordered_log','unique_meals_purch_log', 'contacts_w_customer_service','avg_time_per_site_visit_log', 'master_classes_attended_log','total_photos_viewed_log','length_time_spent_website','one_star_rank', 'two_star_rank', 'four_star_rank']]
                                        
target_1 = model_1.loc[ : ,'revenue']
target_2 = model_1.loc[ : , 'revenue_log'] # ready for use later

x_train_model_1, x_test_model_1, y_train_model_1, y_test_model_1 = train_test_split(
    model_1_data,
    target_2,
    test_size = 0.25,
    random_state = 219)

#############################################################################
lr1 = sklearn.linear_model.LinearRegression()


# FITTING to the training data
lr1_fit = lr1.fit(x_train_model_1, y_train_model_1)


# PREDICTING on new data
lr1_pred = lr1_fit.predict(x_test_model_1)


# SCORING the results
print('OLS Training Score :', lr1.score(x_train_model_1, y_train_model_1).round(4))  # using R-square
print('OLS Testing Score  :', lr1.score(x_test_model_1, y_test_model_1).round(4)) # using R-square

lr1_train_score = lr1.score(x_train_model_1, y_train_model_1).round(4)
lr1_test_score  = lr1.score(x_test_model_1, y_test_model_1).round(4)

# displaying and saving the gap between training and testing
print('OLS Train-Test Gap :', abs(lr1_train_score - lr1_test_score).round(4))
lr1_test_gap = abs(lr1_train_score - lr1_test_score).round(4)

# zipping each feature name to its coefficient
lr1_model_values = zip(model_1_data[x_variables].columns,
                      lr1_fit.coef_.round(decimals = 4))

# setting up a placeholder list to store model features
lr1_model_lst = [('intercept', lr1_fit.intercept_.round(decimals = 4))]


# printing out each feature-coefficient pair one by one
for val in lr1_model_values:
    lr1_model_lst.append(val)
    
# checking the results
for pair in lr1_model_lst:
    print(pair)

OLS Training Score : 0.7459
OLS Testing Score  : 0.7569
OLS Train-Test Gap : 0.011
('intercept', 2.5528)
('avg_prep_vid_time', 0.0014)
('average_meals_ordered', -0.0256)
('total_orders', -0.0036)
('total_meals_ordered_log', 0.3998)
('unique_meals_purch_log', -0.1363)
('contacts_w_customer_service', 0.0059)
('avg_time_per_site_visit_log', 0.0503)
('master_classes_attended_log', 0.0183)
('total_photos_viewed_log', 0.0059)
('length_time_spent_website', -0.0815)
('one_star_rank', -0.0884)
('two_star_rank', -0.0518)
('four_star_rank', 0.1261)


### Model 2: Lasso

In [17]:
# MODEL 2 - Lasso Response: Revenue 

# Creating a new dataset to make sure everything works fine. 
model_2 = ap_customer_2.copy()

# Subsetting the data in order to have the right variables 
model_2_data = model_2[['avg_prep_vid_time','average_meals_ordered', 'total_orders', 'total_meals_ordered_log','unique_meals_purch_log', 'contacts_w_customer_service','avg_time_per_site_visit_log', 'master_classes_attended_log','total_photos_viewed_log','length_time_spent_website','one_star_rank', 'two_star_rank', 'four_star_rank']]

# Creating target variables 
target_1 = model_2.loc[ : ,'revenue']
target_2 = model_2.loc[ : , 'revenue_log'] # ready for use later

# Splitting up the data for training and testing focus: log_revenue
x_train_model_2, x_test_model_2, y_train_model_2, y_test_model_2 = train_test_split(
            model_2_data,
            target_1,
            test_size = 0.25,
            random_state = 219)

# CINSTANTIATING a lasso model
lasso_model = sklearn.linear_model.Lasso()

# FITTING to the training data
lasso_fit = lasso_model.fit(x_train_model_2, y_train_model_2)

# PREDICTING on new data
lasso_pred = lasso_fit.predict(x_test_model_2)


# SCORING the results
print('Lasso Training Score :', lasso_model.score(x_train_model_2, y_train_model_2).round(4))
print('Lasso Testing Score  :', lasso_model.score(x_test_model_2, y_test_model_2).round(4))


# saving scoring data for future use
lasso_train_score = lasso_model.score(x_train_model_2, y_train_model_2).round(4) # using R-square
lasso_test_score  = lasso_model.score(x_test_model_2, y_test_model_2).round(4)   # using R-square


# displaying and saving the gap between training and testing
print('Lasso Train-Test Gap :', abs(lasso_train_score - lasso_test_score).round(4))
lasso_test_gap = abs(lasso_train_score - lasso_test_score).round(4)


# zipping each feature name to its coefficient
lasso_model_values = zip(model_2_data[x_variables].columns,
                      lasso_fit.coef_.round(decimals = 4))


# setting up a placeholder list to store model features
lasso_model_lst = [('intercept', lasso_fit.intercept_.round(decimals = 4))]


# printing out each feature-coefficient pair one by one
for val in lasso_model_values:
    lasso_model_lst.append(val)
    

# checking the results
for pair in lasso_model_lst:
    print(pair)

Lasso Training Score : 0.6996
Lasso Testing Score  : 0.7398
Lasso Train-Test Gap : 0.0402
('intercept', -1454.1014)
('avg_prep_vid_time', 8.8439)
('average_meals_ordered', -122.5332)
('total_orders', -9.5141)
('total_meals_ordered_log', 1576.6577)
('unique_meals_purch_log', -694.8131)
('contacts_w_customer_service', 66.1907)
('avg_time_per_site_visit_log', 250.2256)
('master_classes_attended_log', 87.0299)
('total_photos_viewed_log', 25.9429)
('length_time_spent_website', -458.1933)
('one_star_rank', -159.7424)
('two_star_rank', -131.0571)
('four_star_rank', 912.339)


In [18]:
# comparing OLS & LASSO results 

# Printing the results
print(f"""
Model                  Train Score        Test Score
-----                  -----------        ----------
OLS (revenue_log)      {lr1_train_score}             {lr1_test_score}
Lasso (revenue)        {lasso_train_score}             {lasso_test_score}
""")


# creating a dictionary for model results
performance = {
    
    'Model Type'    : ['OLS', 'Lasso'],
           
    'Training' : [lr1_train_score, lasso_train_score],
           
    'Testing'  : [lr1_test_score, lasso_test_score],
                    
    'Train-Test Gap' : [lr1_test_gap, lasso_test_gap],
                    
    'Model Size' : [len(lr1_model_lst), len(lasso_model_lst)],
                    
    'Model' : [lr1_model_lst, lasso_model_lst]}


# converting model_performance into a DataFrame
performance = pd.DataFrame(performance)


# sending model results to Excel
performance.to_excel('./performance_model.xlsx',
                           index = False)


Model                  Train Score        Test Score
-----                  -----------        ----------
OLS (revenue_log)      0.7459             0.7569
Lasso (revenue)        0.6996             0.7398



### Model 3: KNN Non-Standardized Data

In [19]:
# MODEL 3 - KNN Non-Standardized Data Response: revenue_log 
model_3 = ap_customer_2.copy()

model_3_data = model_3[['avg_prep_vid_time','average_meals_ordered', 'total_orders', 'total_meals_ordered_log','unique_meals_purch_log', 'contacts_w_customer_service','avg_time_per_site_visit_log', 'master_classes_attended_log','total_photos_viewed_log','length_time_spent_website','one_star_rank', 'two_star_rank', 'four_star_rank']]
                                        
target_1 = model_3.loc[ : ,'revenue']
target_2 = model_3.loc[ : , 'revenue_log'] # ready for use later

x_train_model_3, x_test_model_3, y_train_model_3, y_test_model_3 = train_test_split(
    model_3_data,
    target_2,
    test_size = 0.25,
    random_state = 219)

knn_reg = KNeighborsRegressor(algorithm = 'auto',
                              n_neighbors = 10)


##############################################################################
# FITTING the model based on the training data
knn_reg_fit = knn_reg.fit(x_train_model_3, y_train_model_3)


# PREDITCING on new data
knn_reg_pred = knn_reg_fit.predict(x_test_model_3)


# SCORING the results
print('KNN Training Score:', knn_reg.score(x_train_model_3, y_train_model_3).round(4))
print('KNN Testing Score :',  knn_reg.score(x_test_model_3, y_test_model_3).round(4))


# saving scoring data for future use
knn_reg_score_train = knn_reg.score(x_train_model_3, y_train_model_3).round(4)
knn_reg_score_test  = knn_reg.score(x_test_model_3, y_test_model_3).round(4)


# displaying and saving the gap between training and testing
print('KNN Train-Test Gap:', abs(knn_reg_score_train - knn_reg_score_test).round(4))
knn_reg_test_gap = abs(knn_reg_score_train - knn_reg_score_test).round(4)


KNN Training Score: 0.7019
KNN Testing Score : 0.6645
KNN Train-Test Gap: 0.0374


### Model 4: KNN Standardized Data

In [20]:
# MODEL 4 - KNN Standardized Data Response: revenue_log 

#SUBSETTING original dataset
model_4_data = ap_customer_2[['avg_prep_vid_time','average_meals_ordered', 'total_orders', 'total_meals_ordered_log','unique_meals_purch_log', 'contacts_w_customer_service','avg_time_per_site_visit_log', 'master_classes_attended_log','total_photos_viewed_log','length_time_spent_website','one_star_rank', 'two_star_rank', 'four_star_rank']]

# INSTANTIATING a StandardScaler() object
scaler = StandardScaler()

# FITTING the scaler with housing_data
scaler.fit(model_4_data)


# TRANSFORMING our data after fit
X_scaled = scaler.transform(model_4_data)

# converting scaled data into a DataFrame
X_scaled_df = pd.DataFrame(X_scaled)

#New training data
X_train_STAND, X_test_STAND, y_train_STAND, y_test_STAND = train_test_split(
            X_scaled_df,
            target_2,
            test_size = 0.25,
            random_state = 219)

# INSTANTIATING a model with the optimal number of neighbors
knn_stand = KNeighborsRegressor(algorithm = 'auto',
                   n_neighbors = 23)


# FITTING the model based on the training data
knn_stand_fit = knn_stand.fit(X_train_STAND, y_train_STAND)


# PREDITCING on new data
knn_stand_pred = knn_stand_fit.predict(X_test_STAND)


# SCORING the results
print('KNN Training Score:', knn_stand.score(X_train_STAND,y_train_STAND).round(4))
print('KNN Testing Score :', knn_stand.score(X_test_STAND, y_test_STAND).round(4))


# saving scoring data for future use
knn_stand_score_train = knn_stand.score(X_train_STAND,y_train_STAND).round(4)
knn_stand_score_test  = knn_stand.score(X_test_STAND, y_test_STAND).round(4)


# displaying and saving the gap between training and testing
print('KNN Train-Test Gap:', abs(knn_stand_score_train - knn_stand_score_test).round(4))
knn_stand_test_gap = abs(knn_stand_score_train - knn_stand_score_test).round(4)

KNN Training Score: 0.7864
KNN Testing Score : 0.7849
KNN Train-Test Gap: 0.0015


In [21]:
# comparing KNN results 

print(f"""
KNN Model             Neighbors     Train Score      Test Score
----------------      ---------     ----------       ----------
Non-Standardized      10             {knn_reg_score_train}            {knn_reg_score_test}
Standardized          23             {knn_stand_score_train}           {knn_stand_score_test}
""")


# creating a dictionary for model results
model_performance = {
    
    'Model Type'    : ['KNN_Not_Standardized', 'KNN_Standardized_Opt FINAL MODEL'],
           
    
    'Training' : [knn_reg_score_train,
                  knn_stand_score_train],
           
    
    'Testing'  : [knn_reg_score_test,
                  knn_stand_score_test],
                    
    
    'Train-Test Gap' : [knn_reg_test_gap,
                        knn_stand_test_gap],
                   
    
    'Model Size' : ["NA", " NA"],
                    
    'Model'      : ["NA", "NA"] }


# converting model_performance into a DataFrame
model_performance = pd.DataFrame(model_performance)


KNN Model             Neighbors     Train Score      Test Score
----------------      ---------     ----------       ----------
Non-Standardized      10             0.7019            0.6645
Standardized          23             0.7864           0.7849



In [22]:
# converting model_performance into a DataFrame
model_performance = pd.DataFrame(model_performance)


# concatenating with former performance DataFrame
total_performance = pd.concat([performance, model_performance],
                              axis = 0)


total_performance.sort_values(by = 'Testing',
                              ascending = False)


# sending model results to Excel
total_performance.to_excel('./performance.xlsx',
                           index = False)


# checking the results
total_performance

Unnamed: 0,Model Type,Training,Testing,Train-Test Gap,Model Size,Model
0,OLS,0.7459,0.7569,0.011,14.0,"[(intercept, 2.5528), (avg_prep_vid_time, 0.00..."
1,Lasso,0.6996,0.7398,0.0402,14.0,"[(intercept, -1454.1014), (avg_prep_vid_time, ..."
0,KNN_Not_Standardized,0.7019,0.6645,0.0374,,
1,KNN_Standardized_Opt FINAL MODEL,0.7864,0.7849,0.0015,,
