1. Introduction to Regression Model Development

A predictive model is an algorithm or mathematical representation of a system that can make future predictions based on past data. It entails identifying and using relationships between a dependent variable (what we are attempting to predict, in this case, "Revenue") and independent variables (inputs that influence the y-variable) to make predictions. A predictive model's objective is to generate a mathematical equation that best describes the relationship between the inputs and outputs, and then use that equation to predict future outcomes (Kumar, 2016). These models can be used to assist organizations in making data-driven decisions in a variety of fields, including finance, marketing, healthcare, and many others. Linear regression models, more complex non-linear models, and machine learning models such as decision trees, random forests, and neural networks can all be used as predictive models (Kumar, 2016). The model chosen is usually determined by the nature of the problem, the quality and quantity of available data, and the desired prediction accuracy. In this project, we build a linear regression model to predict the variable “Revenue” which is the total revenue generated from each Apprentice Chef customer. The dependent variable in this case is a continuous outcome variable.
The aim is to build a model that has a high R-Square score while keeping the gap between training and testing scores less than or equal to 0.05. We opted to build a simple OLS model to satisfy all the project criteria. That is, incorporating aspects like Recursive Feature Elimination (RFE) for Feature Selection resulted in models with the top features but significantly lower R-Square. A linear regression predictive model works by fitting a line to the observed data that best predicts the dependent variable (what we are attempting to predict, i.e., Revenue) using the independent variables (inputs that affect the dependent variable) (Özkale, 2015). The goal is to find the line that minimizes the sum of the squared differences between the dependent variable's actual and predicted values.
The line is represented by the equation:
y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn
where:
•	y is the predicted value of the dependent variable
•	b0, b1, b2, ..., bn are the coefficients of the independent variables
•	x1, x2, ..., xn are the independent variables
The coefficients are estimated using data and represent the effect of each independent variable on the dependent variable. After estimating the coefficients, the model can be utilized to make predictions by plugging in the values of the predictor variables. Linear regression is based on the assumption of a linear relationship between the independent variables and the dependent variable, so it works best when the relationship is indeed linear (Özkale, 2015). However, the relationship may not be linear in some cases, in which case non-linear regression or other more complex models could be more appropriate.

2. The Model Building Process

Our model-building process in python involved the following steps:
1)Importing the required libraries: We started by importing the libraries needed for the analysis, such as pandas for data manipulation, numpy for numerical computations, and statsmodels for the OLS regression. Our model choice comes from scikit-learn which is a popular open-source machine-learning library in Python that provides simple and efficient tools for data mining and data analysis (Hao & Ho, 2019). It is built on top of the popular Python libraries NumPy, SciPy, and matplotlib. scikit-learn offers a variety of algorithms for regression, classification, clustering, dimensionality reduction, and model selection (Hao & Ho, 2019). The library is intended to be simple to use and understand, with an emphasis on accessibility and interpretability.
2)Loading the data: We loaded the data into a pandas data frame and inspected it to ensure it was loaded correctly. The pandas library provides an easy and convenient way to load .xlsx files into Python, which is the format of the dataset in this case. We used the read_excel () function to load the .xlsx file into a pandas data frame.
3)Data cleaning and EDA: This entails checking for missing values, outliers, and other anomalies in the data and cleaning the data as necessary. Exploratory Data Analysis (EDA) is intended to understand the relationships between the dependent and independent variables and identify any potential confounding variables.
4)Feature selection: This step involves selecting the independent variables that will be used in the model based on their relevance and importance. In my case, I realized that selecting the most relevant variables (top 5 predictors) resulted in a model with a significantly lower R-Square. I, therefore, opted to include as many predictors in the model as possible. 
5)The next step was the actual model-building process whereby we used the statsmodels library to fit the OLS regression model to the data. We specified the dependent (Revenue) and predictor variables, the type of regression (linear), and other parameters needed for the analysis.
6)We then had to evaluate our model's performance by calculating the R-squared, adjusted R-squared, Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and other metrics.

3. Model Performance and Summary of Predictors

Our model was evaluated by calculating the training score, testing score, and train test gap. Evaluating a regression model with both the training and testing scores (also referred to as the validation score) provides a more complete picture of the model's performance. The training score indicates how well the model fits the training data, whereas the testing score indicates how well the model generalizes to new, previously unseen data (Mazen, 2021). A high training score and a low testing score indicate overfitting, which occurs when the model learns the training data too well and fails to generalize well to new data. A low training and testing score indicate underfitting, which occurs when the model is too simple to capture the patterns in the data. A good model should have a high training and testing score, indicating a good fit to the training data and generalization to new data (Mazen, 2021). In our case, the model achieves a training score of 0.622, a testing score of 0.638, and a train test gap of -0.016. 
In a regression model evaluation, a low train-test gap of -0.016 indicates that the model is not overfitting the training data. The train-test gap refers to the difference between the training and testing scores, with a negative value indicating that the testing score is higher than the training score. A low train-test gap indicates that the model is generalizing well to new, previously unseen data, as evidenced by better performance on test data compared to training data (Mazen, 2021). In other words, the model has learned well the underlying patterns and relationships in the data that allow it to perform well on new data rather than simply memorizing the training data.
The R-Square and Adjusted R-Square evaluation metrics achieved by our model is 0.622 and 0.618 respectively. This means that 61.8% of the variation in the dependent variable (Revenue) can be attributed to the independent variables used in building the model. We can consider this a relatively good fit for our linear regression model. Lastly, we look at the summary of the coefficients of the independent variables. From the table, we can see that significant predictors of revenue are the total number of meals ordered by a customer (positive coefficient), the number of unique meal sets ordered by each customer (positive coefficient), the number of times each customer made contact with customer service (positive coefficient), average time a customer spend per website or mobile app visit (positive coefficient), average time in seconds meal prep instruction video was playing (negative coefficient), the average rating of meal sets by a customer (negative coefficient), the average number of meals ordered per customer (negative coefficient), and the total number of clicks on photos across all website or mobile app visits (positive coefficient). The p-value is less than the significant level (0.05) in all these cases, i.e., p < 0.05. 
A significant positive coefficient in our linear regression model means that as the value of the predictor variable increases, so does the value of the response variable (revenue). In other words, the predictor variable and the response variable have a positive relationship. A negative significant coefficient, on the other hand, indicates that as the value of the predictor variable increases, the value of the response variable decreases. In other words, the predictor variable and the response variable have a negative relationship. It, therefore, means, for instance, that revenue will go up if a customer orders more unique meals whereas revenue will go down if the average time of a meal preparation instruction goes up.  

4. References
Hao, J., & Ho, T. K. (2019). Machine learning made easy: a review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics, 44(3), 348-361.
Kumar, A. (2016). Learning predictive analytics with Python. Packt Publishing Ltd.
Mazen, A. (2021, October 21). Evaluating your Regression Model in Python. Retrieved from https://linguisticmaz.medium.com/evaluating-your-regression-model-in-python-66c58abc4fb
Özkale, M. R. (2015). Predictive performance of linear regression models. Statistical Papers, 56(2), 531-567.


In [2]:
#APPRENTICE CHEF PREDICTIVE MODEL 
# Importing the necessary libraries 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

# loading the dataset
df = pd.read_excel(r'C:\Users\User\Desktop\Apprentice_Chef_Dataset_2023.xlsx')

# specifying the independent variables 
independent_vars = ['TOTAL_MEALS_ORDERED', 'UNIQUE_MEALS_PURCH', 'CONTACTS_W_CUSTOMER_SERVICE', 'PRODUCT_CATEGORIES_VIEWED', 
                    'AVG_TIME_PER_SITE_VISIT', 'CANCELLATIONS_AFTER_NOON', 'PC_LOGINS', 'MOBILE_LOGINS', 
                    'WEEKLY_PLAN', 'AVG_PREP_VID_TIME', 'LARGEST_ORDER_SIZE', 
                    'AVG_MEAN_RATING', 'TOTAL_PHOTOS_VIEWED']

# specifying the dependent variable
dependent_var = 'REVENUE'

# spliting the dataset into training and testing sets
X = df[independent_vars]
y = df[dependent_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=219, test_size=0.25)

# fitting a linear regression model
reg = LinearRegression()
reg.fit(X_train, y_train)

# making predictions
y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)

# evaluating the model
train_score = round(reg.score(X_train, y_train), 3)
test_score = round(reg.score(X_test, y_test), 3)
train_test_gap = round(train_score - test_score, 3)

# printing results
print(f"Final Model Type: OLS Regression")
print(f"Training Score: {train_score}")
print(f"Testing Score: {test_score}")
print(f"Train-Test Gap: {train_test_gap}")

# creating a summary of the linear regression model
X_train_const = sm.add_constant(X_train)
results = sm.OLS(y_train, X_train_const).fit()
print(results.summary())


Final Model Type: OLS Regression
Training Score: 0.622
Testing Score: 0.638
Train-Test Gap: -0.016
                            OLS Regression Results                            
Dep. Variable:                REVENUE   R-squared:                       0.622
Model:                            OLS   Adj. R-squared:                  0.618
Method:                 Least Squares   F-statistic:                     182.5
Date:                Wed, 08 Feb 2023   Prob (F-statistic):          1.88e-293
Time:                        16:40:24   Log-Likelihood:                -11637.
No. Observations:                1459   AIC:                         2.330e+04
Df Residuals:                    1445   BIC:                         2.338e+04
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
---------------