# Cili Lado Data Analytics and Machine Learning Project - Group 8
Table of Content:

Step 1: Acquire the dataset

Step 2: Import the libraries

Step 3: Import the dataset

Step 4a: Feature Selection (For customer flow and sales)

Step 5a: Clean the data by identifying and handling missing value, redundancy and outliers

Step 6a: Encode the categorical data

Step 7a: Feature Scaling

Step 8a: Splitting dataset into training and testing sets, train the model and measure the accurancy

Step 4b: Feature Selection (For new customer and repeat customer)

Step 5b: Clean the data by dentifying and handling missing value, redundancy and outliers

Step 6b: Encode the categorical data

Step 7b: Feature Scaling

Step 8b: Splitting dataset into training and testing sets, train the model and measure the accurancy

Step 9: Combine both Multiple Linear Regression as one, and create the same Multiple Linear Regression with RNN

Step 10: Random Forest Model

The flowchart of our work are shown in the diagram below.

![FlowChart](CiliLadoData/FlowChart.png)

# Step 1: Acquire the dataset

We got the data from this google drive link. https://drive.google.com/drive/folders/16BK8_d1V-A3M1WQ0neaeCwqrHPfzH7QS?usp=sharing . This link is provided by Wei Shen where he got it from Mr.Afiq, who is the founder of Cili Lado.

Not all datasets in google drive have been used, only certain datasets that are relevant to our analysis have been used. The selected datasets are the Product Overview from the Product Folder. 

We download all the dataset into our local drive as a zip file .

![DownloadAll](CiliLadoData/DownloadAll.png)

All the datasets is downloaded in this zip file.

![ZippedFile](CiliLadoData/ZippedFile.png)
![DownloadedFile](CiliLadoData/DownloadedFile.png)

The zip file includes the Product Overview datasets which is from May 2023 to September 2023. Each of them has 22 columns but different number of rows. May, June, July, August, September has 32, 31, 32, 32, 31 rows respectively. The column names are:

1. Date
2. Product Visitors (Visit)
3. Product Page Views
4. Items Visited
5. Product Bounce Visitors
6. Product Bounce Rate
7. Search Clicks
8. Likes
9. Product Visitors (Add to Cart)
10. Units (Add to Cart)
11. Conversion Rate (Add to Cart)
12. Buyers (Placed Order)
13. Units (Placed Order)
14. Items Placed
15. Sales (Placed Order)(MYR)
16. Conversion Rate (Placed Order)
17. Buyers (Confirmed Order)
18. Units (Confirmed Order)
19. Items Confirmed
20. Sales (Confiremd Order)(MYR)
21. Conversion Rate (Confirmed Order)
22. Converison Rate (Placed to Confirmed)

We first combine all the 5 datasets together by copy and paste them into a new Excel File called MergedFile.xlsx.

![CopiedFile](CiliLadoData/CopiedFile.png)

![PasteFile](CiliLadoData/PasteFile.png)

We first combine all 5 datasets together by copying and pasting them into a new Excel File called MergedFile.xlsx.

![MayJune](CiliLadoData/MayJune.png)

To check if the datasets are merged correctly, we calculate the total number of rows from each file, which is 31 + 30 + 31 + 31 + + 30 + 1 = 154. The total is same as our MergedFile, which has 154 rows and this means that the merged file contains all needed data.

However, to fulfill our objective, we require a different set of data sourced from the Dashboard of the year 2023.

![Dashboard2023](CiliLadoData/Dashboard2023.png)

These following columns have been extracted from the Dashboard dataset:

1. Numbers of buyers
2. Numbers of new buyers
3. Numbers of existing buyers

These columns are then being added into the merged file afterwards.

![AddedColumn](CiliLadoData/AddedColumn.png)

While going through the dataset, we found out that the figures in repeat purchase rate numbers are inaccurate. So, we perform data augmentation for two columns: the percentage of new buyers and the percentage of repeat buyers by using the data inside the dataset and Excel Function.

![NewCalculation](CiliLadoData/NewCalculation.png)

![RepeatCalculation](CiliLadoData/RepeatCalculation.png)

We use If in our calculation because if the number of buyer is equal to zero, it might have division by zero error. If the number of buyers equals zero, the output is set to zero, otherwise, the division operation proceeds.  We also convert our calculation to percentage by using this function.

![Percentage](CiliLadoData/Percentage.png)

We decided to complete all these steps in Excel instead of in Python because it is faster. Besides that, we want to make direct changes to our dataset rather than temporary changes only.

Up to now, we have 27 column and 154 rows inside the dataset.

Initially, almost all the data in the Excel file was not numerical data.

![ConvertData](CiliLadoData/ConvertData.png)

Therefore, we converted all the data in the dataset into numerical values by selecting the "Convert to Number" option in Excel to prevent potential errors. You can identify non-numeric data when the left upper corner of the cell is marked in green.

![Number](CiliLadoData/Number.png)

If all the cells are white, it indicates that we have successfully converted the data into numerical values. Now, we can proceed with using Python for data preprocessing steps.

We have also converted the date column to ensure Python recognises it in date format, preventing unintentional calculations. This format ensures proper identification as a date type without triggering any unwanted error.

![Date](CiliLadoData/Date.png)

Now the data can be used for the next few steps.


# Step 2: Import the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
from scipy.stats import shapiro
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from keras.models import Sequential
from keras.layers import LSTM, Dense
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Step 3: Import the dataset

We imported the datasets from a local directory. We create a folder called CiliLadoData and store all the datasets and images we use in that folder.

In [None]:
df = pd.read_excel('MergedFile.xlsx')

# Use .info() to show the info of the Excel file
print(df.info())

In [None]:
# Use print df to show the dataset
print(df)

# Step 4: Feature Selection

The objective of our assignment is to increase the sales of Cili Lado. To have a better grasp on the insights of our analysis, we have chosen to break it down into 2 different aspects, objective a and objective b. This approach allows us to thoroughly examine specific aspects of our dataset, providing a clearer understanding of the factors influencing Cili Lado's sales. By taking this step-by-step approach, we aim to uncover deeper insights and understandings on the various dynamics dataset.

In part **A**, we **analyse the customer behaviour** by comparing the number of customers visting Cili Lado (human flow) and the number of sales using Multiple Linear Regression to see if they are directly proportional.


While in part **B**, our goal is to study the **relationship between the number of return customer and sales** by using Single Linear Regression and Multiple Linear Regression. Finally, we combine all factors together using neural network and Multiple Linear Regression.

# Step 5: Clean the data by identifying and handling missing value, redundancy and outliers
After careful consideration, we have opted for **imputation** as our preferred method to handle the outliers instead of removing rows. This decision was made after experimenting with row removal, which resulted in the elimination of 50 rows from the dataset that is already relatively small. Given the trade-off between having a more accurate but smaller dataset versus imputing outliers and retaining more data, we have prioritised preserving a larger dataset for analysis. This choice aims to strike a balance between data accuracy and quantity, acknowledging the importance of maximising information while mitigating the impact of outliers on our analysis.

Additionally, our preference for separately addressing outlier imputation. By treating outliers individually, we aim to prevent the distortion of valuable data in specific columns, ensuring a more targeted and precise approach to maintaining data integrity.

# Step 6: Encode the categorical data
This step is to transform categorical data into numerical data since most of the machine learning model only understand numerical values. However, when dealing with the data in Excel beforehand we can found out that all of the data is numerical except the Date column.

# Step 7: Feature Scaling
Normalise and standardise the range of features in the datasets to guarantee that the machine learning model can work and also help to improve the model's training speed and performance.

We chose to use MinMaxScaler over other scaling methods as it ranges from 0 to 1, providing positive values for our features. This is in contrast to feature scaling, which ranges from -1 to 1. The positive range aligns well with our preference for non-negative values, making MinMaxScaler the suitable choice.

# Step 8: Splitting dataset into training and testing sets, train the model and measure the accurancy

We split the datasets into training and testing set. By referring to the Parento theory, the 80% of the datasets are used for trainig while the remaining 20% of the datasets are used for testing.

# Step 4a: Feature Selection

We analyse customer behaviour by comparing the number of customers visiting Cili Lado (human flow) and the number of sales.

In [None]:
# DataFrame for objective 1
df1 = df.copy()

# Remove unwanted data columns that are irrelevant to the analysis
drop_columns = ['Product Bounce Visitors', 'Product Bounce Rate','Likes', 'Product Visitors (Add to Cart)',
       'Units (Add to Cart)', 'Conversion Rate (Add to Cart)','Buyers (Placed Order)', 'Units (Placed Order)', 'Items Placed',
       'Sales (Placed Order) (MYR)', 'Conversion Rate (Placed Order)','Buyers (Confirmed Order)', 'Units (Confirmed Order)',
       'Items Confirmed','Conversion Rate (Confirmed Order)','Conversion Rate (Placed to Confirmed)', 'Numbers of buyers',
       'Numbers of new buyers', 'Numbers of existing buyers','Percentage of new buyers', 'Percentage of repeat buyers']

df1.drop(columns=drop_columns, inplace=True)

display(df1)

# Step 5a: Clean the data by identifying and handling missing value, redundancy and outliers

In [None]:
# Determine the missing value of each column by using .isna(), use .sum() to sum all the missing value
print("Find missing value of each column using isna()")
print (df1.isna().sum())

Based on above output, we found out that there are **no missing value** in the dataset, so we do not need to use .dropna() to delete or drop any row.

In [None]:
# Determine any redundancy in the dataset
# Use .duplicate is to check if there is any duplicate data
duplicate_rows = df1.duplicated().sum()
duplicate_columns = df1.T.duplicated().sum()

print("Find any duplicate values:")
duplicate_rows, duplicate_columns

Based on above result, we found out that there are **no duplicate data** in this dataset, so there are no redundancy occur in this datasets.

In [None]:
# To check for outliers in the data

# Exclude the 'Date' column
outliersdf = df1.copy()

# Create a boxplot to visualize the outliers
plt.figure(figsize=(15, 8))
sns.boxplot(data=outliersdf)
plt.title("Boxplot of Data for Objective 1")
plt.xticks(rotation=90)
plt.show()

The above figure is the boxplot that used to identify the outliers in the datasets. The circle that all exceeded the horizontal lines are outliers.


In [None]:
# Assuming 'outliersdf' is your original DataFrame

# Calculate the first quartile (Q1), third quartile (Q3) and interquartile range (IQR)
Q1 = outliersdf.quantile(0.25, numeric_only=True)
Q3 = outliersdf.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1

# Function to replace outliers with the median (or mean)
def impute_outlier_with_median(outliersdf, q1, q3, iqr):
    for col in outliersdf.select_dtypes(include=np.number).columns:
        lower_bound = q1[col] - 1.5 * iqr[col]
        upper_bound = q3[col] + 1.5 * iqr[col]
        median_value = outliersdf[col].median()

        # Replace outliers with median (you can also use mean or other metrics)
        outliersdf[col] = np.where((outliersdf[col] < lower_bound) | (outliersdf[col] > upper_bound), median_value, outliersdf[col])
    return outliersdf

# Impute outliers in the DataFrame
df_imputed = impute_outlier_with_median(outliersdf.copy(), Q1, Q3, IQR)

df1 = df_imputed.copy()

# Create a boxplot to visualize the DataFrame with imputed outliers
plt.figure(figsize=(15, 8))
sns.boxplot(data=df_imputed)
plt.title("Boxplot of Data for Objective 1 with Imputed Outliers")
plt.xticks(rotation=90)
plt.show()

The above figure is the boxplot of the dataset after detection and imputation of outliers. The number of circles that all exceeded the horizontal lines are reduced. Also the height of all boxplot had increased but the scale decreased so the range of the column remains the same.

In [None]:
# Now you can display df1 with cleaned data
display(df1)

Based on the above output, we can observe that the number of rows remains the same since we choose to impute the outliers instead of removing the outliers.

# Step 6a: Encode the categorical data

In [None]:
# We need to determine the categorical data inside the dataset first
# However, by observing the dataset it does not have any categorical data but we can double check it by using .dtypes
print(df1.dtypes)

Based on above output, we can observe that all the data is in numerical format so we do not need to do any encoding.

# Step 7a: Feature Scaling

In [None]:
# Extract the date column
date_column = df1.iloc[:, 0]

# Min-Max scale all columns except the date column
minmax_data = MinMaxScaler().fit_transform(df1.iloc[:, 1:])

# Combine the Min-Max scaled data with the date column
minmax_frame = pd.DataFrame(data=minmax_data, columns=df1.columns[1:])
minmax_frame.insert(0, df1.columns[0], date_column)

# Print the datasets after feature scaling
print(minmax_frame)

df1 = minmax_frame.copy()

# Step 8a: Splitting dataset into training and testing sets, train the model and measure the accurancy
The Multiple Linear Regression model below is used to analyze the relationship between the Product Visitors, Product Page Views, Items Visited, Search Clicks and the Sales in confirmed order. Scatter plots are then plotted to visualize the results. The performance metrics such as MAE, MSE, R2 and RMSE are used to assess the performance and effectiveness of the model.

In [None]:
# Multiple Linear Regression to determine the relationship between the Product Visitors, Product Page Views, Items Visited, Search Clicks and the Sales in confirmed order

# Select independent variable (x) and dependent variable (y)
x_a = df1[['Product Visitors (Visit)','Product Page Views','Items Visited','Search Clicks']]
y_a = df1['Sales (Confirmed Order) (MYR)'] 

# Split the data into training and testing sets
x_train_a, x_test_a, y_train_a, y_test_a = train_test_split(x_a,y_a,test_size=0.2, random_state=42)

# Create a linear regression model
model_a = LinearRegression()

# Train the model
model_a.fit(x_train_a, y_train_a)

# Predict the test set result using the trained model
y_pred_a = model_a.predict(x_test_a)

# Plot the graph for the predicted vs actual values
plt.scatter (y_test_a, y_pred_a)
plt.xlabel ('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs. Actual Values (r = {0:0.2f})'.format(pearsonr(y_test_a, y_pred_a)[0], 2))
plt.show()

# Calculate Mean Absolute Error (MAE)
mae_a = mean_absolute_error(y_test_a, y_pred_a)
print(f"Mean Absolute Error (MAE): {mae_a}")

# Calculate Mean Squared Error (MSE)
mse_a = mean_squared_error(y_test_a, y_pred_a)
print(f"Mean Squared Error (MSE): {mse_a}")

# Calculate R-squared (R2)
r2_a = r2_score(y_test_a, y_pred_a)
print(f"R-squared (R2): {r2_a}")

# Calculate Root Mean Squared Error (RMSE)
rmse_a = np.sqrt(mean_squared_error(y_test_a, y_pred_a))
print(f"Root Mean Squared Error (RMSE): {rmse_a}")

In Multiple Linear Regression, the Pearson correlation coefficient is commonly used to assess the relationship between the predicted values and the actual values, rather than measuring the linear relationship between each independent variable and the dependent variable individually, as done in Simple Linear Regression. Also, it is normally called multiple correlation coefficient. 

From this we can see that the result of r = 0.64 indicates that this is a stong positive linear relationship between the predicted values and the actual values. 

The R-squared value of 0.3617 reveals that approximately 36.17% of the total variability in sales can be explained by the variation in the percentage of new and repeat buyers. However, the presence of other unaccounted factors contributes to the remaining 63.83% of variability.

The Mean Absolute Error (MAE: 0.1646), Mean Squared Error (MSE: 0.0597), and Root Mean Squared Error (RMSE: 0.2443) are all relatively low, suggesting reasonable predictive accuracy.

The valuable insights derived from this graph indicate a significant correlation between sales and the number of items visited, as well as product page views on Shopee. Evidently, optimizing the online shopping experience by enhancing both product page views and the number of visits strongly correlates with increased sales. To capitalize on this relationship, Cili Lado should consider investing in strategies that make their Shopee storefront more visually appealing and engaging for potential customers.

# Step 4b: Feature Selection
We will analyse the relationship between the percentage of new and repeat customers and the sales in confirmed order.

In [None]:
# DataFrame for objective 2
df2 = df.copy()

# Remove unwanted data columns that are irrelevant to the analysis
drop_columns = ['Product Visitors (Visit)', 'Product Page Views','Items Visited', 'Product Bounce Visitors', 'Product Bounce Rate',
       'Search Clicks', 'Likes', 'Product Visitors (Add to Cart)','Units (Add to Cart)', 'Conversion Rate (Add to Cart)',
       'Buyers (Placed Order)', 'Units (Placed Order)', 'Items Placed','Sales (Placed Order) (MYR)', 'Conversion Rate (Placed Order)',
       'Buyers (Confirmed Order)', 'Units (Confirmed Order)','Items Confirmed','Conversion Rate (Confirmed Order)','Conversion Rate (Placed to Confirmed)',
       'Numbers of buyers', 'Numbers of new buyers','Numbers of existing buyers']

df2.drop(columns=drop_columns, inplace=True)

display(df2)

# Step 5b: Clean the data by dentifying and handling missing value, redundancy and outliers

In [None]:
# Determine the missing value of each column by using .isna(), use .sum() to sum all the missing value
print("Find missing value of each column using isna()")
print (df2.isna().sum())

Based on above output, we found out that there are **no missing value** in the dataset, so we do not need to use .dropna() to delete or drop any row.

In [None]:
# Determine any redundancy in the dataset
# Use .duplicate is to check if there is any duplicate data
duplicate_rows = df2.duplicated().sum()
duplicate_columns = df2.T.duplicated().sum()

print("Find any duplicate values:")
duplicate_rows, duplicate_columns

Based on above result, we found out that there are **no duplicate data** in this dataset, so there are no redundancy occur in this datasets.

In [None]:
# To check for outliers in the data

# Exclude the 'Date' column
outliersdf2 = df2.copy()

# Create a boxplot to visualize the outliers
plt.figure(figsize=(15, 8))
sns.boxplot(data=outliersdf2)
plt.title("Boxplot of Data for Obejctive 2")
plt.xticks(rotation=90)
plt.show()

The above figure is the boxplot that used to identify the outliers in the datasets. The circle that all exceeded the horizontal lines are outliers. The boxplot graph of Percentage of new buyers and repeat buyers are not visible because the data size is too small compared to Sales.

In [None]:
# Assuming 'outliersdf' is your original DataFrame

# Calculate the first quartile (Q1), third quartile (Q3) and interquartile range (IQR)
Q1 = outliersdf2.quantile(0.25, numeric_only=True)
Q3 = outliersdf2.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1

# Function to replace outliers with the median (or mean)
def impute_outlier_with_median(outliersdf2, q1, q3, iqr):
    for col in outliersdf2.select_dtypes(include=np.number).columns:
        lower_bound = q1[col] - 1.5 * iqr[col]
        upper_bound = q3[col] + 1.5 * iqr[col]
        median_value = outliersdf2[col].median()

        # Replace outliers with median (you can also use mean or other metrics)
        outliersdf2[col] = np.where((outliersdf2[col] < lower_bound) | (outliersdf2[col] > upper_bound), median_value, outliersdf2[col])
    return outliersdf2

# Impute outliers in the DataFrame
df_imputed2 = impute_outlier_with_median(outliersdf2.copy(), Q1, Q3, IQR)

df2 = df_imputed2.copy()

# Create a boxplot to visualize the DataFrame with imputed outliers
plt.figure(figsize=(15, 8))
sns.boxplot(data=df_imputed2)
plt.title("Boxplot of Data for Objective 2 with Imputed Outliers")
plt.xticks(rotation=90)
plt.show()

The above figure is the boxplot of the dataset after detection and imputation of outliers. The number of circles that all exceeded the horizontal lines are reduced. Also the maximum range of all column had reduced as the scale is also reduced. Visually the third quadrant of sales move from more than 400 to less than 400.

In [None]:
# Now you can display df2 with cleaned data
display(df2)

Based on the above output, we can observe that the number of rows remains the same since we choose to impute the outliers instead of removing the outliers.

# Step 6b: Encode the categorical 

In [None]:
# We need to determine the categorical data inside the dataset first
# However, by observing the dataset it does not have any categorical data but we can double check it by using .dtypes
print(df2.dtypes)

Based on above output, we can observe that all the data is in numerical format so we do not need to do any encoding.

# Step 7b: Feature Scaling

In [None]:
# Extract the date column
date_column = df2.iloc[:, 0]

# Min-Max scale all columns except the date column
minmax_data = MinMaxScaler().fit_transform(df2.iloc[:, 1:])

# Combine the Min-Max scaled data with the date column
minmax_frame = pd.DataFrame(data=minmax_data, columns=df2.columns[1:])
minmax_frame.insert(0, df2.columns[0], date_column)

# Print the datasets after feature scaling
print(minmax_frame)

df2 = minmax_frame.copy()

# Step 8b: Splitting dataset into training and testing sets, train the model and measure the accurancy
The Simple Linear Regression model below is used to analyze the relationship between the percentage of new buyers and sales in confirmed order. Scatter plots and regression lines are then plotted to visualize the results. Pearson's correlation coefficient is used to measure the strength and direction of a linear relationship between two variables. The performance metrics such as MAE, MSE, R2 and RMSE are used to assess the performance and effectiveness of the model.

In [None]:
# Simple Linear Regression to determine the relationship between the percentage of new buyers and the sales in confirmed order

# Select independent variable (x) and dependent variable (y)
x_b1 = df2['Percentage of new buyers']
y_b1 = df2['Sales (Confirmed Order) (MYR)']

# Split the data into training and testing sets
x_train_b1, x_test_b1, y_train_b1, y_test_b1 = train_test_split(x_b1.values.reshape(-1, 1), y_b1, test_size=0.20, random_state=42)

# Create a linear regression model
model_b1 = LinearRegression()

# Train the model
model_b1 .fit (x_train_b1, y_train_b1)

# Predict the test set result using the trained model
y_pred_b1 = model_b1 .predict(x_test_b1)

# Extract the Intercept Value, Y
# Identify interception points
intercept_b1 = model_b1 .intercept_

# Extract the value of Coefficient, C
# Identify coefficient values
coefficient_b1 = model_b1 .coef_

print ("Intercept Value: ", intercept_b1)
print ("Coefficient: ", coefficient_b1)

# Plot the actual data and regression line
plt.scatter(x_test_b1, y_test_b1, color='black', label='Actual Data')
plt.plot(x_test_b1, y_pred_b1, color='blue', linewidth=3, label='Regression Line')
plt.xlabel('Percentage of new buyers')
plt.ylabel('Sales (Confirmed Order) (MYR)')
plt.title('Relationship between Percentage of new buyers and Sales (Confirmed Order) (MYR)')
plt.legend()
plt.show()

# Calculate Pearson correlation coefficient (r) between variables
correlation_coefficient, _ = pearsonr(x_b1, y_b1)
print(f"Pearson Correlation Coefficient (r): {correlation_coefficient}")

# Calculate Mean Absolute Error (MAE)
mae_b1 = mean_absolute_error(y_test_b1, y_pred_b1)
print(f"Mean Absolute Error (MAE): {mae_b1}")

# Calculate Mean Squared Error (MSE)
mse_b1 = mean_squared_error(y_test_b1, y_pred_b1)
print(f"Mean Squared Error (MSE): {mse_b1}")

# Calculate R-squared (R2)
r2_b1 = r2_score(y_test_b1, y_pred_b1)
print(f"R-squared (R2): {r2_b1}")

# Calculate Root Mean Squared Error (RMSE)
rmse_b1 = np.sqrt(mean_squared_error(y_test_b1, y_pred_b1))
print(f"Root Mean Squared Error (RMSE): {rmse_b1}")

The result of 0.5628 shows that there is a moderate positive linear relationship between the percentage of new buyers and sales in confirmed orders in MYR. The positive value indicates that as the number of new buyers increases, there is a tendency for sales in confirmed orders to increase, though the relationship is not particularly strong. 

The R-squared value of 0.3409 reveals that approximately 34.1% of the total variability in sales can be explained by the variation in the percentage of new buyers. This implies that while the presence of new buyers is a significant factor, 65.9% of the variability is influenced by other factors not considered in our model.

The Mean Absolute Error (MAE: 0.1557), Mean Squared Error (MSE: 0.0617), and Root Mean Squared Error (RMSE: 0.2483) are all relatively low, suggesting reasonable predictive accuracy.

The analysis shows a moderate to strong relationship between sales and the percentage of new buyers. We suggest Cili Lado to develop targeted marketing campaigns aimed at attracting new customers. This could include social media advertising, collaborations with influencers, or offering first-time buyer discounts. Also, use the customer flow data to identify peak times and channels that attract the most visitors, then align marketing campaigns to be most aggressive during these periods like during 11.11 on Shopee.

The Simple Linear Regression model below is used to analyze the relationship between the percentage of repeat buyers and sales in confirmed order. Scatter plots and regression lines are then plotted to visualize the results. Pearson's correlation coefficient is used to measure the strength and direction of a linear relationship between two variables. The performance metrics such as MAE, MSE, R2 and RMSE are used to assess the performance and effectiveness of the model.

In [None]:
# Simple Linear Regression to determine the relationship between the percentage of repeat buyers and the sales in confirmed order

# Select independent variable (x) and dependent variable (y)
x_b2 = df2['Percentage of repeat buyers']
y_b2 = df2['Sales (Confirmed Order) (MYR)']

# Split the data into training and testing sets
x_train_b2, x_test_b2, y_train_b2, y_test_b2 = train_test_split(x_b2.values.reshape(-1,1), y_b2, test_size=0.2, random_state=42)

# Create a linear regression model
model_b2 = LinearRegression()

# Train the model
model_b2.fit (x_train_b2, y_train_b2)

# Predict the test set result using the trained model
y_pred_b2 = model_b2.predict(x_test_b2)

# Extract the Intercept Value, Y
# Identify interception points
intercept_b2 = model_b2.intercept_

# Extract the value of Coefficient, C
# Identify coefficient values
coefficient_b2 = model_b2.coef_

print ("Intercept Value: ", intercept_b2)
print ("Coefficient: ", coefficient_b2)

# Plot the actual data and regression line
plt.scatter(x_test_b2, y_test_b2, color='black', label='Actual Data')
plt.plot(x_test_b2, y_pred_b2, color='blue', linewidth=3, label='Regression Line')
plt.title('Relationship between Percentage of repeat buyers and Sales (Confirmed Order) (MYR)')
plt.xlabel('Percentage of repeat buyers')
plt.ylabel('Sales (Confirmed Order) (MYR)')
plt.legend()
plt.show()

# Calculate Pearson correlation coefficient (r) between variables
correlation_coefficient, _ = pearsonr(x_b2, y_b2)
print(f"Pearson Correlation Coefficient (r): {correlation_coefficient}")

# Calculate Mean Absolute Error (MAE)
mae_b2 = mean_absolute_error(y_test_b2, y_pred_b2)
print(f"Mean Absolute Error (MAE): {mae_b2}")

# Calculate Mean Squared Error (MSE)
mse_b2 = mean_squared_error(y_test_b2, y_pred_b2)
print(f"Mean Squared Error (MSE): {mse_b2}")

# Calculate R-squared (R2)
r2_b2 = r2_score(y_test_b2, y_pred_b2)
print(f"R-squared (R2): {r2_b2}")

# Calculate Root Mean Squared Error (RMSE)
rmse_b2 = np.sqrt(mean_squared_error(y_test_b2, y_pred_b2))
print(f"Root Mean Squared Error (RMSE): {rmse_b2}")


The result of 0.2957 shows that there is a weak positive linear relationship between the percentage of repeat buyers and sales in confirmed orders in MYR. The positive value indicates that as the number of repeat buyers increases, there is a tendency for sales in confirmed orders to increase, though the relationship is weak. 

The R-squared value of 0.07 reveals that approximately 7% of the total variability in sales can be explained by the variation in the percentage of repeat buyers. This implies that while the presence of repeat buyers is merly a small factor, 93% of the variability is influenced by other factors not considered in our model.

The Mean Absolute Error (MAE: 0.2251), Mean Squared Error (MSE: 0.0862), and Root Mean Squared Error (RMSE: 0.2935) are all relatively low, suggesting reasonable predictive accuracy.


With a weaker relationship between repeat buyers and sales, means that Cili Lado need to improve customer retention also. Implementing loyalty programs, such as rewards for frequent purchases, personalized discounts, or exclusive offers for returning customers, could enhance repeat purchase rates. Also, the most important for returned customer is the flavour, maybe the flavour on the chili need to be improved to attract more customer to buy it.


Here, we combine the 2 Single Linear Regression craeted above and form a Multiple Linear Regression model to analyze the relationship between the percentage of new and repeat buyers and sales in confirmed order in MYR. Scatter plots are then plotted to visualize the results. The performance metrics such as MAE, MSE, R2 and RMSE are used to assess the performance and effectiveness of the model.

In [None]:
# Multiple linear regression for Percentage of new buyers, Percentage of repeat buyers and Sales in confirmed order

# Select independent variable (x) and dependent variable (y)
x_mlr = df2[['Percentage of new buyers', 'Percentage of repeat buyers']]
y_mlr = df2['Sales (Confirmed Order) (MYR)']

# Split dataset into training and testing sets
x_train_mlr, x_test_mlr, y_train_mlr, y_test_mlr = train_test_split(x_mlr, y_mlr, test_size=0.2, random_state=42)

# Create a linear regression model
model_mlr = LinearRegression()

# Train the model
model_mlr.fit(x_train_mlr, y_train_mlr)

# Predict the test set result using the trained model
y_pred_mlr = model_mlr.predict(x_test_mlr)

# Extract Intercept Value
intercept_mlr = model_mlr.intercept_

# Extract Coefficient Values
coefficients_mlr = model_mlr.coef_

print("Intercept Value: ", intercept_mlr)
print("Coefficients: ", coefficients_mlr)

# Visualize scatter plot for predictions vs actual values
plt.scatter(y_test_mlr, y_pred_mlr, color='blue', label='Actual Data')
plt.xlabel('Actual Sales (Confirmed Order) (MYR)')
plt.ylabel('Predicted Sales (Confirmed Order) (MYR)')
plt.title('Predicted vs. Actual Values (r = {0:0.2f})'.format(pearsonr(y_test_mlr, y_pred_mlr)[0]))
plt.legend()
plt.show()

# Calculate Mean Absolute Error (MAE)
mae_mlr = mean_absolute_error(y_test_mlr, y_pred_mlr)
print(f"Mean Absolute Error (MAE): {mae_mlr}")

# Calculate Mean Squared Error (MSE)
mse_mlr = mean_squared_error(y_test_mlr, y_pred_mlr)
print(f"Mean Squared Error (MSE): {mse_mlr}")

# Calculate R-squared (R2)
r2_mlr = r2_score(y_test_mlr, y_pred_mlr)
print(f"R-squared (R2): {r2_mlr}")

# Calculate Root Mean Squared Error (RMSE)
rmse_mlr = np.sqrt(mean_squared_error(y_test_mlr, y_pred_mlr))
print(f"Root Mean Squared Error (RMSE): {rmse_mlr}")

In Multiple Linear Regression, the Pearson correlation coefficient is commonly used to assess the relationship between the predicted values and the actual values, rather than measuring the linear relationship between each independent variable and the dependent variable individually, as done in Simple Linear Regression. Also, it is normally called multiple correlation coefficient. 

The result of 0.76 shows a strong linear relationship between the overall predictions of the MLR model and the actual outcomes. This indicates that the model is performing well in explaining the variation in the data.

The R-squared value of 0.3563 reveals that approximately 35.6% of the total variability in sales can be explained by the variation in the percentage of new and repeat buyers. However, the presence of other unaccounted factors contributes to the remaining 64.4% of variability.

The Mean Absolute Error (MAE: 0.1487), Mean Squared Error (MSE: 0.0602), and Root Mean Squared Error (RMSE: 0.2454) are all relatively low, suggesting reasonable predictive accuracy.

Here are the performance metric obtained from the Single Linear Regression:

First Model
1. Mean Absolute Error (MAE): 0.1557287631083225
2. Mean Squared Error (MSE): 0.061653663866352265
3. R-squared (R2): 0.34088722548671835
4. Root Mean Squared Error (RMSE): 0.24830155832445408

Second Model
1. Mean Absolute Error (MAE): 0.22513521776646409
2. Mean Squared Error (MSE): 0.08615952543290184
3. R-squared (R2): 0.07890561083393577
4. Root Mean Squared Error (RMSE): 0.29352942856364816

For this Multiple Linear Regression it has the lowest MAE, MSE and RMSE while R-squared is the highest. This shows that the Multiple Linear Regression model is better and more suitable than Single Linear Regression model in this case.


# Step 9: Combine all Multiple Linear Regression as one Multiple Linear Regression
Here, we combine the both Multiple Linear Regression from Part A and Part B. Multiple Linear Regression model is used to analyze customer behaviour, percentage of new and return buyers and sales(confirmed order) (MYR). Scatter plots are then plotted to visualize the results. The performance metrics such as MAE, MSE, R2 and RMSE are used to assess the performance and effectiveness of the model.

In [None]:
# Concatenate columns from both datasets
x_9 = pd.concat([df1[['Product Visitors (Visit)', 'Product Page Views', 'Items Visited', 'Search Clicks']],
               df2[['Percentage of new buyers', 'Percentage of repeat buyers']]], axis=1)

# Target variable
y_9 = df1['Sales (Confirmed Order) (MYR)']

# Split the data into training and testing sets
x_train_9, x_test_9, y_train_9, y_test_9 = train_test_split(x_9, y_9, test_size=0.2, random_state=42)

# Create a linear regression model
model_9 = LinearRegression()

# Train the model
model_9.fit(x_train_9, y_train_9)

# Predeict the test set result using the trained model
y_pred_9 = model_9.predict(x_test_9)

# Visualize scatter plot for predicted vs actual values
plt.scatter(y_test_9, y_pred_9)
plt.xlabel ('Actual Sales (Confirmed Order) (MYR)')
plt.ylabel('Predicted Sales (Confirmed Order) (MYR)')
plt.title('Predicted vs. Actual Values (r = {0:0.2f})'.format(pearsonr(y_test_9, y_pred_9)[0], 2))
plt.show()

# Calculate Mean Absolute Error (MAE)
mae_9 = mean_absolute_error(y_test_9, y_pred_9)
print(f"Mean Absolute Error (MAE): {mae_9}")

# Calculate Mean Squared Error (MSE)
mse_9 = mean_squared_error(y_test_9, y_pred_9)
print(f"Mean Squared Error (MSE): {mse_9}")

# Calculate R-squared (R2)
r2_9 = r2_score(y_test_9, y_pred_9)
print(f"R-squared (R2): {r2_9}")

# Calculate Root Mean Squared Error (RMSE)
rmse_9 = np.sqrt(mean_squared_error(y_test_9, y_pred_9))
print(f"Root Mean Squared Error (RMSE): {rmse_9}")

The result of 0.70 shows a strong linear relationship between the overall predictions of the MLR model and the actual outcomes. This indicates that the model is performing well in explaining the variation in the data. However this value is lower before incorporating customer behavior. This means that repeat customer plays a important part in the overall sales.

The R-squared value of 0.4183 reveals that approximately 41.83% of the total variability in sales can be explained by the variation in the customer behaviour. However, the presence of other unaccounted factors contributes to the remaining 58.17% of variability.

The Mean Absolute Error (MAE: 0.1430), Mean Squared Error (MSE: 0.0544), and Root Mean Squared Error (RMSE: 0.2333) are all relatively low, suggesting reasonable predictive accuracy.

Here is the performance metric obtained from the Multiple Linear Regression from Sales and Percentage of New and Repeat Customer

1. Mean Absolute Error (MAE): 0.148739393014352
2. Mean Squared Error (MSE): 0.06020542214623809
3. R-squared (R2): 0.3563697541549179
4. Root Mean Squared Error (RMSE): 0.24536793218804712

For this Multiple Linear Regression it has a lower MAE, MSE, RMSE while it has a significant higher R2 than the Multiple Multiple Linear Regression from Sales and Percentage of New and Repeat Customer. This means that after incoporating more factor like Product Visitors, Product Page View, Items Visited and more the model becomes more accurate.

By combining all factor together it can help us to identify which factors have the most significant impact on sales. Allocate the marketing budget accordingly, focusing on high-impact strategies that drive both traffic and sales.

This graph help us to indentify that the best way to increase sales is to attract customer more, as a better marketing will attract more customer and increase sales.

In [None]:
# Plot the graph for residuals

# Calculate residuals
residuals_9 = y_test_9- y_pred_9

# Plot residuals against predicted values
plt.scatter(y_pred_9 , residuals_9 )
plt.title('Predicted vs Residuals (For accessing model accuracy)')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
plt.show()

The graph above show the relationship between predicted values and residuals that can help us to access how well the trained model predicts outcomes. Predicted values are what the model thinks the result should be, while the residuals are the differences between the predictions and the actual values. 

Some of the data points are far from 0.0, it means that predictions sometimes overestimate or underestimate the actual value. This variation might influenced by the external variables such as the promotional activity occur from time to time, change in customer behaviour, competition dynamics and economic changes that can impact sales unpredictably.

Based on the above residuals plot, we can observe that the residuals are more concentrated in the middle of the plot. 

In [None]:
# Create a density plot of the residuals, bins is number of bars
sns.histplot((y_test_9 - y_pred_9), bins = 50)
plt.xlabel ('Residuals')
plt.ylabel('Density')

# 1 is the position of the variable, 0 is test statistic, 1 is p-value
plt.title ('Histogram of Residuals (Shapiro W p-value = {0:0.3f})'.format(shapiro(y_test_b2 - y_pred_b2)[1]))
plt.show()

# P-value is 0.000, which is less than 0.05, so it is against the idea you are testing, so it is not normally distributed

The Shapiro-Wilk test is used to assess whether the sample data fits a normal distribution. The histogram of residuals above have a p-value of 0.000, shows that the residuals significantly deviate from a normal distribution.

To fully leverage the Date column, a recurrent neural network (RNN) is being used. We also alter the parameters or tuning to get the best performing RNN. This includes the epochs, batch size, layers and many more. 

In [None]:
df1['Year'] = df1['Date'].dt.year
df1['Month'] = df1['Date'].dt.month
df1['Day'] = df1['Date'].dt.day

# Concatenate columns from both datasets, including the extracted date features
x_9rnn = pd.concat([df1[['Year', 'Month', 'Day', 'Product Visitors (Visit)', 'Product Page Views', 'Items Visited', 'Search Clicks']],
               df2[['Percentage of new buyers', 'Percentage of repeat buyers']]], axis=1)

y_9rnn = df1['Sales (Confirmed Order) (MYR)']

# Normalize the features
scaler = MinMaxScaler(feature_range=(0, 1))
x_scaled = scaler.fit_transform(x_9rnn)

# Reshape data for RNN: [samples, time steps, features]
x_9rnn = np.reshape(x_scaled, (x_scaled.shape[0], 1, x_scaled.shape[1]))

# Split the data into training and testing sets
x_train_9rnn, x_test_9rnn, y_train_9rnn, y_test_9rnn = train_test_split(x_9rnn, y_9rnn, test_size=0.2, random_state=42)

# Build the RNN model
model_9rnn = Sequential()
model_9rnn.add(LSTM(50, return_sequences=True, input_shape=(1, x_9rnn.shape[2])))
model_9rnn.add(LSTM(50))
model_9rnn.add(Dense(1))

# Compile the model
model_9rnn.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])

# Train the model
model_9rnn.fit(x_train_9rnn, y_train_9rnn, epochs=100, batch_size=32, verbose=1)

# Predict the test set result using the trained model
y_pred_9rnn = model_9rnn.predict(x_test_9rnn)

# Visualize scatter plot for predicted vs actual values
plt.scatter(y_test_9rnn, y_pred_9rnn)
plt.xlabel('Actual Sales (Confirmed Order) (MYR)')
plt.ylabel('Predicted Sales (Confirmed Order) (MYR)')
plt.title('RNN Predicted vs. Actual Values')
plt.show()

# Calculate and print the performance metrics for the recurrent neural network model
mae_9rnn = mean_absolute_error(y_test_9rnn, y_pred_9rnn)
mse_9rnn = mean_squared_error(y_test_9rnn, y_pred_9rnn)
r2_9rnn = r2_score(y_test_9rnn, y_pred_9rnn)
rmse_9rnn = np.sqrt(mse_9rnn)

print(f"RNN Mean Absolute Error (MAE): {mae_9rnn}")
print(f"RNN Mean Squared Error (MSE): {mse_9rnn}")
print(f"RNN R-squared (R2): {r2_9rnn}")
print(f"RNN Root Mean Squared Error (RMSE): {rmse_9rnn}")

After several training and testing of various machine learning models, we can conclude that RNN model has better performance compared to multiple linear regression model and is the ideal model for our project. It has lower MAE, MSE, RMSE and higher R-squared compared to multiple linear regression. The model is more accurate and has better metrics since date is also one of the most important factors in business.

# Step 10: Random Forest Model
All the Multiple Linear Regression create above had shown that customer behaviour is important. However, in the realm of business, where resources such as capital, time, and effort are constrained, strategic prioritization becomes crucial. Therefore, Random Forest Model is used. 

It can capture non-linear relationships that Multiple Linear Regression might miss. This could reveal more complex patterns in how customer behaviors interact to influence sales, allowing for better marketing strategies.

Also, it is more robust to outliers than Multiple Linear Regression, possibly providing a more accurate picture of the sales drivers because the data has outliers and is not normally distributed.

With a better understanding of what drives sales because of feature importance, Cili Lado can create more targeted strategies. For instance, if Search Click is a key feature, Cili Lado might work to create more engaging content to keep customers clicking into its page.


In [None]:
x_10 = pd.concat([df1[['Product Visitors (Visit)', 'Product Page Views', 'Items Visited', 'Search Clicks']],
               df2[['Percentage of new buyers', 'Percentage of repeat buyers']]], axis=1)

# Target variable
y_10 = df1['Sales (Confirmed Order) (MYR)']

x_train_10, x_test_10, y_train_10, y_test_10 = train_test_split(x_10, y_10, test_size=0.2, random_state=42)

# Create a Random Forest Regression model
model_rf = RandomForestRegressor()
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt']
}

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=model_rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(x_train_10, y_train_10)

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

# Use the best parameters to create a new model
best_rf = RandomForestRegressor(**grid_search.best_params_, random_state=42)

# Train the model
best_rf.fit(x_train_10, y_train_10)

# Predict the test set result using the trained model
y_pred_rf = best_rf.predict(x_test_10)

# Plotting the predicted vs actual values for Random Forest Model
plt.scatter(y_test_10, y_pred_rf)
plt.xlabel('Actual Sales (Confirmed Order) (MYR)')
plt.ylabel('Predicted Sales (Confirmed Order) (MYR)')
plt.title('Random Forest Predicted vs. Actual Values')
plt.show()

# Calculate and print the performance metrics for Random Forest Model
mae_rf = mean_absolute_error(y_test_10, y_pred_rf)
mse_rf = mean_squared_error(y_test_10, y_pred_rf)
r2_rf = r2_score(y_test_10, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)

print(f"Random Forest Mean Absolute Error (MAE): {mae_rf}")
print(f"Random Forest Mean Squared Error (MSE): {mse_rf}")
print(f"Random Forest R-squared (R2): {r2_rf}")
print(f"Random Forest Root Mean Squared Error (RMSE): {rmse_rf}")


We use a search grid and calculate the best parameters for the random forest. This is better because it does not require manually testing and it is more comprehensive and faster. Here are some of the parameters in random forest.

1. n_estimators: The number of trees in the forest. Generally, more trees increase performance and decrease the risk of overfitting, but also increase computational cost.

2. max_depth: The maximum depth of each tree. Deeper trees can model more complex patterns but can lead to overfitting.

3. min_samples_split: The minimum number of samples required to split an internal node. Higher values prevent creating nodes that only fit a small number of instances.

4. min_samples_leaf: The minimum number of samples required to be at a leaf node. Setting this higher can smooth the model, especially for regression.

5. max_features: The number of features to consider when looking for the best split. Trying different values can affect both performance and overfitting.


In [None]:
best_rf_model = RandomForestRegressor(
    n_estimators=200,
    max_depth=20,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    random_state=42
)

# Train the model
best_rf_model.fit(x_train_10, y_train_10)

# Make predictions on the test set
y_pred_rf = best_rf_model.predict(x_test_10)


After got the best parameters in the search grid, we generate the feature importance of each factor.

In [None]:
feature_importances = best_rf_model.feature_importances_
features = x_10.columns  # Make sure this refers to the correct DataFrame used for training the model
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

# Plotting Feature Importances
importance_df = importance_df.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances in Random Forest Model')
plt.xticks(rotation=90)
plt.show()

print("Feature Importances:")
print(importance_df)


We found out that Product Visitors (Visit), Search Clicks and Product Page Views are the top 3 most important feature that is related to sales. 

Therefore, by this graph Cili Lado should focus on the marketing side to increase Product Visitors and Search Clicks first as this 2 is the most important factor on increasing its sales.

In [None]:
# Plot the graph for residuals

# Calculate residuals
residuals_rf = y_test_10 - y_pred_rf

# Plot residuals against predicted values
plt.scatter(y_pred_rf, residuals_rf)
plt.title('Predicted vs Residuals (For accessing model accuracy)')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
plt.show()

The graph above show the relationship between predicted values and residuals that can help us to access how well the trained model predicts outcomes. Predicted values are what the model thinks the result should be, while the residuals are the differences between the predictions and the actual values. 

Some of the data points are far from 0.0, indicating the model's predictions are sometimes higher and sometimes lower than the actual values. This variation might influenced by the external variables such as the promotional activity occur from time to time, change in customer behaviour, competition dynamics and economic changes that can impact sales unpredictably.

Based on the above residuals plot, we can observe that the residuals have a more uniform scatter across the entire range of predicted values.

This residuals plot is **considered slightly better** than the one above because the residuals are more evenly distributed around the zero line and the model is equally likely to overestimate and underestimate throughout the entire prediction range. In addition, the vertical distribution in the middle is smaller, indicating higher prediction accuracy for these points.

In [None]:
# Create a density plot of the residuals, bins is number of bars
sns.histplot((y_test_10 - y_pred_rf), bins = 50)
plt.xlabel ('Residuals')
plt.ylabel('Density')

# 1 is the position of the variable, 0 is test statistic, 1 is p-value
plt.title ('Histogram of Residuals (Shapiro W p-value = {0:0.3f})'.format(shapiro(y_test_a - y_pred_a)[1]))
plt.show()

# P-value is 0.000, which is less than 0.05, so it is against the idea you are testing, so it is not normally distributed

The Shapiro-Wilk test is used to assess whether the sample data fits a normal distribution. The histogram of residuals above have a p-value of **0.002** suggests that the residuals are not normally distributed but it is **slightly better** than the one above which have a p-value of 0.000. In addition, the distribution deviation might be **less severe** compared to the one above.

# Step 11: Summary
For Part A, the important findings are listed below:

1. Optimize the Shopee Experience: Elevate the platform's design interface and experience on Shopee to attract customers, directly influencing sales. Conduct testing across various layouts, designs, and call-to-action placements to foster increased page views and visits, thereby enhancing the overall Shopee shopping experience. Also, advertisment and marketing on making more people knowing Cili Lado is important. Considering XiaoHongShu, Facebook and Instagram also as advertisment and marketing source. 

2. Search Optimization: Given the significance of search clicks as a featured metric, it becomes crucial to optimize the search functionality. To stand out among competitors, consider implementing standout graphics or design to prompt users to click on Cili Lado. By combining functionality with eye-catching aesthetics, the aim is to not just meet but exceed user expectations, by increasing clicks and heightened user satisfaction. For example, Cili Lado is not only a food but also a cultural experience in Malaysia.

For Part B, the important findings are as follows:

1. Loyalty Programs: If repeat customers significantly contribute to sales, developing a loyalty program or subscription model could enhance customer retention rates. Also, improve the flavour or increase more flavour which make the repeat customer come back for more.

2. Targeted Marketing: Create different marketing campaigns for new and existing customers. Attract new customer by going on Food Fair or Road Show. This ensure that more people know this brand called Cili Lado. Also, survey can be taken to analyse the age or customer peference for better improvement on marketing strategies.

The following are the key findings for all combined model:


1. Holistic Marketing Strategy: This combined model can identify which factors have the most significant impact on sales. Focus on the part then move on to the next most important feature. By knowing the combined effect, Cili Lado can prioritize which aspects of the customer experience need improvement to boost sales.

2. Product Development and Inventory Management: By understanding the full customer journey from initial visit to repeat purchase, you can make informed decisions about product development and inventory management to ensure that you're meeting customer demand.

3. Resource Allocation: For certain features, like product visitors and search click, have a stronger relationship with sales, Cili Lado can allocate more resources to improve these areas. For instance, create higher quality marketing video or graphics to attract customer.

4. Marketing Strategy: The combined model can help refine marketing strategies. For example, if new customer rates are a significant predictor of sales, strategies could include targeted ads to attract new customers or promotions to convert first-time site visitors into purchasers.

5. Customer Retention: If repeat customer rates significantly impact sales, then customer retention programs, loyalty rewards, or personalized marketing could be areas to invest in, aiming to increase repeat purchases.
