This notebook predicts customer sentiment and revenue generation based on their browsing behavior. By analyzing various features, we aim to understand the key drivers behind customer decisions and provide actionable insights for the business.

## Company Context
Our company is an e-commerce platform that sells a variety of products online. Understanding customer behavior is crucial for us to enhance user experience, optimize marketing strategies, and ultimately increase revenue. This analysis will help us identify patterns in how customers interact with our website and what leads to a purchase.

## Business Insights
The primary goal of this analysis is to gain actionable insights into what drives customer sentiment and revenue. We want to answer questions like:
* Which marketing channels are most effective?
* What is the impact of website performance on customer decisions?
* How does browsing behavior differ between new and returning visitors?
* What are the characteristics of customers who are most likely to make a purchase?

In [None]:
import pandas
import matplotlib.pyplot as mat
import seaborn
import numpy 
from matplotlib.gridspec import GridSpec as grid
from matplotlib.ticker import MultipleLocator as ml

In [None]:
# Importing data
data = pandas.read_csv(r"C:\\Users\\Bhavik Parmar\\OneDrive\\Desktop\\Data\\Shoppers_Behaviour_and_Revenue.csv")

# Exploratory Data Analysis

In [None]:
data.info()

In [None]:
data.describe()

## Feature Engineering
To better understand customer behavior, we will create new features from the existing data. This includes:
* **Total Pages Viewed:** Combining administrative, informational, and product-related pages to get a sense of overall engagement.
* **Total Duration:** Summing up the time spent on different page types to measure session length.
* **Engagement Score:** A custom metric that combines session duration, page values, and bounce rates to quantify user engagement.
* **Seasonal Analysis:** Converting month data into seasons to identify any seasonal trends in customer behavior.

In [None]:
data.columns

In [None]:
data.head()

In [None]:
avgExitRate = data['ExitRates'].mean()
avgExitRate

In [None]:
# Common Def Function Creation Cell ---> a window to have all the functions used in this whole session

def apply_isBounce(row):
    if (row['BounceRates'] == 1) & (row['ProductRelated'] == 1) & (row['ExitRates'] == 1):
        return 'Yes'
    elif (row['BounceRates'] != 1) & (row['ProductRelated'] != 1) & (row['ExitRates'] != 1):
        return 'No'
    else:
        return 'MayBe'
    
def apply_isExistedReally(row, avgExitRate):

    if row['ExitRates'] > avgExitRate:
        return 'Yes'
    else:
        return 'No'
    
def apply_hasLongSession(row, avgtotal):
    if row['totalDuration'] > avgtotal:
        return 'Yes'
    else:
        return 'No'

In [None]:
# Function for feature Engineering --> First function 


def first_engineering(data):

    # Basic Transformation of the columns    
    data['totalPagesViewed'] = data['Informational'] + data['Administrative'] + data['ProductRelated']
    
    data['totalDuration'] = data['Informational_Duration'] + data['Administrative_Duration'] + data['ProductRelated_Duration']
    
    data['avgTimePerPage'] = data['totalDuration'] / data['totalPagesViewed']

    return data
        

In [None]:
data1 = first_engineering(data = data) # Execution of the first function

In [None]:
# Second function for data engineering
def second_engineering(data):
    
    avgTotalDuration = data['totalDuration'].mean()
    data['hasLongSession'] = data.apply(lambda row: apply_hasLongSession(row, avgTotalDuration), axis=1)

    data['productFocus'] = data['ProductRelated'] / data['totalPagesViewed']

    data['productTimeRatio'] = data['ProductRelated_Duration'] / data['totalDuration']

    """Some Flag Based Columns (Categorical)"""
    # Whether They bounced ?
    avgExitRate = data['ExitRates'].mean()
    data['isBounce'] = data.apply(lambda row: apply_isBounce(row), axis=1)
    
    data['isExit'] = data.apply(lambda row: apply_isExistedReally(row, avgExitRate), axis=1)


    return data

In [None]:
data2 = second_engineering(data1)

In [None]:
def third_featureEngineering(data):
     
    # 1. Pages per Category Ratio
    data['adminPageRatio'] = data['Administrative'] / data['totalPagesViewed']
    data['infoPageRatio'] = data['Informational'] / data['totalPagesViewed']
    
    # 2. Time Distribution Ratios
    data['adminTimeRatio'] = data['Administrative_Duration'] / data['totalDuration']
    data['infoTimeRatio'] = data['Informational_Duration'] / data['totalDuration']
    
    # 3. Interaction Efficiency: Value per Page
    data['pageValuePerView'] = data['PageValues'] / (data['totalPagesViewed'] + 1e-6)
    
    # 4. Engagement Score (Custom metric combining duration, value, and low bounce rate)
    data['engagementScore'] = (
        (data['totalDuration'] / (data['totalDuration'].max() + 1e-6)) * 0.4 +
        (data['PageValues'] / (data['PageValues'].max() + 1e-6)) * 0.4 +
        ((1 - data['BounceRates']) / (1 + data['BounceRates'].max())) * 0.2
    )
    
    month_map = {
        'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'June': 6,
        'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12
    }
    data['monthNum'] = data['Month'].map(month_map)
    
    def assign_season(m):
        if m in [12, 1, 2]: return 'Winter'
        elif m in [3, 4, 5]: return 'Spring'
        elif m in [6, 7, 8]: return 'Summer'
        else: return 'Fall'
    
    data['season'] = data['monthNum'].apply(assign_season)
    
    # 6. High Value Visitor Flag
    avg_page_value = data['PageValues'].mean()
    data['isHighValueVisitor'] = numpy.where(data['PageValues'] > avg_page_value, 'Yes', 'No')
    
    # 7. Time to Value Ratio
    data['timeToValueRatio'] = data['totalDuration'] / (data['PageValues'] + 1e-6)
    
    # 8. Weekend + High Engagement Interaction
    data['weekendHighEngagement'] = numpy.where(
        (data['Weekend'] == True) & (data['engagementScore'] > data['engagementScore'].mean()),
        'Yes', 'No'
    )

    return data

In [None]:
data = third_featureEngineering(data2)

In [None]:
# Removing some missing values ! 
haslongsessionMode = data['hasLongSession'].mode()[0]
data['hasLongSession']= data['hasLongSession'].fillna(haslongsessionMode)
productFocusMean = int(data['productFocus'].mean())
data.fillna(productFocusMean, inplace=True)

In [None]:
# Replacing Nann Values
data['productTimeRatio'] = data['productTimeRatio'].fillna(0)

## Exploratory Data Analysis ---> Continued

In [None]:
pandas.set_option('display.max_columns', None)
data.head(5)

In [None]:
# Product Related 
overview = data['ProductRelated'].describe()
print(overview)
maxVal = data['ProductRelated'].max()
print('Max Value', maxVal)

# Product Page Visists
avg_val = data['ProductRelated'].mean()
specific = data.loc[data['ProductRelated'] > avg_val]

seaborn.histplot(y=specific['ProductRelated'], x=specific['Month'])

meanforEachday = specific.groupby('Month')['ProductRelated'].mean()
for i, month in enumerate(meanforEachday.index):
    mat.plot(i, meanforEachday[month], marker='o', color='red', label='Mean' if i == 0 else "")

mat.legend()
mat.show()

In [None]:
# Plotting countplot
countByMonths = specific.groupby('Month')['ProductRelated'].count()
pointer = 300
colorSeq = [ 'red' if val > pointer else 'blue' for val in countByMonths.values.tolist() ]

# Plotting count plot
seaborn.barplot(
    x = countByMonths.index.tolist(),
    y = countByMonths.values.tolist(),
    hue= countByMonths.index.tolist(),
    palette=colorSeq,
    legend=False
)
mat.title('Product Page Visits Per Month')
mat.show()

In [None]:
"""Grouping by revenue and months"""
# This is for the red bar (Months)
highMonthRevenue = specific.loc[
    (specific['Month'] == 'Dec') | (specific['Month'] == 'Mar') | (specific['Month'] == 'May') | (specific['Month'] == 'Nov') # Getting specific months
].groupby('Revenue')['ProductRelated'].count()

# This is for the remaning months
lowMonthRevenue = specific.loc[
    (specific['Month'] != 'Dec') | (specific['Month'] != 'Mar') | (specific['Month'] != 'May') | (specific['Month'] != 'Nov') # Getting specific months
].groupby('Revenue')['ProductRelated'].count()

# Creating dataframe
tempDataframe1 = pandas.DataFrame(
    pandas.concat(
        [highMonthRevenue.rename('High Month Revenue'),
        lowMonthRevenue.rename('Low Month Revenue')],
        axis=1)
)

tempDataframe1.index.name = 'Revenue'

newDf = tempDataframe1.reset_index().melt(
    id_vars='Revenue',             
    var_name='MonthType',          
    value_name='Count'       
)

mat.figure(figsize=(12, 5))

seaborn.barplot(
    x = newDf['Revenue'],
    y = newDf['Count'],
    hue = newDf['MonthType']
)

customAxis = [val for val in range(0, newDf['Count'].max(), 500)]
mat.yticks(customAxis)

mat.show()

print(tempDataframe1)

In [None]:
# Months where we generated the revenue (Full data)
RevenuebyMonth = data.loc[data['Revenue'] == True].groupby('Month')['Revenue'].count()

fig, ax = mat.subplots(1, 1, figsize=(15, 5))

bar = seaborn.barplot(
    x=RevenuebyMonth.index.tolist(),
    y = RevenuebyMonth.values.tolist(),
    hue = RevenuebyMonth.values.tolist(),
    palette='coolwarm',
    ax=ax
)

# Adding values to the plot ----> For better intution and interpretation
for labelsToTheBar in bar.containers:
    bar.bar_label(labelsToTheBar, fmt='%.0f', padding=3)

ax.set_title('Revenue Generated Counts Per Month')

In [None]:
# Way Changed!
pointfinder = data.loc[
    ((data['Month'] == 'Nov') | (data['Month'] == 'Dec') | (data['Month'] == 'Mar') | (data['Month'] == 'May')) & (data['Revenue'] == True)
]

nonrevenuecustomer = data.loc[
    ~(((data['Month'] == 'Nov') | (data['Month'] == 'Dec') | (data['Month'] == 'Mar') | (data['Month'] == 'May')) & (data['Revenue'] == True))
]

pointfinder

In [None]:
# Buyers and non-buyers behaviour -> Comparison

# Where the revenue was generated
newVisitor = pointfinder.loc[pointfinder['VisitorType'] == 'New_Visitor']
existingVisitor = pointfinder.loc[pointfinder['VisitorType'] == 'Returning_Visitor']

fig, ax = mat.subplots(2, 2, figsize=(15, 10))

seaborn.scatterplot(x = newVisitor['ProductRelated'], y = newVisitor['ProductRelated_Duration'], hue = newVisitor['VisitorType'], ax = ax[0, 0])
seaborn.scatterplot(x = existingVisitor['ProductRelated'], y = existingVisitor['ProductRelated_Duration'], hue = existingVisitor['VisitorType'], ax = ax[0, 1])

ax[0, 0].set_title('New Visitors (Revenue Generated)')
ax[0, 1].set_title('Existing Visitors (Revenue Generated)')

# Where the revenue was not generated
newVisitornon = nonrevenuecustomer.loc[nonrevenuecustomer['VisitorType'] == 'New_Visitor']
existingVisitornon = nonrevenuecustomer.loc[nonrevenuecustomer['VisitorType'] == 'Returning_Visitor']

seaborn.scatterplot(x = newVisitornon['ProductRelated'], y = newVisitornon['ProductRelated_Duration'], hue = newVisitornon['VisitorType'], ax = ax[1, 0])
seaborn.scatterplot(x = existingVisitornon['ProductRelated'], y = existingVisitornon['ProductRelated_Duration'], hue = existingVisitornon['VisitorType'], ax = ax[1, 1])

ax[1, 0].set_title('New Visitors (Revenue Not Generated)')
ax[1, 1].set_title('Existing Visitors (Revenue Not Generated)')

mat.show()

In [None]:
# Setting Canvas Figure
fig, ax = mat.subplots(1, 2, figsize=(15, 7))

# Insights on whether the new visitors are buyers or non-buyers
new_visitor_nonbuyer = data.loc[(data['VisitorType'] == 'New_Visitor') & (data['Revenue'] == False)]
new_visitor_buyer = data.loc[(data['VisitorType'] == 'New_Visitor') & (data['Revenue'] == True)]

# Plotting the counts --> New Visitors 
newVisitorCountData = [len(new_visitor_buyer), len(new_visitor_nonbuyer)] 
countPlot = seaborn.barplot(data = newVisitorCountData ,ax=ax[0])
countPlot.set_title('New Visitor Buyer vs Non-Buyer Counts')
countPlot.set_xticks([0, 1])
countPlot.set_xticklabels(['Buyer', 'Non-Buyer'])

# Adding numbers to the bars
for thelabels in countPlot.containers:
    countPlot.bar_label(thelabels, fmt='%.0f')

# Insights on whether the Existing purchased or not
exist_visitor_nonbuyer = data.loc[(data['VisitorType'] == 'Returning_Visitor') & (data['Revenue'] == False)]
exist_visitor_buyer = data.loc[(data['VisitorType'] == 'Returning_Visitor') & (data['Revenue'] == True)]

# Plotting the counts --> Existing Visitors
existing = [len(exist_visitor_buyer), len(exist_visitor_nonbuyer)]
colors = ['red', 'yellow'] 
countExisting = seaborn.barplot(data = existing, ax=ax[1])
countExisting.set_title('Existing Visitor Buyer vs Non-Buyer Counts')
countExisting.set_xticks([0, 1])
countExisting.set_xticklabels(['Buyer', 'Non-Buyer'])
for ExistingCountLabels in countExisting.containers:
    countExisting.bar_label(ExistingCountLabels, fmt='%.0f', padding=1)
mat.show()

In [None]:
data.head()

In [None]:
# Data
badRetention = data.loc[(data['Revenue'] == False) & (data['VisitorType'] == 'Returning_Visitor')]
goodRetention = data.loc[(data['Revenue'] == True) & (data['VisitorType'] == 'Returning_Visitor')]

In [None]:
# Functions to be used in the plots
def apply_auto(array):

    # Plot settings -> For Bad Retention
    colors = []
    system = []
    
    # Subtraction is also hardcoded!
    for val in range(len(array) - 1): 
        if val == 0 or val == 1: # Values are hardcoded!
            colors.append('darkblue')
            system.append('Desktop')
        elif val == 3 or val == 4:
            colors.append('blue')
            system.append('Mobile')
        else:
            colors.append('skyblue')
            system.append('Tab or Book')
    
    return colors, system


# MONTH
# Exploring Page Value -> Who Purchased Something
fig, ax = mat.subplots(3, 2, figsize=(20, 15))

month_to_pagevalue = badRetention.groupby('Month')['PageValues'].mean().sort_values(ascending=False)
month_to_pagevalue_good = goodRetention.groupby('Month')['PageValues'].mean().sort_values(ascending=False)
pagevalue_mean = goodRetention.groupby('Month')['PageValues'].mean().sort_values(ascending=False).mean()

seaborn.barplot(
    x = month_to_pagevalue.index.tolist(),
    y = month_to_pagevalue.values.tolist(),
    ax=ax[0, 0]
)
ax[0, 0].axhline(y=pagevalue_mean, color='red', linestyle='--', label='Average Page Value', linewidth=2)
ax[0, 0].set_title('Page Value by Month (Bad Retention)')
ax[0, 0].set_xlabel('Month')
ax[0, 0].set_ylabel('Average Page Value')
ax[0, 0].legend()

# Page Value and its relation with total pages viewed
seaborn.barplot(
    x = month_to_pagevalue_good.index.tolist(), 
    y = month_to_pagevalue_good.values.tolist(), 
    hue = month_to_pagevalue_good.index.tolist(), 
    ax=ax[0, 1]
)


# Operating Systems
# Operating System and its relation with page value
os_pagevalue = badRetention.groupby('OperatingSystems')['PageValues'].sum().sort_values(ascending=False)
os_pagevalue_good = goodRetention.groupby('OperatingSystems')['PageValues'].sum().sort_values(ascending=False)

# Converting the values to text
os_map = {
    2: 'Mac OS',
    3: 'Linux',
    4: 'Chrome OS',
    5: 'Android',
    6: 'iOS',
    7: 'Other',
    8: 'Embedded'
} 

os_pagevalue.index = os_pagevalue.index.map(os_map) # Value Changed
clr1, syst1 = apply_auto(os_pagevalue) # Colors and System Type

os_pagevalue_good.index = os_pagevalue_good.index.map(os_map) # Value Changed
clr2, syst2 = apply_auto(os_pagevalue_good) # Colors and System Type

seaborn.barplot(
    x = os_pagevalue.index.tolist(),
    y = os_pagevalue.values.tolist(),
    ax=ax[1, 0],
    hue = os_pagevalue.index.tolist(),
    palette=clr1
)

ax[1, 0].set_title('Page Value by Operating System (Bad Retention)')
ax[1, 0].set_xlabel('Operating System')
ax[1, 0].set_ylabel('Total Page Value')
ax[1, 0].legend(title='System Type', labels=syst1)

seaborn.barplot(
    x = os_pagevalue_good.index.tolist(),
    y = os_pagevalue_good.values.tolist(),
    ax=ax[1, 1],
    hue = os_pagevalue_good.index.tolist(),
    palette=clr2
)
ax[1, 1].set_title('Page Value by Operating System (Good Retention)')
ax[1, 1].set_xlabel('Operating System')
ax[1, 1].set_ylabel('Total Page Value')
ax[1, 1].legend(title='System Type', labels=syst2)


# Browsers
browser_pagevalue = badRetention.groupby('Browser')['PageValues'].sum().sort_values(ascending=False)
browser_pagevalue_good = goodRetention.groupby('Browser')['PageValues'].sum().sort_values(ascending=False)

# Updating the values
browser_mapping = {
    1: 'Chrome',
    2: 'Safari',
    3: 'Firefox',
    4: 'Edge',
    5: 'Internet Explorer',
    6: 'Opera',
    7: 'Samsung Internet',
    8: 'UC Browser',
    9: 'Android Browser',
    10: 'Brave',
    11: 'Vivaldi',
    12: 'Maxthon',
    13: 'Other / Unknown'
}

browser_pagevalue.index = browser_pagevalue.index.map(browser_mapping) 
browser_pagevalue_good.index = browser_pagevalue_good.index.map(browser_mapping)

badBrowser = seaborn.barplot(
    x = browser_pagevalue.index.tolist(),
    y = browser_pagevalue.values.tolist(),
    ax = ax[2, 0],
    hue = browser_pagevalue.index.tolist()
)
for labels in badBrowser.containers:
    badBrowser.bar_label(labels, fmt='%.0f', padding=1)
ax[2, 0].set_title('Pagevalues by Browser (Bad Retention)')
ax[2, 0].set_xlabel('Browsers')
ax[2, 0].set_ylabel('Frequency')


# Good Retention Browser based page values
goodBrowser = seaborn.barplot(
    x = browser_pagevalue_good.index.tolist(),
    y = browser_pagevalue_good.values.tolist(),
    ax = ax[2, 1],
    hue = browser_pagevalue_good.index.tolist()
)
for labels in goodBrowser.containers:
    goodBrowser.bar_label(labels, fmt='%.0f', padding=1)
ax[2, 1].set_title('Pagevalues by Browser (Good Retention)')
ax[2, 1].set_xlabel('Browsers')
ax[2, 1].set_ylabel('Frequency')

mat.tight_layout()
mat.show()

In [None]:
data.head()

In [None]:
# Weekend and Revenue
revenueOnWeekend = data.loc[(data['Weekend'] == True) & (data['Revenue'] == True)]
existing = revenueOnWeekend.loc[revenueOnWeekend['VisitorType'] == 'Returning_Visitor']
newCustomer = revenueOnWeekend.loc[revenueOnWeekend['VisitorType'] == 'New_Visitor']
print('Existing Customer Length -', len(existing))
print('Existing Customer revenue (PageValue)', existing['PageValues'].sum())
print('New Customer Length -', len(newCustomer))
print('New Customer revenue (PageValue)', newCustomer['PageValues'].sum())


In [None]:
# No Weekend and Revenue
revenueOnNoWeekend = data.loc[(data['Weekend'] == False) & (data['Revenue'] == True)]

existing = revenueOnNoWeekend.loc[revenueOnNoWeekend['VisitorType'] == 'Returning_Visitor']
newCustomer = revenueOnNoWeekend.loc[revenueOnNoWeekend['VisitorType'] == 'New_Visitor']
print('Existing Customer Length -', len(existing))
print('Existing Customer revenue (PageValue)', existing['PageValues'].sum())
print('New Customer Length -', len(newCustomer))
print('New Customer revenue (PageValue)', newCustomer['PageValues'].sum())

## Model Training and Testing
We will train and test several machine learning models to predict customer sentiment and revenue. The models we will use include:
* Linear Regression
* Polynomial Regression
* Random Forest Regressor
* Gradient Boosting Regressor

We will evaluate the performance of these models using metrics like RMSE and R-squared to determine the most accurate model for our business needs.

In [None]:
# Machin learning Libraries

# Preprocessing Methods
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import PolynomialFeatures

# Importing Pipeline
from sklearn.pipeline import make_pipeline

# Machine learning Models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error, r2_score, root_mean_squared_error


In [None]:
def convert_Xtrain(data_var):

    # Saving encoders for specific columns
    lbEncoder = {}
    scEncoder = {}
    prEncoder = {}

    if not data_var.empty:

        catData = data_var.select_dtypes(include=['bool', 'object'])
        numData = data_var.select_dtypes(include=['int', 'float'])

        # First Categorical Data
        catCol = catData.columns

        for eachColumn in catCol:

            # Fitting the method
            encoder = LabelEncoder()
            new_Column_Value = encoder.fit_transform(catData[eachColumn])
            lbEncoder[eachColumn] = encoder

            # Scaling the encoded ones
            scaler = StandardScaler()
            scaled_New_values = scaler.fit_transform(pandas.DataFrame(new_Column_Value, columns=[eachColumn]))
            scEncoder[eachColumn] = scaler

            # Passing it to the respective column
            catData[eachColumn] = scaled_New_values.flatten()

            # Transferring each transformers

        # Second Continous Data
        numCol = numData.columns

        for eachColumns in numCol:

            # Power Transformer
            columnMean = numData[eachColumns].mean()
            columnMedian = numData[eachColumns].median()

            # This transformer handles the skewness of the data as well as
            # It scales down the columns. So no need to use any transformation method
            if columnMean > columnMedian:
                
                # When the data is right skewed
                transformer = PowerTransformer(method='yeo-johnson')
                numScaledValues = transformer.fit_transform(pandas.DataFrame(numData[eachColumns], columns=[eachColumns]))
                numData[eachColumns] = numScaledValues
            
            elif columnMean < columnMedian:

                # When the data is left skewed
                transformer = PowerTransformer(method='yeo-johnson')
                numScaledValues = transformer.fit_transform(pandas.DataFrame(numData[eachColumns], columns=[eachColumns]))
                numData[eachColumns] = numScaledValues
            

            # Transferring the transformer
            prEncoder[eachColumns] = transformer

        # Concatenating both the data
        concat_data = pandas.DataFrame(
            pandas.concat([catData, numData], axis=1)
        )

        return concat_data, lbEncoder, scEncoder, prEncoder

    else:
        return 'please pass the data in this function'

In [None]:
# Second function for X test
def convert_Xtest(data_var, en1, en2, en3):
    
    # Checking the variable has something
    if not data_var.empty and any(x is not None for x in (en1, en2, en3)):
        
        for col, label in en1.items():
            data_var[col] = label.transform(data_var[col])
        
        for col, scaler in en2.items():
            data_var[col] = scaler.transform(data_var[[col]])
    
        for col, transform in en3.items():
            data_var[col] = transform.transform(data_var[[col]])
    
        return data_var
        

In [None]:
# Lets Compare the RMSE
def rangeCheck(yTest, prediction):
    target_range = numpy.max(yTest) - numpy.min(yTest)
    rmse_test = numpy.sqrt(numpy.mean((yTest - prediction) ** 2))
    rmse_percentage = (rmse_test / target_range) * 100

    print("Target Range:", target_range)
    print("Test RMSE:", rmse_test)
    print("RMSE as % of Target Range:", rmse_percentage)

In [None]:
# Model Training
# X and Y initialization
x = data.drop(columns=['PageValues'])
y = data['PageValues']

# Splitting the data
X_train, X_test, ry_train, ry_test = train_test_split(x, y, test_size=0.2, random_state=57)

# X_training preparations
X_train, encoded, scaled, transformed = convert_Xtrain(X_train)
x_transfromed_columns = list(X_train.columns)
X_test = convert_Xtest(X_test, encoded, scaled, transformed)[x_transfromed_columns]

In [None]:
# Transforming the y_train and y_test both to check the model performance
y_train = numpy.log1p(ry_train)
y_test = numpy.log1p(ry_test)

### Linear Model Training and Prediction with Evaluation

In [None]:
# Training the model
linearModel = LinearRegression()
linearModel.fit(X_train, y_train)

# Training data fit
trainedPrediction = linearModel.predict(X_train)
train_rmse = root_mean_squared_error(y_train, trainedPrediction)
train_r2 = r2_score(y_train, trainedPrediction)

# Test data
predict = linearModel.predict(X_test)
test_rmse = root_mean_squared_error(y_test, predict)
test_r2 = r2_score(y_test, predict)

print('Trained Prediction Statistics')
print('Train RMSE Score - ', train_rmse)
print('Train R2 Score - ', train_r2, '\n')

print('Test Prediction Statistics')
print('Test RMSE - ', test_rmse)
print('Test r2 - ', test_r2, '\n')

# RMSE Comparison
print('RMSE Comparison')
rangeCheck(y_test, predict)

### Polynomial Regression with evaluation

In [None]:
# Degree Initialization
degree = [1, 2, 3]

# Score Collectors
score = {}

# Y_prediction Variable for plotting
y_pred = {}

for eachDegree in degree:
    
    # Training the model
    poly = PolynomialFeatures(degree=eachDegree, include_bias=False)
    x_train_poly = poly.fit_transform(X_train)
    x_test_poly = poly.transform(X_test)

    poly_reg = LinearRegression()
    poly_reg.fit(x_train_poly, y_train)

    # Training data fit
    p_train_pred = poly_reg.predict(x_train_poly)
    p_train_rmse = root_mean_squared_error(y_train, p_train_pred)
    p_train_r2 = r2_score(y_train, p_train_pred)

    # Test data
    p_predict = poly_reg.predict(x_test_poly)
    p_test_rmse = root_mean_squared_error(y_test, p_predict)
    p_test_r2 = r2_score(y_test, p_predict)

    print('Training set')
    print('training RMSE - ', p_train_rmse)
    print('training R2 - ', p_train_r2)

    print()
    print('Testing RMSE - ', p_test_rmse)
    print('Testing R2 - ', p_test_r2)

    print('Testing Set')

    # Sending the score for each degree
    score[eachDegree] = {
        'train': [p_train_rmse, p_train_r2],
        'test': [p_test_rmse, p_test_r2]
    }

    # Transferring the variables
    y_pred[f'{eachDegree}'] = p_predict

    rangeCheck(y_test, p_predict)
    print()

### RANSAC (Random Sample Concensus) Training and Testing with model evaluation

In [None]:
# Residuals Finder
def findResidual(model, x_train, x_test, y_train, y_test):

    if any(x is not None for x in (model, x_train, x_test, y_train, y_test)):

        # fitting the training set
        model.fit(x_train, y_train)

        # Predicting Training and testing 
        xtrainPredict = model.predict(x_train)
        xtestPredict = model.predict(x_test)

        # Redisuals Calculations
        trainResiduals = numpy.abs(y_train - xtrainPredict)
        testRedisuals = numpy.abs(y_test - xtestPredict)
    
        return trainResiduals, testRedisuals

In [None]:
# Preparing the model and Model training

# Parameters for the Random Sample Consencus
base_model = [LinearRegression(), Lasso()]
modelNames = ['Linear Model', 'L1 Regularization or Lasso']

# Residual Threhsold for training and testing
count = 0
dataLen = len(data)

inlinearAndOutlier = {}

# Y_pred score
lassoPredict = []

for model in base_model:

    xtrain_residualThreshold, xtest_residualThreshold = findResidual(
            model, X_train, X_test, y_train, y_test
        )
    
    xtrain_percentile =  numpy.percentile(xtrain_residualThreshold, 90)
    xtest_percentile = numpy.percentile(xtest_residualThreshold, 90)
    
    randomSampleConsencus = RANSACRegressor(
        estimator=model,
        min_samples=0.5,
        residual_threshold=xtrain_percentile,
        max_trials = 10,
        random_state=42
    )

    randomSampleConsencus.fit(X_train, y_train)

    # Training Set and Its Score
    rm_train_pred = randomSampleConsencus.predict(X_train)
    rm_rmse = root_mean_squared_error(y_train, rm_train_pred)
    rm_r2 = r2_score(y_train, rm_train_pred)

    # Testing Set and Its score
    rm_test_pred = randomSampleConsencus.predict(X_test)
    rm_test_rmse = root_mean_squared_error(y_test, rm_test_pred)
    rm_test_r2 = r2_score(y_test, rm_test_pred)

    print(modelNames[count], '\n')
    print('Trained Prediction Statistics')
    print('Train RMSE Score - ', rm_rmse)
    print('Train R2 Score - ', rm_r2, '\n')

    print('Test Prediction Statistics')
    print('Test RMSE - ', rm_test_rmse)
    print('Test r2 - ', rm_test_r2, '\n')

    # Transferring the values
    inlineMask = randomSampleConsencus.inlier_mask_
    outlierMask = numpy.logical_not(inlineMask)

    inlinearAndOutlier[modelNames[count]] = {
        'inline_mask': inlineMask,
        'outline_mask': outlierMask
    }

    if modelNames[count] == 'L1 Regularization or Lasso':
        lassoPredict.append(rm_test_pred)

    # Range Check Comparison
    rangeCheck(y_test, rm_test_pred)

    count += 1

### Random Forest Regressor

In [None]:
# Preparing the model
randomForest = RandomForestRegressor(
    max_depth=5,
    min_samples_leaf=26,
    min_samples_split=17,
    max_features='sqrt',
    n_estimators=453,
    random_state=42,
    warm_start=True
)

randomForest.fit(X_train, y_train)

# Training Evaluation
train_predict = randomForest.predict(X_train)
trainrmse = root_mean_squared_error(y_train, train_predict)
trainr2 = r2_score(y_train, train_predict)

# Testing Evaluation
test_predict = randomForest.predict(X_test)
testrmse = root_mean_squared_error(y_test, test_predict)
testr2 = r2_score(y_test, test_predict)

print('Training Scores')
print('RMSE - ', trainrmse)
print('R2 Score - ', trainr2, '\n')

print('Test Score')
print('RMSE - ', testrmse)
print('R2 Score - ', testr2)
print()
rangeCheck(y_test, test_predict)

### Gradient Boost Regressor

In [None]:
# Preparing the model for the training and testing
gradientReg = GradientBoostingRegressor(
    max_depth=3,
    min_samples_leaf=13,
    min_samples_split=5,
    max_features='sqrt',
    n_estimators=313,
    random_state=42,
    warm_start=True,
    learning_rate=0.01
)
gradientReg.fit(X_train, y_train)

# Training Evaluation
gtrain_predict = gradientReg.predict(X_train)
gtrainrmse = root_mean_squared_error(y_train, gtrain_predict)
gtrainr2 = r2_score(y_train, gtrain_predict)

# Testing Evaluation
gtest_predict = gradientReg.predict(X_test)
gtestrmse = root_mean_squared_error(y_test, gtest_predict)
gtestr2 = r2_score(y_test, gtest_predict)

print('Training Scores')
print('RMSE - ', gtrainrmse)
print('R2 Score - ', gtrainr2, '\n')

print('Test Score')
print('RMSE - ', gtestrmse)
print('R2 Score - ', gtestr2)

print()
rangeCheck(y_test, gtest_predict)

In [None]:
# Let's check the variance
variance = numpy.var(y_train)
print('The Variance - ', variance)

# Visualizing the variance
mat.hist(y_train, bins=20, edgecolor='k')
mat.title('Target Variable Distribution')
mat.xlabel('Target Vairable')
mat.ylabel('Frequency')
mat.show()

In [None]:
def train_model(xtrain, xtest, ytest, ytrain):

    # Linear Model
    # Training the model
    linearModel = LinearRegression()
    linearModel.fit(xtrain, ytrain)

    # Training data fit
    trainedPrediction = linearModel.predict(xtrain)
    train_rmse = root_mean_squared_error(ytrain, trainedPrediction)
    train_r2 = r2_score(ytrain, trainedPrediction)

    # Test data
    predict = linearModel.predict(xtest)
    test_rmse = root_mean_squared_error(ytest, predict)
    test_r2 = r2_score(ytest, predict)

    # Polynomial Regression
    # Degree Initialization
    degree = [1, 2, 3]
    
    # Score Collectors
    score = {}

    # Y_prediction Variable for plotting
    y_pred = {}

    for eachDegree in degree:
        
        # Training the model
        poly = PolynomialFeatures(degree=eachDegree, include_bias=False)
        x_train_poly = poly.fit_transform(xtrain)
        x_test_poly = poly.transform(xtest)

        poly_reg = LinearRegression()
        poly_reg.fit(x_train_poly, ytrain)

        # Training data fit
        p_train_pred = poly_reg.predict(x_train_poly)
        p_train_rmse = root_mean_squared_error(ytrain, p_train_pred)
        p_train_r2 = r2_score(ytrain, p_train_pred)

        # Test data
        p_predict = poly_reg.predict(x_test_poly)
        p_test_rmse = root_mean_squared_error(ytest, p_predict)
        p_test_r2 = r2_score(ytest, p_predict)

        # Transferring the variables
        y_pred[f'{eachDegree}'] = [p_test_rmse, p_train_rmse]

    # Random Sample Conusensus
    # Parameters for the Random Sample Consencus
    base_model = [LinearRegression(), Lasso()]
    modelNames = ['Linear Model', 'L1 Regularization or Lasso']

    # Residual Threhsold for training and testing
    count = 0

    inlinearAndOutlier = {}

    # Y_pred score
    lassoPredict = []

    for model in base_model:

        xtrain_residualThreshold, xtest_residualThreshold = findResidual(
                model, xtrain, xtest, ytrain, ytest
            )
        
        xtrain_percentile =  numpy.percentile(xtrain_residualThreshold, 90)
        xtest_percentile = numpy.percentile(xtest_residualThreshold, 90)
        
        randomSampleConsencus = RANSACRegressor(
            estimator=model,
            min_samples=0.5,
            residual_threshold=xtrain_percentile,
            max_trials = 10,
            random_state=42
        )

        randomSampleConsencus.fit(xtrain, ytrain)

        # Training Set and Its Score
        rm_train_pred = randomSampleConsencus.predict(xtrain)
        rm_rmse = root_mean_squared_error(ytrain, rm_train_pred)
        rm_r2 = r2_score(ytrain, rm_train_pred)

        # Testing Set and Its score
        rm_test_pred = randomSampleConsencus.predict(xtest)
        rm_test_rmse = root_mean_squared_error(ytest, rm_test_pred)
        rm_test_r2 = r2_score(ytest, rm_test_pred)

        # Transferring the values
        inlineMask = randomSampleConsencus.inlier_mask_
        outlierMask = numpy.logical_not(inlineMask)

        inlinearAndOutlier[modelNames[count]] = {
            'inline_mask': inlineMask,
            'outline_mask': outlierMask
        }

        if modelNames[count] == 'L1 Regularization or Lasso':
            lassoPredict.append([rm_test_rmse, rm_rmse])

        # Range Check Comparison
        rangeCheck(ytest, rm_test_pred)

        count += 1
    
    # Random Forest Regression
    
    # Gradient Boost
    # Preparing the model for the training and testing
    gradientReg = GradientBoostingRegressor(
        max_depth=3,
        min_samples_leaf=13,
        min_samples_split=5,
        max_features='sqrt',
        n_estimators=313,
        random_state=42,
        warm_start=True,
        learning_rate=0.01
    )
    gradientReg.fit(xtrain, ytrain)

    # Training Evaluation
    gtrain_predict = gradientReg.predict(xtrain)
    gtrainrmse = root_mean_squared_error(ytrain, gtrain_predict)
    gtrainr2 = r2_score(ytrain, gtrain_predict)

    # Testing Evaluation
    gtest_predict = gradientReg.predict(xtest)
    gtestrmse = root_mean_squared_error(ytest, gtest_predict)
    gtestr2 = r2_score(ytest, gtest_predict)

    rangeCheck(ytest, gtest_predict)

    # Random Forest Regression
    # Preparing the model
    randomForest = RandomForestRegressor(
        max_depth=5,
        min_samples_leaf=26,
        min_samples_split=17,
        max_features='sqrt',
        n_estimators=453,
        random_state=42,
        warm_start=True
    )

    randomForest.fit(xtrain, ytrain)

    # Training Evaluation
    train_predict = randomForest.predict(xtrain)
    trainrmse = root_mean_squared_error(ytrain, train_predict)
    trainr2 = r2_score(ytrain, train_predict)

    # Testing Evaluation
    test_predict = randomForest.predict(xtest)
    testrmse = root_mean_squared_error(ytest, test_predict)
    testr2 = r2_score(ytest, test_predict)

    dict2 = {
        'Linear Model': [test_rmse, train_rmse],
        'Random Sample Concuses (Lasso Regularization)': lassoPredict[0],
        'Polynomial Regression': y_pred['2'],
        'Random Forest Regression': [testrmse, rm_rmse],
        'Gradient Boost Regression': [gtestrmse, gtrainrmse]
    }
    

    return dict2 

In [None]:
rmseScore = train_model(X_train, X_test, y_test, y_train)

In [None]:
print('RMSE Score (Root Mean Squared Error)')
training_test_score = pandas.DataFrame.from_dict(rmseScore, orient="index", columns=['TestScore', 'TrainScore'])
training_test_score

In [None]:
# Let's Compare it with the baseline of the dataset
baseline = numpy.full_like(y_test, y_train.mean())
baseline_rmse = float(numpy.sqrt(mean_squared_error(y_test, baseline)))

# Checking this baseline rmse with the rmse score we are getting
for key, values in compData['RMSE'].items():

    if values >= baseline_rmse:
        print(f'{key} underperforms the baseline Mean')
    else:
        print(f'{key} outperforms the baseline Mean,\n\nwhere, \n{key} \nbaseline RMSE - {baseline_rmse} > Model RMSE {values}')
    

In [None]:
# Let's perform scatter plot analysis for each model

# Linear Model
# Create figure with custom GridSpec layout
fig = mat.figure(figsize=(15, 10))
gs = fig.add_gridspec(2, 3)

# Top row
ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[0, 2])

# Bottom row: [1, 0] separate, [1, 1] and [1, 2] merged
ax4 = fig.add_subplot(gs[1, 0])
ax5 = fig.add_subplot(gs[1, 1:])

# Scatter plots
ax1.scatter(y_test, predict)
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         color='red', linestyle='--', linewidth=2)
ax1.set_title('Linear Model Scatter Plot')

ax2.scatter(y_test, y_pred['2'])
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         color='red', linestyle='--', linewidth=2)
ax2.set_title('Polynomial Regression Scatter Plot')

ax3.scatter(y_test, lassoPredict)
ax3.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         color='red', linestyle='--', linewidth=2)
ax3.set_title('Lasso Regularization (L1) Model Scatter Plot')

ax4.scatter(y_test, test_predict)
ax4.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         color='red', linestyle='--', linewidth=2)
ax4.set_title('Random Forest Regressor Scatter Plot')

ax5.scatter(y_test, gtest_predict)
ax5.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         color='red', linestyle='--', linewidth=2)
ax5.set_title('Gradient Boost Regressor Scatter Plot')

mat.tight_layout()
mat.show()

## Conclusion
This analysis provides a comprehensive overview of customer behavior on our e-commerce platform. By leveraging feature engineering and machine learning, we can predict customer sentiment and revenue with a high degree of accuracy. The insights gained from this analysis will help us make data-driven decisions to improve our platform and grow our business.