<h1>Table of Contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#import_libraries">Import Libraries</a></li>
        <li><a href="#import_dataset">Import "Online Retail" Dataset</a></li>
        <li><a href="#information">Information about the Dataset</a></li>
        <li><a href="#pre-processing">Pre-processing</a></li>
        <li><a href="#eda">Exploratory Data Analysis (EDA)</a></li>                 
    </ol>
</div>
<br>
<hr>

<div id="import_libraries"> 
    <h2>Import Libraries</h2>    
</div>

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt  
import seaborn as sns
import matplotlib.dates as mdates

import warnings
warnings.filterwarnings("ignore")

<hr>
<div id="import_dataset"> 
    <h2>Import "Online Retail" Dataset</h2>         
</div>

### Overview of the Online Retail Dataset  

The **"Online Retail"** dataset is commonly used for data analysis in e-commerce, featuring the following attributes:  

- **InvoiceNo**: Unique transaction identifier, grouping items purchased together.  
- **StockCode**: Unique product identifier, useful for tracking sales and inventory.  
- **Description**: Textual description of products, aiding in understanding product types and preferences.  
- **Quantity**: Number of units sold in a transaction, important for analyzing sales volume.  
- **InvoiceDate**: Date and time of the transaction, crucial for time series and sales trend analysis.  
- **UnitPrice**: Price per unit, vital for revenue calculations and pricing strategies.  
- **CustomerID**: Unique customer identifier, valuable for customer behavior analysis and segmentation.  
- **Country**: Customer location, helping in geographic market analysis.  

### Uses of the Dataset  

1. **Sales Analysis**: Identifying bestsellers and trends.  
2. **Customer Segmentation**: Clustering based on buying behavior.  
3. **Inventory Management**: Optimizing stock levels and turnover.  
4. **Market Basket Analysis**: Understanding product bundling.  
5. **Predictive Modeling**: Forecasting future sales and behavior.  
6. **Time Series Analysis**: Modeling sales trends over time.  

With 541,909 observations, this dataset offers rich insights for informed decision-making in retail.

In [None]:
# Load the dataset
or_df = pd.read_excel('Online_Retail.xlsx')

display(or_df.head())

<hr>
<div id="information"> 
    <h2>Information about the Dataset</h2>    
</div>

In [None]:
# Get the shape of the dataset, which returns the number of rows and columns
shape_of_the_dataset = or_df.shape
print("\nThe shape of the dataset -->", shape_of_the_dataset)

In [None]:
# Show summary statistics for the dataset
# This includes count, mean, standard deviation, minimum, 25%, 50%, 75%, and maximum values for numeric columns
# This includes count, unique(the number of unique values), top(the mode) and the frequency of the top value for object columns
print('\nThe dataset description:\n')

data_describe = or_df.describe(include = 'all')
display(data_describe)

In [None]:
# Display a concise summary of the dataset
# This summary includes the index dtype, column dtypes, non-null values, and memory usage 
print('\nMore information about the dataset:\n')

data_information = or_df.info()
display(data_information)

In [None]:
# Calculate the number of unique values in each column of the dataset
print('\nNumber of unique data in the dataset:\n')

unique_data = or_df.nunique()
print(unique_data)

<hr>
<div id="pre-processing"> 
    <h2>Pre-processing</h2>    
</div>
<div>
    <ol>
        <li><a href="#convert_data_types">Convert Data Types</a></li>
        <li><a href="#duplicates">Duplicate Tuples</a></li>
        <li><a href="#missing_values">Handling Missing Values</a></li>
        <li><a href="#filter_out">Filter Out Unnecessary Data</a></li>        
        <li><a href="#outliers">Detecting Outliers (Noise)</a></li>                      
    </ol>
</div>
<br>

<hr>
<div id="convert_data-types"> 
    <h2>Convert Data Types</h2>    
</div>

In [None]:
# Convert 'StockCode' column to categorical
or_df['StockCode'] = or_df['StockCode'].astype('category')

# Convert 'Description' column to categorical   
or_df['Description'] = or_df['Description'].astype('category') 

# Convert 'Country' column to categorical
or_df['Country'] = or_df['Country'].astype('category')

# Convert 'CustomerID' column to string
or_df['CustomerID'] = or_df['CustomerID'].astype('category')

# Display the data types for each column after conversion
print("\nData types after conversion:")
display(or_df.dtypes)

<hr>
<div id="duplicates"> 
    <h2>Duplicate Tuples</h2>    
</div>

In [None]:
# Calculate the number of duplicate rows in the dataframe
Num_of_duplicate_rows = or_df.duplicated().sum()
print("\nThe number of duplicate rows -->", Num_of_duplicate_rows)

In [None]:
# Identify all duplicated rows in the dataframe  
# 'duplicated(keep=False)' marks all duplicates (including the first occurrence as True)
df_all_duplicate = or_df[or_df.duplicated(keep=False)]
print("\nAll the rows and their duplicates:\n")
display(df_all_duplicate)

In [None]:
# Identify only the duplicated rows in the dataframe
# 'duplicated()' without any parameters, meaning its output only shows the rows that are duplicates and excludes the first occurrences
duplicate = or_df[or_df.duplicated()]
print("\nJust duplicate rows:\n")
display(duplicate)

In [None]:
# Drop all duplicate rows from the dataframe
# df_ADD --> df_after dropping duplicates
df_ADD = or_df.drop_duplicates()
print("\nThe dataset after dropping the duplicate tuples:\n")
display(df_ADD)

In [None]:
# Check the shape to see how many duplicate rows were removed  
print(f'Dataset shape before dropping rows: {or_df.shape}')
print(f'Dataset shape after dropping rows: {df_ADD.shape}')

<hr>
<div id="missing_values"> 
    <h2>Handling Missing Values</h2>    
</div>
<div>
    <ol>
        <li><a href="#bf_ff">Backward Fill (bfill) and Forward Fill (ffill) methods</a></li>
        <li><a href="#mode">The Mode Method</a></li>
        <li><a href="#combination">Combination of Both Methods</a></li>             
        <li><a href="#output">Output the results</a></li>    
    </ol>
</div>
<br>

In [None]:
# Check for missing values in the dataframe
isna = pd.DataFrame(df_ADD.isna().sum(axis=0))
print(isna)

In [None]:
# Find missing values ​​in other shapes
# Define unwanted values and consider them as null/missing  
unwanted_values = ['?', '!', '$', 'None', 'null', '', '*'] 

# Replace unwanted values with NaN   
for col in df_ADD.columns: 
    df_ADD.loc[:, col] = df_ADD[col].replace(unwanted_values, np.nan)

In [None]:
# Check for any NaN values now present in the dataframe  
missing_values_count = df_ADD.isna().sum() 

# Display the count of missing values for each column  
print("\nCount of missing values in each column:")  
print(missing_values_count[missing_values_count > 0])

In [None]:
# Summary of missing data (percentage of missing values)
missing_summary = df_ADD.isnull().mean() * 100
print(f"\nPercentage of missing values in each column:\n{missing_summary}")

In [None]:
# Display rows with missing values  
rows_with_missing = df_ADD[df_ADD.isna().any(axis=1)]  
print("\nRows with missing values:")  
display(rows_with_missing)

<div id="bf_ff"> 
    <h2>Backward Fill (bfill) and Forward Fill (ffill) methods</h2>    
</div>

In [None]:
# Fill missing values in 'description' and 'customerID' using forward fill and backward fill
df_fill = df_ADD.copy()              # Create a copy of the dataframe
  
df_fill['Description'] = df_fill['Description'].fillna(method='ffill').fillna(method='bfill')  
df_fill['CustomerID'] = df_fill['CustomerID'].fillna(method='ffill').fillna(method='bfill')  

# Check if there are still any missing values  
print(df_fill[['Description', 'CustomerID']].isnull().sum())  

<div id="mode"> 
    <h2>The Mode Method</h2>    
</div>

In [None]:
# Create a copy of the dataframe
df_mode = df_ADD.copy()

# Impute missing values in 'Description' column with the mode
mode_description = df_mode['Description'].mode()[0]
df_mode['Description'].fillna(mode_description, inplace=True)

# Impute missing values in 'CustomerID' column with the mode
mode_customer_id = df_mode['CustomerID'].mode()[0]
df_mode['CustomerID'].fillna(mode_customer_id, inplace=True)

# Verify the imputation
print(df_mode[['Description', 'CustomerID']].isnull().sum())

<div id="combination"> 
    <h2>Combination of Both Methods</h2>    
</div>

### Rationale for Combining Mode and Forward/Backward Fill Methods

This project focuses on gaining insights into customer behavior, sales performance, and product trends. While imputation methods for missing values can impact data quality, the 'Description' and 'CustomerID' columns hold varying levels of importance relative to our primary analysis.
To ensure clarity and robustness in data preprocessing, the following approach was adopted:

1. **Mode Imputation for 'Description'**: Given that the 'Description' column had fewer missing values and is less critical for the primary analysis, the mode imputation method was used. This ensures consistency by replacing missing values with the most frequently occurring category.
2. **Forward/Backward Fill for 'CustomerID'**: The 'CustomerID' column had a significant number of missing values. Using the forward fill and backward fill methods helps avoid over-representation of a single customer ID, maintaining the temporal relevance and contextual accuracy of our data.

By documenting and comparing these methods, the goal is to provide a comprehensive view of the imputation techniques used, allowing for better decision-making and transparency in this analysis.

In [None]:
# Create a copy of the dataframe
df_combined = df_ADD.copy()

# Impute missing values in 'Description' column with the mode
mode_description = df_combined['Description'].mode()[0]
df_combined['Description'].fillna(mode_description, inplace=True)

# Impute missing values in 'CustomerID' column using forward fill, then backward fill
df_combined['CustomerID'].fillna(method='ffill', inplace=True)
df_combined['CustomerID'].fillna(method='bfill', inplace=True)

# Verify the imputation
print(df_combined[['Description', 'CustomerID']].isnull().sum())

<div id="output"> 
    <h2>Output the results</h2>    
</div>

In [None]:
# Descriptive Statistics: Check summary statistics for each dataset to see if there are significant differences
print('\nBaseline outcome: \n')  
display(df_ADD.describe())

print('\nForward Fill/Backward Fill method outcome: \n') 
display(df_fill.describe())
 
print('\nThe mode method outcome: \n')
display(df_mode.describe())

print('\nCombination of both methods outcome: \n')
display(df_combined.describe())

In [None]:
print("\nContinue working with Combination of both methods after comparing different methods:\n")
display(df_combined)

<hr>
<div id="filter_out"> 
    <h2>Filter Out Unnecessary Data</h2>    
</div>

In [None]:
# Filter out negative quantities
df_filter_out = df_combined.copy()              # Create a copy of the dataframe
df_filter_out = df_filter_out[df_filter_out['Quantity'] > 0]
df_filter_out = df_filter_out[df_filter_out['UnitPrice'] > 0]

In [None]:
# Check the shape to see how many rows were filtered out
print(f'Dataset shape before filter out: {df_combined.shape}')
print(f'Dataset shape after filter out: {df_filter_out.shape}')

<hr>
<div id="outliers"> 
    <h2>Detecting Outliers (Noise)</h2>    
</div>
<div>
    <ol>
        <li><a href="#plot-boxplot">Plot Boxplot</a></li>
        <li><a href="#z_score">Z-score method</a></li>          
    </ol>
</div>
<br>

<div id="plot-boxplot"> 
    <h2>Plot Boxplot</h2>    
</div>

In [None]:
# Plot a boxplot to detect outliers in numerical features  
# Set up the figure with 1 row and 2 columns for boxplots, sharing the y-axis
fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True)        # Set the figure size (width, height)

# Create the first boxplot for 'UnitPrice'
sns.boxplot(x='UnitPrice', data=df_filter_out, ax=axes[0], 
            boxprops=dict(facecolor='#d62728', alpha=0.6))
axes[0].set_title('Boxplot of UnitPrice')

# Create the second boxplot for 'Quantity'
sns.boxplot(x='Quantity', data=df_filter_out, ax=axes[1], 
            boxprops=dict(facecolor='#1f77b4', alpha=0.6))
axes[1].set_title('Boxplot of Quantity')

plt.tight_layout()         # Adjust the spacing between plots
plt.show() 
  

<div id="z_score"> 
    <h2>Z-score method</h2>    
</div>

In [None]:
# Create a copy of the dataset
df_zscore = df_filter_out.copy()                    

# Calculate Z-scores for the specified columns  
# Z-score indicates how many standard deviations an element is from the mean
z_scores = stats.zscore(df_zscore[['Quantity', 'UnitPrice']])  

In [None]:
# Create a new column 'is_outlier' in the dataset  
# Set it to True if any Z-score for the row is greater than 3 or less than -3  
# This indicates that the row is an outlier in at least one of the two columns  
df_zscore['is_outlier'] = (abs(z_scores) > 3).any(axis=1) 

In [None]:
# Filter the dataset to show only the rows that are identified as outliers  
outliers_only = df_zscore[df_zscore['is_outlier']]   

# Output of the outliers  
display(outliers_only) 

In [None]:
# Dataframe after removing outliers  
df_zscore = df_zscore[~df_zscore['is_outlier']]                # Use bitwise NOT to select non-outlier rows  

# Reset the index of the final dataset for cleaner indexing  
df_zscore.reset_index(drop=True, inplace=True)
df_zscore = df_zscore.drop('is_outlier', axis=1)               # Drop 'is_outlier' column

# Display the dataset after removing outliers 
display(df_zscore)

In [None]:
# Output the shape of the dataset after outlier detection
print(f'Dataset shape before removing outliers: {df_filter_out.shape}')
print(f'Dataset shape after removing outliers: {df_zscore.shape}')

<hr>
<div id="eda"> 
    <h2>Exploratory Data Analysis (EDA)</h2>    
</div>
<div>
    <ol>
        <li><a href="#visualize-data-distributions">Visualize Data Distributions</a></li>
        <li><a href="#explore-relationships">Explore Relationships</a></li>       
        <li><a href="#analyze-trends">Analyze Trends</a></li>
        <li><a href="#cohort-analysis">Cohort Analysis</a></li>
        <li><a href="#time-based-heatmap">Time-based Heatmap</a></li>        
        <li><a href="#seasonal-holiday-trends-analysis">Seasonal & Holiday Trends Analysis</a></li>                          
    </ol>
</div>
<br>

<div id="visualize-data-distributions"> 
    <h2>Visualize Data Distributions</h2>    
</div>

### Numerical Features 

In [None]:
# Plot a histogram for numerical features
# Set up the figure with 1 row and 2 columns for histograms, sharing the y-axis  
fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharey=True)                      # Set the figure size (width, height)

# Create the first histogram for 'UnitPrice'
sns.histplot(df_zscore['UnitPrice'], bins=20, kde=True, color='darkred', ax=axes[0])   # Create histogram for UnitPrice with KDE
axes[0].set_title('Histogram of Unit Price')
axes[0].set_xlabel('Unit Price')
axes[0].set_ylabel('Frequency')   

# Create the second histogram for 'Quantity' 
sns.histplot(df_zscore['Quantity'], bins=20, kde=True, color='darkblue', ax=axes[1])   # Create histogram for Quantity with KDE
axes[1].set_title('Histogram of Quantity')
axes[1].set_xlabel('Quantity')
axes[1].set_ylabel('Frequency')  
  
plt.tight_layout()           # Adjust the spacing between plots 
plt.show()   

### Categorical Features

In [None]:
# Plot a bar chart for categorical features --> Country
# Count occurrences of each country and sort in descending order  
country_counts = df_zscore['Country'].value_counts().reset_index()  
country_counts.columns = ['Country', 'Count']                             # Rename columns  
country_counts = country_counts.sort_values(by='Count', ascending=False)  # Sort by count

# Number of top countries to display (top 10)
top_n = 10  
country_counts = country_counts.head(top_n)

# Set the figure size (width, height)  
plt.figure(figsize=(12, 6)) 
# Create a bar plot for 'Country' counts
ax = sns.barplot(data=country_counts, x='Country', y='Count', palette='viridis', order=country_counts['Country'])   

# Add value labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2, p.get_height()),
                ha='center', va='bottom', fontsize=10, color='black', fontweight='bold')    

# Set title and labels  
plt.title(f'Top {top_n} Countries by Count')    
plt.xlabel('')
plt.ylabel('')
ax.set_yticks([]) 
plt.xticks(rotation=45, ha='right')         # Rotate x-axis labels for better readability   
plt.show()   

In [None]:
# Plot a bar chart for categorical features --> CustomerID
# Count occurrences of each CustomerID and sort in descending order
customer_counts = df_zscore['CustomerID'].value_counts().reset_index()
customer_counts.columns = ['CustomerID', 'Count']           # Rename columns
customer_counts = customer_counts.sort_values(by='Count', ascending=False).head(20)    # Keep top 20 customers

# Set the figure size
plt.figure(figsize=(14, 6))
# Create a bar plot for 'CustomerID' counts
ax = sns.barplot(data=customer_counts, x='CustomerID', y='Count', palette='coolwarm', order=customer_counts['CustomerID'])

# Add value labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2, p.get_height()),
                ha='center', va='bottom', fontsize=8, color='black', fontweight='bold')  

# Set title and labels
plt.title('Top 20 Customers by Purchase Count')  
plt.xlabel('')
plt.ylabel('')
ax.set_yticks([])
plt.xticks(rotation=45, ha='right')       # Rotate x-axis labels for better readability
plt.show()

In [None]:
# Plot a bar chart for categorical features --> StockCode
# Count occurrences of each StockCode and sort in descending order
stock_counts = df_zscore['StockCode'].value_counts().reset_index()
stock_counts.columns = ['StockCode', 'Count']         # Rename columns
stock_counts = stock_counts.sort_values(by='Count', ascending=False).head(20)    # Keep top 20 best-selling products

# Set the figure size
plt.figure(figsize=(14, 6))
# Create a bar plot for 'StockCode' counts
ax = sns.barplot(data=stock_counts, x='StockCode', y='Count', palette='magma', order=stock_counts['StockCode'])

# Add value labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2, p.get_height()),
                ha='center', va='bottom', fontsize=8, color='black', fontweight='bold')

# Set title and labels
plt.title('Top 20 Best-Selling Products by StockCode')  
plt.xlabel('')
plt.ylabel('')
ax.set_yticks([])
plt.xticks(rotation=45, ha='right')       # Rotate x-axis labels for better readability
plt.show()

<div id="explore-relationships"> 
    <h2>Explore Relationships</h2>    
</div>

### Numerical Features

In [None]:
# Plot a heatmap to explore relationships between variables --> Quantity & UnitPrice
# Compute the correlation matrix
correlation_matrix = df_zscore[['Quantity', 'UnitPrice']].corr()

# Set the figure size
plt.figure(figsize=(6, 4))

# Create a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5, square=True)

# Set title
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Plot a scatterplot to Check how variables interact --> Quantity & UnitPrice
# Set figure size
plt.figure(figsize=(8, 5))

# Create scatter plot
sns.scatterplot(x=df_zscore['Quantity'], y=df_zscore['UnitPrice'], alpha=0.5, color='darkblue')

# Set titles and labels
plt.title('Scatter Plot of Quantity vs. Unit Price')
plt.xlabel('Quantity')
plt.ylabel('Unit Price')
plt.show()

### Numerical & Categorical Features

In [None]:
# Plot box and bar plots to view relationships between numerical and categorical variables --> Quantity & Country
# Count occurrences of each country and sort in descending order  
country_counts = df_zscore['Country'].value_counts().reset_index()  
country_counts.columns = ['Country', 'Count']                             # Rename columns  
country_counts = country_counts.sort_values(by='Count', ascending=False).head(10)    # Keep top 10 countries   

# Set figure size
fig, axes = plt.subplots(1, 2, figsize=(18, 6))         # 1 row, 2 columns

# Boxplot: Distribution of Quantity by Country
sns.boxplot(x='Country', y='Quantity', data=df_zscore, palette='Greens_r', 
            order=country_counts['Country'], ax=axes[0])
# Set titles and labels
axes[0].set_title(f'Quantity Distribution by Country (Top {10})')
axes[0].set_xlabel('')
axes[0].set_ylabel('')
axes[0].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

# Barplot: Average Quantity Sold per Country
sns.barplot(x='Country', y='Quantity', data=df_zscore, palette='Greens_r', 
            order=country_counts['Country'], estimator=np.mean, ax=axes[1])
# Set titles and labels
axes[1].set_title(f'Average Quantity Sold by Country (Top {10})')
axes[1].set_xlabel('')
axes[1].set_ylabel('')
axes[1].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

plt.tight_layout()               # Adjust the spacing between plots
plt.show()

In [None]:
# Plot box and bar plots to view relationships between numerical and categorical variables --> UnitPrice & Country
# Count occurrences of each country and sort in descending order  
country_counts = df_zscore['Country'].value_counts().reset_index()  
country_counts.columns = ['Country', 'Count']                             # Rename columns  
country_counts = country_counts.sort_values(by='Count', ascending=False).head(10)    # Keep top 10 countries

# Set figure size
fig, axes = plt.subplots(1, 2, figsize=(18, 6))         # 1 row, 2 columns

# Boxplot: Distribution of UnitPrice by Country
sns.boxplot(x='Country', y='UnitPrice', data=df_zscore, palette='YlGn_r', 
            order=country_counts['Country'], ax=axes[0])
# Set titles and labels
axes[0].set_title(f'UnitPrice Distribution by Country (Top {10})')
axes[0].set_xlabel('')
axes[0].set_ylabel('')
axes[0].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

# Barplot: Average UnitPrice Sold per Country
sns.barplot(x='Country', y='UnitPrice', data=df_zscore, palette='YlGn_r', 
            order=country_counts['Country'], estimator=np.mean, ax=axes[1])
# Set titles and labels
axes[1].set_title(f'Average UnitPrice Sold by Country (Top {10})')
axes[1].set_xlabel('')
axes[1].set_ylabel('')
axes[1].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

plt.tight_layout()               # Adjust the spacing between plots
plt.show()

In [None]:
# Plot box and bar plots to view relationships between numerical and categorical variables --> Quantity & CustomerID
# Count occurrences of each CustomerID and sort in descending order
customer_counts = df_zscore['CustomerID'].value_counts().reset_index()
customer_counts.columns = ['CustomerID', 'Count']           # Rename columns
customer_counts = customer_counts.sort_values(by='Count', ascending=False).head(10)    # Keep top 10 customers   

# Set figure size
fig, axes = plt.subplots(1, 2, figsize=(18, 6))         # 1 row, 2 columns

# Boxplot: Distribution of Quantity by CustomerID
sns.boxplot(x='CustomerID', y='Quantity', data=df_zscore, palette='YlOrBr_r', 
            order=customer_counts['CustomerID'], ax=axes[0])
# Set titles and labels
axes[0].set_title(f'Quantity Distribution by CustomerID (Top {10})')
axes[0].set_xlabel('')
axes[0].set_ylabel('')
axes[0].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

# Barplot: Average Quantity Sold per CustomerID
sns.barplot(x='CustomerID', y='Quantity', data=df_zscore, palette='YlOrBr_r', 
            order=customer_counts['CustomerID'], estimator=np.mean, ax=axes[1])
# Set titles and labels
axes[1].set_title(f'Average Quantity Sold by CustomerID (Top {10})')
axes[1].set_xlabel('')
axes[1].set_ylabel('')
axes[1].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

plt.tight_layout()               # Adjust the spacing between plots
plt.show()

In [None]:
# Plot box and bar plots to view relationships between numerical and categorical variables --> UnitPrice & CustomerID
# Count occurrences of each CustomerID and sort in descending order
customer_counts = df_zscore['CustomerID'].value_counts().reset_index()
customer_counts.columns = ['CustomerID', 'Count']           # Rename columns
customer_counts = customer_counts.sort_values(by='Count', ascending=False).head(10)    # Keep top 10 customers   

# Set figure size
fig, axes = plt.subplots(1, 2, figsize=(18, 6))         # 1 row, 2 columns

# Boxplot: Distribution of UnitPrice by CustomerID
sns.boxplot(x='CustomerID', y='UnitPrice', data=df_zscore, palette='YlOrRd_r', 
            order=customer_counts['CustomerID'], ax=axes[0])
# Set titles and labels
axes[0].set_title(f'UnitPrice Distribution by CustomerID (Top {10})')
axes[0].set_xlabel('')
axes[0].set_ylabel('')
axes[0].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

# Barplot: Average UnitPrice Sold per CustomerID
sns.barplot(x='CustomerID', y='UnitPrice', data=df_zscore, palette='YlOrRd_r', 
            order=customer_counts['CustomerID'], estimator=np.mean, ax=axes[1])
# Set titles and labels
axes[1].set_title(f'Average UnitPrice Sold by CustomerID (Top {10})')
axes[1].set_xlabel('')
axes[1].set_ylabel('')
axes[1].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

plt.tight_layout()               # Adjust the spacing between plots
plt.show()

In [None]:
# Plot box and bar plots to view relationships between numerical and categorical variables --> Quantity & StockCode
# Count occurrences of each StockCode and sort in descending order
stock_counts = df_zscore['StockCode'].value_counts().reset_index()
stock_counts.columns = ['StockCode', 'Count']         # Rename columns
stock_counts = stock_counts.sort_values(by='Count', ascending=False).head(10)    # Keep top 10 best-selling products

# Set figure size
fig, axes = plt.subplots(1, 2, figsize=(18, 6))         # 1 row, 2 columns

# Boxplot: Distribution of Quantity by StockCode
sns.boxplot(x='StockCode', y='Quantity', data=df_zscore, palette='dark:salmon', 
            order=stock_counts['StockCode'], ax=axes[0])
# Set titles and labels
axes[0].set_title(f'Quantity Distribution by StockCode (Top {10})')
axes[0].set_xlabel('')
axes[0].set_ylabel('')
axes[0].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

# Barplot: Average Quantity Sold per StockCode
sns.barplot(x='StockCode', y='Quantity', data=df_zscore, palette='dark:salmon', 
            order=stock_counts['StockCode'], estimator=np.mean, ax=axes[1])
# Set titles and labels
axes[1].set_title(f'Average Quantity Sold by StockCode (Top {10})')
axes[1].set_xlabel('')
axes[1].set_ylabel('')
axes[1].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

plt.tight_layout()               # Adjust the spacing between plots
plt.show()

In [None]:
# Plot box and bar plots to view relationships between numerical and categorical variables --> UnitPrice & StockCode
# Count occurrences of each StockCode and sort in descending order
stock_counts = df_zscore['StockCode'].value_counts().reset_index()
stock_counts.columns = ['StockCode', 'Count']         # Rename columns
stock_counts = stock_counts.sort_values(by='Count', ascending=False).head(10)    # Keep top 10 best-selling products

# Set figure size
fig, axes = plt.subplots(1, 2, figsize=(18, 6))         # 1 row, 2 columns

# Boxplot: Distribution of UnitPrice by StockCode
sns.boxplot(x='StockCode', y='UnitPrice', data=df_zscore, palette='rocket', 
            order=stock_counts['StockCode'], ax=axes[0])
# Set titles and labels
axes[0].set_title(f'UnitPrice Distribution by StockCode (Top {10})')
axes[0].set_xlabel('')
axes[0].set_ylabel('')
axes[0].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

# Barplot: Average UnitPrice Sold per StockCode
sns.barplot(x='StockCode', y='UnitPrice', data=df_zscore, palette='rocket', 
            order=stock_counts['StockCode'], estimator=np.mean, ax=axes[1])
# Set titles and labels
axes[1].set_title(f'Average UnitPrice Sold by StockCode (Top {10})')
axes[1].set_xlabel('')
axes[1].set_ylabel('')
axes[1].tick_params(axis='x', rotation=45)        # Rotate x-axis labels for better readability

plt.tight_layout()               # Adjust the spacing between plots
plt.show()

### Categorical Features

In [None]:
# Plot a heatmap to view relationships between variables --> Country & StockCode
# Create a cross-tabulation between Country and StockCode
country_stock = pd.crosstab(df_zscore['Country'], df_zscore['StockCode'])

# Filter top 10 countries and top 10 stock codes for better readability
top_countries = df_zscore['Country'].value_counts().head(10).index
top_stocks = df_zscore['StockCode'].value_counts().head(10).index
filtered_data = country_stock.loc[top_countries, top_stocks]

# Plot heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(filtered_data, annot=True, fmt='d', cmap='crest')

# Set titles and labels
plt.title('Top 10 Countries vs Top 10 Products (StockCode)')
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
plt.show()

In [None]:
# Plot a bar chart to view relationships between variables --> Country & StockCode
# Select a country (e.g. UK)
selected_country = 'United Kingdom'

# Filter for that country
df_country = df_zscore[df_zscore['Country'] == selected_country]

# Group by StockCode and Count
stockcode_counts = df_country['StockCode'].value_counts().head(10).reset_index()           # Keep top 10
stockcode_counts.columns = ['StockCode', 'Count']

# Plot barplot
plt.figure(figsize=(10, 6))
sns.barplot(data=stockcode_counts, x='StockCode', y='Count', palette='crest_r',
            order=stockcode_counts['StockCode'])

# Set titles and labels
plt.title(f'Top 10 StockCodes in {selected_country}')
plt.xlabel('')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Plot a heatmap to view relationships between variables --> Country & CustomerID
# Create a cross-tabulation between Country and CustomerID
country_customer = pd.crosstab(df_zscore['Country'], df_zscore['CustomerID'])

# Filter top 10 countries and top 10 customerIDs for better readability
top_countries = df_zscore['Country'].value_counts().head(10).index
top_CustomerIDs = df_zscore['CustomerID'].value_counts().head(10).index
filtered_data = country_customer.loc[top_countries, top_CustomerIDs]

# Plot heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(filtered_data, annot=True, fmt='d', cmap='mako_r')

# Set titles and labels
plt.title('Top 10 Countries vs Top 10 CustomerIDs')
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
plt.show()

In [None]:
# Plot a bar chart to view relationships between variables --> Country & CustomerID
# Select a country (e.g. UK)
selected_country = 'United Kingdom'

# Filter for that country
df_country = df_zscore[df_zscore['Country'] == selected_country]

# Group by CustomerID and Count
customerID_counts = df_country['CustomerID'].value_counts().head(10).reset_index()           # Keep top 10
customerID_counts.columns = ['CustomerID', 'Count']

# Plot barplot
plt.figure(figsize=(10, 6))
sns.barplot(data=customerID_counts, x='CustomerID', y='Count', palette='mako',
            order=customerID_counts['CustomerID'])

# Set titles and labels
plt.title(f'Top 10 CustomerID in {selected_country}')
plt.xlabel('')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Plot a heatmap to view relationships between variables --> StockCode & CustomerID
# Create a cross-tabulation between StockCode and CustomerID
stock_customer = pd.crosstab(df_zscore['StockCode'], df_zscore['CustomerID'])

# Filter top 10 stock codes and top 10 customerIDs for better readability
top_stocks = df_zscore['StockCode'].value_counts().head(10).index
top_CustomerIDs = df_zscore['CustomerID'].value_counts().head(10).index
filtered_data = stock_customer.loc[top_stocks, top_CustomerIDs]

# Plot heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(filtered_data, annot=True, fmt='d', cmap='flare')

# Set titles and labels
plt.title('Top 10 Stock Codes vs Top 10 CustomerIDs')
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
plt.show()

In [None]:
# Plot a bar chart to view relationships between variables --> StockCode & CustomerID
# Select a StockCode (e.g. 85123A)
selected_stockcode = '85123A'

# Filter for that stockcode
df_stockcode = df_zscore[df_zscore['StockCode'] == selected_stockcode]

# Group by CustomerID and Count
customerID_counts = df_stockcode['CustomerID'].value_counts().head(10).reset_index()           # Keep top 10
customerID_counts.columns = ['CustomerID', 'Count']

# Plot barplot
plt.figure(figsize=(10, 6))
sns.barplot(data=customerID_counts, x='CustomerID', y='Count', palette='flare_r',
            order=customerID_counts['CustomerID'])

# Set titles and labels
plt.title(f'Top 10 CustomerID in {selected_stockcode}')
plt.xlabel('')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<div id="analyze-trends"> 
    <h2>Analyze Trends</h2>    
</div>

In [None]:
# Plot a time series plot to analyze trends over time --> Monthly 'Quantity' trend 
# Create a 'InvoiceMonth' column for monthly trend
df_zscore['InvoiceMonth'] = df_zscore['InvoiceDate'].dt.to_period('M').dt.to_timestamp()

# Group by month and calculate total quantity sold
monthly_trend = df_zscore.groupby('InvoiceMonth')['Quantity'].sum().reset_index()

# Plot the monthly trend
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_trend, x='InvoiceMonth', y='Quantity', marker='o', color='royalblue')

# Set titles and labels
plt.title('Monthly Total Quantity Trend')
plt.xlabel('')
plt.ylabel('Total Quantity Sold')
plt.tight_layout()
plt.show()

In [None]:
# Plot a time series plot to analyze trends over time --> Monthly 'TotalPrice' trend 
# Create 'TotalPrice' feature 
df_zscore['TotalPrice'] = df_zscore['Quantity'] * df_zscore['UnitPrice']

# Group by month and calculate total price sold
monthly_trend = df_zscore.groupby('InvoiceMonth')['TotalPrice'].sum().reset_index()

# Plot the monthly trend
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_trend, x='InvoiceMonth', y='TotalPrice', marker='o', color='darkblue')

# Set titles and labels
plt.title('Monthly Total Price Trend')
plt.xlabel('')
plt.ylabel('Total Price Sold')
plt.tight_layout()
plt.show()

In [None]:
# Plot a time series plot to analyze trends over time --> Weekly 'TotalPrice' trend
# Create a 'InvoiceWeek' column for weekly trend
df_zscore['InvoiceWeek'] = df_zscore['InvoiceDate'].dt.to_period('W').dt.to_timestamp()

# Group by week and calculate total price sold
weekly_trend = weekly_trend = df_zscore.groupby('InvoiceWeek')['TotalPrice'].sum().reset_index()

# Plot the weekly trend
plt.figure(figsize=(12, 6))
sns.lineplot(data=weekly_trend, x='InvoiceWeek', y='TotalPrice', marker='o', color='green')

# Set the date format for x-axis
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%W'))        # Format: Year-Week
plt.gca().xaxis.set_major_locator(mdates.WeekdayLocator(interval=1))   # Adjust the interval for weekly ticks

# Set titles and labels
plt.title('Weekly Total Price Trend')
plt.xlabel('')
plt.ylabel('Total Price Sold')
plt.tight_layout()
plt.show()

In [None]:
# Plot a time series plot to analyze trends over time --> Monthly 'TotalPrice' trend based on 'Country'
# Get top 5 countries by total price
top_countries = df_zscore.groupby('Country')['TotalPrice'].sum().sort_values(ascending=False).head(5).index.tolist()     # Top 5 countries

# Filter data for only top countries
df_top_countries = df_zscore[df_zscore['Country'].isin(top_countries)].copy()

# Group by month and country, then sum total price
monthly_country_trend = df_top_countries.groupby(['InvoiceMonth', 'Country'])['TotalPrice'].sum().reset_index()

# Plot the monthly trend
plt.figure(figsize=(14, 6))
sns.lineplot(data=monthly_country_trend, x='InvoiceMonth', y='TotalPrice', hue='Country', marker='o',
             hue_order=top_countries)       

# Set titles and labels
plt.title('Monthly Total Price Trend for Top 5 Countries')
plt.xlabel('')
plt.ylabel('Total Price Sold')
plt.tight_layout()
plt.show()

In [None]:
# Plot a time series plot to analyze trends over time --> Monthly 'TotalPrice' trend based on 'StockCode'
# Get top 5 most sold products by total price
top_stockcodes = df_zscore.groupby('StockCode')['TotalPrice'].sum().sort_values(ascending=False).head(5).index.tolist()     # Top 5 stockcodes

# Filter data for only top products (stockcodes)
df_top_stockcodes = df_zscore[df_zscore['StockCode'].isin(top_stockcodes)].copy()

# Group by month and stockcode, then sum total price
monthly_stockcode_trend = df_top_stockcodes.groupby(['InvoiceMonth', 'StockCode'])['TotalPrice'].sum().reset_index()

# Plot the monthly trend
plt.figure(figsize=(14, 6))
sns.lineplot(data=monthly_stockcode_trend, x='InvoiceMonth', y='TotalPrice', hue='StockCode', marker='o',
             hue_order=top_stockcodes)

# Set titles and labels
plt.title('Monthly Total Price Trend for Top 5 StockCodes')
plt.xlabel('')
plt.ylabel('Total Price Sold')
plt.tight_layout()
plt.show()

In [None]:
# Plot a time series plot to analyze trends over time --> Monthly 'TotalPrice' trend based on 'CustomerID'
# Get top 5 customerIDs by total price
top_customerIDs = df_zscore.groupby('CustomerID')['TotalPrice'].sum().sort_values(ascending=False).head(5).index.tolist()     # Top 5 customerIDs

# Filter data for only top customerIDs
df_top_customerIDs = df_zscore[df_zscore['CustomerID'].isin(top_customerIDs)].copy()

# Group by month and customerID, then sum total price
monthly_customerIDs_trend = df_top_customerIDs.groupby(['InvoiceMonth', 'CustomerID'])['TotalPrice'].sum().reset_index()

# Plot the monthly trend
plt.figure(figsize=(14, 6))
sns.lineplot(data=monthly_customerIDs_trend, x='InvoiceMonth', y='TotalPrice', hue='CustomerID', marker='o',
             hue_order=top_customerIDs)

# Set titles and labels
plt.title('Monthly Total Price Trend for Top 5 CustomerIDs')
plt.xlabel('')
plt.ylabel('Total Price Sold')
plt.tight_layout()
plt.show()

<div id="cohort-analysis"> 
    <h2>Cohort Analysis</h2>    
</div>
In this analysis, customers are grouped based on when they first purchased (e.g., the month they first purchased), then it is examined how active they remain and return to purchase in subsequent months.

In [None]:
# Cohort analysis to identify customer behaviors at different times 
# Create the 'CohortMonth': the month of the customer's first purchase
df_zscore['CohortMonth'] = df_zscore.groupby('CustomerID')['InvoiceMonth'].transform('min')

# Calculate the number of months between the invoice date and the cohort month
def get_month_diff(df):
    return (df_zscore['InvoiceMonth'].dt.year - df_zscore['CohortMonth'].dt.year) * 12 + \
           (df_zscore['InvoiceMonth'].dt.month - df_zscore['CohortMonth'].dt.month)

df_zscore['CohortIndex'] = get_month_diff(df_zscore) + 1      # Month index starts at 1

# Count unique customers in each CohortMonth and CohortIndex
cohort_data = df_zscore.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].nunique().reset_index()

# Pivot the data to create a retention matrix
cohort_counts = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='CustomerID')

# Get cohort sizes (number of customers in the first month)
cohort_sizes = cohort_counts.iloc[:, 0]

# Calculate retention rates
retention = cohort_counts.divide(cohort_sizes, axis=0).round(3)

# Format CohortMonth to show only YYYY-MM
retention.index = retention.index.to_series().dt.strftime('%Y-%m')

# Plot the retention matrix heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(retention, annot=True, fmt='.0%', cmap='BuGn', cbar=True)

# Set titles and labels
plt.title('Cohort Analysis - Customer Retention')
plt.xlabel('Cohort Month Index')
plt.ylabel('Cohort Start Month')
plt.show()

<div id="time-based-heatmap"> 
    <h2>Time-based Heatmap</h2>    
</div>

In [None]:
# Plot a heatmap to show what times of the week customers had the most interaction
# Extract Hour and Day of Week
df_zscore['Hour'] = df_zscore['InvoiceDate'].dt.hour
df_zscore['DayOfWeek'] = df_zscore['InvoiceDate'].dt.dayofweek  # Monday=0, Sunday=6

# Map day numbers to names
day_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
df_zscore['DayOfWeek'] = df_zscore['DayOfWeek'].apply(lambda x: day_labels[x])

# Create pivot table: count of invoices per DayOfWeek and Hour
time_heatmap = df_zscore.pivot_table(index='DayOfWeek', columns='Hour', 
                                     values='InvoiceNo', aggfunc='count').fillna(0)

# Reorder days for visual clarity
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
time_heatmap = time_heatmap.reindex(day_order)

# Plot the heatmap
plt.figure(figsize=(14, 6))
sns.heatmap(time_heatmap, cmap='YlGnBu', annot=True, fmt='.0f')

# Set titles and labels
plt.title('Time-based Heatmap: Invoice Count by Day and Hour')
plt.xlabel('Hour of Day')
plt.ylabel('')
plt.show()

In [None]:
# Plot a heatmap to show what times of the week the highest amount of sales occurred
# Create pivot table: total sales per DayOfWeek and Hour
sales_heatmap = df_zscore.pivot_table(index='DayOfWeek', columns='Hour',
                                      values='TotalPrice', aggfunc='sum').fillna(0)

# Reorder days for visual clarity
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
sales_heatmap = sales_heatmap.reindex(day_order)

# Plot the heatmap
plt.figure(figsize=(14, 6))
sns.heatmap(sales_heatmap, cmap='YlOrRd', annot=True, fmt='.0f')

# Set titles and labels
plt.title('Time-based Heatmap: Total Sales by Day and Hour')
plt.xlabel('Hour of Day')
plt.ylabel('')
plt.show()

In [None]:
# Plot a heatmap to show what times of the week the number of products sold
#Create pivot table: total quantity sold per DayOfWeek and Hour
quantity_heatmap = df_zscore.pivot_table(index='DayOfWeek', columns='Hour',
                                         values='Quantity', aggfunc='sum').fillna(0)

# Reorder days for visual clarity
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
quantity_heatmap = quantity_heatmap.reindex(day_order)

# Plot the heatmap
plt.figure(figsize=(14, 6))
sns.heatmap(quantity_heatmap, cmap='PuBuGn', annot=True, fmt='.0f')

# Set titles and labels
plt.title('Time-based Heatmap: Quantity Sold by Day and Hour')
plt.xlabel('Hour of Day')
plt.ylabel('')
plt.show()

<div id="seasonal-holiday-trends-analysis"> 
    <h2>Seasonal & Holiday Trends Analysis</h2>    
</div>

In [None]:
# Plot bar & line plots to find seasonal sales trend
# Extract season (1=Winter, 2=Spring, etc.)
df_zscore['Season'] = df_zscore['InvoiceDate'].dt.month % 12 // 3 + 1

# Map numeric seasons to names
season_map = {1: 'Winter', 2: 'Spring', 3: 'Summer', 4: 'Autumn'}
df_zscore['Season'] = df_zscore['Season'].map(season_map)

# Calculate total sales for each season
season_sales = df_zscore.groupby('Season')['TotalPrice'].sum().reindex(['Winter', 'Spring', 'Summer', 'Autumn'])

# Plot combined barplot and lineplot
plt.figure(figsize=(10, 6))

# Plot the barplot
ax = sns.barplot(x=season_sales.index, y=season_sales.values, palette='Paired', label='Bar Plot')

# Plot the lineplot
ax2 = ax.twinx()
sns.lineplot(x=season_sales.index, y=season_sales.values, color='darkred', marker='o', ax=ax2, label='Line Plot')

# Add value labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2, p.get_height()),
                ha='center', va='bottom', fontsize=8, color='black', fontweight='bold')

# Set title and labels
ax.set_title('Seasonal Sales Trend')
ax.set_ylabel('Total Sales (Bar Plot)')
ax2.set_ylabel('Total Sales (Line Plot)', color='darkred')
ax.set_xlabel('')
ax.set_yticks([])
ax2.set_yticks([])
plt.show()

In [None]:
# Plot bar & line plots to find seasonal customer activity
# Count unique customers per season
active_customers = df_zscore.groupby('Season')['CustomerID'].nunique().reindex(['Winter', 'Spring', 'Summer', 'Autumn'])

# Plot combined barplot and lineplot
plt.figure(figsize=(10, 6))

# Plot the barplot
ax = sns.barplot(x=active_customers.index, y=active_customers.values, palette='tab20', label='Bar Plot')

# Plot the lineplot
ax2 = ax.twinx()
sns.lineplot(x=active_customers.index, y=active_customers.values, color='darkred', marker='o', ax=ax2, label='Line Plot')

# Add value labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2, p.get_height()),
                ha='center', va='bottom', fontsize=8, color='black', fontweight='bold')

# Set title and labels
ax.set_title('Active Customers by Season')
ax.set_ylabel('Number of Unique Customers (Bar Plot)')
ax2.set_ylabel('Number of Unique Customers (Line Plot)', color='darkred')
ax.set_xlabel('')
ax.set_yticks([])
ax2.set_yticks([])
plt.show()

In [None]:
# Plot lines to compare between sales on weekends (Saturday, Sunday) and normal days (Weekdays)
# Extract date-only column for InvoiceDate
df_zscore['Date'] = df_zscore['InvoiceDate'].dt.date

# Identify weekends
df_zscore['DayOfWeek'] = df_zscore['InvoiceDate'].dt.dayofweek  # 0=Monday, 6=Sunday
df_zscore['DayType'] = df_zscore['DayOfWeek'].apply(lambda x: 'Weekend' if x in [5, 6] else 'Weekday')

# Aggregate daily sales by DayType
daily_sales = df_zscore.groupby(['Date', 'DayType'])['TotalPrice'].sum().reset_index()

# Create pivot table for easier plotting
pivot_df = daily_sales.pivot(index='Date', columns='DayType', values='TotalPrice')
pivot_df = pivot_df.fillna(0)

# Apply rolling average for smoothing
pivot_df['Weekday_Smoothed'] = pivot_df['Weekday'].rolling(window=14, center=True).mean()
pivot_df['Weekend_Smoothed'] = pivot_df['Weekend'].rolling(window=14, center=True).mean()

# Plot smoothed lines
plt.figure(figsize=(14,6))
plt.plot(pivot_df.index, pivot_df['Weekday_Smoothed'], label='Weekday Sales', color='darkred')
plt.plot(pivot_df.index, pivot_df['Weekend_Smoothed'], label='Weekend Sales', color='darkblue')

# Set title and labels
plt.title('Sales Trend: Weekdays vs Weekends')
plt.xlabel('')
plt.ylabel('Total Sales')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Plot a boxplot to compare the sales distribution between weekdays and weekends
# Extract DayType and TotalPrice for boxplot
boxplot_data = df_zscore[['DayType', 'TotalPrice']]

# Plot the boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(data=boxplot_data, x='DayType', y='TotalPrice', palette=['#3498db', '#e74c3c'])

# Set title and labels
plt.title('Comparison of Sales Distribution: Weekdays vs Weekends')
plt.xlabel('')
plt.ylabel('Total Sales')
plt.tight_layout()
plt.show()

In [None]:
# Plot sales trends to find the effect of holidays on sales
# Define holiday dates
holiday_dates = {
    'Christmas': pd.to_datetime('2010-12-25'),
    'Valentine\'s Day': pd.to_datetime('2011-02-14'),
    'Black Friday': pd.to_datetime('2011-11-25')
}

# Group total sales by day
daily_sales = df_zscore.groupby('InvoiceDayOnly')['TotalPrice'].sum().reset_index()
daily_sales['InvoiceDayOnly'] = pd.to_datetime(daily_sales['InvoiceDayOnly'])

# Calculate average daily sales (for all regular days)
avg_daily_sales = daily_sales['TotalPrice'].mean()

# Plot sales trends around holidays and compare with average
plt.figure(figsize=(14, 6))
for holiday_name, holiday_date in holiday_dates.items():
    window_start = holiday_date - pd.Timedelta(days=7)
    window_end = holiday_date + pd.Timedelta(days=7)

    sales_window = daily_sales[
        (daily_sales['InvoiceDayOnly'] >= window_start) &
        (daily_sales['InvoiceDayOnly'] <= window_end)
    ]

    plt.plot(sales_window['InvoiceDayOnly'],
             sales_window['TotalPrice'],
             marker='o',
             label=f"{holiday_name} Window")

    # Highlight the exact holiday point
    exact_point = sales_window[daily_sales['InvoiceDayOnly'] == holiday_date]
    if not exact_point.empty:
        plt.scatter(holiday_date,
                    exact_point['TotalPrice'].values[0],
                    color='red',
                    s=120,
                    edgecolors='black',
                    zorder=5,
                    label=f"{holiday_name}")

# Plot average daily sales as a horizontal line
plt.axhline(y=avg_daily_sales, color='darkred', linestyle='--', linewidth=2, label='Average Daily Sales')

# Set title and labels
plt.title('Sales Around Holidays Compared to Regular Daily Sales')
plt.xlabel('')
plt.ylabel('Total Sales')
plt.legend()
plt.tight_layout()
plt.show()