<a href="https://colab.research.google.com/github/Nataliia-data-analyst/Nataliia-data-analyst/blob/main/Online_Retail_Dataset_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pandas
!pip install pandas matplotlib statsmodels
!pip install pandas numpy matplotlib seaborn scikit-learn statsmodels
!pip install matplotlib
!pip install matplotlib.pyplot
!pip install seaborn
!pip install datetime
!pip install warnings
!pip install csv
import pandas as pd
import warnings
import csv
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime
from statsmodels.tsa.arima.model import ARIMA
from sklearn.cluster import KMeans
warnings.filterwarnings("ignore")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

So, we will be working with the "Online Retail" dataset. This dataset contains online store transactions and provides detailed information about purchases, including the invoice number, product code, quantity of items sold, date, product price, customer ID, and the buyer's country.

The dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/352/online+retail), titled "Online Retail", contains transactional data of a UK-based online retail company. Below is the description of the columns in this dataset:

InvoiceNo: A unique number assigned to each invoice. It indicates the specific transaction. If the invoice starts with a 'C', it denotes a cancellation.

StockCode: A unique code assigned to each product (item) in the transaction.

Description: A description of the product (item) associated with the StockCode.

Quantity: The number of products (items) per transaction (can be negative for product returns).

InvoiceDate: The date and time when the transaction occurred.

UnitPrice: The price of a single product (item) at the time of the transaction.

CustomerID: A unique identifier for each customer.

Country: The country where the customer is located.

This dataset contains transactions between 01/12/2010 and 09/12/2011, primarily involving wholesale customers across different countries.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/PythonMarathon/Colab Notebooks/Online Retail.csv')

In [None]:
type(df)

View the first 5 records:

In [None]:
df.head() # df.head(5)

We can review specific rows in the data as follows:

In [None]:
df[100:110]

Let's display the first 3 rows of data from the variable df.

In [None]:
df.head(3)

Let's display the rows of data from 2010 to 2015 from the variable df.

In [None]:
df[2010:2016]

Let's find out some general characteristics of the columns.

In [None]:
df.info()

In [None]:
type(df.StockCode.loc[0])

Let's check the data types in each column, whether all data is filled, and how much memory the data occupies in RAM.

In [None]:
df.shape

## Let's start the analysis!

What countries are represented in this dataset?

In [None]:
df.Country

Let's **count how many times each country** is represented.

In [None]:
df.Country.value_counts()

Let's take a look at the **last 5 countries**.

In [None]:
df.Country.value_counts()[-5:]

Let's display the top 10 countries by the number of rows in the variable df, along with the corresponding number of rows.

In [None]:
df.Country.value_counts()[:10]

What are **the most popular products** in the store?

In [None]:
df.StockCode.value_counts()

## Data filtering

It would be interesting to find out what this popular product is. Let's take a look at the description.

In [None]:
df.StockCode=='85123A'

In [None]:
df[df.StockCode=='85123A']

We see that the descriptions are different. Let's check what descriptions exist and how many there are.

In [None]:
df[df.StockCode=='85123A'].Description.value_counts()

We see that there are data entries where something went wrong (there are few of them, and they differ from the majority). We can manually correct or clean them if they are not needed for the analysis.

Let's take a look at the other records.

In [None]:
df[df.StockCode==22423]

Something went wrong. It seems that the value has a different type. How can we check this?

In [None]:
df.StockCode.value_counts().index

We see that the value 22423 is surrounded by quotes, which means it is a string and NOT a number. In Python, values in quotes are always strings.

In [None]:
type(22423), type('22423'), type('85123A')

In [None]:
22423 != '22423'

That's why our data wasn't filtered. Let's fix that!

In [None]:
df[df.StockCode=='22423']

Let's find out what descriptions the purchased goods with a **StockCode** value of **22423** have, and how many times each of the description values occurs?

In [None]:
df[df.StockCode=='22423'].Description.value_counts()

Let's calculate the mean for the United Kingdom

In [None]:
df[df.Country=='United Kingdom'][df.StockCode=='22423'].UnitPrice.mean()

Let's calculate the mean for France

In [None]:
df[df.Country=='France'][df.StockCode=='22423'].UnitPrice.mean()

Let's calculate the mean for Germany

In [None]:
df[df.Country=='Germany'][df.StockCode=='22423'].UnitPrice.mean()

Let's find out what the average price of goods with a **StockCode** value of **22423** is in **Spain**

In [None]:
df[df.Country=='Spain'][df.StockCode=='22423'].UnitPrice.mean()

Let's calculate the average across all selected countries at once

In [None]:
df[df.Country.isin(['Germany', 'France', 'United Kingdom'])][df.StockCode=='22423'][['Country', 'UnitPrice']]

It's time to simplify the code. To avoid writing a large piece of code every time, let's store this structure in a variable:

In [None]:
df_filtered = df[df.Country.isin(['Germany', 'France', 'United Kingdom'])][df.StockCode=='22423'][['Country', 'UnitPrice']]

In [None]:
df[['Country', 'UnitPrice']]

Let's display the country and its average purchase value of the product.

In [None]:
df_filtered

Now let's GROUP THE DATA BY COUNTRY! Then we will calculate the aggregate function

In [None]:
df_filtered.groupby('Country').mean()

Without using a variable, it could look like this. Just writing everything directly for our variable df

In [None]:
df[df.Country.isin(['Germany', 'France', 'United Kingdom'])][df.StockCode=='22423'][['Country', 'UnitPrice']].groupby('Country').mean()

Or like this

In [None]:
df[
    df.Country.isin(['Germany', 'France', 'United Kingdom'])
  ][df.StockCode=='22423'].groupby('Country').UnitPrice.mean()

What is the average price of ALL products in the countries Germany, France, and the United Kingdom?

In [None]:
df[df.Country.isin(['Germany', 'France', 'United Kingdom'])].groupby('Country').UnitPrice.mean()

## Data sorting

What is the cheapest product that is purchased in France?

In [None]:
df[df.Country=='France'].sort_values(by='StockCode') #This is not enough! By which field do you want to sort?

In [None]:
df[df.Country=='France'].sort_values(by='UnitPrice')

Manual entries are not very interesting to us. Let's remove them from this analysis

In [None]:
df[df.Country=='France'][df.StockCode!='M'].sort_values(by='UnitPrice')

Let's save this DataFrame in a variable so that we can work with it another time

In [None]:
df_fr_price_sorted = df[df.Country=='France'][df.StockCode!='M'].sort_values(by='UnitPrice')

Let's sort in descending order:

In [None]:
df_fr_price_sorted.sort_values(by='UnitPrice', ascending=False).iloc[0]

In [None]:
df_fr_price_sorted

Let's find the product that was purchased in the largest quantity in the DataFrame df_fr_price_sorted. We'll display the product and write a conclusion about which product it is (with which InvoiceId, StockCode).

In [None]:
df_fr_price_sorted.sort_values('Quantity', ascending=False).iloc[0]

In [None]:
df.to_csv('/content/drive/My Drive/mydata.csv', index=False)

In [None]:
df_fr_price_sorted.to_csv('/content/drive/My Drive/online_retail_Fr_UnitPrice_sorted.csv')

In [None]:
new_df = pd.read_csv('/content/drive/My Drive/online_retail_Fr_UnitPrice_sorted.csv', index_col=0)

In [None]:
new_df.head()

Attention! This file exists only in this session! You can either save it directly to your Google Drive or download the file locally and upload it to Google Drive manually.

In [None]:
df_fr_price_sorted.to_csv('/content/drive/MyDrive/online_retail_Fr_UnitPrice_sorted.csv')

##Descriptive Analysis
1. Sales Overview:

In [None]:
# Calculate total revenue
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']
total_revenue = df['TotalRevenue'].sum()

print(f"Total sales revenue generated in the dataset: £{total_revenue:.2f}")

In [None]:
#How many unique products were sold?
# Count unique products sold based on the 'StockCode' column
unique_products = df['StockCode'].nunique()

print(f"Number of unique products sold: {unique_products}")

2. Customer Insights:

  2.1. How many unique customers are there in the dataset?




In [None]:
# Count unique customers based on the 'CustomerID' column
# First, ensure that CustomerID is not null
unique_customers = df['CustomerID'].nunique()

print(f"Number of unique customers in the dataset: {unique_customers}")

  2.2. What is the average order value for customers?


In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Calculate total revenue for each transaction
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']

# Group by 'InvoiceNo' and calculate the sum of total revenue for each order
order_values = df.groupby('InvoiceNo')['TotalRevenue'].sum()

# Calculate the average order value
average_order_value = order_values.mean()

print(f"Average order value for customers: £{average_order_value:.2f}")

3. Product Analysis:

  3.1. What are the top-selling products by quantity sold?

In [None]:
# Remove any rows where 'Quantity' is negative or NaN
df = df[df['Quantity'] > 0]

# Group by 'StockCode' or 'Description' and calculate total quantity sold
top_selling_products = df.groupby('Description')['Quantity'].sum().reset_index()

# Sort the results by quantity sold in descending order
top_selling_products = top_selling_products.sort_values(by='Quantity', ascending=False)

# Display the top 10 selling products
top_10_products = top_selling_products.head(10)

print("Top-selling products by quantity sold:")
print(top_10_products)

  3.2. Which products have the highest revenue?

In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Calculate total revenue for each transaction
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']

# Group by 'Description' and calculate total revenue
revenue_by_product = df.groupby('Description')['TotalRevenue'].sum().reset_index()

# Sort the results by revenue in descending order
revenue_by_product = revenue_by_product.sort_values(by='TotalRevenue', ascending=False)

# Display the top 10 products by revenue
top_revenue_products = revenue_by_product.head(10)

print("Products with the highest revenue:")
print(top_revenue_products)

## Temporal Analysis
4. Sales Trends:

  4.1. How do sales figures vary month over month?

In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Calculate total revenue for each transaction
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']

# Create a new DataFrame with month and year
df['Month'] = df['InvoiceDate'].dt.to_period('M')

# Group by month and calculate total revenue
monthly_sales = df.groupby('Month')['TotalRevenue'].sum().reset_index()

# Convert 'Month' back to datetime for plotting
monthly_sales['Month'] = monthly_sales['Month'].dt.to_timestamp()

# Display the monthly sales
print("Monthly sales revenue:")
print(monthly_sales)

# Plotting the monthly sales figures
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['Month'], monthly_sales['TotalRevenue'], marker='o')
plt.title('Monthly Sales Revenue')
plt.xlabel('Month')
plt.ylabel('Total Revenue (£)')
plt.xticks(rotation=45)
plt.grid()
plt.tight_layout()
plt.show()

  4.2. What are the peak sales periods (days, weeks, or months) in the dataset?

In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Calculate total revenue for each transaction
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']

# Group by day and calculate total revenue
daily_sales = df.groupby(df['InvoiceDate'].dt.date)['TotalRevenue'].sum().reset_index()
daily_sales.columns = ['Date', 'TotalRevenue']

# Group by week and calculate total revenue
weekly_sales = df.resample('W-Mon', on='InvoiceDate')['TotalRevenue'].sum().reset_index()
weekly_sales.columns = ['Week', 'TotalRevenue']

# Group by month and calculate total revenue
monthly_sales = df.resample('M', on='InvoiceDate')['TotalRevenue'].sum().reset_index()
monthly_sales.columns = ['Month', 'TotalRevenue']

# Identify peak sales periods
peak_daily = daily_sales.loc[daily_sales['TotalRevenue'].idxmax()]
peak_weekly = weekly_sales.loc[weekly_sales['TotalRevenue'].idxmax()]
peak_monthly = monthly_sales.loc[monthly_sales['TotalRevenue'].idxmax()]

# Display peak periods
print(f"Peak sales day: {peak_daily['Date']} with revenue: £{peak_daily['TotalRevenue']:.2f}")
print(f"Peak sales week: {peak_weekly['Week']} with revenue: £{peak_weekly['TotalRevenue']:.2f}")
print(f"Peak sales month: {peak_monthly['Month']} with revenue: £{peak_monthly['TotalRevenue']:.2f}")

# Optional: Plotting the results for visualization
plt.figure(figsize=(12, 6))
plt.plot(daily_sales['Date'], daily_sales['TotalRevenue'], label='Daily Sales', marker='o', alpha=0.7)
plt.title('Daily Sales Revenue')
plt.xlabel('Date')
plt.ylabel('Total Revenue (£)')
plt.xticks(rotation=45)
plt.grid()
plt.tight_layout()
plt.legend()
plt.show()

plt.figure(figsize=(12, 6))
plt.plot(weekly_sales['Week'], weekly_sales['TotalRevenue'], label='Weekly Sales', marker='o', alpha=0.7)
plt.title('Weekly Sales Revenue')
plt.xlabel('Week')
plt.ylabel('Total Revenue (£)')
plt.xticks(rotation=45)
plt.grid()
plt.tight_layout()
plt.legend()
plt.show()

plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['Month'], monthly_sales['TotalRevenue'], label='Monthly Sales', marker='o', alpha=0.7)
plt.title('Monthly Sales Revenue')
plt.xlabel('Month')
plt.ylabel('Total Revenue (£)')
plt.xticks(rotation=45)
plt.grid()
plt.tight_layout()
plt.legend()
plt.show()

5. Customer Purchase Patterns:

  5.1. How often do customers make repeat purchases?

In [None]:
# Display the first few rows of the DataFrame to understand its structure
print(df.head())

# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Group by 'CustomerID' and count the number of unique invoices per customer
repeat_purchases = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()

# Rename the columns for clarity
repeat_purchases.columns = ['CustomerID', 'NumUniquePurchases']

# Count the frequency of repeat purchases
repeat_counts = repeat_purchases['NumUniquePurchases'].value_counts().sort_index()

# Display the number of customers making repeat purchases
print("Frequency of Repeat Purchases:")
print(repeat_counts)

# Plotting the distribution of repeat purchases
plt.figure(figsize=(10, 6))
repeat_counts.plot(kind='bar')
plt.title('Distribution of Repeat Purchases by Number of Unique Purchases')
plt.xlabel('Number of Unique Purchases')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

# Calculate percentage of customers with repeat purchases
num_repeat_customers = repeat_counts[repeat_counts.index > 1].sum()
total_customers = repeat_purchases['CustomerID'].nunique()
percentage_repeat_customers = (num_repeat_customers / total_customers) * 100

print(f"Percentage of customers making repeat purchases: {percentage_repeat_customers:.2f}%")

  5.2. What is the average time between a customer's first and last purchase?




In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Group by 'CustomerID' and get the first and last purchase date
customer_purchase_dates = df.groupby('CustomerID')['InvoiceDate'].agg(['min', 'max']).reset_index()

# Calculate the time difference between first and last purchase
customer_purchase_dates['TimeDifference'] = customer_purchase_dates['max'] - customer_purchase_dates['min']

# Calculate the average time difference
average_time_difference = customer_purchase_dates['TimeDifference'].mean()

# Display the average time difference in days
print(f"Average time between a customer's first and last purchase: {average_time_difference.days:.2f} days")

## Geographic Analysis
6. Sales by Country:
  
  6.1. Which countries generate the most revenue?


In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Calculate total revenue for each transaction
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']

# Group by country and calculate total revenue
country_revenue = df.groupby('Country')['TotalRevenue'].sum().reset_index()

# Sort the countries by revenue in descending order
country_revenue = country_revenue.sort_values(by='TotalRevenue', ascending=False)

# Display the top countries by revenue
print("Countries generating the most revenue:")
print(country_revenue.head(10))

# Plotting the revenue by country for visualization
plt.figure(figsize=(12, 6))
plt.barh(country_revenue['Country'][:10], country_revenue['TotalRevenue'][:10], color='skyblue')
plt.title('Top 10 Countries by Revenue')
plt.xlabel('Total Revenue (£)')
plt.ylabel('Country')
plt.grid(axis='x')
plt.tight_layout()
plt.show()

  6.2. What is the average order value per country?



In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Calculate total revenue for each transaction
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']

# Group by country to calculate total revenue and number of unique orders
country_summary = df.groupby('Country').agg(
    TotalRevenue=('TotalRevenue', 'sum'),
    UniqueOrders=('InvoiceNo', 'nunique')
).reset_index()

# Calculate the average order value per country
country_summary['AverageOrderValue'] = country_summary['TotalRevenue'] / country_summary['UniqueOrders']

# Display the average order value per country
print("Average Order Value per Country:")
print(country_summary[['Country', 'AverageOrderValue']].sort_values(by='AverageOrderValue', ascending=False))

# Optional: Plotting the average order value per country for visualization
plt.figure(figsize=(12, 6))
plt.barh(country_summary['Country'], country_summary['AverageOrderValue'], color='lightgreen')
plt.title('Average Order Value per Country')
plt.xlabel('Average Order Value (£)')
plt.ylabel('Country')
plt.grid(axis='x')
plt.tight_layout()
plt.show()

## Customer Behavior
7. Segmentation:

  7.1. Can we segment customers based on their purchasing behavior (e.g., frequent vs. infrequent buyers)?

In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Calculate total revenue for each transaction
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']

# Group by customer and calculate relevant metrics
customer_segments = df.groupby('CustomerID').agg(
    NumPurchases=('InvoiceNo', 'nunique'),
    TotalSpent=('TotalRevenue', 'sum'),
    LastPurchaseDate=('InvoiceDate', 'max')
).reset_index()

# Calculate recency in days
current_date = df['InvoiceDate'].max()  # Get the most recent date in the dataset
customer_segments['Recency'] = (current_date - customer_segments['LastPurchaseDate']).dt.days

# Define segments based on purchasing behavior
def segment_customers(row):
    if row['NumPurchases'] >= 10:
        return 'Frequent Buyer'
    elif row['NumPurchases'] >= 3:
        return 'Moderate Buyer'
    else:
        return 'Infrequent Buyer'

customer_segments['Segment'] = customer_segments.apply(segment_customers, axis=1)

# Display segmented customers
print("Customer Segmentation:")
print(customer_segments[['CustomerID', 'NumPurchases', 'TotalSpent', 'Recency', 'Segment']].head(10))

# Optional: Visualize the segments
plt.figure(figsize=(10, 6))
customer_segments['Segment'].value_counts().plot(kind='bar', color='lightblue')
plt.title('Customer Segmentation by Purchasing Behavior')
plt.xlabel('Customer Segment')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.tight_layout()
plt.show()


  7.2. How does customer behavior vary by country or region?


In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Calculate total revenue for each transaction
df['TotalRevenue'] = df['UnitPrice'] * df['Quantity']

# Group by country to calculate total revenue, number of unique customers, and average order value
country_summary = df.groupby('Country').agg(
    TotalRevenue=('TotalRevenue', 'sum'),
    UniqueCustomers=('CustomerID', 'nunique'),
    NumPurchases=('InvoiceNo', 'nunique')
).reset_index()

# Calculate average order value (AOV)
country_summary['AverageOrderValue'] = country_summary['TotalRevenue'] / country_summary['NumPurchases']

# Sort countries by total revenue for better visualization
country_summary = country_summary.sort_values(by='TotalRevenue', ascending=False)

# Display the summary
print("Customer Behavior by Country:")
print(country_summary)

# Visualize total revenue by country
plt.figure(figsize=(12, 6))
plt.barh(country_summary['Country'], country_summary['TotalRevenue'], color='lightblue')
plt.title('Total Revenue by Country')
plt.xlabel('Total Revenue (£)')
plt.ylabel('Country')
plt.grid(axis='x')
plt.tight_layout()
plt.show()

# Visualize average order value by country
plt.figure(figsize=(12, 6))
plt.barh(country_summary['Country'], country_summary['AverageOrderValue'], color='lightgreen')
plt.title('Average Order Value by Country')
plt.xlabel('Average Order Value (£)')
plt.ylabel('Country')
plt.grid(axis='x')
plt.tight_layout()
plt.show()

# Visualize number of unique customers by country
plt.figure(figsize=(12, 6))
plt.barh(country_summary['Country'], country_summary['UniqueCustomers'], color='salmon')
plt.title('Number of Unique Customers by Country')
plt.xlabel('Number of Unique Customers')
plt.ylabel('Country')
plt.grid(axis='x')
plt.tight_layout()
plt.show()

8. Abandoned Carts:

Are there indications of abandoned carts or incomplete purchases in the data?

In [None]:
# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
df = df[(df['UnitPrice'] > 0) & (df['Quantity'] > 0)]

# Count unique items per invoice
invoice_counts = df.groupby('InvoiceNo').agg(
    NumItems=('StockCode', 'count'),
    TotalRevenue=('TotalRevenue', 'sum'),
    UniqueCustomers=('CustomerID', 'nunique')
).reset_index()

# Potential indicators of abandoned carts
abandoned_carts = invoice_counts[(invoice_counts['NumItems'] == 1) |
                                  (invoice_counts['TotalRevenue'] == 0)]

# Display abandoned carts
print("Potential Abandoned Carts or Incomplete Purchases:")
print(abandoned_carts)

# Visualizing potential abandoned carts
plt.figure(figsize=(10, 6))
plt.hist(invoice_counts['NumItems'], bins=range(1, invoice_counts['NumItems'].max() + 2), alpha=0.7, color='orange', edgecolor='black')
plt.axvline(x=1, color='red', linestyle='dashed', linewidth=2)
plt.title('Number of Items per Invoice')
plt.xlabel('Number of Items')
plt.ylabel('Frequency')
plt.xticks(range(1, invoice_counts['NumItems'].max() + 1))
plt.grid(axis='y')
plt.tight_layout()
plt.show()

## Product Performance
9. Product Returns:

  9.1. What is the return rate for different products?

In [None]:
# Data Cleaning: Remove any rows where 'Quantity' is zero or 'UnitPrice' is negative or NaN
df = df[(df['Quantity'] != 0) & (df['UnitPrice'] > 0)]

# Identify returns: Rows with negative quantities (indicating returns)
returns = df[df['Quantity'] < 0]

# Convert returns to positive for analysis
returns['Quantity'] = returns['Quantity'].abs()

# Group by product (StockCode) to calculate total sold and total returned
sales = df.groupby('StockCode').agg(
    TotalSold=('Quantity', 'sum')
).reset_index()

# Group returns by product to calculate total returned
returns_summary = returns.groupby('StockCode').agg(
    TotalReturned=('Quantity', 'sum')
).reset_index()

# Merge sales and returns dataframes
return_rates = pd.merge(sales, returns_summary, on='StockCode', how='left').fillna(0)

# Calculate return rate
return_rates['ReturnRate'] = return_rates['TotalReturned'] / return_rates['TotalSold']

# Sort return rates in descending order
return_rates = return_rates.sort_values(by='ReturnRate', ascending=False)

# Display return rates for the top 10 products
print("Return Rates for Different Products:")
print(return_rates[['StockCode', 'TotalSold', 'TotalReturned', 'ReturnRate']].head(10))

# Visualize return rates for the top 20 products
plt.figure(figsize=(12, 6))
plt.bar(return_rates['StockCode'].astype(str).head(20), return_rates['ReturnRate'].head(20), color='lightcoral')
plt.title('Return Rates for Different Products')
plt.xlabel('Stock Code')
plt.ylabel('Return Rate')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

print(f"Number of return Rates for Different Products: {len(return_rates)}")



  9.2. Are there specific products that have a high incidence of returns?




In [None]:
# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())

# Remove any rows where 'UnitPrice' or 'Quantity' is negative or NaN
# Also handle missing values
df = df[(df['Quantity'] != 0) & (df['UnitPrice'] > 0)]
df.dropna(subset=['InvoiceNo', 'CustomerID'], inplace=True)

# Identify returns: Rows with negative quantities (returns)
returns = df[df['Quantity'] < 0]

# Convert returns to positive for analysis
returns['Quantity'] = returns['Quantity'].abs()

# Group by product (StockCode) to calculate total sold
sales = df.groupby('StockCode').agg(
    TotalSold=('Quantity', 'sum')
).reset_index()

# Group by product to calculate total returned
returns_summary = returns.groupby('StockCode').agg(
    TotalReturned=('Quantity', 'sum')
).reset_index()

# Merge sales and returns dataframes
return_rates = pd.merge(sales, returns_summary, on='StockCode', how='left').fillna(0)

# Calculate return rate
return_rates['ReturnRate'] = return_rates['TotalReturned'] / return_rates['TotalSold']

# Identify products with high return rates (e.g., > 20%)
high_return_threshold = 0.20
high_return_products = return_rates[return_rates['ReturnRate'] > high_return_threshold]

# Sort by return rate
high_return_products = high_return_products.sort_values(by='ReturnRate', ascending=False)

# Display high return products
print("Products with High Return Rates (>20%):")
print(high_return_products[['StockCode', 'TotalSold', 'TotalReturned', 'ReturnRate']].head(10))

# Visualize high return rates
plt.figure(figsize=(12, 6))
plt.bar(high_return_products['StockCode'].astype(str).head(10), high_return_products['ReturnRate'].head(10), color='lightcoral')
plt.title('Products with High Return Rates')
plt.xlabel('Stock Code')
plt.ylabel('Return Rate')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

print(f"Number of products with high return rates: {len(high_return_products)}")


10. Seasonality:

Are there seasonal trends in product sales (e.g., holiday spikes)?




In [None]:
# Remove rows where 'InvoiceDate' or 'Quantity' is NaN and 'Quantity' is negative
df.dropna(subset=['InvoiceDate', 'Quantity'], inplace=True)
df = df[df['Quantity'] > 0]

# Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create a new column for Month-Year
df['MonthYear'] = df['InvoiceDate'].dt.to_period('M')

# Group by Month-Year to get total sales
monthly_sales = df.groupby('MonthYear').agg(
    TotalSales=('Quantity', 'sum'),
    TotalRevenue=('UnitPrice', 'sum')
).reset_index()

# Visualizing Monthly Sales
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['MonthYear'].astype(str), monthly_sales['TotalRevenue'], marker='o', color='blue')
plt.title('Monthly Sales Revenue Over Time')
plt.xlabel('Month-Year')
plt.ylabel('Total Revenue')
plt.xticks(rotation=45)
plt.grid()
plt.tight_layout()
plt.show()

# Visualizing Monthly Sales Quantity
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['MonthYear'].astype(str), monthly_sales['TotalSales'], marker='o', color='orange')
plt.title('Monthly Sales Quantity Over Time')
plt.xlabel('Month-Year')
plt.ylabel('Total Quantity Sold')
plt.xticks(rotation=45)
plt.grid()
plt.tight_layout()
plt.show()

##Predictive Analysis
11. Forecasting:

  11.1. What are the forecasts for future sales based on historical data?
  

In [None]:
# Remove rows where 'InvoiceDate' or 'Quantity' is NaN and 'Quantity' is negative
df.dropna(subset=['InvoiceDate', 'Quantity'], inplace=True)
df = df[df['Quantity'] > 0]

# Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create a new column for Month-Year
df['MonthYear'] = df['InvoiceDate'].dt.to_period('M')

# Group by Month-Year to get total sales revenue
monthly_sales = df.groupby('MonthYear').agg(
    TotalRevenue=('UnitPrice', 'sum')
).reset_index()

# Convert MonthYear to datetime for modeling
monthly_sales['MonthYear'] = monthly_sales['MonthYear'].dt.to_timestamp()

# Set MonthYear as the index
monthly_sales.set_index('MonthYear', inplace=True)

# Fit the ARIMA model (order can be adjusted based on data characteristics)
model = ARIMA(monthly_sales['TotalRevenue'], order=(1, 1, 1))
model_fit = model.fit()

# Forecast future sales for the next 12 months
forecast = model_fit.forecast(steps=12)

# Create a new DataFrame for the forecasted values
forecast_index = pd.date_range(start=monthly_sales.index[-1] + pd.DateOffset(months=1), periods=12, freq='M')
forecast_series = pd.Series(forecast, index=forecast_index)

# Plot the historical sales and the forecasted sales
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales.index, monthly_sales['TotalRevenue'], label='Historical Sales', color='blue')
plt.plot(forecast_series.index, forecast_series, label='Forecasted Sales', color='orange', linestyle='--')
plt.title('Sales Forecast for the Next 12 Months')
plt.xlabel('Month-Year')
plt.ylabel('Total Revenue')
plt.legend()
plt.grid()
plt.tight_layout()
plt.show()

11.2. How can we predict future customer behavior based on past purchasing patterns?



In [None]:
# Data Cleaning
df.dropna(subset=['InvoiceDate', 'Quantity'], inplace=True)
df = df[df['Quantity'] > 0]

# Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create a 'TotalAmount' column
df['TotalAmount'] = df['Quantity'] * df['UnitPrice']

# Create a reference date
reference_date = df['InvoiceDate'].max() + pd.DateOffset(days=1)

# RFM Calculation
rfm_df = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (reference_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalAmount': 'sum'
}).rename(columns={
    'InvoiceDate': 'Recency',
    'InvoiceNo': 'Frequency',
    'TotalAmount': 'Monetary'
})

# Scale the RFM data
rfm_df_scaled = (rfm_df - rfm_df.mean()) / rfm_df.std()

# Clustering with K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
rfm_df['Cluster'] = kmeans.fit_predict(rfm_df_scaled)

# Visualizing the segments
plt.figure(figsize=(10, 6))
sns.scatterplot(data=rfm_df, x='Recency', y='Monetary', hue='Cluster', palette='Set2')
plt.title('Customer Segments based on RFM Analysis')
plt.xlabel('Recency (Days Since Last Purchase)')
plt.ylabel('Monetary Value ($)')
plt.grid()
plt.show()

# Predicting Future Behavior
# Example: Analyze one segment and predict behavior
cluster_0 = rfm_df[rfm_df['Cluster'] == 0]
print("Cluster 0 Summary:\n", cluster_0.describe())

## Operational Insights
12. Inventory Management:
  

In [None]:
# Display the first few rows and info about the dataset
print(df.head())
print(df.info())

# Data Cleaning: Remove rows with missing or zero quantity or negative prices
df.dropna(subset=['Quantity', 'UnitPrice'], inplace=True)
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Create a new column for total sales amount
df['TotalSales'] = df['Quantity'] * df['UnitPrice']

# Calculate total sales and average daily sales for each product
sales_per_product = df.groupby('StockCode').agg({
    'TotalSales': 'sum',
    'Quantity': 'sum',
    'InvoiceDate': 'nunique'
}).reset_index()

# Calculate average daily sales
# Assuming there are unique invoices per day, use the invoice date to calculate days
days_active = (df['InvoiceDate'].max() - df['InvoiceDate'].min()).days + 1  # +1 to include last day
sales_per_product['AverageDailySales'] = sales_per_product['Quantity'] / days_active

# Assuming a lead time of 14 days for restocking and a safety stock of 20 units
lead_time = 14
safety_stock = 20

# Calculate Reorder Point (ROP)
sales_per_product['ReorderPoint'] = (sales_per_product['AverageDailySales'] * lead_time) + safety_stock

# Simulate current inventory (for demonstration; in practice, this would come from your inventory records)
# For this example, let's assume a random current inventory level between 0 and 150 for each product
np.random.seed(42)  # For reproducibility
sales_per_product['CurrentInventory'] = np.random.randint(0, 150, size=len(sales_per_product))

# Identify products at risk of stockouts
sales_per_product['AtRiskOfStockout'] = sales_per_product['CurrentInventory'] < sales_per_product['ReorderPoint']

# Filter products at risk
at_risk_products = sales_per_product[sales_per_product['AtRiskOfStockout']]

# Display products at risk of stockouts
print("Products at Risk of Stockouts:")
print(at_risk_products[['StockCode', 'CurrentInventory', 'ReorderPoint', 'AverageDailySales']])

# Visualizing the inventory levels and reorder points
plt.figure(figsize=(12, 6))
sns.barplot(data=at_risk_products, x='StockCode', y='ReorderPoint', color='lightblue', label='Reorder Point')
sns.barplot(data=at_risk_products, x='StockCode', y='CurrentInventory', color='salmon', label='Current Inventory')
plt.title('Inventory Levels vs. Reorder Points for Products at Risk of Stockouts')
plt.xlabel('Product (StockCode)')
plt.ylabel('Units')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

  12.1. What are the inventory needs based on sales data?


In [None]:
# Display the first few rows and info about the dataset
print(df.head())
print(df.info())

# Data Cleaning: Remove rows with missing or zero quantity or negative prices
df.dropna(subset=['Quantity', 'UnitPrice'], inplace=True)
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Create a new column for total sales amount
df['TotalSales'] = df['Quantity'] * df['UnitPrice']

# Calculate total sales and average daily sales for each product
sales_per_product = df.groupby('StockCode').agg({
    'TotalSales': 'sum',
    'Quantity': 'sum',
    'InvoiceDate': 'nunique'
}).reset_index()

# Calculate average daily sales
# Assuming there are unique invoices per day, use the invoice date to calculate days
days_active = (df['InvoiceDate'].max() - df['InvoiceDate'].min()).days + 1  # +1 to include last day
sales_per_product['AverageDailySales'] = sales_per_product['Quantity'] / days_active

# Assuming a lead time of 14 days for restocking and a safety stock of 20 units
lead_time = 14
safety_stock = 20

# Calculate Reorder Point (ROP)
sales_per_product['ReorderPoint'] = (sales_per_product['AverageDailySales'] * lead_time) + safety_stock

# Simulate current inventory (for demonstration; in practice, this would come from your inventory records)
# For this example, let's assume a random current inventory level between 0 and 150 for each product
np.random.seed(42)  # For reproducibility
sales_per_product['CurrentInventory'] = np.random.randint(0, 150, size=len(sales_per_product))

# Calculate the inventory needs (how much to reorder)
sales_per_product['InventoryNeeds'] = sales_per_product['ReorderPoint'] - sales_per_product['CurrentInventory']

# Filter for products that need reordering
inventory_needs = sales_per_product[sales_per_product['InventoryNeeds'] > 0]

# Display inventory needs
print("Inventory Needs Based on Sales Data:")
print(inventory_needs[['StockCode', 'CurrentInventory', 'ReorderPoint', 'InventoryNeeds']])

# Visualizing Inventory Needs
plt.figure(figsize=(12, 6))
sns.barplot(data=inventory_needs, x='StockCode', y='InventoryNeeds', color='orange')
plt.title('Inventory Needs for Products Based on Sales Data')
plt.xlabel('Product (StockCode)')
plt.ylabel('Inventory Needs (Units)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

12.2. Which products are at risk of stockouts?



In [None]:
# Display the first few rows and info about the dataset
print(df.head())
print(df.info())

# Data Cleaning: Remove rows with missing or zero quantity or negative prices
df.dropna(subset=['Quantity', 'UnitPrice'], inplace=True)
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Create a new column for total sales amount
df['TotalSales'] = df['Quantity'] * df['UnitPrice']

# Calculate total sales and average daily sales for each product
sales_per_product = df.groupby('StockCode').agg({
    'TotalSales': 'sum',
    'Quantity': 'sum',
    'InvoiceDate': 'nunique'
}).reset_index()

# Calculate average daily sales
# Assuming there are unique invoices per day, use the invoice date to calculate days
days_active = (df['InvoiceDate'].max() - df['InvoiceDate'].min()).days + 1  # +1 to include last day
sales_per_product['AverageDailySales'] = sales_per_product['Quantity'] / days_active

# Assuming a lead time of 14 days for restocking and a safety stock of 20 units
lead_time = 14
safety_stock = 20

# Calculate Reorder Point (ROP)
sales_per_product['ReorderPoint'] = (sales_per_product['AverageDailySales'] * lead_time) + safety_stock

# Simulate current inventory (for demonstration; in practice, this would come from your inventory records)
# For this example, let's assume a random current inventory level between 0 and 150 for each product
np.random.seed(42)  # For reproducibility
sales_per_product['CurrentInventory'] = np.random.randint(0, 150, size=len(sales_per_product))

# Identify products at risk of stockouts
sales_per_product['AtRiskOfStockout'] = sales_per_product['CurrentInventory'] < sales_per_product['ReorderPoint']

# Filter products at risk
at_risk_products = sales_per_product[sales_per_product['AtRiskOfStockout']]

# Display products at risk of stockouts
print("Products at Risk of Stockouts:")
print(at_risk_products[['StockCode', 'CurrentInventory', 'ReorderPoint', 'AverageDailySales']])

# Visualizing the inventory levels and reorder points
plt.figure(figsize=(12, 6))
sns.barplot(data=at_risk_products, x='StockCode', y='ReorderPoint', color='lightblue', label='Reorder Point')
sns.barplot(data=at_risk_products, x='StockCode', y='CurrentInventory', color='salmon', label='Current Inventory')
plt.title('Inventory Levels vs. Reorder Points for Products at Risk of Stockouts')
plt.xlabel('Product (StockCode)')
plt.ylabel('Units')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

##Customer Feedback (if applicable)
13. Feedback Analysis:

How do customer reviews (if available) correlate with sales figures?



In [None]:
# Data Cleaning: Remove rows with missing or zero quantity or negative prices
df.dropna(subset=['Quantity', 'UnitPrice'], inplace=True)
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Create a new column for total sales amount
df['TotalSales'] = df['Quantity'] * df['UnitPrice']

# Simulate average review scores (1 to 5) for each product
np.random.seed(42)
stock_codes = df['StockCode'].unique()
review_scores = pd.DataFrame({
    'StockCode': stock_codes,
    'AverageReviewScore': np.random.uniform(1, 5, size=len(stock_codes))  # Random scores between 1 and 5
})

# Calculate total sales and average sales for each product
sales_per_product = df.groupby('StockCode').agg({
    'TotalSales': 'sum',
    'Quantity': 'sum'
}).reset_index()

# Merge sales data with review scores
merged_data = pd.merge(sales_per_product, review_scores, on='StockCode')

# Calculate the correlation between average review scores and total sales
correlation = merged_data['AverageReviewScore'].corr(merged_data['TotalSales'])
print(f"Correlation between Average Review Score and Total Sales: {correlation:.2f}")

# Visualizing the correlation
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_data, x='AverageReviewScore', y='TotalSales', alpha=0.6)
plt.title('Customer Reviews vs Total Sales')
plt.xlabel('Average Review Score')
plt.ylabel('Total Sales (in currency units)')
plt.xlim(1, 5)  # Setting limits for better visualization
plt.ylim(0, merged_data['TotalSales'].max() * 1.1)
plt.grid(True)
plt.axhline(0, color='grey', lw=0.8)
plt.axvline(0, color='grey', lw=0.8)
plt.show()

## Data Quality and Integrity
14. Data Quality Checks:

  14.1. Are there missing values or duplicates in the dataset?
  

In [None]:
# Display the first few rows and info about the dataset
print("First few rows of the dataset:")
print(df.head())

# Check for missing values
print("\nMissing values in each column:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])  # Display only columns with missing values

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows in the dataset: {duplicates}")

# Summary of the dataset
print(f"\nTotal number of rows in the dataset: {len(df)}")

  14.2. How do missing values affect the overall analysis?

**Missing values can significantly impact data analysis in several ways. Here are the key effects and considerations regarding how missing values affect the overall analysis:**

**1. Loss of Information**

*   **Reduction in Dataset Size:** When rows with missing values are removed, it reduces the overall size of the dataset, which can lead to loss of valuable information.

*   **Bias in Analysis:** If the missing values are not randomly distributed (e.g., concentrated in certain groups), the analysis may become biased, leading to incorrect conclusions.


**2. Impact on Statistical Validity**

* **Decreased Statistical Power:**  Fewer data points can reduce the ability to detect true effects or relationships. This is particularly important in hypothesis testing.

* **Altered Distribution:** Missing values can distort the distribution of the data, which can affect statistical analyses such as mean, median, standard deviation, etc.

**3. Inaccuracy in Predictions**
* **Model Performance:** Missing values can reduce the accuracy of predictive models. Most machine learning algorithms require complete datasets and may not perform well when faced with missing values.

* **Imputation Issues:** Techniques used to fill in missing values (imputation) can introduce biases if not done properly, potentially skewing results.

**4. Complexity in Data Analysis**

* **Increased Complexity:** Handling missing data often adds complexity to the analysis process. Analysts must decide how to handle missing values (e.g., removal, imputation), which can complicate workflows and analyses.

* **Different Approaches:** Different strategies for dealing with missing values (e.g., mean imputation, regression imputation, deletion) can yield different results, leading to inconsistencies.

**5. Data Quality and Trustworthiness**

* **Trust in Data:** A dataset with a high percentage of missing values may lead to skepticism about the reliability of the data, making it harder to draw actionable insights.

* **Need for Transparency:** It's crucial to document how missing values were handled and their potential impact on the analysis, ensuring transparency in the analytical process.

##Strategies to Handle Missing Values

To mitigate the negative impacts of missing values, here are some common strategies:

**1. Deletion:**

* **Listwise Deletion:** Remove rows with missing values, which is simple but can lead to loss of data.

* **Pairwise Deletion:** Analyze only the available data points for each specific analysis, which retains more data.

**2. Imputation:**

* **Mean/Median Imputation:** Replace missing values with the mean or median of the column.

* **Regression Imputation:** Use regression models to predict and fill in missing values based on other variables.

* **K-Nearest Neighbors (KNN):** Use the nearest neighbors to estimate missing values based on similar data points.

**3. Using Algorithms that Support Missing Values:**
Some machine learning algorithms can handle missing values natively, such as decision trees and random forests.

**4. Data Collection Improvements:**
Enhancing data collection processes to minimize the occurrence of missing values in the future.

##Conclusion

Missing values can substantially influence the outcome of data analysis, making it crucial to address them appropriately.

Understanding their impact allows analysts to make informed decisions on how to handle missing data, ensuring that the resulting analysis is robust and reliable.

Taking the time to explore and manage missing values properly can enhance the quality of insights drawn from the data.

##Summary Insights


Total Sales Revenue: 8911407.904

Top Selling Product: PAPER CRAFT , LITTLE BIRDIE

Top Product Percentage of Revenue: 0.908891174913499

Unique Customers: 4338

Average Order Value: 480.86595639974104

Top Countries Revenue: {'United Kingdom': 7308391.5540000005, 'Netherlands': 285446.34, 'EIRE': 265545.9}

Repeat Purchase Percentage: 98.3633010603965

At Risk Stockout Products (more in table below: use new table to download items to detailed analysis): 'GREEN GIANT GARDEN THERMOMETER', 'HEARTS WRAPPING TAPE ', 'LARGE CAKE TOWEL PINK SPOTS', BABUSHKA 65CMx65CM', 'POMPOM CURTAIN', 'POP ART PUSH DOWN RUBBER ', 'RETROSPOT S DOILEY', 'SET 36 COLOUR PENCILS DOLLY GIRL', 'SET 36 COLOUR PENCILS LOVE LONDON', 'SET 36 COLOURING PENCILS DOILEY',

In [None]:
# Display the first few rows and check for columns
print("First few rows of the dataset:")
print(df.head())
print("\nColumns in the dataset:")
print(df.columns)

# Clean the dataset: Remove rows with missing values in key columns
columns_to_check = ['InvoiceNo', 'CustomerID', 'Quantity', 'UnitPrice']
df.dropna(subset=columns_to_check, inplace=True)

# 1. Calculate total sales revenue
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
total_revenue = df['TotalPrice'].sum()

# 2. Identify top-selling product
top_products = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False)
top_product = top_products.idxmax()
top_product_sales = top_products.max()
top_product_percentage = (top_product_sales / total_revenue) * 100

# 3. Calculate unique customers and average order value
unique_customers = df['CustomerID'].nunique()
average_order_value = df.groupby('InvoiceNo')['TotalPrice'].sum().mean()

# 4. Revenue by country
country_revenue = df.groupby('Country')['TotalPrice'].sum().sort_values(ascending=False)
top_countries = country_revenue.head(3)

# 5. Calculate repeat purchases
repeat_customers = df[df.duplicated(['CustomerID'], keep=False)]
repeat_customer_count = repeat_customers['CustomerID'].nunique()
repeat_purchase_percentage = (repeat_customer_count / unique_customers) * 100

# 6. Identify stockout risk based on sales velocity
product_sales = df.groupby('Description')['Quantity'].sum()
low_stock_threshold = product_sales.mean()  # A simple threshold for example
at_risk_products = product_sales[product_sales < low_stock_threshold].index.tolist()

# Summary Insights
summary_insights = {
    'Total Sales Revenue': total_revenue,
    'Top Selling Product': top_product,
    'Top Product Percentage of Revenue': top_product_percentage,
    'Unique Customers': unique_customers,
    'Average Order Value': average_order_value,
    'Top Countries Revenue': top_countries.to_dict(),  # Convert to dict for better display
    'Repeat Purchase Percentage': repeat_purchase_percentage,
    'At Risk Stockout Products': at_risk_products
}

# Output the summary insights
print("\nSummary Insights:")
for key, value in summary_insights.items():
    print(f"{key}: {value}")