# PySpark Online Retail II Dataset Analysis

This notebook demonstrates how to load and analyze the Online Retail II dataset using PySpark in Google Colab.


## 1. Install Required Packages

First, install PySpark and related dependencies in Google Colab.


In [None]:
# Install required packages
%pip install pyspark pandas openpyxl


## 2. Import Libraries and Initialize Spark Session


In [8]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col, sum as spark_sum, count, when, isnan, isnull, desc, min as spark_min, max as spark_max
import pandas as pd

# Initialize Spark session
# Configure Spark for both local and Colab environments
spark = SparkSession.builder \
    .appName("OnlineRetailAnalysis") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", "10000") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "2g") \
    .config("spark.sql.execution.arrow.pyspark.fallback.enabled", "true") \
    .config("spark.python.worker.timeout", "300") \
    .config("spark.python.worker.reuse", "true") \
    .getOrCreate()

# Set log level to reduce output noise
spark.sparkContext.setLogLevel("WARN")

print("Spark session initialized successfully!")
print(f"Spark version: {spark.version}")


Spark session initialized successfully!
Spark version: 4.0.1


## 3. Load Data from GitHub

Since PySpark cannot directly read Excel files, we use pandas to read from GitHub and then convert to Spark DataFrame.


In [15]:
# Use pandas to read Excel file from GitHub
print("Reading Excel file from GitHub...")

# GitHub repository information
github_user = "Hachi630"
github_repo = "BDAS"
file_path = "online_retail_II.xlsx"

# Construct GitHub raw URL
github_url = f"https://raw.githubusercontent.com/{github_user}/{github_repo}/main/{file_path}"

# Read Excel file with multiple sheets
print("Loading data from both sheets (2009-2010 and 2010-2011)...")
excel_data = pd.read_excel(github_url, sheet_name=None)  # Read all sheets

# Get the two sheets
sheet_2009_2010 = excel_data['Year 2009-2010']
sheet_2010_2011 = excel_data['Year 2010-2011']

print(f"2009-2010 data shape: {sheet_2009_2010.shape}")
print(f"2010-2011 data shape: {sheet_2010_2011.shape}")

# Combine both datasets
pandas_df = pd.concat([sheet_2009_2010, sheet_2010_2011], ignore_index=True)
print(f"Combined data shape: {pandas_df.shape}")

# Convert pandas DataFrame to Spark DataFrame
# Ensure DataFrame is named df for consistency
df = spark.createDataFrame(pandas_df)

print("Data successfully loaded from GitHub into Spark DataFrame!")


Reading Excel file from GitHub...
Loading data from both sheets (2009-2010 and 2010-2011)...
2009-2010 data shape: (525461, 8)
2010-2011 data shape: (541910, 8)
Combined data shape: (1067371, 8)


  Could not convert 'C489449' with type str: tried to convert to int64
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)


Data successfully loaded from GitHub into Spark DataFrame!


## 4. Individual Table Analysis

Analyze each table separately before combining.


In [16]:
# Analyze individual tables
print("=== Individual Table Analysis ===")

print("\n--- 2009-2010 Data ---")
print(f"Shape: {sheet_2009_2010.shape}")
print(f"Columns: {list(sheet_2009_2010.columns)}")
print("First 3 rows:")
print(sheet_2009_2010.head(3))

print("\n--- 2010-2011 Data ---")
print(f"Shape: {sheet_2010_2011.shape}")
print(f"Columns: {list(sheet_2010_2011.columns)}")
print("First 3 rows:")
print(sheet_2010_2011.head(3))

print("\n--- Combined Data Summary ---")
print(f"Total records: {len(pandas_df):,}")
print(f"Total columns: {len(pandas_df.columns)}")
print(f"Columns: {list(pandas_df.columns)}")


=== Individual Table Analysis ===

--- 2009-2010 Data ---
Shape: (525461, 8)
Columns: ['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'Price', 'Customer ID', 'Country']
First 3 rows:
  Invoice StockCode                          Description  Quantity  \
0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   
1  489434    79323P                   PINK CHERRY LIGHTS        12   
2  489434    79323W                  WHITE CHERRY LIGHTS        12   

          InvoiceDate  Price  Customer ID         Country  
0 2009-12-01 07:45:00   6.95      13085.0  United Kingdom  
1 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
2 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  

--- 2010-2011 Data ---
Shape: (541910, 8)
Columns: ['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'Price', 'Customer ID', 'Country']
First 3 rows:
  Invoice StockCode                         Description  Quantity  \
0  536365    85123A  WHITE HANGING HEART

## 5. Check Data Dimensions

Determine the number of rows and columns in the combined dataset.


In [17]:
# Check data dimensions with error handling
print("=== Data Dimension Information ===")

# Get row count with error handling
try:
    row_count = df.count()
    print(f"Dataset row count: {row_count:,}")
    USE_SPARK = True
except Exception as e:
    print(f"Error getting row count with Spark: {e}")
    print("Using pandas DataFrame for all analysis...")
    row_count = len(pandas_df)
    print(f"Dataset row count (from pandas): {row_count:,}")
    USE_SPARK = False

# Get column count
if USE_SPARK:
    column_count = len(df.columns)
    column_names = df.columns
else:
    column_count = len(pandas_df.columns)
    column_names = list(pandas_df.columns)

print(f"Dataset column count: {column_count}")
print(f"Column names: {column_names}")


=== Data Dimension Information ===
Error getting row count with Spark: An error occurred while calling o150.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 4.0 failed 1 times, most recent failure: Lost task 2.0 in stage 4.0 (TID 82) (windows10.microdone.cn executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:252)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:143)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:158)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:178)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:261)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
	at org.

## 6. Complete Analysis with Error Handling

The following analysis will automatically detect if PySpark is working and use the appropriate method.


In [18]:
# Preview data - show first 5 rows
print("=== Data Preview (First 5 Rows) ===")
if USE_SPARK:
    df.show(5, truncate=False)
else:
    print(pandas_df.head())

# Print data schema to verify data types
print("\n=== Data Schema ===")
if USE_SPARK:
    df.printSchema()
else:
    print(pandas_df.dtypes)

# Display basic statistical summary for numeric columns
print("\n=== Numeric Columns Statistical Summary ===")
if USE_SPARK:
    df.describe().show()
else:
    print(pandas_df.describe())

# Check for missing values
print("\n=== Missing Values Check ===")
if USE_SPARK:
    missing_values = df.select([spark_sum(when(isnull(c) | isnan(c), 1).otherwise(0)).alias(c) for c in df.columns])
    missing_values.show()
else:
    missing_values = pandas_df.isnull().sum()
    print(missing_values)


=== Data Preview (First 5 Rows) ===
  Invoice StockCode                          Description  Quantity  \
0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   
1  489434    79323P                   PINK CHERRY LIGHTS        12   
2  489434    79323W                  WHITE CHERRY LIGHTS        12   
3  489434     22041         RECORD FRAME 7" SINGLE SIZE         48   
4  489434     21232       STRAWBERRY CERAMIC TRINKET BOX        24   

          InvoiceDate  Price  Customer ID         Country  
0 2009-12-01 07:45:00   6.95      13085.0  United Kingdom  
1 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
2 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
3 2009-12-01 07:45:00   2.10      13085.0  United Kingdom  
4 2009-12-01 07:45:00   1.25      13085.0  United Kingdom  

=== Data Schema ===
Invoice                object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
Price       

In [19]:
# Check negative values in Quantity column (returns)
print("=== Quantity Column Analysis ===")
if USE_SPARK:
    quantity_stats = df.select(
        spark_min("Quantity").alias("Min Quantity"),
        spark_max("Quantity").alias("Max Quantity"),
        count(when(col("Quantity") < 0, 1)).alias("Return Records Count"),
        count(when(col("Quantity") > 0, 1)).alias("Normal Sales Records Count")
    )
    quantity_stats.show()
else:
    quantity_stats = {
        "Min Quantity": pandas_df['Quantity'].min(),
        "Max Quantity": pandas_df['Quantity'].max(),
        "Return Records Count": (pandas_df['Quantity'] < 0).sum(),
        "Normal Sales Records Count": (pandas_df['Quantity'] > 0).sum()
    }
    for key, value in quantity_stats.items():
        print(f"{key}: {value}")

# Check UnitPrice column range
print("\n=== UnitPrice Column Analysis ===")
if USE_SPARK:
    price_stats = df.select(
        spark_min("UnitPrice").alias("Min Unit Price"),
        spark_max("UnitPrice").alias("Max Unit Price"),
        count(when(col("UnitPrice") < 0, 1)).alias("Negative Price Records Count"),
        count(when(col("UnitPrice") == 0, 1)).alias("Zero Price Records Count")
    )
    price_stats.show()
else:
    price_stats = {
        "Min Unit Price": pandas_df['Price'].min(),
        "Max Unit Price": pandas_df['Price'].max(),
        "Negative Price Records Count": (pandas_df['Price'] < 0).sum(),
        "Zero Price Records Count": (pandas_df['Price'] == 0).sum()
    }
    for key, value in price_stats.items():
        print(f"{key}: {value}")


=== Quantity Column Analysis ===
Min Quantity: -80995
Max Quantity: 80995
Return Records Count: 22950
Normal Sales Records Count: 1044421

=== UnitPrice Column Analysis ===
Min Unit Price: -53594.36
Max Unit Price: 38970.0
Negative Price Records Count: 5
Zero Price Records Count: 6202


In [20]:
# Display record counts by country
print("=== Record Count by Country (Top 10) ===")
if USE_SPARK:
    df.groupBy("Country").count().orderBy(desc("count")).show(10)
else:
    country_counts = pandas_df['Country'].value_counts().head(10)
    print(country_counts)

# Display record counts by customer
print("\n=== Record Count by Customer (Top 10) ===")
if USE_SPARK:
    df.groupBy("Customer ID").count().orderBy(desc("count")).show(10)
else:
    customer_counts = pandas_df['Customer ID'].value_counts().head(10)
    print(customer_counts)


=== Record Count by Country (Top 10) ===
Country
United Kingdom    981330
EIRE               17866
Germany            17624
France             14330
Netherlands         5140
Spain               3811
Switzerland         3189
Belgium             3123
Portugal            2620
Australia           1913
Name: count, dtype: int64

=== Record Count by Customer (Top 10) ===
Customer ID
17841.0    13097
14911.0    11613
12748.0     7307
14606.0     6709
14096.0     5128
15311.0     4717
14156.0     4130
14646.0     3890
13089.0     3438
16549.0     3255
Name: count, dtype: int64


In [21]:
print("=== Analysis Complete ===")
print("Dataset basic information summary:")
print(f"- Total records: {row_count:,}")
print(f"- Column count: {column_count}")
print(f"- Main columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, Customer ID, Country")
print("- Data types verified through schema")
print("- Statistical summary shows distribution of numeric columns")
print("- Missing values and anomalies checked")
if USE_SPARK:
    print("- Analysis performed using PySpark")
else:
    print("- Analysis performed using pandas (PySpark failed)")

# Stop Spark session if it was started
if USE_SPARK:
    try:
        spark.stop()
        print("Spark session stopped.")
    except:
        pass


=== Analysis Complete ===
Dataset basic information summary:
- Total records: 1,067,371
- Column count: 8
- Main columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, Customer ID, Country
- Data types verified through schema
- Statistical summary shows distribution of numeric columns
- Missing values and anomalies checked
- Analysis performed using pandas (PySpark failed)
