# PySpark Online Retail II Dataset Analysis

This notebook demonstrates how to load and analyze the Online Retail II dataset using PySpark in Google Colab.


## 1. Install Required Packages

First, install PySpark and related dependencies in Google Colab.


In [None]:
# Install required packages
!pip install pyspark pandas openpyxl


## 2. Import Libraries and Initialize Spark Session


In [None]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col, sum as spark_sum, count, when, isnan, isnull, desc, min as spark_min, max as spark_max
import pandas as pd

# Initialize Spark session
# In Google Colab, we need to set some configurations to ensure Spark works properly
spark = SparkSession.builder \
    .appName("OnlineRetailAnalysis") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

# Set log level to reduce output noise
spark.sparkContext.setLogLevel("WARN")

print("Spark session initialized successfully!")
print(f"Spark version: {spark.version}")


## 3. Load Data

Since PySpark cannot directly read Excel files, we use pandas to read and then convert to Spark DataFrame.


In [None]:
# Use pandas to read Excel file
print("Reading Excel file...")
pandas_df = pd.read_excel('online_retail_II.xlsx')

# Convert pandas DataFrame to Spark DataFrame
# Ensure DataFrame is named df for consistency
df = spark.createDataFrame(pandas_df)

print("Data successfully loaded into Spark DataFrame!")


## 4. Check Data Dimensions

Determine the number of rows and columns in the dataset.


In [None]:
# Check data dimensions
print("=== Data Dimension Information ===")

# Get row count
row_count = df.count()
print(f"Dataset row count: {row_count:,}")

# Get column count
column_count = len(df.columns)
print(f"Dataset column count: {column_count}")

# Display column names
print(f"Column names: {df.columns}")


## 5. Preview Data

Display the first few rows to understand the content structure.


In [None]:
# Preview data - show first 5 rows
print("=== Data Preview (First 5 Rows) ===")
df.show(5, truncate=False)


## 6. Data Schema

Print the DataFrame schema to verify data types.


In [None]:
# Print data schema to verify data types
print("=== Data Schema ===")
df.printSchema()


## 7. Statistical Summary of Numeric Columns

Get basic statistical information for numeric columns.


In [None]:
# Display basic statistical summary for numeric columns
print("=== Numeric Columns Statistical Summary ===")
# Use describe() method to get statistical information for numeric columns
df.describe().show()


In [None]:
# Additional statistical information - use summary() method for more detailed statistics
print("=== Detailed Statistical Summary ===")
df.summary().show()


## 8. Missing Values Check

Check for missing values in the data.


In [None]:
# Check for missing values
print("=== Missing Values Check ===")

# Calculate missing value count for each column
missing_values = df.select([spark_sum(when(isnull(c) | isnan(c), 1).otherwise(0)).alias(c) for c in df.columns])
missing_values.show()


## 9. Specific Column Analysis

Analyze special cases in Quantity and UnitPrice columns.


In [None]:
# Check negative values in Quantity column (returns)
print("=== Quantity Column Analysis ===")

quantity_stats = df.select(
    spark_min("Quantity").alias("Min Quantity"),
    spark_max("Quantity").alias("Max Quantity"),
    count(when(col("Quantity") < 0, 1)).alias("Return Records Count"),
    count(when(col("Quantity") > 0, 1)).alias("Normal Sales Records Count")
)
quantity_stats.show()


In [None]:
# Check UnitPrice column range
print("=== UnitPrice Column Analysis ===")

price_stats = df.select(
    spark_min("UnitPrice").alias("Min Unit Price"),
    spark_max("UnitPrice").alias("Max Unit Price"),
    count(when(col("UnitPrice") < 0, 1)).alias("Negative Price Records Count"),
    count(when(col("UnitPrice") == 0, 1)).alias("Zero Price Records Count")
)
price_stats.show()


## 10. Group Analysis

Perform group analysis by country and customer.


In [None]:
# Display record counts by country
print("=== Record Count by Country (Top 10) ===")
df.groupBy("Country").count().orderBy(desc("count")).show(10)


In [None]:
# Display record counts by customer
print("=== Record Count by Customer (Top 10) ===")
df.groupBy("Customer ID").count().orderBy(desc("count")).show(10)


## 11. Summary

Dataset basic information summary.


In [None]:
print("=== Analysis Complete ===")
print("Dataset basic information summary:")
print(f"- Total records: {row_count:,}")
print(f"- Column count: {column_count}")
print(f"- Main columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, Customer ID, Country")
print("- Data types verified through printSchema()")
print("- Statistical summary shows distribution of numeric columns")
print("- Missing values and anomalies checked")

# Stop Spark session (optional, usually not needed in Colab)
# spark.stop()
