<a href="https://colab.research.google.com/github/KursaDSc/-Amazon-E-Commerce-Sales-Data-Analysis-using-PySpark/blob/main/Amazon_Sales_Dataset_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon Sales Data Analysis Project

## Introduction
This study is designed to examine large-scale Amazon sales data, which forms the foundation of the rapidly growing e-commerce world. The project aims not only to present data analysis results but also to deeply explore the technological infrastructure of modern data science approaches. In this context, starting with the fundamental principles of Big Data (Volume, Velocity, Variety), the industry standard framework for distributed and high-performance data processing, Apache Spark (PySpark), will be utilized. Throughout the work, the differences between Spark's core building blocks, RDDs and DataFrames, will be investigated, and data loading processes will be addressed within the context of Distributed Storage solutions such as HDFS and S3. Finally, data will be processed using PySpark's powerful filtering, aggregation, and joining capabilities, the findings will be presented through visualization tasks, and the work will be concluded with a mini analysis report. Our goal is to leverage the power of PySpark to uncover critical business trends and operational insights within the Amazon sales data.[bağlantı metni](https:// [bağlantı metni](https://))

## Project Details

* **Framework:** Apache PySpark
* **Environment:** Google Colab / Jupyter Notebook
* **Visualization:** Matplotlib & Seaborn
* **Output:** Single .ipynb file + Short Summary Report

## Setup and Environment Configuration

In [10]:
## COMPLETE PYSPARK SETUP: VERIFIED METHOD FOR COLAB

print("1. Installing OpenJDK...")
# 1. Install OpenJDK (Java is required for Spark)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

print("2. Downloading Apache Spark...")
# 2. Download Apache Spark 3.5.1 (Direct link using a common mirror)
!wget -q https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz -O spark.tgz

# 3. Extract the Spark file
print("3. Extracting Spark files...")
!tar xf spark.tgz

# 4. Clean up the archive file
!rm spark.tgz

# 5. Install findspark library
print("4. Installing findspark...")
!pip install -q findspark

# 6. Set Environment Variables
print("5. Setting environment variables...")
import os
import findspark
from pyspark.sql import SparkSession

# Set SPARK_HOME to the extracted folder (Note: folder name is long, be careful)
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

# Set JAVA_HOME
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

# Initialize findspark
findspark.init()

# 7. Create a Spark Session
print("6. Initializing Spark Session...")
spark = SparkSession.builder\
        .master("local[*]")\
        .appName("AmazonAnalysis")\
        .getOrCreate()

# VERIFICATION
print("\n--- Spark Setup Complete ---")
print("Spark Session successfully created:")
spark

1. Installing OpenJDK...
2. Downloading Apache Spark...
3. Extracting Spark files...
4. Installing findspark...
5. Setting environment variables...
6. Initializing Spark Session...

--- Spark Setup Complete ---
Spark Session successfully created:


## Section 1 — What is Big Data? (Volume, Velocity, Variety)

In [11]:
## DATA UPLOAD

# Upload the Amazon sales data file from your local machine to the Colab environment.
# A file selector will appear after executing this cell.
from google.colab import files
uploaded = files.upload()

Saving Amazon Sale Report.csv to Amazon Sale Report.csv


In [12]:
FILE_PATH = "Amazon Sale Report.csv"

try:
    amazon_df = spark.read.csv(
        FILE_PATH,
        header=True,
        inferSchema=True
    )

except Exception as e:
    print(f"An error occurred during file loading: {e}")
    print("Please ensure the FILE_PATH variable matches the uploaded file name exactly.")

--- DataFrame Schema (Structure) ---
root
 |-- index: integer (nullable = true)
 |-- Order ID: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Fulfilment: string (nullable = true)
 |-- Sales Channel : string (nullable = true)
 |-- ship-service-level: string (nullable = true)
 |-- Style: string (nullable = true)
 |-- SKU: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- ASIN: string (nullable = true)
 |-- Courier Status: string (nullable = true)
 |-- Qty: integer (nullable = true)
 |-- currency: string (nullable = true)
 |-- Amount: double (nullable = true)
 |-- ship-city: string (nullable = true)
 |-- ship-state: string (nullable = true)
 |-- ship-postal-code: double (nullable = true)
 |-- ship-country: string (nullable = true)
 |-- promotion-ids: string (nullable = true)
 |-- B2B: boolean (nullable = true)
 |-- fulfilled-by: string (nullable = true)
 |-- Unnamed: 22: boolea

In [17]:
    # Check the data structure and schema
    print("--- DataFrame Schema (Structure) ---")
    amazon_df.printSchema()

    # Check the first few records
    print("\n--- First 5 Rows of Data ---")
    amazon_df.show(5)

    # Check the total number of records
    print(f"\nTotal Records: {amazon_df.count()}")

--- DataFrame Schema (Structure) ---
root
 |-- index: integer (nullable = true)
 |-- Order ID: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Fulfilment: string (nullable = true)
 |-- Sales Channel : string (nullable = true)
 |-- ship-service-level: string (nullable = true)
 |-- Style: string (nullable = true)
 |-- SKU: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- ASIN: string (nullable = true)
 |-- Courier Status: string (nullable = true)
 |-- Qty: integer (nullable = true)
 |-- currency: string (nullable = true)
 |-- Amount: double (nullable = true)
 |-- ship-city: string (nullable = true)
 |-- ship-state: string (nullable = true)
 |-- ship-postal-code: double (nullable = true)
 |-- ship-country: string (nullable = true)
 |-- promotion-ids: string (nullable = true)
 |-- B2B: boolean (nullable = true)
 |-- fulfilled-by: string (nullable = true)
 |-- Unnamed: 22: boolea

In [18]:
## 1. ANALYSIS OF DIMENSIONS (RELATION TO VOLUME)
# Calculate the total number of rows (Volume proxy) and columns.
row_count = amazon_df.count()
col_count = len(amazon_df.columns)

# Display the results.
print(f"DataFrame Dimensions (Rows, Columns): ({row_count}, {col_count})")

# --- COMMENTARY ON VOLUME ---
# The large number of rows ({row_count}) directly correlates to the **Volume** characteristic of Big Data.
# A high record count demonstrates the sheer scale of transactions accumulated over the observation period,
# necessitating the use of a distributed computing framework like PySpark for efficient handling.

DataFrame Dimensions (Rows, Columns): (128975, 24)


In [22]:
from collections import Counter

## VARIETY ANALYSIS: LIST ALL TYPES AND SUMMARIZE FREQUENCIES

# Get the schema as a list of field objects (StructField objects)
data_schema = amazon_df.schema
data_type_list = [str(field.dataType) for field in data_schema]

# 1. LIST ALL COLUMNS AND THEIR TYPES (Aligned Output)
print("--- 1. Detailed Column-to-Type Mapping (Variety) ---")

# Find the length of the longest column name for dynamic alignment
max_name_length = max(len(field.name) for field in data_schema)

# Iterate and print each column with aligned formatting
for field in data_schema:
    # {field.name:<{max_name_length}} ensures left alignment based on the longest name
    print(f"Column: {field.name:<{max_name_length}} | Data Type: {field.dataType}")


# 2. DATA TYPE FREQUENCY SUMMARY (Table Output)
print("\n--- 2. Data Type Frequency Summary (Variety Count) ---")

# Use Counter to count the occurrences of each unique data type
type_counts = Counter(data_type_list)

# Determine max lengths for alignment in the summary table
max_type_len = max(len(type_name) for type_name in type_counts.keys())
header_format = f"| {'Data Type':<{max_type_len}} | {'Count':<5} |"
separator = "-" * (max_type_len + 12)

# Print Summary Table
print(separator)
print(header_format)
print(separator)

# Print Data
for data_type, count in type_counts.items():
    print(f"| {data_type:<{max_type_len}} | {count:<5} |")
print(separator)

--- 1. Detailed Column-to-Type Mapping (Variety) ---
Column: index              | Data Type: IntegerType()
Column: Order ID           | Data Type: StringType()
Column: Date               | Data Type: StringType()
Column: Status             | Data Type: StringType()
Column: Fulfilment         | Data Type: StringType()
Column: Sales Channel      | Data Type: StringType()
Column: ship-service-level | Data Type: StringType()
Column: Style              | Data Type: StringType()
Column: SKU                | Data Type: StringType()
Column: Category           | Data Type: StringType()
Column: Size               | Data Type: StringType()
Column: ASIN               | Data Type: StringType()
Column: Courier Status     | Data Type: StringType()
Column: Qty                | Data Type: IntegerType()
Column: currency           | Data Type: StringType()
Column: Amount             | Data Type: DoubleType()
Column: ship-city          | Data Type: StringType()
Column: ship-state         | Data Type: Stri

In [25]:
from pyspark.sql.functions import min as spark_min, max as spark_max, datediff, to_date, lit, round as spark_round

## 3. ANALYZE DATE RANGE AND TRANSACTION RATE (RELATION TO VELOCITY)

# --- Prerequisites: Assuming 'amazon_df' is loaded and 'Date' column is present ---
row_count = amazon_df.count()

# Convert the 'Date' column to a proper Date type (MM-dd-yy format assumed from previous context)
df_with_date = amazon_df.withColumn("Order_Date", to_date(amazon_df["Date"], "MM-dd-yy"))

# Find the earliest and latest order dates
date_analysis = df_with_date.agg(
    spark_min("Order_Date").alias("Min_Date"),
    spark_max("Order_Date").alias("Max_Date")
).collect()[0]

min_date = date_analysis['Min_Date']
max_date = date_analysis['Max_Date']

# Calculate the number of days between the first and last transaction
days_span = df_with_date.withColumn(
    "Date_Span",
    datediff(lit(max_date), lit(min_date))
).select(spark_max("Date_Span")).collect()[0][0]

# --- NEW CALCULATION: AVERAGE TRANSACTION RATE ---

# Calculate the average number of transactions per day
if days_span > 0:
    avg_tx_per_day = round(row_count / days_span)
else:
    avg_tx_per_day = row_count # Handle case where all transactions are on the same day

print(f"Earliest Order Date: {min_date}")
print(f"Latest Order Date: {max_date}")
print(f"Total Span of Data (Days): {days_span}")
print(f"Total Transactions: {row_count:,}")

# Print the new metric, rounded to 2 decimal places
print(f"Average Transactions per Day: {avg_tx_per_day:,.0f}")

Earliest Order Date: 2022-03-31
Latest Order Date: 2022-06-29
Total Span of Data (Days): 90
Total Transactions: 128,975
Average Transactions per Day: 1,433


# The 3 V's Justify PySpark

The analysis of the Amazon sales data confirms its classification as **Big Data** due to the presence of the defining **3 V's**, which collectively necessitate the use of a distributed framework like Apache Spark.

* **Volume:** The sheer scale of **128,975** transactions accumulated within a short observation period highlights the significant data **Volume**. This mass of records cannot be efficiently processed or analyzed using single-machine traditional tools.

* **Velocity:** The data generation rate, evidenced by the high **Velocity** of approximately **1,433 transactions per day** over a **90-day** span, shows a continuous, fast-flowing data stream. Analyzing trends from such rapidly generated data demands the high-throughput capabilities that PySpark provides.

* **Variety:** The dataset demonstrates **Variety** through its blend of data types, including **Strings** (for categorical features like 'Category' and 'City') and **Numeric** types (for 'Qty' and 'Amount'). While currently structured, this inherent diversity, along with the potential future inclusion of unstructured data, reinforces the need for PySpark’s flexible DataFrame API.

In conclusion, the challenge presented by the data's **Volume, Velocity, and Variety** confirms that **PySpark is the appropriate technology** to manage, transform, and extract meaningful business insights from this large-scale e-commerce data.

## Section 2 — Hadoop & MapReduce

## Section 3 — Apache Spark (PySpark): RDD vs DataFrame

## Section 4 — Spark SQL & Streaming

## Section 5 — Distributed Storage (HDFS & S3)

## Section 6 — Data Processing with PySpark (Filter, Aggregation, Joins)

## Section 7 — Loading & Reading Data on S3

## Section 8 — Visualization Tasks

## Section 9 — Mini Analysis Report (Short Comments)