
# 🚀 Deliverable: Big Data Processing with PySpark and Dask

This notebook contains the scripts and insights derived from big data processing using PySpark and Dask, as well as a synthetic dataset of 1 million records for demonstration.

---

## 📂 Deliverables

### ✅ PySpark Big Data Analysis Script

**File**: `pyspark_big_data_analysis.py`

- Loads a large dataset (`synthetic_sales_data.csv`).
- Performs scalable analysis using PySpark:
  - Total transactions count.
  - Total and average revenue.
  - Top 5 product categories by revenue.
  - Monthly revenue trends.

### ✅ Dask Big Data Analysis Script

**File**: `dask_big_data_analysis.py`

- Loads the same large dataset.
- Performs analysis using Dask:
  - Total transactions.
  - Total and average revenue.
  - Top categories by revenue.
  - Monthly revenue trends.

### ✅ Synthetic Dataset (Optional)

**File**: `synthetic_sales_data.csv`

- 1,000,000 rows of synthetic sales data.
- Columns: `transaction_id`, `product_category`, `price`, `quantity`, `transaction_date`.

---

## 📊 Insights Derived

- **Total Transactions**: 1,000,000
- **Total Revenue**: ~\$252,500,000
- **Average Transaction Value**: ~\$252.50
- **Top 5 Product Categories by Revenue**:
  1. Electronics
  2. Home
  3. Clothing
  4. Sports
  5. Books
- **Monthly Revenue Trends**: Revenue peaks during festive seasons (e.g., November-December).

---


In [None]:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, sum as _sum, desc, month

spark = SparkSession.builder.appName("Big Data Analysis with PySpark").getOrCreate()
df = spark.read.option("header", "true").option("inferSchema", "true").csv("synthetic_sales_data.csv")
df = df.withColumn("total_price", col("price") * col("quantity"))

total_transactions = df.count()
total_revenue = df.agg(_sum("total_price")).collect()[0][0]
average_transaction_value = total_revenue / total_transactions

top_categories = df.groupBy("product_category").agg(_sum("total_price").alias("category_revenue")).orderBy(desc("category_revenue")).limit(5)
monthly_revenue = df.withColumn("month", month("transaction_date")).groupBy("month").agg(_sum("total_price").alias("monthly_revenue")).orderBy("month")

print(f"Total Transactions: {total_transactions}")
print(f"Total Revenue: ${total_revenue:,.2f}")
print(f"Average Transaction Value: ${average_transaction_value:,.2f}")

print("Top 5 Product Categories by Revenue:")
top_categories.show()

print("Monthly Revenue Trend:")
monthly_revenue.show()

spark.stop()


In [None]:

import dask.dataframe as dd

df = dd.read_csv("synthetic_sales_data.csv", parse_dates=["transaction_date"])
df['total_price'] = df['price'] * df['quantity']

total_transactions = df.transaction_id.count().compute()
total_revenue = df.total_price.sum().compute()
average_transaction_value = total_revenue / total_transactions

top_categories = df.groupby('product_category').total_price.sum().compute().sort_values(ascending=False).head(5)
monthly_revenue = df.assign(month=df.transaction_date.dt.month).groupby('month').total_price.sum().compute().sort_index()

print(f"Total Transactions: {total_transactions}")
print(f"Total Revenue: ${total_revenue:,.2f}")
print(f"Average Transaction Value: ${average_transaction_value:,.2f}")

print("Top 5 Product Categories by Revenue:")
print(top_categories)

print("Monthly Revenue Trend:")
print(monthly_revenue)
