# Big Data Management and Processing

## Bitcoin Market Trends

### Batch analytics using Apache Spark

#### Why Apache Spark?

Apache Spark is used to demonstrate scalable batch analytics on large datasets.
While the dataset can fit on a single machine, Spark illustrates how the same
analysis could scale horizontally across a cluster without code changes.

Note: Spark execution depends on local Java and Spark configuration. If Spark fails to initialize, this notebook serves as a conceptual demonstration of batch analytics. All results are reproducible via the MongoDB-based notebook.

In [3]:
# Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Bitcoin Big Data Analysis") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .config("spark.ui.showConsoleProgress", "false") \
    .getOrCreate()

In [4]:
# Load the data
spark_df = spark.read.csv(
    r"C:\Users\user\Downloads\btc_daily.csv",
    header=True,
    inferSchema=True)

#### Batch Aggregation Strategy

The following transformations use Sparkâ€™s distributed DataFrame API,
which internally applies MapReduce-style execution to group, aggregate,
and summarize large volumes of data efficiently.

In [5]:
# Yearly Aggregation
from pyspark.sql.functions import year, to_date, avg, sum

spark_df = spark_df.withColumn(
    "date", to_date("day"))

yearly_trends = spark_df.groupBy(
    year("date").alias("year")).agg(
    avg("tx_count").alias("avg_tx_count"),
    sum("total_fee_satoshis").alias("total_fees")).orderBy("year")

yearly_trends.show(10)

+----+------------------+--------------+
|year|      avg_tx_count|    total_fees|
+----+------------------+--------------+
|2009|  91.3659217877095|     287000000|
|2010| 507.6849315068493|    4398957094|
|2011| 5210.315068493151|  308651762554|
|2012|23095.765027322403|  679745946734|
|2013|53817.098630136985| 1527463598088|
|2014| 69215.67123287672|  463657408928|
|2015|125134.30958904109|  820011053248|
|2016| 225755.8005464481| 2255558583994|
|2017| 285104.7369863014|10037068608288|
|2018|223001.74246575343| 2570482391966|
+----+------------------+--------------+
only showing top 10 rows


In [6]:
# Spikes detection
spark_df.orderBy(
    spark_df.tx_count.desc()).select("day", "tx_count").show(10)

+----------+--------+
|       day|tx_count|
+----------+--------+
|2024-04-23|  927010|
|2024-09-08|  910083|
|2024-07-21|  859629|
|2024-05-26|  852655|
|2024-07-23|  838977|
|2024-05-25|  835040|
|2024-10-21|  835011|
|2024-07-29|  826129|
|2024-07-22|  824999|
|2024-11-19|  810805|
+----------+--------+
only showing top 10 rows
