##  Project Objective
The objective of this data engineering project is to build a scalable, end-to-end data pipeline that ingests, processes, and enriches Bitcoin’s daily market data. This pipeline will serve as the foundation for analytical reporting, time-series feature engineering, and predictive modeling related to Bitcoin's market behavior.

# Benefits
A Bitcoin investment advisory firm
- The advisory team can trust the data for research, reports, and investor recommendations without manual wrangling or risk of errors.
- Helps the firm detect trends, shifts, and patterns in the Bitcoin market — empowering analysts to make better buy/sell/hold calls.
- Enables risk analysts to flag abnormal behavior early, protecting clients and improving the firm's credibility.
- Enhances the firm’s client reporting and supports data-driven decision-making.
- Reduces operational overhead, freeing up analysts to focus on strategy instead of data prep.

##  Description
This notebook handles the ingestion of raw Bitcoin daily market data. It reads the dataset from a CSV source, infers schema, performs initial inspection, and writes the output to a raw (bronze) layer for further processing.

##  Tools & Technologies Used in This Notebook
| Tool / Library | Purpose |
|----------------|---------|
| **Apache Spark (PySpark)** | Distributed data processing for ingesting and handling large datasets |
| **Databricks / Jupyter Notebook** | Interactive environment for writing and testing code |
| **Delta Lake / Parquet** | Optimized storage format for raw (bronze) layer |
| **DBFS or Cloud Storage (e.g., `/mnt/raw/`)** | Storage location for input dataset |

In [0]:
%fs ls /Volumes/bitcoin/market_data/coin/Bitcoin_history_data.csv    # List the contents of the specified directory in DBFS (Databricks File System)


path,name,size,modificationTime
dbfs:/Volumes/bitcoin/market_data/coin/Bitcoin_history_data.csv,Bitcoin_history_data.csv,335981,1753691038000


In [0]:
# Read the Bitcoin history CSV file into a Spark DataFrame
# - header=True: uses the first row as column names
# - inferSchema=True: automatically infers data types for each column
df_raw = spark.read.csv("/Volumes/bitcoin/market_data/coin/Bitcoin_history_data.csv", header=True, inferSchema=True)

# Display the schema of the DataFrame to understand the structure and data types
df_raw.printSchema()

# Show the first 5 rows of the DataFrame for a quick preview of the data
df_raw.show(5)

# Write the raw DataFrame to a Delta Lake table in the 'bronze' layer
# - format("delta"): saves the data in Delta format for ACID transactions and versioning
# - mode("overwrite"): replaces any existing data in the target table
# - saveAsTable(): registers the table in the metastore under the specified name
df_raw.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("bitcoin.market_data.bronze_bitcoin")

root
 |-- Date: date (nullable = true)
 |-- Close: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Open: double (nullable = true)
 |-- Volume: long (nullable = true)

+----------+------------------+------------------+------------------+-----------------+--------+
|      Date|             Close|              High|               Low|             Open|  Volume|
+----------+------------------+------------------+------------------+-----------------+--------+
|2014-09-17| 457.3340148925781|468.17401123046875| 452.4219970703125| 465.864013671875|21056800|
|2014-09-18|424.44000244140625| 456.8599853515625|   413.10400390625|456.8599853515625|34483200|
|2014-09-19| 394.7959899902344| 427.8349914550781| 384.5320129394531|424.1029968261719|37919700|
|2014-09-20|408.90399169921875| 423.2959899902344|389.88299560546875|394.6730041503906|36863600|
|2014-09-21| 398.8210144042969| 412.4259948730469| 393.1809997558594|408.0849914550781|26580100|
+---