<a href="https://colab.research.google.com/github/Shazizan/portfolio/blob/master/etl_vault_ps_realtime_crypto_by_coingecko.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Real-Time Crypto (CoinGecko) ETL with PySpark and GitHub**

# **Setup & Data Preparation**

## **1 - Setup Configuration: Install Java & PySpark**

Insight:
- !apt-get → runs a Bash command in Colab (like installing software).
- !pip install → installs Python packages in Colab.

In [1]:
# Install Java 11 (required by Spark)
!apt-get install openjdk-11-jdk -y

# Install PySpark
!pip install pyspark

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libxt-dev libxtst6 libxxf86dga1 openjdk-11-jre
  x11-utils
Suggested packages:
  libxt-doc openjdk-11-demo openjdk-11-source visualvm mesa-utils
The following NEW packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libxt-dev libxtst6 libxxf86dga1 openjdk-11-jdk
  openjdk-11-jre x11-utils
0 upgraded, 10 newly installed, 0 to remove and 38 not upgraded.
Need to get 5,367 kB of archives.
After this operation, 15.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-dejavu-core all 2.37-2build1 [1,041 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-dejavu-extra all 2.37-2build1 [2,041 kB]
Get:3 http://archive.ubuntu.com/ubuntu jam

## **2 - Import PySpark & Create Spark Session**

Insight:
- SparkSession is like the “engine” that lets us run PySpark commands.
- If we see Spark is ready!, we’re all set.

In [2]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("CryptoETL").getOrCreate()

print("Spark is ready!")

Spark is ready!


# **Extract**

## **Extract data using Python (requests library)**

Insight:
- PySpark doesn’t call APIs directly — we usually use Python to fetch the data first, then convert it into a Spark DataFrame.

1️⃣ Import required modules

In [4]:
import requests
from pyspark.sql import SparkSession

2️⃣ Fetch data from CoinGecko API (the output is in json format)

In [5]:
url = "https://api.coingecko.com/api/v3/simple/price?ids=bitcoin,ethereum&vs_currencies=usd"
response = requests.get(url)

# Check if request was successful
if response.status_code == 200:
    print("Data fetched successfully!")
    data = response.json()
    print(data)
else:
    print("Error fetching data:", response.status_code)

Data fetched successfully!
{'bitcoin': {'usd': 114546}, 'ethereum': {'usd': 4141.59}}


# **Transformation**

## **Transform / convert JSON into PySpark-friendly format**

Insight:
- PySpark likes list of dictionaries. So we convert JSON into it.

In [8]:
# Convert JSON to list of dicts
data_list = [{"coin": k, "usd": v["usd"]} for k, v in data.items()]

In [9]:
print(data_list)

[{'coin': 'bitcoin', 'usd': 114546}, {'coin': 'ethereum', 'usd': 4141.59}]


## **Create PySpark DataFrame**

In [10]:
# Create Spark session if not already
#spark = SparkSession.builder.appName("CryptoETL").getOrCreate() -- since I already created above.

# Convert list of dicts to Spark DataFrame
# df = spark.createDataFrame(data_list) -- use this one only for thr below error

PySparkTypeError: [CANNOT_MERGE_TYPE] Can not merge type `LongType` and `DoubleType`.

Insight:
This error is common in PySpark when we try to create a DF from a list of dictionaries but the values in the same column have mixed types — for example:

- Some usd values are integers (LongType)
- Some usd values are floats (DoubleType)

As a result, PySpark cannot automatically merge LongType and DoubleType, so it throws CANNOT_MERGE_TYPE.

## **Fixing Error on Transformation / Convertion Part**

- Fixing error in the transformation / convertion part by ensuring all the usd values are in float.
- then, repeat the step of creating spark dataframe

In [11]:
# Ensure all usd values are float
data_list = [{"coin": k, "usd": float(v["usd"])} for k, v in data.items()]

# Create PySpark DataFrame
df = spark.createDataFrame(data_list)

In [20]:
df.show()

+--------+--------+
|    coin|     usd|
+--------+--------+
| bitcoin|114546.0|
|ethereum| 4141.59|
+--------+--------+



## **Optional - Add Timestamp Column**

In [22]:
from pyspark.sql.functions import current_timestamp

df = df.withColumn("timestamp", current_timestamp())

In [23]:
df.show()

+--------+--------+--------------------+
|    coin|     usd|           timestamp|
+--------+--------+--------------------+
| bitcoin|114546.0|2025-10-01 07:32:...|
|ethereum| 4141.59|2025-10-01 07:32:...|
+--------+--------+--------------------+



# **Load**

Here I'm going to save the work as a JSON file and push it to my GitHub repository (pipelines-vault)

## **1 - Save PySpark DataFrame as JSON**

In [24]:
# Save DataFrame as JSON locally in Colab
df.coalesce(1).write.mode("overwrite").json("crypto_prices")

Insight:
- This creates a folder called crypto_prices in Colab.
- Inside it, there will be a file like part-00000-xxxx.json containing my data.
- .coalesce(1) - PySpark splits DataFrames across multiple partitions internally to process data in parallel.

- When writing a DataFrame to a file, PySpark writes one file per partition by default & .coalesce(1) tells PySpark: Merge all partitions into a single partition before writing.

## **2 - Rename JSON file for GitHub upload**

In [26]:
#These are standard Python modules — no extra installation needed.
import os       #used for interacting with the operating system (like folders, paths).
import glob    #finds files/folders matching a pattern.
import shutil  #used for moving or copying files.

# Find the JSON file inside the folder
json_file = glob.glob("crypto_prices/part-*.json")[0]
# Rename/move it to a simple file
shutil.move(json_file, "crypto_prices.json")
print("Saved as crypto_prices.json")


Saved as crypto_prices.json


Insight:
- glob.glob(pattern) → returns a list of file paths matching the pattern.
- Pattern "crypto_prices/part-*.json" means:
* (1) Go inside the folder crypto_prices
* (2) Look for any file starting with part- and ending with .json

Why needed?:
- When PySpark writes JSON, it creates a file like part-00000-xxxx.json inside the folder.
- We need to find this automatically because the filename has a random suffix.

another insight:
- glob.glob returns a list of matching files.
- [0] takes the first file from the list.
- In our case, there should only be one JSON file because we used .coalesce(1).

✅ After run the code above, I should have a single JSON file ready to push.

## **3 - Upload JSON file to GitHub using API**

Requirement:
- GitHub username
- Repository name
- Personal Access Token (PAT) with repo permission

In [28]:
import base64
import requests

# Read JSON file
with open("crypto_prices.json", "r") as f:
    content = f.read()

# Encode content to base64 (GitHub API requirement)
encoded_content = base64.b64encode(content.encode()).decode()

# GitHub info
username = "Shazizan"
repo = "pipeline-vault"
path = "crypto_prices.json"  # destination path in repo
token = "REPLACE_WITH_YPUR_OWN_TOKEN"

url = f"https://api.github.com/repos/Shazizan/pipeline-vault/contents/crypto_prices.json"

# Upload JSON to GitHub
response = requests.put(
    url,
    headers={"Authorization": f"token {token}"},
    json={
        "message": "Add latest crypto prices",
        "content": encoded_content
    }
)

print(response.json())

{'content': {'name': 'crypto_prices.json', 'path': 'crypto_prices.json', 'sha': '05ce7cd2f7d348a019de99bf9a580bc250c9cc33', 'size': 146, 'url': 'https://api.github.com/repos/Shazizan/pipeline-vault/contents/crypto_prices.json?ref=main', 'html_url': 'https://github.com/Shazizan/pipeline-vault/blob/main/crypto_prices.json', 'git_url': 'https://api.github.com/repos/Shazizan/pipeline-vault/git/blobs/05ce7cd2f7d348a019de99bf9a580bc250c9cc33', 'download_url': 'https://raw.githubusercontent.com/Shazizan/pipeline-vault/main/crypto_prices.json', 'type': 'file', '_links': {'self': 'https://api.github.com/repos/Shazizan/pipeline-vault/contents/crypto_prices.json?ref=main', 'git': 'https://api.github.com/repos/Shazizan/pipeline-vault/git/blobs/05ce7cd2f7d348a019de99bf9a580bc250c9cc33', 'html': 'https://github.com/Shazizan/pipeline-vault/blob/main/crypto_prices.json'}}, 'commit': {'sha': 'c01139e40fe9eece77a96ac9129583ea7e6b8fe4', 'node_id': 'C_kwDOPznHZNoAKGMwMTEzOWU0MGZlOWVlY2U3N2E5NmFjOTEyOTU4M2

## **4 - Validate / Check GitHub**

- Go to repository (pipeline-vault) → the crypto_prices.json file with the real-time data should be available the
- In principle, this notebook can be executed at any time to retrieve the latest prices and update the data on GitHub. This marks my first independent, hands-on experiment in managing streaming data.