# 🚀 ETL Pipeline on Azure with PySpark (CSV Only)
---
This notebook demonstrates an **ETL pipeline** on **Azure Data Lake Storage Gen2 (ADLS Gen2)** using **PySpark**. Unlike the Parquet-based pipeline, here we focus on **CSV ingestion, transformation, and saving back as CSV**.

### 🔑 Key Steps
1. Configure ADLS access (OAuth-based)
2. Extract raw CSV data (5 Olympic datasets)
3. Perform data exploration & transformations
4. Save transformed data back to ADLS in CSV format
5. Verify outputs

⚠️ **Note:** For GitHub upload, client ID/secret should be stored securely using Azure Key Vault or Databricks secrets, not hardcoded as shown here for demo purposes.

## 📦 Install & Import Dependencies

In [None]:
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType

## 🔑 Configure Azure Data Lake Access
We configure Spark to authenticate with Azure Data Lake Storage Gen2 using **OAuth credentials**.

👉 Replace placeholders with your own credentials.

In [None]:
spark.conf.set("fs.azure.account.auth.type.<storage_account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage_account>.dfs.core.windows.net",
              "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage_account>.dfs.core.windows.net", "<client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage_account>.dfs.core.windows.net", "<client-secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage_account>.dfs.core.windows.net",
              "https://login.microsoftonline.com/<tenant-id>/oauth2/token")

## 📥 Step 1: Extract Raw CSV Data
We read 5 Olympic datasets (Athletes, Coaches, EntriesGender, Medals, Teams) from the **raw-data** zone in ADLS.

In [None]:
athletes = spark.read.csv("abfss://<container>@<storage_account>.dfs.core.windows.net/raw-data/Athletes.csv",
                          header=True, inferSchema=True)
coaches = spark.read.csv("abfss://<container>@<storage_account>.dfs.core.windows.net/raw-data/Coaches.csv",
                         header=True, inferSchema=True)
entries_gender = spark.read.csv("abfss://<container>@<storage_account>.dfs.core.windows.net/raw-data/EntriesGender.csv",
                                header=True, inferSchema=True)
medals = spark.read.csv("abfss://<container>@<storage_account>.dfs.core.windows.net/raw-data/Medals.csv",
                        header=True, inferSchema=True)
teams = spark.read.csv("abfss://<container>@<storage_account>.dfs.core.windows.net/raw-data/Teams.csv",
                       header=True, inferSchema=True)

## 🔍 Step 2: Data Exploration
Quick look at row counts and schema.

In [None]:
print("Athletes rows:", athletes.count())
print("Coaches rows:", coaches.count())
print("EntriesGender rows:", entries_gender.count())
print("Medals rows:", medals.count())
print("Teams rows:", teams.count())

athletes.show(5)
medals.show(5)

## 🔄 Step 3: Transformations
- Cast gender counts to integers
- Find top countries by gold medals
- Calculate average female/male participation

In [None]:
# Cast numeric columns
entries_gender = entries_gender.withColumn("Female", col("Female").cast(IntegerType())) \
                               .withColumn("Male", col("Male").cast(IntegerType())) \
                               .withColumn("Total", col("Total").cast(IntegerType()))

# Top countries by gold medals
top_gold_medal_countries = medals.orderBy("Gold", ascending=False) \
                                 .select("TeamCountry", "Gold")
top_gold_medal_countries.show(10)

# Average gender participation
average_entries_by_gender = entries_gender.withColumn("Avg_Female", col("Female") / col("Total")) \
                                               .withColumn("Avg_Male", col("Male") / col("Total"))
average_entries_by_gender.show(5)

## 💾 Step 4: Save Transformed Data (CSV)
We save the transformed datasets into the **transformed-data** zone.

⚠️ Using `.repartition(1)` ensures single CSV output (useful for demos).

In [None]:
athletes.repartition(1).write.mode("overwrite").option("header", "true").csv("abfss://<container>@<storage_account>.dfs.core.windows.net/transformed-data/Athletes")
coaches.repartition(1).write.mode("overwrite").option("header", "true").csv("abfss://<container>@<storage_account>.dfs.core.windows.net/transformed-data/Coaches")
entries_gender.repartition(1).write.mode("overwrite").option("header", "true").csv("abfss://<container>@<storage_account>.dfs.core.windows.net/transformed-data/EntriesGender")
medals.repartition(1).write.mode("overwrite").option("header", "true").csv("abfss://<container>@<storage_account>.dfs.core.windows.net/transformed-data/Medals")
teams.repartition(1).write.mode("overwrite").option("header", "true").csv("abfss://<container>@<storage_account>.dfs.core.windows.net/transformed-data/Teams")

## ✅ Step 5: Verification
Reload one dataset from the transformed zone to confirm schema and data.

In [None]:
athletes_check = spark.read.csv("abfss://<container>@<storage_account>.dfs.core.windows.net/transformed-data/Athletes",
                               header=True, inferSchema=True)

athletes_check.show(5)
athletes_check.printSchema()

## 📌 Summary
In this notebook we:
- Configured ADLS access via OAuth
- Extracted raw CSV datasets
- Performed basic transformations & analytics
- Saved outputs back to ADLS in CSV format

👉 Next Steps:
- Automate pipeline with **Azure Data Factory**
- Add **CI/CD integration** for production
- Build **Power BI dashboards** for reporting