# 🏦 ETL Project: Largest Banks Data Pipeline

This notebook performs a full **Extract–Transform–Load (ETL)** process on global banking data from Wikipedia.  
It scrapes the table of the world’s largest banks by market capitalization, converts currencies using live exchange rates, and loads the final dataset into both a **CSV file** and a **SQLite database**.

## ⚙️ Import Required Libraries

We’ll use popular Python libraries for web scraping, data manipulation, and database operations:
- `requests` and `BeautifulSoup` for web scraping  
- `pandas` and `numpy` for data transformation  
- `sqlite3` for database loading  
- `datetime` for logging process timestamps  

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from datetime import datetime
from io import StringIO
import sqlite3

## 🌐 Configuration and Global Variables

Below are the file paths, URLs, headers, and table attribute names used throughout the ETL pipeline.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
header = {"User-Agent": "Mozilla/5.0"}

attributes = ["Name", "MC_USD_Billion"]
table_attributes = ["Name", "MC_USD_Billion",
                    "MC_GBP_Billion", "MC_EUR_Billion", "MC_INR_Billion"]

output_csv_path = "Largest_banks_data.csv"
database_name = "Banks.db"
table_name = "Largest_banks"
log_file = "code_log.txt"

## 🧱 Helper Functions

We’ll define the following functions:
- `log_progress()` — Records process updates with timestamps.
- `extract()` — Fetches and parses the table from Wikipedia.
- `transform()` — Cleans the dataset and converts values to multiple currencies.
- `load_to_csv()` — Saves the transformed data to a CSV file.
- `load_to_db()` — Loads the data into a SQLite database.
- `run_queries()` — Executes SQL queries for validation.

In [3]:
def log_progress(message):
    timestamp_format = '%Y-%m-%d %H:%M:%S'
    now = datetime.now()
    timestamp = now.strftime(timestamp_format)
    with open(log_file, "a") as file:
        file.write(f"{timestamp}: {message}\n")


def extract(url):
    r = requests.get(url, headers=header)
    soup = BeautifulSoup(r.text, "html.parser")
    tables = str(soup.find_all("table", class_="wikitable"))
    dfs = pd.read_html(StringIO(tables))
    df = dfs[0]
    return df


def transform(df, csv_path):
    rates = pd.read_csv(csv_path, index_col="Currency")
    df = df.rename(columns={
        "Bank name": "Name", "Total assets (2025) (US$ billion)": "MC_USD_Billion"})
    for currency, rate in rates["Rate"].items():
        df[f"MC_{currency}_Billion"] = np.round(df["MC_USD_Billion"]*rate, 2)
    df = df[table_attributes]
    return df


def load_to_csv(df, output_path):
    df.to_csv(output_path, index=False)


def load_to_db(df, sql_connection, table_name):
    df.to_sql(table_name, sql_connection, if_exists="replace", index=False)


def run_queries(query_statement, sql_connection):
    log_progress("Running Query")
    print(pd.read_sql(query_statement, sql_connection))
    log_progress("Querying Successful")

## 🚀 Running the ETL Pipeline

This section runs the full ETL process in sequence:
1. **Extract** — Pull raw HTML data from Wikipedia.  
2. **Transform** — Clean and enrich the dataset with currency conversions.  
3. **Load** — Export results to CSV and SQLite.  
4. **Query** — Validate and preview the loaded data.  

We’ll also include robust error handling and logging to track progress and issues.

In [None]:
# Running ETL

log_progress("Preliminaries complete. Initiating ETL process")

try:
    extracted = extract(url)
    log_progress("Data extraction complete. Initiating Transformation process")

    transformed = transform(extracted, "exchange_rate.csv")
    log_progress("Data transformation complete. Initiating Loading process")

    load_to_csv(transformed, output_csv_path)
    log_progress("Data saved to CSV file")

    sql_conn = sqlite3.connect(database_name)
    log_progress("SQL Connection initiated")

    load_to_db(transformed, sql_conn, table_name)
    log_progress("Data loaded to Database as a table, Executing queries")

except Exception as e:
    log_progress(f"Error Found: {e}")
    print(f"Error found during ETL Process: {e}")

else:
    run_queries("SELECT * FROM Largest_banks", sql_conn)
    run_queries(
        "SELECT AVG(MC_GBP_Billion) FROM Largest_banks", sql_conn)
    run_queries("SELECT Name from Largest_banks LIMIT 5", sql_conn)
    log_progress("Process Complete")

finally:
    sql_conn.close()
    log_progress("Server Connection closed")

### 🔍 Inspecting the Wikipedia Table

Before extraction, the HTML structure of the Wikipedia page was inspected using **Developer Tools** (right-click → *Inspect*).  
The table we scraped is contained within a `<table>` element with the class **`wikitable`**, which we targeted using BeautifulSoup.

![Wikipedia HTML Inspection](wiki_inspect.png)

## 🧩 Extraction Phase

In this step, we scrape the Wikipedia page for the list of largest banks.  
We then parse the HTML using `BeautifulSoup` and convert the table into a Pandas DataFrame for easier processing.

### 🧾 Sample Output:
| Rank | Bank name                              | Total assets (2025) (US$ billion) |
|------|----------------------------------------|-----------------------------------:|
| 1    | Industrial and Commercial Bank of China | 6688.74 |
| 2    | Agricultural Bank of China              | 5923.76 |
| 3    | China Construction Bank                 | 5558.38 |
| 4    | Bank of China                           | 4803.51 |
| 5    | JPMorgan Chase                          | 4002.81 |
| ...  | ...                                     | ... |
| 96   | SEB Group                               | 339.65 |
| 97   | Raiffeisen Group                        | 337.25 |
| 98   | Banco Bradesco                          | 331.96 |
| 99   | VTB Bank                                | 330.43 |
| 100  | First Abu Dhabi Bank                    | 330.32 |

## 🔄 Transformation Phase

In this step, we rename columns for clarity, calculate equivalent market capitalization values in GBP, EUR, and INR using exchange rates from a CSV file, and rearrange the table structure.

### 🧾 Sample Output (Transformed DataFrame)

| Name                                   | MC_USD_Billion | MC_GBP_Billion | MC_EUR_Billion | MC_INR_Billion |
|----------------------------------------|----------------:|----------------:|----------------:|----------------:|
| Industrial and Commercial Bank of China | 6688.74 | 5350.99 | 6220.53 | 554830.98 |
| Agricultural Bank of China              | 5923.76 | 4739.01 | 5509.10 | 491375.89 |
| China Construction Bank                 | 5558.38 | 4446.70 | 5169.29 | 461067.62 |
| Bank of China                           | 4803.51 | 3842.81 | 4467.26 | 398451.15 |
| JPMorgan Chase                          | 4002.81 | 3202.25 | 3722.61 | 332033.09 |
| ...                                     | ... | ... | ... | ... |
| SEB Group                               | 339.65 | 271.72 | 315.87 | 28173.97 |
| Raiffeisen Group                        | 337.25 | 269.80 | 313.64 | 27974.89 |
| Banco Bradesco                          | 331.96 | 265.57 | 308.72 | 27536.08 |
| VTB Bank                                | 330.43 | 264.34 | 307.30 | 27409.17 |
| First Abu Dhabi Bank                    | 330.32 | 264.26 | 307.20 | 27400.04 |

## 💾 Loading Phase

After transforming the dataset, the final step is to **load** the processed data into persistent storage formats for future analysis and querying.

In this project:
- A CSV file named **`Largest_banks_data.csv`** was created to store the cleaned and transformed data locally.  
- The same dataset was also loaded into a **SQLite database (`Banks.db`)** under the table name **`Largest_banks`** for SQL-based validation and analytics.

Both operations were logged in the `code_log.txt` file to track ETL progress and completion.

## 🧠 Query Section

### 🧩 Query 1 — Display All Records  

**SQL Command:**
```sql
SELECT * FROM Largest_banks;
```

```text
                                       Name  MC_USD_Billion  MC_GBP_Billion  MC_EUR_Billion  MC_INR_Billion
0   Industrial and Commercial Bank of China         6688.74         5350.99         6220.53       554830.98
1                Agricultural Bank of China         5923.76         4739.01         5509.10       491375.89
2                   China Construction Bank         5558.38         4446.70         5169.29       461067.62
3                             Bank of China         4803.51         3842.81         4467.26       398451.15
4                            JPMorgan Chase         4002.81         3202.25         3722.61       332033.09
..                                      ...             ...             ...             ...             ...
95                                SEB Group          339.65          271.72          315.87        28173.97
96                         Raiffeisen Group          337.25          269.80          313.64        27974.89
97                           Banco Bradesco          331.96          265.57          308.72        27536.08
98                                 VTB Bank          330.43          264.34          307.30        27409.17
99                     First Abu Dhabi Bank          330.32          264.26          307.20        27400.04

[100 rows x 5 columns]
```

### 🧮 Query 2 — Average Market Cap in GBP

**SQL Command:**
```sql
SELECT AVG(MC_GBP_Billion) FROM Largest_banks;
```

```text
   AVG(MC_GBP_Billion)
0              945.115
```

### 🏦 Query 3 — Display Top 5 Banks by Name

**SQL Command:**
```sql
SELECT Name FROM Largest_banks LIMIT 5;
```

```text
                                      Name
0  Industrial and Commercial Bank of China
1               Agricultural Bank of China
2                  China Construction Bank
3                            Bank of China
4                           JPMorgan Chase
```

## ✅ Conclusion

We successfully implemented a full ETL workflow that:
- Extracted live data from Wikipedia  
- Transformed and enriched it with multiple currency conversions  
- Loaded the results into both CSV and SQLite formats  

This process demonstrates an **automated data pipeline** that could easily be adapted for continuous integration (CI/CD) or data engineering workflows.