# Initial setup

The libraries needed for the solution are as follows:

- requests - The library used for accessing the information from the URL.

- bs4 - The library containing the BeautifulSoup function used for webscraping.

- pandas - The library used for processing the extracted data, storing it to required formats and communicating with the databases.

- sqlite3 - The library required to create a database server connection.

- numpy - The library required for the mathematical rounding operation as required in the objectives.

- datetime - The library containing the function datetime used for extracting the timestamp for logging purposes.

In [212]:
# !pip install pandas
# !pip install numpy
# !pip install bs4

In [213]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime 

In [214]:
# initialize all the known entities

url = 'https://web.archive.org/web/20230908091635/https://en.wikipedia.org/wiki/List_of_largest_banks'
table_attribs = ["Name", "MC_USD_Billion"]
db_name = 'Banks.db'
table_name = 'Largest_banks'
csv_path = './Largest_banks_data.csv'

# Task 1: Extracting information

Extraction of information from a web page is done using the web scraping process. For this, I analyzed the link to come up with the strategy of how to get the required information. The following points are worth observing after inspecting the URL and noted the position of the table:

- In the given webpage, the table is at the first position, or index 0. Among this, the entries under 'Bank Name' and 'Market Cap' are required.

- To correctly extract the second anchor element of the second col in the HTML , I will need to use find_all('a') to get a list of all anchor elements within the cell, and then selecting the second one.


In [215]:
def extract(url, table_attribs):
    ''' This function extracts the required
    information from the website and saves it to a dataframe. The
    function returns the dataframe for further processing. '''
     
    page = requests.get(url).text #Extract the web page as text
    data = BeautifulSoup(page,'html.parser') #Parse the text into an HTML object
    df = pd.DataFrame(columns=table_attribs) #Create an empty pandas DataFrame named df with columns as the table_attribs.
    tables = data.find_all('tbody') 
    rows = tables[0].find_all('tr')
    for row in rows:
        display(row)
        col = row.find_all('td') #Check the contents of each row, having attribute ‘td’, for the following conditions
        if len(col)!=0: 
            anchors = col[1].find_all('a')  # Find all <a> tags in the cell
            data_dict = {"Name": anchors[1].contents[0] if len(anchors) > 1 else None,  # Get the second <a> tag
                        "MC_USD_Billion": float(col[2].contents[0])}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df,df1], ignore_index=True) #Store all entries matching the conditions in step 5 to a dictionary with keys the same as entries of table_attribs. Append all these dictionaries one by one to the dataframe.
    return df

In [216]:
exchange_rates_df = pd.read_csv("./exchange_rate.csv") # get the exchange rates dataframe
exchange_rates = {
    'GBP': exchange_rates_df.loc[exchange_rates_df['Currency'] == 'GBP', 'Rate'].values[0],
    'EUR': exchange_rates_df.loc[exchange_rates_df['Currency'] == 'EUR', 'Rate'].values[0],
    'INR': exchange_rates_df.loc[exchange_rates_df['Currency'] == 'INR', 'Rate'].values[0],
}

# Task 2: Transform information

In [217]:
def transform(df, exchange_rates):
    ''' This function converts the GDP information from Currency
    format to float value, transforms the information of GDP from
    USD (Millions) to USD (Billions) rounding to 2 decimal places.
    The function returns the transformed dataframe.'''
    
    cap_list = df["MC_USD_Billion"].tolist()
    df["MC_USD_Billion"] = cap_list
    df['MC_GBP_Billion'] = (df['MC_USD_Billion'] * exchange_rates['GBP']).round(2)
    df['MC_EUR_Billion'] = (df['MC_USD_Billion'] * exchange_rates['EUR']).round(2)
    df['MC_INR_Billion'] = (df['MC_USD_Billion'] * exchange_rates['INR']).round(2)
    
    return df

# Task 3: Loading information

In [218]:
def load_to_csv(df, csv_path):
    df.to_csv(csv_path)

In [219]:
#save the transformed dataframe as a table in the database.

def load_to_db(df, sql_connection, table_name):
    df.to_sql(table_name, sql_connection, if_exists='replace', index=False)

Task 4: Querying the database table

In [220]:
def run_query(query_statement, sql_connection):
    display(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    display(query_output)

# Task 5: Logging progress

In [221]:
def log_progress(message): 
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("./code_log.txt","a") as f: 
        f.write(timestamp + ' : ' + message + '\n')

In [226]:
log_progress('Preliminaries complete. Initiating ETL process')

df = extract(url, table_attribs)

log_progress('Data extraction complete. Initiating Transformation process')

df = transform(df, exchange_rates)

mkt_cap = df['MC_EUR_Billion'][4]

log_progress('Retrieved market capitalization of the 5th largest bank in billion EUR successfully')

display(f"market capitalization of the 5th largest bank in billion EUR: {mkt_cap}")

log_progress('Data transformation complete. Initiating loading process')

load_to_csv(df, csv_path)

log_progress('Data saved to CSV file')

sql_connection = sqlite3.connect('Banks.db')

log_progress('SQL Connection initiated.')

load_to_db(df, sql_connection, table_name)

log_progress('Data loaded to Database as table. Running the query')

query_statement = f"SELECT * from {table_name}"
run_query(query_statement, sql_connection)

#Print the average market capitalization of all the banks in Billion USD.
query_avg_usd = f"SELECT AVG(MC_USD_Billion) from {table_name}"
run_query(query_avg_usd, sql_connection)

query_avg_gbp = f"SELECT AVG(MC_GBP_Billion) from {table_name}"
run_query(query_avg_gbp, sql_connection)

query_avg_eur = f"SELECT AVG(MC_EUR_Billion) from {table_name}"
run_query(query_avg_eur, sql_connection)

query_avg_inr = f"SELECT AVG(MC_INR_Billion) from {table_name}"
run_query(query_avg_inr, sql_connection)

query_top_5 = f"SELECT Name from {table_name} LIMIT 5"
run_query(query_top_5, sql_connection)

london_query = f"SELECT Name, MC_GBP_Billion from {table_name}"
run_query(london_query, sql_connection)

berlin_query = f"SELECT Name, MC_EUR_Billion from {table_name}"
run_query(berlin_query, sql_connection)

delhi_query = f"SELECT Name, MC_INR_Billion from {table_name}"
run_query(delhi_query, sql_connection)

log_progress('Process Complete.')

sql_connection.close()

<tr>
<th data-sort-type="number">Rank
</th>
<th>Bank name
</th>
<th>Market cap<br/>(US$ billion)
</th></tr>

<tr>
<td>1
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/JPMorgan_Chase" title="JPMorgan Chase">JPMorgan Chase</a>
</td>
<td>432.92
</td></tr>

<tr>
<td>2
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/Bank_of_America" title="Bank of America">Bank of America</a>
</td>
<td>231.52
</td></tr>

<tr>
<td>3
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/China" title="China"><img alt="China" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/Industria

<tr>
<td>4
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/China" title="China"><img alt="China" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/Agricultu

<tr>
<td>5
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/India" title="India"><img alt="India" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/23px-Flag_of_India.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/35px-Flag_of_India.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/45px-Flag_of_India.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/HDFC_Bank" title="HDFC Bank">HDFC Bank</a>
</td>
<td>157.91
</td></tr>

<tr>
<td>6
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/Wells_Fargo" title="Wells Fargo">Wells Fargo</a>
</td>
<td>155.87
</td></tr>

<tr>
<td>7
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/United_Kingdom" title="United Kingdom"><img alt="United Kingdom" class="mw-file-element" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/23px-Flag_of_the_United_Kingdom.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/35px-Flag_of_the_United_Kingdom.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/46px-Flag_of_the_United_Kingdom.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/HSBC" title="HSBC">HSBC Holdings PLC</a>
</td>
<td>148.90
</td></tr>

<tr>
<td>8
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/Morgan_Stanley" title="Morgan Stanley">Morgan Stanley</a>
</td>
<td>140.83
</td></tr>

<tr>
<td>9
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/China" title="China"><img alt="China" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/China_Con

<tr>
<td>10
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/China" title="China"><img alt="China" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/Bank_of_

'market capitalization of the 5th largest bank in billion EUR: 146.86'

'SELECT * from Largest_banks'

Unnamed: 0,Name,MC_USD_Billion,MC_GBP_Billion,MC_EUR_Billion,MC_INR_Billion
0,JPMorgan Chase,432.92,346.34,402.62,35910.71
1,Bank of America,231.52,185.22,215.31,19204.58
2,Industrial and Commercial Bank of China,194.56,155.65,180.94,16138.75
3,Agricultural Bank of China,160.68,128.54,149.43,13328.41
4,HDFC Bank,157.91,126.33,146.86,13098.63
5,Wells Fargo,155.87,124.7,144.96,12929.42
6,HSBC Holdings PLC,148.9,119.12,138.48,12351.26
7,Morgan Stanley,140.83,112.66,130.97,11681.85
8,China Construction Bank,139.82,111.86,130.03,11598.07
9,Bank of China,136.81,109.45,127.23,11348.39


'SELECT AVG(MC_USD_Billion) from Largest_banks'

Unnamed: 0,AVG(MC_USD_Billion)
0,189.982


'SELECT AVG(MC_GBP_Billion) from Largest_banks'

Unnamed: 0,AVG(MC_GBP_Billion)
0,151.987


'SELECT AVG(MC_EUR_Billion) from Largest_banks'

Unnamed: 0,AVG(MC_EUR_Billion)
0,176.683


'SELECT AVG(MC_INR_Billion) from Largest_banks'

Unnamed: 0,AVG(MC_INR_Billion)
0,15759.007


'SELECT Name from Largest_banks LIMIT 5'

Unnamed: 0,Name
0,JPMorgan Chase
1,Bank of America
2,Industrial and Commercial Bank of China
3,Agricultural Bank of China
4,HDFC Bank


'SELECT Name, MC_GBP_Billion from Largest_banks'

Unnamed: 0,Name,MC_GBP_Billion
0,JPMorgan Chase,346.34
1,Bank of America,185.22
2,Industrial and Commercial Bank of China,155.65
3,Agricultural Bank of China,128.54
4,HDFC Bank,126.33
5,Wells Fargo,124.7
6,HSBC Holdings PLC,119.12
7,Morgan Stanley,112.66
8,China Construction Bank,111.86
9,Bank of China,109.45


'SELECT Name, MC_EUR_Billion from Largest_banks'

Unnamed: 0,Name,MC_EUR_Billion
0,JPMorgan Chase,402.62
1,Bank of America,215.31
2,Industrial and Commercial Bank of China,180.94
3,Agricultural Bank of China,149.43
4,HDFC Bank,146.86
5,Wells Fargo,144.96
6,HSBC Holdings PLC,138.48
7,Morgan Stanley,130.97
8,China Construction Bank,130.03
9,Bank of China,127.23


'SELECT Name, MC_INR_Billion from Largest_banks'

Unnamed: 0,Name,MC_INR_Billion
0,JPMorgan Chase,35910.71
1,Bank of America,19204.58
2,Industrial and Commercial Bank of China,16138.75
3,Agricultural Bank of China,13328.41
4,HDFC Bank,13098.63
5,Wells Fargo,12929.42
6,HSBC Holdings PLC,12351.26
7,Morgan Stanley,11681.85
8,China Construction Bank,11598.07
9,Bank of China,11348.39


# Code Execution and expected output

execute all using **Run All**:

# Conclusion

In this project, I performed complex Extract, Transform, and Loading operations on real world data. I was able to:

- Extract the relevant information from (https://web.archive.org/web/20230908091635/https://en.wikipedia.org/wiki/List_of_largest_banks) using Webscraping and requests API.
- Transform the data to a required format.
- Load the processed data to a local file and as a database table.
- Query the database table using Python.
- Create detailed logs of all operations conducted.