# Project Scenario:
You have been hired as a data engineer by research organization. Your boss has asked you to create a code that can be used to compile the list of `the top 10 largest banks in the world ranked by market capitalization in billion USD`. Further, the data needs to be transformed and stored in `GBP, EUR and INR` as well, in accordance with the exchange rate information that has been made available to you as a CSV file. The processed information table is to be saved locally in a CSV format and as a database table.<br>

The required data seems to be available on the URL mentioned below:<br>

URL 'https://web.archive.org/web/20230908091635 /https://en.wikipedia.org/wiki/List_of_largest_banks'<br>

In [2]:
# Standard Libraries
import sqlite3
from datetime import datetime # import this for the time stamp fuction

# third party
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Declare the Attributes

In [3]:
csv_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-PY0221EN-Coursera/labs/v2/exchange_rate.csv'
url = 'https://web.archive.org/web/20230908091635 /https://en.wikipedia.org/wiki/List_of_largest_banks'
table_attribs = ['Name', 'MC_USD_Billion'] # 'MC_GBP_Billion', 'MC_EUR_Billion', 'MC_INR_Billion'
db_name = 'Banks.db'
table_name = 'Largest_banks'
csv_path = 'Largest_banks_data.csv'

## Extracting information using the web scraping process

Identify the position of the required table under the heading `By market capitalization`. Write the function extract() to retrieve the information of the table to a Pandas data frame.<br>

Note: Remember to remove the last character from the `Market Cap` column contents, like, '\n', and typecast the value to `float` format.<br>

In [None]:
 <tr>
    <th data-sort-type="number">Rank</th>
    <th>Bank name</th>
    <th>Market cap<br/>(US$ billion)</th>
 </tr>,

 <tr>
    <td>1</td>
    <td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/JPMorgan_Chase" title="JPMorgan Chase">JPMorgan Chase</a></td>
    <td>432.92</td>
 </tr>,

 <tr>
    <td>2</td>
    <td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span> <a href="/web/20230908091635/https://en.wikipedia.org/wiki/Bank_of_America" title="Bank of America">Bank of America</a></td>
    <td>231.52</td>
 </tr>

In [38]:
def extract(url, table_attribs):
    ''' This function extracts the required
    information from the website and saves it to a dataframe. The
    function returns the dataframe for further processing. '''
    
    page = requests.get(url).text # Extract the web page as text
    data = BeautifulSoup(page, 'html.parser') # Parse the text into an HTML object.
    df = pd.DataFrame(columns=table_attribs) # Create an empty pandas DataFrame named df with columns as the table_attribs.
    tables = data.find_all('tbody') # Extract all 'tbody' attributes of the HTML object 
    rows = tables[0].find_all('tr') # and then extract all the rows of the index 0 table using the 'tr' attribute.

    for row in rows:
        col = row.find_all('td') 
        if len(col)!=0 and col[1].find('a') is not None:
            data_dict = {"Name": col[1].find_all('a')[-1].get('title'), # col[1].find_all('a')[1]['title'] This tag return two list. Only the second one has the title.
                        "MC_USD_Billion": col[2].contents[0].strip('\n')}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df,df1], ignore_index=True)
            df["MC_USD_Billion"] = df["MC_USD_Billion"].astype(float)
    return df

df = extract(url, table_attribs)
df

Unnamed: 0,Name,MC_USD_Billion
0,JPMorgan Chase,432.92
1,Bank of America,231.52
2,Industrial and Commercial Bank of China,194.56
3,Agricultural Bank of China,160.68
4,HDFC Bank,157.91
5,Wells Fargo,155.87
6,HSBC,148.9
7,Morgan Stanley,140.83
8,China Construction Bank,139.82
9,Bank of China,136.81


# Transform the Info
* Transform the dataframe by adding columns for Market Capitalization in `GBP`, `EUR` and `INR`, `rounded to 2 decimal places`, based on the exchange rate information shared as a CSV file.
* Write the code for a function transform() to perform the said task.<br>
* Execute a function call to transform() and verify the output.<br>

In [48]:
def transform(df):
    
    ''' This function accesses the CSV file for exchange rate
    information, and adds three columns to the data frame, each
    containing the transformed version of Market Cap column to
    respective currencies'''

    # Read the exchange rate CSV file and convert the contents to a dictionary 
    # so that the contents of the first columns are the keys to the dictionary 
    # and the contents of the second column are the corresponding values.
    er = pd.read_csv('exchange_rate.csv')
    er_dict = er.set_index('Currency').to_dict()['Rate']
    
    df['MC_GBP_Billion'] = [np.round(x*er_dict['GBP'],2) for x in df['MC_USD_Billion']]
    df['MC_EUR_Billion'] = [np.round(x*er_dict['EUR'],2) for x in df['MC_USD_Billion']]    
    df['MC_INR_Billion'] = [np.round(x*er_dict['INR'],2) for x in df['MC_USD_Billion']]    
    

    return df

transformed_df = transform(df)
transformed_df

Unnamed: 0,Name,MC_USD_Billion,MC_GBP_Billion,MC_EUR_Billion,MC_INR_Billion
0,JPMorgan Chase,432.92,346.34,402.62,35910.71
1,Bank of America,231.52,185.22,215.31,19204.58
2,Industrial and Commercial Bank of China,194.56,155.65,180.94,16138.75
3,Agricultural Bank of China,160.68,128.54,149.43,13328.41
4,HDFC Bank,157.91,126.33,146.86,13098.63
5,Wells Fargo,155.87,124.7,144.96,12929.42
6,HSBC,148.9,119.12,138.48,12351.26
7,Morgan Stanley,140.83,112.66,130.97,11681.85
8,China Construction Bank,139.82,111.86,130.03,11598.07
9,Bank of China,136.81,109.45,127.23,11348.39


## Loading information to a CSV file and a database

In [49]:
def load_to_csv(df, csv_path):
    ''' This function saves the final dataframe as a `CSV` file 
    in the provided path. Function returns nothing.'''
    df.to_csv(csv_path)

load_to_csv(transformed_df,csv_path)

In [50]:
def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final dataframe as a database table
    with the provided name. Function returns nothing.'''
    df.to_sql(table_name, sql_connection, if_exists='replace', index=False)
    
sql_connection = sqlite3.connect(db_name)
load_to_db(transformed_df, sql_connection, table_name)

## Querying the database table

In [52]:
def run_query(query_statement, sql_connection):
    ''' This function runs the stated query on the database table and
    prints the output on the terminal. Function returns nothing. '''
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)

## Logging progress
This function will be called multiple times throughout the execution of this code and will be asked to add a log entry in a .txt file, `code_log.txt`. The entry is supposed to be in the following format:`<Time_stamp> : <message_text>`

Take a screenshot of the code, as created for the log_progress() function and save it to your local machine as Task_1_log_function.png

In [53]:
def log_progress(message): 
    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("./code_log.txt","a") as f: 
        f.write(timestamp + ' : ' + message + '\n')

## Final Function Call

In [56]:
log_progress('Preliminaries complete. Initiating ETL process')
df = extract(url, table_attribs)

log_progress('Data extraction complete. Initiating Transformation process')
transformed_df = transform(df)

log_progress('Data transformation complete. Initiating loading process')
load_to_csv(transformed_df, csv_path)
log_progress('Data saved to CSV file')

sql_connection = sqlite3.connect(db_name)
log_progress('SQL Connection initiated.')
load_to_db(transformed_df, sql_connection, table_name)
log_progress('Data loaded to Database as table. Running the query')

log_progress('Printing the contents of the entire table...')
query_statement = f"SELECT * from {table_name}"
run_query(query_statement, sql_connection)

log_progress('Printing  the average market capitalization of all the banks in Billion USD...')
query_statement = f"SELECT AVG(MC_GBP_Billion) FROM {table_name}"
run_query(query_statement, sql_connection)

log_progress('Print only the names of the top 5 banks...')
query_statement = f"SELECT Name from {table_name} LIMIT 5"
run_query(query_statement, sql_connection)

log_progress('Process Complete.')
sql_connection.close()

                                      Name  MC_USD_Billion  MC_GBP_Billion  \
0                           JPMorgan Chase          432.92          346.34   
1                          Bank of America          231.52          185.22   
2  Industrial and Commercial Bank of China          194.56          155.65   
3               Agricultural Bank of China          160.68          128.54   
4                                HDFC Bank          157.91          126.33   
5                              Wells Fargo          155.87          124.70   
6                                     HSBC          148.90          119.12   
7                           Morgan Stanley          140.83          112.66   
8                  China Construction Bank          139.82          111.86   
9                            Bank of China          136.81          109.45   

   MC_EUR_Billion  MC_INR_Billion  
0          402.62        35910.71  
1          215.31        19204.58  
2          180.94        16138.75