## Objectives

- Use webscraping to get bank information

### Import libraries
We are going to be using several Python libraries below:
- Pandas
- requests
- BeautifulSoup
- html5lib

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import html5lib

## Extract Data Using Web Scraping

The wikipedia webpage https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.

### Webpage Contents

Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>

In [2]:
url = 'https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks'
html_data = requests.get(url).text

In [1]:
# html_data

### Scraping the Data

Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names.  Display the first five rows using head. 

Using BeautifulSoup parse the contents of the webpage.

In [4]:
soup = BeautifulSoup(html_data, "html.parser")

In [None]:
print(soup.prettify())

Load the data from the `By market capitalization` table into a pandas dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Using the empty dataframe `data` and the given loop extract the necessary data from each row and append it to the empty dataframe.

In [11]:
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])

for row in soup.find_all('tbody')[2].find_all('tr'):
    col = row.find_all('td')
    if len(col) == 0:
        continue
    else:
        data = data.append({"Name": col[1].text.strip(), "Market Cap (US$ Billion)": col[2].text.strip()}, ignore_index=True)

In [12]:
data.head()

Unnamed: 0,Name,Market Cap (US$ Billion)
0,JPMorgan Chase,390.934
1,Industrial and Commercial Bank of China,345.214
2,Bank of America,325.331
3,Wells Fargo,308.013
4,China Construction Bank,257.399


### Loading the Data

Load the `pandas` dataframe created above into a JSON named `bank_market_cap.json` using the `to_json()` function.

In [13]:
data.to_json("bank_market_cap.json")