# ETL process project

The main idea is to apply an ETL (Extract, Transform and Load) process in a reduced scale, using python and handling diverse file formats (.csv, .json). 

The taks is to build a dataset containing the 50 biggest companies in the world by revenue including their revenue in diverse currencies (USD, EUR, GDP, JPY, BRL, ARS) and two columns with data from the company's country of origin.

To accomplish this task, the project is divided in three main sections:

**1. Extract**: data from different sources is collected and extracted into our local environment.
- **Web scraping:** a list of the 50 largest companies by revenue is scraped from Wikipedia using the library "Beautiful Soup".

Link: https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue

- **API communication:** the exchange rate from EUR to diverse currencies are downloaded in a .json format from "Exchange rates" api

Link: http://api.exchangeratesapi.io

- **Downloading datasets:** two datasets (population, gdp per capita) are downloaded from Gapminder.

Link: https://www.gapminder.org/data/

> Two different modules (***collect_data.py*** and ***etl_module.py***) are written to perform the operations on this section

**2. Transform:** some data manipulation operations are carried on to build the main .csv file
- Data cleaning
- Currency conversion
- Merging

**3. Load:*** export the resulting DataFrame to a unique .csv file called "final_dataset.csv".
> To load the dataset into a RDMBS data can be normalized.

##### Import modules

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import glob
from datetime import datetime
from collect_data import scraper, talk_to_api
from etl_module import extract_csv, extract, load, log

log("ETL Job Started")

2022-Apr-18-20:20:34, ETL Job Started



## 1. Extract

### 1.1 Web scraping

In [2]:
log("Extract phase started")

soup = scraper("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")

df = pd.DataFrame(columns=["Company", "Industry", "Revenue (USD millions)", "Employees", "Country"])

table = soup.find_all("tbody")[0]
for row in table.find_all('tr'):
    col = row.find_all('td')
    if col != []:
        company = col[0].text.strip()
        industry = col[1].text.strip()
        revenue = float(col[2].text.strip().replace("$","").replace(",","."))
        employees = int(col[4].text.strip().replace(",",""))
        country = col[5].text.strip()
        df = df.append({"Company": company, 
                        "Industry": industry, 
                        "Revenue (USD millions)": revenue,
                        "Employees": employees,
                        "Country": country}
                        , ignore_index=True
                      )

df.to_csv("list_of_largest_companies_by_revenue.csv")

2022-Apr-18-20:20:37, Extract phase started



### 1.2. Communicate with the API

In [3]:
talk_to_api(key = "4937cfda4cab4fc7dfc45a0e5ccc882a", csv_name = "eur_exchange_rates")

Status code (if 200 then success):  200


### 1.3. Extract files

In [4]:
# Companies csv
companies = extract("list_of_largest_companies_by_revenue", 
                    columns = ["Company", "Industry", "Revenue (USD millions)", "Employees", "Country"],
                    sep = ",")
print(companies.head())

# Exchange rates
rates = extract("eur_exchange_rates", columns = ["Rates"], sep = ",")
print(rates.head())

# Population
population = extract("population_total", columns = ["Population"], sep = ";")
print(population.head())

# GDP
gdp = extract("gdppercapita_us_inflation_adjusted", columns = ["GDP"], sep = ";")
print(gdp.head())

                    Company                        Industry  \
0                   Walmart                          Retail   
1                State Grid                     Electricity   
2                    Amazon  Retail, Information Technology   
3  China National Petroleum                     Oil and gas   
4             Sinopec Group                     Oil and gas   

   Revenue (USD millions)  Employees        Country  
0                 559.151    2300000  United States  
1                 386.617     896360          China  
2                 386.064    1608000  United States  
3                 283.958    1242245          China  
4                 283.728     553833          China  
       Rates
AED    3.960
AFN   94.333
ALL  120.850
AMD  508.140
ANG    1.962
                     Population
Afghanistan               38.9M
Angola                    32.9M
Albania                   2.88M
Andorra                   77.3k
United Arab Emirates      9.89M
               GDP
Aruba   

In [5]:
log("Extract phase finished")

2022-Apr-18-20:20:47, Extract phase finished



## 2. Transform

In [6]:
log("Transform phase started")

2022-Apr-18-20:20:49, Transform phase started



##### Extract currencies

In [7]:
currencies = ["ARS", "BRL", "GBP", "USD", "JPY"]
currencies_dict = {}

for i in rates.index:
    if i in currencies:
        currencies_dict[i] = rates.loc[i, "Rates"]

currencies_dict        

{'ARS': 123.179, 'BRL': 5.022, 'GBP': 0.828, 'JPY': 136.856, 'USD': 1.078}

##### Transform USD to EUR (reference currency)

In [8]:
companies["Revenue (EUR millions)"] = round(companies["Revenue (USD millions)"] / currencies_dict["USD"], 3)
companies.head()

Unnamed: 0,Company,Industry,Revenue (USD millions),Employees,Country,Revenue (EUR millions)
0,Walmart,Retail,559.151,2300000,United States,518.693
1,State Grid,Electricity,386.617,896360,China,358.643
2,Amazon,"Retail, Information Technology",386.064,1608000,United States,358.13
3,China National Petroleum,Oil and gas,283.958,1242245,China,263.412
4,Sinopec Group,Oil and gas,283.728,553833,China,263.199


##### Build other columns

In [9]:
currencies.remove("USD")
for currency in currencies:
    companies[f"Revenue ({currency} millions)"] = round(companies["Revenue (EUR millions)"] * currencies_dict[currency], 3)

companies.head()

Unnamed: 0,Company,Industry,Revenue (USD millions),Employees,Country,Revenue (EUR millions),Revenue (ARS millions),Revenue (BRL millions),Revenue (GBP millions),Revenue (JPY millions)
0,Walmart,Retail,559.151,2300000,United States,518.693,63892.085,2604.876,429.478,70986.249
1,State Grid,Electricity,386.617,896360,China,358.643,44177.286,1801.105,296.956,49082.446
2,Amazon,"Retail, Information Technology",386.064,1608000,United States,358.13,44114.095,1798.529,296.532,49012.239
3,China National Petroleum,Oil and gas,283.958,1242245,China,263.412,32446.827,1322.855,218.105,36049.513
4,Sinopec Group,Oil and gas,283.728,553833,China,263.199,32420.59,1321.785,217.929,36020.362


##### Data cleaning: population and gdp

In [10]:
# Select countries of interest
countries = companies["Country"].unique()
countries

array(['United States', 'China', 'Japan', 'Germany', 'Saudi Arabia',
       'South Korea', 'United Kingdom', 'Netherlands', 'Taiwan',
       'Singapore', 'Switzerland', 'France'], dtype=object)

In [11]:
# Get and clean values from populations
population_dict = {}
for country in countries:
    value = population.loc[country, "Population"]
    if '.' in value:
        value = value.replace("M", "00000").replace("B", "0000000").replace(".", "")
        population_dict[country] = int(value)
    else:
        value = value.replace("M", "000000")
        population_dict[country] = int(value)

population_dict

{'United States': 331000000,
 'China': 1440000000,
 'Japan': 126000000,
 'Germany': 83800000,
 'Saudi Arabia': 34800000,
 'South Korea': 51300000,
 'United Kingdom': 67900000,
 'Netherlands': 17100000,
 'Taiwan': 23800000,
 'Singapore': 58500000,
 'Switzerland': 86500000,
 'France': 65300000}

In [12]:
# Build the column "Population"
companies["Population"] = [population_dict[country] for country in df["Country"]]
companies.head()

Unnamed: 0,Company,Industry,Revenue (USD millions),Employees,Country,Revenue (EUR millions),Revenue (ARS millions),Revenue (BRL millions),Revenue (GBP millions),Revenue (JPY millions),Population
0,Walmart,Retail,559.151,2300000,United States,518.693,63892.085,2604.876,429.478,70986.249,331000000
1,State Grid,Electricity,386.617,896360,China,358.643,44177.286,1801.105,296.956,49082.446,1440000000
2,Amazon,"Retail, Information Technology",386.064,1608000,United States,358.13,44114.095,1798.529,296.532,49012.239,331000000
3,China National Petroleum,Oil and gas,283.958,1242245,China,263.412,32446.827,1322.855,218.105,36049.513,1440000000
4,Sinopec Group,Oil and gas,283.728,553833,China,263.199,32420.59,1321.785,217.929,36020.362,1440000000


In [13]:
# Get and clean values from gdp
gdp_dict = {}
for country in countries:
    try:
        value = gdp.loc[country, "GDP"]
        gdp_dict[country] = int(value.replace(".", "").replace("k", "00"))
    except:
        # Taiwan GDP: 33004 (Wikipedia)
        gdp_dict[country] = 33004
    
gdp_dict

{'United States': 58500,
 'China': 10400,
 'Japan': 34400,
 'Germany': 41300,
 'Saudi Arabia': 18700,
 'South Korea': 31300,
 'United Kingdom': 41800,
 'Netherlands': 46300,
 'Taiwan': 33004,
 'Singapore': 58100,
 'Switzerland': 85700,
 'France': 35800}

In [14]:
# Build the column "GDP"
companies["GDP"] = [gdp_dict[country] for country in df["Country"]]
companies.head()

Unnamed: 0,Company,Industry,Revenue (USD millions),Employees,Country,Revenue (EUR millions),Revenue (ARS millions),Revenue (BRL millions),Revenue (GBP millions),Revenue (JPY millions),Population,GDP
0,Walmart,Retail,559.151,2300000,United States,518.693,63892.085,2604.876,429.478,70986.249,331000000,58500
1,State Grid,Electricity,386.617,896360,China,358.643,44177.286,1801.105,296.956,49082.446,1440000000,10400
2,Amazon,"Retail, Information Technology",386.064,1608000,United States,358.13,44114.095,1798.529,296.532,49012.239,331000000,58500
3,China National Petroleum,Oil and gas,283.958,1242245,China,263.412,32446.827,1322.855,218.105,36049.513,1440000000,10400
4,Sinopec Group,Oil and gas,283.728,553833,China,263.199,32420.59,1321.785,217.929,36020.362,1440000000,10400


In [15]:
log("Transform phase finished")

2022-Apr-18-20:20:56, Transform phase finished



## 3. Load

In [16]:
load(companies, "final_dataset")
log("ETL process finished")

2022-Apr-18-20:20:58, ETL process finished



# Author

Santiago Vallespir