# Data Collection and Preparation

The following workbook collects all data required for the analysis. This includes: 

 * Economic data: GDP per capita growth rates and inflation rates
 * Coronavirus data: daily case numbers, total cases, deaths
 * Fear and greed index: related to Crypto assets and stocks
 * Asset price data: Bitcoin, Ethereum, S & P 500 and overall Cryptocurrency market capitilisation figure

---

### Import the modules required for the analysis

The data collection process involves the use of a number of key Python modules, specifically pandas.

In [1]:
# Import the modules required for the analysis
import pandas as pd
import datetime as dt
import time
import numpy as np
import os
import requests
import json
from pathlib import Path
import bs4 as bs
import requests
import yfinance as yf
from visual import create_forecast

### Coronavirus Data

This section of the data collection process collects time series data related to COVID-19. The objective of collecting this data is to prepare the dataset for multiple visualisations (such as geographic representation and time-series comparison against other indicators). 

In [2]:
# Specify the URL to the raw github content
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv'
geo_url = 'https://raw.githubusercontent.com/albertyw/avenews/master/old/data/average-latitude-longitude-countries.csv'

# Set path for writing CSV files to data
write_path = "./Data/"

# Read the covid data into a dataframe
covid = pd.read_csv(url)

# Read the geographic data into a dataframe
geo_data = pd.read_csv(geo_url)

# Print datatypes
print(covid.dtypes)

# Display data
covid.head(5)

iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
human_development_index                    float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


Convert the date column to a datetime format using pandas' *to_datetime* function.

Once completed, the code then filters for the time period that is being utilised for this report. For the purposes of analysis and data availability, the time period being used is from **1st March 2020 to 31st December 2021**. The data collected is being constantly updated and therefore needs to be filtered to capture the correct dates.

In [3]:
# Convert date to datetime
covid['date'] = pd.to_datetime(covid['date'], infer_datetime_format = True)

# Filter 'date' based on assessment period specified above
covid = covid[(covid['date'] >= '2020-01-01') & (covid['date'] <= '2021-12-31')]

The time series data is now prepared and available to use. Now grouped data is required per country for both 2020 and 2021.

In [4]:
# Select only relevant columns, group by location and year and sum both total cases and year
covid_grouped = covid.iloc[:,[2,3,5,8]].groupby(['location', covid.date.dt.year]).sum().reset_index().\
                                        rename(columns = {'new_cases' : 'Cases',
                                                          'new_deaths' : 'Deaths',
                                                          'location' : 'Country'})

The following code merges the *covid_grouped* dataframe with the *geo_data* dataframe to produce a combined dataframe with Covid-19 data and locational data.

In [5]:
#Join the locational data from geo_data with the covid_grouped dataframe
covid_grouped = pd.merge(covid_grouped, geo_data, on = 'Country', how = 'inner')

# Write the Covid-19 data to CSV format
covid_grouped.to_csv(write_path + "covid_yoy_data.csv", index = False)

# Display joined results
covid_grouped

Unnamed: 0,Country,date,Cases,Deaths,ISO 3166 Country Code,Latitude,Longitude
0,Afghanistan,2020,52330.0,2189.0,AF,33.0,65.0
1,Afghanistan,2021,105754.0,5167.0,AF,33.0,65.0
2,Albania,2020,58316.0,1181.0,AL,41.0,20.0
3,Albania,2021,151908.0,2036.0,AL,41.0,20.0
4,Algeria,2020,99610.0,2756.0,DZ,28.0,3.0
...,...,...,...,...,...,...,...
378,Yemen,2021,8027.0,1374.0,YE,15.0,48.0
379,Zambia,2020,20725.0,388.0,ZM,-15.0,30.0
380,Zambia,2021,233549.0,3346.0,ZM,-15.0,30.0
381,Zimbabwe,2020,13867.0,363.0,ZW,-20.0,30.0


### Economic data

There are two primary economic indicators that will be used to answer a number of questions relating to this analysis. These are Gross Domestic Product (GDP) per capita (per capita translates to *'per person'*) and inflation (the increase in the price of goods and services).

In [6]:
# Read real GDP growth data 
real_gdp = pd.read_csv(write_path + 'real_gdp_growth.csv')

# Read inflation rate data from the worldbank API
inflation = pd.read_csv(write_path + 'inflation_annual.csv')

The data is not presented in panel data format and thus should be structured accordingly. The following piece of code will prepare the data to be presented in a more appropriate format.

In [7]:
# Stack data, reset index and rename columns
real_gdp = real_gdp.set_index(['Country','Country Code', 'Indicator Name', 'Indicator Code']).\
                    stack().reset_index().rename(columns = {'level_4' : 'Year', 
                                                            0 : 'GDP per capita growth (annual %)'})

# Select only relevant columns within the real_gdp dataframe
real_gdp = real_gdp.iloc[:,[0,1,4,5]]

# Apply the above logic inflation data
inflation = inflation.set_index(['Country','Country Code', 'Indicator Name', 'Indicator Code']).\
                      stack().reset_index().rename(columns = {'level_4' : 'Year', 
                                                              0 : 'Inflation Rate'})

# Select only relevant columns within the real_gdp dataframe
inflation = inflation.iloc[:,[0,1,4,5]]

Convert year column to datetime format.

In [8]:
# List of dataframes
economic_data = [real_gdp, inflation]

# Initiate for loop
for eco in economic_data:
    
    # Convert to datetime
    eco['Year'] = pd.to_datetime(eco['Year'], infer_datetime_format = True)

Join the dataframes.

In [9]:
# Merge dataframes
eco_data = pd.merge(real_gdp, inflation, on = ['Country','Year','Country Code'] , how = 'inner')

# Take the average inflation rate and GDP per capita growth rate for 2019 and 2020 only (to join with Covid-19 data)
eco_avg = eco_data[eco_data['Year'] >= '2019-01-01'].groupby(['Country','Country Code',eco_data.Year.dt.year]).\
                                                     mean().rename(columns = {
                                        'GDP per capita growth (annual %)' : 'Average GDP Per Capita Growth Rate',
                                        'Inflation Rate' : 'Average Inflation Rate'}).reset_index()

# Display results
eco_avg

Unnamed: 0,Country,Country Code,Year,Average GDP Per Capita Growth Rate,Average Inflation Rate
0,Afghanistan,AFG,2019,1.535637,2.302373
1,Africa Eastern and Southern,AFE,2019,-0.570661,3.923372
2,Africa Eastern and Southern,AFE,2020,-5.394391,4.978097
3,Africa Western and Central,AFW,2019,0.501092,1.758565
4,Africa Western and Central,AFW,2020,-3.502433,2.425007
...,...,...,...,...,...
396,West Bank and Gaza,PSE,2020,-13.631959,-0.735332
397,World,WLD,2019,1.480856,2.167730
398,World,WLD,2020,-4.395210,1.936941
399,Zambia,ZMB,2019,-1.451364,9.150316


---

### Combined Economic and Covid Data

The following code combines both Coronavirus data prepared earlier in this workbook with the economic data prepared above. Data is joined based on country name. The following approach has been undertaken to deal with data quality issues: 

 * Nations with a year of data missing will be excluded for completeness

There are also a number of limitations to mention. These include:

 * Not all data is available for GDP per capita figures, meaning there will be exclusions in the dataset
 * Using country name as a joining key has proven somewhat effective, but may differ amongst datasets
 
The following code combines the abovementioned dataframes.

In [10]:
# Combine dataframes using country as the key
eco_covid = pd.merge(eco_avg, 
                     covid_grouped, 
                     how = 'left', 
                     left_on = ['Country','Year'], 
                     right_on = ['Country','date'])

# Count number of country occurrances to exclude from dataset (if not containing both 2019 and 2020 data)
count_country = eco_covid.groupby('Country').size().reset_index(name = 'Count')

# Join count to eco_covid dataframe
eco_covid = pd.merge(eco_covid, count_country, on = 'Country', how = 'inner')

# Remove any countries with missing 2019 or 2020 data
eco_covid = eco_covid[eco_covid['Count'] > 1].drop(columns = 'Count')

# Fill in NaN 2019 data with identical latitude and longitude data for a more complete dataset
eco_covid[['Latitude','Longitude']] = eco_covid.groupby('Country')[['Latitude','Longitude']].bfill()

# Remove null values with missing data
eco_covid = eco_covid[eco_covid['Latitude'].notna()]

# Assess economic impact of GDP per capita and inflation for each country
eco_covid = eco_covid.iloc[:,[0,3,4,9,10]].set_index(['Country','Latitude','Longitude'])

# Calculate difference
eco_covid = eco_covid.groupby(eco_covid.index).diff().dropna(how = 'any').rename(columns = {
                                        'Average GDP Per Capita Growth Rate' : 'Change in GDP (Year on Year)',
                                        'Average Inflation Rate' : 'Change in Inflation Rate (Year on Year)'}).reset_index()

# Identify GDP per capita impact (whether positive or negative): to be used in Mapbox
eco_covid['Impact'] = eco_covid['Change in GDP (Year on Year)'].apply(lambda x : 'Negative' if x < 0 else 'Positive')

# Convert GDP change to positive (to ensure mapbox 'size' argument is satisfied (does not accept negative float))
eco_covid['Change in GDP (Year on Year)'] = eco_covid['Change in GDP (Year on Year)'].\
                                                apply(lambda x : (x * -1) if x < 0 else x)

---

### Cryptocurrency and stock index data

The following code reads in the overall market capitalisation data for the entire cryptocurrency market, as well as Bitcoin and Ehtereum data.

Other stock index data is also included in the analysis, specifically stock indices such as the Standard and Poor's 500 (S&P 500) index.

In [11]:
# Specify the path used where the data is located
crypto_tmc_path = Path("./Data/crypto_tmc.csv")

# Read the CSV file
crypto_tmc_data = pd.read_csv(crypto_tmc_path, 
                              index_col = "date", 
                              infer_datetime_format = True, 
                              parse_dates = True)

# Reset the index
crypto_tmc_data.reset_index(inplace = True)

# Generate sample data
crypto_tmc_data.head(5)

Unnamed: 0,date,open,high,low,close,MA,MA.1,Volume,Volume MA,OnBalanceVolume,RSI
0,2014-03-19,7734802000.0,7869879000.0,7623081000.0,7700141000.0,,,8512398.32,,-50927719.95,57.701674
1,2014-03-20,7688016000.0,7704364000.0,7290326000.0,7378310000.0,,,8512398.32,17958504.08,-59440118.27,50.721708
2,2014-03-21,7380354000.0,7667154000.0,7002819000.0,7179603000.0,,,8512398.32,16100911.41,-67952516.59,46.945709
3,2014-03-22,7180253000.0,7210748000.0,6808422000.0,7125488000.0,,,8512398.32,15695300.61,-76464914.91,45.94262
4,2014-03-23,7131450000.0,7197064000.0,7034979000.0,7105506000.0,,,22428897.94,15985514.79,-98893812.86,45.55555


For overall market capitilsation figures, only the close figure and index (currently date) are relevant. The following code selects only the close column , resets the index column to allow for joins to other datasets, such as the fear and greed index.

In [12]:
# Select only relevant columns
crypto = crypto_tmc_data[['date','close']]

# Generate sample data
crypto

Unnamed: 0,date,close
0,2014-03-19,7.700141e+09
1,2014-03-20,7.378310e+09
2,2014-03-21,7.179603e+09
3,2014-03-22,7.125488e+09
4,2014-03-23,7.105506e+09
...,...,...
2839,2022-01-01,2.250000e+12
2840,2022-01-02,2.240000e+12
2841,2022-01-03,2.210000e+12
2842,2022-01-04,2.200000e+12


In [13]:
# Convert to CSV
crypto.to_csv(write_path + 'cryptotmc_data_cleaned.csv')

The following code reads in historical S&P 500, Bitcoin and Ethereum data into a dataframe format. Further analysis and cleaning is performed on the dataset to prepare for future joins in the report. Relevant columns are selected (close) as part of the analysis.

In [14]:
# Obtain S&P500, Bitcoin and Ethereum data 
hist_data = yf.download("BTC-USD ETH-USD ^GSPC", 
                        start = "2018-01-01", 
                        end = "2021-12-31")

# Select only closing values
hist_data = hist_data['Close']

# Reset index
hist_data = hist_data.reset_index()

# Convert date column to datetime
hist_data['Date'] = pd.to_datetime(hist_data['Date'], infer_datetime_format = True)

# Display data
hist_data.dtypes

[*********************100%***********************]  3 of 3 completed


Date       datetime64[ns]
BTC-USD           float64
ETH-USD           float64
^GSPC             float64
dtype: object

The following code scrapes the table of Wikipedia showing the companies that are included in the S&P 500 index. Once this data is captured, the code will assess top and bottom performers by completing API calls on each of the stocks. Top five and bottom five performers based off 12 months data will be assessed.

In [15]:
# Identify the URL required
sp_url = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

# Create soup object
soup = bs.BeautifulSoup(sp_url.text)

# Create empty list
sp_tickers = []

# Identify the table with stored tickers
table = soup.find('table', {'class' : 'wikitable sortable'})

# Obtain rows of tickers
rows = table.findAll('tr')[1:]

# Initiate for loop
for row in rows:
    
    # Obtain ticker symbol in text format
    ticker = row.findAll('td')[0].text
    
    # Append list with ticker name
    sp_tickers.append(ticker[:-1])

The following piece of code reads in all companies that are included in the S&P 500 index and produces a dataframe. The YF API requires tickers to be separated by whitespace and does not accept lists. Therefore the approach was taken to join each of the list items into a single string and assign the string to a variable that the YF API accepts.

In [16]:
# Convert list of tickers to string value for YF API to read
ticker_string = ' '.join([str(ticker) for ticker in sp_tickers])

# Obtain S&P500 data
sp_all_data = yf.download(ticker_string, start = "2020-02-01", end = "2021-12-31")

# Select only closing values
sp_all_data = sp_all_data['Close']

# Display data
sp_all_data

[*********************100%***********************]  505 of 505 completed

2 Failed downloads:
- BRK.B: No data found, symbol may be delisted
- BF.B: No data found for this date range, symbol may be delisted


Unnamed: 0_level_0,A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,...,XEL,XLNX,XOM,XRAY,XYL,YUM,ZBH,ZBRA,ZION,ZTS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-31,82.559998,26.840000,131.750000,77.377502,81.019997,85.559998,186.289993,87.139999,205.210007,351.140015,...,69.190002,84.480003,62.119999,56.000000,81.660004,105.769997,147.899994,239.020004,45.490002,134.210007
2020-02-03,82.150002,27.160000,132.649994,77.165001,82.300003,85.699997,185.949997,87.059998,207.800003,358.000000,...,69.449997,85.050003,60.730000,55.619999,83.339996,106.410004,148.509995,242.550003,46.110001,135.520004
2020-02-04,83.519997,28.430000,131.559998,79.712502,84.360001,88.089996,190.899994,88.230003,212.529999,366.739990,...,69.300003,85.779999,59.970001,56.570000,86.510002,106.709999,156.770004,247.869995,46.680000,138.970001
2020-02-05,84.930000,29.100000,137.020004,80.362503,86.629997,91.440002,190.729996,89.559998,212.220001,365.549988,...,69.300003,88.230003,62.730000,57.220001,87.610001,106.779999,157.699997,247.809998,48.009998,137.889999
2020-02-06,84.820000,28.299999,134.369995,81.302498,87.180000,92.480003,196.009995,89.470001,214.160004,367.459991,...,69.300003,87.720001,61.880001,57.599998,83.120003,103.739998,159.009995,252.089996,47.009998,138.970001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-12-23,157.800003,18.260000,232.130005,176.279999,133.089996,129.529999,352.209991,139.160004,403.309998,569.619995,...,66.610001,216.110001,61.020000,55.430000,117.500000,135.339996,126.989998,582.409973,62.360001,242.509995
2021-12-27,158.740005,18.170000,236.500000,180.330002,134.410004,131.899994,357.829987,141.460007,415.329987,577.679993,...,66.820000,222.779999,61.889999,55.950001,118.290001,138.009995,127.809998,606.330017,63.009998,246.509995
2021-12-28,159.179993,18.540001,238.130005,179.289993,134.389999,132.360001,357.440002,140.470001,415.269989,569.359985,...,67.620003,220.270004,61.689999,56.029999,119.519997,137.979996,128.210007,597.320007,63.110001,244.250000
2021-12-29,160.649994,18.049999,241.029999,179.380005,135.360001,133.339996,361.839996,141.190002,415.420013,569.289978,...,67.959999,217.619995,61.150002,56.650002,119.360001,138.660004,128.229996,601.119995,63.450001,247.029999


The following code drops any NA values located in the dataset and calculates the return from March 2020 to December 2020 (key Covid-19 period).

In [17]:
# Remove all NaN values from the data
sp_all_data.dropna(how = 'all', inplace = True)

# Select first and last row of dataframe pandas
sp_all_data = sp_all_data.iloc[[0,-1]]

# Create a new dataframe to capture the movement in share price from the start of the pandemic to end of 2020
sp_movement = sp_all_data.pct_change()

# Drop NA values
sp_movement.dropna(how = 'all', inplace = True)

# Present data in stacked format
sp_movement = sp_movement.stack().reset_index().rename(columns = {'level_1' : 'Ticker',
                                                                  0 : 'Percentage Change (%)'}).drop(columns = 'Date')

# Multiple percentage change by 100
sp_movement['Percentage Change (%)'] = sp_movement['Percentage Change (%)'] * 100

# Display results
sp_movement

Unnamed: 0,Ticker,Percentage Change (%)
0,A,94.585754
1,AAL,-32.749631
2,AAP,82.944971
3,AAPL,131.824496
4,ABBV,67.069867
...,...,...
495,YUM,31.095781
496,ZBH,-13.299526
497,ZBRA,151.493592
498,ZION,39.481201


In [18]:
# Select the most impacted stocks over the 2020 period
poorest_performers = sp_movement.nsmallest(10,
                                           'Percentage Change (%)',
                                           keep = 'all')

# Add in performance rating of bottom and top 10 for visualisations in main script
poorest_performers['Performance'] = 'Bottom 10'

# Select the best performing stocks over the 2020 period
best_performers = sp_movement.nlargest(10,
                                       'Percentage Change (%)',
                                       keep = 'all')

# Add in performance rating of bottom and top 10 for visualisations in main script
best_performers['Performance'] = 'Top 10'

# Regroup the data for a complete dataset
performance_data = pd.concat([best_performers, 
                              poorest_performers])

# Write to CSV
performance_data.to_csv(write_path + "PERFORMANCE_DATA.csv", index = False)

---

### Fear and Greed Index

The following code reads in the fear and greed index data.

In [19]:
# Import CSV file
fear_greed = pd.read_csv(write_path + "fear_greed_crypto.csv")

# Rename columns
fear_greed.rename(columns = {'date' : 'Date',
                             'fng_value' : 'Fear and Greed Value',
                             'fng_classification' : 'Fear and Greed Classification'}, inplace = True)

# Display results
fear_greed

Unnamed: 0,Date,Fear and Greed Value,Fear and Greed Classification
0,2/01/2018,30,Fear
1,3/01/2018,38,Fear
2,4/01/2018,16,Extreme Fear
3,5/01/2018,56,Greed
4,6/01/2018,24,Extreme Fear
...,...,...,...
1421,26/12/2021,37,Fear
1422,27/12/2021,40,Fear
1423,28/12/2021,41,Fear
1424,29/12/2021,27,Fear


Convert the date column to datetime format.

In [20]:
# Convert date format to datetime format using Pandas
fear_greed['Date'] = pd.to_datetime(fear_greed['Date'], infer_datetime_format = True)

# Assess data types
fear_greed.dtypes

Date                             datetime64[ns]
Fear and Greed Value                      int64
Fear and Greed Classification            object
dtype: object

In [21]:
# Join Bitcoin and Ethereum data onto the fear and greed index to produce new dataset
fg_crypto = pd.merge(fear_greed, hist_data, on = 'Date', how = 'inner')

# Rename columns for dataset
fg_crypto = fg_crypto.rename(columns = {'BTC-USD' : 'Bitcoin',
                                        'ETH-USD' : 'Ethereum',
                                        '^GSPC' : 'S&P 500'}).sort_values('Date', ascending = True)

# Apply moving average for feed and great index to smooth out volatility
fg_crypto['Moving Avg FG Index'] = fg_crypto['Fear and Greed Value'].rolling(window = 30).mean()

# Write to CSV
fg_crypto.to_csv(write_path + 'fear_greed_crypto_clean.csv', index = False)

# Display results
fg_crypto

Unnamed: 0,Date,Fear and Greed Value,Fear and Greed Classification,Bitcoin,Ethereum,S&P 500,Moving Avg FG Index
0,2018-02-01,30,Fear,9170.540039,1036.790039,2821.979980,
11,2018-02-02,15,Extreme Fear,8830.750000,915.784973,2762.129883,
38,2018-02-03,40,Fear,9174.910156,964.018982,,
68,2018-02-04,24,Extreme Fear,8277.009766,834.682007,,
94,2018-02-05,11,Extreme Fear,6955.270020,697.950989,2648.939941,
...,...,...,...,...,...,...,...
1421,2021-12-26,37,Fear,50809.515625,4067.328125,,28.533333
1422,2021-12-27,40,Fear,50640.417969,4037.547607,4791.189941,29.166667
1423,2021-12-28,41,Fear,47588.855469,3800.893066,4786.350098,29.633333
1424,2021-12-29,27,Fear,46444.710938,3628.531738,4793.060059,29.433333


### Final remarks

With the removal of null values and countries with missing data, the dataset for assessing Covid-19 impacts is ready for use. The following code confirms the available data and the structure of the new dataframe that will be used in the report.

In [22]:
# Print sample size and code completion
print(f"After the data preparation process, there are {(eco_covid['Country'].count())/2} countries with available data.")
print(f"Code executed without error at {dt.datetime.now()}")

# Print to CSV
eco_covid.to_csv(write_path + "eco_covid.csv", index = False)

# Display results
eco_covid

After the data preparation process, there are 62.0 countries with available data.
Code executed without error at 2022-01-09 22:13:04.988204


Unnamed: 0,Country,Latitude,Longitude,Change in GDP (Year on Year),Change in Inflation Rate (Year on Year),Impact
0,Albania,41.00,20.00,5.948067,0.209796,Negative
1,Algeria,28.00,3.00,5.892408,0.463363,Negative
2,Armenia,40.00,45.00,14.954258,-0.232011,Negative
3,Australia,-27.00,133.00,1.825701,-0.763862,Negative
4,Austria,47.33,13.33,8.165388,-0.148985,Negative
...,...,...,...,...,...,...
119,United Kingdom,54.00,-2.00,11.302967,-0.748618,Negative
120,United States,38.00,-97.00,5.676563,-0.578626,Negative
121,Uruguay,-33.00,-56.00,6.173625,1.874418,Negative
122,Vietnam,16.00,106.00,4.021806,0.425111,Negative


In [44]:
# Identify the URL required
gdp_url = requests.get('https://tradingeconomics.com/country-list/gdp-per-capita?continent=world')

# Create soup object
soup = bs.BeautifulSoup(gdp_url.text)

# Create empty lists
country_data = []
current_gdp_list = []
previous_gdp_list = []

# Identify the table with stored tickers
table = soup.find('table', {'class' : 'table-heatmap'})

# Obtain rows of tickers
gdp_rows = table.findAll('tr')[1:]

# Initiate for loop
for rows in gdp_rows:
    
    # Obtain ticker symbol in text format
    country = rows.findAll('td')[0].text
    
    # Obtain recent GDP
    recent_gdp = rows.findAll('td')[1].text
    
    # Obtain previous GDP
    previous_gdp = rows.findAll('td')[2].text
    
    # Append lists with relevant data
    country_data.append(country[:3:])
    current_gdp_list.append(recent_gdp)
    previous_gdp_list.append(previous_gdp)

In [47]:
gdp_data = pd.DataFrame(list(zip(country_data, current_gdp_list, previous_gdp_list)))
gdp_data.columns = ['Country', '2021 GDP', '2020 GDP']
gdp_data

Unnamed: 0,Country,2021 GDP,2020 GDP
0,\n\r\n,101207,104584
1,\n\r\n,85682,88413
2,\n\r\n,83536,81438
3,\n\r\n,78558,75113
4,\n\r\n,75059,76085
...,...,...,...
185,\n\r\n,501,513
186,\n\r\n,456,489
187,\n\r\n,411,418
188,\n\r\n,395,403
