## Web Scraping & Transformations on HTML
### Kaylynn Mosier
### 25 April 2024

In [2]:
# Import required packages
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [3]:
# Load data using BeautifulSoup library
fd = open("C:/Users/kayly/OneDrive/Desktop/MSDS/DSC540/Tem Project/World Development Indicators _ The World Bank.html", "r", encoding='utf8')
soup = BeautifulSoup(fd)
fd.close() 

### Parsing Datatable from HTML

In [4]:
# Finds all tables and saves them to a variable
all_tables = soup.find_all('table')
# Find the number of tables
print("Total number of tables: {}".format(len(all_tables)))

Total number of tables: 3


In [5]:
# Finds header table
headers= soup.find("table", {"id":"fixedTable"})
print(type(headers))

<class 'bs4.element.Tag'>


In [6]:
# Finds all tr tags within the header table
levels = headers.find_all('tr')
levels

[<tr class="level0"> <th class="first"></th> <th class="separator" colspan="2"><div class="spacer"><a data-text="Metadata:Forest area" href="javascript:void(0)" onclick="loadWDIMetaData('AG.LND.FRST.K2', 'S', 'Series', 'Forest area', 'Forest area@Mammal species, threatened@Bird species, threatened@Fish species, threatened@Plant species (higher), threatened@Terrestrial protected areas@Marine protected areas@', 'AG.LND.FRST.K2@EN.MAM.THRD.NO@EN.BIR.THRD.NO@EN.FSH.THRD.NO@EN.HPT.THRD.NO@ER.LND.PTLD.ZS@ER.MRN.PTMR.ZS@')">Forest area</a></div></th> <th class="separator" colspan="4"><div class="spacer">Threatened species</div></th> <th class="separator" colspan="1"><div class="spacer"><a data-text="Metadata:Terrestrial protected areas" href="javascript:void(0)" onclick="loadWDIMetaData('ER.LND.PTLD.ZS', 'S', 'Series', 'Terrestrial protected areas', 'Forest area@Mammal species, threatened@Bird species, threatened@Fish species, threatened@Plant species (higher), threatened@Terrestrial protecte

In [7]:
header_levels2 = [[div.get_text().strip() for div in tr.find_all('div')] for tr in levels]
header_levels2

[['Forest area',
  'Threatened species',
  'Terrestrial protected areas',
  'Marine protected areas'],
 ['', 'Mammals', 'Birds', 'Fishes', 'Higher plants', '', ''],
 ['sq. km thousands',
  '',
  '',
  '',
  '',
  '% of total land area',
  '% of territorial waters'],
 ['1990', '2021', '2018', '2018', '2018', '2018', '2022', '2022']]

In [8]:
# Finds datatable that contains needed data
data_table = soup.find("table", {"id":"scrollTable"})
print(type(data_table))

<class 'bs4.element.Tag'>


In [9]:
# Finds all tr data in data table
row = data_table.find_all('tr')
row

[<tr> <td class="country"><div class="spacer"><a class="metaLink" data-text="Metadata:Afghanistan" href="javascript:void(0)" onclick="loadMetaData('AFG', 'C' ,'Country',  'Afghanistan')">Afghanistan</a></div></td> <td class=""><div class="spacer">12</div></td> <td class=""><div class="spacer">12</div></td> <td class=""><div class="spacer">11</div></td> <td class=""><div class="spacer">16</div></td> <td class=""><div class="spacer">4</div></td> <td class=""><div class="spacer">5</div></td> <td class=""><div class="spacer">3.6</div></td> <td><div class="spacer">..</div></td> </tr>,
 <tr> <td class="country"><div class="spacer"><a class="metaLink" data-text="Metadata:Albania" href="javascript:void(0)" onclick="loadMetaData('ALB', 'C' ,'Country',  'Albania')">Albania</a></div></td> <td class=""><div class="spacer">8</div></td> <td class=""><div class="spacer">8</div></td> <td class=""><div class="spacer">3</div></td> <td class=""><div class="spacer">8</div></td> <td class=""><div class="sp

In [10]:
# Finds text from each row of the data table
data_rows1 = [[td.get_text().strip() for td in tr.findAll('td')] for tr in row]
data_rows1

[['Afghanistan', '12', '12', '11', '16', '4', '5', '3.6', '..'],
 ['Albania', '8', '8', '3', '8', '44', '4', '18.6', '2.8'],
 ['Algeria', '17', '20', '14', '15', '41', '22', '4.6', '0.1'],
 ['American Samoa', '0', '0', '1', '8', '12', '1', '15.8', '8.7'],
 ['Andorra', '0', '0', '2', '3', '0', '0', '26.9', '..'],
 ['Angola', '793', '661', '18', '32', '53', '34', '7.0', '0.0'],
 ['Antigua and Barbuda', '0', '0', '2', '3', '31', '4', '19.9', '0.3'],
 ['Argentina', '352', '285', '38', '52', '42', '70', '8.7', '11.8'],
 ['Armenia', '3', '3', '9', '14', '3', '74', '24.7', '..'],
 ['Aruba', '0', '0', '2', '2', '24', '2', '18.9', '0.0'],
 ['Australia', '1,339', '1,340', '63', '52', '125', '108', '20.4', '44.3'],
 ['Austria', '38', '39', '3', '13', '11', '17', '29.5', '..'],
 ['Azerbaijan', '10', '11', '8', '17', '14', '44', '10.2', '0.4'],
 ['Bahamas, The', '5', '5', '5', '10', '43', '7', '36.6', '7.9'],
 ['Bahrain', '0', '0', '3', '7', '14', '0', '13.0', '21.1'],
 ['Bangladesh', '19', '19', '

In [11]:
# Create data frame of development indicator data
development_data = pd.DataFrame(data_rows1)
development_data

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,Afghanistan,12,12,11,16,4,5,3.6,..
1,Albania,8,8,3,8,44,4,18.6,2.8
2,Algeria,17,20,14,15,41,22,4.6,0.1
3,American Samoa,0,0,1,8,12,1,15.8,8.7
4,Andorra,0,0,2,3,0,0,26.9,..
...,...,...,...,...,...,...,...,...,...
221,Sub-Saharan Africa,7340,6232,967,993,2064,4862,16.4,..
222,Low income,3524,2971,584,578,963,2291,12.2,..
223,Lower middle income,6286,5824,1255,1436,2408,4886,14.1,1.6
224,Upper middle income,21415,20679,980,1502,2179,6952,15.6,11.0


## Transformation 1- Add column names

I tried repeatedly to use hierarchical indexing for this table because that's how it appearaed on the website, but was unsuccessful. I was able to create the multi-index but then had a hard time correctly accessing needed elements of the dataframe. In the end, I decided it was better to just name the columns as needed rather than using hierarchical indexing. 

In [12]:
# Add column names
development_data.columns = ['Country', 'Forest Area (sq.km thousands) 1990', 'Forest Area (sq.km thousands) 2021', 'Threatened Mammals', 'Threatened Birds', 'Threatened Fishes', 'Threatened Higher Plants', 'Terresterial protected areas (% of total land area) 2022', 'Marine protected areas (% of total territorial waters) 2022']
development_data

Unnamed: 0,Country,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Threatened Mammals,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Terresterial protected areas (% of total land area) 2022,Marine protected areas (% of total territorial waters) 2022
0,Afghanistan,12,12,11,16,4,5,3.6,..
1,Albania,8,8,3,8,44,4,18.6,2.8
2,Algeria,17,20,14,15,41,22,4.6,0.1
3,American Samoa,0,0,1,8,12,1,15.8,8.7
4,Andorra,0,0,2,3,0,0,26.9,..
...,...,...,...,...,...,...,...,...,...
221,Sub-Saharan Africa,7340,6232,967,993,2064,4862,16.4,..
222,Low income,3524,2971,584,578,963,2291,12.2,..
223,Lower middle income,6286,5824,1255,1436,2408,4886,14.1,1.6
224,Upper middle income,21415,20679,980,1502,2179,6952,15.6,11.0


## Transformation 2- Remove income levels from country column

In my other datasets, information is filtered by country. Information on income level is not important to me in this investigation.

In [13]:
# Sorts index
development_data = development_data.sort_index(axis=1)
development_data

Unnamed: 0,Country,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals
0,Afghanistan,12,12,..,3.6,16,4,5,11
1,Albania,8,8,2.8,18.6,8,44,4,3
2,Algeria,17,20,0.1,4.6,15,41,22,14
3,American Samoa,0,0,8.7,15.8,8,12,1,1
4,Andorra,0,0,..,26.9,3,0,0,2
...,...,...,...,...,...,...,...,...,...
221,Sub-Saharan Africa,7340,6232,..,16.4,993,2064,4862,967
222,Low income,3524,2971,..,12.2,578,963,2291,584
223,Lower middle income,6286,5824,1.6,14.1,1436,2408,4886,1255
224,Upper middle income,21415,20679,11.0,15.6,1502,2179,6952,980


In [14]:
development_data.set_index(development_data['Country'], inplace=True)
development_data = development_data.drop('Country', axis=1)
development_data

Unnamed: 0_level_0,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Afghanistan,12,12,..,3.6,16,4,5,11
Albania,8,8,2.8,18.6,8,44,4,3
Algeria,17,20,0.1,4.6,15,41,22,14
American Samoa,0,0,8.7,15.8,8,12,1,1
Andorra,0,0,..,26.9,3,0,0,2
...,...,...,...,...,...,...,...,...
Sub-Saharan Africa,7340,6232,..,16.4,993,2064,4862,967
Low income,3524,2971,..,12.2,578,963,2291,584
Lower middle income,6286,5824,1.6,14.1,1436,2408,4886,1255
Upper middle income,21415,20679,11.0,15.6,1502,2179,6952,980


In [15]:
# Check shape of dataframe
development_data.shape

(226, 8)

In [16]:
# Drop rows containing income level data
development_data = development_data.drop(index=['Low income', 'Lower middle income', 'Upper middle income', 'High income'])
development_data

Unnamed: 0_level_0,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Afghanistan,12,12,..,3.6,16,4,5,11
Albania,8,8,2.8,18.6,8,44,4,3
Algeria,17,20,0.1,4.6,15,41,22,14
American Samoa,0,0,8.7,15.8,8,12,1,1
Andorra,0,0,..,26.9,3,0,0,2
...,...,...,...,...,...,...,...,...
Latin America & Caribbean,10700,9296,19.4,24.1,1117,1716,5439,629
Middle East & North Africa,205,230,1.3,5.1,290,672,374,228
North America,6507,6567,12.8,12.3,118,322,536,62
South Asia,826,900,0.5,8.7,253,397,794,252


In [17]:
# Checks shape of dataframe to confirm rows were dropped
development_data.shape

(222, 8)

In [18]:
# Resets index
development_data = development_data.reset_index()
development_data

Unnamed: 0,Country,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals
0,Afghanistan,12,12,..,3.6,16,4,5,11
1,Albania,8,8,2.8,18.6,8,44,4,3
2,Algeria,17,20,0.1,4.6,15,41,22,14
3,American Samoa,0,0,8.7,15.8,8,12,1,1
4,Andorra,0,0,..,26.9,3,0,0,2
...,...,...,...,...,...,...,...,...,...
217,Latin America & Caribbean,10700,9296,19.4,24.1,1117,1716,5439,629
218,Middle East & North Africa,205,230,1.3,5.1,290,672,374,228
219,North America,6507,6567,12.8,12.3,118,322,536,62
220,South Asia,826,900,0.5,8.7,253,397,794,252


## Transformation 3- Fix blank values

In [19]:
# Check number of Na values in each column
development_data.isna().sum()

Country                                                        0
Forest Area (sq.km thousands) 1990                             0
Forest Area (sq.km thousands) 2021                             0
Marine protected areas (% of total territorial waters) 2022    0
Terresterial protected areas (% of total land area) 2022       0
Threatened Birds                                               0
Threatened Fishes                                              0
Threatened Higher Plants                                       0
Threatened Mammals                                             0
dtype: int64

This dataset does not contain NA values, but missing values are filled with two periods (..), I need to fill these with NA values so data can be sorted correctly.

In [20]:
# Replace '..' values with NaN
development_data = development_data.replace('..', np.nan)
development_data

Unnamed: 0,Country,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals
0,Afghanistan,12,12,,3.6,16,4,5,11
1,Albania,8,8,2.8,18.6,8,44,4,3
2,Algeria,17,20,0.1,4.6,15,41,22,14
3,American Samoa,0,0,8.7,15.8,8,12,1,1
4,Andorra,0,0,,26.9,3,0,0,2
...,...,...,...,...,...,...,...,...,...
217,Latin America & Caribbean,10700,9296,19.4,24.1,1117,1716,5439,629
218,Middle East & North Africa,205,230,1.3,5.1,290,672,374,228
219,North America,6507,6567,12.8,12.3,118,322,536,62
220,South Asia,826,900,0.5,8.7,253,397,794,252


In [21]:
# Find number of NaN values after fixing dataframe 
development_data.isna().sum()

Country                                                         0
Forest Area (sq.km thousands) 1990                             11
Forest Area (sq.km thousands) 2021                              3
Marine protected areas (% of total territorial waters) 2022    45
Terresterial protected areas (% of total land area) 2022        4
Threatened Birds                                                2
Threatened Fishes                                               2
Threatened Higher Plants                                        2
Threatened Mammals                                              2
dtype: int64

## Transformation 4- Remove rows that have NaN values

To work with this data effectively, I need to remove rows that contian NaN values.

In [22]:
development_data.set_index(development_data['Country'], inplace=True)
development_data = development_data.drop('Country', axis=1)

# Drop rows with more than 1 Na value
development_data = development_data.dropna()
development_data

Unnamed: 0_level_0,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Albania,8,8,2.8,18.6,8,44,4,3
Algeria,17,20,0.1,4.6,15,41,22,14
American Samoa,0,0,8.7,15.8,8,12,1,1
Angola,793,661,0.0,7.0,32,53,34,18
Antigua and Barbuda,0,0,0.3,19.9,3,31,4,2
...,...,...,...,...,...,...,...,...
Europe & Central Asia,10232,10576,10.7,14.2,678,1239,1306,350
Latin America & Caribbean,10700,9296,19.4,24.1,1117,1716,5439,629
Middle East & North Africa,205,230,1.3,5.1,290,672,374,228
North America,6507,6567,12.8,12.3,118,322,536,62


In [23]:
development_data.isna().sum()

Forest Area (sq.km thousands) 1990                             0
Forest Area (sq.km thousands) 2021                             0
Marine protected areas (% of total territorial waters) 2022    0
Terresterial protected areas (% of total land area) 2022       0
Threatened Birds                                               0
Threatened Fishes                                              0
Threatened Higher Plants                                       0
Threatened Mammals                                             0
dtype: int64

## Transformation 5- Change data types of columns

In [24]:
# Check data types
development_data.dtypes

Forest Area (sq.km thousands) 1990                             object
Forest Area (sq.km thousands) 2021                             object
Marine protected areas (% of total territorial waters) 2022    object
Terresterial protected areas (% of total land area) 2022       object
Threatened Birds                                               object
Threatened Fishes                                              object
Threatened Higher Plants                                       object
Threatened Mammals                                             object
dtype: object

In [25]:
# Loop through column in data frame to first change them to strings, then remove commas, and finally convert them to floats
for column in development_data:
    development_data[column] = development_data[column].astype(str)
    development_data[column] = development_data[column].str.replace(',','')
    development_data[column] = development_data[column].astype(float)
development_data

Unnamed: 0_level_0,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Albania,8.0,8.0,2.8,18.6,8.0,44.0,4.0,3.0
Algeria,17.0,20.0,0.1,4.6,15.0,41.0,22.0,14.0
American Samoa,0.0,0.0,8.7,15.8,8.0,12.0,1.0,1.0
Angola,793.0,661.0,0.0,7.0,32.0,53.0,34.0,18.0
Antigua and Barbuda,0.0,0.0,0.3,19.9,3.0,31.0,4.0,2.0
...,...,...,...,...,...,...,...,...
Europe & Central Asia,10232.0,10576.0,10.7,14.2,678.0,1239.0,1306.0,350.0
Latin America & Caribbean,10700.0,9296.0,19.4,24.1,1117.0,1716.0,5439.0,629.0
Middle East & North Africa,205.0,230.0,1.3,5.1,290.0,672.0,374.0,228.0
North America,6507.0,6567.0,12.8,12.3,118.0,322.0,536.0,62.0


## Transformation 6: Add column for change in forest area from 1990 to 2021

In [26]:
# Adds column while preforming needed calculation 
development_data['Change in Forest Area 1990 to 2021'] = development_data['Forest Area (sq.km thousands) 1990'] - development_data['Forest Area (sq.km thousands) 2021']
development_data

Unnamed: 0_level_0,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals,Change in Forest Area 1990 to 2021
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Albania,8.0,8.0,2.8,18.6,8.0,44.0,4.0,3.0,0.0
Algeria,17.0,20.0,0.1,4.6,15.0,41.0,22.0,14.0,-3.0
American Samoa,0.0,0.0,8.7,15.8,8.0,12.0,1.0,1.0,0.0
Angola,793.0,661.0,0.0,7.0,32.0,53.0,34.0,18.0,132.0
Antigua and Barbuda,0.0,0.0,0.3,19.9,3.0,31.0,4.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...
Europe & Central Asia,10232.0,10576.0,10.7,14.2,678.0,1239.0,1306.0,350.0,-344.0
Latin America & Caribbean,10700.0,9296.0,19.4,24.1,1117.0,1716.0,5439.0,629.0,1404.0
Middle East & North Africa,205.0,230.0,1.3,5.1,290.0,672.0,374.0,228.0,-25.0
North America,6507.0,6567.0,12.8,12.3,118.0,322.0,536.0,62.0,-60.0


## Writing final table to CSV file

In [27]:
# Writing dataframe to a csv file
development_data.to_csv('DevelopmentData', sep=',', encoding='utf-8', index=True)

In [28]:
# Checking that writing to file worked correctly
csvFile = pd.read_csv("C:/Users/kayly/OneDrive/Desktop/MSDS/DSC540/Tem Project/DevelopmentData")
csvFile

Unnamed: 0,Country,Forest Area (sq.km thousands) 1990,Forest Area (sq.km thousands) 2021,Marine protected areas (% of total territorial waters) 2022,Terresterial protected areas (% of total land area) 2022,Threatened Birds,Threatened Fishes,Threatened Higher Plants,Threatened Mammals,Change in Forest Area 1990 to 2021
0,Albania,8.0,8.0,2.8,18.6,8.0,44.0,4.0,3.0,0.0
1,Algeria,17.0,20.0,0.1,4.6,15.0,41.0,22.0,14.0,-3.0
2,American Samoa,0.0,0.0,8.7,15.8,8.0,12.0,1.0,1.0,0.0
3,Angola,793.0,661.0,0.0,7.0,32.0,53.0,34.0,18.0,132.0
4,Antigua and Barbuda,0.0,0.0,0.3,19.9,3.0,31.0,4.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...
166,Europe & Central Asia,10232.0,10576.0,10.7,14.2,678.0,1239.0,1306.0,350.0,-344.0
167,Latin America & Caribbean,10700.0,9296.0,19.4,24.1,1117.0,1716.0,5439.0,629.0,1404.0
168,Middle East & North Africa,205.0,230.0,1.3,5.1,290.0,672.0,374.0,228.0,-25.0
169,North America,6507.0,6567.0,12.8,12.3,118.0,322.0,536.0,62.0,-60.0


## Ethical Implications 

This dataset was overall a pretty clean dataset. There were not many missing values and thoses that were missing were clearly labeled. I chose to remove NaN values in this dataset to make further processing easier down the line. Because I will be combining datasets based on the country, it is important that the countries I keep have all necessary data. 

I did not search for outliers within this dataset. It's would be almost impossible to correctly categorize values as outliers because so much of this data is dependent on the country the values are gathered from. For example, comparing forest area of Angola and the United States would show that Angola has a small forest area while the United States has a massive forest area. This is dependent on the size of the country. Just because one value is massively bigger than the other, does not mean it's an outlier. 

I chose to be conservative when tranforming data and only changed obviously values. The main change I made was to convert .. to NaN and then drop those values from the dataset. I don't see this to be very risky because I didn't alter any values in the dataset. I did end up dropping about 50 countries or income levels from the dataset because of mising values. In this case, I see it as more ethically sound to remove these countries from the dataset rather than alter them. In other instances, this could be seen as skewing data. 

My data was sourced form the World Data Bank which is a well known and reputable source of data. It often accumulates data from multiple studies for further evaluation by data scientists. I am unsure how the original data was collected, so I cannot be sure there are not ethical breeches there. As far as I know, there are no legal or regulatory implications in my data. 