# **Web Scraping with PYTHON using BeautifulSoup Library**

*   **What to scrape** : Scraping current report of COVID-19 *cases* and *deaths* across the *world*.
*   **Where to scrape** : From Wikipedia, link: https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data#covid-19-pandemic-data



## **Importing essential libraries**

The libraries required for this project are:


1.   **requests** (to request the data from the web)
2.   **bs4** (to scrap the data)
3.   **pandas** (to create and manipulate the DataFrame)





In [1]:
# checking the dependencies
!pip install requests
!pip install bs4
!pip install pandas



In [2]:
# importing the libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

## **Collecting HTML Data of COVID-19 pandemic data from Wikipedia**

In [3]:
# getting html data using requests
html = requests.get('https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data#covid-19-pandemic-data').text

## **Scrapping the data**

In [4]:
# creating a BeautifulSoup object using lxml parser to scrape the data
scrape = BeautifulSoup(html, 'lxml')

Filtering the required data from the HTML page

In [5]:
# filtering table body from the html text
table = scrape.find_all('table')[0].find('tbody')

In [6]:
# filtering rows in the table from the table body
rows = table.find_all('tr')

In [7]:
print(rows)

[<tr class="sticky-row">
<th class="unsortable">
</th>
<th scope="col">Location
</th>
<th scope="col">Cases
</th>
<th scope="col">Deaths
</th></tr>, <tr class="sorttop static-row-header">
<td data-sort-value="World" style="text-align: center;"><img alt="" data-file-height="20" data-file-width="20" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/83/OOjs_UI_icon_globe.svg/16px-OOjs_UI_icon_globe.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/83/OOjs_UI_icon_globe.svg/24px-OOjs_UI_icon_globe.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/83/OOjs_UI_icon_globe.svg/32px-OOjs_UI_icon_globe.svg.png 2x" width="16"/>
</td>
<th scope="row"><a href="/wiki/COVID-19_pandemic" title="COVID-19 pandemic">World</a><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[a]</a></sup>
</th>
<td data-sort-value="521127460">521,127,460
</td>
<td data-sort-value="6263321">6,263,321
</td></tr>, <tr>
<td data-sort-value="European Unio

Removing first and last items from rows list:


1.   Removing first row which contains table titles.
2.   Removing last row as we have no use of it.



In [8]:
# removing the first item
rows.pop(0)
# removing the last item
rows.pop(-1)

<tr class="sortbottom static-row-header" style="text-align: left;">
<td colspan="4" style="width: 0;"><style data-mw-deduplicate="TemplateStyles:r1011085734">.mw-parser-output .reflist{font-size:90%;margin-bottom:0.5em;list-style-type:decimal}.mw-parser-output .reflist .references{font-size:100%;margin-bottom:0;list-style-type:inherit}.mw-parser-output .reflist-columns-2{column-width:30em}.mw-parser-output .reflist-columns-3{column-width:25em}.mw-parser-output .reflist-columns{margin-top:0.3em}.mw-parser-output .reflist-columns ol{margin-top:0}.mw-parser-output .reflist-columns li{page-break-inside:avoid;break-inside:avoid-column}.mw-parser-output .reflist-upper-alpha{list-style-type:upper-alpha}.mw-parser-output .reflist-upper-roman{list-style-type:upper-roman}.mw-parser-output .reflist-lower-alpha{list-style-type:lower-alpha}.mw-parser-output .reflist-lower-greek{list-style-type:lower-greek}.mw-parser-output .reflist-lower-roman{list-style-type:lower-roman}</style><div class="reflist

Extracting the scraped data into 'data' list

In [9]:
data = [] # list to store the collected data
for row in rows:
  # from each row in the 'rows' list
  # we will extract:
  # 1. Location, 2. Total reported cases, 3. Deaths occured
  location = row.find('th').text.replace('\n','')
  cases = row.find_all('td')[1].text.replace('\n','')
  deaths = row.find_all('td')[-1].text.replace('\n','')
  # we will store the scraped data into a temporary list called 'record'
  record = [location, cases, deaths]
  # appending each record list we get into 'data' list
  data.append(record)

In [10]:
# printing the data we scraped
print(data)

[['World[a]', '521,127,460', '6,263,321'], ['European Union[b]', '140,148,968', '1,084,893'], ['United States', '82,437,716', '999,570'], ['India', '43,121,599', '524,214'], ['Brazil', '30,682,094', '665,104'], ['France', '29,215,091', '147,337'], ['Germany', '25,729,848', '137,499'], ['United Kingdom', '22,255,282', '177,425'], ['Russia', '17,989,065', '369,961'], ['South Korea', '17,782,061', '23,709'], ['Italy', '17,030,147', '165,182'], ['Turkey', '15,053,168', '98,890'], ['Spain', '12,127,122', '105,444'], ['Vietnam', '10,695,036', '43,065'], ['Argentina', '9,101,319', '128,729'], ['Japan', '8,334,859', '30,036'], ['Netherlands', '8,161,310', '22,399'], ['Iran', '7,227,683', '141,216'], ['Australia', '6,590,066', '7,794'], ['Colombia', '6,095,316', '139,821'], ['Indonesia', '6,050,519', '156,453'], ['Poland', '6,003,297', '116,207'], ['Mexico', '5,745,652', '324,465'], ['Ukraine', '5,040,518', '112,459'], ['Malaysia', '4,475,873', '35,612'], ['Thailand', '4,373,846', '29,472'], ['

## **Creating DataFrame**

In [11]:
# creating a DataFrame named as 'covid_data' using the 'data' list (of lists) we scraped
covid_data = pd.DataFrame(data, columns = ['Location', 'Cases', 'Deaths'])

In [12]:
# first five rows of the DataFrame
covid_data.head()

Unnamed: 0,Location,Cases,Deaths
0,World[a],521127460,6263321
1,European Union[b],140148968,1084893
2,United States,82437716,999570
3,India,43121599,524214
4,Brazil,30682094,665104


In [13]:
# last five rows of the DataFrame
covid_data.tail()

Unnamed: 0,Location,Cases,Deaths
212,Macau,82,—
213,Vatican City,29,0
214,Marshall Islands,17,—
215,Federated States of Micronesia,7,0
216,"Saint Helena, Ascension and Tristan da Cunha",4,—


## **Data Preprocessing**

In [14]:
# shape of the DataFrame
covid_data.shape

(217, 3)

In [15]:
# Info of the DataFrame
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Location  217 non-null    object
 1   Cases     217 non-null    object
 2   Deaths    217 non-null    object
dtypes: object(3)
memory usage: 5.2+ KB


In [16]:
# Data types of the columns in the DataFrame
covid_data.dtypes

Location    object
Cases       object
Deaths      object
dtype: object

In [17]:
# checking if there is any null value
covid_data.isnull().sum()

Location    0
Cases       0
Deaths      0
dtype: int64

In the DataFrame we have created:

1.   There are 217 rows and 3 columns.
2.   There are no null values.

However the data type of the columns 'Cases' and 'Deaths' is inappropriate and also the values are not in the right format.



Changing the values into right format for both 'Cases' and 'Deaths' columns.

In [18]:
# creating a function to change the format of the values.
def valToNum(val):
  # Our objective is to
  # 1. Remove the commas and 
  # 2. Replace the value to 0 if '—' is the value.
  val = val.replace(',','')
  val = val.replace('—','0')
  return val

Applying this function to every value in '*Cases*' and '*Deaths*' column

In [19]:
# chaning the data format of 'Cases' using apply() function in pandas
covid_data['Cases'] = covid_data['Cases'].apply(valToNum)

In [20]:
# chaning the data format of 'Deaths' using apply() function in pandas
covid_data['Deaths'] = covid_data['Deaths'].apply(valToNum)

In [21]:
# changing the dtype of both the columns to pandas int64 type
covid_data['Cases'] = covid_data['Cases'].astype('int64')
covid_data['Deaths'] = covid_data['Deaths'].astype('int64')

Checking if the data is in right dtype format

In [22]:
covid_data.dtypes

Location    object
Cases        int64
Deaths       int64
dtype: object

Saving the covid_data DataFrame into a '.csv' file

In [23]:
# top 5 rows in the dataset
covid_data.head()

Unnamed: 0,Location,Cases,Deaths
0,World[a],521127460,6263321
1,European Union[b],140148968,1084893
2,United States,82437716,999570
3,India,43121599,524214
4,Brazil,30682094,665104


In [24]:
covid_data.to_csv('scraped_covid_data.csv', index = False)

## **Conclusion**

Successfully, we have scraped **COVID-19 Pandemic Data** from Wikipedia and saved it into a '.csv' file using *Requests*, *BeautifulSoup*, and *Pandas* libraries.