# HTML PARSER

## AIM

Web scraping is used to collect information from various websites programmatically and be an expert in web scrapping to harvest the data effectively.

use Python libraries requests and Beautiful Soup

steps-

   Using the browser’s developer tools, inspect the HTML structure of the target site
    
   Scrap and parse the data from the web using requests and Beautiful Soup
    
   Develop a web scraping pipeline from start to finish
    
   Build a script that fetches data from the Web and displays relevant information in your console


## Web scraping

Challenges of Web Scrapping
- Variety: Every website is different and unique
- Durability: Websites constantly change


### Worldometers Website

build a web scraper that fetches the real-time Covid-19 data from worldometer website.

web scraper will parse the HTML on the site to pick out the relevant information

we can scrape any site on the Internet that you can look at, but the difficulty of doing so depends on the site.

### Inspect Data Source 

first step is to get to know the website that you want to scrape.we should understand the site structure to extract the information that’s relevant for us.

### Inspect using Developer Tool

understand the page structure to pick what you want from the HTML response

use developer tools to understand the structure of a website.

On Windows and Linux, you can access them by clicking the top-right menu button (⋮) and selecting More Tools → Developer Tools.

Windows/Linux: Ctrl+Shift+I

Developer tools allow us to interactively explore the site’s document object model (DOM) to better understand your source.

### Install Libraries

You need the following Python Libraries

    BeautifulSoup4
    Requests
    pandas
    lxml


In [1]:
pip install BeautifulSoup4

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install Requests

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


### Import libraries 

In [5]:
from bs4 import BeautifulSoup

In [6]:
import requests

### Request Permissions

In [7]:
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url)

In [8]:
print(html.text)


<!DOCTYPE html>
<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="en">
<!--<![endif]-->



<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <title>COVID - Coronavirus Statistics - Worldometer</title>
    <meta name="description" content="Daily and weekly updated statistics tracking the number of COVID-19 cases, recovered, and deaths. Historical data with cumulative charts, graphs, and updates.">


    
	<!-- Favicon -->
	<link rel="shortcut icon" href="/favicon/favicon.ico" type="image/x-icon">
	<link rel="apple-touch-icon" sizes="57x57" href="/favicon/apple-icon-57x57.png">
	<link rel="apple-touch-icon" sizes="60x60" href="/favicon/apple-icon-60x60.png">
	<link rel="apple-touch-icon" sizes="72x72" href="/favicon/apple-icon-72x72.png">
	<link rel="apple-touch-icon" sizes="

In [9]:
soup = BeautifulSoup(html.text,'lxml')

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   COVID - Coronavirus Statistics - Worldometer
  </title>
  <meta content="Daily and weekly updated statistics tracking the number of COVID-19 cases, recovered, and deaths. Historical data with cumulative charts, graphs, and updates." name="description"/>
  <!-- Favicon -->
  <link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="/favicon/apple-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
  <link href="/favicon/apple-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
  <link href="/favicon/apple-icon-76x

## Inspect H1 Elements

In [11]:
soup.h1

<h1>Coronavirus Cases:</h1>

In [12]:
soup.title

<title>COVID - Coronavirus Statistics - Worldometer</title>

In [13]:
soup.title.text

'COVID - Coronavirus Statistics - Worldometer'

In [14]:
header_h1 = soup.find_all(id="maincounter-wrap")
for head_h1 in header_h1:
    print(head_h1.h1.text)
    print(head_h1.div.span.text, end="\n"*2)

Coronavirus Cases:
678,525,540        

Deaths:
6,790,241

Recovered:
651,065,750



In [15]:
scrp_table = soup.find('table', id='main_table_countries_today')
scrp_table

<table class="table table-bordered table-hover main_table_countries" id="main_table_countries_today" style="width:100%;margin-top: 0px !important;display:none;">
<thead>
<tr>
<th width="1%">#</th>
<th width="100">Country,<br/>Other</th>
<th width="20">Total<br/>Cases</th>
<th width="30">New<br/>Cases</th>
<th width="30">Total<br/>Deaths</th>
<th width="30">New<br/>Deaths</th>
<th width="30">Total<br/>Recovered</th>
<th width="30">New<br/>Recovered</th>
<th width="30">Active<br/>Cases</th>
<th width="30">Serious,<br/>Critical</th>
<th width="30">Tot Cases/<br/>1M pop</th>
<th width="30">Deaths/<br/>1M pop</th>
<th width="30">Total<br/>Tests</th>
<th width="30">Tests/<br/>
<nobr>1M pop</nobr>
</th>
<th width="30">Population</th>
<th style="display:none" width="30">Continent</th>
<th width="30">1 Case<br/>every X ppl</th><th width="30">1 Death<br/>every X ppl</th><th width="30">1 Test<br/>every X ppl</th>
<th width="30">New Cases/1M pop</th>
<th width="30">New Deaths/1M pop</th>
<th width

### Create Column List

In [16]:
headers = []
for i in scrp_table.find_all('th'):
    title = i.text
    headers.append(title)

In [17]:
headers

['#',
 'Country,Other',
 'TotalCases',
 'NewCases',
 'TotalDeaths',
 'NewDeaths',
 'TotalRecovered',
 'NewRecovered',
 'ActiveCases',
 'Serious,Critical',
 'Tot\xa0Cases/1M pop',
 'Deaths/1M pop',
 'TotalTests',
 'Tests/\n1M pop\n',
 'Population',
 'Continent',
 '1 Caseevery X ppl',
 '1 Deathevery X ppl',
 '1 Testevery X ppl',
 'New Cases/1M pop',
 'New Deaths/1M pop',
 'Active Cases/1M pop']

In [18]:
headers

headers[10]
headers[10] = 'Tot Cases/1M pop'

headers[13]
headers[13] = 'Tests/1M pop'


In [19]:
headers

['#',
 'Country,Other',
 'TotalCases',
 'NewCases',
 'TotalDeaths',
 'NewDeaths',
 'TotalRecovered',
 'NewRecovered',
 'ActiveCases',
 'Serious,Critical',
 'Tot Cases/1M pop',
 'Deaths/1M pop',
 'TotalTests',
 'Tests/1M pop',
 'Population',
 'Continent',
 '1 Caseevery X ppl',
 '1 Deathevery X ppl',
 '1 Testevery X ppl',
 'New Cases/1M pop',
 'New Deaths/1M pop',
 'Active Cases/1M pop']

### Create Dataframe and Fill 

In [20]:
import pandas as pd

In [21]:
scrapdata = pd.DataFrame(columns = headers)

In [22]:
for tr in scrp_table.find_all('tr')[1:]:
    row_data = tr.find_all('td')
    row = [td.text for td in row_data]
    length = len(scrapdata)
    scrapdata.loc[length] = row

In [23]:
scrapdata

Unnamed: 0,#,"Country,Other",TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",...,TotalTests,Tests/1M pop,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,New Cases/1M pop,New Deaths/1M pop,Active Cases/1M pop
0,,\nNorth America\n,124197357,,1607360,,119343703,+2063,3246294,8032,...,,,,North America,\n,,,,,
1,,\nAsia\n,213887922,+10658,1534611,+12,198418561,+19517,13934750,15585,...,,,,Asia,\n,,,,,
2,,\nEurope\n,245776136,,2014485,,241313786,+4373,2447865,6443,...,,,,Europe,\n,,,,,
3,,\nSouth America\n,67915987,+169,1349265,+7,66106343,+813,460379,10212,...,,,,South America,\n,,,,,
4,,\nOceania\n,13959634,,25943,,13814460,,119231,71,...,,,,Australia/Oceania,\n,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242,,Total:,67915987,+169,1349265,+7,66106343,+813,460379,10212,...,,,,South America,,,,,,
243,,Total:,13959634,,25943,,13814460,,119231,71,...,,,,Australia/Oceania,,,,,,
244,,Total:,12787783,,258562,,12068191,,461030,547,...,,,,Africa,,,,,,
245,,Total:,721,,15,,706,,0,0,...,,,,,,,,,,


In [24]:
scrapdata.drop(scrapdata.index[0:8], inplace=True)
scrapdata.drop(scrapdata.index[228:236], inplace=True)
scrapdata.reset_index(inplace=True, drop=True)

In [25]:
# Drop “#” column
scrapdata.drop('#', inplace=True, axis=1)

In [26]:
scrapdata

Unnamed: 0,"Country,Other",TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,...,TotalTests,Tests/1M pop,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,New Cases/1M pop,New Deaths/1M pop,Active Cases/1M pop
0,USA,104958987,,1142370,,102261450,,1555167,2866,313493,...,1162998821,3473657,334805269,North America,3,293,0,,,4645
1,India,44684775,,530757,,44152151,,1867,,31767,...,917510608,652275,1406631776,Asia,31,2650,2,,,1
2,France,39582057,,164712,,39347955,,69390,869,603527,...,271490188,4139547,65584518,Europe,2,398,0,,,1058
3,Germany,38002611,,167289,,37582900,+3400,252422,,453040,...,122332384,1458359,83883596,Europe,2,501,1,,,3009
4,Brazil,36987682,,698047,,36106527,,183108,,171753,...,63776166,296146,215353593,South America,6,309,3,,,850
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
226,Vatican City,29,,,,29,,0,,36295,...,,,799,Europe,28,,,,,
227,Western Sahara,10,,1,,9,,0,,16,...,,,626161,Africa,62616,626161,,,,
228,Total:,12787783,,258562,,12068191,,461030,547,,...,,,,Africa,,,,,,
229,Total:,721,,15,,706,,0,0,,...,,,,,,,,,,


## Export Dataframe to CSV

In [28]:
scrapdata.to_csv('covid_data.csv', index=False)  # Export to cs

csv_data = pd.read_csv('covid_data.csv') # Try to read csv

In [29]:
csv_data

Unnamed: 0,"Country,Other",TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,...,TotalTests,Tests/1M pop,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,New Cases/1M pop,New Deaths/1M pop,Active Cases/1M pop
0,USA,104958987,,1142370,,102261450,,1555167,2866,313493,...,1162998821,3473657,334805269,North America,3,293,0.0,,,4645
1,India,44684775,,530757,,44152151,,1867,,31767,...,917510608,652275,1406631776,Asia,31,2650,2.0,,,1
2,France,39582057,,164712,,39347955,,69390,869,603527,...,271490188,4139547,65584518,Europe,2,398,0.0,,,1058
3,Germany,38002611,,167289,,37582900,+3400,252422,,453040,...,122332384,1458359,83883596,Europe,2,501,1.0,,,3009
4,Brazil,36987682,,698047,,36106527,,183108,,171753,...,63776166,296146,215353593,South America,6,309,3.0,,,850
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
226,Vatican City,29,,,,29,,0,,36295,...,,,799,Europe,28,,,,,
227,Western Sahara,10,,1,,9,,0,,16,...,,,626161,Africa,62616,626161,,,,
228,Total:,12787783,,258562,,12068191,,461030,547,,...,,,,Africa,,,,,,
229,Total:,721,,15,,706,,0,0,,...,,,,,,,,,,
