<a href="https://colab.research.google.com/github/Rachita-G/Python_Practice/blob/main/Web_Scraping/01_Covid_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is web scraping?
In simple words, web scraping is the process of extracting desired data from a web
resource. This method involves different procedures such as interacting with the web
resource, choosing the appropriate data, obtaining information from the data, and
converting the data to the desired format. With all the previous methods considered,
a major spotlight will be thrown on the process of pulling the required data from the
semistructured data.

# Dos and don'ts of web scraping
Scraping a web resource is not always welcomed by the owners. Some companies
put a restriction on using bots against them. It's etiquette to follow certain rules
while scraping. The following are the dos and don'ts of web scraping:
* Do refer to the terms and conditions: The first thing that should come to our
mind before we begin scraping is terms and conditions. Do visit the website's
terms and conditions page and get to know whether they prohibit scraping
from their site. If so, it's better to back off.
* Don't bombard the server with a lot of requests: Every website runs on a
server that can serve only a specific amount of workload. It is equivalent to
being rude if we bombard the server with lots of requests in a specific span
of time, which may result in sever breakdown. Wait for some time between
requests instead of bombarding the server with too many requests at once.
Some sites put a restriction on the maximum number of requests
processed per minute and will ban the request sender's IP address
if this is not adhered to.
*  Do track the web resource from time to time: A website doesn't always stay
the same. According to its usability and the requirement of users, they tend
to change from time to time. If any alteration has taken place in the website,
our code to scrape may fail. Do remember to track the changes made to the
site, modify the scrapper script, and scrape accordingly.

The concept of web scraping stands as a savior when we really turn imperative to
access some information from a web resource that does not maintain an API. 

Before we begin, let's get to know some important concepts that will help us to
reach our goal. Take a look at the response content format of a request, which
will introduce us to a particular type of data:

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url="https://www.worldometers.info/coronavirus/"
response=requests.get(url)

In [None]:
response

<Response [200]>

In [None]:
response.request.headers

{'User-Agent': 'python-requests/2.23.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

# TYPES OF DATA
In most cases, we deal with three types of data when working with web sources.
They are as follows:

1. **Structured data**

Structured data is a type of data that exists in an organized form. Normally,
structured data has a predefined format and it is machine readable. Each piece of
data that lies in structured data has a relation with every other data as a specific
format is imposed on it. This makes it easier and faster to access different parts of
data. The structured data type helps in mitigating redundant data while dealing
with huge amounts of data.
Databases always contain structured data, and SQL techniques can be used to access
data from them. We can regard census records as an example of structured data.
They contain information about the date of birth, gender, place, income, and so on,
of the people of a country.

2. **Unstructured data**

In contrast to structured data, unstructured data either misses out on a standard
format or stays unorganized even though a specific format is imposed on it. Due to
this reason, it becomes difficult to deal with different parts of the data. Also, it turns
into a tedious task. To handle unstructured data, different techniques such as text
analytics, Natural Language Processing (NLP), and data mining are used. Images,
scientific data, text-heavy content (such as newspapers, health records, and so on),
come under the unstructured data type.

3. **Semistructured data**

Semistructured data is a type of data that follows an irregular trend or has a structure
which changes rapidly. This data can be a self described one, it uses tags and
other markers to establish a semantic relationship among the elements of the data.
Semistructured data may contain information that is transferred from different sources.
Scraping is the technique that is used to extract information from this type of data. The
information available on the Web is a perfect example of semistructured data.



# What is BeautifulSoup?
The BeautifulSoup library is a simple yet powerful web scraping library. It has
the capability to extract the desired data when provided with an HTML or XML
document. It is charged with some superb methods, which help us to perform web
scraping tasks effortlessly.

In [None]:
from bs4 import BeautifulSoup
html=response.text
html

'\n<!DOCTYPE html>\n<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->\n<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->\n<!--[if !IE]><!-->\n<html lang="en">\n<!--<![endif]-->\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>Coronavirus Update (Live): 8,325,920 Cases and 448,002 Deaths from COVID-19 Virus Pandemic - Worldometer</title>\n<meta name="description" content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates">\n\n<link rel="shortcut icon" href="/favicon/favicon.ico" type="image/x-icon">\n<link rel="apple-touch-icon" sizes="57x57" href="/favicon/apple-icon-57x57.png">\n<link re

In [None]:
soup=BeautifulSoup(html,'lxml')
soup

<!DOCTYPE html>
<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]--><!--[if IE 9]> <html lang="en" class="ie9"> <![endif]--><!--[if !IE]><!--><html lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Coronavirus Update (Live): 8,325,920 Cases and 448,002 Deaths from COVID-19 Virus Pandemic - Worldometer</title>
<meta content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates" name="description"/>
<link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="/favicon/app

In [None]:
# just some things to know
soupy=BeautifulSoup('<td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/us/">USA</a></td>')
soupy

<html><body><td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/us/">USA</a></td></body></html>

In [None]:
print(soupy)
print(soupy.html.body)
tag=soupy.a
print(tag)
print(type(tag))
print(tag.name)
print(tag['class'])
print(tag.string)

<html><body><td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/us/">USA</a></td></body></html>
<body><td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/us/">USA</a></td></body>
<a class="mt_a" href="country/us/">USA</a>
<class 'bs4.element.Tag'>
a
['mt_a']
USA


**Returning back**

In the preceding example, the response content is rendered in the form of
semistructured data, which is represented using HTML tags; this in turn helps
us to access the information about the different sections of a web page individually.
Now, let's get to know the different types of data that the Web generally deals with.

In [None]:
get_table=soup.find("table",id="main_table_countries_today")
containers=get_table.find_all("tr")
containers

[<tr>
 <th width="1%">#</th>
 <th width="100">Country,<br/>Other</th>
 <th width="20">Total<br/>Cases</th>
 <th width="30">New<br/>Cases</th>
 <th width="30">Total<br/>Deaths</th>
 <th width="30">New<br/>Deaths</th>
 <th width="30">Total<br/>Recovered</th>
 <th width="30">New<br/>Recovered</th>
 <th width="30">Active<br/>Cases</th>
 <th width="30">Serious,<br/>Critical</th>
 <th width="30">Tot Cases/<br/>1M pop</th>
 <th width="30">Deaths/<br/>1M pop</th>
 <th width="30">Total<br/>Tests</th>
 <th width="30">Tests/<br/>
 <nobr>1M pop</nobr>
 </th>
 <th width="30">Population</th>
 <th style="display:none" width="30">Continent</th>
 <th width="30">1 Case<br/>every X ppl</th><th width="30">1 Death<br/>every X ppl</th><th width="30">1 Test<br/>every X ppl</th>
 </tr>,
 <tr class="total_row_world row_continent" data-continent="North America" style="display: none">
 <td></td>
 <td style="text-align:left;">
 <nobr>North America</nobr>
 </td>
 <td>2,556,591</td>
 <td>+16,978</td>
 <td>148,198</

In [None]:
print(len(containers))

232


In [None]:
containers[1]

<tr class="total_row_world row_continent" data-continent="North America" style="display: none">
<td></td>
<td style="text-align:left;">
<nobr>North America</nobr>
</td>
<td>2,556,591</td>
<td>+16,978</td>
<td>148,198</td>
<td>+1,064</td>
<td>1,120,569</td>
<td>+4,464</td>
<td>1,287,824</td>
<td>19,465</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td data-continent="North America" style="display:none;">North America</td>
<td>
</td>
<td></td>
<td></td>
</tr>

In [None]:
print(BeautifulSoup.prettify(containers[2]))

<tr class="total_row_world row_continent" data-continent="South America" style="display: none">
 <td>
 </td>
 <td style="text-align:left;">
  <nobr>
   South America
  </nobr>
 </td>
 <td>
  1,520,566
 </td>
 <td>
  +6,878
 </td>
 <td>
  63,659
 </td>
 <td>
  +397
 </td>
 <td>
  820,273
 </td>
 <td>
  +13,294
 </td>
 <td>
  636,634
 </td>
 <td>
  12,169
 </td>
 <td>
 </td>
 <td>
 </td>
 <td>
 </td>
 <td>
 </td>
 <td>
 </td>
 <td data-continent="South America" style="display:none;">
  South America
 </td>
 <td>
 </td>
 <td>
 </td>
 <td>
 </td>
</tr>



In [None]:
container=containers[9]
container

<tr style="">
<td style="font-size:12px;color: grey;text-align:center;vertical-align:middle;">1</td>
<td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/us/">USA</a></td>
<td style="font-weight: bold; text-align:right">2,218,902</td>
<td style="font-weight: bold; text-align:right;background-color:#FFEEAA;">+10,502</td>
<td style="font-weight: bold; text-align:right;">119,374 </td>
<td style="font-weight: bold; 
                                    text-align:right;background-color:red; color:white">+242</td>
<td style="font-weight: bold; text-align:right">903,616</td>
<td style="font-weight: bold; text-align:right;background-color:#c8e6c9; color:#000">+575</td>
<td style="text-align:right;font-weight:bold;">1,195,912</td>
<td style="font-weight: bold; text-align:right">16,695</td>
<td style="font-weight: bold; text-align:right">6,705</td>
<td style="font-weight: bold; text-align:right">361</td>
<td style="font-weight: bold; text-align:right">25,

In [None]:
country=containers[200].find_all("a",{"class":"mt_a"})
country[0].text

'Grenada'

In [None]:
cases=container.find_all("td")
cases[14].text

'330,928,170 '

In [None]:
for i in range(9,109):
    country=containers[i].find_all("a",{"class":"mt_a"})
    print("country",country[0].text)

country USA
country Brazil
country Russia
country India
country UK
country Spain
country Italy
country Peru
country Iran
country Germany
country Chile
country Turkey
country France
country Mexico
country Pakistan
country Saudi Arabia
country Canada
country Bangladesh
country Qatar
country South Africa
country Belgium
country Belarus
country Colombia
country Sweden
country Netherlands
country Ecuador
country Egypt
country UAE
country Indonesia
country Singapore
country Portugal
country Kuwait
country Argentina
country Ukraine
country Switzerland
country Poland
country Philippines
country Afghanistan
country Oman
country Ireland
country Iraq
country Dominican Republic
country Romania
country Panama
country Bolivia
country Israel
country Bahrain
country Armenia
country Japan
country Austria
country Nigeria
country Kazakhstan
country Moldova
country Ghana
country Serbia
country Denmark
country S. Korea
country Algeria
country Azerbaijan
country Guatemala
country Czechia
country Cameroon
co

In [None]:
for i in range(9,109):
        country=containers[i].find_all("a",class_="mt_a")
        print("country",country[0].text)
        cases=containers[i].find_all("td")
        print("total_cases",cases[2].text)
        print("total_deaths",cases[4].text)
        print("total_recovered",cases[6].text)
        print("active_cases",cases[8].text)
        print("critical_cases",cases[9].text)
        print("total_tests",cases[12].text)
        print("population",cases[14].text)
        print('\n')

country USA
total_cases 2,218,902
total_deaths 119,374 
total_recovered 903,616
active_cases 1,195,912
critical_cases 16,695
total_tests 25,834,677
population 330,928,170 


country Brazil
total_cases 934,769
total_deaths 45,585 
total_recovered 477,364
active_cases 411,820
critical_cases 8,318
total_tests 1,709,468
population 212,500,470 


country Russia
total_cases 553,301
total_deaths 7,478 
total_recovered 304,342
active_cases 241,481
critical_cases 2,300
total_tests 15,679,724
population 145,932,234 


country India
total_cases 360,483
total_deaths 12,058 
total_recovered 191,446
active_cases 156,979
critical_cases 8,944
total_tests 6,084,256
population 1,379,455,941 


country UK
total_cases 299,251
total_deaths 42,153 
total_recovered N/A
active_cases N/A
critical_cases 379
total_tests 7,121,976
population 67,872,439 


country Spain
total_cases 291,408
total_deaths 27,136 
total_recovered N/A
active_cases N/A
critical_cases 617
total_tests 4,826,516
population 46,754,133 


co

total_deaths 51 
total_recovered 3,700
active_cases 1,470
critical_cases 
total_tests 
population 9,527,545 


country DRC
total_cases 5,100
total_deaths 115 
total_recovered 640
active_cases 4,345
critical_cases 
total_tests 
population 89,421,061 


country Guinea
total_cases 4,639
total_deaths 26 
total_recovered 3,327
active_cases 1,286
critical_cases 24
total_tests 14,407
population 13,115,087 


country Haiti
total_cases 4,547
total_deaths 80 
total_recovered 24
active_cases 4,443
critical_cases 
total_tests 9,353
population 11,396,738 


country Djibouti
total_cases 4,545
total_deaths 43 
total_recovered 3,411
active_cases 1,091
critical_cases 
total_tests 40,855
population 987,384 


country North Macedonia
total_cases 4,482
total_deaths 210 
total_recovered 1,803
active_cases 2,469
critical_cases 34
total_tests 46,445
population 2,083,377 


country Gabon
total_cases 4,114
total_deaths 29 
total_recovered 1,432
active_cases 2,653
critical_cases 15
total_tests 23,741
population

In [None]:
dic={}
for i in range(9,109):
    key=i
    country=containers[i].find_all("a",class_="mt_a")
    print("country",country[0].text)
    cases=containers[i].find_all("td")
    print("total_cases",cases[2].text)
    print("total_deaths",cases[4].text)
    print("total_recovered",cases[6].text)
    print("active_cases",cases[8].text)
    print("critical_cases",cases[9].text)
    print("total_tests",cases[12].text)
    print("population",cases[14].text)
    print('\n')
    values=[country[0].text,cases[2].text,cases[4].text,cases[6].text,cases[8].text,cases[9].text,cases[12].text,cases[14].text]
    dic[key]=values

country USA
total_cases 2,218,902
total_deaths 119,374 
total_recovered 903,616
active_cases 1,195,912
critical_cases 16,695
total_tests 25,834,677
population 330,928,170 


country Brazil
total_cases 934,769
total_deaths 45,585 
total_recovered 477,364
active_cases 411,820
critical_cases 8,318
total_tests 1,709,468
population 212,500,470 


country Russia
total_cases 553,301
total_deaths 7,478 
total_recovered 304,342
active_cases 241,481
critical_cases 2,300
total_tests 15,679,724
population 145,932,234 


country India
total_cases 360,483
total_deaths 12,058 
total_recovered 191,446
active_cases 156,979
critical_cases 8,944
total_tests 6,084,256
population 1,379,455,941 


country UK
total_cases 299,251
total_deaths 42,153 
total_recovered N/A
active_cases N/A
critical_cases 379
total_tests 7,121,976
population 67,872,439 


country Spain
total_cases 291,408
total_deaths 27,136 
total_recovered N/A
active_cases N/A
critical_cases 617
total_tests 4,826,516
population 46,754,133 


co

total_cases 12,198
total_deaths 279 
total_recovered 10,774
active_cases 1,145
critical_cases 15
total_tests 1,132,823
population 51,267,604 


country Algeria
total_cases 11,268
total_deaths 799 
total_recovered 7,943
active_cases 2,526
critical_cases 39
total_tests 
population 43,815,597 


country Azerbaijan
total_cases 10,991
total_deaths 133 
total_recovered 6,075
active_cases 4,783
critical_cases 66
total_tests 397,399
population 10,135,522 


country Guatemala
total_cases 10,706
total_deaths 418 
total_recovered 2,096
active_cases 8,192
critical_cases 5
total_tests 31,427
population 17,900,654 


country Czechia
total_cases 10,154
total_deaths 333 
total_recovered 7,399
active_cases 2,422
critical_cases 12
total_tests 505,272
population 10,708,259 


country Cameroon
total_cases 9,864
total_deaths 276 
total_recovered 5,570
active_cases 4,018
critical_cases 28
total_tests 
population 26,513,830 


country Honduras
total_cases 9,656
total_deaths 330 
total_recovered 1,075
active_

In [None]:
dic

{9: ['USA',
  '2,218,902',
  '119,374 ',
  '903,616',
  '1,195,912',
  '16,695',
  '25,834,677',
  '330,928,170 '],
 10: ['Brazil',
  '934,769',
  '45,585 ',
  '477,364',
  '411,820',
  '8,318',
  '1,709,468',
  '212,500,470 '],
 11: ['Russia',
  '553,301',
  '7,478 ',
  '304,342',
  '241,481',
  '2,300',
  '15,679,724',
  '145,932,234 '],
 12: ['India',
  '360,483',
  '12,058 ',
  '191,446',
  '156,979',
  '8,944',
  '6,084,256',
  '1,379,455,941 '],
 13: ['UK',
  '299,251',
  '42,153 ',
  'N/A',
  'N/A',
  '379',
  '7,121,976',
  '67,872,439 '],
 14: ['Spain',
  '291,408',
  '27,136 ',
  'N/A',
  'N/A',
  '617',
  '4,826,516',
  '46,754,133 '],
 15: ['Italy',
  '237,828',
  '34,448 ',
  '179,455',
  '23,925',
  '163',
  '4,773,408',
  '60,464,907 '],
 16: ['Peru',
  '237,156',
  '7,056 ',
  '125,205',
  '104,895',
  '1,121',
  '1,396,605',
  '32,952,301 '],
 17: ['Iran',
  '195,051',
  '9,185 ',
  '154,812',
  '31,054',
  '2,789',
  '1,319,920',
  '83,947,823 '],
 18: ['Germany',
  '

In [None]:
import pandas as pd
data=pd.DataFrame(dic)
data

Unnamed: 0,9,10,11,12,13,14,15,16,17,18,...,99,100,101,102,103,104,105,106,107,108
0,USA,Brazil,Russia,India,UK,Spain,Italy,Peru,Iran,Germany,...,Somalia,Kyrgyzstan,CAR,Mayotte,Cuba,Croatia,Maldives,Mauritania,Estonia,Sri Lanka
1,2218902,934769,553301,360483,299251,291408,237828,237156,195051,189027,...,2658,2562,2410,2333,2280,2258,2094,2057,1977,1924
2,119374,45585,7478,12058,42153,27136,34448,7056,9185,8918,...,88,30,14,29,84,107,8,93,69,11
3,903616,477364,304342,191446,,,179455,125205,154812,173600,...,649,1902,396,2058,1999,2141,1670,373,1743,1397
4,1195912,411820,241481,156979,,,23925,104895,31054,6509,...,1921,630,2000,246,197,10,416,1591,165,516
5,16695,8318,2300,8944,379,617,163,1121,2789,419,...,2,12,2,13,5,,9,8,,1
6,25834677,1709468,15679724,6084256,7121976,4826516,4773408,1396605,1319920,4694147,...,,150612,18921,8800,138831,70712,35533,13842,99000,90010
7,330928170,212500470,145932234,1379455941,67872439,46754133,60464907,32952301,83947823,83774027,...,15870989,6519466,4826038,272499,11326859,4106085,540120,4643639,1326503,21409881


In [None]:
data=data.T
data

Unnamed: 0,0,1,2,3,4,5,6,7
9,USA,2218902,119374,903616,1195912,16695,25834677,330928170
10,Brazil,934769,45585,477364,411820,8318,1709468,212500470
11,Russia,553301,7478,304342,241481,2300,15679724,145932234
12,India,360483,12058,191446,156979,8944,6084256,1379455941
13,UK,299251,42153,,,379,7121976,67872439
...,...,...,...,...,...,...,...,...
104,Croatia,2258,107,2141,10,,70712,4106085
105,Maldives,2094,8,1670,416,9,35533,540120
106,Mauritania,2057,93,373,1591,8,13842,4643639
107,Estonia,1977,69,1743,165,,99000,1326503


In [None]:
data.columns=['country','total_cases','total_deaths','total_recovered','active_cases','critical_cases','total_tests','population']

In [None]:
data.columns

Index(['country', 'total_cases', 'total_deaths', 'total_recovered',
       'active_cases', 'critical_cases', 'total_tests', 'population'],
      dtype='object')

In [None]:
data

Unnamed: 0,country,total_cases,total_deaths,total_recovered,active_cases,critical_cases,total_tests,population
9,USA,2218902,119374,903616,1195912,16695,25834677,330928170
10,Brazil,934769,45585,477364,411820,8318,1709468,212500470
11,Russia,553301,7478,304342,241481,2300,15679724,145932234
12,India,360483,12058,191446,156979,8944,6084256,1379455941
13,UK,299251,42153,,,379,7121976,67872439
...,...,...,...,...,...,...,...,...
104,Croatia,2258,107,2141,10,,70712,4106085
105,Maldives,2094,8,1670,416,9,35533,540120
106,Mauritania,2057,93,373,1591,8,13842,4643639
107,Estonia,1977,69,1743,165,,99000,1326503


In [None]:
data.index=range(0,len(data))
data

Unnamed: 0,country,total_cases,total_deaths,total_recovered,active_cases,critical_cases,total_tests,population
0,USA,2218902,119374,903616,1195912,16695,25834677,330928170
1,Brazil,934769,45585,477364,411820,8318,1709468,212500470
2,Russia,553301,7478,304342,241481,2300,15679724,145932234
3,India,360483,12058,191446,156979,8944,6084256,1379455941
4,UK,299251,42153,,,379,7121976,67872439
...,...,...,...,...,...,...,...,...
95,Croatia,2258,107,2141,10,,70712,4106085
96,Maldives,2094,8,1670,416,9,35533,540120
97,Mauritania,2057,93,373,1591,8,13842,4643639
98,Estonia,1977,69,1743,165,,99000,1326503


In [None]:
import chart_studio.plotly as py
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

In [None]:
df['total_cases']

0     2,218,902
1       934,769
2       553,301
3       360,483
4       299,251
        ...    
95        2,258
96        2,094
97        2,057
98        1,977
99        1,924
Name: total_cases, Length: 100, dtype: object

In [None]:
df=data.copy()
dt = dict(
        type = 'choropleth',
        colorscale = 'Viridis',
        reversescale = True,
        locations = df['country'],
        locationmode = "country names",
        z = df['total_deaths'],
        text = df['country'],
        colorbar = {'title' : 'Covid Cases'},
      ) 

In [None]:
layout = dict(
    title = 'COVID Death Cases',
    geo = dict(
        showframe = False,
        projection = {'type':'natural earth'}
    )
)

In [None]:
choromap = go.Figure(data = [dt],layout = layout)
iplot(choromap)