## B. NANDANA 

### 01. Web Scraping
#### Can you write a Python script that scrapes (web-scraping) information about the 10 most COVID affected countries with the details (total cases, total deaths, and total recovered) from worldometers.info and write it to a CSV file.


In [153]:
import os # module for navigating your machine eg: file directories
import requests # module for requesting uris
import csv # for handling csv files
import pandas as pd # pandas for handling data
from datetime import datetime # for working with data and time
from bs4 import BeautifulSoup as soup # for parsing web pages

print ("Imported all required modules")

Imported all required modules


In [154]:
# Define the URL where the webpage can be accessed

url = "https://www.worldometers.info/coronavirus/#countries"  #website that provides real-time information about the coronavirus pandemic.

#Request the web page from the URL

response = requests.get(url, allow_redirects = True) # used to send a GET request to the specified URL.
                                                     #allow_redirects parameter is set to True, 
    #which means that the function will automatically follow any redirects that are returned by the server. This can be useful if the server redirects the client to a different URL, but it can also introduce security risks if the server redirects the client to an unexpected location.

response.status_code 
#The response variable will contain the server's response to the GET request. 
#The status_code attribute of the response object can be used to check if the request was successful. A status code of 200 indicates that the request was successful and the server returned the requested content, while other status codes indicate different types of error or redirection.
# check if the page was requested successfully

200

 100s - Informational: More input is expected from client or server 
 
 200s - Success: The client's request was successful

300s - Redirection: Requested URL is located elsewhere; May need user's further action

400s - Client Error: Client-side error 

500s - Server Error: Server-side error or server is incapable of performing the request 

In [155]:
if response.ok:
    print('proceeding with webscrapping')
    # proceed with webscrapping
else:
    print('An error occured')

    #You should also check if the content is available before proceeding with webscrapping.

proceeding with webscrapping


In [156]:
import time
time.sleep(1) # sleep for 1 second

#It's also a best practice to add a delay between requests to a website, also known as "rate limiting", 
#to avoid overloading the server or getting blocked by the website. This can be achieved by using the time.sleep() function to pause the script for a specific amount of time between requests.

The response.headers attribute returns a Python dictionary-like object that contains the headers of the server's response. The headers contain information about the server and the requested resource, such as the content type, the date the resource was last modified, and the size of the resource. 

In [157]:
response.headers

{'Connection': 'Keep-Alive', 'Set-Cookie': 'mobile_detect=desktop; expires=Thu, 09-Feb-2023 10:54:29 GMT; Max-Age=2592000; path=/; secure', 'Content-Type': 'text/html; charset=UTF-8', 'Etag': '"925287914-1673348069;br"', 'X-Litespeed-Cache': 'miss', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'br', 'Vary': 'Accept-Encoding', 'Date': 'Tue, 10 Jan 2023 10:54:29 GMT', 'Server': 'LiteSpeed', 'Alt-Svc': 'quic=":443"; ma=2592000; v="43,46", h3-Q043=":443"; ma=2592000, h3-Q046=":443"; ma=2592000, h3-Q050=":443"; ma=2592000, h3-25=":443"; ma=2592000, h3-27=":443"; ma=2592000'}

property that is used to access the headers of an HTTP response. It typically returns an object or a dictionary-like structure that contains key-value pairs representing the headers of the response.

In [158]:
response.text[:1000]

'\n<!DOCTYPE html>\n<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->\n<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->\n<!--[if !IE]><!-->\n<html lang="en">\n<!--<![endif]-->\n\n\n\n<head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n\n    <title>COVID Live - Coronavirus Statistics - Worldometer</title>\n    <meta name="description" content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates">\n\n\n    \n\t<!-- Favicon -->\n\t<link rel="shortcut icon" href="/favicon/favicon.ico" type="image/x-icon">\n\t<link rel="apple-touch-icon" sizes="57x57" href="/favicon/apple-icon-57x57.png">\n\t<

### Parsing the web page

In [159]:
# Extract the contents of the web page from the response

soup_response = soup(response.text, "html.parser") #Parse the text as a Beautiful Soup object
soup_sample = soup(response.text[:1000], "html.parser") # Parse a sample of the text
soup_sample


<!DOCTYPE html>

<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>COVID Live - Coronavirus Statistics - Worldometer</title>
<meta content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates" name="description"/>
<!-- Favicon -->
<link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
	&lt;link rel="apple-touch-icon" </head></html>

### Extracting information

In [160]:
sections = soup_response.find_all("div", id = "maincounter-wrap")
sections

[<div id="maincounter-wrap" style="margin-top:15px">
 <h1>Coronavirus Cases:</h1>
 <div class="maincounter-number">
 <span style="color:#aaa">669,167,798        </span>
 </div>
 </div>,
 <div id="maincounter-wrap" style="margin-top:15px">
 <h1>Deaths:</h1>
 <div class="maincounter-number">
 <span>6,716,659</span>
 </div>
 </div>,
 <div id="maincounter-wrap" style="margin-top:15px;">
 <h1>Recovered:</h1>
 <div class="maincounter-number" style="color:#8ACA2B ">
 <span>640,431,448</span>
 </div>
 </div>]

In [161]:
len(sections)

3

In [162]:
for section in sections:
    print("------")
    print(section)
    print("------")
    print("/r") # print some blank space for better formatting

------
<div id="maincounter-wrap" style="margin-top:15px">
<h1>Coronavirus Cases:</h1>
<div class="maincounter-number">
<span style="color:#aaa">669,167,798        </span>
</div>
</div>
------
/r
------
<div id="maincounter-wrap" style="margin-top:15px">
<h1>Deaths:</h1>
<div class="maincounter-number">
<span>6,716,659</span>
</div>
</div>
------
/r
------
<div id="maincounter-wrap" style="margin-top:15px;">
<h1>Recovered:</h1>
<div class="maincounter-number" style="color:#8ACA2B ">
<span>640,431,448</span>
</div>
</div>
------
/r


In [163]:
cases = sections[0].find("span").text.replace(" ", "").replace(",", " ")
deaths = sections[1].find("span").text.replace(" ", "")
recov = sections[2].find("span").text.replace(" ", "")
print("No. of cases: {}; death: {}; and recoveries: {}.".format(cases, deaths,recov))


No. of cases: 669 167 798; death: 6,716,659; and recoveries: 640,431,448.


#### Saving Results from the Scrape

In [164]:
# create a document folder

import os

directory = "./Documents"

if not os.path.exists(directory):
    os.makedirs(directory)
else:
    print("Folder already exists")


Folder already exists


In [165]:
# Write the results to a CSV file
date = datetime.now().strftime("%d-%m-%Y")  # to get todays date
print(date)

10-01-2023


In [166]:
variables = ["Total Cases", "Total Deaths","Total Recoveries"] # definining variable names for the file
out_file = "./Documents/covid-19-statistics" + date + ".csv"#definining a file for writing the results
observation = cases, deaths, recov # define an observation (row)
print(observation)

('669 167 798', '6,716,659', '640,431,448')


In [167]:
with open(out_file, "w", newline="") as file:   
    writer = csv.writer(file)
    writer.writerow(variables)
    writer.writerow(observation)

In [168]:
# check if the file is present in documents folder
os.listdir("./Documents")

['covid-19-10-country-statistics10-01-2023.csv',
 'covid-19-country-statistics10-01-2023.csv',
 'covid-19-statistics10-01-2023.csv']

In [169]:
with open(out_file, "r") as file:
    data = file.read()
    
print(data)


Total Cases,Total Deaths,Total Recoveries
669 167 798,"6,716,659","640,431,448"



## Country level COVID 19 Data

In [170]:
date = datetime.now().strftime("%d-%m-%Y") 

Extract the information contained in each row in a table :

In [171]:
table = soup_response.find("table",id = "main_table_countries_today").find("tbody")
rows = table.find_all("tr", style ="")

In [172]:
global_info = []
for row in rows:
    coloumns = row.find_all("td")
    country_info = [coloumn.text.strip() for coloumn in coloumns]
    del country_info[7]
    global_info.append(country_info)
    
print(global_info[0:10])
print("\r")
print("No. of rows in a table : {}".format(len(global_info)))
print("\r")

del global_info[0] # delete first row with world statistics


[['', 'World', '669,167,798', '+177,404', '6,716,659', '+517', '640,431,448', '22,019,691', '48,532', '85,848', '861.7', '', '', '', 'All', '', '', '', '', '', ''], ['1', 'USA', '103,123,617', '', '1,121,298', '', '100,027,148', '1,975,171', '4,874', '308,011', '3,349', '1,155,691,065', '3,451,831', '334,805,269', 'North America', '3', '299', '0', '', '', '5,899'], ['2', 'India', '44,681,355', '', '530,722', '', '44,147,174', '3,459', '698', '31,765', '377', '912,192,538', '648,494', '1,406,631,776', 'Asia', '31', '2,650', '2', '', '', '2'], ['3', 'France', '39,409,429', '', '162,990', '', '38,808,127', '438,312', '869', '600,895', '2,485', '271,490,188', '4,139,547', '65,584,518', 'Europe', '2', '402', '0', '', '', '6,683'], ['4', 'Germany', '37,540,072', '', '162,975', '', '36,905,700', '471,397', '1,406', '447,526', '1,943', '122,332,384', '1,458,359', '83,883,596', 'Europe', '2', '515', '1', '', '', '5,620'], ['5', 'Brazil', '36,515,758', '', '694,949', '', '35,247,755', '573,054',

Save this scape to a file: 

In [173]:
import os

try:
    os.mkdir("./Documents")
except OSError as error:
    print("Unable to create the folder..It already exists")
    
variables = ["No.", "Country","Total Cases","Total Deaths",
             "Total Recovered"] 

date = datetime.now().strftime("%d-%m-%Y")  # to get todays date
out_file = "./Documents/covid-19-country-statistics" + date + ".csv"#definining a file for writing the results
with open(out_file, "w") as file:
        file.write(",".join(variables))
print(out_file)

Unable to create the folder..It already exists
./Documents/covid-19-country-statistics10-01-2023.csv


In [174]:
with open(out_file, "w", newline="") as file:   
    writer = csv.writer(file)
    writer.writerow(variables)
    for country in global_info:
        writer.writerow(country)

In [175]:
data = pd.read_csv(out_file, encoding = "ISO-8859-1", index_col = False)
data.head(10)


  data = pd.read_csv(out_file, encoding = "ISO-8859-1", index_col = False)


Unnamed: 0,No.,Country,Total Cases,Total Deaths,Total Recovered
0,1,USA,103123617,,1121298
1,2,India,44681355,,530722
2,3,France,39409429,,162990
3,4,Germany,37540072,,162975
4,5,Brazil,36515758,,694949
5,6,Japan,30647859,75504.0,60411
6,7,S. Korea,29599747,60041.0,32669
7,8,Italy,25279682,,185417
8,9,UK,24210131,,201028
9,10,Russia,21832768,3032.0,394168


In [176]:
data[data["Country"] == "India"]


Unnamed: 0,No.,Country,Total Cases,Total Deaths,Total Recovered
1,2,India,44681355,,530722


Create a new csv file with only ten countries and its data

In [179]:
import pandas as pd

data = pd.read_csv(out_file, encoding = "ISO-8859-1", index_col = False)
item = data.head(10)
# Specify the name of the output file and the directory where it should be saved
output_file = "./Documents/covid-19-10-country-statistics" + date + ".csv"#definining a file for writing the results

# Use the to_csv() method of the DataFrame to save the data to a CSV file
item.to_csv(output_file, index=False)



  data = pd.read_csv(out_file, encoding = "ISO-8859-1", index_col = False)
