In [1]:
import pandas as pd
import bs4 as bs
import numpy as np
import requests
import re

**Use BeautifulSoup and Requests or Pandas to scrape the table “COVID-19 cases, deaths, and rates by location” under Statistics / Total cases, deaths, and death rates by country (Our World in Data) on this Wikipedia page: https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory**

In [2]:
source = requests.get("https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory")
print(source)

<Response [200]>


*   **Convert the scraped data to a Pandas DataFrame.**



In [3]:
soup = bs.BeautifulSoup(source.content, features='html.parser')

covid_df = pd.read_html(str(soup.find('table', attrs = {'id':'table65150380'})))[0]
print(covid_df.columns)
covid_df = covid_df.loc[:,['Country.1', 'Deathsper million', 'Deaths', 'Cases']]
print(covid_df.head())

Index(['Country', 'Country.1', 'Deathsper million', 'Deaths', 'Cases',
       'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7'],
      dtype='object')
                Country.1 Deathsper million   Deaths      Cases
0                World[a]               687  5414213  282800017
1                    Peru              6070   202524    2279299
2                Bulgaria              4445    30657     737233
3  Bosnia and Herzegovina              4095    13365     288876
4                 Hungary              4037    38894    1246689


*   **Transform the Pandas DataFrame so that it has country, deaths_per_million, deaths, cases as columns (no other data should be present). Print the head of the resulting DataFrame.**



In [4]:
covid_df.columns = ['country', 'deaths_per_million', 'deaths', 'cases']
covid_df.head(3)

Unnamed: 0,country,deaths_per_million,deaths,cases
0,World[a],687,5414213,282800017
1,Peru,6070,202524,2279299
2,Bulgaria,4445,30657,737233


*   **Drop any row that does not contain country or region information. I.e. drop all the rows that do not contain numerical data.**
*   **Drop all rows of countries with zero recorded deaths or non-numeric death data.**

In [5]:
covid_df = covid_df.drop([217], axis = 0)
print(covid_df.head())
covid_df = covid_df.replace('—', '0')
covid_df = covid_df[covid_df.deaths != '0']


                  country deaths_per_million   deaths      cases
0                World[a]                687  5414213  282800017
1                    Peru               6070   202524    2279299
2                Bulgaria               4445    30657     737233
3  Bosnia and Herzegovina               4095    13365     288876
4                 Hungary               4037    38894    1246689


*   **Use string formatting to remove the square bracket information from region names (e.g. World[a], European Union[b] should be World, European Union).**



In [6]:
covid_df.country = covid_df['country'].replace("\[.*\]", "",regex=True)
print(covid_df.head())

                  country deaths_per_million   deaths      cases
0                   World                687  5414213  282800017
1                    Peru               6070   202524    2279299
2                Bulgaria               4445    30657     737233
3  Bosnia and Herzegovina               4095    13365     288876
4                 Hungary               4037    38894    1246689


*   **Assign the DataFrame index to be the country name.**



In [7]:
covid_df = covid_df.set_index('country')

*   **Convert the datatype of all DataFrame values from objects to integers.**



In [8]:
covid_df = covid_df.astype({'deaths_per_million':int, 'deaths':int, 'cases':int})

*   **Create a new column called cases_per_deaths and assign it the value number of cases divided by deaths.**



In [9]:
# Create a new column called cases_per_deaths and assign it the value number of cases divided by deaths.
covid_df['cases_per_deaths'] = round(covid_df['cases']/covid_df['deaths'])
print(covid_df.head())

                        deaths_per_million  ...  cases_per_deaths
country                                     ...                  
World                                  687  ...              52.0
Peru                                  6070  ...              11.0
Bulgaria                              4445  ...              24.0
Bosnia and Herzegovina                4095  ...              22.0
Hungary                               4037  ...              32.0

[5 rows x 4 columns]


*  **Sort the DataFrame so that the countries with the highest number of cases_per_deaths come first. Print the first 20 rows of your sorted DataFrame**



In [10]:
covid_df.sort_values('cases_per_deaths', ascending = False).head(20)

Unnamed: 0_level_0,deaths_per_million,deaths,cases,cases_per_deaths
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Greenland,17,1,2437,2437.0
Bhutan,3,3,2660,887.0
Cayman Islands,165,11,8386,762.0
Burundi,3,38,26999,710.0
Iceland,107,37,25314,684.0
Faroe Islands,265,13,5261,405.0
Qatar,210,616,248802,404.0
Maldives,481,262,95222,363.0
United Arab Emirates,216,2160,754911,349.0
Singapore,151,825,278409,337.0


**Write what the cases per death number indicates for countries with a high value and countries with a low value?**

The 'cases_per_deaths' column indicates the number of deaths among the number of cases of a country. 

Thus, the highest cases_per_deaths shows that for every one death, there have been 2437 cases found. Which means among 2437 cases, one death was seen. 

Similarly, the low cases_per_deaths shows that the death rate is comparatively high as per there is 1 death per approx 5 cases. 


