# COVID-19 Dataset

The purpose of this Notebook is to demonstrate web scrapping using Beautifulsoup library, obtaining the country wise data in the table and making the data types suitable for further analysis.
The dataset is exported in the excel file.

The dataset is prepared according to the date on which the webpage is accessed.
Data obtained from [Worldometers](https://www.worldometers.info/coronavirus/#countries)

Following details about coronavirus cases are collected in the table
1. Country Name
2. Total Cases
3. Total Deaths
4. Total Recovered
5. Active Cases
6. Critical Cases
7. Total Tests

In [28]:
import pandas as pd
import numpy as np
import requests               # To send HTTP requests
from bs4 import BeautifulSoup # For scraping the webpage

print('Libraries imported.')

Libraries imported.


### Scraping data from Worldometers

In [29]:
website='https://www.worldometers.info/coronavirus/#countries'
website_url=requests.get(website).text
soup = BeautifulSoup(website_url,'html.parser')

In [30]:
website_url = requests.get('https://www.worldometers.info/coronavirus/#countries').text
soup = BeautifulSoup(website_url,'html.parser')

In [31]:
#soup

In [32]:
my_table = soup.find('tbody')
#my_table

Collecting HTML Data in the form of table

In [33]:
table_data = []

for row in my_table.findAll('tr'):
    row_data = []

    for cell in row.findAll('td'):
        row_data.append(cell.text)

    if(len(row_data) > 0):
        data_item = {
            "Country": row_data[0],
            "TotalCases": row_data[1],
            "NewCases": row_data[2],
            "TotalDeaths": row_data[3],
            "NewDeaths": row_data[4],
            "TotalRecovered": row_data[5],
            "ActiveCases": row_data[6],
            "CriticalCases": row_data[7],
            "Totcase1M": row_data[8],
            "Totdeath1M": row_data[9],
            "TotalTests": row_data[10],
            "Tottest1M": row_data[11],
        }
        table_data.append(data_item)

converting this created table into dataframe

In [34]:
df = pd.DataFrame(table_data)
df.head(15)

Unnamed: 0,Country,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,ActiveCases,CriticalCases,Totcase1M,Totdeath1M,TotalTests,Tottest1M
0,,\nNorth America\n,1603816,2960.0,97945,273.0,395758.0,1110113.0,17404,,,
1,,\nEurope\n,1730530,11857.0,159280,212.0,729034.0,842216.0,12500,,,
2,,\nSouth America\n,377660,471.0,20054,16.0,130956.0,226650.0,10341,,,
3,,\nAsia\n,744527,5569.0,23715,123.0,420389.0,300423.0,4964,,,
4,,\nAfrica\n,77086,84.0,2579,1.0,28241.0,46266.0,288,,,
5,,\nOceania\n,8621,31.0,119,,7854.0,648.0,20,,,
6,,\n\n,721,,15,,651.0,55.0,4,,,
7,,World,4542961,20972.0,303707,625.0,1712883.0,2526371.0,45521,583.0,39.0,
8,1.0,USA,1457593,,86912,,318027.0,1052654.0,16240,4407.0,263.0,10638893.0
9,2.0,Spain,272646,,27321,,186480.0,58845.0,1376,5832.0,584.0,2467761.0


In [35]:
df.shape

(223, 12)

In [36]:
df1=df.loc[8:219,['Country','TotalCases','TotalDeaths','TotalRecovered','ActiveCases','CriticalCases','TotalTests']]
df1.head(10)

Unnamed: 0,Country,TotalCases,TotalDeaths,TotalRecovered,ActiveCases,CriticalCases,TotalTests
8,1,USA,,,318027.0,1052654.0,263
9,2,Spain,,,186480.0,58845.0,584
10,3,Russia,10598.0,113.0,58226.0,202199.0,17
11,4,UK,,,,,495
12,5,Italy,,,115288.0,76440.0,519
13,6,Brazil,247.0,6.0,79479.0,109687.0,66
14,7,France,,,59605.0,91840.0,420
15,8,Germany,,,151700.0,15347.0,95
16,9,Turkey,,,104030.0,36712.0,48
17,10,Iran,,,90539.0,17140.0,82


In [37]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212 entries, 8 to 219
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Country         212 non-null    object
 1   TotalCases      212 non-null    object
 2   TotalDeaths     212 non-null    object
 3   TotalRecovered  212 non-null    object
 4   ActiveCases     212 non-null    object
 5   CriticalCases   212 non-null    object
 6   TotalTests      212 non-null    object
dtypes: object(7)
memory usage: 11.7+ KB


In [38]:
#df1.to_excel('covid19_data.xlsx')

In [39]:
df1.dtypes

Country           object
TotalCases        object
TotalDeaths       object
TotalRecovered    object
ActiveCases       object
CriticalCases     object
TotalTests        object
dtype: object

In [40]:
def removecomma(col_name):
    result = []
    for num in df1[col_name]:
        #print(num)
        result.append(num.replace(',', ''))
    df1[col_name]=result
    df1[col_name] = pd.to_numeric(df1[col_name], errors='coerce', downcast='integer')


In [41]:
colnames=df1.columns

In [42]:
for col in colnames[1:7]:
    removecomma(col)

In [43]:
df1.head()

Unnamed: 0,Country,TotalCases,TotalDeaths,TotalRecovered,ActiveCases,CriticalCases,TotalTests
8,1,,,,318027.0,1052654.0,263.0
9,2,,,,186480.0,58845.0,584.0
10,3,,10598.0,113.0,58226.0,202199.0,17.0
11,4,,,,,,495.0
12,5,,,,115288.0,76440.0,519.0


In [44]:
df1.dtypes

Country            object
TotalCases        float64
TotalDeaths       float64
TotalRecovered    float64
ActiveCases       float64
CriticalCases     float64
TotalTests        float64
dtype: object

#### Filling all the NaN values with 0

In [45]:
df1["TotalTests"] = df1["TotalTests"].fillna(0)
df1["TotalRecovered"] = df1["TotalRecovered"].fillna(0)
df1["CriticalCases"] = df1["CriticalCases"].fillna(0)
df1["TotalDeaths"] = df1["TotalDeaths"].fillna(0)
df1.head(10)

Unnamed: 0,Country,TotalCases,TotalDeaths,TotalRecovered,ActiveCases,CriticalCases,TotalTests
8,1,,0.0,0.0,318027.0,1052654.0,263.0
9,2,,0.0,0.0,186480.0,58845.0,584.0
10,3,,10598.0,113.0,58226.0,202199.0,17.0
11,4,,0.0,0.0,,0.0,495.0
12,5,,0.0,0.0,115288.0,76440.0,519.0
13,6,,247.0,6.0,79479.0,109687.0,66.0
14,7,,0.0,0.0,59605.0,91840.0,420.0
15,8,,0.0,0.0,151700.0,15347.0,95.0
16,9,,0.0,0.0,104030.0,36712.0,48.0
17,10,,0.0,0.0,90539.0,17140.0,82.0


In [46]:
import numpy as np
df1.sort_values(by='TotalCases', ascending=False, inplace=True)
df1.set_index(np.arange(1,213),inplace=True)

In [47]:
df1.head()

Unnamed: 0,Country,TotalCases,TotalDeaths,TotalRecovered,ActiveCases,CriticalCases,TotalTests
1,1,,0.0,0.0,318027.0,1052654.0,263.0
2,2,,0.0,0.0,186480.0,58845.0,584.0
3,3,,10598.0,113.0,58226.0,202199.0,17.0
4,4,,0.0,0.0,,0.0,495.0
5,5,,0.0,0.0,115288.0,76440.0,519.0


In [48]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 1 to 212
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country         212 non-null    object 
 1   TotalCases      0 non-null      float64
 2   TotalDeaths     212 non-null    float64
 3   TotalRecovered  212 non-null    float64
 4   ActiveCases     208 non-null    float64
 5   CriticalCases   212 non-null    float64
 6   TotalTests      212 non-null    float64
dtypes: float64(6), object(1)
memory usage: 13.2+ KB


In [50]:
df1.to_excel('Covid19_data_15052020.xlsx', index=True)
print("The excel file is Generated !!!")

The excel file is Generated !!!


#### Thank You