### **Data Scrapping for covid 19 on Malaysian outbreak**

Hi, I would like to retrieve datasets on Covid 19 from day 1 in tabular format from this website https://www.outbreak.my/. I would like to use the data for data analytics purposes. Appreciate your help if possible.

I'm actually interested in the data as shown in the location map from this link (https://www.outbreak.my/map), where the website outlined the location of the confirmed cases with detailed information such as the place, reported data, case active state and source of information in the .csv format.

Also a chance that I might be able to get the Malaysia statistics data from this link (https://www.outbreak.my/stats) in the as well in tabular format (.csv)

In [0]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime 
import time 

In [0]:
url = "https://www.outbreak.my/stats"

In [0]:
htmltable = soup.find('table', { 'class' : 'table table-striped card-table table-vcenter text-nowrap datatable' })

In [0]:
def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
    return rows

In [77]:
list_table = tableDataText(htmltable)
list_table[:2]

[['Case',
  'Age',
  'Gender',
  'Nationality',
  'Status',
  'Confirmed Date',
  'Recovered Date',
  'Deceased'],
 ['2162',
  '61',
  'Male',
  'Malaysian',
  'Death',
  '24th, Mar 2020',
  '-',
  '28th, Mar 2020']]

In [80]:
dftable = pd.DataFrame(list_table[1:], columns=list_table[0])
dftable.head(1000)


Unnamed: 0,Case,Age,Gender,Nationality,Status,Confirmed Date,Recovered Date,Deceased
0,2162,61,Male,Malaysian,Death,"24th, Mar 2020",-,"28th, Mar 2020"
1,2032,83,Male,Malaysian,Death,"25th, Mar 2020",-,"27th, Mar 2020"
2,1840,62,Male,Malaysian,Death,"23rd, Mar 2020",-,"26th, Mar 2020"
3,1797,48,Male,Malaysian,Death,"23rd, Mar 2020",-,"26th, Mar 2020"
4,1625,56,Male,Malaysian,Death,"20th, Mar 2020",-,"25th, Mar 2020"
...,...,...,...,...,...,...,...,...
158,5,36,Female,China,Recovered,"28th, Jan 2020","14th, Feb 2020",-
159,4,40,Male,China,Recovered,"25th, Jan 2020","8th, Feb 2020",-
160,3,65,Female,China,Recovered,"24th, Jan 2020","14th, Feb 2020",-
161,2,2,Male,China,Recovered,"24th, Jan 2020","14th, Feb 2020",-


In [84]:
!ls

exportDataFrames.csv	 exportDataset.csv     exportDatasets.csv
exportDataFrames.gsheet  exportDataset.gsheet


In [83]:
cd /content/drive/My Drive/web-scrapping

/content/drive/My Drive/web-scrapping


In [0]:
dftable.to_csv('/content/drive/My Drive/web-scrapping/covid19.csv', index=False, header=True)

In [87]:
dftable.isnull().sum()

Case              0
Age               0
Gender            0
Nationality       0
Status            0
Confirmed Date    0
Recovered Date    0
Deceased          0
dtype: int64

In [88]:
dftable.isna().sum()

Case              0
Age               0
Gender            0
Nationality       0
Status            0
Confirmed Date    0
Recovered Date    0
Deceased          0
dtype: int64

In [89]:
dftable.isna().any()

Case              False
Age               False
Gender            False
Nationality       False
Status            False
Confirmed Date    False
Recovered Date    False
Deceased          False
dtype: bool

In [90]:
dftable.isna().any(axis = None)

False

In [91]:
dftable.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 8 columns):
Case              163 non-null object
Age               163 non-null object
Gender            163 non-null object
Nationality       163 non-null object
Status            163 non-null object
Confirmed Date    163 non-null object
Recovered Date    163 non-null object
Deceased          163 non-null object
dtypes: object(8)
memory usage: 10.3+ KB


In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

In [96]:
location_group = dftable.groupby(['Case'])['Age'].count().reset_index()

location_average = dftable.groupby(['Recovered Date'])['Deceased'].count().reset_index()


display(location_group, round(location_average, 2))

Unnamed: 0,Case,Age
0,1,1
1,10,1
2,1006,1
3,101,1
4,1031,1
...,...,...
158,95,1
159,96,1
160,97,1
161,98,1


Unnamed: 0,Recovered Date,Deceased
0,-,121
1,"11th, Mar 2020",1
2,"12th, Mar 2020",6
3,"13th, Mar 2020",1
4,"14th, Feb 2020",4
5,"14th, Mar 2020",2
6,"15th, Mar 2020",7
7,"16th, Feb 2020",1
8,"17th, Feb 2020",1
9,"18th, Feb 2020",4


In [0]:
location_group = location_group.sort_values('Case', ascending=False).reset_index(drop=True)
location_group.Case

plt.rcdefaults()
fix,ax = plt.subplots()

age = location_group.Age.tolist()
y_pos = np.arange(len(age))
case = location_group.Case.tolist()

ax.bar(y_pos,case, align='center', alpha=0.5)
ax.set_yticks(y_pos)
ax.set_yticklabels(age)
ax.invert_yaxis() 
ax.set_xlabel('Number of Cases')
ax.set_ylabel('Age')
ax.set_title('\n\n Number of Case and Age \n\n')

plt.show()