### Task 1 - Data Collection
Here you will obtain the required data for the analysis. As described in the project instructions, you will perform a web scrap to obtain data from the NCDC website, import data from the John Hopkins repository, and import the provided external data.


In [46]:
# Import all libraries in this cell
import requests
import numpy as np
import urllib.request
import pandas as pd
import csv
from bs4 import BeautifulSoup
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')  
import warnings
warnings.filterwarnings('ignore')

### A - NCDC Website scrap
Website - https://covid19.ncdc.gov.ng/

In [47]:
url = "https://covid19.ncdc.gov.ng/"
page = requests.get(url).text
# Initializing the BeautifulSoup package with the specific parser
soup = BeautifulSoup(page, 'lxml')
print(soup)


<!DOCTYPE html>
<html lang="en">
<head><meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<title>NCDC Coronavirus COVID-19 Microsite</title>
<!--[if lt IE 11]>
    	<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
    	<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
    	<![endif]-->
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, user-scalable=0, minimal-ui" name="viewport"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="" name="description"/>
<meta content="" name="keywords"/>
<meta content="Codedthemes" name="author"/>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
  new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
  j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
  'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
  })(

In [48]:
table = soup.find("table", id="custom1")

# Getting the Table header names 

table_headers = table.thead.findAll("tr")
for k in range(len(table_headers)):
    data = table_headers[k].find_all("th")
    column_names = [j.string.strip() for j in data]
    
print(column_names)

['States Affected', 'No. of Cases (Lab Confirmed)', 'No. of Cases (on admission)', 'No. Discharged', 'No. of Deaths']


In [49]:
# Extracting the keys ('States Affected' column) and the Values from the table using the <td> tag
values = []
keys = []
table_data = table.tbody.findAll('tr')
for k in range(len(table_data)):
    key = table_data[k].find_all("td")[0].string.strip()
    keys.append(key)
    value = [j.string.strip() for j in table_data[k].find_all("td")]
    values.append(value)
    
# Creating an enumerated dictionary object to create our dataframe
dataframe_dict = dict(enumerate(values))
dataframe_dict

{0: ['Lagos', '57,900', '471', '56,990', '439'],
 1: ['FCT', '19,697', '453', '19,080', '164'],
 2: ['Plateau', '9,029', '23', '8,949', '57'],
 3: ['Kaduna', '8,984', '52', '8,867', '65'],
 4: ['Rivers', '6,977', '30', '6,847', '100'],
 5: ['Oyo', '6,838', '212', '6,503', '123'],
 6: ['Edo', '4,892', '8', '4,699', '185'],
 7: ['Ogun', '4,620', '8', '4,563', '49'],
 8: ['Kano', '3,918', '15', '3,793', '110'],
 9: ['Ondo', '3,226', '1,083', '2,080', '63'],
 10: ['Kwara', '3,120', '251', '2,814', '55'],
 11: ['Delta', '2,613', '798', '1,744', '71'],
 12: ['Osun', '2,544', '30', '2,462', '52'],
 13: ['Nasarawa', '2,378', '1,992', '373', '13'],
 14: ['Enugu', '2,259', '257', '1,973', '29'],
 15: ['Katsina', '2,097', '14', '2,049', '34'],
 16: ['Gombe', '2,034', '4', '1,986', '44'],
 17: ['Ebonyi', '2,008', '11', '1,965', '32'],
 18: ['Anambra', '1,909', '64', '1,826', '19'],
 19: ['Akwa Ibom', '1,788', '75', '1,699', '14'],
 20: ['Abia', '1,677', '11', '1,645', '21'],
 21: ['Imo', '1,655', 

In [50]:
#convert to dataframe
df_ncdc = pd.DataFrame(dataframe_dict).T
df_ncdc.columns = column_names
df_ncdc.head()


Unnamed: 0,States Affected,No. of Cases (Lab Confirmed),No. of Cases (on admission),No. Discharged,No. of Deaths
0,Lagos,57900,471,56990,439
1,FCT,19697,453,19080,164
2,Plateau,9029,23,8949,57
3,Kaduna,8984,52,8867,65
4,Rivers,6977,30,6847,100


### B - John Hopkins Data Repository
Here you will obtain data from the John Hopkins repository. Your task here involves saving the data from the GitHub repo link to DataFrame for further analysis. Find the links below. 
* Global Daily Confirmed Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv)
* Global Daily Recovered Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv)
* Global Daily Death Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv)

In [51]:
# Using the raw user github user content
confd = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'

recov = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

death = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

In [52]:
df_confd = pd.read_csv(confd)
df_confd.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/30/21,3/31/21,4/1/21,4/2/21,4/3/21,4/4/21,4/5/21,4/6/21,4/7/21,4/8/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,56384,56454,56517,56572,56595,56676,56717,56779,56873,56943
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,124723,125157,125506,125842,126183,126531,126795,126936,127192,127509
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,117061,117192,117304,117429,117524,117622,117739,117879,118004,118116
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,11944,12010,12053,12115,12174,12231,12286,12328,12363,12409
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,22182,22311,22399,22467,22579,22631,22717,22885,23010,23108


In [53]:
df_recov = pd.read_csv(recov)
df_recov.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/30/21,3/31/21,4/1/21,4/2/21,4/3/21,4/4/21,4/5/21,4/6/21,4/7/21,4/8/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,51473,51550,51788,51798,51802,51885,51902,51928,51940,51956
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,90617,91271,91875,92500,93173,93842,94431,95035,95600,96129
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,81442,81538,81632,81729,81813,81896,81994,82096,82192,82289
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,11276,11315,11365,11401,11428,11474,11523,11570,11616,11692
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,20446,20493,20508,20867,20871,20879,21452,21489,21545,21557


In [54]:
df_death = pd.read_csv(death)
df_death.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/30/21,3/31/21,4/1/21,4/2/21,4/3/21,4/4/21,4/5/21,4/6/21,4/7/21,4/8/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,51473,51550,51788,51798,51802,51885,51902,51928,51940,51956
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,90617,91271,91875,92500,93173,93842,94431,95035,95600,96129
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,81442,81538,81632,81729,81813,81896,81994,82096,82192,82289
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,11276,11315,11365,11401,11428,11474,11523,11570,11616,11692
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,20446,20493,20508,20867,20871,20879,21452,21489,21545,21557


### C - External Data 
* Save the external data to a DataFrame
* External Data includes but not limited to: `covid_external.csv`, `Budget data.csv`, `RealGDP.csv`

In [55]:
df_budget = pd.read_csv("Budget data.csv")
df_budget.head()

Unnamed: 0,states,Initial_budget (Bn),Revised_budget (Bn)
0,Abia,136.6,102.7
1,Adamawa,183.3,139.31
2,Akwa-Ibom,597.73,366.0
3,Anambra,137.1,112.8
4,Bauchi,167.2,128.0


In [56]:
df_external = pd.read_csv('covid_external.csv')
df_external.head()

Unnamed: 0,states,region,Population,Overall CCVI Index,Age,Epidemiological,Fragility,Health System,Population Density,Socio-Economic,Transport Availability,Acute IHR
0,FCT,North Central,4865000,0.3,0.0,0.9,0.4,0.6,0.9,0.6,0.2,0.79
1,Plateau,North Central,4766000,0.4,0.5,0.4,0.8,0.3,0.3,0.5,0.3,0.93
2,Kwara,North Central,3524000,0.3,0.4,0.3,0.2,0.4,0.2,0.6,0.7,0.93
3,Nassarawa,North Central,2783000,0.1,0.3,0.5,0.9,0.0,0.1,0.6,0.5,0.85
4,Niger,North Central,6260000,0.6,0.0,0.6,0.3,0.7,0.1,0.8,0.8,0.84


In [57]:
df_gdp = pd.read_csv('RealGDP.csv')
df_gdp.head()

Unnamed: 0,Year,Q1,Q2,Q3,Q4
0,2014,15438679.5,16084622.31,17479127.58,18150356.45
1,2015,16050601.38,16463341.91,17976234.59,18533752.07
2,2016,15943714.54,16218542.41,17555441.69,18213537.29
3,2017,15797965.83,16334719.27,17760228.17,18598067.07
4,2018,16096654.19,16580508.07,18081342.1,19041437.59


### Task 2 - View the data
Obtain basic information about the data using the `head()` and `info()` method.

In [58]:
df_confd.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274 entries, 0 to 273
Columns: 447 entries, Province/State to 4/8/21
dtypes: float64(2), int64(443), object(2)
memory usage: 957.0+ KB


In [59]:
df_recov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259 entries, 0 to 258
Columns: 447 entries, Province/State to 4/8/21
dtypes: float64(2), int64(443), object(2)
memory usage: 904.6+ KB


In [60]:
df_death.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259 entries, 0 to 258
Columns: 447 entries, Province/State to 4/8/21
dtypes: float64(2), int64(443), object(2)
memory usage: 904.6+ KB


In [61]:
df_external.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   states                   37 non-null     object 
 1   region                   37 non-null     object 
 2   Population               37 non-null     int64  
 3   Overall CCVI Index       37 non-null     float64
 4   Age                      37 non-null     float64
 5   Epidemiological          37 non-null     float64
 6   Fragility                37 non-null     float64
 7   Health System            37 non-null     float64
 8   Population Density       37 non-null     float64
 9   Socio-Economic           37 non-null     float64
 10   Transport Availability  37 non-null     float64
 11  Acute IHR                37 non-null     float64
dtypes: float64(9), int64(1), object(2)
memory usage: 3.6+ KB


In [62]:
df_gdp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Year    7 non-null      int64  
 1   Q1      7 non-null      float64
 2   Q2      7 non-null      float64
 3   Q3      7 non-null      float64
 4   Q4      7 non-null      float64
dtypes: float64(4), int64(1)
memory usage: 408.0 bytes


In [63]:
df_budget.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   states               37 non-null     object 
 1   Initial_budget (Bn)  37 non-null     float64
 2   Revised_budget (Bn)  37 non-null     float64
dtypes: float64(2), object(1)
memory usage: 1016.0+ bytes


In [64]:
df_ncdc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   States Affected               37 non-null     object
 1   No. of Cases (Lab Confirmed)  37 non-null     object
 2   No. of Cases (on admission)   37 non-null     object
 3   No. Discharged                37 non-null     object
 4   No. of Deaths                 37 non-null     object
dtypes: object(5)
memory usage: 1.7+ KB


### Task 3 - Data Cleaning and Preparation
From the information obtained above, you will need to fix the data format. 
<br>
Examples: 
* Convert to appropriate data type.
* Rename the columns of the scraped data.
* Remove comma(,) in numerical data
* Extract daily data for Nigeria from the Global daily cases data

TODO A - Clean the scraped data

In [65]:
#[Write Your Code Here]


TODO B - Get a Pandas DataFrame for Daily Confirmed Cases in Nigeria. Columns are Date and Cases

TODO C - Get a Pandas DataFrame for Daily Recovered Cases in Nigeria. Columns are Date and Cases

TODO D - Get a Pandas DataFrame for Daily Death Cases in Nigeria. Columns are Date and Cases

### Task 4 - Analysis
Here you will perform some analyses on the datasets. You are welcome to communicate findings in charts and summary. 
<br>
We have included a few TODOs to help with your analysis. However, do not let this limit your approach, feel free to include more, and be sure to support your findings with chart and summary 

TODO A - Generate a plot that shows the Top 10 states in terms of Confirmed Covid cases by Laboratory test

TODO B - Generate a plot that shows the Top 10 states in terms of Discharged Covid cases. Hint - Sort the values

TODO D - Plot the top 10 Death cases

TODO E - Generate a line plot for the total daily confirmed, recovered and death cases in Nigeria

TODO F - 
* Determine the daily infection rate, you can use the Pandas `diff` method to find the derivate of the total cases.
* Generate a line plot for the above

TODO G - 
* Calculate maximum infection rate for a day (Number of new cases)
* Find the date

TODO H - Determine the relationship between the external dataset and the NCDC COVID-19 dataset. 
Here you will generate a line plot of top 10 confirmed cases and the overall community vulnerability index on the same axis. From the graph, explain your observation.
<br>
Steps
* Combine the two dataset together on a common column(states)
* Create a new dataframe for plotting. This DataFrame will contain top 10 states in terms of confirmed cases i.e sort by confirmed cases. ** Hint: Check out Pandas [nlargest](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html) function. This [tutorial](https://cmdlinetips.com/2019/03/how-to-select-top-n-rows-with-the-largest-values-in-a-columns-in-pandas/) can help out ** 
* Plot both variable on the same axis. Check out this [tutorial](http://kitchingroup.cheme.cmu.edu/blog/2013/09/13/Plotting-two-datasets-with-very-different-scales/)

TODO I - Determine the relationship between the external dataset and the NCDC COVID-19 dataset. 
* Here you will generate a regression plot between two variables to visualize the linear relationships - Confirmed Cases and Population Density.
Hint: Check out Seaborn [Regression Plot](https://seaborn.pydata.org/generated/seaborn.regplot.html).
* Provide a summary of your observation

TODO J - 
* Provide more analyses by extending TODO G & H. Meaning, determine relationships between more features.
* Provide a detailed summary of your findings. 
* Note that you can have as many as possible.

### TODO L - 
Determine the effect of the Pandemic on the economy. To do this, you will compare the Real GDP value Pre-COVID-19 with Real GDP in 2020 (COVID-19 Period, especially Q2 2020)
<br>
Steps
* From the Real GDP Data, generate a `barplot` using the GDP values for each year & quarters. For example: On x-axis you will have year 2017 and the bars will be values of each quarters(Q1-Q4). You expected to have subplots of each quarters on one graph.
<br>
Hint: Use [Pandas.melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) to create your plot DataFrame 
* Set your quarter legend to lower left.
* Using `axhline`, draw a horizontal line through the graph at the value of Q2 2020.
* Write out your observation

### Note: Do not limit your analysis to the provided TODOs. Perform more analyses e.g 
* Check for more external dataset
* Ask more questions & find the right answers by exploring the data