# Segmenting and Grouping Neighborhood Vulnerable to Covid-19 Pandemic

<h1>By Isaac Etungu</h1>

<h2>Part One: Module Development.</h2>

In this section, I will develop a <b>Simple Linear Regression Model</b> that will predict the number of deaths due to Convid-19 Pandemic in the World using the variables or features. This is just an estimate but should give us an objective idea of how much the Pandemic will cost.

Data Analytics, we often use Model Development to help us predict future observations from the data we have. In Artificial Intelligence, a Model will help us understand the exact relationship between different variables and how these variables are used to predict the result.

<b>First; I install lxml from conda</b>

In [1]:
conda install -y lxml

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.9.2
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /srv/conda/envs/notebook

  added / updated specs:
    - lxml


The following packages will be UPDATED:

  ca-certificates    anaconda::ca-certificates-2020.10.14-0 --> conda-forge::ca-certificates-2021.5.30-ha878542_0
  certifi                anaconda::certifi-2020.6.20-py36_0 --> conda-forge::certifi-2021.5.30-py36h5fab9bb_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.


<b>Then import libraries</b>

In [2]:
import requests
import lxml.html as lh
import pandas as pd

<b> I remove the website to be put in the notebook</b>

In [3]:
WHO_url = 'https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases' #assign the wiki page
#WHO_url = 'https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases'

page = requests.get(WHO_url) # create a handle for contents of the wiki page

doc = lh.fromstring(page.content) # store content of the wiki page under doc

tr_elements = doc.xpath('//tr') # parse data stored between tr in the html

[len(T) for T in tr_elements[:12]] # check the length of the first 12 rows

[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]

<b>Checking the table headers</b>

In [4]:
tr_elements = doc.xpath('//tr') # parse first row as header

col = [] # create empty list
i = 0

for t in tr_elements[0]: # for each row, store each first element (header) and an empty list
    i+=1
    name=t.text_content()
    print("%d:%s" % (i,name))
    col.append((name,[]))

1:Region 
2:Places reporting cases
3:Cases
4:Deaths
5:Confirmed cases
			 during
			the 14-days
6:Reporting period
			YYYY-WW


<b>Check the data in other Row</b>

In [5]:
for j in range(1,len(tr_elements)): # Because header is the first row, data would be store in the subsequent rows.
    T = tr_elements[j] #T is j'th row
    
    if len(T)!=6: #if row is not size 3, //tr data is not from the table.
        break
        
    i = 0 #i is the index of the first column
    
    for t in T.iterchildren(): #iterate through each element of the row
        data=t.text_content()
            
        col[i][1].append(data) #append the data to the empty list of the i'th column
            
        i+=1 #increment i for the next column


<b>What about the numbers of rows and columns</b>

In [6]:
[len(C) for (title,C) in col]

[216, 216, 216, 216, 216, 216]

<b>Displays the data frame with three columns.</b>

In [7]:
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)


Display first 12 row

In [8]:
df.head(12)

Unnamed: 0,Region,Places reporting cases,Cases,Deaths,Confirmed cases\n\t\t\t during\n\t\t\tthe 14-days,Reporting period\n\t\t\tYYYY-WW
0,Africa,Algeria,133388,3571,4663,2021-22 and 2021-23
1,,Angola,36600,825,2420,2021-22 and 2021-23
2,,Benin,8109,102,51,2021-22 and 2021-23
3,,Botswana,62040,896,5727,2021-22 and 2021-23
4,,Burkina_Faso,13459,167,28,2021-22 and 2021-23
5,,Burundi,5013,8,259,2021-22 and 2021-23
6,,Cameroon,79904,1302,1922,2021-22 and 2021-23
7,,Cape_Verde,31571,273,1212,2021-22 and 2021-23
8,,Central_African_Republic,10987,98,3896,2021-22 and 2021-23
9,,Chad,4942,174,13,2021-22 and 2021-23


Display last 10 Rows

In [9]:
df.tail(10)

Unnamed: 0,Region,Places reporting cases,Cases,Deaths,Confirmed cases\n\t\t\t during\n\t\t\tthe 14-days,Reporting period\n\t\t\tYYYY-WW
206,,Micronesia_(Federated_States_of),0,0,0,2021-22 and 2021-23
207,,New_Caledonia,128,0,0,2021-22 and 2021-23
208,,New_Zealand,2709,26,392,2021-22 and 2021-23
209,,Northern_Mariana_Islands,183,2,0,2021-22 and 2021-23
210,,Papua_New_Guinea,16682,165,834,2021-22 and 2021-23
211,,Solomon_Islands,20,0,0,2021-22 and 2021-23
212,,Vanuatu,3,0,0,2021-22 and 2021-23
213,,Wallis_and_Futuna,454,7,9,2021-22 and 2021-23
214,Other,Cases_on_an_international_conveyance_Japan,705,6,0,2021-22 and 2021-23
215,Total,,176702468,3813133,5649195,


<b>Checking the shapes</b>

In [10]:
df.shape  #checking the number of rows, columns in the data set

(216, 6)

<h2>Let's Clean the dataframe</h2>

In [11]:
import pandas as pd
import numpy as np

<b>Remove Row 215 that has</b> Total <b>record</b>

In [12]:
# Delete row at index position Region: Total
df2 = df.drop([df.index[215]])
df2

Unnamed: 0,Region,Places reporting cases,Cases,Deaths,Confirmed cases\n\t\t\t during\n\t\t\tthe 14-days,Reporting period\n\t\t\tYYYY-WW
0,Africa,Algeria,133388,3571,4663,2021-22 and 2021-23
1,,Angola,36600,825,2420,2021-22 and 2021-23
2,,Benin,8109,102,51,2021-22 and 2021-23
3,,Botswana,62040,896,5727,2021-22 and 2021-23
4,,Burkina_Faso,13459,167,28,2021-22 and 2021-23
...,...,...,...,...,...,...
210,,Papua_New_Guinea,16682,165,834,2021-22 and 2021-23
211,,Solomon_Islands,20,0,0,2021-22 and 2021-23
212,,Vanuatu,3,0,0,2021-22 and 2021-23
213,,Wallis_and_Futuna,454,7,9,2021-22 and 2021-23


# Data Analysis

<b>I will now import libraries to enable us analyse the data</b>

In [13]:
!conda install -c anaconda seaborn -y
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
%matplotlib inline

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.9.2
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /srv/conda/envs/notebook

  added / updated specs:
    - seaborn


The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2021.5.3~ --> anaconda::ca-certificates-2020.10.14-0
  certifi            conda-forge::certifi-2021.5.30-py36h5~ --> anaconda::certifi-2020.6.20-py36_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [14]:
%%capture
! pip install seaborn

In [15]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

In [16]:
#sns.regplot(x="Cases", y="Deaths", data="df2")
#plt.ylim(0, )

<h2>And now let's Explore the Data</h2>

In [17]:
df2.describe()

Unnamed: 0,Region,Places reporting cases,Cases,Deaths,Confirmed cases\n\t\t\t during\n\t\t\tthe 14-days,Reporting period\n\t\t\tYYYY-WW
count,215.0,215,215,215,215,215
unique,7.0,215,214,191,190,1
top,,Nicaragua,20,0,0,2021-22 and 2021-23
freq,209.0,1,2,11,18,215


<b> I check if we have any missing values for the columns.</b>

In [18]:
df2.get('Region', default='Absent') #Column 'Region' is missing, I don't know why

'Absent'

In [19]:
df2.isnull().sum(axis = 0) #Checking the for missing values in all columns
#print("number of NaN values for the column Region :", df2['Region'].isnull().sum())
#print("number of NaN values for the column Places reporting cases:", df2['Places reporting cases'].isnull().sum())
#print("number of NaN values for the column Cases :", df2['Cases'].isnull().sum())
#print("number of NaN values for the column Deaths :", df2['Deaths'].isnull().sum())
#print("number of NaN values for the column Confirmed Confirmed cases during the 14 days :", df2['Confirmed cases during the 14 days'].isnull().sum())

Region                                               0
Places reporting cases                               0
Cases                                                0
Deaths                                               0
Confirmed cases\n\t\t\t during\n\t\t\tthe 14-days    0
Reporting period\n\t\t\tYYYY-WW                      0
dtype: int64

# Exploratory data analysis

<b> Identifying number of cases reported by region<b/>

In [20]:
#df2['Region'].value_counts().to_frame() #Not recognising column 'Region'. I couldn't figure out why

# Model Development

In [21]:
#Import libraries 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression 
%matplotlib inline

In [22]:
!conda install -c anaconda xlrd --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.9.2
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.



# Simple Linear Regression Development

Calculating the R^2 for <b>Deaths</b> using The <b>Cases</b>

In [23]:
X = df2[['Cases']] 
Y = df2['Deaths'] 
lm = LinearRegression() 
lm 
lm.fit(X,Y)
lm.score(X, Y)

0.8591020688490538

<b>Using identified cases to predict deaths</b>

In [24]:
X = df2[['Cases']] 
Y = df2['Deaths'] 
lm = LinearRegression()
lm
lm.fit(X,Y)

Yhat=lm.predict(X)
Yhat[0:5]

array([5492.29241779, 3771.125482  , 3264.47417019, 4223.52130428,
       3359.61244257])

Calculating the value of the <b>intercept</b>

In [25]:
lm.intercept_

3120.2730018427264

Calculating the value of the <b>slope</b>

In [26]:
lm.coef_

array([0.01778285])

# The final estimated linear model;

Using the formulae <b>Y = mx + c</b>

Replacing the values, We get;

<b>Deaths = 0.01778285 X Cases + 3120.2730018427264</b> as of data update of 17th June 2021 at 09:17pm. However, this Linear model keeps changing as the numbers of cases and deaths increases

# Measures for In-Sample Evaluation of Linear Regression

<h1>a). R-squared</h1>

R squared, also known as the <b>coefficient of determination</b>, is a measure to indicate how close the data is to the fitted regression line. The value of the R-squared is the percentage (%) of variation of the response variable (y) that is explained by a linear model.

In [27]:
#Cases_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))

The R-square is:  0.8591020688490538


I conclude that ~ <b>85.9%</b> of the variation of the <b>Deaths</b> is explained by this simple linear model "Cases_fit"

<h1>b). Mean Squared Error (MSE)</h1>

The Mean Squared Error measures the <b>average of the squares of errors</b>, that is, the difference between actual value (y) and the estimated value (ŷ).

<b>Let's calculate the MSE of our data</b>


We can predict the output i.e., "yhat" using the <b>predict method</b>, where X is the input variable:

In [28]:
Yhat=lm.predict(X)
print('The output of the first five predicted value is: ', Yhat[0:5])

The output of the first five predicted value is:  [5492.29241779 3771.125482   3264.47417019 4223.52130428 3359.61244257]



Let's import the function <b>mean_squared_error</b> from the module metrics

In [29]:
from sklearn.metrics import mean_squared_error

<b>We compare the predicted results with the actual results</b>

In [30]:

mse = mean_squared_error(df2['Deaths'], Yhat)
print('The mean square error of Death and predicted value is: ', mse)

The mean square error of Death and predicted value is:  579868953.4089152


# Conclusion

The scope of this analysis is limited to <b>Simple Linear Regression Model</b>.
However, using <b>Multiple Linear Regression model</b> and compare results would be better to be able to <b>predict Deaths</b>
from our dataset.

Model for predicting deaths based on cases is:

# Deaths = 0.01778285 X Cases + 3120.2730018427264

As of 17th June 2021 at 09:17pm. However, This model keeps on changing as data is being updated daily.

# Thank you

# Author:

This notebook was written by <b>Isaac Etungu</b>. 
Isaac is a student at Uganda Technology & Management University in Uganda, and holds a bachelors degree of Agricultural Scinece & Entrepreneuarship.
Currently Persuing a Masters in Information Technology at Uganda Technology & Management University.