<a id="section-one"></a>
**<font color = DarkBlue size = 5>Import related libraries</font>**

First of all we have to import all the needed libraries:


- Numpy: a library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices.

- Pandas: a library offers data structures and operations for manipulating numerical tables and time series.

- Pandas_Profiling: an open source Python library with which we can quickly do an exploratory data analysis with just a few lines of code.

- Matplotlib: a plotting library for the Python programming language and its numerical mathematics extension NumPy

- Sklearn:Scikit-learn is a free software machine learning library for the Python programming language. 


In [2]:
# import libraries
import numpy as np
import pandas as pd 
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

<a id="section-two"></a>
**<font color = DarkBlue size = 5>Read data and Inspect Data</font>**

In [3]:
#load dataset
df = pd.read_csv("owid-covid-data.csv")

In [4]:
#Show first five rows of data
df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


In [5]:
# Information of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132644 entries, 0 to 132643
Data columns (total 65 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   iso_code                                 132644 non-null  object 
 1   continent                                124170 non-null  object 
 2   location                                 132644 non-null  object 
 3   date                                     132644 non-null  object 
 4   total_cases                              125410 non-null  float64
 5   new_cases                                125408 non-null  float64
 6   new_cases_smoothed                       124365 non-null  float64
 7   total_deaths                             114293 non-null  float64
 8   new_deaths                               114489 non-null  float64
 9   new_deaths_smoothed                      124365 non-null  float64
 10  total_cases_per_million         

In [6]:
# Pandas Profiling
rep = ProfileReport(df)
rep

In [7]:
# View column data types
df.dtypes

iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
human_development_index                    float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 65, dtype: object

Result:
- There are 65 columns of object and float datatypes

In [8]:
# Check Duplicates
df.duplicated().sum()

0

Result:
- Dataset don't contains any duplicate rows.


In [9]:
# Get statistical data about each column
df.describe(include="all") 

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,132644,124170,132644,132644,125410.0,125408.0,124365.0,114293.0,114489.0,124365.0,...,89329.0,88037.0,57713.0,104208.0,123313.0,114771.0,4656.0,4656.0,4656.0,4656.0
unique,237,6,237,681,,,,,,,...,,,,,,,,,,
top,ARG,Africa,Peru,2021-07-14,,,,,,,...,,,,,,,,,,
freq,681,32809,681,234,,,,,,,...,,,,,,,,,,
mean,,,,,2038985.0,8353.112888,8370.887531,50452.5,184.004114,168.641989,...,10.589701,32.748528,50.865182,3.029443,73.260475,0.726207,32052.628565,8.667889,15.883224,781.114218
std,,,,,11639100.0,43547.802492,43035.659063,258963.6,871.591559,817.796127,...,10.502683,13.512865,31.821458,2.455593,7.532205,0.150042,90422.6518,15.939853,31.06281,1166.907295
min,,,,,1.0,-74347.0,-6223.0,1.0,-1918.0,-232.143,...,0.1,7.7,1.188,0.1,53.28,0.394,-31959.4,-27.35,-95.92,-1749.128494
25%,,,,,2388.0,3.0,10.286,80.0,0.0,0.143,...,1.9,21.6,20.859,1.3,67.92,0.602,-167.75,-0.9725,-0.835,-43.100925
50%,,,,,27148.5,104.0,130.143,741.0,2.0,2.0,...,6.3,31.4,49.839,2.4,74.62,0.744,2315.9,5.215,6.465,364.906628
75%,,,,,264101.0,1077.0,1133.429,6468.0,22.0,18.571,...,19.3,41.3,83.241,4.0,78.74,0.845,20475.2,13.58,22.3,1351.300112


Result:
- Dataset contains 65 columns and almost each column conatins NaN


In [10]:
# Draw Histogram to show distribution of data for each feature
df.hist(figsize=(35,35))
plt.suptitle('Histograms', fontsize=18);
plt.savefig("Histogram.png")

Result:
- A Histogram of all the 65 columns


In [12]:
#copy dataset into another variable
df_original=df

In [13]:
#copy dataset bcz sometime we have to again start from this point 
df=df_original

In [11]:
df.dtypes

iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                float64
                                            ...   
human_development_index                    float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 65, dtype: object

Result:
- Dataset have different data types


In [12]:
# we don't need this column because we will follow number of rows for time and date
df=df.drop('date', axis=1)
df.shape

(132644, 64)

Result:
- After droping the column 'date' we have 132644 rows and 64 columns 


In [13]:
# remove this colum also
df=df.drop('location', axis=1)
df.shape

(132644, 63)

Result:
- After droping the column 'location' we have 132644 rows and 63 columns 


In [14]:
# remove this colum also
df=df.drop('continent', axis=1)
df.shape

(132644, 62)

Result:
- After droping the column 'continent' we have 132644 rows and 62 columns 


In [15]:
# remove this colum also
df=df.drop('iso_code', axis=1)
df.shape

(132644, 61)

Result:
- After droping the column 'iso-code' we have 132644 rows and 61 columns 


In [16]:
# remove this colum also
df=df.drop('tests_units', axis=1)
df.shape

(132644, 60)

Result:
- After droping the column 'tests_units' we have 132644 rows and 60 columns 


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132644 entries, 0 to 132643
Data columns (total 60 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   total_cases                              125410 non-null  float64
 1   new_cases                                125408 non-null  float64
 2   new_cases_smoothed                       124365 non-null  float64
 3   total_deaths                             114293 non-null  float64
 4   new_deaths                               114489 non-null  float64
 5   new_deaths_smoothed                      124365 non-null  float64
 6   total_cases_per_million                  124766 non-null  float64
 7   new_cases_per_million                    124764 non-null  float64
 8   new_cases_smoothed_per_million           123726 non-null  float64
 9   total_deaths_per_million                 113662 non-null  float64
 10  new_deaths_per_million          

In [18]:
df.dtypes

total_cases                                float64
new_cases                                  float64
new_cases_smoothed                         float64
total_deaths                               float64
new_deaths                                 float64
new_deaths_smoothed                        float64
total_cases_per_million                    float64
new_cases_per_million                      float64
new_cases_smoothed_per_million             float64
total_deaths_per_million                   float64
new_deaths_per_million                     float64
new_deaths_smoothed_per_million            float64
reproduction_rate                          float64
icu_patients                               float64
icu_patients_per_million                   float64
hosp_patients                              float64
hosp_patients_per_million                  float64
weekly_icu_admissions                      float64
weekly_icu_admissions_per_million          float64
weekly_hosp_admissions         

In [19]:
#count NaN 
df.isna().sum()

total_cases                                  7234
new_cases                                    7236
new_cases_smoothed                           8279
total_deaths                                18351
new_deaths                                  18155
new_deaths_smoothed                          8279
total_cases_per_million                      7878
new_cases_per_million                        7880
new_cases_smoothed_per_million               8918
total_deaths_per_million                    18982
new_deaths_per_million                      18786
new_deaths_smoothed_per_million              8918
reproduction_rate                           28218
icu_patients                               116697
icu_patients_per_million                   116697
hosp_patients                              113952
hosp_patients_per_million                  113952
weekly_icu_admissions                      131304
weekly_icu_admissions_per_million          131304
weekly_hosp_admissions                     130518


Result:
- Each column contain NaN values 


In [20]:
# replace NaN with 0
df =df.fillna(0)


Result:
- Replace NaN with zero becaue there should not be NaN values 


In [21]:
#count NaN 
df.isna().sum()

total_cases                                0
new_cases                                  0
new_cases_smoothed                         0
total_deaths                               0
new_deaths                                 0
new_deaths_smoothed                        0
total_cases_per_million                    0
new_cases_per_million                      0
new_cases_smoothed_per_million             0
total_deaths_per_million                   0
new_deaths_per_million                     0
new_deaths_smoothed_per_million            0
reproduction_rate                          0
icu_patients                               0
icu_patients_per_million                   0
hosp_patients                              0
hosp_patients_per_million                  0
weekly_icu_admissions                      0
weekly_icu_admissions_per_million          0
weekly_hosp_admissions                     0
weekly_hosp_admissions_per_million         0
new_tests                                  0
total_test

Result:
- After replacing NaN, there is no NaN values


In [22]:
# checking the min and max values
df['new_deaths'].max()

18007.0

Result:
- In a single day, 18007 is number maxium of deaths


In [23]:
# Max value at index
df['new_deaths'].idxmax()

130561

Result:
- Number of row in dataset which contains 18007 death per day 


In [24]:
# min value
df['new_deaths'].min()

-1918.0

Result:
- In a single day, -1918 is minimum number of deaths but it is wrong. Because minimum death should be zero, not less than zero


In [25]:
# min value at index
df['new_deaths'].idxmin()

111824

Result:
- Number of row in dataset which contains -1918 death per day 

In [26]:
# total number of negative values
sum(n < 0 for n in df['new_deaths'].values.flatten())

124

Result:
- Total number of negative values in 'new_deaths'

In [27]:
# replace negative values to 0
df['new_deaths']=df['new_deaths'].clip(lower=0)

Result:
- Replace the all the negative values in 'new_deaths' with 0

In [28]:
# again check total number of negative values
sum(n < 0 for n in df['new_deaths'].values.flatten())

0

Result:
- Total number of negative values in 'new_deaths' after replacing with 0

In [29]:
# convert complete dataset to float
df =df.astype(float)

Result:
- Convert dataset to float data type for further used by Algorithms

In [30]:
df.dtypes

total_cases                                float64
new_cases                                  float64
new_cases_smoothed                         float64
total_deaths                               float64
new_deaths                                 float64
new_deaths_smoothed                        float64
total_cases_per_million                    float64
new_cases_per_million                      float64
new_cases_smoothed_per_million             float64
total_deaths_per_million                   float64
new_deaths_per_million                     float64
new_deaths_smoothed_per_million            float64
reproduction_rate                          float64
icu_patients                               float64
icu_patients_per_million                   float64
hosp_patients                              float64
hosp_patients_per_million                  float64
weekly_icu_admissions                      float64
weekly_icu_admissions_per_million          float64
weekly_hosp_admissions         

In [31]:
# handle NaN and infinite values
df = df.reset_index()

Result:
- Handle NaN value if there occur in dataset

In [32]:
# here save a column as our label or target-column
target_col=np.array(df['new_deaths'])

Result:
- Set 'new_deaths' column as our target column because we are going to train an algorithm to predict the number new death which may be in near future.

In [33]:
# remove column
df=df.drop('new_deaths', axis=1)
df.shape

(132644, 60)

Result:
- Becaue this column is target column therefore we keep it 'Y' and other dataset to 'X'

In [34]:
df = df.reset_index()

In [35]:
#copy dataset for forther use
mydataset2 = df;
mydataset2.shape

(132644, 61)

In [36]:
#saving dataset columns as list because we will use it for feature selection process
mydataset_list=list(df.columns)

In [37]:
#convert dataset into array
mydataset=np.array(df)

In [38]:
#copy dataset which is in watt and array form
copy_mydataset=mydataset;


In [39]:
# split to 80% for training and 20% for testing 
mydataset_train, mydataset_test, target_col_train, target_col_test=train_test_split(mydataset,target_col, test_size=0.20, random_state=7)

Result:
- Split the dataset into train and test. 

In [40]:
#print into of spliting data
print('Training Features Shape:', mydataset_train.shape)
print('Training Labels Shape:', target_col_train.shape)
print('Testing Features Shape:', mydataset_test.shape)
print('Testing Labels Shape:', target_col_test.shape)
print('Dataset Shape:', mydataset.shape)

Training Features Shape: (106115, 61)
Training Labels Shape: (106115,)
Testing Features Shape: (26529, 61)
Testing Labels Shape: (26529,)
Dataset Shape: (132644, 61)


In [None]:
#............Fit Classifier....#
nNNAC = KNeighborsClassifier(n_neighbors=20) 
nNNAC.fit(mydataset_train, target_col_train)


Result:
- Apply K Nearest Neighbor classifier with 'n_neighbors=20', we can update this hyperparameter value.
- Train with training sample

In [None]:
# predict at test data
predictions1 = nNNAC.predict(mydataset_test)

Result:
- Input unseen data to algorithm

In [None]:
# Plot of load forcasted by K-NN Classifier    
plt.figure()
plt.ylabel('Death');
plt.xlabel('Days'); 
plt.plot(target_col_test,'-g',label='Actual'); 
plt.plot(predictions1,'-r',label='KNN')
plt.title('Death_K-NN')
plt.gca().set_xlim(left=1); plt.legend(fancybox=True, framealpha=0.5)
plt.gca().legend(('Actual','KNN'));
plt.savefig('DeathForecastingKNN.png',bbox_inches='tight',transparent='true')
plt.show();

In [None]:
# evaluation matrices
print(metrics.accuracy_score(target_col_test, predictions1))

0.4849787025519243


Result:
- Predicted Accuracy
- Because dataset is very large and contains numerious number of eatures, therefore algorithm show low accuracy


<a id="subsection-three"></a>
**<font color = Teal size = 4 >Summary</font>**

**We lead dataset, analyse, generate a consize report, clean, pre-process, split into 80% and 20% ratio for algorithm training and testing, train K Nearest Neighbor algorithm, and evaluate the algorithm. It can also predict the number of upcomming new cases. This is beneficial for governemtn to predict the new cases and death and take appropriate action in order to overcome that figure**