# COVID - 19 Project #

The goal os the project is to understand this dataset, gain some insight from it. Finally to utilize sklearn to train some models of covid and make predictions.

the data set is from https://health-infobase.canada.ca/covid-19/ which consists of data related to COVID-19. There are several variables such as number of cases per day, deaths per day, provincial rates and many others. 

# 1) Setup #

Setup we will be importing linbraries, files, preliminary data analysis to get a better understanding of the data we will be working with. 

In [21]:
# import all the libraries, Pandas (data processing)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import datetime as dt

#import linear model for linearregression and polynomial
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import r2_score

#### Import data ####

Download .csv file from https://health-infobase.canada.ca/covid-19/ to follow along.

In [22]:
# open data folder from .cvs file. files are obtained from https://health-infobase.canada.ca/covid-19/
path = 'covid19.csv'

#place data into a DataFrame
df = pd.read_csv(path, header=None)

#### Data Description #### 

Show data header

In [23]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,pruid,prname,prnameFR,date,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday
1,35,Ontario,Ontario,31-01-2020,3,0,0,3,,,,,3,3
2,59,British Columbia,Colombie-Britannique,31-01-2020,1,0,0,1,,,,,1,1
3,1,Canada,Canada,31-01-2020,4,0,0,4,,,,,4,4
4,35,Ontario,Ontario,2020-02-08,3,0,0,3,,,,,0,0


#### Data information ####

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 658 entries, 0 to 657
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       658 non-null    object
 1   1       658 non-null    object
 2   2       658 non-null    object
 3   3       658 non-null    object
 4   4       658 non-null    object
 5   5       658 non-null    object
 6   6       648 non-null    object
 7   7       658 non-null    object
 8   8       587 non-null    object
 9   9       59 non-null     object
 10  10      55 non-null     object
 11  11      1 non-null      object
 12  12      658 non-null    object
 13  13      628 non-null    object
dtypes: object(14)
memory usage: 72.1+ KB


#### Show statistical information of dataset ###

This is not very important at this moment, but im including to remind myself of good practice.

In [25]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
count,658,658,658,658,658,658,648,658,587,59,55,1,658,628
unique,16,16,16,57,295,36,103,302,456,39,43,1,158,224
top,35,British Columbia,Canada,2020-04-12,0,0,0,0,0,0,0,ratetested,0,0
freq,56,56,56,15,90,537,414,78,38,7,3,1,242,232


# 2) Data Cleaning # 

We will be cleaning the data here, and making it esier for us to develop code for the data.


Find missing data and present in the form of a presentage, from this information we can determine which data should not be used. 

In [26]:
print('Precent data missing: ')
print((df.isnull().sum()/657)*100)

Precent data missing: 
0       0.000000
1       0.000000
2       0.000000
3       0.000000
4       0.000000
5       0.000000
6       1.522070
7       0.000000
8      10.806697
9      91.171994
10     91.780822
11    100.000000
12      0.000000
13      4.566210
dtype: float64


#### Data removal ####

I am removing several attributes such as:

(13) Precent per day: this is not imporant because we can do this ourselves.

(11) Test rate: 100% of this data is missing so remove it.

(10) Precent recovered: 91% of the data is missing. When more data becomes avalible, I would include this.  

(9) Number recovered: 91% of the data is missing. When more data becomes avalible, I would include this.  

(8) Number of tested: 10% of the data is missing. When more data becomes avalible, I would include this.

(2) French names: removed not requiered. 

Removing 'Repatriated Travellers' because they have very incomplete information. 

In [27]:
#reject unwanted naming conventions
# this data has NaN or utilized percet of cases /day which can be calculated if needed.
df.drop(13, axis=1, inplace=True)
df.drop(11, axis=1, inplace=True)
df.drop(10, axis=1, inplace=True)
df.drop(9, axis=1, inplace=True)
df.drop(8, axis=1, inplace=True)
df.drop(2,axis=1,inplace=True)
df.drop(0,axis=0,inplace=True)

df = df[df[1] != 'Repatriated travellers']

#### Redefine headers ###

redefining headers to something more discriptive

In [28]:
#Set header names for data frame
headers = ["ProvinceID","ProvinceNameEN","Date","ConfirmedCases", "ProbableCases","Deceased",
         "Total","TotalToday"]
df.columns = headers

#### missing data ####

repeat missing information, to ensure we have no missing information. 

In [29]:
print('Precent data missing: ')
print((df.isnull().sum()/617)*100)

Precent data missing: 
ProvinceID        0.0
ProvinceNameEN    0.0
Date              0.0
ConfirmedCases    0.0
ProbableCases     0.0
Deceased          0.0
Total             0.0
TotalToday        0.0
dtype: float64


#### Changing dtypes ####

changing the data type

In [30]:
# Chaning data types to integers
df[["ProvinceID", "ConfirmedCases"]] = df[["ProvinceID", "ConfirmedCases"]].astype("int")
df[["ProbableCases","Deceased"]] = df[["ProbableCases","Deceased"]].astype("int")
df[["Total","TotalToday"]] = df[["Total","TotalToday"]].astype("int")

# Changing data tpye to datetime 
df[['Date']] = df[['Date']].astype("datetime64")

df.dtypes

ProvinceID                 int32
ProvinceNameEN            object
Date              datetime64[ns]
ConfirmedCases             int32
ProbableCases              int32
Deceased                   int32
Total                      int32
TotalToday                 int32
dtype: object

In [31]:
df.head()

Unnamed: 0,ProvinceID,ProvinceNameEN,Date,ConfirmedCases,ProbableCases,Deceased,Total,TotalToday
1,35,Ontario,2020-01-31,3,0,0,3,3
2,59,British Columbia,2020-01-31,1,0,0,1,1
3,1,Canada,2020-01-31,4,0,0,4,4
4,35,Ontario,2020-02-08,3,0,0,3,0
5,59,British Columbia,2020-02-08,4,0,0,4,3


#### New Data format #

With a new Data format I am creating a new DataFrame. This is the new data format, I think it is clearner and more orginized. 

In [32]:
#df_group_one = df[['Province-Name-EN','Date','Confirmed-Cases']]
#df_1 = df
df.set_index("ProvinceNameEN", inplace=True)
df.head()

Unnamed: 0_level_0,ProvinceID,Date,ConfirmedCases,ProbableCases,Deceased,Total,TotalToday
ProvinceNameEN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ontario,35,2020-01-31,3,0,0,3,3
British Columbia,59,2020-01-31,1,0,0,1,1
Canada,1,2020-01-31,4,0,0,4,4
Ontario,35,2020-02-08,3,0,0,3,0
British Columbia,59,2020-02-08,4,0,0,4,3


# Saveing data frame #

Saving the dataframe to a .csv file.

In [36]:
df.to_csv('Covid_19_cleaned_data.csv',index=True)