## Objective: 
### Understanding data and getting insights on the latest Covid-19 data set for Ontario 


## Import the libraries

In [1]:
import pandas as pd

## Load the data

In [2]:
df = pd.read_csv('ongoing_outbreaks_phu.csv')
df.tail()

Unnamed: 0,date,phu_name,phu_num,outbreak_group,number_ongoing_outbreaks
25677,2021-06-12,WINDSOR-ESSEX COUNTY,2268,2 Congregate Living,1
25678,2021-06-12,YORK REGION,2270,5 Recreational,4
25679,2021-06-12,YORK REGION,2270,4 Workplace,15
25680,2021-06-12,YORK REGION,2270,3 Education,2
25681,2021-06-12,YORK REGION,2270,1 Congregate Care,1


In [3]:
#checking the shape of data
df.shape

(25682, 5)

#### The Covid dataset consists of 25682 rows and 5 columns.

In [4]:
df.dtypes

date                        object
phu_name                    object
phu_num                      int64
outbreak_group              object
number_ongoing_outbreaks     int64
dtype: object

##### Different types of variable in the dataset are:
##### Categorical variables - date, phu_name, and outbreak_group
##### Quantitative variables -phu_num and  number_ongoing_outbreaks. Here phu_num and  number_ongoing_outbreaks are discrete variables.

In [5]:
#Basic inforamation of teh dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25682 entries, 0 to 25681
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   date                      25682 non-null  object
 1   phu_name                  25682 non-null  object
 2   phu_num                   25682 non-null  int64 
 3   outbreak_group            25682 non-null  object
 4   number_ongoing_outbreaks  25682 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 1003.3+ KB


## Data Cleaning

In [6]:
df.isnull().sum()

date                        0
phu_name                    0
phu_num                     0
outbreak_group              0
number_ongoing_outbreaks    0
dtype: int64

### There are no null values in the dataset.

## Descriptive Statistics

In [7]:
df.describe()

Unnamed: 0,phu_num,number_ongoing_outbreaks
count,25682.0,25682.0
mean,2466.00658,6.288996
std,680.605949,10.778658
min,2226.0,1.0
25%,2238.0,1.0
50%,2253.0,2.0
75%,2265.0,6.0
max,5183.0,123.0


### df.describe() provides the descriptive statistics of all the numerical values of the dataset. 
### The regions in Ontario province has seen a minimum of 1 case and a maximum of 123

---------------------
## Exploring the dataset and answers to objective queries

## 1. How many regions are there in Ontario province?

In [8]:
df['phu_name'].nunique()

34

### There are a total of 34 regions in the Ontario province.

## 2. Which region in the Ontario Province has seen the worst Covid outbreak?

In [9]:
s= df.groupby(['phu_name'])['number_ongoing_outbreaks'].sum().reset_index().sort_values(by='number_ongoing_outbreaks',ascending=False).head(5)
s.style.background_gradient(cmap='Purples')

Unnamed: 0,phu_name,number_ongoing_outbreaks
29,TORONTO,30828
21,PEEL REGION,28965
33,YORK REGION,19651
5,DURHAM REGION,8000
4,CITY OF OTTAWA,7816


### The region 'TORONTO' has seen the worst outbreak with 30828 cases followed by 'PEEL REGION' with 28965 cases.

## 3. Which group has witnessed the worst outbreak in the Ontario province?

In [10]:
df['outbreak_group'].unique()

array(['4 Workplace', '1 Congregate Care', '5 Recreational',
       '6 Other/Unknown', '3 Education', '2 Congregate Living'],
      dtype=object)

### There are 6 unique groups in the dataset.

In [11]:
df.groupby(['outbreak_group']).number_ongoing_outbreaks.sum().reset_index().sort_values(by='number_ongoing_outbreaks',ascending=False)

Unnamed: 0,outbreak_group,number_ongoing_outbreaks
0,1 Congregate Care,49035
3,4 Workplace,46974
2,3 Education,32209
1,2 Congregate Living,20040
4,5 Recreational,9673
5,6 Other/Unknown,3583


### The 'Congregate Care' group has seen the worst Covid outbreak in the Ontario province with 49035 cases, followed by the 'Workplace' group with 46974 cases.

## 4. Which region has witnessed the worst one day spike of active cases in the Ontarion Province?

In [12]:
df_filter = df[df['number_ongoing_outbreaks'] == 123]

df_filter

Unnamed: 0,date,phu_name,phu_num,outbreak_group,number_ongoing_outbreaks
19194,2021-04-14,TORONTO,3895,3 Education,123


### Ontario province witnessed the worst one day spike of 123 cases in 'TORONTO' region for the 'Education' group on April 14th, 2021.

## 5. What is the total number of active cases in the Ontario province as on June 12th, 2021?

In [13]:
filter = df['date'] == '2021-06-12'
df_new = df[filter]
df_new.head()

Unnamed: 0,date,phu_name,phu_num,outbreak_group,number_ongoing_outbreaks
25612,2021-06-12,ALGOMA DISTRICT,2226,1 Congregate Care,1
25613,2021-06-12,ALGOMA DISTRICT,2226,4 Workplace,1
25614,2021-06-12,BRANT COUNTY,2227,2 Congregate Living,1
25615,2021-06-12,BRANT COUNTY,2227,1 Congregate Care,1
25616,2021-06-12,BRANT COUNTY,2227,4 Workplace,2


In [14]:
df_new['number_ongoing_outbreaks'].sum()

191

### There are 191 number of active cases in the  Ontario povince as on June 12th, 2021.

## 6. Which region has the large number of active cases as on June 12th, 2021?

In [15]:
data = df_new.groupby(['phu_name']).number_ongoing_outbreaks.sum().reset_index().sort_values(by='number_ongoing_outbreaks',ascending=False)
data.head(20).style.background_gradient(cmap='Reds')

Unnamed: 0,phu_name,number_ongoing_outbreaks
13,PEEL REGION,40
22,YORK REGION,22
18,TORONTO,21
4,DURHAM REGION,15
3,CITY OF OTTAWA,12
12,NIAGARA REGION,12
16,SIMCOE MUSKOKA DISTRICT,11
15,PORCUPINE,8
2,CITY OF HAMILTON,8
8,HALTON REGION,8


### The 'PEEL REGION' has reported the highest number of active cases in the Ontario province on June 12th, 2021.

## 7. Which group is more susceptible to the Covid-19 exposure in the Ontario Province as on June 12th, 2021?

In [16]:
df_new.groupby(['outbreak_group']).number_ongoing_outbreaks.sum().reset_index().sort_values(by='number_ongoing_outbreaks',ascending=False)

Unnamed: 0,outbreak_group,number_ongoing_outbreaks
3,4 Workplace,85
2,3 Education,30
1,2 Congregate Living,29
0,1 Congregate Care,23
4,5 Recreational,22
5,6 Other/Unknown,2


### The group 'Workplace' has seen the highest number of active cases. The group 'Education' and 'Congregate Living' has witnessed almost equal number of cases say, 30 and 29 respectively. 

## 8. What is the timeframe of the dataset?

In [17]:
date_min = str(df[['date']].min())
date_min

'date    2020-11-01\ndtype: object'

In [18]:
date_max = str(df[['date']].max())
date_max

'date    2021-06-12\ndtype: object'

### The dataset consists of ongoing cases from Nov 1st, 2020 to June 12th, 2021.

In [19]:
df[["year", "month", "day"]] = df["date"].str.split("-", expand = True)
df = df.drop(columns = {'date'})
print("\nNew DataFrame:")
df.head(5)


New DataFrame:


Unnamed: 0,phu_name,phu_num,outbreak_group,number_ongoing_outbreaks,year,month,day
0,BRANT COUNTY,2227,4 Workplace,1,2020,11,1
1,BRANT COUNTY,2227,1 Congregate Care,2,2020,11,1
2,CHATHAM-KENT,2240,5 Recreational,1,2020,11,1
3,CHATHAM-KENT,2240,6 Other/Unknown,1,2020,11,1
4,CHATHAM-KENT,2240,4 Workplace,1,2020,11,1


## 9. Which year has seen the highest number of active cases in the Ontario province?

In [20]:
df.groupby('year')['number_ongoing_outbreaks'].sum().reset_index().sort_values(by='number_ongoing_outbreaks',ascending=False)

Unnamed: 0,year,number_ongoing_outbreaks
1,2021,116427
0,2020,45087


### The year 2021 has seen the highest number of active cases in the Ontario province.

## 10. Which month has seen the highest spike in active cases in the year 2021?

In [21]:
#Applying the filter to the dataset
df_month = df[df['year'] == '2021']

month_wise = df_month.groupby('month')['number_ongoing_outbreaks'].sum().reset_index().sort_values(by='number_ongoing_outbreaks',ascending=False)
month_wise.head(5).style.background_gradient(cmap='Blues')

Unnamed: 0,month,number_ongoing_outbreaks
0,1,32170
3,4,23984
1,2,19611
4,5,19146
2,3,18072


### The month of 'January' has seen the highest spike in the number of active cases of 32170 followed by 'April' with a number of 23984 active cases in the year of 2021.