## Gather Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import plotly.graph_objects as go

In [None]:
# Covid-19 Dataset for Saudi Arabia taken from 
# King Abdullah Petroleum Studies and Research Center (KAPSARC) Website
df = pd.read_csv('saudi-arabia-coronavirus-disease-covid-19-situation.csv', sep=';')

#### First, investigate the dataset. We can see that the dataset is not ordered, and some columns need to be converted to more appropriate datatypes.

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.head(10)

In [None]:
df.tail(10)

In [None]:
df.info()

#### We can also find how many categorical variables are in our categorical columns using the value_counts() pandas function.

In [None]:
df.Indicator.value_counts()

In [None]:
df.Event.value_counts()

In [None]:
df.City.value_counts()

In [None]:
df.region.value_counts()

#### Find out any null values in any of the columns.

In [None]:
df[df.region.isnull()]

In [None]:
df[df.City.isnull()]

In [None]:
df[df.Cases.isnull()]

#### The null values seem to be critical cases in the 6 dates shown above, I'm going to remove them since I don't have any information on what region or city they're in. 

In [None]:
df.duplicated()

#### There seems to be no duplicates in the dataset.

## Clean Dataset

1. Replace NaN's in Event column with 'No Event'.
2. Convert Date column to datetime object.
3. Convert Cases column to int datatype.
4. Rename 'Daily / Cumulative' to Daily_Cumulative
5. Rename region column to Region.
6. Drop null columns in [Region, City, Cases].
7. Sort data by Date (ascending).

In [None]:
# Make copy of dataframe.
df_clean = df.copy()

In [None]:
# 1. Replace NaN's in Event column with 'No Event'.
df_clean.Event.fillna('No Event', inplace=True)

In [None]:
# 2. Convert Date column to datetime object.
pd.to_datetime(df_clean.Date)

In [None]:
# 3. Convert Cases column to int datatype.
df_clean.Cases = df_clean.Cases.astype(int)

In [None]:
# 4. Rename 'Daily / Cumulative' to Daily_Cumulative
df_clean.rename(columns={'Daily / Cumulative': 'Daily_Cumulative'}, inplace=True)

In [None]:
# 5. Rename region column to Region.
df_clean.rename(columns={'region': 'Region'}, inplace=True)

In [None]:
# 6. Drop null columns in [Region, City, Cases].
df_clean = df_clean.dropna(subset=['Cases'])

In [None]:
# 7. Sort data by Date (ascending).
df_clean.sort_values(by=['Date'], inplace=True, ascending=True)

In [None]:
df_clean.info()

In [None]:
df_clean

## Exploratory Analysis

### Univariate Exploration

To start off the exploratory analysis, I made two count plots of the Region and Indicator columns. This just shows how much of each of these categorical column have of each value. Since the number of cases differs for each row, it's not an accurate indication of any insights on the dataset. 

- The Region count plot shows us that there's 13 Saudi regions, in addition to the Total column which makes 14 values. The 3 regions with the highest values in the dataset are Riyadh, Eastern Region and Mecca. 
- The Indicator count plot shows us that there's 5 indicators, Starting with Cases, Recoveries, Active cases, Mortalities and lastly Critical cases. This seems to be a good indication of Recovery rate being high, and Critical cases being relatively low in the country.

In [None]:
fig, ax = plt.subplots(nrows=2, figsize = [16,16])
sns.set(style='whitegrid')
sns.countplot(data = df_clean, x = 'Region', order = df_clean['Region'].value_counts().index, ax = ax[0])
sns.countplot(data = df_clean, x = 'Indicator', order = df_clean['Indicator'].value_counts().index, ax = ax[1])

plt.show()

### Bivariate Exploration

For the bivariate exploration, I made graphs of the total cases and active cases up to the latest date available in the dataset which is the 5th of August.

- This graph only shows the cities in the Mecca region, as of the 5th of August, Jeddah has the most Active cases of Covid-19, with 3675 cases. Mecca is second and has 1584 cases, Taif has 872. You can also zoom in the graph to see that Hadda with the least active cases only has 1 case.

In [None]:
data = df_clean[df_clean.Region != "Total"]
data = data[data.Region == "Mecca"]
data = data[data.Indicator == "Active cases"]
data = data[data.Date == "2020-08-05"]
fig = px.bar(data, x="City", y="Cases", title="Total Active Cases in Makkah Region - 5th August 2020", template="plotly_white")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- This graph shows the cumulative number of cases of Covid-19 up until the latest date in the dataset, which is the 5th of August. Similiar to our univariate count plot, the regions in the top 3 spots are Mecca, Eastern Region and Riyadh, although their order differs. Mecca and Eastern Region alone take up half the pie graph, meaning that half the cases in the country up until the 5th of August are in those 2 regions.

In [None]:
data = df_clean.query("Daily_Cumulative=='Cumulative'")
data = data.query("Indicator=='Cases'")
data = data[data.Date == "2020-08-05"]
data = data[data.Region != "Total"]
fig = px.pie(data, values='Cases', names='Region', title='Total Cases by Region - 5th August 2020')
fig.show()

### Multivariate Exploration

I used the Indicator column on various graphs for my multivariate exploration. The Indicator column has 5 values.
1. Active cases
2. Cases
3. Recoveries
4. Mortalities
5. Critical cases

- The following Graph shows the total number of cases for each indicator, from the start of the first case of Covid-19 in the 2nd of March 2020, up until the latest date in the dataset which is the 5th of August. Recovery rates have been higher than the active cases since the 17th of May 2020. We can also see that Mortalities and Critical cases have been relatively low, which is the same as the count plots in the univariate exploration. 

In [None]:
data = df_clean.query("City=='Total'")
data = data.query("Daily_Cumulative=='Cumulative'")
fig = px.line(data, x="Date", y="Cases", title='Total Cases in Saudi Arabia ', color='Indicator', template="plotly_white")
fig.show()

- In the following bubble chart are July 2020's cases of Covid-19. The size of the bubble is determined by the number of total cases in each of the 3 indicators which are Cases, Recoveries and Mortalities. The Active cases indicator seems to not be present in July, the Cases indicator seems to have replaced it. The highest recovery count for one day was on the 13th of July 2020 with 7718 recoveries made in one day.

In [None]:
data = df_clean.query("City=='Total'")
data = data.query("Daily_Cumulative=='Daily'")
fig = px.scatter(data, x="Date", y="Cases", size="Cases", size_max=30, title='Daily Cases in Saudi Arabia in July 2020', color='Indicator', template="plotly_white", range_x=['2020-07-01','2020-07-31'])
fig.show()

- In the dataset, there was an Event column indicating if any event happened like for example the start or end of curfew for some regions or cities. I plotted this with the number of daily cases to see if any event had an impact in the number of daily cases. I noticed a spike in cases after the partial lifting of curfew in all cities except Makkah and after curfew lifted in all regions. The cases have been gradually decreasing since the 7th of July.

In [None]:
data = df_clean.query("City=='Total'")
data = data.query("Indicator=='Cases'")
data = data.query("Daily_Cumulative=='Daily'")
fig = px.line(data, x="Date", y="Cases", title='Daily Cases in Saudi Arabia in July 2020', template="plotly_white")
event = data.query("Event!='No Event'")
x_val = event.Date
y_val = event.Cases

hover_text= (event.Event).to_numpy()
fig.add_trace(go.Scatter(x=x_val, y=y_val,
                    mode='markers',
                    name='Event',
                    hovertext = hover_text,
                    hoverinfo="text",
                    showlegend=True))
fig.show()