# 3. Bivariate Analysis

Combinations of two variables are analyzed in this section. Top ten countries with highest number of cases are identified and visualized through scatter plots and line graphs. The trend of these countries is analysed.

This section contains the following: 

    3.1. Country/Region and Confirmed Cases
    3.2. Top ten countries in detail

In [1]:
# Importing the required modules

import pandas as pd
import numpy as np
import plotly as ply
import plotly.express as px

In [4]:
# Reading the dataset into a dataframe

bivariate_analysis = pd.read_csv("C:\\Users\\Sharmila\\Desktop\\Dataset for Analysis.csv")

In [5]:
# Printing the dataset

print(bivariate_analysis)

       Unnamed: 0 Country/Region Province/State   Lat  Long Confirmed  \
0               0    Afghanistan              ?  33.0  65.0       0.0   
1               1    Afghanistan              ?  33.0  65.0       0.0   
2               2    Afghanistan              ?  33.0  65.0       0.0   
3               3    Afghanistan              ?  33.0  65.0       0.0   
4               4    Afghanistan              ?  33.0  65.0       0.0   
...           ...            ...            ...   ...   ...       ...   
23491       23491       Zimbabwe              ? -20.0  30.0      23.0   
23492       23492       Zimbabwe              ? -20.0  30.0      23.0   
23493       23493       Zimbabwe              ? -20.0  30.0      24.0   
23494       23494       Zimbabwe              ? -20.0  30.0      25.0   
23495       23495       Zimbabwe              ? -20.0  30.0      25.0   

      Recovered Deaths  
0           0.0    0.0  
1           0.0    0.0  
2           0.0    0.0  
3           0.0    0.0 

## 3.1. Country/Region and Confirmed Cases

This section is divided into: 

    3.1.1. Scatter plot of Country/Region and Confirmed Cases
    3.1.2. Trend in most frequent cases
    3.1.3. Countries with maximum number of Cases

### 3.1.1. Scatter plot of Country/Region and Confirmed Cases

In [6]:
# Overview of the Years and Magnitude columns

country_overview = bivariate_analysis['Country/Region']
print(country_overview.describe())

confirmed_overview = bivariate_analysis['Confirmed']
print(confirmed_overview)

count     23496
unique      185
top       China
freq       2937
Name: Country/Region, dtype: object
0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
         ... 
23491    23.0
23492    23.0
23493    24.0
23494    25.0
23495    25.0
Name: Confirmed, Length: 23496, dtype: object


In [7]:
# Viewing the missing values in Country column

print(country_overview.isna().sum())
print(confirmed_overview.isna().sum())

0
0


The output for both of them is zero because the missing values are represented by '?'. The following code checks the count for '?'. 

In [8]:
# Frequency of missing values in Country and Confirmed Columns

print(len(bivariate_analysis[bivariate_analysis['Country/Region'] == '?']))
print(len(bivariate_analysis[bivariate_analysis['Confirmed'] == '?']))

0
89


In [9]:
# Accessing rows for which column Country/Region has '?'

country_updated = bivariate_analysis[bivariate_analysis['Country/Region'] == '?' ].index

# Deleting these row indexes from the dataframe

bivariate_analysis.drop(country_updated, inplace=True)

In [10]:
# Verifying that there are no more missing values
print(len(bivariate_analysis[bivariate_analysis['country_updated'] == '?']))

KeyError: 'country_updated'

In [13]:
# Importing the required modules to plot the graph

import plotly.graph_objects as go
import numpy as np

# Setting X and Y axes

x = bivariate_analysis['Country/Region'].unique()
y = bivariate_analysis['Confirmed'].unique()

# Plotting the graph

fig = go.Figure(data=go.Scatter(x=x,y=y))
fig.update_layout(title='Trend in confirmed cases',
                   xaxis_title='Country',
                   yaxis_title='Confirmed cases')
fig.show()

In [12]:
# Importing the required modules to plot the graph

import plotly.graph_objects as go
import numpy as np

# Setting X and Y axes

x = bivariate_analysis['Country/Region'].unique()
y = bivariate_analysis['Recovered'].unique()

# Plotting the graph

fig = go.Figure(data=go.Scatter(x=x,y=y))
fig.update_layout(title='Trend in cases',
                   xaxis_title='Country',
                   yaxis_title='Recovered cases')
fig.show()

In [14]:
# Importing the required modules to plot the graph

import plotly.graph_objects as go
import numpy as np

# Setting X and Y axes

x = bivariate_analysis['Country/Region'].unique()
y = bivariate_analysis['Deaths'].unique()

# Plotting the graph

fig = go.Figure(data=go.Scatter(x=x,y=y))
fig.update_layout(title='Trend in cases',
                   xaxis_title='Country',
                   yaxis_title='Death cases')
fig.show()

In [16]:
# Most frequent country 

cases = bivariate_analysis['Country/Region'].value_counts(dropna=False)
print(cases)

China              2937
Canada             1335
United Kingdom      979
France              979
Australia           712
                   ... 
North Macedonia      89
Algeria              89
Western Sahara       89
Tunisia              89
New Zealand          89
Name: Country/Region, Length: 185, dtype: int64


### 3.1.6. Countries with maximum number of cases

# 3.2. Top five countries in detail

The trend in number of earthquakes in China, Canada, UK, France, and Australia are plotted. 

## China

In [33]:
# Importing the required libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Extracting all rows of China cases

country_china = bivariate_analysis['Country/Region'] == "China"

# Create variable with TRUE if age is greater than 50
china_confirmed_cases = bivariate_analysis['Confirmed'] 

final = bivariate_analysis[country_china & china_confirmed_cases]
print(final)

      Unnamed: 0 Country/Region Province/State      Lat      Long Confirmed  \
5073        5073          China          Anhui  31.8257  117.2264       1.0   
5074        5074          China          Anhui  31.8257  117.2264       9.0   
5075        5075          China          Anhui  31.8257  117.2264      15.0   
5076        5076          China          Anhui  31.8257  117.2264      39.0   
5077        5077          China          Anhui  31.8257  117.2264      60.0   
...          ...            ...            ...      ...       ...       ...   
8005        8005          China       Zhejiang  29.1832  120.0934    1268.0   
8006        8006          China       Zhejiang  29.1832  120.0934    1268.0   
8007        8007          China       Zhejiang  29.1832  120.0934    1268.0   
8008        8008          China       Zhejiang  29.1832  120.0934    1268.0   
8009        8009          China       Zhejiang  29.1832  120.0934    1268.0   

     Recovered Deaths  
5073       0.0    0.0  
507

In [35]:
histogram_china = final['Country/Region']
fig = px.histogram(histogram_china, x= "Country/Region")
fig.update_layout(
    title_text='Histogram of cases in China', # title of plot
    xaxis_title_text='Name of the Country', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

In [47]:
import plotly.express as px
data = bivariate_analysis['Confirmed']


fig = px.bar(data, x='Confirmed', y='pop',
             hover_data=['lifeExp', 'gdpPercap'], color='lifeExp',
             labels={'pop':'population of Canada'}, height=400)
fig.show()

ValueError: Value of 'y' is not the name of a column in 'data_frame'. Expected one of ['Confirmed'] but received: pop