# 2. Univariate Analysis

This notebook contains: 

    2.1. Statistical Overview of the Entire Dataset
    2.2. Univariate Analysis for Categorical Variables
    2.3. Univariate Analysis for Numerical Variables
    2.4. Insights

## 2.1. Statistical Overview of the Entire Dataset

This section provides a brief statistical overview of the entire preprocessed dataset. 

The required packages are imported over here:

In [3]:
import pandas as pd
import numpy as np
import plotly as ply
import plotly.express as px

The downloaded preprocessed dataset *Dataset for Analysis* is then read into a Pandas DataFrame:

In [4]:
univariate_analysis = pd.read_csv("C:\\Users\\Sharmila\\Desktop\\Dataset for Analysis.csv")

In [5]:
# Viewing the entire dataset

print(univariate_analysis)

       Unnamed: 0 Country/Region Province/State   Lat  Long Confirmed  \
0               0    Afghanistan              ?  33.0  65.0       0.0   
1               1    Afghanistan              ?  33.0  65.0       0.0   
2               2    Afghanistan              ?  33.0  65.0       0.0   
3               3    Afghanistan              ?  33.0  65.0       0.0   
4               4    Afghanistan              ?  33.0  65.0       0.0   
...           ...            ...            ...   ...   ...       ...   
23491       23491       Zimbabwe              ? -20.0  30.0      23.0   
23492       23492       Zimbabwe              ? -20.0  30.0      23.0   
23493       23493       Zimbabwe              ? -20.0  30.0      24.0   
23494       23494       Zimbabwe              ? -20.0  30.0      25.0   
23495       23495       Zimbabwe              ? -20.0  30.0      25.0   

      Recovered Deaths  
0           0.0    0.0  
1           0.0    0.0  
2           0.0    0.0  
3           0.0    0.0 

In [6]:
print(univariate_analysis.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23496 entries, 0 to 23495
Data columns (total 8 columns):
Unnamed: 0        23496 non-null int64
Country/Region    23496 non-null object
Province/State    23496 non-null object
Lat               23496 non-null float64
Long              23496 non-null float64
Confirmed         23496 non-null object
Recovered         23496 non-null object
Deaths            23496 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 1.4+ MB
None


In [7]:
# First five rows

print(univariate_analysis.head())

   Unnamed: 0 Country/Region Province/State   Lat  Long Confirmed Recovered  \
0           0    Afghanistan              ?  33.0  65.0       0.0       0.0   
1           1    Afghanistan              ?  33.0  65.0       0.0       0.0   
2           2    Afghanistan              ?  33.0  65.0       0.0       0.0   
3           3    Afghanistan              ?  33.0  65.0       0.0       0.0   
4           4    Afghanistan              ?  33.0  65.0       0.0       0.0   

  Deaths  
0    0.0  
1    0.0  
2    0.0  
3    0.0  
4    0.0  


In [8]:
# Last five rows

print(univariate_analysis.tail())

       Unnamed: 0 Country/Region Province/State   Lat  Long Confirmed  \
23491       23491       Zimbabwe              ? -20.0  30.0      23.0   
23492       23492       Zimbabwe              ? -20.0  30.0      23.0   
23493       23493       Zimbabwe              ? -20.0  30.0      24.0   
23494       23494       Zimbabwe              ? -20.0  30.0      25.0   
23495       23495       Zimbabwe              ? -20.0  30.0      25.0   

      Recovered Deaths  
23491       1.0    3.0  
23492       1.0    3.0  
23493       2.0    3.0  
23494       2.0    3.0  
23495       2.0    3.0  


## 2.2. Univariate Analysis of Categorical Variables 

This section analyzes four categorial variables: Country/Region, Province/State Briefly, they are - 
    
    Country/Region: The name of the country where a confirmed case was reported.
    Province/State: The name of the specific region in the country where the case was reported.
    
Tasks performed in this section are:

    2.2.1. Histogram of cases
    2.2.2. Histogram of cases in each region
    2.2.3. Histogram of cases in the same country and region
    


### 2.2.1. Histogram of cases

In [9]:
# Number of unique countries

print(len(univariate_analysis['Country/Region'].unique()))

185


In [10]:
# Frequency of earthquakes in each country

univariate_analysis['Country/Region'].value_counts()

China                   2937
Canada                  1335
United Kingdom           979
France                   979
Australia                712
                        ... 
Italy                     89
Burundi                   89
United Arab Emirates      89
Afghanistan               89
West Bank and Gaza        89
Name: Country/Region, Length: 185, dtype: int64

The count of cases in each country can be represented as a histogram below:

In [12]:
# Plotting the histogram

histogram_countries = univariate_analysis['Country/Region']
fig = px.histogram(histogram_countries, x= "Country/Region")
fig.update_layout(
    title_text='Histogram of cases in each country', # title of plot
    xaxis_title_text='Name of the Country', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

### 2.2.2. Histogram of cases in each region

In [11]:
# Number of unique regions

print(len(univariate_analysis['Province/State'].unique()))

83


In [13]:
# Frequency of cases in each region

print(univariate_analysis['Province/State'].value_counts())

?                               16198
Shanxi                             89
Northern Territory                 89
British Columbia                   89
Gansu                              89
                                ...  
New South Wales                    89
Bermuda                            89
Mayotte                            89
Australian Capital Territory       89
Tibet                              89
Name: Province/State, Length: 83, dtype: int64


The count of cases in each region can be represented in the histogram below:

In [72]:
# Plotting the histogram

histogram_region = univariate_analysis['Region']
fig = px.histogram(histogram_region, x="Region")
fig.update_layout(
    title_text='Histogram of earthquakes in each region', # title of plot
    xaxis_title_text='Name of the Region', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

### 2.2.3. Histogram of cases in the same country and province

In [15]:
# Number of cases occurred in the same country and province

print(len(univariate_analysis['Country/Region'].unique()))

185


In [16]:
# Names of the regions where more than one case has occurred

multiple_cases = pd.DataFrame(univariate_analysis, columns=['Country/Region'])
multiple_cases_output = multiple_cases[multiple_cases.duplicated()]
print(multiple_cases_output)

      Country/Region
1        Afghanistan
2        Afghanistan
3        Afghanistan
4        Afghanistan
5        Afghanistan
...              ...
23491       Zimbabwe
23492       Zimbabwe
23493       Zimbabwe
23494       Zimbabwe
23495       Zimbabwe

[23311 rows x 1 columns]


In [17]:
# The frequency of cases in each of the regions

print(univariate_analysis['Country/Region'].value_counts())

China                  2937
Canada                 1335
United Kingdom          979
France                  979
Australia               712
                       ... 
Mongolia                 89
Sierra Leone             89
Serbia                   89
Kenya                    89
Antigua and Barbuda      89
Name: Country/Region, Length: 185, dtype: int64


In [18]:
# Plotting the histogram

histogram_region = univariate_analysis['Country/Region']
fig = px.histogram(histogram_region, x="Country/Region")
fig.update_layout(
    title_text='Histogram of cases in each country and province', # title of plot
    xaxis_title_text='Name of the Country along with province', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

## 2.3. Univariate Analysis of Numerical Variables

This section analyzes five numerical variables: Lat, Long, Confirmed, Recovered, Deaths. 

This section is further organized as the analysis of following variables: 

    2.3.1. Lat
    2.3.2. Long
    2.3.3. Confirmed, Recovered, Deaths

### 2.3.1. Lat

Along with a statistical overview of the column, a histogram is represented.

In [19]:
# Statistical overview of the column 'Lat'

univariate_analysis.Lat.describe()

count    23496.000000
mean        21.529941
std         24.745761
min        -51.796300
25%          7.405000
50%         23.659750
75%         41.227200
max         71.706900
Name: Lat, dtype: float64

In [21]:
# Plotting the histogram

histogram_lat = univariate_analysis['Lat']
fig = px.histogram(histogram_lat, x="Lat")
fig.update_layout(
    title_text='Histogram of latitude occurrence', # title of plot
    xaxis_title_text='Lat', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

### 2.3.2. Lat and Long

This section contains the following subsections: 

    2.3.2.1. Minimum latitude
    2.3.2.2. Maximum latitude
    2.3.2.3. Minimum longitude
    2.3.2.4. Maximum longitude
    2.3.2.5. Entries with same latitude and longitude

#### 2.3.2.1. Minimum latitude

In [22]:
# Finding the minimum latitude value

print(univariate_analysis['Lat'].min())

-51.7963


In [23]:
# Accessing the index of that value

latitude_min = univariate_analysis.loc[univariate_analysis['Lat'] == '-51.7963']
print(latitude_min)

Empty DataFrame
Columns: [Unnamed: 0, Country/Region, Province/State, Lat, Long, Confirmed, Recovered, Deaths]
Index: []



elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



#### 2.3.2.2. Maximum latitude

In [24]:
# Finding maximum latitude

print(univariate_analysis['Lat'].max())

71.7069


#### 2.3.2.3. Minimum longitude

In [25]:
# Finding the minimum longitude value

print(univariate_analysis['Long'].min())

-135.0


In [26]:
# Accessing the index of that value

longitude_min = univariate_analysis.loc[univariate_analysis['Long'] == '-135']
print(longitude_min)

Empty DataFrame
Columns: [Unnamed: 0, Country/Region, Province/State, Lat, Long, Confirmed, Recovered, Deaths]
Index: []


#### 2.3.2.4. Maximum Longitude

In [27]:
# Finding maximum longitude

print(univariate_analysis['Long'].max())

178.065


#### 2.3.2.5. Entries with same latitude and longitude

The next piece of code is aimed at understanding if there were multiple cases in exactly the same geographical coordinates.

In [28]:
# Creating a new dataframe with latitude and longitude values from the main dataframe univariate_analysis

multiple_cases = pd.DataFrame(univariate_analysis, columns=['Lat', 'Long'])

# Creating another dataframe with only the duplicate values from multiple_cases

rows = multiple_cases[multiple_cases.duplicated()]
print(rows)

        Lat  Long
1      33.0  65.0
2      33.0  65.0
3      33.0  65.0
4      33.0  65.0
5      33.0  65.0
...     ...   ...
23491 -20.0  30.0
23492 -20.0  30.0
23493 -20.0  30.0
23494 -20.0  30.0
23495 -20.0  30.0

[23236 rows x 2 columns]


In [31]:
# Creating a new dataframe by excluding missing values from Longitude column

rows_updated_long = rows[rows.Long != '?']
print(rows_updated_long)

# Creating a new dataframe by excluding missing values from rows_updated_long

rows_updated_long = rows[rows.Long!= '?']
print(rows_updated_long)

        Lat  Long
1      33.0  65.0
2      33.0  65.0
3      33.0  65.0
4      33.0  65.0
5      33.0  65.0
...     ...   ...
23491 -20.0  30.0
23492 -20.0  30.0
23493 -20.0  30.0
23494 -20.0  30.0
23495 -20.0  30.0

[23236 rows x 2 columns]
        Lat  Long
1      33.0  65.0
2      33.0  65.0
3      33.0  65.0
4      33.0  65.0
5      33.0  65.0
...     ...   ...
23491 -20.0  30.0
23492 -20.0  30.0
23493 -20.0  30.0
23494 -20.0  30.0
23495 -20.0  30.0

[23236 rows x 2 columns]


### 2.3.3. Confirmed, Recovered, and Deaths

    2.3.3.1. Minimum Confirmed, Recovered, and Deaths 
    2.3.3.2. Maximum Confirmed, Recovered, and Deaths
    2.3.3.3. Frequency of Confirmed, Recovered, and Deaths
    2.3.3.4. Histogram of Confirmed, Recovered, and Deaths

In [33]:
# Overview of the Confirmed

print(univariate_analysis['Confirmed'].describe())

count     23496
unique     2614
top         0.0
freq      10045
Name: Confirmed, dtype: object


In [34]:
# Overview of the Recovered

print(univariate_analysis['Recovered'].describe())

count     23496
unique     1334
top         0.0
freq      13137
Name: Recovered, dtype: object


In [35]:
# Overview of the Deaths column

print(univariate_analysis['Deaths'].describe())

count     23496
unique      759
top         0.0
freq      16141
Name: Deaths, dtype: object


#### 2.3.3.1. Minimum Confirmed, Recovered, and Deaths

In [36]:
# Finding the minimum Confirmed

print(univariate_analysis['Confirmed'].min())

-1.0


In [37]:
# Finding the minimum Recovered

print(univariate_analysis['Recovered'].min())

0.0


In [38]:
# Finding the minimum Deaths

print(univariate_analysis['Deaths'].min())

-1.0


#### 2.3.3.3. Frequency of Confirmed, Recovered, and Deaths

In [40]:
# Counting the frequency of Confirmed, Recovered, and Deaths columns
confirmed_count = univariate_analysis['Confirmed'].value_counts()
print(confirmed_count)

0.0        10045
1.0         1011
3.0          437
2.0          436
4.0          295
           ...  
2994.0         1
5690.0         1
8928.0         1
724.0          1
31728.0        1
Name: Confirmed, Length: 2614, dtype: int64


In [41]:
# Counting the frequency of Confirmed, Recovered, and Deaths columns
recovered_count = univariate_analysis['Recovered'].value_counts()
print(recovered_count)

0.0        13137
?           1246
1.0         1076
2.0          564
4.0          310
           ...  
77000.0        1
52096.0        1
525.0          1
63612.0        1
244.0          1
Name: Recovered, Length: 1334, dtype: int64


In [43]:
# Counting the frequency of Confirmed, Recovered, and Deaths columns
death_count = univariate_analysis['Deaths'].value_counts()
print(death_count)

0.0        16141
1.0         1632
2.0          870
3.0          627
6.0          470
           ...  
1368.0         1
724.0          1
15887.0        1
210.0          1
244.0          1
Name: Deaths, Length: 759, dtype: int64


In [44]:
# Frequency of Confirmed cases in ascending order

print(univariate_analysis.groupby('Confirmed').size())

Confirmed
-1.0          8
0.0       10045
1.0        1011
10.0        182
100.0        11
          ...  
997.0         1
9976.0        1
998.0         2
999.0         2
?            89
Length: 2614, dtype: int64


#### 2.3.3.4. Histogram of Confirmed Cases

In [45]:
histogram_region = univariate_analysis['Confirmed']
fig = px.histogram(histogram_region, x="Confirmed")
fig.update_layout(
    title_text='Histogram of Confirmed cases', # title of plot
    xaxis_title_text='Case number', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

In [46]:
import plotly.graph_objects as go

labels = ['Confirmed Cases']
values = univariate_analysis['Confirmed'].unique()
print(len(values))


2614


In [47]:
# A pie chart of the distribution of Confirmed cases

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.show()