# Exploratory Data Analysis Cookbook

## Chapter 1: Summary Statistics

We are working with Covid-19 data for this exercise.

In [1]:
# import libraries that we will need to load and manipulate the data
import numpy as np
import pandas as pd

In [4]:
covid_data = pd.read_csv("covid-data.csv")
covid_data.head

<bound method NDFrame.head of      iso_code continent     location        date  total_cases  new_cases  \
0         AFG      Asia  Afghanistan  24/02/2020            5          5   
1         AFG      Asia  Afghanistan  25/02/2020            5          0   
2         AFG      Asia  Afghanistan  26/02/2020            5          0   
3         AFG      Asia  Afghanistan  27/02/2020            5          0   
4         AFG      Asia  Afghanistan  28/02/2020            5          0   
...       ...       ...          ...         ...          ...        ...   
5813      NGA    Africa      Nigeria  06/10/2022       265741        236   
5814      NGA    Africa      Nigeria  07/10/2022       265741          0   
5815      NGA    Africa      Nigeria  08/10/2022       265816         75   
5816      NGA    Africa      Nigeria  09/10/2022       265816          0   
5817      NGA    Africa      Nigeria  10/10/2022       265816          0   

      new_cases_smoothed  total_deaths  new_deaths  new_d

In [5]:
covid_data_sub = covid_data[['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases']]

In [6]:
covid_data_sub.head(5)

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases
0,AFG,Asia,Afghanistan,24/02/2020,5,5
1,AFG,Asia,Afghanistan,25/02/2020,5,0
2,AFG,Asia,Afghanistan,26/02/2020,5,0
3,AFG,Asia,Afghanistan,27/02/2020,5,0
4,AFG,Asia,Afghanistan,28/02/2020,5,0


In [8]:
covid_data_sub.dtypes

iso_code       object
continent      object
location       object
date           object
total_cases     int64
new_cases       int64
dtype: object

In [9]:
covid_data_sub.shape

(5818, 6)

__Calculate mean and median__

Note that both numpy and pandas have methods to calculate both

In [10]:
data_mean = np.mean(covid_data_sub['new_cases'])

In [11]:
data_mean

8814.365761430045

In [12]:
data_median = np.median(covid_data_sub['new_cases'])
data_median

261.0

__Identifying the mode__

In [13]:
# first, import an additional module from scipy
from scipy import stats

In [14]:
data_mode = stats.mode(covid_data_sub['new_cases'])
data_mode

ModeResult(mode=0, count=805)

In [17]:
# stats.mode returns an array, which we take the first entry from to discover the new_cases mode
data_mode[0]

0

In [19]:
# learning! stats.mode can no longer be used on non-numeric columns
continent_mode = stats.mode(covid_data_sub['continent'])
continent_mode[0]

TypeError: Argument `a` is not recognized as numeric. Support for input that cannot be coerced to a numeric array was deprecated in SciPy 1.9.0 and removed in SciPy 1.11.0. Please consider `np.unique`.

__Checking the variance of the data set__

Both pandas and numpy have methods to calculate variance and standard deviation

In [20]:
data_variance = np.var(covid_data_sub['new_cases'])
data_variance

451321915.92810047

__Standard deviation__

Standard deviation is the square root of the variance, and is expressed in the same units as the original column. This makes SD more intuitively easy to understand.

Variance is a distance calculation, based I think on the Pythogorean formula. Standard deviation is derived from variance by taking the square root, and so the cell below results in a standard deviation of about 21,000 new cases. This means that when considering the data set as a whole, average new cases for each day tend to fall within 21k cases of the mean. Based on the mean calculated above, that would put new cases for a typical point in the data set between 0 and 30k (technically between -12k and 30k, but you can't have negative new cases).

In [21]:
data_sd = np.std(covid_data_sub['new_cases'])
data_sd

21244.33844411495

__Max and Min__

How far apart are the largest and smallest values of new cases in our data set?

In [22]:
data_max = np.max(covid_data_sub['new_cases'])
data_min = np.min(covid_data_sub['new_cases'])

print(data_min, data_max)

0 287149


In [24]:
# in this case, since data_min = 0, data_range = data_max
data_range = data_max - data_min
data_range

287149

__Percentiles__

Percentiles divide a data set into equal sizes. Pandas has a percentile method called quantile which functions the same.

In [29]:
data_percentile = np.percentile(covid_data_sub['new_cases'], 60)
data_percentile

591.3999999999996

__Quartiles__

Quartiles divide the data set into 4 equal portions, while percentiles divide it into 100 portions. Quantile is the broad term for the sections/method of dividing data.

In [31]:
data_quartile = np.quantile(covid_data_sub['new_cases'], 0.75)
data_quartile

3666.0

__IQR__

The interquartile range describes the range of the middle 50% of the data (specifically the distance between 1st and 3rd quartile), and is not as sensitive to outliers because of that.

In [32]:
# this is another situation where we need to use the stats module from scipy
data_iqr = stats.iqr(covid_data_sub['new_cases'])
data_iqr

3642.0