# Basic Statistical Concepts with COVID 19 Data

## Importing Libraries 

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

## Reading CSV and Dataset Info

In [2]:
covid_data = pd.read_csv("covid-data.csv")
covid_data = covid_data[['iso_code','continent','location','date','total_cases','new_cases']]

covid_data.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases
0,AFG,Asia,Afghanistan,24/02/2020,5,5
1,AFG,Asia,Afghanistan,25/02/2020,5,0
2,AFG,Asia,Afghanistan,26/02/2020,5,0
3,AFG,Asia,Afghanistan,27/02/2020,5,0
4,AFG,Asia,Afghanistan,28/02/2020,5,0


In [3]:
covid_data.dtypes

iso_code       object
continent      object
location       object
date           object
total_cases     int64
new_cases       int64
dtype: object

In [4]:
covid_data.shape

(5818, 6)

## Statistical Concepts

### Mean

The mean is considered the average of a dataset. To calculate the mean, we need to sum up all the data points and divide the sum by the number of data points in our dataset. It is very sensitive to outliers. Since unusually high or low numbers will affect the sum of data points without affecting the number of data points, these outliers can heavily influence the mean of a dataset.

In [5]:
data_mean = np.mean(covid_data['new_cases'])
data_mean

8814.365761430045

### Median

The median is the middle value within a sorted dataset (ascending or descending order). Half of the data points in the dataset are less than the median, and the other half of the data points are greater than the median. Unlike the mean, the median is not sensitive to outliers. To calculate the median, we sort the dataset and select the middle value if the number of datapoints is odd. If it is even, we select the two middle values and find the average to get the median.

In [6]:
data_median = np.median(covid_data["new_cases"])
data_median

261.0

### Mode

The mode is the value that occurs most frequently in the dataset, or simply put, the most common value in a dataset. Unlike the mean and median, which must be applied to numeric values, the mode can be applied to both numeric and non-numeric values since the focus is on the frequency at which a value occurs. The mode provides quick insights into the most common value.

In [7]:
data_mode = stats.mode(covid_data['new_cases'], keepdims=False)
data_mode

ModeResult(mode=0, count=805)

### Data Variance

Just like we may want to know where the center of a dataset lies, we may also want to know how widely spread the dataset is, for example, how far apart the numbers in the dataset are from each other. That's the role of the variance. the variance gives us a sense of the spread of a dataset or the variability.

In [8]:
data_variance = np.var(covid_data['new_cases'])
data_variance

451321915.92810047

### Standard Deviation

The standard deviation is derived from the variance and is simply the square root of the variance. The standard deviation is typically more intuitive because it is expressed in the same units as the dataset, for example, kilometers (km). On the other hand, the variance is typically expressed in units larger than the dataset and can be less intuitive, for example, kilometers squared (km2).

In [9]:
data_sd = np.std(covid_data['new_cases'])
data_sd

21244.33844411495

### Range

The range also helps us understand the spread of a dataset or how far apart the dataset’s numbers are from each other. It is the difference between the minimum and maximum values within a dataset.

In [10]:
data_max = np.max(covid_data['new_cases'])
data_min = np.min(covid_data['new_cases'])

print(data_max, data_min)

287149 0


In [11]:
data_range = data_max - data_min
data_range

287149

### Percentile

The percentile is an interesting statistic because it can be used to measure the spread of a dataset and, at the same time, identify the center of a dataset. The percentile divides the dataset into 100 equal portions, allowing us to determine the values in a dataset above or below a certain limit. Typically, 99 percentiles will split your dataset into 100 equal portions. The value of the 50th percentile is the same value as the median.

In [12]:
data_percentile = np.percentile(covid_data['new_cases'],80)
data_percentile

10130.000000000002

### Quartile

The quartile is like the percentile because it can be used to measure the spread and identify the center of a dataset. Percentiles and quartiles are called quantiles. While the percentile divides the dataset into 100 equal portions, the quartile divides the dataset into 4 equal portions. Typically, three quartiles will split your dataset into four equal portions.

In [13]:
data_quartile = np.quantile(covid_data['new_cases'],0.75) #3rd quartile (Q3)
data_quartile

3666.0

### Interquartile Range (IQR)

The IQR also measures the spread or variability of a dataset. It is simply the distance between the first and third quartiles. The IQR is a very useful statistic, especially when we need to identify where the middle 50% of values in a dataset lie. Unlike the range, which can be skewed by very high or low numbers (outliers), the IQR isn’t affected by outliers since it focuses on the middle 50. It is also useful when we need to compute for outliers in a dataset.

In [14]:
data_IQR = stats.iqr(covid_data['new_cases'])
data_IQR

3642.0