# **Exploratory Data Analytics:**

## Generating Summary Statistics:
- Analyzing the mean of a dataset
- Checking the median of a dataset
- Identifying the mode of a dataset
- Checking the variance of a dataset
- Identifying the standard deviation of a dataset
- Generating the range of a dataset
- Identifying the percentiles of a dataset
- Checking the quartiles of a dataset
- Analyzing the interquartile range (IQR) of a dataset

- Analyzing mean of dataset

In [13]:
import numpy as np
import pandas as pd 
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('../data/covid-data.csv')
df = df[['iso_code','continent','location','date','total_cases', 'new_cases']]
df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases
0,AFG,Asia,Afghanistan,24/02/2020,5,5
1,AFG,Asia,Afghanistan,25/02/2020,5,0
2,AFG,Asia,Afghanistan,26/02/2020,5,0
3,AFG,Asia,Afghanistan,27/02/2020,5,0
4,AFG,Asia,Afghanistan,28/02/2020,5,0


In [3]:
df.dtypes

iso_code       object
continent      object
location       object
date           object
total_cases     int64
new_cases       int64
dtype: object

In [4]:
df.shape

(5818, 6)

In [6]:
df_mean = np.mean(df['new_cases'])
df_mean

np.float64(8814.365761430045)

- checking median of datasets
  - sorts the data and finds the central value
  - not sensitive to outliers

In [11]:
df_median = np.median(df['new_cases'])
df_median

np.float64(261.0)

- Identifying the mode of a dataset
    - frequently repeated values
    - it can be applied on both numeric and categorical values

In [20]:
df_mode = stats.mode(df['new_cases'])
df_mode[0]

np.int64(0)

- Checking the variance of a dataset
    - know the spread of the dataset

In [24]:
df_var = np.var(df['new_cases'])
df_var

np.float64(451321915.9280954)

- Identifying the standard deviation of a dataset
    - square root of the variance
    - helps to get output in same values as datapoints

In [26]:
df_std = np.std(df['new_cases'])
df_std

np.float64(21244.338444114834)

- Generating the range of a dataset
    - helps to understand the spread of dataset or how far apart the dataset numbers are from each other.
    - differnece between minimum  and maximum values in dataset
    - very useful when we use it along with variance and standard deviation

In [28]:
df_max = np.max(df['new_cases'])
df_min = np.min(df['new_cases'])
df_max

np.int64(287149)

In [29]:
df_min

np.int64(0)

In [32]:
data_range = df_max - df_min
data_range

np.int64(287149)

- Identifying the percentiles of a dataset
    - it can be used to measure the spread of a dataset and, at the same time, identify the center of a dataset.
    - The percentile divides the dataset into 100 equal portions, allowing us to determine the values in a dataset above or below a certain limit. 
    - Typically, 99 percentiles will split your dataset into 100 equal portions. The value of the 50th percentile is the same value as the median. 

In [34]:
df_percentile = np.percentile(df['new_cases'], 60)
df_percentile

np.float64(591.3999999999996)

- Checking the quartiles of a dataset
  - The quartile is like the percentile because it can be used to measure the spread and identify the center of a dataset.
  - Percentiles and quartiles are called quantiles.
  - While the percentile divides the dataset into 100 equal portions, the quartile divides the dataset into 4 equal portions.
  - Typically, three quartiles will split your dataset into four equal portions.

In [35]:
df_quartiles = np.quantile(df['new_cases'], 0.75)
df_quartiles

np.float64(3666.0)

- Analyzing the interquartile range (IQR) of a dataset
    - The IQR also measures the spread or variability of a dataset.
    - It is simply the distance between the first and third quartiles
    -  The IQR is a very useful statistic, especially when we need to identify where the middle 50% of values in a dataset lie
    -  Unlike the range, which can be skewed by very high or low numbers (outliers), the IQR isn’t affected by outliers since it focuses on the middle 50. It is also useful when we need to compute for outliers in a dataset.

In [36]:
df_iqr = stats.iqr(df['new_cases'])
df_iqr

np.float64(3642.0)