## Descriptive Statistics Part-2

### Quantiles and Percentiles
- Quantiles are statistical measures used to divide a set of numerical data into equal-sized groups, with each group containing an equal number of observations. Quantiles are important measures of variability and can be used to: understand distribution of data, summarize and compare different datasets. They can also be 
used to identify outliers.

- There are several types of quantiles used in statistical analysis, including :-
1. Quartiles: Divide the data into four equal parts, Q1 (25th percentile), Q2 
(50th percentile or median), and Q3 (75th percentile).

2. Deciles: Divide the data into ten equal parts, D1 (10th percentile), D2 (20th percentile), ..., D9 (90th percentile).

3. Percentiles: Divide the data into 100 equal parts, P1 (1st percentile), P2 (2nd percentile), ..., P99 (99th percentile).

4. Quintiles: Divides the data into 5 equal parts,

Things to remember while calculating these measures :-
- Data should be sorted from low to high.
- We are basically finding the location of an observation.
- They are not actual values in the data.
- All other tiles can be easily derived from Percentiles.

#### Percentiles
- Percentile :- A percentile is a statistical measure that represents the percentage of observations in a dataset that fall below a particular value. For example, the 75th percentile is the value below which 75% of the observations in the dataset fall.

##### Formula to Calculate the Percentile Value

PL = (p / 100) × (N + 1)


Where:
- **PL** = the desired percentile value location  
- **N** = the total number of observations in the dataset  
- **p** = the percentile rank (expressed as a percentage)  


### 5 number summary in statistics
- The five-number summary is a descriptive statistic that provides a summary of a dataset. It consists of five values that divide the dataset into four equal parts, also known as quartiles. The five-number summary includes the following values :-

1. Minimum value: The smallest value in the dataset.

2. First quartile (Q1): The value that separates the lowest 25% of the data from 
the rest of the dataset.

3. Median (Q2): The value that separates the lowest 50% from the highest 50% 
of the data.

4. Third quartile (Q3): The value that separates the lowest 75% of the data from 
the highest 25% of the data.

5. Maximum value: The largest value in the dataset.

- Interquartile Range :- The interquartile range (IQR) is a measure of variability that is based on the five-number summary of a dataset. Specifically, the IQR is defined as the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset.  


### Boxplot
- A box plot, also known as a box-and-whisker plot, is a graphical representation of a dataset that shows the distribution of the data. The box plot displays a summary of the data, including the minimum and maximum values, the first quartile (Q1), the median (Q2), and the third quartile (Q3).

- Benefits of a Boxplot :-
1. Easy way to see the distribution of data
2. Tells about skewness of data
3. Can identify outliers
4. Compare 2 categories of data

- Note :- Using Boxplot, We can detect outliers.

### Covariance

- Covariance is a statistical measure that describes the degree to which two variables are linearly related. It measures how much two variables change together, such that when one variable increases, does the other variable also increase, or does it decrease?

- If the covariance between two variables is positive, it means that the variables tend to move together in the same direction. If the covariance is negative, it means that the variables tend to move in opposite directions. A covariance of zero indicates that the variables are not linearly related.

### How Covriance is calculated?

#### Covariance Formula

#### Population
$$
\sigma_{xy} = \frac{\sum (X - \mu_x)(Y - \mu_y)}{N}
$$

Where:  
- \(X, Y\) = The value of X and Y in the population  
- $\mu_x, \mu_y$ = The population mean of X and Y  
- \(N\) = Total number of observations  

---

### Sample
$$
s_{xy} = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{n - 1}
$$

Where:  
- \(X, Y\) = The value of X and Y in the sample data  
- $\bar{X}, \bar{Y}$ = The sample mean of X and Y  
- \(n\) = Total number of observations  


#### Disadvantage of Covariance
- One limitation of covariance is that it does not tell us about the strength of the relationship between two variables, since the magnitude of covariance is affected by the scale of the variables.

- Note - Covariance of a variable with itself.


In [2]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [4]:
df = pd.DataFrame()

In [6]:
from re import X
x = pd.Series([12,25,68,42,113])
y = pd.Series([11,29,58,121,100])

In [7]:
df['x'] = x
df['y'] = y

In [None]:
df

Unnamed: 0,x,y
0,12,11
1,25,29
2,68,58
3,42,121
4,113,100


In [None]:
# example 1 -  to illustrate disadvantage of Covariance

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3))

# plot scatterplots on each axes
ax1.scatter(df['x'], df['y'])
ax2.scatter(df['x']*2, df['y']*2)

ax1.set_title("Covariance - " + str(np.cov(df['x'], df['y'])[0,1]))
ax2.set_title("Covariance - " + str(np.cov(df['x']*2, df['y']*2)[0,1]))

print(np.cov(df['x'],df['y'])[0,1])
print(np.cov(df['x']*2,df['y']*2)[0,1])

In [None]:
# example 2 -  to illustrate disadvantage of Covariance

fig, ax = plt.subplots(1, 3, figsize=(15, 3))

# Plot scatterplots on each axes
ax[0].scatter(df['x'], df['x'])
ax[1].scatter(df['x'], df['y'])
ax[2].scatter(df['x']*2, df['y']*2)

ax[0].set_title("Covariance - " + str(np.cov(df['x'],df['x'])[0,1]))
ax[1].set_title("Covariance - " + str(np.cov(df['x'],df['y'])[0,1]))
ax[2].set_title("Covariance - " + str(np.cov(df['x']*2,df['y']*2)[0,1]))