# Statistics-2

$\normalsize \text{by } \textit{Manideep Bangaru}$       
$\small \text{on 17/12/2020}$

# Numerical Descriptive Measures

### Loading required libraries

In [63]:
import numpy as np

### Defining a list of elements

In [64]:
vec = [20,34,56,23,45,22,60,23,56,78,23,45]

### Mean
The arithmetic mean (typically referred to as the mean) is the most common measure of central tendency.

$\large\frac{\large \text{Sum of the values}}{\large \text{Number of values}} \large \rightarrow$$\large \sum_{i=1}^{n} \large \frac{x_i}{n}$

In [189]:
print("Average is : " + str(round(np.mean(vec),5)))

Average is : 40.41667


### Median
The median is the middle value in an ordered array of data that has been ranked from smallest to largest or largest to smallest

$\large \text{Median} = \large \frac{(n+1)}{2} \text{ranked value}$

In [66]:
print("Median is : "+str(round(np.median(vec),2)))

Median is : 39.5


* For symmetrical distributed data mean, median and mode are almost equal in value
* For asymmetrical distributed data, following relationship holds good approximately
* Mode = 3 * Median - 2 * Mean (or)
* Mean - Mode = 3 * (Mean - Median)
* Above realtion is called as empirical relation. Using this if two measures are known, it is easy to find out the third measure

### Harmonic Mean
The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals

$\Large\frac {\Large n}{\Large \sum_{\Large i=1}^{\Large n}\Large (\frac{\Large 1}{\Large x_i})}$

In [67]:
def hm_(lis):
    sum_ = 0
    for i in range(len(lis)):
        sum_ = sum_ + (1/lis[i]) 
    return(len(lis)/sum_)
        

In [68]:
print("Harmonic mean is : " + str(round(hm_(vec),2)))

Harmonic mean is : 32.88


### Geometric Mean
When you want to measure the rate of change of a variable over time, you need to use the geo- metric mean instead of the arithmetic mean

$\large \bar X_g = (X_1*X_2*X_3 * \ .... \ X_n )^\frac{1}{n}$


In [69]:
def gm_(lis):
    return(np.prod(lis)**(1/len(lis)))

In [70]:
print('Geometric mean is : '+ str(gm_(vec)))

Geometric mean is : 36.399618874358126


* The arithmetic mean is appropriate if the values have the same units
* The geometric mean is appropriate if the values have differing units
* The harmonic mean is appropriate if the data values are ratios of two variables with different measures, called rates

## Variation and Shape

### Range
Range is the difference between largest and smallest element. It is the simplest descriptive measure of variation.\
$\sf \normalsize Range = (\large x_{max}-\large x_{min})$

In [71]:
def range_(lis):
    return (np.max(lis) - np.min(lis))

In [72]:
print("Range is : "+str(range_(vec)))

Range is : 58


### Variance
Variance measures the average scatter around the mean.\
$\large S^2 (\text{sample variance}) \small =\large \frac{\sum_{i=1}^{n}{(x_i-\bar X)^2}}{n-1}$

In [197]:
def variance_(lis):
    mean_ = np.mean(lis)
    numerator = 0
    for i in range(len(lis)):
        numerator += ((lis[i] - mean_)**2)
    return numerator/(len(lis)-1)

In [198]:
print("Variance is : "+ str(round(variance_(vec),5)))

Variance is : 366.44697


### Standard deviation
It is the square root of the variance.\
$\large S (\text{sample standard deviation}) \small =\large \sqrt \frac{\sum_{i=1}^{n}{(x_i-\bar X)^2}}{n-1}$

In [199]:
def stddev_(lis):
    return np.sqrt(variance_(lis))

In [200]:
print("Standard deviation is : "+str(round(stddev_(vec),2)))

Standard deviation is : 19.14


In [201]:
# Validating with numpy standard deviation formula
round(np.std(vec),2)

18.33

### Coefficient of Variation
It measures the scatter in the data with respect to the mean.\
$\large (\frac {\sigma}{\mu})*100$

In [202]:
def CoeffVar_(lis):
    return (stddev_(lis)/np.mean(lis))*100

In [203]:
print("Coefficient of Variation is : " + str(round(CoeffVar_(vec),2)))

Coefficient of Variation is : 47.36


### Z-Score
* The Z score of a value is the difference between that value and the mean, divided by the stan- dard deviation
* A Z score of 0 indicates that the value is the same as the mean
* If a Z score is a positive or negative number, it indicates whether the value is above or below the mean and by how many standard deviations

$\large Z = \large(\frac{x-\mu}{\sigma})$


In [204]:
def zscore_(lis):
    temp_ = []
    for ele in lis:
        temp_.append((ele-np.mean(lis))/stddev_(lis))
    return temp_
        

In [205]:
zscore_(vec)

[-1.0665452134482747,
 -0.33519992422660055,
 0.8140569588360304,
 -0.9098283657579159,
 0.23942851730471487,
 -0.9620673149880355,
 1.0230127557565087,
 -0.9098283657579159,
 0.8140569588360304,
 1.963313841898661,
 -0.9098283657579159,
 0.23942851730471487]

In [206]:
from scipy.stats import zscore
zscore(vec)

array([-1.11397014, -0.3501049 ,  0.85025476, -0.95028474,  0.25007493,
       -1.00484654,  1.06850198, -0.95028474,  0.85025476,  2.05061443,
       -0.95028474,  0.25007493])

### Skewness
skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

* Mean < Median ----> left skewed or negative skew
* Mean = Median ----> symmetrical distribution (zero skewness)
* Mean > Median ----> right skewed or positive skew

_**As a general rule, a Z score that is less than -3.0 or greater than +3.0 indicates an outlier value**_

In [207]:
from scipy.stats import skew
skew(vec)

0.5074242414080993

In [208]:
from plotly import express as px
px.line(vec)

### Kurtosis
* A distribution that has a sharper-rising center peak than the peak of a normal distribution has positive kurtosis, a kurtosis value that is greater than zero, and is called **lepokurtic** 
* A distribution that has a slower-rising (flatter) center peak than the peak of a normal distribution has negative kurtosis, a kurtosis value that is less than zero, and is called **platykurtic**
* A **lepokurtic distribution** has a **higher concentration** of values near the mean of the distribution compared to a normal distribution, while a **platykurtic distribution** has a **lower concentration** compared to a normal distribution

In [209]:
from scipy.stats import kurtosis,kurtosistest
print(kurtosis(vec))
# The distribution with a higher kurtosis has a heavier tail

print(kurtosistest([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]))

-0.9116954950386589
KurtosistestResult(statistic=-1.7058104152122062, pvalue=0.08804338332528348)


### Quartiles

$\normalsize \sf Quartiles\ split\ the\ data\ into\ 4\ equal\ parts$

$\normalsize Q1 = \frac {n+1}{4} \ \ \ \ \ \ \ \ Q2 = 2(\frac {n+1}{4})$

$\normalsize Q3 = 3(\frac {n+1}{4}) \ \ \ \ Q4 = 4(\frac {n+1}{4})$

### Percentile
Percentile divides the data into 100 equal parts

### The Interquartile Range (IQR)}
The interquartile range is the difference between third quartile **Q3** and first quartile **Q1**\
$\textbf{IQR} = \textbf{Q3 - Q1}$

### The Empirical rule

The Empirical rule states that :

In a Normal distribution,

* Approximately, 68% of the values are within $\bf{\pm1\sigma}$
* Approximately, 95% of the values are within $\bf{\pm2\sigma}$
* Approximately, 99.7% of the values are within $\bf{\pm3\sigma}$

### Chebyshev's theorem

For heavily skewed datasets that do not appear to be normally distributed, you should use chebyshev's theorem:

Regardless of the shape, the percentage of values that are found within distances of k standard deviations from the mean must be at least

$\Large (1-\frac{1}{k^2}) * 100$

### The Covariance and the Coefficient of Correlation

#### The Covariance

It measures the strength of a linear relationship between two numerical variables

Sample,$\large \sf \ cov(X,Y) = \frac{\sum_{i=1}^{n}(X_i-\bar X)(Y_i-\bar Y)}{n-1}$

Population,$\large \sf \ cov(X,Y) = \frac{\sum_{i=1}^{n}(X_i-\bar X)(Y_i-\bar Y)}{n}$

In [210]:
# find out Covariance of these two lists
vec1 = [23,45,34,23,34,56,78,65,45,34]
vec2 = [65,54,34,45,23,45,67,88,96,33]

In [211]:
def cov_(lis1,lis2,sample):
    num = 0
    for i,j in zip(lis1,lis2):
        num += (i - np.mean(lis1))*(j - np.mean(lis2))
    if sample==0:
        return num/len(lis1)
    else:
        return num/(len(lis1)-1)

In [212]:
print("Sample Variance is : " + str(cov_(vec1,vec2,True)))
print("Population Variance is : "+str(cov_(vec1,vec2,False)))

Sample Variance is : 196.77777777777777
Population Variance is : 177.1


### The Coefficient of Correlation

It measures the relative strength of a linear relationship between two numerical variables

* Values ranges between -1 to +1
* -1, perfect negative correlation
* +1, perfect positive correlation

Sample correlation, $\large \bf r = \frac {Cov(X,Y)}{S_XS_Y}$

In [213]:
def corr_(lis1,lis2):
    return cov_(lis1,lis2,False)/(stddev_(lis1)*stddev_(lis2))

In [217]:
print("Coefficient of Correlation is : " + str(round(corr_(vec1,vec2),2)))

Coefficient of Correlation is : 0.41


<font color='olive'>$\Large \textbf{Thank you !!!}$</font>