# Descriptive Statistics: 
### Measure of central tendancy: Mean, Median, Mode, 
### Measure of variation: SD, var, z-score, CV
### Measure of postion: quartiles
### skewness, kurtosis

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as st
import scipy # scientific python
from scipy import stats
from scipy.stats import zscore, kurtosis, variation
from scipy.stats.mstats import gmean  
# sklearn # sci-kit learn

## Read the data

In [2]:
df = pd.read_csv("Data.csv") # excel # url # text # json #data # numpy
df

Unnamed: 0,Annual income
0,62000.0
1,64000.0
2,49000.0
3,324000.0
4,1264000.0
5,54330.0
6,64000.0
7,51000.0
8,55000.0
9,48000.0


In [5]:
df.shape

(11, 1)

In [4]:
df.head(7)

Unnamed: 0,Annual income
0,62000.0
1,64000.0
2,49000.0
3,324000.0
4,1264000.0
5,54330.0
6,64000.0


In [8]:
df.describe(include = 'all') # AM # 5-number summary in boxolot # 1.898482e+05 = 1.898482*10^5 = 189848.2 mean
# Applied Computational Statistics (Nitin Malik) (jan-June 2021)

Unnamed: 0,Annual income
count,11.0
mean,189848.2
std,365285.4
min,48000.0
25%,52000.0
50%,55000.0
75%,64000.0
max,1264000.0


In [4]:
df.mean() # AM

Annual income    189848.181818
dtype: float64

In [5]:
df.mean().round()

Annual income    189848.0
dtype: float64

## Geometric & Harmonic mean

In [22]:
stats.gmean(df)
# create a list of all the numbers
# then pass the list as an argument to the ft

array([86292.96812607])

In [9]:
stats.gmean(df.iloc[:, 2:4], axis=0) # axis=0 gives you the gm col wise iloc[rows,columns]

array([], dtype=float64)

In [18]:
stats.gmean(df.iloc[:, 0:1], axis=1) # axis=1 gives you the gm row wise

array([  62000.,   64000.,   49000.,  324000., 1264000.,   54330.,
         64000.,   51000.,   55000.,   48000.,   53000.])

In [23]:
stats.hmean(df) # harmonic mean

array([65647.45574851])

## Median

In [24]:
df.median()

Annual income    55000.0
dtype: float64

### Mode

In [25]:
df.mode()

Unnamed: 0,Annual income
0,64000.0


In [14]:
range=df.max()-df.min()
range

Annual income    1216000.0
dtype: float64

## Variance Sample

In [33]:
df.var() # .var() # var = (sum(x-mu)^2)/N

Annual income    1.334334e+11
dtype: float64

## Standard Deviation

In [32]:
df.std() # std = sqrt(var)

Annual income    365285.380951
dtype: float64

In [34]:
s=df.var()
np.sqrt(s) ## SD using formula

Annual income    365285.380951
dtype: float64

In [None]:
## bivariate analysis

In [6]:
df.cov() # covariance is different from coeff of variance (CV) 

Unnamed: 0,Annual income
Annual income,133433400000.0


In [7]:
df.corr() # correlation coeff

Unnamed: 0,Annual income
Annual income,1.0


## Measure of Position

In [5]:
df.quantile() # default quantile is Q2 (Median)
#quantile: fractile, quartile, decile, percentile

Annual income    55000.0
Name: 0.5, dtype: float64

In [4]:
df.quantile(q=0.50, axis=0) # Q2 (median)
# default axis=0

Annual income    55000.0
Name: 0.5, dtype: float64

In [10]:
df.quantile(q=0.25) # Q1 25%

Annual income    52000.0
Name: 0.25, dtype: float64

In [7]:
df.quantile(q=0.2, axis=0) # 20%

Annual income    51000.0
Name: 0.2, dtype: float64

In [22]:
df.describe().loc['50%']

Annual income    55000.0
Name: 50%, dtype: float64

In [5]:
df.quantile(q=0.25, axis=0) # Q1

Annual income    52000.0
Name: 0.25, dtype: float64

In [11]:
df.quantile(q=0.1, axis=0)

Annual income    49000.0
Name: 0.1, dtype: float64

In [15]:
IQR=df.quantile(0.75) - df.quantile(0.25)
IQR
# IQR= Q3-Q1

Annual income    12000.0
dtype: float64

## z-scores

In [12]:
df

Unnamed: 0,Annual income
0,62000.0
1,64000.0
2,49000.0
3,324000.0
4,1264000.0
5,54330.0
6,64000.0
7,51000.0
8,55000.0
9,48000.0


In [39]:
zscore(df)
# directly using the function

array([[-0.36707821],
       [-0.3613358 ],
       [-0.40440386],
       [ 0.38517724],
       [ 3.084109  ],
       [-0.38910035],
       [-0.3613358 ],
       [-0.39866146],
       [-0.38717664],
       [-0.40727507],
       [-0.39291905]])

In [4]:
zscore=(df-df.mean())/df.std() # z-score = (value-mean) / SD
zscore

Unnamed: 0,Annual income
0,-0.349995
1,-0.34452
2,-0.385584
3,0.367252
4,2.940583
5,-0.370993
6,-0.34452
7,-0.380109
8,-0.369158
9,-0.388322


In [10]:
# Critical value of z in standard normal distribution
stats.norm.interval(0.95, loc=0, scale=1)[1] # confidence level c= 0.95

1.959963984540054

When we ask for a 0.95 confidence internval, we are asking for the x axis values on both sides of the symmetrical curve where 95% of the probabilities are (area). Or where 95% of the values lie.

Wherever this intersects with the x axis, we get a value. We ask how many standard deviations away from the mean is this value? That is the z score.

In this case Z score is 1.96. This is because 95% of the values lie between -1.96 to +1.96 standard deviations.

In [47]:
stats.norm.interval(0.95, loc=0, scale=np.std(df))[1]

array([682627.9087859])

In [49]:
stats.norm.interval(0.95, loc=np.mean(df), scale=1)[1]

array([189850.14178217])

## Why is area under the standard distribution curve=1?
The probability of picking all the items from the data adds up to 1
Lets say there are 3 red and 2 blue balls in a bag. This is your entire population. Probability of picking red balls are 3/5 and probability of picking blue balls are 2 / 5.

$\frac{3}{5} + \frac{2}{5} = 1$

In [27]:
## Coefficient of variation CV

In [15]:
variation(df) 

array([1.83454981])

In [14]:
cv=df.std()/df.mean()
cv # ???????????

Annual income    1.924092
dtype: float64

## Skewness

In [36]:
df.skew()

Annual income    3.058906
dtype: float64

In [29]:
## Kurtosis

In [53]:
kurtosis(df)

array([5.2450895])