<a href="https://colab.research.google.com/github/sijuswamy/PyWorks/blob/main/Descriptive_Statistics_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

>** Lesson Outcome** Upon successful completion of this session, the participant will be able to:
- Understand the role of descriptive statistics in data analysis
- Use `Python` for descriptive data analysis

>**Descriptive statistics** is about describing and summarizing data. It uses two main approaches:

- The **quantitative approach** describes and summarizes data numerically.
- The **visual approach** illustrates data with charts, plots, histograms, and other graphs.

### Univariate, bivariate and multi-variate analysis

- Univariate analysis describe and summarize a single variable
- bivariate analysis search for statistical relationships among a pair of variables
- multivariate analysis is concerned with multiple variables at once

## Measures in descriptive statistics

- **Central tendency** tells  about the centers of the data. Useful measures include the `mean, median, and mode`.

- **Variability** tells about the spread of the data. Useful measures include `variance` and `standard deviation`

- **Correlation or joint variability** tells about the relation between a pair of variables in a dataset. Useful measures include covariance and the correlation coefficient

## `python` libraries for Descriptive Data Analysis

In [8]:
import numpy as np
import scipy.stats as sp
import pandas as pd

In [17]:
iDdata=[1,2,3,4,10,5]

In [18]:
# finding mean
print(np.mean(iDdata))


4.166666666666667


In [19]:
#calculating mean after excluding `nan`
print(np.nanmean(iDdata))


4.166666666666667


## Harmonic mean

The harmonic mean of $n$ items $x_1,x_2,x_3,\cdots, x_n$ is defined as 

$$HM=\dfrac{n}{\sum_\limits{i=1}^n \frac{1}{x_i}}$$

In [20]:
# calculating harmonic mean
print(sp.hmean((iDdata)))

2.5174825174825175


### Geometric mean

Geometric mean is defined as $$GM=\sqrt[n]{\prod x_i}$$

In [21]:
sp.gmean(iDdata)

3.2598444275495897

### Weighted Average

In [25]:
# reading weights
w=[0.1,0.3,0.4,0.5,0.6,0.7]
y = np.average(iDdata, weights=w)
print("Weighted Average",round(y,3))

Weighted Average 5.154


### Median
The sample median is the middle element of a sorted dataset. The dataset can be sorted in increasing or decreasing order. 

In [26]:
md=np.median(y)
print("Median is:", md)

Median is: 5.153846153846155


###Mode
The sample mode is the value in the dataset that occurs most frequently. If there isn’t a single such value, then the set is multimodal since it has multiple modal values.

In [34]:
u = [2, 3, 2, 8, 12,2,6,3]
mode_ = max((u.count(item), item) for item in set(u))[1]
print(mode_)

2


## Measures of Variability
The measures of central tendency aren’t sufficient to describe data. You’ll also need the measures of variability that quantify the spread of data points.

- Variance
- Standard deviation
- Skewness
- Percentiles
- Ranges

### Variance

In [37]:
var_ = np.var(u)
print(var_)

11.6875


### Standard Deviation
The sample standard deviation is another measure of data spread. It’s connected to the sample variance, as standard deviation, $𝑠$, is the positive square root of the sample variance. The standard deviation is often more convenient than the variance because it has the same unit as the data points.

In [42]:
Sdv=np.std(u,ddof=1)
print(Sdv)

3.654742515847438


### Skewness
The sample skewness measures the asymmetry of a data sample.

In [45]:
Sk=sp.skew(u, bias=False)
print(Sk) #positively skewed

1.3432111531262334


In [50]:
kur=sp.kurtosis(u,bias=False)
print(kur) # since kurtosis less than 3, it is meso-kurtic

0.9941605421945154


### Percentiles
The sample 𝑝 percentile is the element in the dataset such that 𝑝% of the elements in the dataset are less than or equal to that value. Also, (100 − 𝑝)% of the elements are greater than or equal to that value.  

In [54]:
np.percentile(u, [25, 50, 75])

array([2. , 3. , 6.5])

## Ranges
The range of data is the difference between the maximum and minimum element in the dataset. You can get it with the function `np.ptp()`:

In [56]:
#np.ptp(u)
print(np.max(u)-np.min(u))

10


## Interquartile range#
 The interquartile range is the difference between the first and third quartile. Once you calculate the quartiles, you can take their difference:

In [62]:
quartiles = np.quantile(u, [0.25,0.5, 0.75])
IQR=quartiles[2] - quartiles[0]
print(IQR)

4.5


### Summary of Descriptive Statistics

In [67]:
des=sp.describe(u)
des.mean

4.75

## Multi-variate analysis

### Measures of Correlation Between Pairs of Data
You’ll often need to examine the relationship between the corresponding elements of two variables in a dataset.

In [68]:
x = list(range(-10, 11))
y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]
x_, y_ = np.array(x), np.array(y)

In [70]:
n = len(x)
mean_x, mean_y = sum(x) / n, sum(y) / n
cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n))/ (n - 1))
cov_xy

19.95

In [71]:
cov_matrix = np.cov(x_, y_)
cov_matrix

array([[38.5       , 19.95      ],
       [19.95      , 13.91428571]])

## Correlation Coefficient
The correlation coefficient, or Pearson product-moment correlation coefficient, is denoted by the symbol $𝑟$. The coefficient is another measure of the correlation between data. 

In [75]:
corr_matrix = np.corrcoef(x_, y_)
corr_matrix

array([[1.        , 0.86195001],
       [0.86195001, 1.        ]])

In [78]:
r=corr_matrix[0,1]
print(r)

0.8619500056316061


## Working with two-dimensional data

In [80]:
# with numpy array
a=np.array([[ 1,  1,  1],
       [ 2,  3,  1],
       [ 4,  9,  2],
       [ 8, 27,  4],
       [16,  1,  1]])

In [81]:
np.sum(a,axis=0)

array([31, 41,  9])

In [82]:
np.mean(a,axis=1)

array([ 1.,  2.,  5., 13.,  6.])

In [84]:
sp.hmean(a,axis=0)

array([2.58064516, 2.01492537, 1.33333333])

In [85]:
np.var(a,axis=0)

array([29.76, 96.96,  1.36])

In [86]:
np.cov(a)

array([[  0. ,   0. ,   0. ,   0. ,   0. ],
       [  0. ,   1. ,   3.5,  11.5,   0. ],
       [  0. ,   3.5,  13. ,  44. ,  -7.5],
       [  0. ,  11.5,  44. , 151. , -37.5],
       [  0. ,   0. ,  -7.5, -37.5,  75. ]])

In [87]:
# working with dataframes
row_names = ['first', 'second', 'third', 'fourth', 'fifth']
col_names = ['A', 'B', 'C']
df = pd.DataFrame(a, index=row_names, columns=col_names)
df

Unnamed: 0,A,B,C
first,1,1,1
second,2,3,1
third,4,9,2
fourth,8,27,4
fifth,16,1,1


In [88]:
df.mean()# colum means

A    6.2
B    8.2
C    1.8
dtype: float64

In [89]:
df.mean(axis=1)

first      1.0
second     2.0
third      5.0
fourth    13.0
fifth      6.0
dtype: float64

In [90]:
#pandas summary statistics
df.describe()

Unnamed: 0,A,B,C
count,5.0,5.0,5.0
mean,6.2,8.2,1.8
std,6.09918,11.009087,1.30384
min,1.0,1.0,1.0
25%,2.0,1.0,1.0
50%,4.0,3.0,1.0
75%,8.0,9.0,2.0
max,16.0,27.0,4.0
