# Descriptive Statistics, Transformations

types of descriptive statistics: 
- measure of central tendency
- measure of variability/ Dispersion

## Measures of Central Tendency

It represents the whole dataset by a single value. <br>
It gives us the location of central points

- Mean
- Mode
- Median

### a. Mean

It is the sum of observation divided by the total number of observations. <br>
Also defined as average which is the sum divided by count.

![mean formula](mean.png)

where, n = number of items

**Example**

In [10]:
import numpy as np
# sample data
arr = [5, 6, 11]

# mean
mean = np.mean(arr)
print("Mean: ", mean)

Mean:  7.333333333333333


### b. Mode

It is the value that has the highest frequecy in the given dataset. <br>
The dataset does not have a mode if the frequency of all data points are the same. <br>

**Example**

In [9]:
from scipy import stats

# sample Data
arr = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]

# mode
mode = stats.mode(arr, keepdims=True)
print("Mode: ", mode)

Mode:  ModeResult(mode=array([4]), count=array([4]))


### c. Median

It is the middle value of the dataset. <br>
It splits the data into two halves, if the number of elemnts in the dataset is odd then the centre elements is median. <br>
If even, then the median would be the average of two central elements

![median formula](median.png)

where, n = number of terms

**Example**

In [8]:
import numpy as np

# sample data
arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]

# median
median = np.median(arr)

print("Median: ", median)

Median:  5.0


## Measures of Dispersion

They depict the spread of data

- <strong>Range :</strong>The difference between the highest and lowest values in the dataset
- <strong>Variance :</strong>The average of the squared differences from the mean
- <strong>Standard Deviation :</strong> Shows how your data is spread around the mean 
>- it's the square-root of variance

### 1. Range

Describe the difference between the largest and smallest data points in the dataset.
The bigger the range the wider the spread

>> <code>Range = Largest data value - smallest data value</code>

**Example**

In [5]:
import numpy as np

# sample data
arr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2]

maximum = np.max(arr)
minimum = np.min(arr)

range = maximum - minimum
print("Maximum data value: {}, Minimum data Value: {}, Range: {}".format(maximum, minimum, range))

Maximum data value: 9, Minimum data Value: 1, Range: 8


### 2. Variance

It's defined as an average squared deviation from the mean.<br>
Calculates by finding the difference between every data point and the mean and then squaring them, then dividing them by the number of data values

![variance formula](variance.png)

Where, N = number of elements<br>
u = mean<br>
x = a single data value

**Example**

In [7]:
import statistics

# sample data
arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]

# variance
var = statistics.variance(arr)
print("Variance: ", var)

Variance: 7.5


## 3. Standard Deviation

It's a measure of how dispersed the data is in relation to the mean.

- low std deviation means the data are scattered around the mean
- hight std deviation indicates the data are more spread out

the standard deviation value is always positive <br>
It's also the square-root of `variance` of the data set

**General formula**

![standard deviation formula](std_dev1.png)

Where, N = number of data values/points<br>
u = mean<br>
rho = represents standard deviation

**Example**

In [11]:
import statistics

# sample data
arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]

std_dev = statistics.stdev(arr)
print("Standard Deviation: ", std_dev)

Standard Deviation:  2.7386127875258306


**Standard Deviation Formulas**
- Standard Deviation of Sample Data
- Standard Deviation of Population Data

The formula for the standard deviation of population data is,

![standard deviation](std_dev.png)

Where,
- s = population standard deviation
- x<sub>i</sub> = i<sup>th</sup> observation
- x ***bar*** = sample mean
- N = number of observations

## Quartiles

Quartiles are the set of values that divide the data points into four identical values/parts.<br>
In statistics they are used to divide the data-set into four quarters

There are three quartiles:- 

- First / Lower Quartile
- Second Quartile / Median
- Third / Upper Quartile

The steps to obtain the quartile formula are:

- <strong>Step 1 :</strong>Sort the given data in ascending order
- <strong>Step 2 :</strong>Find respective quartile values / terms as per need from the formula below

>- first quartile = ({n + 1} / {4})<sup>th</sup> term
>- second quartile = ({n + 1} / {2})<sup>th</sup> term
>- third quartile - ({3 (n + 1)} / {4})<sup>th</sup> term

where ***n*** is the total count of numbers in a given data

## Transformation

rescaling the distribution of numeric values is necessary for the algorithm to converge faster or to provide a more exact solution.
Rescaling mutates the range of the values of the features and can affect variance.
You can perform rescaling in two ways:-

- statistical `standardization` (z-score normalization)
>- Standardization typically means rescaling data to have a mean of 0 and a standard deviation of 1(unit variance)
- min-max transformation / `normalization`
>- Normalization typically means rescaling the values into a range of [0, 1]