# Descriptive Statistics Notebook

## Index

1. [Central Location Measures](#Location-Measures)
    - Mean
    - Median
    - Mode

2. [Non- central Location Measurres](#Non-Central-Location-Measures)
    - Min, max
    - Quartiles
    - Percentiles

3. [Dispersion Measures](#Dispersion-Measures)
    - IQR - Inter Quartile Range
    - Variance
    - Standard Deviation
    - Coeficient of Variation
    - Advanced - Extra


In [None]:
# Importing packages

import numpy as np
import matplotlib.pyplot as plt

# Central Location Measures

## Mean

In [None]:
dataset = np.random.randint(20,80,size = 500)
dataset

In [None]:
# calculate the mean/average

mean = (np.sum(dataset))/len(dataset)
print(mean)

# there is a function for this in numpy
print(dataset.mean())

# mean is a measure of location/center
# problem of mean: it is very sensitive to outliers

## Median

In [None]:
# median -> the central number/ middle value
# understand why the median is barely affected by outliers
# because outliers only count as "one more element" and barely change the center
print(np.sort(dataset)[5])
print(np.sort(dataset)[6])
print(np.median(dataset))

## Mode

In [None]:
# mode is the most frequent value aka: the one that appears the most
# numpy doesnt have an easy mode function
# https://www.scipy.org/
from scipy import stats

stats.mode(dataset)
#mode method returns 2 elements -> the mode, and the count of that element

In [None]:
dataset

# Non-Central Location Measures

## Min and max

In [None]:
print(dataset.min())
print(dataset.max())

## Quartiles and percentiles

In [None]:
# calculated the lower quartile -> the position where 25% of my data "has passed"
q1 = np.quantile(dataset, 0.25)#, interpolation='midpoint')
print("the first quartile is", q1)
q2 = np.quantile(dataset, 0.50)
print("the second quartile is",q2)
q3 = np.quantile(dataset, 0.75)
print("the third quartile is", q3)

In [None]:
# plotting histograms allows us to visualize nicely the distribution of our data
# this histograms creates "buckets/boxes" of values and then places the numbers of the dataset in each box and counts
plt.hist(dataset, bins = 20)
#plt.show()

#plt.hist(dataset, bins = 20, cumulative = True)
#plt.hlines([125,250,375],20,85,colors = 'r')
plt.vlines([q1,q2,q3],0,500,colors='r', label= [q1,q2,q3])
plt.show()

In [None]:
# the box plot brings together all this information
plt.boxplot(dataset)
plt.show()

# Dispersion Measures

## IQR - Inter Quartile Range

In [None]:
# Interquartile Range

# range
print(dataset.max() - dataset.min())

# what information does the IQR give you about how spread out the data is?
# between the upper quartile and the lower quartile lies for sure 50% of the data
#it's a measure of how spread out, 50% of the data is?
iqr = q3 - q1
print(iqr)

In [None]:
dataset-np.mean(dataset)

In [None]:
np.square(dataset-np.mean(dataset))

## Variance

In [None]:
# variance

# we square the differences between the mean and each point
squared_differences = np.square(dataset-np.mean(dataset))
variance = np.sum(squared_differences)/len(dataset)

## Standard Deviation

In [None]:
# is the variance a meaningful indicator? why?
# the units of the variance are the square of the units of each elements
# makes it hard to relate to the average.

standard_deviation = np.sqrt(variance)



# the standard deviation has the correct units compared to each element

print("""The "typical" variation of ages in this data set is """, standard_deviation)


In [None]:
dataset2 = []

In [None]:
np.std(dataset)/np.mean(dataset)

In [None]:
np.std(dataset2)/np.mean(dataset2)

## Coefficient of Variation

In [None]:
# coefficient of variation. standard deviation as percentage of mean

cv = standard_deviation/np.mean(dataset)
print(cv)

#big advantage -> this indicator has no units -> its a relative indicator
# very useful when comparing the variations of measures with different units

#e.g.
#webdev cohort std of height = 20 cm
#cv_webdev = 35%
#data cohort std of ages = 4 years
#cv_data = 15%
#what has more variation? heights in webdev of ages in data?
#based on CV I can say that ages in data, has less variation than heights in webdev



# one BIG disadvantage of CV -> non-linear data transformations

# these numbers are exactly the same but in two different scales
# celsius faranheit scale example  D*(9/5) + 32 = F

Celsius =  [0, 10, 20, 30, 40]
Fahrenheit =  [32, 50, 68, 86, 104]

print(np.std(Celsius)/np.mean(Celsius))
print(np.std(Fahrenheit)/np.mean(Fahrenheit))

### Advanced - Extra

In [None]:
#ADVANCED: Rankine is to Farenheight what Kelvin is to Celsius (-273.15), which means there is a linear relation between Kelvin and Rankine
#(9/5) to be exact
Kelvin = [273.15, 283.15, 293.15, 303.15, 313.15]
Rankine = [491.67, 509.67, 527.67, 545.67, 563.67]
#print(np.divide(np.array(Rankine),np.array(Kelvin)))

print(np.std(Kelvin)/np.mean(Kelvin))
print(np.std(Rankine)/np.mean(Rankine))