# Descriptive Statistics Notebook

## Index

1. [Central Location Measures](#Location-Measures)
    - Mean
    - Median
    - Mode

2. [Non- central Location Measurres](#Non-Central-Location-Measures)
    - Min, max
    - Quartiles
    - Percentiles

3. [Dispersion Measures](#Dispersion-Measures)
    - IQR - Inter Quartile Range
    - Variance
    - Standard Deviation
    - Coeficient of Variation
    - Advanced - Extra


In [None]:
# Importing packages

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Central Location Measures

## Mean

In [None]:
dataset = np.random.randint(20,80,size = 500) 
dataset


In [None]:

# calculate the mean/average

mean_dataset = np.sum(dataset) / len(dataset)
mean_dataset


# option 2
mean_dataset_2 = np.sum(dataset) / dataset.size
mean_dataset_2

In [None]:
mean_dataset == mean_dataset_2

## Median

In [None]:
# median -> the central number/ middle value
# understand why the median is barely affected by outliers
# because outliers only count as "one more element" and barely change the center

# just assessing one or 2 datapoints
np.sort(dataset)[5]
np.sort(dataset)[496]


# median
np.median(dataset)


## Mode

In [None]:
# mode is the most frequent value aka: the one that appears the most
# numpy doesnt have an easy mode function
# https://www.scipy.org/
from scipy import stats

stats.mode(dataset)

#mode method returns 2 elements -> the mode, and the count of that element



In [None]:
np.sort(dataset)

# Non-Central Location Measures

## Min and max

In [None]:
print("Min : ",dataset.min()) # ->> 100% of our datapoints will be equal or above this value

print("Max : ",dataset.max()) # ->> 0% of our dataset points are above 79 OR 100% of our dataset points are equal or below 79 

## Quartiles and percentiles

In [None]:
# calculated the lower quartile -> the position where 25% of my data "has passed"

q1 = np.quantile(np.sort(dataset), 0.25)
q2 = np.quantile(dataset, 0.50)
q3 = np.quantile(dataset, 0.75)


print(q1, q2, q3)

In [None]:
np.quantile(dataset, 0.25)


In [None]:
np.quantile(np.sort(dataset), 0.25)

In [None]:
[1,8, 2, 4, 5, 10]

In [None]:
# plotting histograms allows us to visualize nicely the distribution of our data
# this histograms creates "buckets/boxes" of values and then places the numbers of the dataset in each box and counts

plt.hist(x=dataset, bins= 10)


plt.vlines([q1,q2,q3], 0, 60, colors= 'r', label = [q1,q2,q3])
plt.show()



#plt.hist(dataset, bins = 20, cumulative = True)
#plt.hlines([125,250,375],20,85,colors = 'r')


In [None]:
plt.scatter( y= [1,2,3,4], x= dataset[:4])§b

In [None]:
# histogram with categorical values
plt.hist(x=['charlotte', 'max', 'lucas', 'tolga', 'borja', 'charlotte', 'charlotte', 'tolga'])

In [None]:
# the box plot brings together all this information

plt.boxplot(dataset)
plt.show()

In [None]:
dataset_age = pd.DataFrame(dataset, columns= ['Age'])

dataset_age

In [None]:
q1_df =np.quantile(dataset_age['Age'], 0.25)
q1_df

In [None]:
plt.boxplot(dataset_age['Age'])
plt.show()

# Dispersion Measures

## IQR - Inter Quartile Range

In [None]:

# Interquartile Range

# range


# what information does the IQR give you about how spread out the data is?
# between the upper quartile and the lower quartile lies for sure 50% of the data
#it's a measure of how spread out, 50% of the data is?

print(q1)
print(q3)


In [None]:
iqr = q3 - q1
iqr


In [None]:
bottom_thresh = q1 - iqr*1.5
upper_thresh = q3 + iqr * 1.5
[bottom_thresh, upper_thresh]

## Variance

In [None]:
# mean
ds_mean = np.mean(dataset)

# variance

# we square the differences between the mean and each point
squared_differences = np.square( dataset - ds_mean)
squared_differences

variance = np.sum(squared_differences) / len(dataset)
variance

## Standard Deviation

In [None]:

# is the variance a meaningful indicator? why?
# the units of the variance are the square of the units of each elements
# makes it hard to relate to the average.
# the standard deviation has the correct units compared to each element

standard_deviation = np.sqrt(variance)
standard_deviation

In [None]:
np.mean(dataset)

In [None]:
# On average, or a person in this datast we can expect that the average age to rely in between:

[50 - 17 ; 50 + 17] -> [33 ; 67]

## Coefficient of Variation

In [None]:
# coefficient of variation. standard deviation as percentage of mean

#big advantage -> this indicator has no units -> its a relative indicator
# very useful when comparing the variations of measures with different units

#e.g.
#webdev cohort std of height = 20 cm
#cv_webdev = 35%
#data cohort std of ages = 4 years
#cv_data = 15%
#what has more variation? heights in webdev of ages in data?
#based on CV I can say that ages in data, has less variation than heights in webdev



# one BIG disadvantage of CV -> non-linear data transformations

# these numbers are exactly the same but in two different scales
# celsius faranheit scale example  D*(9/5) + 32 = F

Celsius =  [0, 10, 20, 30, 40]
Fahrenheit =  [32, 50, 68, 86, 104]

print(np.std(Celsius)/np.mean(Celsius))
print(np.std(Fahrenheit)/np.mean(Fahrenheit))

In [None]:
cv = standard_deviation / np.mean(dataset) 
cv

### Advanced - Extra

In [None]:
#ADVANCED: Rankine is to Farenheight what Kelvin is to Celsius (-273.15), which means there is a linear relation between Kelvin and Rankine
#(9/5) to be exact
Kelvin = [273.15, 283.15, 293.15, 303.15, 313.15]
Rankine = [491.67, 509.67, 527.67, 545.67, 563.67]
#print(np.divide(np.array(Rankine),np.array(Kelvin)))
