# Descriptive Statistics

## Here I will learn all(many) concepts of Descriptive Statistics

### Here are all the libraries the we are gonna need:

In [1]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import math as mt
from collections import defaultdict
from scipy.stats import norm
import random

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import math as mt

### Mean

The mean is the average of a row of values

In [2]:
sample=[1,2,1,2,1]

mean_of_sample=np.mean(sample)

mean_of_sample

np.float64(1.4)

Here we did it with the library **numpy** but we can do it without packages too

In [3]:
sample=[1,3,2,5,7,0,2,3]

mean_of_sample=sum(sample)/len(sample)

mean_of_sample

2.875

The ouput of both code samples are equal, the only difference is that the first sample is more compact than the second one. The formula for the mean is: list of numbers / length of the list of numbers. In the second code snippet we have the variable **mean_of_sample** that is the sum of all objects in the list divided by the length of the list sample. But normally to do it more compact, programmers use tools like numpy.

### Weighted Mean

The Weighted mean is almost like the "normal" mean but it's more accurate. Imagine we have two bottles, the first bottle is 2l, the second one is 1l. In both bottles we have water and oil, in the first bottle we have 20ml of oil and in the second one we have 5ml of oil. Now we want to have the average of the oil in the bottles. If the bottle would have the same liquid capacity then the formula would be like this:

In [4]:
oil_in_bottles=[20, 5] # the first object is the the oil in the first bottle, the second one is the oil in the second bottle

mean_of_oil=np.mean(oil_in_bottles)

mean_of_oil

np.float64(12.5)

But that's wrong, because the bottles that we have, have different liquid capacities. In this case we use the **weighted mean**. The formula is like this: ((n1 * w2) + (n2 * w2) + ... + (ni*wi)) / w1 + w2 + ... + wi.

In [5]:
oil_in_bottles=[20, 5]
oil_in_bottles_weights=[2, 1] # The objects in this lists are the liquid capacities of both bottles

weighted_mean= sum(s*w for s,w in zip(oil_in_bottles, oil_in_bottles_weights)) / sum(oil_in_bottles_weights)

weighted_mean

15.0

For those who don't know, the zip function unifies two lists, that means in our example happens this: zip(oil_in_bottles, oil_in_bottles_weights) = [(20, 2), (5, 1)]. It makes tuples out of the lists and puts them in an other list. The Weighted mean is used in cases like this, when you have some numbers, and some numbers are more "important" than the other ones.

### Median

The median is the middle number. It's found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).

In [6]:
sample=[0,1,5,10,14]


def median(values):
    ordered=sorted(values)
    n=len(ordered)
    mid=int(n/2) - 1 if n % 2 == 0 else int(n/2)

    if n %2 == 0:
        return ordered[mid] + ordered[mid+1] / 2.0
    else:
        return ordered[mid]

print(median(sample))

5


In this code: we have a list(sample) that we put in at at then as the parameter values in the median function. First we made a variable that is a new list that is the sorted version of the values list(sample list) then we have n that is the length of our list. Then we calculate the middle value and then we have if statements that depending from how the list is prints the list in a way.

### Mode

We need the mode to find out the value that is used the most in a dataset.

In [7]:
sample=[1,3,2,5,7,0,2,3]

def mode(values):
    counts=defaultdict(lambda: 0)

    for s in values:
        counts[s] += 1

    max_count = max(counts.values())
    modes = [v for v in set(values) if counts[v] == max_count]
    return modes

print(mode(sample))

[2, 3]


### Variance and standard deviation

Variance is a measure of how data points differ from the mean.

In [8]:
data=[0,1,5,7,9,10,14]

def variance(values):
    mean=sum(values)/len(values)
    _variance=sum((v-mean)**2 for v in values)/len(values)
    return _variance

print(variance(data))

21.387755102040817


### Standard deviation

Standard deviation a measure of how dispersed the data is in relation to the mean

In [9]:
def std_dev(values):
    return mt.sqrt(variance(values))

print(std_dev(data))

4.624689730353899


The formula for the standard diviation is practically the same as for the variance, only that standard deviation formula is the square root of the variance formula.

### Sampling with Variance and Standard deviation

We do the variance and standard deviation but with data. The formula is the same only that we have in the denominator n-1 and not n.

In [10]:
data=[0,1,5,7,10,14]

def variance(values, is_sample: bool = False):
    mean=sum(values)/len(values)
    _variance=sum((v - mean) ** 2 for v in values) / (len(values) - (1 if is_sample else 0))
    return _variance

def std_dev(values, is_sample: bool = False):
    return mt.sqrt(variance(values, is_sample))

print("VARIANCE = {}".format(variance(data, is_sample=True)))
print("STD DEV = {}".format(std_dev(data, is_sample=True)))

VARIANCE = 28.56666666666667
STD DEV = 5.344779384283945


### Gauss Distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable.

In [11]:
def normal_pdf(x: float, mean: float, std_dev: float):
    return(1.0/(2.0*mt.pi*std_dev**2)**.5)*mt.exp(-1.0*((x-mean))**2/(2.0*std_dev**2))

### CDF

In [12]:
mean=64.43
std_dev=2.99

x=norm.pdf(64.43, mean, std_dev)

x

np.float64(0.13342551183994403)

### Middle area calculation using CDF

In [13]:
x=norm.pdf(66, mean, std_dev) - norm.pdf(62, mean, std_dev)

x

np.float64(0.020344497432282962)

## PPF

In [15]:
for i in range(0,1000):
    random_p=random.uniform(0.0,1.0)
    random_weight=norm.ppf(random_p, loc=64.43, scale=2.99)
    print(random_weight)

70.25496947936422
64.61403144308728
63.485952426504355
66.508519320678
62.11527729947581
60.85490703760752
63.83875467294527
59.10081050352763
66.95516147679048
69.82641169175285
65.21993620555963
62.74457176450808
65.96332434101909
62.04951532420052
68.5429746086633
66.55210652989919
63.46791788696449
62.645812603416516
70.07041109505305
62.07046900504851
62.668859492943014
67.47747144831303
65.338106540401
64.36590553896212
63.848886699301474
61.984734299064755
69.06059658565651
65.00114912874733
65.9037812527092
63.88369860401298
63.780894960345165
65.27870651319003
65.4616238978545
70.74588014447674
64.96642531141526
66.18100814462167
67.91438715921632
64.62158104711666
61.011703157075615
61.27731419286312
61.8998670495485
56.75687351544261
66.6209963037736
65.11956039975844
66.80486625503276
61.98954081706568
64.19832220601889
61.29413543568062
65.70126414767466
59.752860662309324
62.01092839056002
67.23580752605268
64.2437801751367
62.25607243673185
60.09342762681005
64.581324922

### Z score

In [None]:
def z_score(x,mean,std):
    return (x-mean) / std

def z_to_x(z,mean,std):
    return(z*mean)/std

mean=140000
std_dev=3000
x=150000

z=z_score(x,mean,std_dev)
back_to_x=z_to_x(z,mean,std_dev)

print("z-Wert: {}".format(z))
print("Back to x: {}".format(back_to_x))