# Introduction to descriptive statistics

by Felix Fritzen (fritzen@simtech.uni-stuttgart.de)

additional material for the course _Data processing for engineers and scientists_ at the University of Stuttgart

# Location, dispersion and shape parameters

- given data samples extract the location, dispersion and shape parameters
- give basic interpretation of the results and illustrate the properties
- real world data is used (see also other Jupyter notebooks)


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from math import ceil, floor

from submodules.stochastics import data_binning
from submodules.stochastics import sample_distribution as sample
from submodules.stochastics import distribution_plots as plot

## Example 1: Age distribution in Germany
data source: [kaggle.com](https://www.kaggle.com/lachmann12/world-population-demographics-by-age-2019)
file has been modified (see previous `jupyter` notebooks)

In [2]:
full_data = pd.read_csv('data/world_demographics.csv')[['Country','Value','Age']]
data      = full_data[ full_data.Country.eq('Germany') ]

age = data['Age'].values
idx = np.argsort(age)
population = data['Value'].values[ idx ]
age = age[ idx ]
#population is already given in bins

n_bin_set = [ 5, 10, 20, 40, 110 ]
N = population.sum()
print('  n_bins  |  mean  |  sigma  |   skew   |   kurt  |  median')
for n_bins in n_bin_set:
    pop, ages, binwidth = data_binning.rebin_data( population, age, n_bins, width=1)
    mean = np.dot( ages, pop ) / N
    var  = np.dot( (ages-mean)**2, pop ) / (N-1) 
    sigma=np.sqrt(var)
    skew = np.dot( (ages-mean)**3, pop ) / N / sigma**3
    kurt = np.dot( (ages-mean)**4, pop ) / N / sigma**4 - 3
    median=0
    m=0
    while m < N/2:
        m += pop[median]
        median += 1
    median = ages[median-1]
    print('    %3d   |  %4.1f  | %5.3f  |  %6.3f  | %6.3f  |    %3d' % \
         ( n_bins, mean, sigma, skew, kurt, median ))



  n_bins  |  mean  |  sigma  |   skew   |   kurt  |  median
      5   |  44.3  | 23.349  |   0.052  | -0.945  |     54
     10   |  44.0  | 23.492  |   0.001  | -0.938  |     49
     20   |  43.9  | 23.416  |  -0.028  | -0.930  |     46
     40   |  43.9  | 23.409  |  -0.034  | -0.928  |     44
    110   |  43.9  | 23.405  |  -0.037  | -0.927  |     45


## Example 2: Minimum, average and maximum daily temperature in October 2020

(data taken from https://www.stadtklima-stuttgart.de/index.php?luft_messdaten_download )

In [3]:
# columns contain average, maximum and minimum temperature for Oct 1 (first row) to Oct 31 (last row)
T = np.array( [[14.6,20.0,10.5],[16.1,20.6,12.3],[13.5,16.5,10.2],[14.4,19.3,9.6],
[12.8,16.1,10.9],[13.2,16.5,11.6],[12.5,16.1,10.5],[14.6,19.6,10.8],
[16.6,19.8,13.8],[11.6,15.8,6.7],[8.3,11.8,5.5],[9.1,12.6,5.2],
[9.6,13.9,5.6],[7.0,11.0,3.5],[8.9,10.2,7.5],[9.7,10.4,8.8],
[8.8,11.6,7.3],[8.9,11.5,6.1],[9.8,14.9,6.3],[11.0,18.9,5.0],
[14.9,20.8,11.3],[17.2,22.8,13.7],[15.4,17.2,14.5],[14.3,18.2,10.6],
[13.5,19.9,8.0],[10.4,13.2,7.9],[9.7,13.2,7.4],[12.3,15.1,10.0],
[12.1,14.1,10.2],[13.6,16.0,11.1],[13.8,20.5,9.5]])

N = T.shape[0]
# attention: min/max of average/maximum/minimum temperature
T_min = np.min(T, axis=0)
T_max = np.max(T, axis=0)
# range: largest max - smallest min
T_range=T_max[1]-T_min[2]

T_mean= np.mean( T, axis=0 )
T_median=np.median( T, axis=0 )
T_sigma= np.sum( (T - T_mean[None,:])**2, axis=0 )/(N-1)
T_skew =  np.sum( (T - T_mean[None,:])**3, axis=0 )/N  / T_sigma**3
T_kurt =  np.sum( (T - T_mean[None,:])**4, axis=0 )/N  / T_sigma**4 -3 
T_mean_AD= np.sum( np.abs(T - T_mean[None,:]), axis=0 ) / N
T_median_AD= np.sum( np.abs(T - T_median[None,:]), axis=0 ) / N

print('          |  unit  |  av. temperature  | max. temperature | min. temperature')
print('------------------------------------------------------------------------------')
print(' mean     | deg. C |     %12.1f  |    %12.1f  |    %12.1f' % ( tuple(T_mean)) )
print(' mean AD  | deg. C |     %12.1f  |    %12.1f  |    %12.1f' % ( tuple(T_mean_AD)) )
print(' median   | deg. C |     %12.1f  |    %12.1f  |    %12.1f' % ( tuple(T_median)) )
print(' median AD| deg. C |     %12.1f  |    %12.1f  |    %12.1f' % ( tuple(T_median_AD)) )
print(' sigma    | deg. C |     %12.1f  |    %12.1f  |    %12.1f' % ( tuple(T_sigma)) )
print(' skewness |   --   |     %12.5f  |    %12.5f  |    %12.5f' % ( tuple(T_skew)) )
print(' kurtosis |   --   |     %12.5f  |    %12.5f  |    %12.5f' % ( tuple(T_kurt)) )
print('------------------------------------------------------------------------------')


          |  unit  |  av. temperature  | max. temperature | min. temperature
------------------------------------------------------------------------------
 mean     | deg. C |             12.2  |            16.1  |             9.1
 mean AD  | deg. C |              2.3  |             2.9  |             2.4
 median   | deg. C |             12.5  |            16.1  |             9.6
 median AD| deg. C |              2.3  |             2.9  |             2.3
 sigma    | deg. C |              7.4  |            12.7  |             7.9
 skewness |   --   |         -0.00232  |        -0.00021  |        -0.00136
 kurtosis |   --   |         -2.96683  |        -2.98905  |        -2.96628
------------------------------------------------------------------------------


## Example 3: Correction factors for binned data

**Observations**
- binned data has a loss in accuracy over the original data
- consider variance/std. deviation, median and the skewness (based on corrected std. dev.)

### Meteorology data from Stuttgart in 2020, S-Mitte, Schwabenzentrum

data source: https://www.stadtklima-stuttgart.de/index.php?luft_messdaten_download

In [6]:
import numpy as np
import pandas
import matplotlib.pyplot as plt

T_min = -15
T_max = 50
step  = 5
u = np.arange(T_min, T_max+step,step=step,dtype=float)
w = u[1]-u[0]
l = u-w
c = (u+l)/2
n_bin = c.size
N_min = np.zeros(n_bin)
N_av  = np.zeros(n_bin)
N_max = np.zeros(n_bin)

for year in [2019, 2020]:
    print('---------------------------------------------------')
    print(' data for year ', year )
    print('---------------------------------------------------')
    yr = str(year)
    data = pandas.read_excel('data/weather/SZ-Tages-Werte-2000-2020.xlsx', \
                                       skiprows=6, sheet_name=yr)

    temp_av = data.iloc[:,1].values[:-4]
    temp_max = data.iloc[:,2].values[:-4]
    temp_min = data.iloc[:,3].values[:-4]

    for i in range(u.shape[0]):
        N_min[i] = np.logical_and(temp_min<=u[i], temp_min>l[i]).sum()
        N_av[i]  = np.logical_and(temp_av<=u[i], temp_av>l[i]).sum()
        N_max[i] = np.logical_and(temp_max<=u[i], temp_max>l[i]).sum()

    # compute data for the full year -- reference data using original inputs
    n = temp_av.shape[0]

    mean_T_av   = np.sum(temp_av)/n
    median_T_av = np.median(temp_av)
    sigma_T_av  = np.sqrt( np.sum((temp_av-mean_T_av)**2)/(n-1) )
    skew_T_av   = np.sum((temp_av-mean_T_av)**3)/n /sigma_T_av**3

    mean_T_min  = np.sum(temp_min)/n
    median_T_min= np.median(temp_min)
    sigma_T_min = np.sqrt( np.sum((temp_min-mean_T_av)**2)/(n-1) )
    skew_T_min  = np.sum((temp_min-mean_T_av)**3)/n /sigma_T_min**3

    mean_T_max  = np.sum(temp_max)/n
    median_T_max= np.median(temp_max)
    sigma_T_max = np.sqrt( np.sum((temp_max-mean_T_av)**2)/(n-1) )
    skew_T_max  = np.sum((temp_max-mean_T_av)**3)/n /sigma_T_max**3
    # compute date for the full year -- reference data using binned inputs
    n = temp_av.shape[0]

    mean_T_av_bin   = np.dot( c, N_av )/n
    tmp = N_av.cumsum()
    median_T_av_bin = c[ np.argmax( tmp >= 0.5*n ) ]
    idx = np.argmax( tmp >= n*0.5 )
    alpha=(0.5*n - tmp[idx-1])/N_av[idx]
    median_T_av_bin_corr = l[idx] + alpha*w
    sigma_T_av_bin  = np.sqrt( np.dot((c-mean_T_av_bin)**2, N_av)/(n-1) )
    sigma_T_av_bin_corr  = np.sqrt( sigma_T_av_bin**2 - w*w/12 )
    skew_T_av_bin  = np.dot((c-mean_T_av_bin)**3, N_av)/n / sigma_T_av_bin**3
    skew_T_av_bin_corr = skew_T_av_bin * (sigma_T_av_bin/sigma_T_av_bin_corr)**3

    mean_T_min_bin  = np.dot( c, N_min )/n
    tmp = N_min.cumsum()
    median_T_min_bin = c[ np.argmax( tmp >= 0.5*n ) ]
    idx = np.argmax( tmp >= n*0.5 )
    alpha=(0.5*n - tmp[idx-1])/N_min[idx]
    median_T_min_bin_corr = l[idx] + alpha*w

    sigma_T_min_bin = np.sqrt( np.dot((c-mean_T_min_bin)**2, N_min)/(n-1) )
    sigma_T_min_bin_corr  = np.sqrt( sigma_T_min_bin**2 - w*w/12 )
    skew_T_min_bin = np.dot((c-mean_T_min_bin)**2, N_min)/n / sigma_T_min_bin**3
    skew_T_min_bin_corr = skew_T_min_bin * (sigma_T_min_bin/sigma_T_min_bin_corr)**3

    mean_T_max_bin  = np.dot( c, N_max )/n
    tmp = N_max.cumsum()
    median_T_max_bin = c[ np.argmax( tmp >= 0.5*n ) ]
    idx = np.argmax( tmp >= n*0.5 )
    alpha=(0.5*n - tmp[idx-1])/N_max[idx]
    median_T_max_bin_corr = l[idx] + alpha*w

    sigma_T_max_bin = np.sqrt( np.dot((c-mean_T_max_bin)**2, N_max)/(n-1) )
    sigma_T_max_bin_corr = np.sqrt( sigma_T_max_bin**2 - w*w/12 )
    skew_T_max_bin = np.dot((c-mean_T_max_bin)**2, N_max)/n / sigma_T_max_bin**3
    skew_T_max_bin_corr = skew_T_max_bin * (sigma_T_max_bin/sigma_T_max_bin_corr)**3


    print(' av. temp.    |  original data  |  binned data     ')
    print('---------------------------------------------------')
    print(' mean         |    %10.3f   |  %10.3f' % ( mean_T_av, mean_T_av_bin ))
    print(' median       |    %10.3f   |  %10.3f   (corrected: %10.3f)' % ( median_T_av, median_T_av_bin, median_T_av_bin_corr ))
    print(' sigma        |    %10.3f   |  %10.3f   (corrected: %10.3f)' % ( sigma_T_av, sigma_T_av_bin, sigma_T_av_bin_corr ))
    print(' skewness     |    %10.3f   |  %10.3f   (corrected: %10.3f)\n' % ( skew_T_av, skew_T_av_bin, skew_T_av_bin_corr ))

    print(' min. temp.   |  original data  |  binned data     ')
    print('---------------------------------------------------')
    print(' mean         |    %10.3f   |  %10.3f' % ( mean_T_min, mean_T_min_bin ))
    print(' median       |    %10.3f   |  %10.3f   (corrected: %10.3f)' % ( median_T_min, median_T_min_bin, median_T_min_bin_corr ))
    print(' sigma        |    %10.3f   |  %10.3f   (corrected: %10.3f)' % ( sigma_T_min, sigma_T_min_bin, sigma_T_min_bin_corr ))
    print(' skewness     |    %10.3f   |  %10.3f   (corrected: %10.3f)\n' % ( skew_T_min, skew_T_min_bin, skew_T_min_bin_corr ))

    print(' max. temp.   |  original data  |  binned data     ')
    print('---------------------------------------------------')
    print(' mean         |    %10.3f   |  %10.3f' % ( mean_T_max, mean_T_max_bin ))
    print(' median       |    %10.3f   |  %10.3f   (corrected: %10.3f)' % ( median_T_max, median_T_max_bin, median_T_max_bin_corr ))
    print(' sigma        |    %10.3f   |  %10.3f   (corrected: %10.3f)' % ( sigma_T_max, sigma_T_max_bin, sigma_T_max_bin_corr ))
    print(' skewness     |    %10.3f   |  %10.3f   (corrected: %10.3f)' % ( skew_T_max, skew_T_max_bin, skew_T_max_bin_corr ))



---------------------------------------------------
 data for year  2019
---------------------------------------------------
 av. temp.    |  original data  |  binned data     
---------------------------------------------------
 mean         |        12.805   |      12.856
 median       |        12.200   |      12.500   (corrected:     12.627)
 sigma        |         7.211   |       7.254   (corrected:      7.109)
 skewness     |         0.178   |       0.092   (corrected:      0.097)

 min. temp.   |  original data  |  binned data     
---------------------------------------------------
 mean         |         8.552   |       8.445
 median       |         7.900   |       7.500   (corrected:      8.075)
 sigma        |         7.622   |       6.398   (corrected:      6.233)
 skewness     |        -1.243   |       0.156   (corrected:      0.169)

 max. temp.   |  original data  |  binned data     
---------------------------------------------------
 mean         |        17.708   |    

### Observations
- binning has little effect on mean if sufficiently many samples are available
- median accuracy is directly affected by the bin width
- standard deviation shows some discrepancy, especially for min./max. temperature
- Sheppard's correction for sigma is of moderate use if the data is not normally distributed