# Functional Programming: Rudimentary Statistics and Analytics

## Building a Function
| New Concepts | Description |
| --- | --- |
| _return obj_ (from function) | Functions may return an object to be saved if a variable is defined by the function i.e., var1 = function(obj1, obj2, . . .)|

In [1]:
# def function_name(object1, object2, . . ., objectn):
    # <operations>

### Total
$\sum_{i=0}^{n-1} x_{i}$

In [2]:
# this is not efficient if we need to do this for many different values
n = 0
total = 0
values = [i for i in range(10)]

print("total\t","value")
for value in values:
    total += value
    print(total,"\t", value)
    
print("final total:", total)

total	 value
0 	 0
1 	 1
3 	 2
6 	 3
10 	 4
15 	 5
21 	 6
28 	 7
36 	 8
45 	 9
final total: 45


In [3]:
def total(lst):
    total_ = 0
    # in original we used the index of the list
    # . . .
    # n = len(lst)
    # for i in range(n)
    for val in lst:
        total_ += val
    return total_
total(values)

45

In [4]:
total([i for i in range(-1000, 10000, 53)])

932984

In [5]:
import random
x1 = [3,6,9,12,15,18,21,24,27,30]
x2 = [random.randint(0,100) for i in range(10)]
total(x1), total(x2)

(165, 490)

## Mean
Let $X_1, X_2,...,X_n$ represent $n$ values from random variables. For a given dataset, useful descriptive statistics of central tendency include mean, median, and mode, which we built as functions in a previous chapter. 

We define the mean of a set of numbers:
$\bar{X} = \frac{\sum_{i=0}^{n-1} x_{i}} {n}$

In [6]:
def mean(lst):
    n = len(lst)
    mean_ =  total(lst) / n
    return mean_
mean(x1), mean(x2)

(16.5, 49.0)

Now let's build the rest of the summary statistical functions
1. median
2. mode
3. variance
4. standard deviation
5. standard error
6. covariance
7. correlation

## Median

In [7]:
# reminder: median is the middle number in a chronological list
# if the list is even, then it is the average of the middle two numbers
def median(lst):
    n = len(lst)
    lst = sorted(lst)
    # two cases: 1. list is odd in length
    # i percent j checks for remainder upon dividing i by j
    if n % 2 != 0:
        middle_index = int((n - 1) / 2)
        median_ = lst[middle_index]
    # 2. list is even in length
    else:
        upper_middle_index = int(n / 2)
        lower_middle_index = upper_middle_index - 1
        # pass slice with two middle values to mean()
        median_ = mean(lst[lower_middle_index: upper_middle_index + 1])
        
    return median_
    
median(x1), median(x2)

(16.5, 49.0)

In [8]:
# transform x1 to be of odd length by removing the last index
median(x1[:-1])

15

In [9]:
sorted(x2)

[6, 19, 21, 38, 43, 55, 65, 66, 84, 93]

## Mode

In [10]:
lst = [1,1,1,2,3,4,5,5,5]
def mode(lst):
    count_dct = {}
    # create entries for each value with 0
    for key in lst:
        count_dct[key] = 0
    # add up each occurance    
    for key in lst:
        count_dct[key] += 1
    # calculate max_count up front
    max_count = max(count_dct.values())
    # now we can compare each count to the max count
    mode_ = []
    for key, count in count_dct.items():
        if count == max_count:
            mode_.append(key)

    return mode_       
    
    return count_dct
mode(lst)

[1, 5]

## Variance

$$ \sigma^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n}$$

When we are dealing with a sample, which is a subset of a population of observations, then we divide by $n - 1$, the **Degrees of Freedom**, to unbias the calculation. 

$$DoF = n - 1$$

$$ S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$$

In [11]:
def variance(lst, sample = True):
    list_mean = mean(lst)
    n = len(lst)
    dof = n - 1
    sum_sq_diff = 0
    
    for val in lst:
        diff = val - list_mean
        sum_sq_diff += (diff) ** 2
#        print(val, list_mean, diff, sum_sq_diff)
    if sample == False:
        variance_ = sum_sq_diff / n
    else:
        variance_ = sum_sq_diff / dof
    return variance_

variance(x1), variance(x1, sample = False)

(82.5, 74.25)

In [12]:
variance(x2), variance(x2, sample = False)

(825.7777777777778, 743.2)

$sd = \sqrt{S^2}$

In [13]:
def SD(lst, sample = True):
    SD_ = variance(lst, sample) ** (1/2)
    return SD_
SD(x1), SD(x1, sample = False)

(9.082951062292475, 8.616843969807043)

In [14]:
SD(x2), SD(x2, sample = False)

(28.73634941633641, 27.26169473822198)

## Standard Error

In [16]:
def STE(lst, sample = True):
    n = len(lst)
    se = SD(lst, sample) / n ** (1/2)
    
    return se

In [18]:
STE(x1, sample = True), SD(x1, sample = True)

(2.872281323269014, 9.082951062292475)

In [19]:
STE(x2, sample = True), SD(x2, sample = True)

(9.087231579407327, 28.73634941633641)

## Covariance
To calculate covariance, we multiply the sum of the product of the difference between the observed value and the mean of each list for value _i = 1_ through _n = number of observations_:

$cov_{pop}(x,y) = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})(y_{i} - y_{mean})} {n}$

We pass two lists through the covariance() function. As with the _variance()_ and _SD()_ functions, we can take the sample-covariance.

$cov_{sample}(x,y) = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})(y_{i} - y_{mean})} {n - 1}$

In order for covariance to be calculated, it is required that the lists passed to the function are of equal length. So we check this condition with an if statment:

In [27]:
def covariance(lst1, lst2, sample = True):
    mean1 = mean(lst1)
    mean2 = mean(lst2)
    
    cov = 0
    n1 = len(lst1)
    n2 = len(lst2)
    if n1 == n2:
        n = n1
        # sum the product of the differences
        for i in range(n):
            cov += (lst1[i] - mean1) * (lst2[i] - mean2)
        if sample == False:
            cov = cov / n
        else:
            cov = cov / (n - 1)
            
        return cov
    else:
        print("List lengths are not equal")
        print("List1:", n1)
        print("List2:", n2)
        
covariance(x1, x2, sample = True)

167.66666666666666

In [26]:
covariance(x1[:-1], x2)

List lengths are not equal
List1: 9
List2: 10


$corr(x,y) = \frac{cov_{pop}(x, y)} {\sigma_x \sigma_y}$

In [29]:
def correlation(lst1, lst2):
    cov = covariance(lst1, lst2)
    SD1 = SD(lst1)
    SD2 = SD(lst2)
    corr = cov / (SD1 * SD2)
    
    return corr
correlation(x1, x2)

0.6423743042133908

In [33]:
x3 = [1 + x * -.5 for x in x1]
x3

[-0.5, -2.0, -3.5, -5.0, -6.5, -8.0, -9.5, -11.0, -12.5, -14.0]

In [34]:
correlation(x1, x3)

-1.0

$skew_{pop}(x,y) = \frac{\sum_{i=0}^{n-1}{(x_{i} - x_{mean})^3}} {n\sigma^3}$


$skew_{sample}(x,y) = \frac{\sum_{i=0}^{n-1}{(x_{i} - x_{mean})^3}} {(n-1)(n-2)\sigma^3}$

In [37]:
def skewness(lst, sample = True):
    mean_ = mean(lst)
    SD_ = SD(lst, sample)
    skew = 0
    n = len (lst)
    for val in lst:
        skew += (val - mean_) ** 3
    skew = skew / (n * SD_ ** 3) if not sample else\
        n * skew / ((n - 1) * (n - 2) * SD_ **3)
    return skew

skewness(x1)

0.0

In [38]:
skewness(x2)

0.04259756433457542

$kurt_{pop} = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})^4} {n\sigma^4}$

$kurt_{sample} = \frac{n(n+1)\sum_{i=0}^{n-1} (x_{i} - x_{mean})^4} {(n - 1)(n - 2)( n - 3)\sigma^4} - \frac{3(n - 1)^2}{(n - 2)(n - 3)}$

In [42]:
def kurtosis(lst, sample = True):
    mean_ = mean(lst)
    kurt = 0
    SD_ = SD(lst, sample)
    n = len(lst)
    for val in lst:
        kurt += (val - mean_) ** 4
    kurt = kurt / (n * SD_ ** 4) if  sample == False else  n * (n + 1) * kurt / \
    ((n - 1) * (n - 2) * (n - 3) * (SD_ ** 4)) - (3 *(n - 1) ** 2) / ((n - 2) * (n - 3))
    
    return kurt

kurtosis(x1, sample = False)

1.7757575757575759

In [43]:
kurtosis(x2, sample = False)

1.8572531606262042

## Gather Statistics

In [52]:
import pandas as pd
def gather_statistics(df, sample = False, round_dig = 3):
    dct = {key:{} for key in df}
    for key, val in df.items():
        val.dropna(inplace = True)
        dct[key]["mean"] = round(mean(val), round_dig)
        dct[key]["median"] = round(median(val), round_dig)
        dct[key]["variance"] = round(variance(val), round_dig)
        dct[key]["S.D."] = round(SD(val, sample), round_dig)
        dct[key]["Skewness"] = round(skewness(val, sample), round_dig)
        dct[key]["Kurtosis"] = round(kurtosis(val, sample), round_dig)
    stats_df = pd.DataFrame(dct)
    return stats_df
data = pd.DataFrame([x1,x1], index = ["List1", "List2"]).T
gather_statistics(data, sample = False, round_dig = 2)

Unnamed: 0,List1,List2
mean,16.5,16.5
median,16.5,16.5
variance,82.5,82.5
S.D.,8.62,8.62
Skewness,0.0,0.0
Kurtosis,1.78,1.78


# Fraser Economic Freedom of the World

In [56]:
filename = "efotw-2022-master-index-data-for-researchers-iso.xlsx"
data = pd.read_excel(filename,
                    header = [4],
                    index_col = [3,1])
data

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,ISO Code 2,Countries,Economic Freedom Summary Index,Rank,Quartile,1A Government Consumption,data,1B Transfers and subsidies,data.1,...,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109,Unnamed: 110
ISO Code 3,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
ALB,2020,,AL,Albania,7.64,26.0,1.0,8.026471,12.710000,6.978202,11.590000,...,2011.00,2012.00,2013.00,2014.00,2015.00,2016.0,2017.0,2018.0,2019.00,2020.00
DZA,2020,,DZ,Algeria,5.12,157.0,4.0,3.102941,29.450000,7.817129,8.511137,...,153.00,153.00,157.00,159.00,159.00,162.0,162.0,162.0,165.00,165.00
AGO,2020,,AO,Angola,5.91,138.0,4.0,7.700000,13.820000,9.702997,1.590000,...,38.25,38.25,39.25,39.75,39.75,40.5,40.5,40.5,41.25,41.25
ARG,2020,,AR,Argentina,4.87,161.0,4.0,5.985294,19.650000,6.493188,13.370000,...,114.75,114.75,117.75,119.25,119.25,121.5,121.5,121.5,123.75,123.75
ARM,2020,,AM,Armenia,7.84,11.0,1.0,6.605882,17.540000,7.223433,10.690000,...,76.50,76.50,78.50,79.50,79.50,81.0,81.0,81.0,82.50,82.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VEN,1970,,VE,"Venezuela, RB",7.19,13.0,1.0,6.602003,17.553191,9.827430,1.133333,...,,,,,,,,,,
VNM,1970,,VN,Vietnam,,,,,,,,...,,,,,,,,,,
YEM,1970,,YE,"Yemen, Rep.",,,,,,,,...,,,,,,,,,,
ZMB,1970,,ZM,Zambia,5.33,54.0,3.0,3.448131,28.276353,9.105430,3.783070,...,,,,,,,,,,


In [59]:
data = pd.read_excel(filename,
                    sheet_name = "EFW Panel Data 2022 Report")
rename = {"Panel Dat Summary Index": "Summary",
         "Area 1": "Size of Government",
         "Area 2": "Legal Sustem and Propery Rights",
         "Area 3": "Sound Money",
         "Area 4": "Freedom to Trade Internationally",
         "Area 5": "Regulation"}
data.rename(columns = rename)

Unnamed: 0,Year,ISO_Code_2,ISO_Code_3,World Bank Region,"World Bank Current Income Classification, 1990-present (L=Low income, LM=Lower middle income, UM=Upper middle income, H=High income)",Countries,Panel Data Summary Index,Size of Government,Legal Sustem and Propery Rights,Sound Money,Freedom to Trade Internationally,Regulation,Standard Deviation of the 5 EFW Areas
0,2020,AL,ALB,Europe & Central Asia,UM,Albania,7.640000,7.817077,5.260351,9.788269,8.222499,7.112958,1.652742
1,2020,DZ,DZA,Middle East & North Africa,LM,Algeria,5.120000,4.409943,4.131760,7.630287,3.639507,5.778953,1.613103
2,2020,AO,AGO,Sub-Saharan Africa,LM,Angola,5.910000,8.133385,3.705161,6.087996,5.373190,6.227545,1.598854
3,2020,AR,ARG,Latin America & the Caribbean,UM,Argentina,4.870000,6.483768,4.796454,4.516018,3.086907,5.490538,1.254924
4,2020,AM,ARM,Europe & Central Asia,UM,Armenia,7.840000,7.975292,6.236215,9.553009,7.692708,7.756333,1.178292
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4450,1970,VE,VEN,Latin America & the Caribbean,,"Venezuela, RB",7.242943,8.349529,5.003088,9.621851,7.895993,5.209592,2.028426
4451,1970,VN,VNM,East Asia & Pacific,,Vietnam,,,,,,,
4452,1970,YE,YEM,Middle East & North Africa,,"Yemen, Rep.",,,,,,,
4453,1970,ZM,ZMB,Sub-Saharan Africa,,Zambia,4.498763,5.374545,4.472812,5.137395,,5.307952,0.412514
