# Key points of functional programming
- Modular code
- Emphasis on functions and models to solve problems
- Avoidance of state and data mutation, instead opting to create new data to avoid side effects

For repeated calculations, you're probably better off abstracting that work away. It is like the multiplication * key character. This is a shorthand way of telling the computer to take the given parameters and adding the however many times. 1 * 90 is essentially taking 1 and running a for loop until the counter is 90. That in itself is a function. Just recall your days in algebra

All you have to do is declare the function's existence, specify it's arguments, and how/what it will return. Remember y = mx + b, you pass the mx + b as arguments and get y back.

1. declare existence of a function
2. delcare its name
3. declare what it takes in (var names will be used locally within the function's operations)
4. in the body of the function, write how it operates (calculates)
5. Return some value

## Statistical Functions
| New Concepts | Description |
| --- | --- |
| Operators e.g., !=, %, +=, \*\* | The operator != tests whether the values on either side of the operator are equal; _a % b_ returns the remainder of $a / b$; _a += b_ sets a equal to $a + b$; _a ** b_ raises a to the b power ($a^b$). |
| Dictionary | A dictionary is a datastructure that uses keys instead of index values. Each unique key references an object linked to that key. |
| Dictionary Methods e.g., _dct.values()_ | dct.values() returns a list of the objects that are referenced by the dictionaries keys.|
| Default Function Values | Function may assume a default value for values passed to it. e.g., _def function(val1 = 0, val2 = 2, …)_ | 

### Average Statistics

# Cumulative Multiplication Function

In [1]:
def cum_mult(numbers):
    answer = 1
    for number in numbers:
        answer *= number
    return answer

In [2]:
numbersLITERALLY = [i for i in range(1, 6)]
numbersLITERALLY

In [3]:
cum_mult(numbersLITERALLY)

If the function allows, you will pass an object by calling it in the parentheses that follow the function name. The first function that we build will be the total() function. We define the function algebraically as the sum of all values in a list of length j:

$\sum_{i=0}^{n-1} x_{i}$

Since lists indices start with the integer 0, we will write our functions as starting with _i = 0_ and process elements to the index of value _n - 1_. Since the range function in Python automatically counts to one less than the value identified, the for-loop used will take the form:

In [4]:
n = 0
total = 0
values = [i for i in range(20)]


# This sucks since you'd have to rewrite this every time
print("Total", "Value")
for value in values:
    total  += value
    print(total, "\t", value)

In [5]:
lst = [i for i in range(30)]

In [6]:
def cum_total(numbers, target_index=None):
    total = 0
    cum_totals = []
    for number in numbers:
        total += number
        cum_totals.append(total)
    if target_index is None:
        return cum_totals
    else:
        return cum_totals[target_index]

In [7]:
print(cum_total(lst,9))
print("\n",cum_total(lst))
print("\n",cum_total(i for i in range(0, 29, 3)), type((i for i in range(0, 29, 3))))

45

 [0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 351, 378, 406, 435]

 [0, 3, 9, 18, 30, 45, 63, 84, 108, 135] <class 'generator'>


### Be more abstract using built-in functions. 

In [8]:
def total(elements):
    total = 0
    totals = []
    for element in elements:
        if isinstance(element, (int, float)):
            total += element
        totals.append(total)
    return totals

In [9]:
lst[5] = "This is a string"
total(lst)

[0,
 1,
 3,
 6,
 10,
 10,
 16,
 23,
 31,
 40,
 50,
 61,
 73,
 86,
 100,
 115,
 131,
 148,
 166,
 185,
 205,
 226,
 248,
 271,
 295,
 320,
 346,
 373,
 401,
 430]

#### Mean


Let $X_1, X_2,...,X_n$ represent $n$ random variables. For a given dataset, useful descriptive statistics of central tendency include mean, median, and mode, which we built as functions in a previous chapter. 

We define the mean of a set of numbers:
$\bar{X} = \frac{\sum_{i=0}^{n-1} x_{i}} {n}$

The **mean** gives the expected value - often denoted $E(X)$ or $\bar{X}$- from a series, $X$, by summing all of the observations and dividing by the number of them. The series may be a sample or may include the full population of interest, in which we would identify the mean by the symbol, $\mu_x$. 

The top part of the function is the same as the notation that represents the sum of a list of numbers. Thus, in mean(), we call total() and divide the result by the length of the list.  Then, we use the function to calculate value and save that value as an object:

# Mean function definition

In [10]:
def mean(numbers):
    return sum(numbers) / len(numbers)

In [11]:
# Random number generator
import random
def gen_rand100(n):
    return [random.randint(0, 100) for i in range(n)]
def gen_rand10(n):
    return [random.randint(0, 10) for i in range(n)]

In [12]:
x1 = gen_rand100(10)
x2 = gen_rand10(15)
x1, x2

([98, 20, 69, 92, 38, 64, 44, 8, 1, 64],
 [1, 6, 1, 7, 3, 1, 6, 10, 4, 2, 7, 7, 1, 1, 5])

In [13]:
mean(x1), mean(x2)

(49.8, 4.133333333333334)

#### Median

The **median** is defined is the middle most number in a list. It is less sensitive to outliers than mean; it is the value in the middle of the dataset. For a series of *odd length* defined by a range [i, n] starting with index $i=0$, the median is $\frac{n}{2}$. 

For a series that is of *even length* but otherwise the same, the median is the mean value of the two values that comprise middle of the list. The indices of these numbers are equal defined: 

$$i_1 = \frac{n + 1}{2}; i_2\frac{n - 1}{2}$$

The median is thus defined:
$$\frac{x_\frac{n + 1}{2}+x_\frac{n-1}{2}}{2}$$

We can restate that:

$$k = x_\frac{n + 1}{2}+x_\frac{n-1}{2}$$

Thus, the median is defined as $\frac{k}{2}$.

# Median Function Definition

In [14]:
def median(numbers):
    sorted_numbers = sorted(numbers)
    length = len(sorted_numbers)
    
    # Two case: odd length or even length
    # Even Case
    
    if length % 2 == 0:
        mid = length // 2
        return (sorted_numbers[mid - 1] + sorted_numbers[mid]) / 2
    # Else, it is odd, just take that middle number
    else:
        mid = length // 2
        return sorted_numbers[mid]

In [15]:
median(x1), median(x2)

(54.0, 4)

# Mode Function Definition

### Pseudocode

1. Declare dict to count numbers
2. for every number in the list:
    3. If that number is in the dict, add 1 to that key value
    4. Otherwise, it hasn't been counted before. Set that key value to 1
5. calculate the maximum among the key values using .values() function
6. return the list of keys corresponding to the max count

In [16]:
def mode(nums):
    # 1
    count = {}
    # 2
    for num in nums:
        # 3
        if num in count:
            count[num] += 1
        else:
        # 4
            count[num] = 1
        #5
    max_count = max(count.values())
    print(count.keys(), count.values())
    #6
    return [num for num in count if count[num] == max_count]

In [17]:
mod1 = [1,1,2,3,3,3,3,4,4,5,5,5,5]
mod2 = [1,1,2,2, 3,3,4,4]

In [18]:
mode(mod1), mode(mod2)

dict_keys([1, 2, 3, 4, 5]) dict_values([2, 1, 4, 2, 4])
dict_keys([1, 2, 3, 4]) dict_values([2, 2, 2, 2])


([3, 5], [1, 2, 3, 4])

#### Variance

Average values do not provide a robust description of the data. An average does not tell us the shape of a distribution. In this section, we will build functions to calculate statistics describing distribution of variables and their relationships. The first of these is the variance of a list of numbers.

We define population variance as:

$$ \sigma^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n}$$

When we are dealing with a sample, which is a subset of a population of observations, then we divide by $n - 1$, the **Degrees of Freedom**, to unbias the calculation. 

$$DoF = n - 1$$

The degrees of freedom is the number of independent observation that go into the estimate of a parameter (sample size $n$), minus the number of parameters used as intermediate steps in the estimation of the parameter itself. So if we estimate $\bar{x}$ once, we estimate value of X using a single parameters. (We will see that we use multiple values to estimate X when we use Ordinarly Least Squares Regression.): 


$$ S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$$

Next, we build functions that calculate a population's variance and standard deviation. We will include an option for calculating sample variance and sample standard deviation.

# Variance Function Definition

### Pseudocode
1. Get len list of numbers to reduce computation
2. get the mean
3. calcualte variance and use n-1 if sample == true (1) else use n

In [19]:
def variance(nums, sample = True):
    n = len(nums)
    mean = sum(nums) / n
    
    # If sample is set to truer then use n-1 else n (since obviously it is false)
    var = sum((i - mean)**2 for i in nums) / ((n - 1) if sample == 1 else n)
    return var

In [20]:
variance((1, 2, 3, 4, 5, 6, 7, 8), sample = True), variance((1, 2, 3, 4, 5, 6, 7, 8), sample = False)

(6.0, 5.25)

In [21]:
int(True)

1

#### Standard Deviation

From a list’s variance, we calculate its standard deviation as the square root of the variance. Standard deviation is regularly used in data analysis, primarily because it has the same units of measurement as the mean. It corrects the squaring of individual observations deviations from the mean done when calculating variance. It is denoted $s$ when working with a sample with an unknown population mean $\mu$. $s$ is an _estimator_ of $\sigma$, which is standard deviation when $\mu$ is known: 

$s = \sqrt{S^2}$

This is true for both the population and sample standard deviations. The function and its employment are listed below:

# Standard Deviation Function

### Pseudocode
1. get length of list of numbers
2. get mean
3. calculate variance
4. take variance to 1/2 powern which is the same as square root

In [22]:
def stddev(nums, sample = True):
    stddev = variance(nums, sample) ** (1/2)
    return stddev

In [23]:
random_ints = gen_rand100(300)

In [24]:
stddev(random_ints, sample = False), stddev(random_ints, sample = True)

(29.80481989358246, 29.85461912612702)

# Is random really all that random?

In [25]:
def listgen(num_lists_wanted):
    randomints = {}
    for i in range(num_lists_wanted):
        randomints[i] = gen_rand100(100)
    return randomints

In [26]:
listgendict = listgen(100)

In [27]:
listgendict

{0: [47,
  79,
  82,
  58,
  44,
  32,
  15,
  45,
  84,
  63,
  76,
  27,
  52,
  34,
  61,
  93,
  15,
  69,
  64,
  70,
  28,
  26,
  54,
  25,
  88,
  76,
  66,
  53,
  42,
  4,
  78,
  76,
  28,
  47,
  94,
  86,
  10,
  91,
  77,
  54,
  93,
  78,
  83,
  27,
  96,
  24,
  83,
  42,
  44,
  63,
  1,
  34,
  58,
  42,
  11,
  10,
  31,
  78,
  15,
  47,
  81,
  63,
  45,
  65,
  6,
  92,
  8,
  62,
  98,
  85,
  66,
  8,
  63,
  42,
  73,
  78,
  21,
  56,
  59,
  75,
  70,
  40,
  10,
  41,
  24,
  6,
  41,
  80,
  53,
  47,
  3,
  72,
  7,
  50,
  43,
  68,
  84,
  11,
  92,
  81],
 1: [72,
  81,
  10,
  42,
  41,
  78,
  5,
  42,
  5,
  99,
  56,
  19,
  31,
  96,
  35,
  75,
  36,
  9,
  33,
  59,
  85,
  65,
  37,
  76,
  51,
  36,
  18,
  47,
  78,
  65,
  93,
  24,
  31,
  33,
  70,
  64,
  36,
  74,
  95,
  90,
  31,
  75,
  83,
  77,
  68,
  30,
  71,
  94,
  83,
  76,
  60,
  76,
  50,
  83,
  28,
  48,
  26,
  47,
  52,
  48,
  51,
  59,
  19,
  98,
  68,
  66,
  64,
  

In [28]:
randomornah = {}
for i in listgendict:
    print(stddev(listgendict[i]))
    randomornah[i] = stddev(listgendict[i])
    
var(randomornah)

27.31084810551339
26.79418563491684
28.673492844772014
29.71258108524863
28.917465275846673
27.355975470441443
27.458920649441474
27.480909811001602
29.609417011443767
28.220221634333583
28.23437978812714
30.988578867092638
28.00919582039175
29.506075388781973
29.169826668644912
28.27988541688101
27.41170398117409
30.362230657595795
30.61863663017851
28.453220513252226
28.069054458278742
29.12880486266597
28.289732941457576
26.97565494285467
28.50946856712341
28.51053145839979
30.817053851411686
27.04820052268083
29.805185637890112
30.10224328027815
30.651682768906124
31.7068819601733
27.845263275262585
29.102577205464126
26.640327901643236
28.215296016397375
29.47606459211242
28.989602803540972
30.64888153577451
29.634480326849403
27.39856222726317
30.001252499106627
28.603701109676653
33.06102560631497
28.2399244569339
31.04944736735554
32.00146934757903
28.24960891836211
29.960463847200263
31.20379412309931
28.083209693100724
27.97080729271253
27.694210458673357
28.71922419536848
30

NameError: name 'var' is not defined

In [29]:
def dictvar(dict, sample=True):
    n = len(dict)
    mean = sum(dict.values()) / n
    variance = sum((i - mean)**2 for i in dict.values()) / (n - 1 if sample else n)
    return variance

In [30]:
dictvar(randomornah)

2.1656148238119575

In [31]:
r = random.random()

In [32]:
random.seed(1)

random.random()

0.13436424411240122

In [33]:
random.seed(1)
random.random()

0.13436424411240122

In [34]:
random.seed(random.random())
random.random()

0.13403453044814118

In [35]:
random.seed(random.random())
random.random()

0.17399915189709225

### Standard Error

Next, we will calculate the **standard error** of the sample mean. This describes how likely a given random sample mean $\bar{x_i}$ is to deviate from the population mean $\mu$. It is the standard deviation of the probability distribution for the random variable $\bar{X}$, which represents all possible samples of a single given sample size $n$. As $n$ increases, $\bar{X}$ can be expected to deviate less from $\mu$, so standard error decreases. Because population standard deviation $\sigma$ is rarely given, we again use an _estimator_ for standard error, denoted $s_\bar{x}$. Populational data has no standard error as $\mu$ can only take on a single value. 

As n increases, stddeviation from population mean should decrease and vice versa.

#### Standard error reflects how much variation in the data can be attributed to random sampling error rather than true differences in the population.

In [36]:
def stderr(lst, sample = True):
    nums = len(lst)
    return stddev(lst,sample) / nums ** (1/2)

In [37]:
print(stderr(x1), stderr(x2))

10.575443253121827 0.7613688591494531


### What's left? 

##### Covariance, correlation, skewness and kurtosis. 

Covariance measures the average relationship between two variables. Correlation normalizes the covariance statistic a fraction between 0 and 1.

To calculate covariance, we multiply the sum of the product of the difference between the observed value and the mean of each list for value _i = 1_ through _n = number of observations_:

$cov_{pop}(x,y) = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})(y_{i} - y_{mean})} {n}$

We pass two lists through the covariance() function. As with the _variance()_ and _SD()_ functions, we can take the sample-covariance.

$cov_{sample}(x,y) = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})(y_{i} - y_{mean})} {n - 1}$

In order for covariance to be calculated, it is required that the lists passed to the function are of equal length. So we check this condition with an if statment:

# Covariance

Covariance measures the degree to which changes in one variable are associated with changes in another variable. If the variables increase or decrease together, they have a positive covariance. If one variable tends to increase while the other decreases, they have a negative covariance. A covariance of zero indicates that there is no relationship between the two variables.

In [38]:
def covar(list_1, list_2, population=True):
    if len(list_1) != len(list_2):
        raise ValueError("List lenghts aren't equal")
    numVars = len(list_1)
    list1_mean = sum(list_1) / numVars
    list2_mean = sum(list_2) / numVars
    
    if population==False:
        dof = 1
    else:
        dof = 0

    return sum((list_1[i] - list1_mean) * (list_2[i] - list2_mean) for i in range(numVars)) / (numVars - dof)


In [39]:
X_1 = [3, 6, 9, 12, 15,18,21,24,27,30]
X_2 = [10, 56, 34, 47, 41, 54, 95, 67, 69, 98]
print(covar(X_1, X_2, population=True),covar(X_1, X_2, population=False) )

180.75 200.83333333333334


# Correlation
Covariance, but normalized by standard deviation of the x and y. It does not indicate causation. Indicates magnitude and direction of the linear relationship between two variables.

- greater than 1 = positive linear relationship
- less than 1 = inverse linear relationship
- 0 = no relationship

In [40]:
def correlation(l1, l2):
    coVAR = covar(l1, l2)
    std_dev1 = stddev(l1)
    std_dev2 = stddev(l2)
    return coVAR / (std_dev1 * std_dev2)

correlation(X_1, X_2)

0.7420328594792911

In [41]:
x3 = [x * -.5 for x in X_2]

In [42]:
correlation(x3, X_2)

-0.9000000000000001

# Skewness
- Is the left a mirror image of the right and vice versa (imagine a distribution)
- Positive skewness indicates that the distribution has a longer right tail and that most of the observations are concentrated on the left side of the distribution
- Negative skewness indicates the distribution has a longer left tail and most of the observations are concentrated on the right side of the distribution

Not all distributions are normal, so we need statistics that reflect differences in shapes between distributions.

Skewness is a measure of asymmetry of a population of data about the mean. It is the expected value of the cube of the standard deviation.

$skew_{pop}(x,y) = \frac{\sum_{i=0}^{n-1}{(x_{i} - x_{mean})^3}} {n\sigma^3}$


$skew_{sample}(x,y) = \frac{\sum_{i=0}^{n-1}{(x_{i} - x_{mean})^3}} {(n-1)(n-2)\sigma^3}$

Asymmetry in distribution exists due either the existence of long or fat tails. If a tail is long, this means that it contains values that are relatively far from the mean value of the data. If a tail is fat, there exists a greater number of observations whose values are relatively far from the mean than is predicted by a normal distribution. Skewness may sometimes be thought of as the direction which a distribution leans. This can be due to the existence of asymmetric fat tails, long tails, or both. For example, if a distribution includes a long tail on the right side, but is normal otherwise, it is said to have a positive skew. The same can be said of a distribution with a fat right tail. Skewness can be ambiguous concerning the shape of the distribution. If a distribution has a fat right tail and a long left tail that is not fat, it is possible that its skewness will be zero, even though the shape of the distribution is asymmetric.

In [43]:
def skewness(lst, sample = True):
    lst_mean = mean(lst)
    std_dev = stddev(lst, sample)
    skew = 0
    numVars = len(lst)
    for values in lst:
        skew += (values - lst_mean) ** 3
    skew = skew / (n * std_dev ** 3) if not sample else\
        numVars * skew / ((numVars-1) * (numVars-2) * std_dev ** 3)
    return skew

In [44]:
skewness(X_1), skewness2(X_2)

NameError: name 'skewness2' is not defined

# Kurtosis
- Positive kurtosis indicates a distribution with heavier tails than a normal distribution
- negative kurtosis indicates a distribution with lighter tails than a normal distribution. 
- 0 kurtosis indicates a normal distribution

$kurt_{pop} = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})^4} {n\sigma^4}$

$kurt_{sample} = \frac{n(n+1)\sum_{i=0}^{n-1} (x_{i} - x_{mean})^4} {(n - 1)(n - 2)( n - 3)\sigma^4} - \frac{3(n - 1)^2}{(n - 2)(n - 3)}$


In [45]:
def kurtosis(lst, sample=True):
    lst_mean = mean(lst)
    kurt = 0
    std_dev = stddev(lst, sample)
    numVars = len(lst)
    for values in lst:
        kurt += (values - lst_mean) ** 4
    return kurt / (numVars * std_dev **4)

In [46]:
kurtosis(X_1)

1.4383636363636367

# Gather Statistics


In [47]:
def gatherStats(df, sample=False):
    dct = {key:{} for key in df}
    print(dct)
    for key, values in df.items():
        # Drop missing observations from the DataFrame
        values = values.dropna(axis=0)
        dct[key]["Mean"] = round(mean(values),3)
        dct[key]["Median"] = round(median(values), 3)
        dct[key]["Variance"] = round(variance(values),3)
        dct[key]["Std. Dev."] = round(stddev(values), 3)
        dct[key]["Skewness"] = round(skewness(values),3)
        dct[key]["Kurtosis"] = round(kurtosis(values),3)
    return pd.DataFrame(dct)

In [48]:
import pandas as pd

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [49]:
df = pd.DataFrame([X_1, X_2], index = ["l1", "l2"]).T
gatherStats(df)

{'l1': {}, 'l2': {}}


Unnamed: 0,l1,l2
Mean,16.5,57.1
Median,16.5,55.0
Variance,82.5,719.211
Std. Dev.,9.083,26.818
Skewness,0.0,0.025
Kurtosis,1.438,1.967


In [50]:
variance(X_2)

719.2111111111112


# Economic Freedom of the World

In [51]:
import numpy, pandas as pd, stats

In [61]:
# Support for .xlsx was removed
data_df = pd.read_excel("efotw-22.xls", sheet_name = "EFW Panel Data 2022 Report")
rename = {"Panel Data Summary Index": "Summary",
         "Area 1":"Size of Government",
         "Area 2":"Legal System and Property Rights",
         "Area 3":"Sound Money",
         "Area 4":"Freedom to Trade Internationally",
         "Area 5":"Regulation"}
data_df = data_df.dropna(how="all", axis = 1).rename(columns = rename)

In [64]:
rename = {"Panel Data Summary Index": "Summary",
         "Area 1":"Size of Government",
         "Area 2":"Legal System and Property Rights",
         "Area 3":"Sound Money",
         "Area 4":"Freedom to Trade Internationally",
         "Area 5":"Regulation"}
data_df = data_df.dropna(how="all", axis = 1).rename(columns = rename)

In [65]:
data_df

Unnamed: 0,Year,ISO_Code_2,ISO_Code_3,World Bank Region,"World Bank Current Income Classification, 1990-present (L=Low income, LM=Lower middle income, UM=Upper middle income, H=High income)",Countries,Summary,Size of Government,Legal System and Property Rights,Sound Money,Freedom to Trade Internationally,Regulation,Standard Deviation of the 5 EFW Areas
0,2020,AL,ALB,Europe & Central Asia,UM,Albania,7.640000,7.817077,5.260351,9.788269,8.222499,7.112958,1.652742
1,2020,DZ,DZA,Middle East & North Africa,LM,Algeria,5.120000,4.409943,4.131760,7.630287,3.639507,5.778953,1.613103
2,2020,AO,AGO,Sub-Saharan Africa,LM,Angola,5.910000,8.133385,3.705161,6.087996,5.373190,6.227545,1.598854
3,2020,AR,ARG,Latin America & the Caribbean,UM,Argentina,4.870000,6.483768,4.796454,4.516018,3.086907,5.490538,1.254924
4,2020,AM,ARM,Europe & Central Asia,UM,Armenia,7.840000,7.975292,6.236215,9.553009,7.692708,7.756333,1.178292
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4450,1970,VE,VEN,Latin America & the Caribbean,,"Venezuela, RB",7.242943,8.349529,5.003088,9.621851,7.895993,5.209592,2.028426
4451,1970,VN,VNM,East Asia & Pacific,,Vietnam,,,,,,,
4452,1970,YE,YEM,Middle East & North Africa,,"Yemen, Rep.",,,,,,,
4453,1970,ZM,ZMB,Sub-Saharan Africa,,Zambia,4.498763,5.374545,4.472812,5.137395,,5.307952,0.412514
