# Chapter 4: Functional Programming: Rudimentary Statistics and Analytics

Often we are performing statistical operations over a large dataset. It can be difficult to understand the meaning conveyed by these measures. Learning to program presents an opportunity to better understand how functions work. In this chapter we will create some basic statistical functions and compare their output to the functions built into python. By creating the function, you will understand the meaning of summation signs. Computing these statistics by hand would be a laborious process and expensive in terms of time. Once a function is constructed, it can be employed to calculate statistics in a fraction of the time.

## Building a Function

| New Concepts | Description |
| --- | --- |
| _return obj_ (from function) | Functions may return an object to be saved if a variable is defined by the function i.e., var1 = function(obj1, obj2, . . .)|

In [None]:
def function_name(object1, object2, . . ., objectn):
    <operations>

If the function allows, you will pass an object by calling it in the parentheses that follow the function name. The first function that we build will be the total() function. We define the function algebraically as the sum of all values in a list of length j:

$\sum_{i=0}^{n-1} x_{i}$

Since lists indices start with the integer 0, we will write our functions as starting with _i = 0_ and process elements to the index of value _n - 1_. Since the range function in Python automatically counts to one less than the value identified, the for-loop used will take the form:

In [1]:
# Define three variables with initial values
n = 0
i = 0
total = 0 

# Create a list of integers from 0 to 9 (inclusive)
values = [i for i in range(10)]

# Print header for the output table
print('total\t', 'value')

# Loop through each value in the values list
for value in values:
    # Add the current value to the running total
    total += value
    # Print the running total and current value on the same line, separated by a tab
    print(total, '\t', value)


total	 value
0 	 0
1 	 1
3 	 2
6 	 3
10 	 4
15 	 5
21 	 6
28 	 7
36 	 8
45 	 9


In [2]:
# this is a bad idea!!! don't keep copying and pasting code...
total = 0 
values = [i for i in range(0, 1000, 2)]
print('total\t', 'value')
for value in values:
    total += value
    print(total, '\t', value)

total	 value
0 	 0
2 	 2
6 	 4
12 	 6
20 	 8
30 	 10
42 	 12
56 	 14
72 	 16
90 	 18
110 	 20
132 	 22
156 	 24
182 	 26
210 	 28
240 	 30
272 	 32
306 	 34
342 	 36
380 	 38
420 	 40
462 	 42
506 	 44
552 	 46
600 	 48
650 	 50
702 	 52
756 	 54
812 	 56
870 	 58
930 	 60
992 	 62
1056 	 64
1122 	 66
1190 	 68
1260 	 70
1332 	 72
1406 	 74
1482 	 76
1560 	 78
1640 	 80
1722 	 82
1806 	 84
1892 	 86
1980 	 88
2070 	 90
2162 	 92
2256 	 94
2352 	 96
2450 	 98
2550 	 100
2652 	 102
2756 	 104
2862 	 106
2970 	 108
3080 	 110
3192 	 112
3306 	 114
3422 	 116
3540 	 118
3660 	 120
3782 	 122
3906 	 124
4032 	 126
4160 	 128
4290 	 130
4422 	 132
4556 	 134
4692 	 136
4830 	 138
4970 	 140
5112 	 142
5256 	 144
5402 	 146
5550 	 148
5700 	 150
5852 	 152
6006 	 154
6162 	 156
6320 	 158
6480 	 160
6642 	 162
6806 	 164
6972 	 166
7140 	 168
7310 	 170
7482 	 172
7656 	 174
7832 	 176
8010 	 178
8190 	 180
8372 	 182
8556 	 184
8742 	 186
8930 	 188
9120 	 190
9312 	 192
9506 	 194
9702 	 19

In [3]:
# Define a function called `total` that takes a list of numbers as input
def total(lst):
    # in original I used the index of the list
    # ...
    # n = len(lst)
    # for i in range(n)
    # Initialize a variable to keep track of the cumulative sum of the numbers in the list
    total_ = 0
    # Loop through each number in the list
    for val in lst:
        # Add the current number to the running total
        total_ += val
    # Return the final cumulative sum of the numbers in the list
    return total_

# Call the `total` function with the `values` list as an argument
total(values)


249500

In [4]:
# Call the `total` function with a list of integers as an argument
total([i for i in range(-1000, 10000, 53)])

# The argument is a list comprehension that creates a list of integers ranging from -1000 to 9999 (inclusive),
# incremented by 53 on each iteration.
# The `total` function will add up all the numbers in this list and return the cumulative sum.

932984

In [5]:
# Import the random module
import random

# Create two lists of numbers
# x1 is a list of multiples of 3 from 3 to 30
x1 = [3, 6, 9, 12, 15, 18, 21, 24, 27, 30]

# x2 is a list of 10 random integers between 0 and 100 (inclusive)
x2 = [random.randint(0,100) for i in range(10)]

# Call the `total` function on both lists and print the results
print("Total of x1: ", total(x1))
print("Total of x2: ", total(x2))


Total of x1:  165
Total of x2:  301


#### Mean


Let $X_1, X_2,...,X_n$ represent $n$ random variables. For a given dataset, useful descriptive statistics of central tendency include mean, median, and mode, which we built as functions in a previous chapter. 

We define the mean of a set of numbers:
$\bar{X} = \frac{\sum_{i=0}^{n-1} x_{i}} {n}$

In [6]:
# Define a function called `mean` that takes a list of numbers as input
def mean(lst):
    # Find the number of elements in the list
    n = len(lst)
    # Calculate the mean by dividing the sum of the elements in the list by the number of elements
    mean_ = total(lst) / n
    # Return the calculated mean
    return mean_

# Call the `mean` function on both lists `x1` and `x2` and print the results
print("Mean of x1: ", mean(x1))
print("Mean of x2: ", mean(x2))


Mean of x1:  16.5
Mean of x2:  30.1


Now that we have set up total and mean functions, we are ready to calculate 
other core statistical values: 

1. median
2. mode
3. variance
4. standard deviation
5. standard error
5. covariance
6. correlation


#### Median

The **median** is defined is the middle most number in a list. It is less sensitive to outliers than mean; it is the value in the middle of the dataset. For a series of *odd length* defined by a range [i, n] starting with index $i=0$, the median is $\frac{n}{2}$. 

For a series that is of *even length* but otherwise the same, the median is the mean value of the two values that comprise middle of the list. The indices of these numbers are equal defined: 

$$i_1 = \frac{n + 1}{2}; i_2\frac{n - 1}{2}$$

The median is thus defined:
$$\frac{x_\frac{n + 1}{2}+x_\frac{n-1}{2}}{2}$$

We can restate that:

$$k = x_\frac{n + 1}{2}+x_\frac{n-1}{2}$$

Thus, the median is defined as $\frac{k}{2}$.

In [7]:
# Define a function called `median` that takes a list of numbers as input
def median(lst):
    # Find the number of elements in the list
    n = len(lst)
    # Sort the list
    lst = sorted(lst)

    # Check the length of the list to determine the type of median to be calculated:
    # 1. If the list has an odd number of elements, calculate the median as the middle value
    if n % 2 != 0:
        middle_index = int((n-2) / 2)
        median_ = lst[middle_index]

    # 2. If the list has an even number of elements, calculate the median as the average of the two middle values
    else:
        upper_middle_index = int(n / 2)
        lower_middle_index = upper_middle_index - 1
        # Pass a slice of the two middle values to the `mean` function to get the average
        median_ = mean(lst[lower_middle_index : upper_middle_index + 1])
    
    # Return the calculated median
    return median_

# Call the `median` function on both lists `x1` and `x2` and print the results
print("Median of x1: ", median(x1))
print("Median of x2: ", median(x2))

# median([1,2,3,9,9,4,5])


Median of x1:  16.5
Median of x2:  19.5


In [8]:
# transform x1 to be of odd length by removing the last index
# this is to test the first casein the median() function
median(x1[:-1])

12

#### Mode

The mode of a list is defined as the number that appears the most in series of values. 

In order to quickly and cleanly identify the mode, we are going to use a new data structure: the dictionary. The dictionary is like a list, but elements are called by a key, not by elements from an ordered set of index numbers. We are going to use the values from the list passed to the function as keys. Every time a value is passed, the dictionary will indicate that it has appeared an additional time by adding one to the value pointed to by the key.

In [9]:
lst = [1,1,1,1,1,2,3,4,5,5,5,5,5,1000,1000]
# create an empty dictionary
count_dct  = {}

# create entries for each unique value in the list with a count of 0
for key in lst:
    count_dct[key] = 0

# add up each occurance of each value in the list
for key in lst:
    count_dct[key] += 1

# display the count of each unique value in the list
count_dct


{1: 5, 2: 1, 3: 1, 4: 1, 5: 5, 1000: 2}

In [10]:
def mode(llst):
    count_dct  = {}
    # create entries for each value with 0
    for key in lst:
        count_dct[key] = 0
    # add up each occurance
    for key in lst:
        count_dct[key] += 1
    # calculate max_count up front
    max_count = max(count_dct.values())
    # now we can compare each count to the max count
    mode_ = []
    for key, count in count_dct.items():
        if count == max_count:
            mode_.append(key)
    return mode_

mode(lst)

[1, 5]

### Variance

Average values do not provide a robust description of the data. An average does not tell us the shape of a distribution. In this section, we will build functions to calculate statistics describing distribution of variables and their relationships. The first of these is the variance of a list of numbers.

We define population variance as:

$$ \sigma^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n}$$

When we are dealing with a sample, which is a subset of a population of observations, then we divide by $n - 1$, the **Degrees of Freedom**, to unbias the calculation. 

$$DoF = n - 1$$

The degrees of freedom is the number of independent observation that go into the estimate of a parameter (sample size $n$), minus the number of parameters used as intermediate steps in the estimation of the parameter itself. So if we estimate $\bar{x}$ once, we estimate value of X using a single parameters. (We will see that we use multiple values to estimate X when we use Ordinarly Least Squares Regression.): 


$$ S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$$

In [11]:
def variance (lst, sample = True):
    # calculate the mean of the input list
    list_mean = mean(lst)
    # find the length of the list
    n = len(lst)
    # calculate the degrees of freedom (DoF)
    DoF = n - 1
    # initialize the variable to store the sum of squared differences
    sum_sq_diff = 0

    # loop through the values in the input list
    for val in lst:
        # calculate the difference between the value and the mean
        diff = val - list_mean
        # add the squared difference to the sum of squared differences
        sum_sq_diff += (diff) ** 2

    # calculate the variance depending on whether the input is a sample or a population
    if sample == False:
        variance_ = sum_sq_diff / n
    else:
        variance_ = sum_sq_diff / DoF

    return variance_

# test the function by calculating the variance of x1 with both sample and population options
print('Sample Variance: ', variance(x1, sample= True))
print('Population Variance: ', variance(x1, sample= False))


Sample Variance:  82.5
Population Variance:  74.25


In [12]:
variance(x2, sample= True), variance(x2, sample= False)

(821.2111111111111, 739.0899999999999)

#### Standard Deviation

From a list’s variance, we calculate its standard deviation as the square root of the variance. Standard deviation is regularly used in data analysis, primarily because it has the same units of measurement as the mean. It corrects the squaring of individual observations deviations from the mean done when calculating variance. It is denoted $s$ when working with a sample with an unknown population mean $\mu$. $s$ is an _estimator_ of $\sigma$, which is standard deviation when $\mu$ is known: 

$s = \sqrt{S^2}$

This is true for both the population and sample standard deviations. The function and its employment are listed below:

In [17]:
def SD(lst, sample= True):
    """
    Calculates the standard deviation of a given list.

    Parameters:
    lst (list): List of values
    sample (bool): Whether to calculate population or sample standard deviation. Default is True (sample).
    
    Returns:
    float: Standard deviation of the list

    """
    # calculate variance using the variance function
    variance_ = variance(lst, sample)
    # take the square root of variance to get standard deviation
    SD_ = variance_ ** (1/2)
    return SD_

# test function
print('Sample Standard Deviation: ', SD(x1, sample= True))
print('Population Standard Deviation: ', SD(x1, sample= False))


Sample Standard Deviation:  9.082951062292475
Population Standard Deviation:  8.616843969807043


In [14]:
SD(x2, sample= True), SD(x2, sample= False)

(28.656781241289313, 27.18620973949844)