<a href="https://colab.research.google.com/github/KaterynaPR/KateRul_assignment/blob/main/Kate_Rul__9_2_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 9.2

> Replace all TODOs with your code. Do not change any other code.

In [2]:
# Do not edit this cell

from typing import List

## Descriptive statistics

In this assignment, we will write the functions to calculate the basic statistics from scratch, not using numpy.

### Task 1

Let's start simple: write a function `mean` that calculates the average of the list.

$$\mu = \frac{{\sum_{i=1}^n x_i}}{{n}}$$

In [3]:
def mean(li: List[float]) -> float:
    if len(li) == 0:  #Check for an empty list
        return 0.0
    return sum(li) / len(li)  # Calculating the average value


assert mean([1., 2., 3.]) == 2.
assert mean([1., 1., 2., 0.]) == 1.

### Task 2

Now let's calculate variance (dispersion). You may use the `mean` function implemented before.

$$V = \frac{{\sum_{i=1}^n (x_i - \mu)^2}}{{n}}$$

In [4]:
def variance(li: List[float]) -> float:
    if len(li) == 0:
        return 0.0

    mu = mean(li)  # Calculating the average value
    squared_diffs = [(x - mu) ** 2 for x in li]  #Squares of deviations
    return sum(squared_diffs) / len(li)  # Calculating the variance


assert variance([1., 1., 1.]) == 0.
assert variance([1., 2., 3., 4.]) == 1.25

### Task 3

The standard deviation is easy once you get the variance:

$$\sigma = \sqrt{V}$$

In [5]:
import math

In [6]:
def std(li: List[float]) -> float:
    return math.sqrt(variance(li))


assert std([1., 1., 1.]) == 0.
assert std([1., 2., 3., 4.]) == 1.25**0.5


### Task 4

**Median**

The median is the middle value in a sorted dataset. If the dataset has an odd number of values, the median is the value at the center. If the dataset has an even number of values, the median is the average of the two middle values.

In [7]:
def median(li: List[float]) -> float:
    if len(li) == 0:
        return 0.0  # If the list is empty, return 0

    sorted_li = sorted(li) #sorted list
    n = len(sorted_li)

    if n % 2 == 1: # If the number of items is unpaired
        return sorted_li[n // 2]  #return the middle element
    else:
        # If the number of elements is paired
        mid1 = sorted_li[n // 2 - 1]  #the first middle element
        mid2 = sorted_li[n // 2]      #the second middle element
        return (mid1 + mid2) / 2  #Return the average of two middle elements


assert median([1., 1., 1.]) == 1.
assert median([1., 4., 3., 2.]) == 2.5

## Measure performance

Sometimes, apart from theoretical, algorithmic complexity, it's a good idea to compare the runtime of two algorithms empirically, i.e., run the code many times and time it.

In Python's standard library, we have [timeit](https://docs.python.org/3/library/timeit.html) module that does exactly that.

Let's compare the runtime of your implementations and numpy. Use the provided setup code:

In [12]:
import timeit

# generate data for tests
setup = '''
import random
import numpy as np

arr = np.random.rand(10_000) * 100
li = [random.random() * 100 for _ in range(10_000)]
'''
def mean(arr):
    return sum(arr) / len(arr)

def variance(arr):
    m = mean(arr)
    return sum((x - m) ** 2 for x in arr) / len(arr)

def std(arr):
    return variance(arr) ** 0.5

def median(arr):
    sorted_arr = sorted(arr)
    n = len(arr)
    mid = n // 2
    return (sorted_arr[mid] + sorted_arr[~mid]) / 2

# pass your function to timeit module
funcs = {
    'mean': mean,
    'variance': variance,
    'std': std,
    'median': median,
}

#Function to run performance tests through a dictionary
def run_time():
    for name, func in funcs.items():
        #testing func-s
        time_custom = timeit.timeit(f'{name}(li)', setup=setup + f'from __main__ import {name}', number=1000)
        print(f'Execution time {name} for the list: {time_custom}')

        #testing func-s numpy
        if name == 'variance':
            time_numpy = timeit.timeit(f'np.var(arr)', setup=setup + 'import numpy as np', number=1000)
        else:
            time_numpy = timeit.timeit(f'np.{name}(arr)', setup=setup + 'import numpy as np', number=1000)

        print(f'Execution time np.{name if name != "variance" else "var"} for the array: {time_numpy}')

#Testing the execution time
run_time()

Execution time mean for the list: 0.08715978200007157
Execution time np.mean for the array: 0.015540628000053403
Execution time variance for the list: 2.717448500000046
Execution time np.var for the array: 0.042622239999673184
Execution time std for the list: 1.6022687250001582
Execution time np.std for the array: 0.03973755299966797
Execution time median for the list: 1.6759905779999826
Execution time np.median for the array: 0.1366638829999829


### Task 5

Complete Python statements to compare your functions to numpy. Use `li` for your function and `arr` for numpy functions.

In [13]:
stmt_mean_custom = 'mean(li)'
stmt_mean_np = 'np.mean(arr)'

stmt_var_custom = 'variance(li)' #function for dispersion
stmt_var_np = 'np.var(arr)' #The dispersion function in NumPy

stmt_std_custom = 'std(li)' #function for standard deviation
stmt_std_np = 'np.std(arr)' # the function for standard deviation in Numpy

stmt_median_custom = 'median(li)' #function for median
stmt_median_np = 'np.median(arr)' #the function for median in Numpy

### Task 6

Measure average exec time of your statements with `timeit` module. As your submission, fill out the table with results (rounded to 2 decimal places)

In [17]:
import timeit

# generate data for tests
setup = '''
import random
import numpy as np

arr = np.random.rand(10_000) * 100
li = [random.random() * 100 for _ in range(10_000)]
from __main__ import mean, variance, std, median
'''

# Time measurement for a function mean
time_mean_custom = timeit.timeit(stmt='mean(li)', setup=setup, number=10_000)
time_mean_np = timeit.timeit(stmt='np.mean(arr)', setup=setup, number=10_000)

#  Time measurement for a function variance
time_var_custom = timeit.timeit(stmt='variance(li)', setup=setup, number=10_000)
time_var_np = timeit.timeit(stmt='np.var(arr)', setup=setup, number=10_000)

#  Time measurement for a function std
time_std_custom = timeit.timeit(stmt='std(li)', setup=setup, number=10_000)
time_std_np = timeit.timeit(stmt='np.std(arr)', setup=setup, number=10_000)

#  Time measurement for a function median
time_median_custom = timeit.timeit(stmt='median(li)', setup=setup, number=10_000)
time_median_np = timeit.timeit(stmt='np.median(arr)', setup=setup, number=10_000)

#Results
print(f"Time per 10000 executions, secs:\n")
print(f"Func\t\tCustom\t\tNumpy")
print(f"mean\t\t{round(time_mean_custom, 2)}\t\t{round(time_mean_np, 2)}")
print(f"var\t\t{round(time_var_custom, 2)}\t\t{round(time_var_np, 2)}")
print(f"std\t\t{round(time_std_custom, 2)}\t\t{round(time_std_np, 2)}")
print(f"median\t\t{round(time_median_custom, 2)}\t\t{round(time_median_np, 2)}")

Time per 10000 executions, secs:

Func		Custom		Numpy
mean		0.53		0.09
var		17.54		0.42
std		18.87		0.47
median		17.3		0.99


Time per 10000 executions, secs

| Func       | Custom | Numpy |
| ---------- | ------ | ----- |
| mean       |        |       |
| var        |        |       |
| std        |        |       |
| median     |        |       |