<a id='top'></a>

# CSCI3022 F21
# Homework 2: Visualizing and Processing Data
***

**Name**: Aidan Reese

***

This assignment is due on Canvas (as .ipynb) and Gradescope (as a .pdf) by **MIDNIGHT on Mon 13 Sep**. Your solutions to theoretical questions should be done in Markdown directly below the associated question.  Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  Remember that you are encouraged to discuss the problems with your classmates, but **you must write all code and solutions on your own**.

**NOTES**: 

- Any relevant data sets should be available on Canvas. To make life easier on the graders if they need to run your code, do not change the relative path names here. Instead, move the files around on your computer.
- If you're not familiar with typesetting math directly into Markdown then by all means, do your work on paper first and then typeset it later.  Here is a [reference guide](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) linked on Canvas on writing math in Markdown. **All** of your written commentary, justifications and mathematical work should be in Markdown.  I also recommend the [wikibook](https://en.wikibooks.org/wiki/LaTeX) for LaTex.
- Because you can technically evaluate notebook cells is a non-linear order, it's a good idea to do **Kernel $\rightarrow$ Restart & Run All** as a check before submitting your solutions.  That way if we need to run your code you will know that it will work as expected. 
- It is **bad form** to make your reader interpret numerical output from your code.  If a question asks you to compute some value from the data you should show your code output **AND** write a summary of the results in Markdown directly below your code. 
- 45 points of this assignment are in problems.  The remaining 5 are for neatness, style, and overall exposition of both code and text.
- This probably goes without saying, but... For any question that asks you to calculate something, you **must show all work and justify your answers to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit. 

---
**Shortcuts:**  [Problem 1](#p1) | [Problem 2](#p2) |
---

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline



[Back to top](#top)
<a/ id='p1'></a>

## (20 pts) Problem 1: Computation (Streaming Means)
***

Data science is often divided into two categories: questions of *what* the best value might be to repreesnt a data problem, and questions of *how* to compute that data value.  Question 1 - and prior lectures - should tell you that computing the mean is valuable!  But *how* do we compute the mean?

Let $x_1, x_2, \ldots, x_n$ be $n$ observations of a variable of interest.  Recall that the sample mean $\bar{x}_n$ and sample variance $s^2_n$ are given by 
<a id='eq1'></a>
$$
\bar{x}_n = \frac{1}{n}\sum_{k=1}^n x_k \quad \textrm{and} \quad s^2_n = \frac{1}{n-1}\sum_{k=1}^n \left( x_k - \bar{x}_n\right)^2 \qquad \tag{Equation 1}
$$

**Part A**:

How many computations - floating point operations: addition, subtraction, multiplication, division each count as 1 operation - are required to compute the mean of the data set with $n$ observations?


(n-1) Would sum up every value and then one more operation to divide by n to get the final value.  ((n-1)+1)=n operations to find the mean.

**Part B**:

Now suppose our data is *streaming*- we slowly add observations one at a time, instead of seeing the entire data set at once.  We are still interested in the mean, so if we stream the data set `[4,6,0,10, ...]`, we first compute the mean of the the first data point `[4]`, then we recompute the mean of the first two points `[4,6]`, then we recompute the mean of three `[4,6,0]`, and so forth.

Suppose we recompute the mean from scratch after each and every one of our $n$ observations are one-by-one added to our data set.  How many floating point operations are spent computing (and re-computing) the mean of the data set?

When N=1, Only one operation is done(4/n).  When N=2, 3 operations are don because of the first mean computed (4/N) ->(4+6)/n.  N=3 6 operations are done (4/N) -> ((4+6)/n) -> (4+6+0)/n)).  

Operations =  $ \sum_{1}^{n}(n+(n-1)) $ 

We should be convinced that streaming a mean costs a lot more computer time than just computing once!

In this problem we explore a smarter method for such an _online_ computation of the mean.  

**Result**: The following relation holds between the mean of the first $n-1$ observations and the mean of all $n$ observations: 

$$
\bar{x}_n = \bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n}
$$


A proof of this result is in the [Appendix](#Appendix) after this problem, and requires some careful manipulations of the sum $\bar{x}_n$.  Your task will be to computationally verify and utilize this result.

**Part C**: Write a function `my_sample_mean` that takes as its input a numpy array and returns the mean of that numpy array using the formulas from class ([Equation 1](#eq1)). Write another function `my_sample_var` that takes as its input a numpy array and returns the variance of that numpy array, again using the formulas from class ([Equation 1](#eq1)). You may **not** use any built-in sample mean or variance functions.

In [64]:
def my_sample_mean(list):
    length = len(list)
    mean = (np.sum(list)/length)
    return mean
def my_sample_var(list):
    length = len(list)
    mean = np.sum(list)/length
    var = np.sum((list-mean)*(list-mean))/(len(list)-1)
    return var

**Part D**: Use your functions from Part C to compute the sample mean and sample variance of the following array, which contains the minutes late that the BuffBus is running on Friday afternoon.

`bus = [312, 4, 10, 0, 22, 39, 81, 19, 8, 60, 80, 42,12,1]`

In [30]:
bus = [312, 4, 10, 0, 22, 39, 81, 19, 8, 60, 80, 42,12,1]

print("Mean : ", my_sample_mean(bus))
print("Variance : ", my_sample_var(bus))

Mean :  49.285714285714285
Variance :  6488.681318681319


**Part E**: Implement a third function called `update_mean` that implements the formula discussed after part B. Note that this function will need to take as its input three things: $x_n$, $\bar{x}_{n-1}$ and $n$, and returns $\bar{x}_{n}$. A function header and return statement are provided for you. This function may be auto-graded, so please do not change the given header API - the order of inputs matters! If you change it, you might lose points.

Use this function to compute the values that you get from taking the mean of the first buff buses' lateness, the first two buff buses' lateness, the first three buff buses' lateness, and so on up to all of the `bus` data points from **Part D**. Store your streaming bus means in a numpy array called `buffbus_bad_means`.  Report all 12 estimates in `buffbus_bad_means`.

In [31]:
# Given API:
#[312, 4, 10, 0, 22, 39, 81, 19, 8, 60, 80, 42,12,1]


def update_mean(prev_mean, xn, n):
    new_mean = ((xn - prev_mean)/n) + (prev_mean)
    return(new_mean)

prev_mean = bus[0]
buffbus_bad_means = []

for i in range(np.size(bus)):
    prev_mean = update_mean(prev_mean, bus[i], i+1)
    buffbus_bad_means.append(prev_mean)
    
print(buffbus_bad_means)

[312.0, 158.0, 108.66666666666666, 81.5, 69.6, 64.5, 66.85714285714286, 60.875, 55.0, 55.5, 57.72727272727273, 56.416666666666664, 53.0, 49.285714285714285]


To ensure your function complies with the given API, run this small test, where we suppose we have a mean of $\bar{x}_n = 1$ with the first $2$ data points (`prev_mean`), and we update this with the 3rd ($n=3$) data point which is $x_3=2$:

In [65]:
assert update_mean(1,2,3)==4/3, "Warning: function seems broken."

**Part F**:

How many floating point operations were spent computing the final result in your code in part E?  Is this truly better than the uninformed approach from part B?

(3(n-1)) Operations are done, which scales much much better that what would have been done in part B.  It would do similar operations up until N=4 and then it would be exponentially better.

**Part G:**
A similar result to the formula preceding part C holds for variance.  In particular, we can write that:

$$
\displaystyle s^2 = \frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1} = \frac{1}{n(n-1)} \left(n \cdot \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i \right)^2 \right)
$$

Describe in **words** and/or **psuedocode** how you might adapt the function you made in part **E** to perform running calculations of *both* variance and mean.  Be very clear as to what the input/instantiation arguments would be as well as what the output arguments would be in addition to any intermediate calculations.

The inputs would end up being the list, length of list+1, and new value added. 

Very similar to the previous problem where the operations would be mulitplied to a greater value than in part E, but wouldn't be quadratic in relation to the value of N.  Simply putting the sums over n(n-1) and calculating $s^2$ would solve the problem.



<a id='Appendix'></a>

## Appendix 

*Goal*: Prove that 
$$
\bar{x}_n = \bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n}
$$

Note that you can get an expression for $\bar{x}_{n-1}$ by simply replacing $n$ in Equation 1 above with $n-1$.

We'll start with $\bar{x}_n$ and massage it until we get the righthand side of the formula

\begin{eqnarray}
\nonumber \bar{x}_n &=& \frac{1}{n} \sum_{k=1}^n x_k \\
&=& \frac{1}{n} \sum_{k=1}^{n-1} x_k + \frac{1}{n}x_n \\
&=& \frac{n-1}{n-1}\frac{1}{n} \sum_{k=1}^{n-1} x_k + \frac{1}{n}x_n \\
&=& \frac{n-1}{n} \left(\frac{1}{n-1} \sum_{k=1}^{n-1} x_k\right) + \frac{1}{n}x_n \\
&=& \frac{n-1}{n} \bar{x}_{n-1} + \frac{1}{n}x_n \\
&=& \frac{n}{n}\bar{x}_{n-1} - \frac{1}{n}\bar{x}_{n-1} + \frac{1}{n}x_n \\
&=&  \bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n} \quad \checkmark
\end{eqnarray}



## (25 pts) Problem 2: Data (Grouping and Plotting)

The US Census Bureau is one of the largest data gathering organizations in the year.  They often have to analyze data involving the entire nation and describe it according to a variety of factors, including grouping by location (state, city, neighorhood), demographic factors, time, and more.  For this problem we have access to 10 years of state-wide reported unemployment data: for each of the reporting governments, we have 120 months of unemployment percentages.

Our goal is to explore this data and visualize it.

In [227]:
df=pd.read_csv('../data/employment.csv', encoding='UTF-8')
dfstates=pd.read_csv('../data/stategeocodes.csv', encoding='UTF-8')


**Part A:**  Load in the data above from both `employment.csv` and `stategeocodes.csv` and make sure you understand the data's shape and form.  For each file, check out `pd.dtypes` then print out `pd.shape` and `pd.head`.  Is each field of the correct data type?  Do we have the expected number of rows for tracking all 50 states?


In [228]:
print(df.dtypes)
print(df.shape)

df.head(20)


Series ID      int64
11-Jan       float64
11-Feb       float64
11-Mar       float64
11-Apr       float64
              ...   
20-Aug       float64
20-Sep       float64
20-Oct       float64
20-Nov       float64
20-Dec       float64
Length: 121, dtype: object
(51, 121)


Unnamed: 0,Series ID,11-Jan,11-Feb,11-Mar,11-Apr,11-May,11-Jun,11-Jul,11-Aug,11-Sep,...,20-Mar,20-Apr,20-May,20-Jun,20-Jul,20-Aug,20-Sep,20-Oct,20-Nov,20-Dec
0,1,10.7,10.3,9.8,9.2,9.3,10.3,10.1,9.8,9.5,...,2.8,12.7,7.6,8.0,7.9,7.3,6.8,4.3,3.8,3.5
1,2,9.0,8.9,8.6,8.0,7.6,7.8,7.0,6.7,7.0,...,6.2,12.4,12.0,11.2,10.2,6.2,6.5,5.9,6.3,6.6
2,4,10.1,9.6,9.3,9.1,8.9,10.4,10.2,9.8,9.5,...,5.2,14.2,10.7,10.8,10.9,7.1,7.0,6.4,6.4,6.7
3,5,8.9,8.7,8.0,7.5,7.8,8.3,8.3,7.9,7.7,...,4.2,9.7,8.7,7.9,7.5,6.3,5.8,4.8,4.8,4.7
4,6,12.7,12.4,12.3,11.7,11.5,12.2,12.4,12.1,11.7,...,5.1,16.0,15.5,14.1,13.6,12.3,10.5,9.3,8.3,9.1
5,8,9.5,9.4,9.1,8.5,8.9,8.7,8.5,8.5,8.3,...,5.0,12.2,11.7,11.4,7.5,7.0,6.8,6.6,6.7,6.9
6,9,10.1,9.8,9.4,8.8,9.0,9.3,9.3,9.2,8.8,...,4.3,8.4,11.6,11.4,11.5,8.3,8.2,7.7,7.5,7.7
7,10,8.6,8.4,7.9,7.4,7.2,8.0,7.9,7.8,7.5,...,5.2,13.3,13.6,13.3,8.6,7.7,7.3,5.1,5.1,5.8
8,11,10.2,10.1,10.0,9.4,9.9,11.0,10.9,10.7,10.5,...,5.5,10.6,9.0,9.2,9.5,8.9,8.8,8.2,8.4,8.8
9,12,10.8,10.3,10.1,9.8,9.9,10.4,10.4,10.3,10.0,...,5.0,13.9,14.3,11.7,11.9,8.0,7.3,5.6,5.0,4.2


In [229]:
print(dfstates.dtypes)
print(dfstates.shape)

print("We have a lot of 'Regions' and 'Divisions' in addition to the 50 states, gving us 64")

dfstates.head(20)



Region           int64
Division         int64
State (FIPS)     int64
Name            object
dtype: object
(64, 4)
We have a lot of 'Regions' and 'Divisions' in addition to the 50 states, gving us 64


Unnamed: 0,Region,Division,State (FIPS),Name
0,1,0,0,Northeast Region
1,1,1,0,New England Division
2,1,1,9,Connecticut
3,1,1,23,Maine
4,1,1,25,Massachusetts
5,1,1,33,New Hampshire
6,1,1,44,Rhode Island
7,1,1,50,Vermont
8,1,2,0,Middle Atlantic Division
9,1,2,34,New Jersey


**Part B:**  The official US census divides the US into 4 super-regions, [shown here](https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf).  Add a column for `Region` and assign all of the regions to their correct region.  

Unfortunately, the data wasn't given with these regions, so we have to add them in using the second data file.  We also only have odd codes for each of the states, instead of their names!  Add both `"State"` and `"Region:` columns to the employment data frame with the actual state names and their region numbers or names. You can match IDs from `State (FIPS)` in the `stategeocodes.csv` to the `Series ID` from `employment.csv`.


In [230]:
# newState = dfstates["State (FIPS)"]
# newRegion = df["Series ID"]
#make new collumn in df
df1= df.merge(dfstates, how='inner', left_on='Series ID', right_on='State (FIPS)')
df1.head()

Unnamed: 0,Series ID,11-Jan,11-Feb,11-Mar,11-Apr,11-May,11-Jun,11-Jul,11-Aug,11-Sep,...,20-Jul,20-Aug,20-Sep,20-Oct,20-Nov,20-Dec,Region,Division,State (FIPS),Name
0,1,10.7,10.3,9.8,9.2,9.3,10.3,10.1,9.8,9.5,...,7.9,7.3,6.8,4.3,3.8,3.5,3,6,1,Alabama
1,2,9.0,8.9,8.6,8.0,7.6,7.8,7.0,6.7,7.0,...,10.2,6.2,6.5,5.9,6.3,6.6,4,9,2,Alaska
2,4,10.1,9.6,9.3,9.1,8.9,10.4,10.2,9.8,9.5,...,10.9,7.1,7.0,6.4,6.4,6.7,4,8,4,Arizona
3,5,8.9,8.7,8.0,7.5,7.8,8.3,8.3,7.9,7.7,...,7.5,6.3,5.8,4.8,4.8,4.7,3,7,5,Arkansas
4,6,12.7,12.4,12.3,11.7,11.5,12.2,12.4,12.1,11.7,...,13.6,12.3,10.5,9.3,8.3,9.1,4,9,6,California


**Part C:**

As a sanity check, loop over all the unique regions you've created and print out how many rows of your data frame are in that region.  You should find:

1) 9 in the Northeast

2) 12 in the Midwest

3) 17 in the South

4) 13 in the West

In [231]:
Regions = df1['Region'].to_list()

Northeast = Regions.count(1)
Midwest = Regions.count(2)
South = Regions.count(3)
West = Regions.count(4)

        

print("Northeast : ", Northeast)
print("Midwest : ", Midwest)
print("South : ", South)
print("West : ", West)

Northeast :  9
Midwest :  12
South :  17
West :  13


**Part D:** Create a histogram of the entire data frame.  Describe it's general shape (skewness or symmetry) and whether or not it has any outliers.

(Check out `np.reshape` for a nice way to turn a large matrix/array into something 1-dimensional, for easier plotting!)

In [232]:


# df2 = np.reshape(df,,order='C')
# df2.head()


**Part E:** Create a single figure with a series of box plots (4 side-by-side boxes) of the employment data grouped by each region.

**Part F:** Create a new data frame with 12 columns that groups all of the data according to month of the year.  You can combine all the locations into a single column for each month.  

(*Hint*: Every 12th data column should be from the same month.)

Then create a single figure with a series of box plots (12 side-by-side boxes) of the employment data grouped by each month.


**Part G:** Discuss the following:

1) Does there appear to be larger differences between different *regions* or between different *months*?  Explain fully.  Speculate as to *why* one factor might matter more than the other.

2) Are there any downsides to these kinds of groupings?  Can you think of anything that might make these types of comparisons more useful?
