# Introduction to Data Science - Homework 2
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

Due: Friday, January 25, 11:59pm.

This homework is designed to reinforce the skills we covered in the first two weeks: working with loops, conditions, functions, and the built-in Python data structures. We'll also calculate various descriptive statistics. Make sure to go through the lecture again in case you have any trouble.

## Your Data
Fill out the following information: 

*First Name:* Xinbo  
*Last Name:* Wang

*E-mail:* xinbo.wang@utah.edu

*UID:* u0930578

## Part 1: Vector data

We first will work with a vector of yearly average temperatures from New Haven published [here](https://vincentarelbundock.github.io/Rdatasets/datasets.html). The data is included in this repository in the file `nhtmep.csv`.

The data is stored in the CSV format, which is a simple textfile with 'Comma Separated Values'.
To load the data into a (nested) python array, we use the [csv](https://docs.python.org/3/library/csv.html) library. The following code reads the file and stores it in a vector:

In [1]:
# import the csv library
import csv
# import the math library we'll use later
import math

# initialize the array
temperature_vector = []

# open the file and append the values of the last column to the array
with open('nhtemp.csv') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # remove the first item as it is the title.
    next(filereader)
    for row in filereader:
        # here we append to the array and also cast from string to float
        temperature_vector.append(float(row[2]))
        
# print the vector to see if it worked
print (temperature_vector)

[49.9, 52.3, 49.4, 51.1, 49.4, 47.9, 49.8, 50.9, 49.3, 51.9, 50.8, 49.6, 49.3, 50.6, 48.4, 50.7, 50.9, 50.6, 51.5, 52.8, 51.8, 51.1, 49.8, 50.2, 50.4, 51.6, 51.8, 50.9, 48.8, 51.7, 51.0, 50.6, 51.7, 51.5, 52.1, 51.3, 51.0, 54.0, 51.4, 52.7, 53.1, 54.6, 52.0, 52.0, 50.9, 52.6, 50.2, 52.6, 51.6, 51.9, 50.5, 50.9, 51.7, 51.4, 51.7, 50.8, 51.9, 51.8, 51.9, 53.0]


We'll next use descriptive statistics to analyze the data in `temperature_vector`.

In this problem, we'll do calculations that are also available in NumPy. For the purpose of this homework, however, **we want you to implement the solutions using standard python functionality and the math library, and then check your results using NumPy**. 

See the the [NumPy library](http://docs.scipy.org/doc/numpy-1.11.0/reference/routines.statistics.html) documentation and include the checks as a separate code cell. 

### Task 1.1: Calculate the Mean of a Vector

Write a function that calculates and returns the [arithmetic mean](https://en.wikipedia.org/wiki/Arithmetic_mean) of a vector that you pass into it. 

Pass the temperature vector into this function and print the result. Provide a written interpretation of your results (e.g., "The mean temperature for New Haven for the years 1912 to 1971 is XXX degrees Fahrenheit.")

In [6]:
## your code goes here
def mean(tem_vector):
    sum = 0
    n = len(tem_vector)
    for i in tem_vector:
        sum += i
    mean = sum / n
    return mean
print("The mean temperature for New Haven for the years 1912 to 1971 is " + str(mean(temperature_vector))+" degrees Fahrenheit.")

The mean temperature for New Haven for the years 1912 to 1971 is 51.16 degrees Fahrenheit.


In [8]:
# Check results using NumPy
import numpy as np
np.mean(temperature_vector)

51.160000000000004

**Your Interpretation:** TODO

### Task 1.2: Calculate the Median of a Vector
Write a function that calculates and returns the [median](https://en.wikipedia.org/wiki/Median) of a vector. Pass the temperature vector into this function and print the result. Make sure that your function works for both vectors with an even and odd number of elements. In the case of an even number of elements, use the mean of the two middle values. Provide a written interpretation of your results.

Hint: the [`sorted()`](https://docs.python.org/3/library/functions.html#sorted) function might be helpful for this.

In [15]:
## your code goes here
def med(temp_vector):
    new_list = sorted(temp_vector)
    n = len(temp_vector)
    if n < 1:
        return None;
    if n % 2 == 1:
        return new_list[n//2]
    else:
        return sum(new_list[n//2-1:n//2+1])/2.0
print("The median of the temperature is " + str(med(temperature_vector)))

The median of the temperature is 51.2


In [13]:
# Check results using NumPy
np.median(temperature_vector)

51.2

**Your Interpretation:** TODO

### Task 1.3: Calculate the Standard Deviation of a Vector

Write a function that calculates and returns the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) of a vector. Pass the temperature vector into this function and print the result. Provide a written interpretation of your results.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,

$$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} {{(x_i - \mu)}^2} }$$

where $\mu$ is the mean of the vector. Hint: use your mean function to calculate it.

Hint: the `sqrt()` function from the [`math library`](https://docs.python.org/3/library/math.html) might be helpful for this. If you use a separate file you need to load the library as we did in Part 1 to read in the data. The import looks like this:

In [20]:
## your code goes here
def stanDevi(temp_vector):
    n = len(temp_vector)
    mea = mean(temp_vector)
    sum = 0
    for i in temp_vector:
        sum += (i - mea)**2
    dev = math.sqrt(sum / n)
    return dev
print("The temperature standard deviation is " + str(stanDevi(temperature_vector)))

The temperature standard deviation is 1.2550166001558176


In [21]:
# Check results using NumPy
np.std(temperature_vector)

1.2550166001558178

**Your Interpretation:** TODO

### Task 1.4: Histogram

Write a function that takes a vector and an integer `b` and calculates a [histogram](https://en.wikipedia.org/wiki/Histogram) with `b` bins. The function should return an array containing two arrays. The first should be the counts for each bin, the second should contain the borders of the bins.

For `b=5` your output should look like this: 

`[[3, 12, 33, 10, 2], [47.9, 49.24, 50.58, 51.92, 53.26, 54.6]]`

Here, the first array gives the size of these bins, the second defines the bands. That is, the first band from 47.9-49.24 has 3 entries, the second, from 49.24-50.58 has 12 entries, etc. 

Provide a written interpretation of your results. Comment on whether the histogram is skewed, and if so, in which direction.

In [203]:
## your code goes here
def histo(tem_vector, b):
    count = {}
    list1 = []
    for i in tem_vector:
        count[i] = count.get(i, 0) + 1
    for key, value in count.items():
        tem = value;
        list1.append(tem)
    new_list = [list1[:b], tem_vector[:b]]
    return new_list

print("The histogram of the temperature is "+ str(histo(temperature_vector,5)))

The histogram of the temperature is [[1, 1, 2, 2, 1], [49.9, 52.3, 49.4, 51.1, 49.4]]


In [170]:
# Check results using NumPy
np.histogram(temperature_vector, bins = 5)

(array([ 3, 12, 33, 10,  2]),
 array([47.9 , 49.24, 50.58, 51.92, 53.26, 54.6 ]))

**Your interpretation:** TODO

## Part 2: Working with Matrices

For the second part of the homework, we are going to work with matrices. The [dataset we will use](https://www.wunderground.com/history/airport/KSLC/2015/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2015&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=) contains different properties of the weather in Salt Lake City for 2015 (temperature, humidity, sea level, ...). It is stored in the file [`SLC_2015.csv`](SLC_2015.csv) in this repository.

We first read the data from the file and store it in a nested python array (`weather_matrix`). A nested python array is an array, where each element is an array itself. Here is a simple example: 

In [85]:
arr1 = [1,2,3]
arr2 = ['a', 'b', 'c']

nestedArr = [arr1, arr2]
nestedArr

[[1, 2, 3], ['a', 'b', 'c']]

We provide you with the import code, which writes the data into the nested list `temperature_matrix`. The list contains one list for each month, which, in turn, contains the mean temperature of every day of that month. 

In [89]:
# initialize the 12 arrays for the months
temperature_matrix = [[] for i in range(12)]

# open the file and append the values of the last column to the array
with open('SLC_2015.csv') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # get rid of the header
    next(filereader)
    for row in filereader:
        month = int(row[0].split('/')[0])
        mean_temp = int(row[2])
        temperature_matrix[month-1].append(mean_temp)

print(temperature_matrix)

# the mean tempertarure on August 23. Note the index offset:
print("Mean temp on August 23: " + str(temperature_matrix[7][22]))

[[15, 19, 26, 28, 37, 38, 38, 36, 35, 31, 39, 36, 35, 30, 31, 31, 37, 44, 40, 35, 31, 31, 31, 33, 42, 41, 44, 42, 36, 40, 39], [39, 49, 50, 50, 53, 57, 60, 53, 55, 45, 43, 47, 46, 48, 43, 40, 38, 44, 47, 44, 39, 33, 31, 35, 44, 35, 37, 36], [40, 37, 34, 33, 39, 43, 45, 45, 46, 50, 54, 50, 51, 56, 62, 63, 61, 53, 47, 53, 57, 54, 52, 47, 42, 48, 56, 62, 53, 57, 63], [46, 44, 44, 54, 60, 50, 52, 46, 49, 53, 58, 50, 57, 56, 33, 44, 50, 54, 56, 56, 60, 61, 61, 59, 51, 46, 50, 57, 65, 63], [63, 71, 68, 67, 62, 59, 58, 57, 49, 53, 59, 68, 65, 65, 53, 48, 56, 58, 55, 59, 58, 58, 55, 57, 62, 59, 61, 61, 64, 71, 76], [80, 68, 69, 68, 69, 70, 66, 73, 77, 78, 72, 74, 75, 76, 81, 77, 78, 83, 83, 78, 81, 78, 78, 83, 82, 84, 87, 88, 91, 89], [87, 87, 87, 89, 79, 79, 76, 75, 73, 72, 77, 79, 81, 77, 80, 80, 79, 74, 74, 73, 76, 77, 75, 78, 78, 84, 77, 66, 70, 76, 79], [80, 79, 69, 76, 82, 74, 76, 69, 72, 79, 83, 81, 83, 88, 83, 79, 77, 72, 74, 76, 81, 74, 76, 84, 85, 78, 77, 80, 85, 82, 75], [82, 83, 82

We will next compute the same descriptive statistics as in Part 1 using the nested array `temperature_matrix`. 

In this problem, **we again want you to implement the solutions using standard python functionality and the math library**. We recommend you check your results using NumPy.

**Note:** Since the lists in the matrix are of varying lengths (28 to 31 days) many of the standard NumPy functions won't work directly.

### Task 2.1: Calculates the mean of a whole matrix

Write a function that calculates the mean of a matrix. For this version calculate the mean over all elements in the matrix as if it was one large vector. 
Pass in the matrix with the weather data and return the result. Provide a written interpretation of your results.
Can you use your function from Part 1 and get a valid result?

In [145]:
## your code goes here
def mea(matrix):
    l = []
    sum1 = 0
    for i in matrix:
        sum = 0
        for j in i:
            sum += j
        l.append(sum/len(i))
    for tem in l:
        sum1 += tem
    mean = sum1 / len(l)
    return mean
mea(temperature_matrix)

56.70716205837174

In [139]:
import numpy as np
b = []
su = 0
for i in range(12):
    a = np.array(temperature_matrix[i])
    mean = a.mean()
    b.append(mean)
for tem in b:
    su += tem
mean1 = su / len(b)
print(mean1)

56.70716205837174


**Your Interpretation:** TODO

### Task 2.2:  Calculate the mean of each vector of a matrix

Write a function that calculates the mean temperature of each month and returns an array with the means for each column. Provide a written interpretation of your results. Can you use the function you implemented in Part 1 here efficiently? If so, use it.

In [146]:
## your code goes here
def eachmean(matrix):
    l = []
    sum1 = 0
    for i in matrix:
        sum = 0
        for j in i:
            sum += j
        l.append(sum/len(i))
    return l
eachmean(temperature_matrix)

[34.54838709677419,
 44.32142857142857,
 50.096774193548384,
 52.833333333333336,
 60.483870967741936,
 77.86666666666666,
 77.87096774193549,
 78.35483870967742,
 71.43333333333334,
 61.16129032258065,
 39.96666666666667,
 31.548387096774192]

**Your Interpretation:** TODO

### Task 2.3:  Calculate the median of a whole matrix

Write a function that calculates and returns the median of a matrix over all values (independent from which row they are coming) and returns it. Provide a written interpretation of your results. Can you use your function from Part 1 and get a valid result?

In [155]:
## your code goes here
def median(tem_vector):
    med_list = []
    med = 0
    sum1 = 0
    for i in range(12):
        new_list = sorted(tem_vector[i])
        n = len(tem_vector[i])
        if n < 1:
            return None;
        if n % 2 == 1:
            med = new_list[n//2]
        else:
            med = sum(new_list[n//2-1:n//2+1])/2.0
        med_list.append(med)
    for j in med_list:
        sum1 += j
    return (sum1/len(med_list));
median(temperature_matrix)

57.041666666666664

**Your Interpretation:** TODO

### Task 2.4: Calculate the median of each vector of a matrix

Write a function that calculates the median of each sub array (i.e. each column in the csv file) in the matrix and returns an array of medians (one entry for column in the csv file). To do so, use the function you implemented in Part 1. Provide a written interpretation of your results. 

In [157]:
## your code goes here
def median2(tem_vector):
    med_list1 = []
    med = 0
    sum1 = 0
    for i in range(12):
        new_list = sorted(tem_vector[i])
        n = len(tem_vector[i])
        if n < 1:
            return None;
        if n % 2 == 1:
            med = new_list[n//2]
        else:
            med = sum(new_list[n//2-1:n//2+1])/2.0
        med_list1.append(med)
    return med_list1
median2(temperature_matrix)

[36, 44.0, 51, 53.5, 59, 78.0, 77, 79, 73.0, 62, 40.0, 32]

**Your Interpretation:** TODO

### Task 2.5: Calculate the standard deviation of a whole matrix

Write a function that calculates the standard deviation of a matrix over all values in the matrix (ignoring from which column they were coming) and returns it. Can you use your function from Part 1 and get a valid result? Provide a written interpretation of your results. 

In [168]:
## your code goes here
def stanDeviate(temp_vector):
    for i in range(12):
        n = len(temp_vector[i])
        mea = np.mean(temp_vector[i])
        sum = 0
        for j in temp_vector[i]:
            sum += (j - mea)**2
        dev = math.sqrt(sum / n)
        return dev
stanDeviate(temperature_matrix)

6.5047809200539595

**Your Interpretation:** TODO

### Task 2.6: Calculate the standard deviation of each vector of a matrix

Write a function that calculates the standard deviation of each array in the matrix and returns an array of standard deviations (one standard deviation for each column). To do so, use the function you implemented in Part 1. 
Pass in the matrix with the temperature data and return the result. Provide a written interpretation of your results - is the standard deviation consistent across the seasons? 

In [None]:
## your code goes here


**Your Interpretation:** TODO

## Part 3: Poisson distribution 

In class, we looked at [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution) and [binomial](https://en.wikipedia.org/wiki/Binomial_distribution) discrete random variables. Another example of a discrete random variable is a *Poisson random variable*. 

Read the [wikipedia article on the Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution)

### Part 3.1. Descriptive statistics

Describe what a Poisson random variable is. What is the parameter, $\lambda$? What is the min, max, mean, and variance of a Poisson random variable? 

**Your description:** TODO

### Part 3.2. Example 

Give an example of an application that is described by a Poisson random variable.

**Your description:** TODO

### Part 3.3. Probability mass function

For the parameter $\lambda = 2$, plot the probability mass function. 

In [None]:
# your code

### Part 3.4. Poission sampling

Write python code that takes 1000 samples from the Poisson distribution with parameter $\lambda = 2$. Make a histogram of the samples and compute the sample mean and variance. How does the histogram compare to the probability mass function?

In [None]:
# your code

**Your description:** TODO