# Submission

In [111]:
import pandas as pd
import numpy as np

def covariance(list_1, list_2):
    try:
        assert len(list_1) == len(list_2)
    except:
        print('Lists are not of the same length')
        
    n = len(list_1) - 1
    sample_mean_1 = np.mean(list_1)
    sample_mean_2 = np.mean(list_2)
    
    sum_ = 0
    for index in range(n+1):
        sum_ += (list_1[index] - sample_mean_1) * (list_2[index] - sample_mean_2)
    return (1/n) * sum_

'''
output should look like
     0   1   2
0 [[xx, xy, xz],
1 [ yx, yy, yz],
2 [ zx, zy, zz]]
'''
def covariance_matrix(matrix):
    list_matrix = matrix.T
    len_matrix = len(list_matrix)
    covar_matrix = []

    for row in range(len_matrix):
        covar_matrix.append([])
        for col in range(len_matrix):
            covar_matrix[row].append(covariance(list_matrix[row], list_matrix[col]))
    
    return np.array(covar_matrix)

# test function
def cov_matrix_calculation(data):
    # calculate covariance matrix of the data
    cov_matx = np.cov(data.T)
    return cov_matx

dataset = np.array([
   # x, y, z
    [1, 1, 1],
    [1, 2, 1],
    [1, 3, 2],
    [1, 4, 3]]
)

dataset_1 = np.array([
    [3, 5, 2, 7, 4],
    [1, 4, 7, 3, 6],
    [8, 5, 4, 0, 2]
])

print("tasfia's co-var function: \n", covariance_matrix(dataset_1))
print()
print("numpy's co-var function: \n", cov_matrix_calculation(dataset_1))

tasfia's co-var function: 
 [[13.          1.5        -3.5        -8.         -7.        ]
 [ 1.5         0.33333333 -1.33333333  0.16666667 -1.        ]
 [-3.5        -1.33333333  6.33333333 -4.16666667  3.        ]
 [-8.          0.16666667 -4.16666667 12.33333333  3.        ]
 [-7.         -1.          3.          3.          4.        ]]

numpy's co-var function: 
 [[13.          1.5        -3.5        -8.         -7.        ]
 [ 1.5         0.33333333 -1.33333333  0.16666667 -1.        ]
 [-3.5        -1.33333333  6.33333333 -4.16666667  3.        ]
 [-8.          0.16666667 -4.16666667 12.33333333  3.        ]
 [-7.         -1.          3.          3.          4.        ]]


In [113]:
# real world data
df = pd.read_csv('data/kaggle_pima_indians_diabetes.csv')
# print(covariance_matrix(np.array(df)))
# print(cov_matrix_calculation(df))

[[ 1.13540563e+01  1.39471307e+01  9.21453818e+00 -4.39004101e+00
  -2.85552307e+01  4.69774181e-01 -3.74259714e-02  2.15706198e+01
   3.56618047e-01]
 [ 1.39471307e+01  1.02224831e+03  9.44309556e+01  2.92391827e+01
   1.22093580e+03  5.57269867e+01  1.45487481e+00  9.90828054e+01
   7.11507904e+00]
 [ 9.21453818e+00  9.44309556e+01  3.74647271e+02  6.40293962e+01
   1.98378412e+02  4.30046951e+01  2.64637574e-01  5.45234528e+01
   6.00696708e-01]
 [-4.39004101e+00  2.92391827e+01  6.40293962e+01  2.54473245e+02
   8.02979941e+02  4.93738694e+01  9.72135546e-01 -2.13810232e+01
   5.68747284e-01]
 [-2.85552307e+01  1.22093580e+03  1.98378412e+02  8.02979941e+02
   1.32811801e+04  1.79775172e+02  7.06668051e+00 -5.71432903e+01
   7.17567090e+00]
 [ 4.69774181e-01  5.57269867e+01  4.30046951e+01  4.93738694e+01
   1.79775172e+02  6.21599840e+01  3.67404687e-01  3.36032992e+00
   1.10063763e+00]
 [-3.74259714e-02  1.45487481e+00  2.64637574e-01  9.72135546e-01
   7.06668051e+00  3.6740468

# HW 1 - Covariance Matrix

**Covariance** is used to measure how two random variables change or vary together. For example, the height and weight of giraffes have a positive covariance because when one is big, the other tends to also be big.

_In this assignment, we will calculate the **covariance matrix** for a given dataset._

This assignment will consist of two parts:

## Part 1

**IMPORTANT NOTE:** in the video, the equation he uses has $n$ as the denominator, but when he calculates covariance values, he's using $n-1$ as the denominator. _For our purposes, we will be using $n-1$ as the denominator._

Watch this video on covariance matrices: https://www.youtube.com/watch?v=0GzMcUy7ZI0

Mathematically, we can represent Covariance as such:

$Cov(X,Y)=1/(n-1) \sum_{i=1}^{n}(x[i] - \bar{x})(y[i] - \bar{y})$

Where $n$ is the number of elements in arrays $x$ and $y$

## Part 2

Write a Python function that returns the covariance matrix for a given dataset, just like we watched in the video or any other dataset of your choosing. Below are various places you can get datasets from:

### Dataset Resources

- [Kaggle](https://www.kaggle.com/datasets)
- [Fivethirtyeight](https://github.com/fivethirtyeight/data)
- [Buzzfeed News](https://github.com/BuzzFeedNews/everything)
- [Google Cloud BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data/)
- [Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Database_download)
- Can't find anything from above? Google around until you do!

## Part 3

Use [numpy's var function](https://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html) to confirm that the covariance between a variable and itself is the same as the variance of the variable.

For example, assuming your covariance function is named `cov`, ensure that the following is true: `cov(X, X) == np.var(X)`

### Hints

1. Obtain the covariance between columns $x$ and $y$, between columns $x$ and $z$, and columns $y$ and $z$
1. The covariance between columns $x$ and $y$ is the same as the covariance between columns $y$ and $x$. _We can generalize this for any two columns_
1. Show that the covariance between columns $x$ and $x$ is equal to the variance of column $x$. _We can generalize this for any other column_

## Testing Your Code

Verify your code with numpy built-in `np.cov()` function as follows:

In [35]:
import numpy as np

def cov_matrix_calculation(data):
    # calculate covariance matrix of the data
    cov_matx = np.cov(data.T)
    return cov_matx

dataset = np.array([
    [1, 1, 1], 
    [1, 2, 1], 
    [1, 3, 2], 
    [1, 4, 3]])
print(cov_matrix_calculation(dataset))

[[0.         0.         0.        ]
 [0.         1.66666667 1.16666667]
 [0.         1.16666667 0.91666667]]


# Requirements

To pass this HW, you must meet the following requirements

1. Your function should return the covariance between 6 pairs of random variables: $(X,Y), (X, Z), (Y, Z), (X, X), (Y, Y), and (Z, Z)$
1. Verify that your function's return value is correct by using `np.cov(DATA)` where `DATA` is the return value of your covariance function
1. Verify that your covariance function for each variable with itself returns the same value as `np.var` of that variable

## Stretch Challenges

These are optional challenges for those who want to further expand their skillset:

1. Your function should display the covariance in a matrix format

## Turning In Your HW

Once you have finished your assignment, provide a link to your repo on GitHub and place the link in the appropriate `HW1` column in the [progress tracker](https://docs.google.com/spreadsheets/d/1bJ959aAhQbuJBA_vL1uinDgcEM6k7uROHLg_Wh5Ac2Y/edit?usp=sharing)