### Lets start off by exploring some data
- Load the file `dataset1.csv` from Canvas
- For this notebook, you should be familiar with a few statistical ideas:
    - [Mean](https://en.wikipedia.org/wiki/Mean)
    - [Standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)
    - [Correlation](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html)

You can easily look up how to compute these statistics in Python using a package like numpy.

In [11]:
import pandas as pd 
import numpy as np
df1 = pd.read_csv("dataset1.csv")
df1 = np.array(df1)
print(df1[0:, 2]) 

[ 8.04  6.95  7.58  8.81  8.33  9.96  7.24  4.26 10.84  4.82  5.68]


### Dataset 1

##### Compute the mean 

Using the software of your choice (numpy?), compute the mean of the the `y` values in the first dataset.

In [12]:
print("mean of y: ", np.mean(df1[0:, 2]))

mean of y:  7.500909090909093


- Now compute the mean of the `x` values in the dataset 

In [13]:
print("mean of x: ", np.mean(df1[0:, 1]))

mean of x:  9.0


##### Compute the standard deviation 

- Compute the standard deviation of `x` and `y`

In [14]:
print("std of x: ", np.std(df1[0:, 1]))

print("std of y: ", np.std(df1[0:, 2]))



std of x:  3.1622776601683795
std of y:  1.937024215108669


##### Compute the correlation

- Compute the correlation of `x` and `y`

In [15]:
print("correlation of x and y: ", np.correlate(df1[0:, 1], df1[0:, 2]))

correlation of x and y:  [797.6]


### Generalize your code by writing a method

In [16]:
def summary_stats(file_name):
    '''
    Read in a file called `file_name` and return a dictionary of summary stats
    
    "mean_x": The mean of the x observation
    "mean_y": The mean of the y observation
    "sd_x": The standard deviation of the x observation
    "sd_y": The standard deviation of the y observation
    "corr": The correlation between x and y
    '''
    output = {"mean_x": None,
              "mean_y": None,
              "sd_x": None,
              "sd_y": None,
              "corr": None}
    df = pd.read_csv(file_name)

    # your code here
    df = np.array(df)
    mean_x = np.mean(df[0:, 1])
    mean_y = np.mean(df[0:, 2])
    sd_x = np.std(df[0:, 1])
    sd_y = np.std(df[0:, 2])
    corr = np.correlate(df[0:, 1], df[0:, 2])
    
    output = {"mean_x": mean_x,
              "mean_y": mean_y,
              "sd_x": sd_x,
              "sd_y": sd_y,
              "corr": corr}
    
    print(output)
    
    return output

In [17]:
summary_stats("dataset1.csv")

{'mean_x': 9.0, 'mean_y': 7.500909090909093, 'sd_x': 3.1622776601683795, 'sd_y': 1.937024215108669, 'corr': array([797.6])}


{'mean_x': 9.0,
 'mean_y': 7.500909090909093,
 'sd_x': 3.1622776601683795,
 'sd_y': 1.937024215108669,
 'corr': array([797.6])}

### Use your method to compute summary statistics for the four datasets

In [18]:
for i in range(1, 5):
    file_name = "dataset{}.csv".format(i)
    # pass
    summary_stats(file_name)

{'mean_x': 9.0, 'mean_y': 7.500909090909093, 'sd_x': 3.1622776601683795, 'sd_y': 1.937024215108669, 'corr': array([797.6])}
{'mean_x': 9.0, 'mean_y': 7.50090909090909, 'sd_x': 3.1622776601683795, 'sd_y': 1.93710869148962, 'corr': array([797.59])}
{'mean_x': 9.0, 'mean_y': 7.5, 'sd_x': 3.1622776601683795, 'sd_y': 1.9359329439927313, 'corr': array([797.47])}
{'mean_x': 9.0, 'mean_y': 7.500909090909091, 'sd_x': 3.1622776601683795, 'sd_y': 1.9360806451340837, 'corr': array([721.44])}


### Comment on your observations

Based on your code so far, do you think the datasets are identical? Can you spot any differences? Explain your reasoning

Answer:

Based on the code so far, I have observed that while the mean of x and y are the same for each dataset, the standard deviation varies for datasets 3 and 4. Additionally, the correlation of columns x and y vary for each dataset. Thus we know that the data sets are NOT identical. 

### Now let's make some scatter plots!

In [28]:
import altair as alt 

charts = [] 

for i in range(1, 5):

    c = alt.Chart(pd.read_csv("dataset{}.csv".format(i))).mark_circle().encode(
        x='x',
        y='y'
    ).properties(
        height=100,
        width=100
    )
    
    charts.append(c)


In [29]:
charts[0]

  for col_name, dtype in df.dtypes.iteritems():


In [30]:
charts[1]

In [31]:
charts[2]

In [32]:
charts[3]

### What do we observe?

Your answer here:

After observing the graphs, we can see that the data sets are completely different, and we could have easily been fooled if we only looked at the means and standard deveation that were calculated earlier. 