## Import data and take a look at it

In [1]:
# Import gen_data function
from data_gen import gen_data

# Get the data by calling the gen_data function
data1, data2 = gen_data()

# Print 10 entries from data1 and data2
print(data1[:10])
print(data2[:10])

[49, 88, 73, 75, 10, 63, 34, 14, 14, 66]
[19, 2, 4, 43, 0, 18, 18, 14, 39, 2]


## Standardize the data:
1. Calculate it's mean $\mu = (\frac{\Sigma(x_i)}{n})$ 

    $Mean= \frac{Sum\ of\ all\ the\ values}{Total\ number\ of\ values}$


2. Calculate it's standard deviation $(\frac{\Sigma(x_i^2)}{n} - \mu^2)^{1/2}$.

    $Standard\ deviation =(\frac{Sum\ of squared\ values}{Total\ number of values} -mean^2)^\frac{1}{2}$

3. For each element perform the following:

    $z_i = \frac{x_i - \mu}{\sigma}$
    
    Step 1: Subract the mean from the value
    Step 2: Divide the resulting value from step 1 by standard deviation

In [2]:
### edTest(test_std) ###

# Create a list with the squared values of the elements of data1
data_sq1 = []
for num in data1:
    data_sq1.append(num**2)


# Calculate mean and standard deviation using formula provided in the markdown cell above
mean1 = sum(data1)/len(data1)
std1 = (sum(data_sq1)/len(data_sq1) - mean1**2)**0.5

# Standardize the data using a loop and display 10 elements
std_data = []

for num in data1:
    std_data.append((num-mean1)/std1)

print(std_data)

[-0.19187452454783607, 1.329700598950661, 0.7444793976050852, 0.8225088911178285, -1.713449648046333, 0.354331930041368, -0.7770957258934119, -1.5573906610208463, -1.5573906610208463, 0.4713761703104831, -0.2699040180605795, -1.5573906610208463, -0.38694825832969465, -1.0501989531880138, -0.7380809791370402, 1.0956121184124306, -1.5964054077772178, 1.173641611925174, -0.8551252194061553, 1.6808333197580063, -0.6600514856242967, -0.15285977779146437, -1.011184206431642, -0.38694825832969465, -0.347933511573323, -0.5430072453551815, 0.08122870274676594, 1.173641611925174, -0.2308892713042078, 1.3687153457070327, 1.2126563586815458, -1.0892136999443855, 0.6664499040923417, -0.347933511573323, -0.6990662323806685, 0.6664499040923417, -1.3623169272389877, 1.3687153457070327, 0.5884204105795983, 0.15925819625950938, -0.15285977779146437, 1.5637890794888913, 0.4713761703104831, 1.2516711054379175, -0.4259630050860664, 0.5103909170668548, 1.3687153457070327, 1.329700598950661, -0.1528597777914

### Similarly standardize data2

In [3]:
# Repeat the same process above but this time for the `data2` list
data_sq2 = []
for num in data2:
    data_sq2.append(num**2)


# Calculate mean and standard deviation using formula provided in the markdown cell above
mean2 = sum(data2)/len(data2)
std2 = (sum(data_sq2)/len(data_sq2) - mean2**2)**0.5

# Standardize the data using a loop and display 10 elements
std_data_2 = []

for num in data2:
    std_data_2.append((num-mean2)/std2)

print(std_data_2)

[-0.04896349410262053, -0.866624986975458, -0.7704295172257124, 1.1053821428943265, -0.9628204567252036, -0.09706122897749332, -0.09706122897749332, -0.2894521684769845, 0.9129912033948353, -0.866624986975458, 3.317877947138475, -0.1932566987272389, -0.1932566987272389, -0.7704295172257124, 0.04723197564712505, -0.626136312601094, -0.9628204567252036, 0.6244047941455986, -0.3375499033518573, -0.8185272521005852, -0.09706122897749332, 0.3358183848963618, -0.14515896385236612, 0.14342744539687063, 0.09532971052199785, 0.09532971052199785, -0.14515896385236612, 1.3939685521435632, -0.14515896385236612, -0.6742340474759668, -0.7223317823508396, 0.09532971052199785, -0.1932566987272389, -0.866624986975458, 0.528209324395853, -0.9147227218503308, 0.6244047941455986, 0.04723197564712505, -0.2894521684769845, -0.09706122897749332, 1.1534798777691992, -0.7223317823508396, 0.287720650021489, 0.4801115895209802, 1.5863594916430543, -0.9147227218503308, 1.442066287018436, 0.4801115895209802, -0.52

### ⏸ If you had 1000 such data sets, what would be the most efficient way of standardizing them all?

#### A. Copy-paste the code for each dataset.
#### B. Call the TA and ask him/her to do it.
#### C. Write a function to standardize the data.

In [4]:
### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option A put 'A')

answer = 'C'

## Writing a Function
Manually copy-pasting code in order to process all different datasets would be very tedious and it would also reduce code readability which increases the chances of small errors.

This is why we will declare a function to do the job for us. Everytime we wish to standardize data all we have to do is simply call the function.

In [5]:
### edTest(test_func) ###
# Define a function which calculates mean and std of input data, and returns standardized data
def standardize(data):
  data_sq = []
  for num in data:
    data_sq.append(num**2)
  
  mean = sum(data)/len(data)
  std = (sum(data_sq)/len(data) - mean**2)**0.5

  std_list = []
  for num in data:
    std_list.append((num-mean)/std)

  
  return std_list

In [6]:
# Call the standardize function on data1 and display 10 elements
data1_std = standardize(data1)
print(data1_std)

[-0.19187452454783607, 1.329700598950661, 0.7444793976050852, 0.8225088911178285, -1.713449648046333, 0.354331930041368, -0.7770957258934119, -1.5573906610208463, -1.5573906610208463, 0.4713761703104831, -0.2699040180605795, -1.5573906610208463, -0.38694825832969465, -1.0501989531880138, -0.7380809791370402, 1.0956121184124306, -1.5964054077772178, 1.173641611925174, -0.8551252194061553, 1.6808333197580063, -0.6600514856242967, -0.15285977779146437, -1.011184206431642, -0.38694825832969465, -0.347933511573323, -0.5430072453551815, 0.08122870274676594, 1.173641611925174, -0.2308892713042078, 1.3687153457070327, 1.2126563586815458, -1.0892136999443855, 0.6664499040923417, -0.347933511573323, -0.6990662323806685, 0.6664499040923417, -1.3623169272389877, 1.3687153457070327, 0.5884204105795983, 0.15925819625950938, -0.15285977779146437, 1.5637890794888913, 0.4713761703104831, 1.2516711054379175, -0.4259630050860664, 0.5103909170668548, 1.3687153457070327, 1.329700598950661, -0.1528597777914

In [7]:
# Call the standardize function on data2 and display 10 elements
data2_std = standardize(data2)
print(data2_std)

[-0.04896349410262053, -0.866624986975458, -0.7704295172257124, 1.1053821428943265, -0.9628204567252036, -0.09706122897749332, -0.09706122897749332, -0.2894521684769845, 0.9129912033948353, -0.866624986975458, 3.317877947138475, -0.1932566987272389, -0.1932566987272389, -0.7704295172257124, 0.04723197564712505, -0.626136312601094, -0.9628204567252036, 0.6244047941455986, -0.3375499033518573, -0.8185272521005852, -0.09706122897749332, 0.3358183848963618, -0.14515896385236612, 0.14342744539687063, 0.09532971052199785, 0.09532971052199785, -0.14515896385236612, 1.3939685521435632, -0.14515896385236612, -0.6742340474759668, -0.7223317823508396, 0.09532971052199785, -0.1932566987272389, -0.866624986975458, 0.528209324395853, -0.9147227218503308, 0.6244047941455986, 0.04723197564712505, -0.2894521684769845, -0.09706122897749332, 1.1534798777691992, -0.7223317823508396, 0.287720650021489, 0.4801115895209802, 1.5863594916430543, -0.9147227218503308, 1.442066287018436, 0.4801115895209802, -0.52

## De-standardization function
Often in data science, we perform manipulations on the standardized dataset (because it's usually easier) and then convert it back to the original scale by destandardizing. 
So let's write a function to retrieve the data by de-standardizing.

## Function to de-standardize
You wil require the original `mean` and `std` values in order to de-standardize. Perform the following on each element: 

$x_i = z_i . \sigma + \mu$

In [8]:
### edTest(test_de) ###
# Write a function which takes data, mean and std as input 
# and returns de-standardized data
# Make sure you use the correct mean and std for 
# data1 and data2 calculated earlier
def destandardize(mean, std, data):
  de_list=[]
  for num in data:
    de_list.append(num * std + mean)

  return de_list

In [9]:
### edTest(test_de1) ###
# Use mean and std of data1 calculated earlier and destandardize data_std1
data_de1 = destandardize(mean1, std1,data1_std)
print(data_de1[:10])

[49.0, 88.0, 73.0, 75.0, 10.0, 63.0, 34.0, 14.0, 14.0, 66.0]


In [10]:
### edTest(test_de2) ###
# Use mean and std of data1 calculated earlier and destandardize data_std1
data_de2 = destandardize( mean2, std2,data2_std)
print(data_de2[:10])

[19.0, 2.0, 4.0, 43.0, 0.0, 18.0, 18.0, 14.0, 39.0, 2.0]



### ⏸ By looking at what data is required for destandardizing, do you observe something out of place?

#### A. No, all looks good.
#### B. `mean` and `std` got over-written when copy-pasting code.
#### C. Function to de-standardize requires extra data (mean,std) which were not given by standardize function.
#### D. B and C.

In [11]:
### edTest(test_chow2) ###
# Submit an answer choice as a string below (eg. if you choose option A put 'A')

answer = 'C'