## Scaling and Transforming Data

So far, we've learned how to make visualizations and compute some simple statistics to describe our data. As we continue along in this course, we'll want to find more ways to make comparisons between various datasets, some of which might involve different measurement scales. We might need to scale our data, or use some sort of transformation, which is what we'll be discussing in this video. Along the way, I'll talk about how transformations can help us deal with data that aren't normally distributed, the effect of these transformations on visualizations, and how they can both help, or hinder, how we make sense of a particular problem. So, with that said, let's get started!

### Apples to Oranges: Comparing Test Scores

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Say we're trying to compare how students perform on college entrance exams at two different fictional high 
#schools in the US. One is in California (Sunnydale High), where a large majority tend to take the SAT, and 
#the other is in Illinois (Shermer High), where students favor the ACT. Is there anyway for us to tell how
#these schools stack up to one another?

#So below, we've collected a sample of scores from Sunnydale High vs. Shermer High.
#As you may be able to roughly tell, the SAT scores range from 300-800 points...
SAT_scores = [690, 330, 600, 350, 540, 440, 650, 480, 570, 420, 360, 620, 580, 600,
 390, 420, 510, 640, 350, 470, 570, 430, 410, 420, 380, 420, 510, 620,
 470, 700, 520, 560, 480, 540, 450, 550, 520, 460, 410, 550, 400, 350,
 780, 590, 510, 410, 520, 340, 430, 370, 560, 560, 500, 560, 490, 550,
 430, 520, 710, 520, 460, 390, 550, 410, 480, 450, 520, 610, 380, 620,
 530, 460, 460, 660, 520, 580, 490, 560, 520, 380, 440, 610, 530, 350,
 630, 440, 450, 590, 430, 640, 500, 290, 560, 390, 320, 470, 700, 540,
 440, 550]

#...while the ACT scores range from 1 to 36.
ACT_scores = [24, 18, 32, 23, 22, 26, 18, 23, 17, 28, 15, 20, 20, 17, 19, 24, 17, 29,
 21, 31, 22, 13, 17, 17, 26, 16, 25, 30, 26, 14, 14, 22, 14, 29, 26, 27,
 25, 20, 19, 17, 31, 20, 20, 25, 19, 24, 23, 24, 24, 23, 17, 18, 21, 26,
 21, 21, 28, 22, 22, 21, 18, 10, 16, 25, 31, 23, 24, 18, 28, 18, 20, 23,
 22, 17, 16, 17, 29, 25, 18, 19, 20, 22, 29, 18, 17, 24, 15, 33, 30, 17,
 11, 25, 24, 20, 21, 21, 29, 25, 22, 18]

#Let's go ahead and store these values as a Pandas Dataframe.
columns = ["SAT", "ACT"]
score_df = #YOUR CODE HERE
print(score_df.head())

### What is a transformation?

To make a reasonable comparison, we're going to need to transform the data in some way. Specifically, when 
I talk about a transformation, all that means is that we're going to apply some function, $f(x)$, to each input, 
and get our new outputs. So, something as simple as $x + 0$ counts as a (trivial) transformation, as does a much more 
complicated expression, such as the one below:

$$ \hat{f}(k) = \int_{-\infty}^{\infty}f(x) e^{-2\pi i kx} dx $$

Of course, adding zero isn't a particularly *useful* transformation. And as for the second one, it's certainly 
useful, but not something you'll have to worry about in this course (if you're curious, it's actually a Fourier Transform, which has many applications relating to time series data that we won't dive deep into here). I'll be sure to point out the 
essential transformations you're going to run across when reading other peoples' analyses, and provide you
with all the tools necessary to get started on your own. Now, relating transformations back to our original 
question regarding test scores...

### Scaling Techniques

Max-Min Normalization

$$ x' = \frac{x-\min(x)}{\max(x)-\min(x)} $$

Standardization (z-score)

$$ x' = \frac{x - \mu}{\sigma} $$

In [None]:
#We see that the difficulty comes from having two different measures. Scaling allows us to use the same 
#measuring stick and start to draw some conclusions based on the data. One way of scaling the data is to use 
#max-min normalization, which transforms all our values between 0 and 1. However, in our particular case, it
#might be more helpful to look at standardization, or in other words, compute z-scores. Recall that this 
#just captures the difference between your data point and the mean, relative to the spread of the overall 
#distribution.

#Now let's go ahead and pause for a quick refresher. Try your hand at calculating a z-scores.
#QUIZ: Hand-compute max-min/z-score (x=0 and Dataset: 1 2 2 2 3 3 4 5 5)

#Instead of using the raw scores, we've used the z-scores to compute mu and sigma, and then looked up 
#the national averages and standard deviations for both exams so that we can benchmark each school 
#relative to how the rest of the country performed. 

#2017 data obtained from https://nces.ed.gov/programs/digest/current_tables.asp (Tables 226.XX)
SAT_mean = 527
SAT_sd = 107

ACT_mean = 20.7
ACT_sd = 5.5

#with that info, we can then calculate a normalized dataframe and store this into normalized_df
SAT_norm = #YOUR CODE HERE
ACT_norm = #YOUR CODE HERE
normalized_df = #YOUR CODE HERE

#Let's go ahead and plot the normalized SAT scores from Sunnydale and see what happens.
plt.hist(normalized_df['SAT'], bins=12)

#Even though we haven't discussed histograms yet, we can still get a vague sense of what's going on here. 
#Specifically, note that it kind of resembles a normal distribution: 
#there's a hump--somewhat left of center--that tails off on both ends. 

In [None]:
#Now we'll repeat the same thing for the normalized ACT scores from Shermer High.
plt.hist(normalized_df['ACT'], bins=12)

In [None]:
#Let's print out some centrality measures. 
#Print out the mean and median SAT and ACT scores for each school.



#What do you notice? Does this match the visuals?
#Admittedly, this isn't very rigorous, but it does show how a simple transformation combined 
#with some basic visual exploration is an effective way of getting some quick insights from your data. 
#We can also take a look at some procedures to more confidently answer similar questions
#using grounded statistical techniques. 

### Case Study: New York Stock Exchange

In [None]:
#We'll end this lecture by going through a case study. The following data are the closing values for the Dow
#Jones Industrial Average (DJIA), a stock market index of 30 large, publicly owned companies based in the US
#from 1915 to 2018. Note that the most recent global 2008 recession is clearly depicted. Can you identify any 
#other recessions? How we would identify periods of economic downturn or stagnation in general?

#Data obtained from https://www.macrotrends.net/1319/dow-jones-100-year-historical-chart 
years = np.arange(1915, 2019, 1) 
closing_values = [99.15, 95.00, 74.38, 82.20, 107.23, 71.95, 80.80, 98.17, 95.52, 120.51, 151.08, 157.20, 200.70, 300.00, 
                  248.48, 164.58, 77.90, 59.93, 99.90, 104.04, 144.13, 179.90, 120.85, 154.76, 150.24, 131.13, 110.96, 119.40, 
                  135.89, 152.32, 192.91, 177.20, 181.16, 177.30, 200.13, 235.41, 269.23, 291.90, 280.90, 404.39, 488.40, 
                  499.47, 435.69, 583.65, 679.36, 615.89, 731.14, 652.10, 762.95, 874.13, 969.26, 785.69, 905.11, 943.75, 
                  800.36, 838.92, 890.20, 1020.02, 850.86, 616.24, 852.41, 1004.65, 831.17, 805.01, 838.74, 963.99, 875.00, 
                  1046.54, 1258.64, 1211.57, 1546.67, 1895.95, 1938.83, 2168.57, 2753.20, 2633.66, 3168.83, 3301.11, 3754.09, 
                  3834.44, 5117.12, 6448.27, 7908.30, 9181.43, 11497.12, 10787.99, 10021.57, 8341.63, 10453.92, 10783.01, 
                  10717.50, 12463.15, 13264.82, 8776.39, 10428.05, 11577.51, 12217.56, 13104.14, 16576.66, 17823.07, 17425.03, 
                  19762.60, 24719.22, 23327.46]

plt.plot(years, closing_values)

|          Crowds outside NYSE after crash       | "Bank runs"       |
|:-----------------:| :------------------------: |
|  ![Crowds outside NYSE](Crowd_NYSE.jpg)        | ![American Union Bank](American_Union_Bank.png)|

In [None]:
#For those of you who are familiar with US history, you may know that there was a severe worldwide economic 
#downturn in the 1930s, after WWI and before the start of WWII. The unemployment rate reached 25%, banks 
#began to fail, and a swaths of people lined up to withdraw whatever savings they had left, as depicted
#above. The "Great Depression",doesn't seem to appear in our plot though. Why is that?

#Why might the Great Depression not show up in our graph? Try and fix this issue.



In [None]:
#Voila! Now the Great Depression is clearly visible. Even though a change of 30 points may seem minisciule 
#nowadays, on October 29, 1929, or "Black Tuesday", this was a 12% decrease, which accounts for a significant 
#portion of that giant dropoff in the left-hand side of the graph. Note that there are a few other periods 
#where the market seems to stagnate, and while it's still difficult to precisely pinpoint every major 
#recession, we are able to make out a lot more intricacies of the data, whereas this was all shrouded before
#we applied the transformations.

### Which transformation should I choose?

- Not a "one size fits all" process, should start by exploring your data 
- Normalizing data is a common and sometimes necessary transformation for applying later steps in a statistical pipeline
- Reciprocal and logarithmic transformation are other useful transformations to know
- These transformations have visual effects: the right choice might make analysis easier or emphasize different features of the data
- Very useful resource for much of the information in this lecture http://fmwww.bc.edu/repec/bocode/t/transint.html
- In the next section, we'll look at some ways to spruce up our graphs (previewed here), such as drawing trendlines, highlighting regions, etc.