# Toolbox One

We have various tools at our disposal to summarize variables and the relationship between variables. Imagine that we have multiple toolboxes. This is the first one. There are two levels to this toolbox.

## First Level

On the first level, our three tools are:

1. (sample) Mean of X (or Y)
2. (sample) Standard Deviation of X (or Y)
3. (sample) Covariance of X and Y

## Second Level

On the second level, we have two tools that combine the tools from the first level:

1. Coefficient of Variation = (Standard Deviation)/(Mean)
2. Correlation = (Covariance of X and Y)/((Standard Deviation of X)*(Standard Deviation of Y))

The tools on the second level rescales the standard deviation and covariance statistics. 

## Formulas for the First Three Tools

Formulas are short-hands so that we can be precise and succient. 

## Sample Mean

## Sample Standard Deviation

## Sample Covariance

## Formulas for the Two Rescaling Tools

Formulas are short-hands so that we can be precise and succient. 

## Coefficient of Variation

## Correlation



# Data Examples

## College Education Share and Hourly Wage

Two variables:

1. Fraction of individual with college degree in a state
    + this is in Fraction units, the minimum is 0.00, the maximum is 100 percent, which is 1.00
2. Average hourly salary in the state
    + this is in Dollar units

The two variables above are in different units. We first calculate mean, standard deviation, and covariance. With just these, hard to compare the standard deviation of the two variables, which are in different scales. Also from covariance, hard to tell whether it is large or small. To make comparisons possible, we calculate coefficient of variations and correlation statistics.

## Standard Deviations and Coefficient of Variation

The sample standard deviations for the two variables are: $0.051$ and $1.51$, in fraction and dollar units. Can we say the hourly salary has a larger standard deviation? But it is just a different scale. $1.51$ is a large number, but that does not mean that variable has greater variation than the fraction with college education variable. 

Converting the Statistics to Coefficient of Variations, now we have: $0.16$ and $0.09$. Because of the division, these are both in fraction units--standard deviations as a fraction of the mean. Now these are more comparable.

## Covariance and Correlation

The covariance we get is positive: $0.06$, but is this actually large positive relationship? $0.06$ seems like a small number. 

Rescaling covariance to correlation, the correlation between the two variables is: $0.78$. Since the correlation of two variable is beloww $-1$ and $+1$, we can now say actually the two variables are very positively related. Higher share of individuals with college education is strongly positively correlated with higher hourly salary. 

## R--Mean, Standard Deviation and Covariance

We do not need to load in any special packages to calculate mean, standard deviation, and covariance. These are core R functionalities. 

We will store the results in a named list.

In [44]:
# Load in Data Tools
# For Reading/Loading Data
library(readr)
# Load in Data
df_survey <- read_csv('../data/EPIStateEduWage2017.csv')

Parsed with column specification:
cols(
  State = col_character(),
  Share.College.Edu = col_double(),
  Hourly.Salary = col_double()
)


In [45]:
# We can compute the three basic statistics
stats.level.one <- list(
              # Mean, SD and Var for the College Share variable
              Shr.Coll.Mean = mean(df_survey$Share.College.Edu), 
              Shr.Coll.Std = sd(df_survey$Share.College.Edu),
              Shr.Coll.Var = var(df_survey$Share.College.Edu),
    
              # Mean, SD and Var for the Hourly Wage Variable
              Hr.Wage.Mean = mean(df_survey$Hourly.Salary),                            
              Hr.Wage.Std = sd(df_survey$Hourly.Salary),
              Hr.Wage.Var = var(df_survey$Hourly.Salary),
              
              # Covariance between the two variables
              Shr.Wage.Cov = cov(df_survey$Hourly.Salary, df_survey$Share.College.Edu)
              )

# Let's Print the Statistics we Computed
print(stats.level.one, digits = 3)

$Shr.Coll.Mean
[1] 0.316

$Shr.Coll.Std
[1] 0.0514

$Shr.Coll.Var
[1] 0.00264

$Hr.Wage.Mean
[1] 16.3

$Hr.Wage.Std
[1] 1.51

$Hr.Wage.Var
[1] 2.28

$Shr.Wage.Cov
[1] 0.0604



## R--Coefficient of Variation and Correlation

Let's apply first our formulas directly, then we can get these numbers directly from R as well.

Since we created the named list stats already, we can grab values from that list. 

In [46]:
# We can compute the three basic statistics
stats.level.two <- list(              
              # Coefficient of Variation
              Shr.Coll.Coef.Variation = (stats.level.one$Shr.Coll.Std)/(stats.level.one$Shr.Coll.Mean),
              Hr.Wage.Coef.Variation = (stats.level.one$Hr.Wage.Std)/(stats.level.one$Hr.Wage.Mean),
    
              # Correlation 
              Shr.Wage.Cor = cor(df_survey$Hourly.Salary, df_survey$Share.College.Edu),
              Shr.Wage.Cor.Formula = (stats.level.one$Shr.Wage.Cov
                                     /(stats.level.one$Shr.Coll.Std*stats.level.one$Hr.Wage.Std))
              )

# Let's Print the Statistics we Computed
print(stats.level.two, digits = 3)

$Shr.Coll.Coef.Variation
[1] 0.162

$Hr.Wage.Coef.Variation
[1] 0.0926

$Shr.Wage.Cor
[1] 0.779

$Shr.Wage.Cor.Formula
[1] 0.779

