In [1]:
#import the libraries
import re
import pandas as pd
import latex

# Correlation coefficient

The correlation coefficient measures the strength of the linear relationship betwwn two quantitative variables.
The correlation coefficient is usually denoted as $r$

To compute the correlation $r$ between $x$ and $y$, this is the formula

$$r = \frac 1 {n - 1} \sum_{i=1}^n \left(\frac{x_i - {\bar x}} {s_x} \right) \left(\frac{y_i - {\bar y}} {s_y}\right)$$ 


$$r = \frac 1 {n - 1} \left[\sum_{i=1}^n(x_i - {\bar x}) (y_i - {\bar y})\right] / {s_x s_y}$$

In [11]:
df = pd.DataFrame({'year': [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003],
                  'Number of Casino Employees $x$ (thousands)': [20,23,29,27,30,34,35,37,40,43],
                  'Crime Rate $y$ (number of crimes per 1.000 population)': [1.32,1.67,2.17,2.70,2.75,2.87,3.65,2.86,3.61,4.25]})
df

Unnamed: 0,year,Number of Casino Employees $x$ (thousands),Crime Rate $y$ (number of crimes per 1.000 population)
0,1994,20,1.32
1,1995,23,1.67
2,1996,29,2.17
3,1997,27,2.7
4,1998,30,2.75
5,1999,34,2.87
6,2000,35,3.65
7,2001,37,2.86
8,2002,40,3.61
9,2003,43,4.25


## Assignment:

compute the correlation coefficient for the data in df

## Solution: 

First let's make the formula in a way that we can calculate this easily. 

$r = \frac {\sum_{i=1}^n(x_i-{\bar x}) (y_i - {\bar y})} {\root \of{\sum_{i=1}^n(x_i-{\bar x})^2 \sum_{i=1}^n(y_i-{\bar y})^2}}$

In [12]:
#Right, now let's add different columns to this df, with the equations of the formula calculated in it
#let's start with calculating the mean for x and y

df.describe()

Unnamed: 0,year,Number of Casino Employees $x$ (thousands),Crime Rate $y$ (number of crimes per 1.000 population)
count,10.0,10.0,10.0
mean,1998.5,31.8,2.785
std,3.02765,7.345445,0.904559
min,1994.0,20.0,1.32
25%,1996.25,27.5,2.3025
50%,1998.5,32.0,2.805
75%,2000.75,36.5,3.425
max,2003.0,43.0,4.25


In [13]:
#so the mean for x is 31.8 and for y is 2.785
#Now let's add a column with x - the mean of x (31.8)

values = []
for x in df['Number of Casino Employees $x$ (thousands)']:
    values.append(x-31.8)
df.insert(loc = 3, column = '$$x-{\bar_x}$$', value = values)
df

Unnamed: 0,year,Number of Casino Employees $x$ (thousands),Crime Rate $y$ (number of crimes per 1.000 population),$$x-{ar_x}$$
0,1994,20,1.32,-11.8
1,1995,23,1.67,-8.8
2,1996,29,2.17,-2.8
3,1997,27,2.7,-4.8
4,1998,30,2.75,-1.8
5,1999,34,2.87,2.2
6,2000,35,3.65,3.2
7,2001,37,2.86,5.2
8,2002,40,3.61,8.2
9,2003,43,4.25,11.2


In [14]:
#Now let's do the same for y (y - 2.785)

values = []
for y in df['Crime Rate $y$ (number of crimes per 1.000 population)']:
    values.append(y-2.785)
df.insert(loc = 4, column = '$$y-{\bar_y}$$', value = values)
df

Unnamed: 0,year,Number of Casino Employees $x$ (thousands),Crime Rate $y$ (number of crimes per 1.000 population),$$x-{ar_x}$$,$$y-{ar_y}$$
0,1994,20,1.32,-11.8,-1.465
1,1995,23,1.67,-8.8,-1.115
2,1996,29,2.17,-2.8,-0.615
3,1997,27,2.7,-4.8,-0.085
4,1998,30,2.75,-1.8,-0.035
5,1999,34,2.87,2.2,0.085
6,2000,35,3.65,3.2,0.865
7,2001,37,2.86,5.2,0.075
8,2002,40,3.61,8.2,0.825
9,2003,43,4.25,11.2,1.465


In [16]:
#Great, now this is done, so we can add a column with the last two columns multiplied:

df['x*y'] = df['$$x-{\bar_x}$$'] * df['$$y-{\bar_y}$$']
df

Unnamed: 0,year,Number of Casino Employees $x$ (thousands),Crime Rate $y$ (number of crimes per 1.000 population),$$x-{ar_x}$$,$$y-{ar_y}$$,x*y
0,1994,20,1.32,-11.8,-1.465,17.287
1,1995,23,1.67,-8.8,-1.115,9.812
2,1996,29,2.17,-2.8,-0.615,1.722
3,1997,27,2.7,-4.8,-0.085,0.408
4,1998,30,2.75,-1.8,-0.035,0.063
5,1999,34,2.87,2.2,0.085,0.187
6,2000,35,3.65,3.2,0.865,2.768
7,2001,37,2.86,5.2,0.075,0.39
8,2002,40,3.61,8.2,0.825,6.765
9,2003,43,4.25,11.2,1.465,16.408


In [17]:
#Super, now let's add two more columns with the differences of x squared and the same for y:

df['$x^2$'] = df['$$x-{\bar_x}$$']**2
df['$y^2$'] = df['$$y-{\bar_y}$$']**2
df

Unnamed: 0,year,Number of Casino Employees $x$ (thousands),Crime Rate $y$ (number of crimes per 1.000 population),$$x-{ar_x}$$,$$y-{ar_y}$$,x*y,$x^2$,$y^2$
0,1994,20,1.32,-11.8,-1.465,17.287,139.24,2.146225
1,1995,23,1.67,-8.8,-1.115,9.812,77.44,1.243225
2,1996,29,2.17,-2.8,-0.615,1.722,7.84,0.378225
3,1997,27,2.7,-4.8,-0.085,0.408,23.04,0.007225
4,1998,30,2.75,-1.8,-0.035,0.063,3.24,0.001225
5,1999,34,2.87,2.2,0.085,0.187,4.84,0.007225
6,2000,35,3.65,3.2,0.865,2.768,10.24,0.748225
7,2001,37,2.86,5.2,0.075,0.39,27.04,0.005625
8,2002,40,3.61,8.2,0.825,6.765,67.24,0.680625
9,2003,43,4.25,11.2,1.465,16.408,125.44,2.146225


In [18]:
#Cool, we are almost there, let's add a total (=sum) row!

df.loc['Total']= df.sum()
df

Unnamed: 0,year,Number of Casino Employees $x$ (thousands),Crime Rate $y$ (number of crimes per 1.000 population),$$x-{ar_x}$$,$$y-{ar_y}$$,x*y,$x^2$,$y^2$
0,1994.0,20.0,1.32,-11.8,-1.465,17.287,139.24,2.146225
1,1995.0,23.0,1.67,-8.8,-1.115,9.812,77.44,1.243225
2,1996.0,29.0,2.17,-2.8,-0.615,1.722,7.84,0.378225
3,1997.0,27.0,2.7,-4.8,-0.085,0.408,23.04,0.007225
4,1998.0,30.0,2.75,-1.8,-0.035,0.063,3.24,0.001225
5,1999.0,34.0,2.87,2.2,0.085,0.187,4.84,0.007225
6,2000.0,35.0,3.65,3.2,0.865,2.768,10.24,0.748225
7,2001.0,37.0,2.86,5.2,0.075,0.39,27.04,0.005625
8,2002.0,40.0,3.61,8.2,0.825,6.765,67.24,0.680625
9,2003.0,43.0,4.25,11.2,1.465,16.408,125.44,2.146225


Now let's use these outcomes to fill in the formula: 

- $r = \frac {\sum_{i=1}^n(x_i-{\bar x}) (y_i - {\bar y})} {\root \of{\sum_{i=1}^n(x_i-{\bar x})^2 \sum_{i=1}^n(y_i-{\bar y})^2}}$
- $r = \frac {55.810} {\root \of{({485.6})({7.3641})}}$
- $r = {0.933}$

Conclusion, the correlation is positive, but this does not mean that an increase in the number of casino workers **causes** and increase in the crime rate. 