# Statistics and Data Journalism
## HODP Bootcamp Week 4
### March 6, 2019

### Goals for this week:
* Learn what a significance test and confidence interval are
* Practice basic linear regression
* Understand the difference between correlation and causation

### Parameters vs. Statistics
* In data journalism, we're often interested in estimating *pararameters* of interest, which are fixed but unknown quantities
* We estimate these with *sample statistics*, which are known but random quantities
* Seems very simple but mixing up the two is a common pitfall in journalism

### Load in the data
Here is a real dataset from HODP that you're probably familiar with: our House rankings data. You've worked with this in a previous bootcamp, so the data should be fairly familiar to you. 

In [1]:
import pandas as pd
import numpy as np
rankings = pd.read_csv("house_rankings_2018.csv")
rankings.set_index("House", inplace = True)
rankings

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12
House,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Adams,20,15,24,38,37,44,67,75,74,28,32,80
Cabot,5,13,16,17,7,20,16,31,49,118,148,94
Kirkland,19,19,35,50,71,63,72,70,56,24,24,31
Mather,17,15,19,25,27,40,44,67,112,37,55,76
Quincy,28,43,55,90,71,82,65,44,21,17,14,4
Leverett,11,22,40,73,76,81,94,66,36,18,11,6
Dunster,45,67,113,56,70,42,44,52,19,10,11,5
Currier,14,10,16,15,18,19,20,23,43,92,114,150
Eliot,37,57,60,67,57,76,49,40,38,23,16,14
Lowell,152,106,63,51,45,35,22,24,14,5,7,10


## Hypothesis Tests and P-Values
### Do people prefer Adams or Cabot?
We're going to conduct a two sample t-test. We can do this with SciPy's built in t-test function. A one sample t-test is a test of how "far" a sample statistic is from a hypothesized "true" value. For example, if I claim that 50% of Harvard students are left handed, we might do a one sample t-test to see if survey data can disprove my claim. A two sample t-test is a test of how "far" two sample statistics are from each other; we're usually trying to see if they are significantly different from each other. Let's use Python to see if the mean ratings for Adams and Cabot are significantly different. Note that you may need to reshape the data a little bit; we want one data point for each ranking. In other words, you want to "expand" Cabot's 5 \#1 ratings to be 1, 1, 1, 1, 1, and do the same for each.

In [2]:
adams = []
cabot = []

for i in range(0, 12):
    for val in np.repeat(i, rankings.values[0, i]):
        adams.append(val + 1)
    for val in np.repeat(i, rankings.values[1, i]):
        cabot.append(val + 1)

Now that we've cleaned up the data into the format we want, we can pretty simply conduct a t-test. We're testing the null hypothesis that the two houses have the same true average ranking.

In [3]:
from scipy import stats
stats.ttest_ind(adams, cabot, equal_var = False)

Ttest_indResult(statistic=-10.553882150189777, pvalue=8.152088510595662e-25)

### Interpreting the Results
Now, we have a p-value. The p-value is often misunderstood or improperly interpreted, so it's very important to know what it actually means. **A p-value is the probability of observing our data, given that the null hypothesis is correct.** So, we assume that our data are random, but the true values are fixed. A common pitfall when interpreting p-values is to think of the p-value as the probability of the null hypothesis being correct, but this interpretation is wrong. Looking at the results above with this in mind, we can see that if Adams and Cabot truly had the same average ranking, there would be an astronomically low probability of getting the data we got (though still possible), so we can pretty reasonably reject the null hypothesis and say that the two houses do not have equal rankings. Because the test statistic is negative, we can see that Adams has a lower (better) ranking. 

In general, a common rule of thumb is to reject the null hypothesis when the p-value is less than 0.05. When we fail to reject the null hypothesis, we should never say that we "accept" or "prove" the null. All we have done is failed to reject it; we can never prove the null hypothesis true.

**Now, try finding the significance of the difference between two other houses, and interpret your results.**

In [4]:
#find the significance here

Interpret your results here!

### What about other parameters?

We can also do a ton of other types of comparisons. For example, we can look at proportions and not just sample means, and in a regression, we can examine the significance of the coefficients in the model. We can also do one-sample tests, where we compare our statistic to a set value. For example, if Dean Khurana tells you that he believes 75% of students support the social group sanctions, but your HODP survey suggests that only 65% of students support the sanctions, you can test whether that difference is sufficiently large to dispute Khurana's claim. 

**With the house ranking data from before, formulate another question that involves a test other than a two-sample test of difference in means, and try to answer it below using the data!**

In [5]:
#use the data to answer your question here!

Equally important in data journalism: add an interpretation of your data analysis here!

## Confidence Intervals
You're probably quite familiar with point estimates in Statistics, where we say the mean ranking for Cabot is, say, 9.5. A confidence interval is, as the name suggests, an interval that we think the true value is likely to fall in. These intervals have a certain level of *confidence*, or the percentage of intervals calculated using this method that contain the true value. There is a trade off between confidence and width - higher confidence is great, but it will also widen the interval, which can make it less useful. For example, you can construct a 100% confidence interval for anything, but it goes from negative infinity to positive infinity, and tells us nothing about the parameter of interest.

In practice, 95% confidence intervals are quite popular, which means that, on average, 19 out of every 20 contain the true value of the parameter of interest. A common pitfall with confidence intervals is to say there is a 95% chance that the true value falls inside the interval, but this interpretation is incorrect, so please stay away from this phrase in your articles.

Now, let's construct a 95% confidence interval for the average ranking for Adams house. We have to do a bit more work to generate a confidence interval in Python. We need to provide SciPy's interval function with a confidence level, a center (the sample mean), and a scale (the standard error, which is the sample standard deviation divided by the square root of one less than the length of the dataset).

In [6]:
mean, sigma = np.mean(adams), np.std(adams)

conf_int = stats.norm.interval(0.95, 
                               loc = mean, 
                               scale = sigma/np.sqrt(len(adams) - 1))
conf_int

(7.325386121402705, 7.843153204439991)

Now, try generating an interval for the Cabot data using a different confidence level, and interpret your results.

In [7]:
#generate the interval here

Interpret your results here!

## Causation vs. Correlation
You've all probably heard this before, but it's important to hear it again: correlation does not imply causation. It's pretty rare that we'll be able to show causation in a HODP article, so it's important to frame most of our work as a correlation or trend we noticed, rather than as a direct cause. Often, though, it will intuitively make sense that there "should" or at least "could" be a causal connection. In those cases, make sure to frame your writing as a "possible explanation" than as a statement of what is going on. For example, the percentage of female concentrators by department is likely strongly correlated with the percentage of female faculty members, and there is probably some causal effect here. However, it's best to cite other research on whether such a trend has a causal effect, or to cite relevant quantitative work. For example, in an article about gender balance in different departments, we could talk about existing research on the effect of faculty gender on students and potentially cite relevant Crimson articles, but we should not conclude that (for example) low female faculty presence in Mathematics *causes* low female student presence in Mathematics.

For people who are particularly interested in causation, talk to Stephen or look at Stat 111 (Statistical Inference), Stat 186 (Causal Inference), and Ec 1123 (Econometrics).

Let's look at some examples of how we might be able to find correlations that are likely not causal. This will also show you how to find a correlation coefficient. If all you want is the correlation, it's very easy.

In [8]:
from scipy.stats.stats import pearsonr
#monthly high temps in Boston
bostontemps = [37, 39, 46, 57, 67, 77, 82, 81, 73, 62, 52, 42]
levCounts = list(rankings.values[5,])
pearsonr(levCounts, bostontemps)

(0.7635806435842531, 0.0038495884100404063)

## Basic Regression
Regression is a very useful tool for prediction. Linear regressions allow us to easily model a linear relationship between a response/dependent/Y variable and 1 or more predictor/independent/X variables. This is a very widely used technique, so if you plan to use regression in your project, please come talk to us for a more in depth treatment of the subject, but here are the basics! Regressions in Python are fairly easy to do: we just need a Y list, and at least one X list of equal length! Below, we've built a regression based on the same Leverett and temperature data from above. Note that you may sometimes need to reshape data a bit.

In [25]:
from sklearn import linear_model
lm = linear_model.LinearRegression()
bostontemps = np.array(bostontemps).reshape(-1, 1)

#X, Y is the order
reg = lm.fit(bostontemps, levCounts)
[reg.intercept_, reg.coef_]

[-42.96885688068292, array([1.46800879])]

### A more useful model
The model above is probably not very useful. On your own, try fitting a model to predict number of first-choice votes each house received, with at least two predictor variables. I've given you two possible variables below, though you're welcome to find more, or different ones. Again, note that you may need to reshape data.

*Hint: You still need to find the dependent variable, and structure it like the `levCounts` variable above.*

In [1]:
#walking time from Widener Library (in minutes), from Google Maps
dist = [2, 15, 7, 8, 3, 5, 7, 16, 7, 2, 17, 6]
#was the house renovated in last 10 years? 1 if true
renovated = [0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1]

X = np.matrix([dist, renovated]).transpose()

#find the response variable and format it properly
#fit the model
#print out the coefficients and intercept


NameError: name 'np' is not defined

### Prediction
Finally, one of the most useful things we can do with a predictive model is make predictions! Assuming you called your model `reg`, use the command below to predict the number of first choice votes for Adams House after the renovations begin.

In [30]:
reg.predict(np.array([[2, 1]]))

array([[86.05683956]])