This is an exercise in **what not to do!**

# The shocking case of junk food and acne

You have been asked to investigate a link between jukn food and acne. 

You have been given a dataset, `junkfood.csv`, which contains several columns. First, a numeric column representing a self-reported value from 0 to 10 of "How badly do you present with acne?", with 0 being not at all and 10 being ever-present and severe. Then there is a column which describes their most significant contribution to junk food (ie do they most often eat chocolate, drink softdrink, or something else), and the third column details an estimate for how many times a month they partake in junk food.

In [11]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("junkfood.csv")
df.head()

Unnamed: 0,Acne,Food,Frequency
0,5,Burgers,21
1,6,Chocolate,26
2,5,Ice cream,17
3,1,Ice cream,1
4,0,Cake,2


Let's check for a global link

In [12]:
df.corr()

Unnamed: 0,Acne,Frequency
Acne,1.0,0.018163
Frequency,0.018163,1.0


Looks like there is a small positive correlation between frequency of eating junk food and having bad acne. Let's look at a subset to compare extremes

In [13]:
print(df.loc[df["Acne"] < 3, "Frequency"].mean())
print(df.loc[df["Acne"] > 7, "Frequency"].mean())

14.5357833655706
14.878205128205128


Only looks like a small relationship.

Let's state the null hypothesis a bit better. 

Null hypothesis: *The frequency of junk food does not cause (correlate) with acne.*

To test this we can use Pearson's correlation test in scipy.

In [14]:
from scipy.stats import pearsonr
correlation, pvalue = pearsonr(df["Acne"], df["Frequency"])
print(f"Correlation is {correlation:.3f}, with p-value of {pvalue:0.4f}")

Correlation is 0.018, with p-value of 0.5662


Not a significant p-value, so we cannot reject the null hypothesis.

But wait, what if we ask if any one food is correlated. Let's check each food item.

In [18]:
foods = df["Food"].unique()
alpha = 0.05
for f in foods:
    df2 = df.loc[df["Food"] == f]
    correlation, pvalue = pearsonr(df2["Acne"], df2["Frequency"])
    if pvalue < alpha:
        print(f"Significant correlation (pvalue {pvalue:0.4f}) between {f}: {correlation:.3f}!!")
    else:
        print(f"No significant correlation for {f}")

No significant correlation for Burgers
No significant correlation for Chocolate
No significant correlation for Ice cream
Significant correlation (pvalue 0.0326) between Cake: 0.229!!
No significant correlation for Donuts
No significant correlation for Lollies
No significant correlation for Pure Sugar
No significant correlation for Cheese Pizza
No significant correlation for Brownies
No significant correlation for Milkshakes
No significant correlation for Soft Drink


**DAMMIT CAKE!**

This is just unimaginable. How will the world take this devastating news?

****

Right, so this is an example of significance hunting. By drilling down into our data when we initially don't find results and comparing a large number of hypotheses we are bound to eventually call something statistically significant. In this case, the data was generated without any correlation at all, this is just a statistical fluctuation. But in an effort to find results we could publish an article claiming a statistically significant correlation between cake and acne, and then the media would run with it, and its all junk science.

*The moral of the story here is that you should pick you value of $\alpha$ (in our case, 0.05) based on the number of tests to try and minimise false positives.*

*Also*, keep in mind for when you next see media about correlation between two things, chocolate and heart disease, wine and longevity, etc, that just because you have the strength in your data to determine correlation with significance doesn't mean that *amount* of correlation is significance. 

With a large enough data sample, you can detect a correlation of 0.0001 with statistical significance.

And to end on a fun note [here is a relevant xkcd which inspired the acne link in this example](https://xkcd.com/882/).