<a target="_blank" rel="noopener noreferrer" href="https://colab.research.google.com/github/center-for-computational-psychiatry/course_spice/blob/master/modules/module-07_t-tests.ipynb">![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)</a>

# Hypothesis Testing
This tutorial was inspired by and adapted from Shawn A. Rhoads' [PSYC 347 Course](https://shawnrhoads.github.io/gu-psyc-347/) [[CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/)] and Russell Poldrack's [Statistical Thinking for the 21st Century](https://statsthinking21.github.io/statsthinking21-core-site/) [[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)].

## Learning objectives

This notebook is intended to teach you basic python syntax for:

1. Independent samples t-tests
2. Paired samples t-tests

## Independent samples t-tests

Now that we have basics of Python syntax, data processing, and data visualization down, we can move on to hypothesis testing. Hypothesis testing is a way to test whether a certain effect is present in a population. For example, we might want to test whether a certain drug is effective in treating a psychiatric condition. We can do this by comparing the mean of a certain measure (e.g., symptom severity) in a group of people who took the drug to the mean of the same measure in a group of people who did not take the drug (a good design would also used a [randomized control trial](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6235704/) for group selection). If the mean of the measure is significantly lower in the group that took the drug, we can conclude that the drug is effective in treating the condition.

We will apply this framework to compare means of two groups using [independent samples t-tests](https://www.pythonfordatascience.org/independent-samples-t-test-python/).

In [1]:
# import our packages
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from scipy.stats import ttest_ind

Recall our data from the visualizations module. We explored the relationship between Depression and Loneliness. We found that there was a positive correlation between the two variables: As Depression increased, Loneliness also increased. However, we did not test whether this relationship was statistically significant. An independent samples t-test will allow us to test whether the mean differences in Loneliness between the High and Low Depression groups are statistically significant.

We will use the `stats.ttest_ind()` function from the `scipy` package to perform the t-test. This function takes two arrays as input and returns the t-statistic and p-value. The t-statistic is a measure of the difference between the two groups relative to the variance within each group. The p-value is the probability of observing a difference between the two groups as large as the one we observed if the null hypothesis is true. The null hypothesis is that there is no difference between the two groups. If the p-value is less than 0.05, we can reject the null hypothesis and conclude that there is a significant difference between the two groups.

In [10]:
our_data = pd.read_csv('https://raw.githubusercontent.com/Center-for-Computational-Psychiatry/course_spice/main/modules/resources/data/Banker_et_al_2022_QuestionnaireData_clean.csv')

# define groups
our_data['DepressionGroups'] = pd.cut(our_data['Depression'], 2, labels=['Low', 'High'])
group1 = our_data[our_data['DepressionGroups']=='High']
group2 = our_data[our_data['DepressionGroups']=='Low']

# t-test
t, p = ttest_ind(group1['Loneliness'], group2['Loneliness'])

if p < .0001:
    print(f'T({(len(group1) + len(group2)) - 2})={t:.2f}, p={p}')

T(1088)=17.21, p=7.0949774878784104e-59


We can see that the p-value is less than 0.05, so we can reject the null hypothesis and conclude that there is a significant difference between the two groups. We can also see that the t-statistic is positive, which means that the mean Loneliness score is higher in the High Depression group than in the Low Depression group. This is consistent with our visualization from the previous module. 

**Easy, right? But what does this all mean?**

The p-value is the probability of observing a difference between the two groups as large as the one we observed if the null hypothesis is true. The null hypothesis is that there is no difference between the two groups. If the p-value is less than 0.001, we can reject the null hypothesis and conclude that there is a significant difference between the two groups with 99.9% confidence. (If the p-value is less than 0.01, we can reject the null hypothesis and conclude that there is a significant difference between the two groups with 99% confidence. If the p-value is less than 0.05, we can reject the null hypothesis and conclude that there is a significant difference between the two groups with 95% confidence.)

## Let's do an even deeper dive

