# COMPARING TWO SAMPLES

> by Dr Juan H Klopper

- Research Fellow
- School for Data Science and Computational Thinking
- Stellenbosch University

## INTRODUCTION

In this notebook we build on our understanding of sampling distributions by investigating a few examples. Here, we only have the data from one study and, as is most often the case, we do not have access to the whole population.

We will still build sampling distributions of test statistics under a null hypothesis to put the test statitic of our data in perspective.

## PACKAGES USED IN THIS NOTEBOOK

In [None]:
# Data table configuration for Colab
%load_ext google.colab.data_table

In [None]:
%config InlineBackend.figure_format = "retina" # For Retina type displays

In [None]:
from google.colab import drive # For connecting to our Google Drive

In [None]:
import pandas as pd # Data import and manipulation
import numpy as np # Numerical computing
from scipy import stats # Statistics module

In [None]:
# Data visualisation
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = 'plotly_white'

## DATA IMPORT

We follow the now familiar process of importing data from our Google Drive, stored in a sub directory to the one in which this notebook resides.

In [None]:
# Log on and list files in the DATA directory of your Google Drive
drive.mount('/gdrive')
%cd '/gdrive/My Drive/DATA SCIENCE/DATA'

In [None]:
df = pd.read_csv('data.csv')  # Importing the CSV files

In [None]:
df  # Printing the DataFrame to the screen

In [None]:
df.shape  # Number of patients and variables

There are $200$ observations and $13$ variables.

## COMPARING THE DISTRIBUTION OF A NUMERICAL VARIABLE BETWEEN TWO INDEPENDENT GROUPS

Comparing the distribution of data values for a continuous numerical variable between two groups, requires splitting the data along the sample space of one of the categorical variables (or then a numerical variable from which bins have been created). We consider an appropriate research question to investigate the comparison of two distributions.

RESEARCH QUESTION: _Is there a difference in heart rate values (the `HR` variable) between the Active group and the Control group (the `Group` variable)?_

Note how the groups are formed by the sample space elements of a nominal categorical variable. The two sample space elements, `Active` and `Control` are also independent of each other.

In this example we are going to make use of numpy arrays to hold the heart rate values for each group.

In [None]:
# Array for heart rates of control participants
hr_control = df[df.Group == 'Control']['HR'].to_numpy()

# Array for heart rate of active participants
hr_active = df[df.Group == 'Active']['HR'].to_numpy()

Our analysis is, as always, preceded by the use of descriptive statistics and visualisation.

### DESCRIPTIVE STATISTICS

We can group by the `Group` variable and then use the `describe` method to return summary statistics of the `HR` variable. Remember that the `groupby` function is used to generate groups according to the sample space elements of a variable.

In [None]:
df.groupby('Group')['HR'].describe()

### VISUALIZATION

Since we have a continuous numerical variable, a box plot will provide a good visual summary of the data.

In [None]:
px.box(df,
       y = 'HR',
       color = 'Group',
       title='Heart rate distribution among treatment groups')

### COMPARING THE TWO SAMPLES

Our __null hypothesis__ states that there is no diference in the distribution of heart rate between the two groups (or that it is higher in the `Control` group). It can be denoted by $H_{0}: \text{HR}_{\text{Active}} = \text{HR}_{\text{Control}}$.

Our __alternatine hypothesis__ states that the heart rate in the `Active` group is different from the heart rate in the `Control` group. It can be denoted by $H_{\alpha} = \text{HR}_{\text{Active}} \ne \text{HR}_{\text{Control}}$.

A good test statistic to compare the difference in this continuous numerical variable is the mean, or then the difference in means.

In [None]:
# Our test statistic
np.mean(hr_active - hr_control)

Remeber that this could also be $-4.45$.

In the previous notebook, we used formal statistical tests to compare these means. Here, we expand our understanding of hypotheses testing using a more intuitive approach than statistical tests. We will use a statistical test at the end to verify our results.

In our next step, we have to calculate the distribution of our test statistic under the null hypothesis.  In practical terms, this states a random reallocation of group status. There are $200$ samples in the dataset, with $100$ cases in each group. We randomly reassign a group to each heart rate. Doing this repeatedly and collecting all the means (and then differences in means) will yield a distribution of sampling mean differencess (our test statistic). This is done $10000$ below, populating a list object, `mean_stat` with $10000$ mean differences.

We can do this because our assumption on which we build is that the values are equal in both groups.

In [None]:
mean_stat = []

for i in range(10000):
  grouping = np.random.choice(df.HR, size=(100, 2), replace=False)
  groupI = np.mean(grouping[0:100, 0])
  groupII = np.mean(grouping[0:100, 1])
  mean_stat.append(groupI - groupII)

We look at a histogram of the sampling distribution of means (difference in means) and our original difference.

In [None]:
go.Figure(
    data=go.Histogram(
        x=mean_stat,
        name='Mean differences'
    )
).add_trace(go.Scatter(
    x=[4.45, 4.45],
    y=[0, 140],
    mode='lines',
    name='Original difference'
)).update_layout(
    title='Distribution of difference in means',
    xaxis={'title':'Difference'},
    yaxis={'title':'Frequency'}
)

We note that the difference in means of our study in relation to the distribution of differences in means under the null hypothesis was a rare finding indeed. Since Python stores a `False` value as $0$ and a `True` value as $1$, we can sum over the `True` and `False` values, expressing the number of cases in the `mean_stat` list _above_, i.e. that were larger than our data's value of $4.45$.

In [None]:
np.sum(np.array(mean_stat) > 4.45) / 10000

Since we could also have subtracted in a different order, we also need to consider all the fraction of sampling distribution values less than $-4.45$, as we can see from the histogram below.

In [None]:
go.Figure(
    data=go.Histogram(
        x=mean_stat,
        name='Mean differences'
    )
).add_trace(go.Scatter(
    x=[4.45, 4.45],
    y=[0, 140],
    mode='lines',
    name='Original difference'
)).add_trace(go.Scatter(
    x=[-4.45, -4.45],
    y=[0, 140],
    mode='lines',
    name='Reverse order subtraction'
)).update_layout(
    title='Distribution of difference in means',
    xaxis={'title':'Difference'},
    yaxis={'title':'Frequency'}
)

We use the same method to calculate the fraction _below_ as we did with the _above_ calculation.

In [None]:
np.sum(np.array(mean_stat) < -4.45) / 10000

We add these values to get $0.0044 + 0.0041 = 0.0085 \approx 0.009$.

This is a simulated _p_ value for our study. It is much smaller than a chosen $\alpha$ value of $0.05$ and we can reject the null hypothesis.

Let's recap. Since we don't have access to the complete population we used a technique of reassignment to our known data under the null hypothesis that there is no difference between the groups. From this we built a sampling distribution. We looked at how many times the sampling distribution values was more and less than our finding (with subtraction in either order).

Just to confirm, we also use Student's _t_ to calculate a _p_ value.

### COMPARING MEANS WITH STUDENT'S _t_ TEST

The _t_ test for independent groups compares the mean values of the variable in each group. The _t_ distribution uses the degrees of freedom parameter (as we saw in the previous notebook).

The `ttest_ind` function from the stats module in the scipy package can perform this test. It returns the _t_ statistic and a (two-tailed) _p_ value. We have to divide this _p_ value by $2$ to get the one-tailed _p_ values.

In [None]:
t_stat, p_val = stats.ttest_ind(hr_active,
                                hr_control)
p_val

This is very close to the _p_ value that we calculated above.

We see a _t_ statistic and a _p_ value for this _t_ statistic. A visual representation is given below, where the critical _t_ statistic (representing $2.5$% (below) and $2.5$% (above) of the total area under the curve) is in orange and the _t_ statistic for our data is in blue.

In [None]:
t_stat

In [None]:
t_vals = np.linspace(-3, 3, 200)  # Generating some values for the x-axis
t_pdf_vals = stats.t.pdf(t_vals, 198)  # Calculating the PDF value for each of the x-axis values

t_dist_fig = go.Figure()

t_dist_fig.add_trace(go.Scatter(x=t_vals,
                                y=t_pdf_vals,
                                mode='lines',
                                name='t distribution'))

t_dist_fig.update_layout(title="Student's t test",
                         xaxis=dict(title='t values'),
                         yaxis=dict(title='Distribution'))

t_dist_fig.add_trace(go.Scatter(
    x=[t_stat, t_stat],
    y=[0,0.4],
    name='t statistic',
    mode='lines',
    marker=dict({'color':'deepskyblue'})
))

t_dist_fig.add_trace(go.Scatter(
    x=[-t_stat, -t_stat],
    y=[0,0.4],
    name='t statistic',
    mode='lines',
    marker=dict({'color':'deepskyblue'})
))

t_crit = stats.t.ppf(0.975, 198)

t_dist_fig.add_trace(go.Scatter(
    x=[t_crit, t_crit],
    y=[0,0.4],
    name='critical t statistic',
    mode='lines',
    marker=dict({'color':'orange'})
))

t_dist_fig.add_trace(go.Scatter(
    x=[-t_crit, -t_crit],
    y=[0,0.4],
    name='critical t statistic',
    mode='lines',
    marker=dict({'color':'orange'})
))

t_dist_fig.show()

For a chosen $\alpha$ value of $0.05$ (which is above and below the orange lines), we reject the null hypothesis and accept the alternative hypothesis and state that the heart rate in the active group is significantly different from the heart rate in the control group.

## COMPARING THE MEANS OF SYSTOLIC BLOOD PRESSURE BETWEEN AGE GROUPS

Here we consider the difference in systolic blood pressure (`sBP` variable) between younger and older patients. Our null hypothesis is that there is no difference in the systolic blood pressdure between the groups. Our alternative hypothesis is that there is a difference. Hypothesis testing is therefor two-sided.

In this example, we review how to work with data and create two groups of age by binning the data using a conditional. We will let every participant younger than $65$ be in age group I and every participants $65$ and older be in age group II.

In [None]:
# Creating a new variable using the where function
df['AgeGroup'] = np.where(df.Age < 65, 'I', 'II')

### DESCRIPTIVE STATISTICS

The result is an unbalanced variable, where we have an over-representation of the younger participants.

In [None]:
df.AgeGroup.value_counts()

The `groupby` method is used to describe the systolic blood pressure in both groups.

In [None]:
df.groupby('AgeGroup').sBP.describe()

Younger participants have a lower mean blood pressure. Below, we visualise this difference.

### VISUALISATION

In [None]:
px.box(df,
       y = 'sBP',
       color = 'AgeGroup',
       title='Systolic blood pressure distribution among treatment groups',
       labels={'sBP':'Systolic BP'})

We need to know how different the means are.

### COMPARING THE VARIABLE BETWEEN THE TWO GROUPS

As with our previous use of the null hypothesis, we assume that the systolic blood pressure (the `sBP` variable) is independent of the age group.

We can reassign the systolic blood pressure. Here, we accomplish the task by repeatedly shuffling the values in the `sBP` numpy array, using the `random.shuffle` nump function. We have to be careful with pandas here. When we do the shiffle, we actually change the original dataframe, even though we extracted the column and saved it as a separate numpy array. We therfore create an independent copy of the dataframe to work with.

In [None]:
# Make an independent copy of the dataframe
df_copy = df.copy(deep=True)

In [None]:
mean_stat = []

for i in range(10000):
  sBP = df_copy.sBP.to_numpy() # Reset the original array
  np.random.shuffle(sBP) # Reshuffle the array randomly
  groupI = np.mean(sBP[0:152]) # Select the first 152 obeservations
  groupII = np.mean(sBP[152:201]) # Select the last 48 observations
  mean_stat.append(groupI - groupII) # Claculate and store the difference in means

Once again we view the sampling distribution of the mean difference test statistic. First, though, we store the difference in means for the sample data.

In [None]:
# Creating separate numpy arrays
younger_sBP = df.loc[df.AgeGroup == 'I'].sBP.to_numpy()
older_sBP = df.loc[df.AgeGroup == 'II'].sBP.to_numpy()

# Difference in means
mean_diff = np.mean(younger_sBP) - np.mean(older_sBP)
mean_diff

Since this is a two-tailed hypothesis, we need to reflect this differnce. Below, we create a histogram of the mean difference sample distribution and the two mean differences from the data.

In [None]:
go.Figure(
    data=go.Histogram(
        x=mean_stat,
        name='Mean differences'
    )
).add_trace(go.Scatter(
    x=[mean_diff, mean_diff],
    y=[0, 200],
    mode='lines',
    name='Original difference'
)).add_trace(go.Scatter(
    x=[-mean_diff, -mean_diff],
    y=[0, 200],
    mode='lines',
    name='Reflected original difference'
)).update_layout(
    title='Distribution of difference in means',
    xaxis={'title':'Difference'},
    yaxis={'title':'Frequency'}
)

Below, we view both the fractions below and above our mean difference.

In [None]:
np.sum(np.array(mean_stat) < mean_diff) / 10000

In [None]:
np.sum(np.array(mean_stat) > -mean_diff) / 10000

Combined, we have a very small fraction of values more extreme than our original difference. We can verify this again with a _t_ test.

### STUDENT'S _t_ TEST

In [None]:
stats.ttest_ind(
    younger_sBP,
    older_sBP
)

We are usually only interested in two decimal places, so in both cases we would have _p_ $< 0.01$.

For the sake of interest we look at one more _t_ test that we use when the variances in our continuous numerical variable is different between two groups.

## COMPARING MEANS WITH UNEQUAL VARIANCES

Student's _t_ test assumes that the data are from populations in which the variances of the variable are equal. This can be verified using Levene's test. The Levene test null hypothesis states that the variances are indeed equal and the alternative hypothesis is that they are not.

We will consider if there is a difference in age between two randomly created groups.

In [None]:
# Creating two numpy arrays to hold the age values with different variances
np.random.seed(12)
age_I = np.random.normal(loc=100, scale=10, size=100)
age_II = np.random.normal(loc=100, scale=12.1, size=100)

We generate two arrays with the same mean and size. One is taken from a normal distribution with a stabdard deviation of $10$ and the other being $12.1$. Is this a significant difference.

Below we use the `levene` function from the stats module of the scipy library. The two arrays are used as arguments.

In [None]:
stats.levene(age_I, age_II)

We note that we reject the null hypothesis. We now use the _t_ test for unequal variances, termed the __Welch test__.

### _t_ TEST FOR UNEQUAL VARIANCES

This test is simple to perform and requires the addition of the `equal_var` argument to the `ttest_ind` function and setting it to `False`.

In [None]:
stats.ttest_ind(
    age_I,
    age_II,
    equal_var=False
)

Here, we fasil to reject the null hypothesis.

## CONCLUSION

The _t_ tests are commonly used in data science. They are termed __parametric tests__ for the comparison of two means as they are calculated from theoretical distributions based on parameters, i.e. the _t_ distribution is based on the parameter of degrees of freedom.

We have seen though that we can build sampling distributions from our original data under the null hypothesis that there is no difference and from this we can estimate the disfference between groups.