# 8.4 The Correlation Coefficient

## Objectives
- Analyze and interpret bivariate data using scatter plots, correlation, and linear regression analysis to determine what the best prediction would be for a certain value.
- Analyze an application in the disciplines business, social sciences, psychology, life sciences, health science, and education, and utilize the correct statistical processes to arrive at a solution.

## The Correlation Coefficient
Besides looking at the scatter plot and seeing that a line seems reasonable, how can you tell if the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatterplot) of the strength of the relationship between $x$ and $y$.

The **correlation coefficient**, $r$, developed by Karl Pearson in the early 1900s, is numerical and provides a measure of strength and direction of the linear association between the independent variable $x$ and the dependent variable $y$.

We can find the correlation coefficient using the <code>cor</code> function:

```R
cor(x, y)
```

Here, <code>x</code> is a list of the independent $x$ values, and <code>y</code> is the corresponding list of the dependent $y$ values.

**What the VALUE of $r$ tells us:**
- The value of $r$ is always between -1 and +1: $-1 \leq r \leq 1$.
- The size of the correlation coefficient $r$ indicates the strength of the linear relationship between $x$ and $y$. Values of $r$ close to -1 or to +1 indicate a stronger linear relationship between $x$ and $y$.
- If $r = 0$ there is likely no linear correlation. It is important to view the scatterplot, however, because data that exhibit a curved or horizontal pattern may have a correlation of 0.
- If $r = 1$, there is perfect positive correlation. If $r = –1$, there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not generally happen.

**What the SIGN of $r$ tells us:**
- A positive value of $r$ means that when $x$ increases, $y$ tends to increase and when $x$ decreases, $y$ tends to decrease (**positive correlation**).
- A negative value of $r$ means that when $x$ increases, $y$ tends to decrease and when $x$ decreases, $y$ tends to increase (**negative correlation**).
- The sign of $r$ is the same as the sign of the slope, $b$, of the best-fit line.

***


### Example 4.1
SCUBA divers have maximum dive times they cannot exceed when going to different depths. The data in the table show different depths with the maximum dive times in minutes. Find and interpret the correlation coefficient $r$ for the data.

|X (depth in feet)	|Y (maximum dive time)|
|--|--|
|50	|80|
|60	|55|
|70	|45|
|80	|35|
|90	|25|
|100|22|

#### Solution

In [1]:
x <- c(50, 60, 70, 80, 90, 100)
y <- c(80, 55, 45, 35, 25, 22)

cor(x, y)

So the correlation coefficient for the data is $r = -0.9629$. Since $r$ is very close to -1, the dive depth has a very strong negative linear relationship to the maximum dive time.


***

## Testing the Significance of the Correlation Coefficient
The correlation coefficient, $r$, tells us about the strength and direction of the linear relationship between $x$ and $y$. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient $r$ and the sample size $n$, together.

We perform a hypothesis test of the **"significance of the correlation coefficient"** to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute $r$, the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we only have sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, $r$, is our estimate of the unknown population correlation coefficient.

- The symbol for the population correlation coefficient is $ρ$, the Greek letter "rho."
- $ρ$ = population correlation coefficient (unknown)
- $r$ = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient $ρ$ is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient $r$ and the sample size $n$.

**If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is "significant."**

- Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between $x$ and $y$ because the correlation coefficient is significantly different from zero.
- What the conclusion means: There is a significant linear relationship between $x$ and $y$. We can use the regression line to model the linear relationship between $x$ and $y$ in the population.

**If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is "not significant".**

- Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between $x$ and $y$ because the correlation coefficient is not significantly different from zero."
- What the conclusion means: There is not a significant linear relationship between $x$ and $y$. Therefore, we CANNOT use the regression line to model a linear relationship between $x$ and $y$ in the population.

**NOTE:**
-  If $r$ is significant and the scatter plot shows a linear trend, the line can be used to predict the value of $y$ for values of $x$ that are within the domain of observed $x$ values.
- If $r$ is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
- If $r$ is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed $x$ values in the data.

## Performing the Hypothesis Test
- Null Hypothesis: $H_0: \rho = 0$
- Alternate Hypothesis: $H_a: \rho \neq 0$

**What the hypotheses mean in words:**

- Null Hypothesis $H_0$: The population correlation coefficient *is not* significantly different from zero. There *is not* a significant linear relationship(correlation) between x and y in the population.
- Alternate Hypothesis $H_a$: The population correlation coefficient *is* significantly different from zero. There *is* a significant linear relationship (correlation) between x and y in the population.

**Using a $p$-value to make a decision:**

If the $p$-value is less than the significance level $\alpha$:

- Decision: Reject the null hypothesis.
- Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between $x$ and $y$ because the correlation coefficient is significantly different from zero."

If the $p$-value is *not* less than the significance level $\alpha$:

- Decision: Do not reject the null hypothesis.
- Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between $x$ and $y$ because the correlation coefficient is *not* significantly different from zero."

Calculation Notes:

- The $p$-value is calculated using a $t$-distribution with $n - 2$ degrees of freedom.
- The formula for the test statistic is
$$ t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}}. $$
- The test statistic t has the same sign as the correlation coefficient r.
- The hypothesis test is two-tailed. The $p$-value is the combined area in both tails.

**The fundamental steps for conducting a hypothesis test remain the same:**

1. State the null and alternative hypotheses.
2. Assuming the null hypothesis is true, determine the features of the distribution of sample statistics.
3. Find the $p$-value.
4. Make a conclusion about the null hypothesis.

***


### Example 4.2
A random sample of 11 statistics students produced the following data, where $x$ is the third exam score out of 80, and $y$ is the final exam score out of 200.

|x (third exam score)	|y (final exam score)|
|--|--|
|65	|175|
|67	|133|
|71	|185|
|71	|163|
|66	|126|
|75	|198|
|67	|153|
|70	|163|
|71	|159|
|69	|151|
|69	|159|

Perform a hypothesis test with a 1% level of significance to determine whether or not there is a linear correlation between a student's third exam score and their final exam score.

#### Solution
##### Part 1: State the null and alternative hypotheses.
When testing the significance of the population correlation coefficient, the null and alternative hypotheses are always the same:

$$\begin{align}
H_0:&\ \rho = 0 \\
H_a:&\ \rho \neq 0
\end{align}$$

Our null hypothesis $H_0$ is that there is no linear correlation in the *population* between a student's third exam score and their final exam score. The alternative hypothesis $H_a$ is that there is some degree of linear correlation in the population between a student's third exam score and their final exam score.

##### Part 2: Assuming the null hypothesis is true, determine the features of the distribution of sample statistics.
When testing the population correlation coefficient $\rho$, we use a $t$-distribution with $n - 2$ degrees of freedom, where $n$ is the number of $(x, y)$ points in our data. Since our data has 11 $(x, y)$ points in this case,

$$ df = n - 2 = 11 - 2 = 9. $$

##### Part 3: Find the $p$-value
We will use the test statistic

$$ t = \frac{r\sqrt{n-2}}{\sqrt{1 - r^2}} $$

to find the $p$-value. To calculate this $t$-score, we first must find the sample correlation coefficient $r$:

In [1]:
x <- c(65, 67, 71, 71, 66, 75, 67, 70, 71, 69, 69)
y <- c(175, 133, 185, 163, 126, 198, 153, 163, 159, 151, 159)

cor(x, y)

The sample correlation coefficient is $r = 0.6631$. Then we calculate that the test statistic is

$$ t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}} = \frac{0.6631\sqrt{11 - 2}}{\sqrt{1 - (0.6631)^2}} = 2.6576. $$

Since the alternative hypothesis $H_a$ uses a not-equal-to symbol, we will perform a two-tailed test. That means *half* of the $p$-value is represented by $P(t \geq 2.6576)$, which we will calculate using R.

In [3]:
pt(q = 2.6576, df = 9, lower.tail = FALSE)

So half the $p$-value is $P(t \geq 2.6576) = 0.0131$. This means that the whole $p$-value is

$$p\text{-value} = 2(0.0131) = 0.0262. $$

So assuming that the null hypothesis is true—that there is no correlation in the population—there would still be a 2.62% chance that a random sample of 11 data points would have a correlation coefficient at least as extreme as $r = 0.6631$.

##### Part 4: Make a conclusion about the null hypothesis.
We're performing the test at the 1% significance level, so $\alpha = 0.01$. Thus, since

$$p\text{-value} = 0.0262 \geq 0.01 = \alpha, $$

the evidence is not sufficient to reject the null hypothesis.

We conclude that it is possible that there is no linear correlation in the population between a student's third exam score and their final exam score.


***

### Example 4.3

In [None]:
#**VID=ieQKPsBpAVI**#

***

## Attribution ##
Significant portions of this textbook were written by Barbara Illowsky and Susan Dean in their textbook *Introductory Statistics*, published by OpenStax.

Access for free at [https://openstax.org/books/introductory-statistics/pages/1-introduction](https://openstax.org/books/introductory-statistics/pages/1-introduction)