# Analysing ~~Penguins~~ Relationships

This is the second in our series of sessions that builds the regression
foundations. Here we will look at how to explore relationships in data,
using both quantitative measures such as correlation and a range of
visualisations methods.

This session discusses what it means to analyse relationships between
variables, what is possible with different types of variables (and how
this links to the previous session that looked at [comparing
samples](../../sessions/11-comparing-samples)), and what these methods
can and cannot tell us.

We will use the [**Palmer
Penguins**](https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-04-15/readme.md)
dataset throughout this session, taken from the
[TidyTuesday](https://github.com/rfordatascience/tidytuesday/) GitHub
repository (originally from the
[palmerpenguins](https://allisonhorst.github.io/palmerpenguins/) R
package[1]). We will import the data directly from GitHub, but if you
would prefer to download it instead, click
<a href="https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-15/penguins.csv" download>here</a>.

## Why Analyse Relationships?

In the [first session](../../sessions/11-comparing-samples) in the
regression foundations series we considered how to compare samples, and
what comparisons between groups can tell us. The problem with comparing
samples is that you can only really conclude that the samples are
probably drawn from the same distribution or not. If they are not drawn
from the same population distribution, what does this really tell us? It
is possible to structure our analysis such that this could be quite
meaningful, but in many cases it won’t be.

This is why we need to take the next step, to analysing relationships.
How do two variables move together? Strictly speaking, what we will be
analysing today is less “relationships”, which implies causality, and
more “associations”. We are unable to make claims about causality with
the methods we are using, because we are only considering how variables
vary together, and not directly estimating how one variable causes
changes in another. Still, the methods we will discuss here get us one
step closer to being able to measure effects and infer causality.

In this session, we’ll explore the penguins dataset to see how body mass
relates to flipper length, how bill length relates to bill depth, and
how to identify relationships between continuous traits.

## From Categorical to Continuous

In our previous session, comparing samples, we were comparing the
average value of a continuous variable by groups (categorical
variables). Here we will compare two continuous variables, considering
how one variable (the outcome) changes in response to changes in the
other variable (the predictor).

### Variable Types

| Variable Type             | Example Values       | Typical Question        |
|---------------------------|----------------------|-------------------------|
| **Continuous**            | 5.2, 7.8, 102.3      | *How much?*             |
| **Discrete**              | 1, 2, 3, 10          | *How many?*             |
| **Categorical (Nominal)** | Red, Blue, Green     | *Which type?*           |
| **Categorical (Ordinal)** | Low, Medium, High    | *Which level?*          |
| **Binary (Dichotomous)**  | Yes/No, Pass/Fail    | *Yes or no?*            |
| **Time/Date**             | 2025-06-10, 12:30 PM | *When?*                 |
| **Identifier**            | ID12345, username987 | *Who or what? (unique)* |

There is only a subtle difference between the idea of comparing samples
and analysing relationships. You can frame a comparison between groups
as analysing the relationship between the groups and the continuous
variable, but you are still comparing the average value and dispersion
for each group and inferring the relationship (or association) from
this. When comparing two continuous variables, you can’t reduce either
to their average, and are instead making statements about the way they
vary together.

## Penguins with Long Characteristics

### Import & Process Data

[1] And now a part of the Base R datasets package included as default in
R installations (R \>= 4.5.0).

In [1]:
import pandas as pd

# load penguins data from TidyTuesday URL
url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-15/penguins.csv'
penguins_raw = pd.read_csv(url)
penguins_raw.head()

There are several missing values in this dataset. While we should
generally be a little careful when discarding missing values, we will do
so here just to simplify the process.

In [2]:
# drop missing values
df = penguins_raw.dropna()
df.shape

(333, 8)

### Visualing Relationships

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns

# scatter of flipper length vs. body mass
plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=df,
    x='flipper_len',
    y='body_mass',
    hue='species',
    alpha=0.7
)
plt.title('The Relationship Between Flipper Length & Body Mass')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Body Mass (g)')
plt.show()

**Questions:**

-   What pattern do you see in <a href="#fig-flipper-mass-scatter-plot"
    class="quarto-xref">Figure 1</a>?
-   How does body mass change when flipper length increases, according
    to this plot?
-   Are there differences by species?

In [4]:
# scatter of bill length vs. bill depth
plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=df,
    x='bill_len',
    y='bill_dep',
    hue='species',
    alpha=0.7
)
plt.title('The Relationship Between Bill Length & Bill Depth')
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.show()

**Questions:**

-   What pattern do you see in
    <a href="#fig-bill-scatter-plot" class="quarto-xref">Figure 2</a>?
-   How does bill depth change when bill length increases, according to
    this plot?
-   Are there differences by species?

### Computing Correlations

#### Pairwise Correlation

We can compute the correlation between two variables, using
`scipy.stats`.

In [5]:
from scipy.stats import pearsonr

r, p = pearsonr(df['flipper_len'], df['body_mass'])
print(f"Correlation (r) = {r:.2f}")

Correlation (r) = 0.87

A correlation of 0.87 is very strong. There is clearly a very strong
association between flipper length and body mass. However, we can’t
claim that flipper length causes body mass just based off this.
Correlation does not imply causation[1].

When we visualised the relationship between bill length and bill depth,
there appeared to be a grouping structure going on that complicated
things, and the overall relationship appeared pretty noisy.

[1] Correlation might not imply causation, but it is important to
realise that the presence of correlation does not mean causation is
*not* present. You just can’t conclude causation exists simply because
you observe a correlation.

In [6]:
r, p = pearsonr(df['bill_len'], df['bill_dep'])
print(f"Correlation (r) = {r:.2f}")

Correlation (r) = -0.23

As a result, the correlation score is much lower. A correlation of -0.23
tells us two things:

-   The negative correlation means that when bill length increases, bill
    depth tends to decrease.
-   The weaker correlation suggests that this decrease is a lot noisier,
    and it is much harder to estimate a penguin’s bill depth using their
    bill length.

A correlation of +/- ~0.2 doesn’t necessarily mean there is no
relationship. There are lots of ways correlation can mislead, because it
is a limited measure. Visualising the relationship between bill length
and bill depth showed us that species is highly relevant, and not
factoring this in limits what we can say about this relationship.

#### Correlation Matrix

We may be interested in the pairwise correlation between multiple
variables. If so, computing each correlation between pairs of variables
is very cumbersome. Instead, we can compute a correlation matrix.

In [7]:
# compute correlation matrix
(
    df.select_dtypes(include='number')
    .corr()
    .round(2)
)

We can also visualise a correlation matrix, shown below in
<a href="#fig-correlation-matrix-plot" class="quarto-xref">Figure 3</a>.

In [8]:
# add correlation matrix to summarise relationships
corr_matrix = df.select_dtypes(include='number').corr()
# plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation Matrix")
plt.show()

If you are concerned with a certain outcome and you want to quickly look
at the correlation between all other continuous variables and the
outcome, you can also compute this.

In [9]:
# correlations of all numeric variables with body mass
(
    df.select_dtypes(include='number')
    .corr()['body_mass']
    .drop('body_mass')
    .round(2)
)

bill_len       0.59
bill_dep      -0.47
flipper_len    0.87
year           0.02
Name: body_mass, dtype: float64

### Correlation’s Limitations

Computing correlation can be very informative, but there are a lot of
ways it is limited. Correlation (at least the most common method for
calculating correlation, Pearson’s *r*[1]) does not handle non-linearity
well.

[1] There are a number of other ways of calculating correlation,
including methods that account for non-linearity, but more often than
not you will encounter Pearson’s *r*. Any time correlation is mentioned
without specifying the way it was calculated, you should assume it is
using Pearson’s *r*.

In [10]:
import numpy as np

# simulate a u‑shaped relationship example
x_sim = np.linspace(-3, 3, 200)
y_sim = x_sim**2 + np.random.normal(0, 1, 200)
sim = pd.DataFrame({'x': x_sim, 'y': y_sim})

plt.figure(figsize=(6, 4))
sns.scatterplot(data=sim, x='x', y='y')
plt.title('u-shaped pattern')
plt.show()

In [11]:
r, p = pearsonr(sim['x'], sim['y'])
print(f"Correlation (r) = {r:.2f}")

Correlation (r) = -0.03

**Takeaway:** Pearson’s *r* misses non-linear relationships

### Visualising Linear Regressions

In [12]:
# add linear fit line
sns.regplot(
    data=df,
    x='flipper_len',
    y='body_mass',
    scatter=True,
    ci=95
)
plt.title('Adding a Regression Line')
plt.xlabel('Flipper Length (mm)')
plt.ylabel('Body Mass (g)')
plt.show()

In [13]:
# add linear fit line
sns.regplot(
    data=df,
    x='bill_len',
    y='bill_dep',
    scatter=True,
    ci=95
)
plt.title('Adding a Regression Line')
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.show()

### Visualisation Limitations

Visualising relationships has many of the same issues that computing
correlation does, but fundamentally the biggest issue is that it is
pairwise, and few relationships are strictly pairwise in the real world.

<a href="#fig-bill-regression-plot" class="quarto-xref">Figure 6</a>
points to another limitation with Pearson’s *r* that also applies to
regression plots. Grouping structures!

In [14]:
sns.lmplot(data=df, x="bill_len", y="bill_dep", hue="species")

plt.title('Accounting for groups')
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.show()

<a href="#fig-grouping-structures" class="quarto-xref">Figure 7</a>
shows how pairwise examination of relationships can miss so much.

Drawing a straight line through data is strictly linear, but unlike
correlation coefficients, you will be able to *see* when this is
inappropriate. A linear regression plot will not handle the non-linear
pattern above any better than a correlation coefficient, but you will be
able to see the problem for yourself.

Visualising data helps you to get a sense of how two variables are
related, but where you are seeking to understand the relationship
between variables, or even explain what causes a certain outcome, you
have to go further.

## Summary

-   Visualising data can help intuit how variables are related.
-   Correlations can tell us a lot, but typical methods for calculating
    correlation will miss non-linearity.
-   Correlation does not imply causation (but does not *not* imply
    causation, either).
-   Both of these approaches are generally only able to handle pairwise
    relationships.