# Statistical Analysis 1: Correlation Between Overall Band Score and Text Length

This notebook investigates whether there is a correlation between essay length and overall band scores.

## Key Findings

The analysis reveals a **weak but statistically significant positive linear correlation** (r=0.286, p=7.82e-170) between essay length and overall band score. This indicates that longer essays tend to receive slightly higher scores, though the relationship is not strong enough to be the primary determining factor in scoring.

The extremely low p-value suggests this correlation is highly unlikely to have occurred by chance, making it a statistically significant finding despite the modest correlation strength.

## Imports and Setup

---

In [1]:
import pandas as pd
from os import path
import sys

sys.path.append(path.dirname(path.abspath("")))
project_root = path.dirname(path.abspath(""))
print(project_root)

/Users/finnferchau/dev/team-10


In [2]:
pd.options.plotting.backend = "plotly"

## Data Import

---

Load the clean training dataset containing essays, prompts, evaluations, and band scores.

In [3]:
csv_file = "/data/clean_train.csv"
csv_file_path = project_root + csv_file
print(csv_file_path)

df = pd.read_csv(csv_file_path)
df.head()

/Users/finnferchau/dev/team-10/data/clean_train.csv


Unnamed: 0,prompt,essay,evaluation,band_score_old,task_achievement_description,task_achievement_score,coherence_and_cohesion_description,coherence_and_cohesion_score,lexical_resource_description,lexical_resource_score,grammatical_range_and_accuracy_description,grammatical_range_and_accuracy_score,overall_band_score_description,band_score
0,Interviews form the basic criteria for most la...,It is believed by some experts that the tradit...,**Task Achievement: [7]**\nThe essay effective...,7.5\n\n\n\n\n\r\r\r\r\r\r\r\r\r\r\r\r\r,Task Achievement: [7]** The essay effectively ...,7.0,Coherence and Cohesion: [7.5]** The essay is w...,7.5,Lexical Resource: [7]** The candidate demonstr...,7.0,Grammatical Range and Accuracy: [7]** The essa...,7.0,Overall Band Score: [7.5]** The essay effectiv...,7.5
1,Interviews form the basic selecting criteria f...,Nowadays numerous huge firms allocate an inter...,**Task Achievement:** 5.0\n- The candidate has...,5.0\n\n\n\n\n\r\r\r\r\r\r\r\r\r\r\r\r\r,Task Achievement:** 5.0 - The candidate has ef...,5.0,Coherence and Cohesion:** 4.5 - The essay is w...,4.5,Lexical Resource (Vocabulary):** 4.0 - The can...,4.0,Grammatical Range and Accuracy:** 4.5 - The ca...,4.5,Overall Band Score:** 5.0 - The essay meets al...,5.0
2,Interview form the basic selection criteria fo...,The interview section is the most vital part o...,## Task Achievement:\n- The candidate has effe...,5.5\n\n\n\n\n\r\r\r\r\r\r\r\r\r\r\r\r\r,Task Achievement: - The candidate has effectiv...,6.5,Coherence and Cohesion: - The essay is well-or...,6.5,Lexical Resource: - The candidate demonstrates...,6.0,Grammatical Range and Accuracy: - The candidat...,6.0,Overall Band Score: - Taking into account the ...,6.5
3,Interviews form the basic selection criteria f...,It is argued that the best method to recruit e...,## Task Achievement:\n- The candidate has adeq...,5.5\n\n\n\n\n\r\r\r\r\r\r\r\r\r\r\r\r\r,Task Achievement: - The candidate has adequate...,6.0,Coherence and Cohesion: - The essay lacks a cl...,5.5,Lexical Resource (Vocabulary): - The candidate...,5.0,Grammatical Range and Accuracy: - The essay ex...,5.0,Overall Band Score: - The essay demonstrates a...,5.5
4,Interviews form the basic selection criteria f...,Mostly when you find work in different compani...,## Task Achievement:\n- The candidate has part...,5.5\n\n\n\n\n\r\r\r\r\r\r\r\r\r\r\r\r\r,Task Achievement: - The candidate has partiall...,4.5,Coherence and Cohesion: - The essay lacks clar...,3.5,Lexical Resource (Vocabulary): - The essay dem...,3.5,Grammatical Range and Accuracy: - The essay ex...,3.5,Overall Band Score: - Considering the holistic...,4.0


## Calculate Essay Length

---

Compute the word count for each essay by splitting the text and counting the resulting tokens, creating a new `essay_length` feature.

In [4]:
df["essay_length"] = df["essay"].str.split().str.len()
df["essay_length"].sample(5)

3695    262
2047    371
5138    321
112     275
4838    394
Name: essay_length, dtype: int64

## Distribution Analysis

---

### Essay Length Statistics

Examine the descriptive statistics of essay lengths to understand the data distribution, including mean, median, and range values.

In [5]:
df["essay_length"].describe()

count    9048.000000
mean      293.724580
std        49.369232
min        51.000000
25%       265.000000
50%       290.000000
75%       321.000000
max       466.000000
Name: essay_length, dtype: float64

### Essay Length Histogram

Visualize the distribution of essay lengths using a histogram.

In [6]:
# Plotting the Distribution
df["essay_length"].hist(bins=100)

## Correlation Visualization

---

### Scatter Plot: Band Score vs Essay Length

Create a scatter plot to visualize the relationship between essay length and band scores.

In [7]:
df.plot.scatter(
    x="essay_length",
    y="band_score",
    labels={"essay_length": "Essay Length", "band_score": "Band Score"},
    title="Band Score vs. Essay Length",
)

### Box Plot: Essay Length Distribution by Band Score

Generate box plots for each band score category to compare essay length distributions across different scoring levels, showing means and standard deviations.

In [10]:
import plotly.graph_objects as go

# Get unique bands and sort them
sorted_bands = sorted(df["band_score"].unique())

fig = go.Figure()
fig.update_layout(width=720, height=480)

# Loop through sorted bands
for band in sorted_bands:
    band_data = df[df["band_score"] == band]["essay_length"]
    fig.add_trace(
        go.Box(
            y=band_data,
            name=f"Band {band}",
            boxmean="sd",  # Shows mean and standard deviation
        )
    )

fig.update_layout(
    title="Essay Length Distribution by Band",
    yaxis_title="Essay Length",
    xaxis_title="Band",
    # Optionally force the category order
    xaxis={
        "categoryorder": "array",
        "categoryarray": [f"Band {band}" for band in sorted_bands],
    },
)
fig.show()

### OLS Trendline Analysis

Add an Ordinary Least Squares (OLS) trendline to the scatter plot to quantify the linear relationship between essay length and band scores.

In [11]:
import plotly.express as px

fig = px.scatter(
    df,
    x="essay_length",
    y="band_score",
    trendline="ols",
    labels={"essay_length": "Essay Length", "band_score": "Band Score"},
    title="Band Score vs. Essay Length",
)
fig.show()

## Statistical Correlation Analysis

Calculate the Pearson correlation coefficient and p-value to determine the strength and statistical significance of the relationship between essay length and band scores.

In [12]:
from scipy.stats import pearsonr

r, p = pearsonr(df["band_score"], df["essay_length"])
print(f"pearson r={r:.3f}, p={p:.3g}")

pearson r=0.286, p=7.82e-170


- p: the probability of observing a correlation as strong as the one found, or stronger, purely by chance if there were actually no relationship between the variables.

### [`Click here to go back to the Homepage`](../Homepage.md)