<a href="https://colab.research.google.com/github/HazelvdW/context-framed-listening/blob/main/framed_listening_text_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Framed Listening: **Decriptive Stats & Text Preprocessing**
> By **Hazel A. van der Walle** (PhD student, Music, Durham University), September 2025.

All datasets generated and used for this study are openly available on GitHub https://github.com/HazelvdW/context-framed-listening.

In [None]:
!rm -r context-framed-listening
# Clone the GitHub repository
!git clone https://github.com/HazelvdW/context-framed-listening.git

Refresh files to see **"context-framed-listening"**.


---

## Setup

In [None]:
import os
import csv
import pandas as pd
import numpy as np

from google.colab import data_table
data_table.enable_dataframe_formatter()

Load in the data file "**data_study1_MAIN**" that contains participants' thought desciptions

In [None]:
data = pd.read_csv("/content/context-framed-listening/data_study1_MAIN.csv")

Sentiment analysis is being conducted on music-evoked thoughts (METs).

Create a separate dataset that only contains trials where METs were described (i.e. all rows where "descr_THOUGHT.text" is _not_ NA) and drop the columns only relevant to no-MET trials:

In [None]:
dataMET = data[data['descr_THOUGHT.text'].notna()].copy()

# drop no-MET columns
dataMET.drop(columns = ['response_thought_or_not.keys', 'input_NOT.text'],
             inplace=True)

# Edit clip_name column
for rowIndex, row in dataMET.iterrows():
    clip_name = row['clip_name']
    if clip_name[0:3] == '80s':
        dataMET.loc[rowIndex,'clip_name'] = '80s'+clip_name[3:10]
    elif clip_name[0:3] == 'Jaz':
        dataMET.loc[rowIndex,'clip_name'] = 'Jaz'+clip_name[4:11]
    elif clip_name[0:3] == 'Met':
        dataMET.loc[rowIndex,'clip_name'] = 'Met'+clip_name[5:12]
    elif clip_name[0:3] == 'Ele':
        dataMET.loc[rowIndex,'clip_name'] = 'Ele'+clip_name[10:17]


display(dataMET)

# print all column headers for later reference
print(dataMET.columns)

# print number of trials with and without MET descriptions
non_na_count = len(dataMET)
print(f"\nNumber of trials with MET description: {non_na_count}")

na_count = data['descr_THOUGHT.text'].isna().sum()
print(f"Number of trials with no MET description: {na_count}")

Combine the clip and context values into an additional column (`clip_context_PAIR`) Create a clip genre column (`clip_genre`) to use later.

In [None]:
def create_clip_context_pair(row):
    clip_name = row['clip_name']
    if 'bar' in row['context_word']:
        return 'BAR-' + clip_name
    elif 'video game' in row['context_word']:
        return 'VIDEOGAME-' + clip_name
    elif 'concert' in row['context_word']:
        return 'CONCERT-' + clip_name
    elif 'movie' in row['context_word']:
        return 'MOVIE-' + clip_name
    else:
        return 'NO_MATCH'

dataMET['clip_context_PAIR'] = dataMET.apply(create_clip_context_pair, axis=1)

# Create 'clip_genre' column
def extract_genre(clip_name):
    if '80s' in clip_name:
        return '80s'
    elif 'Jaz' in clip_name:
        return 'Jazz'
    elif 'Met' in clip_name:
        return 'Metal'
    elif 'Ele' in clip_name:
        return 'Electronic'
    else:
        return 'UNKNOWN'

dataMET['clip_genre'] = dataMET['clip_name'].apply(extract_genre)


# Reorder columns
cols = dataMET.columns.tolist()
# Move 'clip_genre' to be after 'clip_name'
cols.insert(cols.index('clip_name') + 1, cols.pop(cols.index('clip_genre')))
# Move 'clip_context_PAIR' to be after 'context_word'
cols.insert(cols.index('context_word') + 1, cols.pop(cols.index('clip_context_PAIR')))
dataMET = dataMET[cols]


# Check the dataframe by a quick re-view
display(dataMET)
print(dataMET.columns)

# Saving a .csv for the option to open and look at the full dataframe
dataMET.to_csv('/content/context-framed-listening/NLP_outputs/dataMET.csv', encoding='utf-8')

---
## Descriptive Statistics

Basic descriptive statistics on the clip-context pairings.


Create a dataframe including summary info about each clip-context stimuli pairing:

* Number of participants that reported METs while listening
* Mean MET and clip ratings

In [None]:
columns = dataMET.columns.tolist()[1:-1]

# Drop these following columns so they don't aggregate by clip-context grouping:
drop = ['clip_name', 'clip_genre', 'context_word', 'clip_context_PAIR',
        'expName', 'File_ID', 'date', 'descr_THOUGHT.text',
        'demographics.headphones', 'demographics.age',
        'demographics.gender','demographics.livingCountry',
        'demographics.birthCountry', 'demographics.nativeLanguage',
        'demographics.otherLanguage', 'demographics.otherLanguageText',
        'demographics.hearingImpariments', 'demographics.hearingImpairmentsText',
        'demographics.education','demographics.musicianIdentification',
        'demographics.feedback']

# Setting up an aggregate function collector
agg_fun = {}

# As we dropped trials without METs, we can just sum participants for MET occurrence
agg_fun['PROLIFIC_PID'] = 'count'

# Taking the mean of all columns except participant IDs and dropped columns
for col in columns:
    if col not in drop and col != 'PROLIFIC_PID':
        agg_fun[col] = 'mean'

# Group the dataframe by clip-context pairing, then run the aggregate functions created above
clipContextDescrStats = dataMET.groupby('clip_context_PAIR').agg(agg_fun)
display(clipContextDescrStats)

# Saving a .csv for the option to open and look at the full dataframe
clipContextDescrStats.to_csv('/content/context-framed-listening/NLP_outputs/clipContextDescrStats.csv', encoding='utf-8')

Check the minimum and maximum reported MET occurences of all clip-context stimuli pairings:

In [None]:
mostMETs_ccpair = clipContextDescrStats['PROLIFIC_PID'].idxmax()
leastMETs_ccpair = clipContextDescrStats['PROLIFIC_PID'].idxmin()
mostMETs_value = clipContextDescrStats['PROLIFIC_PID'].max()
leastMETs_value = clipContextDescrStats['PROLIFIC_PID'].min()

print(f"Clip-context pair with the most reported METs: {mostMETs_ccpair} ({mostMETs_value})")
print(f"Clip-context pair with the least reported METs: {leastMETs_ccpair} ({leastMETs_value})")

---
## Text Preprocessing

Different NLP models require different levels of text filtering to perform optimally.

**BERT** - Benefits from minimal preprocessing because:
* Understands context and handles common words well
* Has its own tokenisation and handles word variations
* Only needs custom domain-specific stop words removed

**Word2Vec & TF-IDF** - Need more aggressive preprocessing because:
* Don't understand context as well
* Common stop words add noise
* Lemmatisation helps group related terms

 > We will produce two levels of text preprocessing to save out for later analyses. All text will be manually spell checked.

### Spell checking

Collect all misspellings flagged by spell checking packages to manually go through and make corrections where necessary.

In [None]:
import re
from collections import defaultdict

import nltk
nltk.download('stopwords')

from nltk import pos_tag, word_tokenize
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

#!pip uninstall -y pyspellchecker
!pip install pyspellchecker
from spellchecker import SpellChecker
spell = SpellChecker()

In [None]:
def find_potential_misspellings_with_context(row):
    misspellings_with_context = []
    text = row['descr_THOUGHT.text']
    if isinstance(text, str):
        # Tokenise the raw text
        tokens = word_tokenize(text)
        for word in tokens:
            # Clean the word for spell checking (lowercase and remove punctuation)
            cleaned_word = re.sub(r'[^a-zA-Z0-9]', '', word.lower())
            if cleaned_word and cleaned_word not in spell:
                misspellings_with_context.append((cleaned_word, text)) # Store word and original text
    return misspellings_with_context

# Apply the function to each row and collect all misspellings with their context
all_misspellings_with_context = []
for index, row in dataMET.iterrows():
    misspellings_list = find_potential_misspellings_with_context(row)
    all_misspellings_with_context.extend(misspellings_list)

# Create a DataFrame from the collected misspellings and their context
misspellings_df = pd.DataFrame(all_misspellings_with_context, columns=['Potential Misspelling', 'Original Text'])

# Display the dataFrame showing unique misspellings
## (to show all occurrences, remove .drop_duplicates() <- otherwise, thsi shows their first occurrence context only.)
display(misspellings_df.drop_duplicates(subset=['Potential Misspelling']))

# Saving the full dataFrame to a CSV file
misspellings_df.to_csv('/content/context-framed-listening/NLP_outputs/misspellings_df.csv', encoding='utf-8', index=False)

# This dataFrame can be used to create your spelling dictionary.

> I saved the above file out to manually check over an edit the spellings as needed. See `misspelling_correction_ds1.csv`.

Below, apply these spelling corrections from the **manually edited csv file** to `dataMET`

In [None]:
# Load the correction mapping from the CSV file
try:
    correction_df = pd.read_csv("/content/context-framed-listening/NLP_outputs/misspelling_correction_ds1.csv")

    # Ensure the 'correction' column is treated as strings and handle potential NaNs
    correction_df['correction'] = correction_df['correction'].astype(str).replace('nan', '') # Convert to string and replace 'nan' string with empty string

    # Create a dictionary from the df, filtering out any remaining NaNs if necessary
    ## Using .dropna() to remove rows where 'misspelling' is NaN
    correction_mapping = pd.Series(correction_df.correction.values, index=correction_df.misspelling).dropna().to_dict()

    print("Correction mapping loaded successfully.")
    #print(correction_mapping) # [optional viewing]

except FileNotFoundError:
    print("Error: misspelling_correction_ds1.csv not found.")
    correction_mapping = {} # empty mapping if file not found
except KeyError:
    print("Error: CSV must contain 'misspelling' and 'correction' columns.")
    correction_mapping = {} # empty mapping if columns are missing


def apply_manual_corrections(text, mapping):
    if isinstance(text, str):

        # Clean the text (lowercase and remove punctuation) before applying corrections
        cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
        words = cleaned_text.split()

        # Apply corrections
        ## Ensure the mapped value is a string before joining
        corrected_words = [str(mapping.get(word, word)) for word in words]
        return " ".join(corrected_words)
    return str(text) # Also ensure the return value is a string if the input was not a string

# Apply the manual corrections to the descr_THOUGHT.text column
dataMET['descr_THOUGHT.text_corrected'] = dataMET['descr_THOUGHT.text'].apply(lambda x: apply_manual_corrections(x, correction_mapping))

# Display the original and corrected text to check the changes
display(dataMET[['descr_THOUGHT.text', 'descr_THOUGHT.text_corrected']])

# Now, the preprocessed_MET_descr column in the next cell should use descr_THOUGHT.text_corrected as input


Importing some more packages necessary for text cleaning before feeding these METs into our NLP models.

### Stop Words

These are word we will remove from the text that won't be acounted for in analyses.

We will define custom domain-specific stop words (e.g. music stlye, context cues, thought types), and get NLTK common English-language stop words.

Both preprocessing levels will have custom stop words removed, but only level 2 preprocessing for Word2Vec and TF-IDF will have the NLTK stop words removed (as BERT understands "the", "is", "and", etc.).

In [None]:
# Define custom stop words (domain-specific terms to remove)
customStopWords = ['music', 'song', 'songs', 'excerpt', 'excerpts', 'piece', 'pieces', 'clip', 'clips',
                   'electronic', 'jazz', 'metal', 'rock',
                   'bar', 'concert', 'film', 'movie', 'videogame', 'video', 'game',
                   #'nineteen', '1920s', '20s', '1930s', '30s', '1940s', '40s',
                   #'50s', '1950s', 'fifties', '50', '1950', 'fifty',
                   #'60s', '1960s', 'sixties', '60', '1960', 'sixty',
                   #'70s', '1970s', 'seventies', '70', '1970', 'seventy',
                   #'80s', '1980s', 'eighties', '80', '1980', 'eighty',
                   #'90s', '1990s', 'nineties', '90', '1990', 'ninety',
                   #'00s', '2000s', 'noughties', '2000', 'y2k', '2010',
                   'think', 'thinks', 'thought', 'thinking',
                   'imagine', 'imagines', 'imagined', 'imagining',
                   'image', 'images', 'imaged', 'imaging',
                   'visualise', 'visualises', 'visualised', 'visualising',
                   'visualize', 'visualizes', 'visualized', 'visualizing',
                   'picture', 'pictures', 'pictured', 'picturing',
                   'scene', 'scenes', 'story', 'stories',
                   'memory', 'memories', 'reminder', 'reminders', 'remind', 'reminds',
                   'remember', 'remembers', 'remembered', 'remembering',
                   'reminiscent', 'reminisce', 'reminisces', 'reminisced', 'reminiscing',
                   'make', 'makes', 'made', 'making',
                   'sound', 'sounds', 'sounded', 'sounding']


# Combine custom stop words with NLTK stop words
nltk_stop_words = set(nltk.corpus.stopwords.words('english'))
print(nltk_stop_words)

all_stop_words = set(customStopWords).union(nltk_stop_words)
print("\nCustom Stop Words:", len(customStopWords))
print("Total Stop Words (custom + NLTK):", len(all_stop_words))

### Lemmatise

Lemmatising groups inflected word forms together by identifying the dictionary form (the lemma) to analyse them as a single item.

[_e.g., improve/improving/improvements/improved/improver = improve_]

Only level 2 preprocessing for Word2Vec and TF-IDF will have lemmatisation (BERT is able to handle word variations).

In [None]:
# Initialise lemmatiser and tag mapping

lemmatiser = nltk.stem.WordNetLemmatizer()
wn = nltk.corpus.wordnet

tag_map = defaultdict(lambda: wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

### Apply text preprocessing:



In [None]:
def preprocess_level1(text, custom_stop_words):
    """
    Level 1: Minimal preprocessing for BERT
    - Only removes custom domain-specific stop words
    - Preserves original word forms (no lemmatisation)
    - Keeps punctuation and capitalization for BERT's tokeniser
    """
    if isinstance(text, str):
        # Tokenise the text
        tokens = word_tokenize(text)
        # Only remove custom stop words, keep everything else
        filtered_tokens = []
        for word in tokens:
            # Check lowercase version against stop words, but keep original case
            if word.lower() not in custom_stop_words:
                filtered_tokens.append(word)
        return " ".join(filtered_tokens)
    return text


def preprocess_level2(text, all_stop_words, lemmatiser, tag_map):
    """
    Level 2: Aggressive preprocessing for Word2Vec and TF-IDF
    - Removes all stop words (custom + NLTK)
    - Applies lemmatisation
    - Converts to lowercase
    - Removes punctuation
    """
    if isinstance(text, str):
        # Tokenise the text
        tokens = word_tokenize(text)
        # Remove stop words and lemmatize
        lemmatised_tokens = []
        for word, tag in pos_tag(tokens):
            # Convert to lowercase and remove non-alphabetic characters
            cleaned_word = re.sub(r'[^a-zA-Z]', '', word.lower())
            if cleaned_word and cleaned_word not in all_stop_words:
                # Lemmatize using the POS tag
                lemmatised_word = lemmatiser.lemmatize(cleaned_word, tag_map[tag[0]])
                lemmatised_tokens.append(lemmatised_word)
        return " ".join(lemmatised_tokens)
    return text


# Apply Level 1 preprocessing (for BERT)
dataMET['METdescr_prepLVL1'] = dataMET['descr_THOUGHT.text_corrected'].apply(
    lambda x: preprocess_level1(x, set(customStopWords))
)

# Apply Level 2 preprocessing (for Word2Vec and TF-IDF)
dataMET['METdescr_prepLVL2'] = dataMET['descr_THOUGHT.text_corrected'].apply(
    lambda x: preprocess_level2(x, all_stop_words, lemmatiser, tag_map)
)

# Display all text processing stages
display(dataMET[['descr_THOUGHT.text',
                 'descr_THOUGHT.text_corrected',
                 'METdescr_prepLVL1',
                 'METdescr_prepLVL2']])

# Save out for analyses and inspection
dataMET.to_csv('/content/context-framed-listening/NLP_outputs/dataMET_preprocessed.csv', encoding='utf-8')

print("\nPreprocessing complete!")
print("- 'METdescr_prepLVL1': Minimal filtering for BERT (custom stop words only)")
print("- 'METdescr_prepLVL2': Full preprocessing for Word2Vec/TF-IDF (all stop words + lemmatisation)")


`dataMET` (saved out as `dataMET_preprocessed.csv`) now contains the preprocessed dataframe with the two levels of filtering: `METdata_prepLVL1` and `METdata_prepLVL2`. We will refer to this file for the following NLP models.