<a href="https://colab.research.google.com/github/HazelvdW/context-framed-listening/blob/main/framed_listening_text_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Framed Listening: **Text Preprocessing**
> By **Hazel A. van der Walle** (PhD student, Music, Durham University), September 2025.

All datasets generated and used for this study are openly available on GitHub https://github.com/HazelvdW/context-framed-listening.

In [17]:
!rm -r context-framed-listening
# Clone the GitHub repository
!git clone https://github.com/HazelvdW/context-framed-listening.git

Cloning into 'context-framed-listening'...
remote: Enumerating objects: 144, done.[K
remote: Counting objects: 100% (144/144), done.[K
remote: Compressing objects: 100% (123/123), done.[K
remote: Total 144 (delta 86), reused 48 (delta 20), pack-reused 0 (from 0)[K
Receiving objects: 100% (144/144), 1.29 MiB | 2.97 MiB/s, done.
Resolving deltas: 100% (86/86), done.


Refresh files to see **"context-framed-listening"**.


---

## Setup

In [18]:
import os
import csv
import pandas as pd
import numpy as np

from google.colab import data_table
data_table.enable_dataframe_formatter()

Load in the data file "**data_study1_MAIN**" that contains participants' thought desciptions

In [19]:
data = pd.read_csv("/content/context-framed-listening/data_study1_MAIN.csv")

Sentiment analysis is being conducted on music-evoked thoughts (METs).

Create a separate dataset that only contains trials where METs were described (i.e. all rows where "descr_THOUGHT.text" is _not_ NA) and drop the columns only relevant to no-MET trials:

In [20]:
dataMET = data[data['descr_THOUGHT.text'].notna()].copy()

# drop no-MET columns
dataMET.drop(columns = ['response_thought_or_not.keys', 'input_NOT.text'],
             inplace=True)

# Edit clip_name column
for rowIndex, row in dataMET.iterrows():
    clip_name = row['clip_name']
    if clip_name[0:3] == '80s':
        dataMET.loc[rowIndex,'clip_name'] = '80s'+clip_name[3:10]
    elif clip_name[0:3] == 'Jaz':
        dataMET.loc[rowIndex,'clip_name'] = 'Jaz'+clip_name[4:11]
    elif clip_name[0:3] == 'Met':
        dataMET.loc[rowIndex,'clip_name'] = 'Met'+clip_name[5:12]
    elif clip_name[0:3] == 'Ele':
        dataMET.loc[rowIndex,'clip_name'] = 'Ele'+clip_name[10:17]


display(dataMET)

# print all column headers for later reference
print(dataMET.columns)

# print number of trials with and without MET descriptions
non_na_count = len(dataMET)
print(f"\nNumber of trials with MET description: {non_na_count}")

na_count = data['descr_THOUGHT.text'].isna().sum()
print(f"Number of trials with no MET description: {na_count}")



Unnamed: 0,clip_name,context_word,expName,PROLIFIC_PID,File_ID,date,descr_THOUGHT.text,rating_music_prompted.response,rating_spontaneity.response,rating_novelty.response,...,demographics.livingCountry,demographics.birthCountry,demographics.nativeLanguage,demographics.otherLanguage,demographics.otherLanguageText,demographics.hearingImpariments,demographics.hearingImpairmentsText,demographics.education,demographics.musicianIdentification,demographics.feedback
0,80s_LOW_02,bar,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"kind of sad, melancholy. not happy or upbeat. ...",5.0,4.0,3.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
1,Jaz_MED_07,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,it did not sound like a video game. if anythin...,5.0,5.0,2.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
2,80s_MED_08,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"overly upbeat. no real emotions, peppy. too mu...",5.0,4.0,4.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
3,Met_LOW_09,concert,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very heavy rock, not for me. somewhere that i ...",5.0,5.0,3.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
4,Met_MED_20,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very charged, maybe you've won something or wo...",5.0,4.0,2.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2554,Met_MED_20,concert,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,A rock band made up of teenage white kids play...,5.0,5.0,4.0,...,United States,United States,English,False,,False,,3,2,none
2555,80s_LOW_02,concert,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,People in a ballroom in elegant dresses slow d...,5.0,5.0,5.0,...,United States,United States,English,False,,False,,3,2,none
2557,Jaz_MED_02,video game,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a jazz festival and old men on stag...,5.0,5.0,4.0,...,United States,United States,English,False,,False,,3,2,none
2558,Ele_MED_20,movie,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a documentary mostly about fun fact...,5.0,5.0,5.0,...,United States,United States,English,False,,False,,3,2,none


Index(['clip_name', 'context_word', 'expName', 'PROLIFIC_PID', 'File_ID',
       'date', 'descr_THOUGHT.text', 'rating_music_prompted.response',
       'rating_spontaneity.response', 'rating_novelty.response',
       'rating_familiarity.response', 'rating_enjoyment.response',
       'demographics.headphones', 'demographics.age', 'demographics.gender',
       'demographics.livingCountry', 'demographics.birthCountry',
       'demographics.nativeLanguage', 'demographics.otherLanguage',
       'demographics.otherLanguageText', 'demographics.hearingImpariments',
       'demographics.hearingImpairmentsText', 'demographics.education',
       'demographics.musicianIdentification', 'demographics.feedback'],
      dtype='object')

Number of trials with MET description: 1962
Number of trials with no MET description: 598


Combine the clip and context values into an additional column (`clip_context_PAIR`) Create a clip genre column (`clip_genre`) to use later.

In [21]:
def create_clip_context_pair(row):
    clip_name = row['clip_name']
    if 'bar' in row['context_word']:
        return 'BAR-' + clip_name
    elif 'video game' in row['context_word']:
        return 'VIDEOGAME-' + clip_name
    elif 'concert' in row['context_word']:
        return 'CONCERT-' + clip_name
    elif 'movie' in row['context_word']:
        return 'MOVIE-' + clip_name
    else:
        return 'NO_MATCH'

dataMET['clip_context_PAIR'] = dataMET.apply(create_clip_context_pair, axis=1)

# Create 'clip_genre' column
def extract_genre(clip_name):
    if '80s' in clip_name:
        return '80s'
    elif 'Jaz' in clip_name:
        return 'Jazz'
    elif 'Met' in clip_name:
        return 'Metal'
    elif 'Ele' in clip_name:
        return 'Electronic'
    else:
        return 'UNKNOWN'

dataMET['clip_genre'] = dataMET['clip_name'].apply(extract_genre)


# Reorder columns
cols = dataMET.columns.tolist()
# Move 'clip_genre' to be after 'clip_name'
cols.insert(cols.index('clip_name') + 1, cols.pop(cols.index('clip_genre')))
# Move 'clip_context_PAIR' to be after 'context_word'
cols.insert(cols.index('context_word') + 1, cols.pop(cols.index('clip_context_PAIR')))
dataMET = dataMET[cols]


# Check the dataframe by a quick re-view
display(dataMET)
print(dataMET.columns)

# Saving a .csv for the option to open and look at the full dataframe
dataMET.to_csv('/content/context-framed-listening/NLP_outputs/dataMET.csv', encoding='utf-8')



Unnamed: 0,clip_name,clip_genre,context_word,clip_context_PAIR,expName,PROLIFIC_PID,File_ID,date,descr_THOUGHT.text,rating_music_prompted.response,...,demographics.livingCountry,demographics.birthCountry,demographics.nativeLanguage,demographics.otherLanguage,demographics.otherLanguageText,demographics.hearingImpariments,demographics.hearingImpairmentsText,demographics.education,demographics.musicianIdentification,demographics.feedback
0,80s_LOW_02,80s,bar,BAR-80s_LOW_02,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"kind of sad, melancholy. not happy or upbeat. ...",5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
1,Jaz_MED_07,Jazz,video game,VIDEOGAME-Jaz_MED_07,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,it did not sound like a video game. if anythin...,5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
2,80s_MED_08,80s,video game,VIDEOGAME-80s_MED_08,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"overly upbeat. no real emotions, peppy. too mu...",5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
3,Met_LOW_09,Metal,concert,CONCERT-Met_LOW_09,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very heavy rock, not for me. somewhere that i ...",5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
4,Met_MED_20,Metal,video game,VIDEOGAME-Met_MED_20,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very charged, maybe you've won something or wo...",5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2554,Met_MED_20,Metal,concert,CONCERT-Met_MED_20,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,A rock band made up of teenage white kids play...,5.0,...,United States,United States,English,False,,False,,3,2,none
2555,80s_LOW_02,80s,concert,CONCERT-80s_LOW_02,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,People in a ballroom in elegant dresses slow d...,5.0,...,United States,United States,English,False,,False,,3,2,none
2557,Jaz_MED_02,Jazz,video game,VIDEOGAME-Jaz_MED_02,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a jazz festival and old men on stag...,5.0,...,United States,United States,English,False,,False,,3,2,none
2558,Ele_MED_20,Electronic,movie,MOVIE-Ele_MED_20,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a documentary mostly about fun fact...,5.0,...,United States,United States,English,False,,False,,3,2,none


Index(['clip_name', 'clip_genre', 'context_word', 'clip_context_PAIR',
       'expName', 'PROLIFIC_PID', 'File_ID', 'date', 'descr_THOUGHT.text',
       'rating_music_prompted.response', 'rating_spontaneity.response',
       'rating_novelty.response', 'rating_familiarity.response',
       'rating_enjoyment.response', 'demographics.headphones',
       'demographics.age', 'demographics.gender', 'demographics.livingCountry',
       'demographics.birthCountry', 'demographics.nativeLanguage',
       'demographics.otherLanguage', 'demographics.otherLanguageText',
       'demographics.hearingImpariments',
       'demographics.hearingImpairmentsText', 'demographics.education',
       'demographics.musicianIdentification', 'demographics.feedback'],
      dtype='object')


---
## Descriptive Statistics

Basic descriptive statistics on the clip-context pairings.


Create a dataframe including summary info about each clip-context stimuli pairing:

* Number of participants that reported METs while listening
* Mean MET and clip ratings

In [22]:
columns = dataMET.columns.tolist()[1:-1]

# Drop these following columns so they don't aggregate by clip-context grouping:
drop = ['clip_name', 'clip_genre', 'context_word', 'clip_context_PAIR',
        'expName', 'File_ID', 'date', 'descr_THOUGHT.text',
        'demographics.headphones', 'demographics.age',
        'demographics.gender','demographics.livingCountry',
        'demographics.birthCountry', 'demographics.nativeLanguage',
        'demographics.otherLanguage', 'demographics.otherLanguageText',
        'demographics.hearingImpariments', 'demographics.hearingImpairmentsText',
        'demographics.education','demographics.musicianIdentification',
        'demographics.feedback']

# Setting up an aggregate function collector
agg_fun = {}

# As we dropped trials without METs, we can just sum participants for MET occurrence
agg_fun['PROLIFIC_PID'] = 'count'

# Taking the mean of all columns except participant IDs and dropped columns
for col in columns:
    if col not in drop and col != 'PROLIFIC_PID':
        agg_fun[col] = 'mean'

# Group the dataframe by clip-context pairing, then run the aggregate functions created above
clipContextDescrStats = dataMET.groupby('clip_context_PAIR').agg(agg_fun)
display(clipContextDescrStats)

# Saving a .csv for the option to open and look at the full dataframe
clipContextDescrStats.to_csv('/content/context-framed-listening/NLP_outputs/clipContextDescrStats.csv', encoding='utf-8')

Unnamed: 0_level_0,PROLIFIC_PID,rating_music_prompted.response,rating_spontaneity.response,rating_novelty.response,rating_familiarity.response,rating_enjoyment.response
clip_context_PAIR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BAR-80s_LOW_02,34,4.235294,3.882353,3.382353,2.294118,3.705882
BAR-80s_LOW_06,29,4.448276,4.103448,3.482759,2.586207,3.793103
BAR-80s_MED_08,29,4.275862,3.862069,2.965517,2.000000,3.310345
BAR-80s_MED_13,25,4.160000,3.840000,2.360000,2.280000,3.600000
BAR-Ele_LOW_09,27,4.592593,3.962963,2.259259,2.000000,3.259259
...,...,...,...,...,...,...
VIDEOGAME-Jaz_MED_07,35,4.428571,4.285714,2.685714,2.457143,3.742857
VIDEOGAME-Met_LOW_09,34,4.558824,3.794118,2.411765,2.029412,3.235294
VIDEOGAME-Met_LOW_14,26,4.307692,3.923077,2.423077,2.230769,3.346154
VIDEOGAME-Met_MED_19,31,4.483871,4.161290,2.580645,2.645161,3.258065


Check the minimum and maximum reported MET occurences of all clip-context stimuli pairings:

In [23]:
mostMETs_ccpair = clipContextDescrStats['PROLIFIC_PID'].idxmax()
leastMETs_ccpair = clipContextDescrStats['PROLIFIC_PID'].idxmin()
mostMETs_value = clipContextDescrStats['PROLIFIC_PID'].max()
leastMETs_value = clipContextDescrStats['PROLIFIC_PID'].min()

print(f"Clip-context pair with the most reported METs: {mostMETs_ccpair} ({mostMETs_value})")
print(f"Clip-context pair with the least reported METs: {leastMETs_ccpair} ({leastMETs_value})")

Clip-context pair with the most reported METs: BAR-Ele_LOW_14 (35)
Clip-context pair with the least reported METs: BAR-80s_MED_13 (25)


---
## Text Preprocessing

### Spell checking

Collect all misspellings flagged by spell checking packages to manually go through and make corrections where necessary.

In [24]:
import re
from collections import defaultdict

import nltk
from nltk import word_tokenize
nltk.download('punkt_tab')

#!pip uninstall -y pyspellchecker
!pip install pyspellchecker
from spellchecker import SpellChecker
spell = SpellChecker()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!




In [25]:
def find_potential_misspellings_with_context(row):
    misspellings_with_context = []
    text = row['descr_THOUGHT.text']
    if isinstance(text, str):
        # Tokenise the raw text
        tokens = word_tokenize(text)
        for word in tokens:
            # Clean the word for spell checking (lowercase and remove punctuation)
            cleaned_word = re.sub(r'[^a-zA-Z0-9]', '', word.lower())
            if cleaned_word and cleaned_word not in spell:
                misspellings_with_context.append((cleaned_word, text)) # Store word and original text
    return misspellings_with_context

# Apply the function to each row and collect all misspellings with their context
all_misspellings_with_context = []
for index, row in dataMET.iterrows():
    misspellings_list = find_potential_misspellings_with_context(row)
    all_misspellings_with_context.extend(misspellings_list)

# Create a DataFrame from the collected misspellings and their context
misspellings_df = pd.DataFrame(all_misspellings_with_context, columns=['Potential Misspelling', 'Original Text'])

# Display the dataFrame showing unique misspellings
## (to show all occurrences, remove .drop_duplicates() <- otherwise, thsi shows their first occurrence context only.)
display(misspellings_df.drop_duplicates(subset=['Potential Misspelling']))

# Saving the full dataFrame to a CSV file
misspellings_df.to_csv('/content/context-framed-listening/NLP_outputs/misspellings_df.csv', encoding='utf-8', index=False)

# This dataFrame can be used to create your spelling dictionary.

Unnamed: 0,Potential Misspelling,Original Text
0,nt,"very heavy rock, not for me. somewhere that i ..."
1,ve,"very charged, maybe you've won something or wo..."
2,rememeber,"very bland, like lift music or something. not ..."
8,doesnt,Doesnt fit with what i would imagien for a vid...
9,imagien,Doesnt fit with what i would imagien for a vid...
...,...,...
1204,1970s1980s,I imagined this being something they'd play ba...
1205,disko,I imagined people dancing and disko lights
1206,kidteen,The theme song at the beginning of a kid/teen ...
1207,rihanna,I imagined Rihanna performing this in a concert


> I saved the above file out to manually check over an edit the spellings as needed. See `misspelling_correction_ds1.csv`.

Below, apply these spelling corrections from the **manually edited csv file** to `dataMET`

In [26]:
# Load the correction mapping from the CSV file
try:
    correction_df = pd.read_csv("/content/context-framed-listening/NLP_outputs/misspelling_correction_ds1.csv")

    # Ensure the 'correction' column is treated as strings and handle potential NaNs
    correction_df['correction'] = correction_df['correction'].astype(str).replace('nan', '') # Convert to string and replace 'nan' string with empty string

    # Create a dictionary from the df, filtering out any remaining NaNs if necessary
    ## Using .dropna() to remove rows where 'misspelling' is NaN
    correction_mapping = pd.Series(correction_df.correction.values, index=correction_df.misspelling).dropna().to_dict()

    print("Correction mapping loaded successfully.")
    #print(correction_mapping) # [optional viewing]

except FileNotFoundError:
    print("Error: misspelling_correction_ds1.csv not found.")
    correction_mapping = {} # empty mapping if file not found
except KeyError:
    print("Error: CSV must contain 'misspelling' and 'correction' columns.")
    correction_mapping = {} # empty mapping if columns are missing


def apply_manual_corrections(text, mapping):
    if isinstance(text, str):

        # Clean the text (lowercase and remove punctuation) before applying corrections
        cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
        words = cleaned_text.split()

        # Apply corrections
        ## Ensure the mapped value is a string before joining
        corrected_words = [str(mapping.get(word, word)) for word in words]
        return " ".join(corrected_words)
    return str(text) # Also ensure the return value is a string if the input was not a string

# Apply the manual corrections to the descr_THOUGHT.text column
dataMET['descr_THOUGHT.text_corrected'] = dataMET['descr_THOUGHT.text'].apply(lambda x: apply_manual_corrections(x, correction_mapping))

# Display the original and corrected text to check the changes
display(dataMET[['descr_THOUGHT.text', 'descr_THOUGHT.text_corrected']])

# Now, the preprocessed_MET_descr column in the next cell should use descr_THOUGHT.text_corrected as input

Correction mapping loaded successfully.


Unnamed: 0,descr_THOUGHT.text,descr_THOUGHT.text_corrected
0,"kind of sad, melancholy. not happy or upbeat. ...",kind of sad melancholy not happy or upbeat emo...
1,it did not sound like a video game. if anythin...,it did not sound like a video game if anything...
2,"overly upbeat. no real emotions, peppy. too mu...",overly upbeat no real emotions peppy too much
3,"very heavy rock, not for me. somewhere that i ...",very heavy rock not for me somewhere that i do...
4,"very charged, maybe you've won something or wo...",very charged maybe you have won something or w...
...,...,...
2554,A rock band made up of teenage white kids play...,a rock band made up of teenage white kids play...
2555,People in a ballroom in elegant dresses slow d...,people in a ballroom in elegant dresses slow d...
2557,I imagined a jazz festival and old men on stag...,i imagined a jazz festival and old men on stag...
2558,I imagined a documentary mostly about fun fact...,i imagined a documentary mostly about fun fact...



Importing some more packages necessary for text cleaning before feeding these METs into our NLP models.

In [27]:
from nltk import pos_tag
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

### Stop Words

Remove common English-language stop words from natural language toolkit package and **custom stop words** (e.g. music stlye, context cues, thought types).

In [28]:
# Define custom stop words
customStopWords = ['music', 'song', 'songs', 'excerpt', 'excerpts', 'piece', 'pieces', 'clip', 'clips',
                   #'nineteen', '1920s', '20s', '1930s', '30s', '1940s', '40s', '50s', '1950s', 'fifties', '50', '1950', 'fifty',
                   #'60s', '1960s', 'sixties', '60', '1960', 'sixty', '70s', '1970s', 'seventies', '70', '1970', 'seventy',
                   #'80s', '1980s', 'eighties', '80', '1980', 'eighty', '90s', '1990s', 'nineties', '90', '1990', 'ninety',
                   #'00s', '2000s', 'noughties', '2000', 'y2k', '2010',
                   'electronic', 'jazz', 'metal', 'rock',
                   'bar', 'concert', 'film', 'movie', 'videogame', 'video', 'game',
                   'think', 'thinks', 'thought', 'thinking',
                   'imagine', 'imagines', 'imagined', 'imagining',
                   'image', 'images', 'imaged', 'imaging',
                   'visualise', 'visualises', 'visualised', 'visualising', 'visualize', 'visualizes', 'visualized', 'visualizing',
                   'picture', 'pictures', 'pictured', 'picturing',
                   'scene', 'scenes', 'story', 'stories',
                   'memory', 'memories', 'reminder', 'reminders', 'remind', 'reminds', 'remember', 'remembers', 'remembered', 'remembering',
                   'reminiscent', 'reminisce', 'reminisces', 'reminisced', 'reminiscing',
                   'make', 'makes', 'made', 'making', 'sound', 'sounds', 'sounded', 'sounding',
                   "'s", 'to', 'of', 'and', 'like', 'around', 'also', 'ish']

# Combine custom stop words with NLTK stop words
nltk_stop_words = set(nltk.corpus.stopwords.words('english'))
all_stop_words = set(customStopWords).union(nltk_stop_words)

print("Total Stop Words:", len(all_stop_words))

Total Stop Words: 274


### Lemmatise


Lemmatising groups together inflected word forms by identifying the dictionary form (the lemma) to analyse them as a single item.

[_e.g., improve/improving/improvements/improved/improver = improve_]

In [29]:
lemmatizer = nltk.stem.WordNetLemmatizer()
wn = nltk.corpus.wordnet

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

### Apply stop words and lemmatisation:

In [30]:
def preprocess_text(text, stop_words, lemmatizer, tag_map):
    if isinstance(text, str):
        # Tokenise the text
        tokens = word_tokenize(text)
        # Remove stop words and lemmatise
        lemmatised_tokens = []
        for word, tag in pos_tag(tokens):
            # Ensure word is lowercase and alpha-numeric for lemmatisation/stopwords
            cleaned_word = re.sub(r'[^a-zA-Z]', '', word.lower())
            if cleaned_word and cleaned_word not in stop_words:
                # Lemmatise using the POS tag
                lemmatized_word = lemmatizer.lemmatize(cleaned_word, tag_map[tag[0]])
                lemmatised_tokens.append(lemmatized_word)
        return " ".join(lemmatised_tokens)
    return text

# Apply the preprocessing function to the corrected text column
dataMET['preprocessed_METdescr'] = dataMET['descr_THOUGHT.text_corrected'].apply(lambda x: preprocess_text(x, all_stop_words, lemmatizer, tag_map))

# Display the original, corrected, and lemmatised text columns
display(dataMET[['descr_THOUGHT.text', 'descr_THOUGHT.text_corrected', 'preprocessed_METdescr']])


Unnamed: 0,descr_THOUGHT.text,descr_THOUGHT.text_corrected,preprocessed_METdescr
0,"kind of sad, melancholy. not happy or upbeat. ...",kind of sad melancholy not happy or upbeat emo...,kind sad melancholy happy upbeat emotionally c...
1,it did not sound like a video game. if anythin...,it did not sound like a video game if anything...,anything maybe end credit something something ...
2,"overly upbeat. no real emotions, peppy. too mu...",overly upbeat no real emotions peppy too much,overly upbeat real emotion peppy much
3,"very heavy rock, not for me. somewhere that i ...",very heavy rock not for me somewhere that i do...,heavy somewhere belong enjoy
4,"very charged, maybe you've won something or wo...",very charged maybe you have won something or w...,charge maybe something battle hype possibly en...
...,...,...,...
2554,A rock band made up of teenage white kids play...,a rock band made up of teenage white kids play...,band teenage white kid play garage
2555,People in a ballroom in elegant dresses slow d...,people in a ballroom in elegant dresses slow d...,people ballroom elegant dress slow dance floor
2557,I imagined a jazz festival and old men on stag...,i imagined a jazz festival and old men on stag...,festival old men stage play instrument
2558,I imagined a documentary mostly about fun fact...,i imagined a documentary mostly about fun fact...,documentary mostly fun fact historic information


### Tokenise

In [31]:
if 'preprocessed_METdescr' in dataMET.columns:
    dataMET['tokenised_MET_descr'] = dataMET['preprocessed_METdescr'].apply(word_tokenize)

    display(dataMET[['preprocessed_METdescr', 'tokenised_MET_descr']])

# Saving a .csv for the option to open and look at the full dataframe
dataMET.to_csv('/content/context-framed-listening/NLP_outputs/dataMET_preprocessed.csv', encoding='utf-8')

Unnamed: 0,preprocessed_METdescr,tokenised_MET_descr
0,kind sad melancholy happy upbeat emotionally c...,"[kind, sad, melancholy, happy, upbeat, emotion..."
1,anything maybe end credit something something ...,"[anything, maybe, end, credit, something, some..."
2,overly upbeat real emotion peppy much,"[overly, upbeat, real, emotion, peppy, much]"
3,heavy somewhere belong enjoy,"[heavy, somewhere, belong, enjoy]"
4,charge maybe something battle hype possibly en...,"[charge, maybe, something, battle, hype, possi..."
...,...,...
2554,band teenage white kid play garage,"[band, teenage, white, kid, play, garage]"
2555,people ballroom elegant dress slow dance floor,"[people, ballroom, elegant, dress, slow, dance..."
2557,festival old men stage play instrument,"[festival, old, men, stage, play, instrument]"
2558,documentary mostly fun fact historic information,"[documentary, mostly, fun, fact, historic, inf..."


`dataMET` (saved out as `dataMET_preprocessed.csv`) now contains the preprocessed dataframe. We will refer to this file for the following NLP models.