<a href="https://colab.research.google.com/github/HazelvdW/context-framed-listening/blob/main/framed_listening_text_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Framed Listening: **Decriptive Stats & Text Preprocessing**
> By **Hazel A. van der Walle** (PhD student, Music, Durham University), September 2025.

All datasets generated and used for this study are openly available on GitHub https://github.com/HazelvdW/context-framed-listening.

In [1]:
!rm -r context-framed-listening
# Clone the GitHub repository
!git clone https://github.com/HazelvdW/context-framed-listening.git

rm: cannot remove 'context-framed-listening': No such file or directory
Cloning into 'context-framed-listening'...
remote: Enumerating objects: 630, done.[K
remote: Counting objects: 100% (206/206), done.[K
remote: Compressing objects: 100% (186/186), done.[K
remote: Total 630 (delta 78), reused 105 (delta 20), pack-reused 424 (from 2)[K
Receiving objects: 100% (630/630), 245.35 MiB | 21.09 MiB/s, done.
Resolving deltas: 100% (269/269), done.


Refresh files to see **"context-framed-listening"**.


---

## Setup

In [2]:
import os
import csv
import pandas as pd
import numpy as np

from google.colab import data_table
data_table.enable_dataframe_formatter()

Load in the data file "**data_study1_MAIN**" that contains participants' thought desciptions

In [3]:
data = pd.read_csv("/content/context-framed-listening/data_study1_MAIN.csv")

Sentiment analysis is being conducted on music-evoked thoughts (METs).

Create a separate dataset that only contains trials where METs were described (i.e. all rows where "descr_THOUGHT.text" is _not_ NA) and drop the columns only relevant to no-MET trials:

### (descr. stats check before separating METs from no-METs)

In [17]:
dataMET = data[data['descr_THOUGHT.text'].notna()].copy()

# drop no-MET columns
dataMET.drop(columns = ['response_thought_or_not.keys', 'input_NOT.text'],
             inplace=True)

# Edit clip_name column
for rowIndex, row in dataMET.iterrows():
    clip_name = row['clip_name']
    if clip_name[0:3] == '80s':
        dataMET.loc[rowIndex,'clip_name'] = '80s'+clip_name[3:10]
    elif clip_name[0:3] == 'Jaz':
        dataMET.loc[rowIndex,'clip_name'] = 'Jaz'+clip_name[4:11]
    elif clip_name[0:3] == 'Met':
        dataMET.loc[rowIndex,'clip_name'] = 'Met'+clip_name[5:12]
    elif clip_name[0:3] == 'Ele':
        dataMET.loc[rowIndex,'clip_name'] = 'Ele'+clip_name[10:17]


display(dataMET)

# print all column headers for later reference
print(dataMET.columns)

# print number of trials with and without MET descriptions
non_na_count = len(dataMET)
na_count = data['descr_THOUGHT.text'].isna().sum()
total_trials = non_na_count + na_count
percentage_with_METs = (non_na_count / total_trials) * 100

print(f"\nNumber of trials with MET description: {non_na_count}")
print(f"Number of trials with no MET description: {na_count}")
print(f"Total number of trials: {total_trials}")
print(f"\n{non_na_count} out of {total_trials} trials ({percentage_with_METs:.1f}%) contained music-evoked thought descriptions.")

# Calculate number of MET descriptions per unique clip and context pair
mets_per_clip_context = dataMET.groupby(['clip_name', 'context_word']).size().reset_index(name='MET_count')
print(f"\nNumber of MET descriptions per unique clip and context pair:\n{mets_per_clip_context}")

# Calculate min, max, and mean MET counts per clip-context pair
min_met_count = mets_per_clip_context['MET_count'].min()
max_met_count = mets_per_clip_context['MET_count'].max()
mean_met_count = mets_per_clip_context['MET_count'].mean()

print(f"\nAt an excerpt level, participants reported {min_met_count} to {max_met_count} thoughts (M = {mean_met_count:.2f}).")




Unnamed: 0,clip_name,context_word,expName,PROLIFIC_PID,File_ID,date,descr_THOUGHT.text,rating_music_prompted.response,rating_spontaneity.response,rating_novelty.response,...,demographics.livingCountry,demographics.birthCountry,demographics.nativeLanguage,demographics.otherLanguage,demographics.otherLanguageText,demographics.hearingImpariments,demographics.hearingImpairmentsText,demographics.education,demographics.musicianIdentification,demographics.feedback
0,80s_LOW_02,bar,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"kind of sad, melancholy. not happy or upbeat. ...",5.0,4.0,3.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
1,Jaz_MED_07,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,it did not sound like a video game. if anythin...,5.0,5.0,2.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
2,80s_MED_08,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"overly upbeat. no real emotions, peppy. too mu...",5.0,4.0,4.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
3,Met_LOW_09,concert,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very heavy rock, not for me. somewhere that i ...",5.0,5.0,3.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
4,Met_MED_20,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very charged, maybe you've won something or wo...",5.0,4.0,2.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2554,Met_MED_20,concert,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,A rock band made up of teenage white kids play...,5.0,5.0,4.0,...,United States,United States,English,False,,False,,3,2,none
2555,80s_LOW_02,concert,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,People in a ballroom in elegant dresses slow d...,5.0,5.0,5.0,...,United States,United States,English,False,,False,,3,2,none
2557,Jaz_MED_02,video game,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a jazz festival and old men on stag...,5.0,5.0,4.0,...,United States,United States,English,False,,False,,3,2,none
2558,Ele_MED_20,movie,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a documentary mostly about fun fact...,5.0,5.0,5.0,...,United States,United States,English,False,,False,,3,2,none


Index(['clip_name', 'context_word', 'expName', 'PROLIFIC_PID', 'File_ID',
       'date', 'descr_THOUGHT.text', 'rating_music_prompted.response',
       'rating_spontaneity.response', 'rating_novelty.response',
       'rating_familiarity.response', 'rating_enjoyment.response',
       'demographics.headphones', 'demographics.age', 'demographics.gender',
       'demographics.livingCountry', 'demographics.birthCountry',
       'demographics.nativeLanguage', 'demographics.otherLanguage',
       'demographics.otherLanguageText', 'demographics.hearingImpariments',
       'demographics.hearingImpairmentsText', 'demographics.education',
       'demographics.musicianIdentification', 'demographics.feedback'],
      dtype='object')

Number of trials with MET description: 1962
Number of trials with no MET description: 598
Total number of trials: 2560

1962 out of 2560 trials (76.6%) contained music-evoked thought descriptions.

Number of MET descriptions per unique clip and context pair:
     clip_

Combine the clip and context values into an additional column (`clip_context_PAIR`) Create a clip genre column (`clip_genre`) to use later.

In [5]:
def create_clip_context_pair(row):
    clip_name = row['clip_name']
    if 'bar' in row['context_word']:
        return 'BAR-' + clip_name
    elif 'video game' in row['context_word']:
        return 'VIDEOGAME-' + clip_name
    elif 'concert' in row['context_word']:
        return 'CONCERT-' + clip_name
    elif 'movie' in row['context_word']:
        return 'MOVIE-' + clip_name
    else:
        return 'NO_MATCH'

dataMET['clip_context_PAIR'] = dataMET.apply(create_clip_context_pair, axis=1)

# Create 'clip_genre' column
def extract_genre(clip_name):
    if '80s' in clip_name:
        return '80s'
    elif 'Jaz' in clip_name:
        return 'Jazz'
    elif 'Met' in clip_name:
        return 'Metal'
    elif 'Ele' in clip_name:
        return 'Electronic'
    else:
        return 'UNKNOWN'

dataMET['clip_genre'] = dataMET['clip_name'].apply(extract_genre)


# Reorder columns
cols = dataMET.columns.tolist()
# Move 'clip_genre' to be after 'clip_name'
cols.insert(cols.index('clip_name') + 1, cols.pop(cols.index('clip_genre')))
# Move 'clip_context_PAIR' to be after 'context_word'
cols.insert(cols.index('context_word') + 1, cols.pop(cols.index('clip_context_PAIR')))
dataMET = dataMET[cols]


# Check the dataframe by a quick re-view
display(dataMET)
print(dataMET.columns)

# Saving a .csv for the option to open and look at the full dataframe
dataMET.to_csv('/content/context-framed-listening/NLP_outputs/dataMET.csv', encoding='utf-8')



Unnamed: 0,clip_name,clip_genre,context_word,clip_context_PAIR,expName,PROLIFIC_PID,File_ID,date,descr_THOUGHT.text,rating_music_prompted.response,...,demographics.livingCountry,demographics.birthCountry,demographics.nativeLanguage,demographics.otherLanguage,demographics.otherLanguageText,demographics.hearingImpariments,demographics.hearingImpairmentsText,demographics.education,demographics.musicianIdentification,demographics.feedback
0,80s_LOW_02,80s,bar,BAR-80s_LOW_02,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"kind of sad, melancholy. not happy or upbeat. ...",5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
1,Jaz_MED_07,Jazz,video game,VIDEOGAME-Jaz_MED_07,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,it did not sound like a video game. if anythin...,5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
2,80s_MED_08,80s,video game,VIDEOGAME-80s_MED_08,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"overly upbeat. no real emotions, peppy. too mu...",5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
3,Met_LOW_09,Metal,concert,CONCERT-Met_LOW_09,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very heavy rock, not for me. somewhere that i ...",5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
4,Met_MED_20,Metal,video game,VIDEOGAME-Met_MED_20,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very charged, maybe you've won something or wo...",5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2554,Met_MED_20,Metal,concert,CONCERT-Met_MED_20,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,A rock band made up of teenage white kids play...,5.0,...,United States,United States,English,False,,False,,3,2,none
2555,80s_LOW_02,80s,concert,CONCERT-80s_LOW_02,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,People in a ballroom in elegant dresses slow d...,5.0,...,United States,United States,English,False,,False,,3,2,none
2557,Jaz_MED_02,Jazz,video game,VIDEOGAME-Jaz_MED_02,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a jazz festival and old men on stag...,5.0,...,United States,United States,English,False,,False,,3,2,none
2558,Ele_MED_20,Electronic,movie,MOVIE-Ele_MED_20,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a documentary mostly about fun fact...,5.0,...,United States,United States,English,False,,False,,3,2,none


Index(['clip_name', 'clip_genre', 'context_word', 'clip_context_PAIR',
       'expName', 'PROLIFIC_PID', 'File_ID', 'date', 'descr_THOUGHT.text',
       'rating_music_prompted.response', 'rating_spontaneity.response',
       'rating_novelty.response', 'rating_familiarity.response',
       'rating_enjoyment.response', 'demographics.headphones',
       'demographics.age', 'demographics.gender', 'demographics.livingCountry',
       'demographics.birthCountry', 'demographics.nativeLanguage',
       'demographics.otherLanguage', 'demographics.otherLanguageText',
       'demographics.hearingImpariments',
       'demographics.hearingImpairmentsText', 'demographics.education',
       'demographics.musicianIdentification', 'demographics.feedback'],
      dtype='object')


---
## Descriptive Statistics

Basic descriptive statistics on the clip-context pairings.


Create a dataframe including summary info about each clip-context stimuli pairing:

* Number of participants that reported METs while listening
* Mean MET and clip ratings

In [6]:
columns = dataMET.columns.tolist()[1:-1]

# Drop these following columns so they don't aggregate by clip-context grouping:
drop = ['clip_name', 'clip_genre', 'context_word', 'clip_context_PAIR',
        'expName', 'File_ID', 'date', 'descr_THOUGHT.text',
        'demographics.headphones', 'demographics.age',
        'demographics.gender','demographics.livingCountry',
        'demographics.birthCountry', 'demographics.nativeLanguage',
        'demographics.otherLanguage', 'demographics.otherLanguageText',
        'demographics.hearingImpariments', 'demographics.hearingImpairmentsText',
        'demographics.education','demographics.musicianIdentification',
        'demographics.feedback']

# Setting up an aggregate function collector
agg_fun = {}

# As we dropped trials without METs, we can just sum participants for MET occurrence
agg_fun['PROLIFIC_PID'] = 'count'

# Taking the mean of all columns except participant IDs and dropped columns
for col in columns:
    if col not in drop and col != 'PROLIFIC_PID':
        agg_fun[col] = 'mean'

# Group the dataframe by clip-context pairing, then run the aggregate functions created above
clipContextDescrStats = dataMET.groupby('clip_context_PAIR').agg(agg_fun)
display(clipContextDescrStats)

# Saving a .csv for the option to open and look at the full dataframe
clipContextDescrStats.to_csv('/content/context-framed-listening/NLP_outputs/clipContextDescrStats.csv', encoding='utf-8')

Unnamed: 0_level_0,PROLIFIC_PID,rating_music_prompted.response,rating_spontaneity.response,rating_novelty.response,rating_familiarity.response,rating_enjoyment.response
clip_context_PAIR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BAR-80s_LOW_02,34,4.235294,3.882353,3.382353,2.294118,3.705882
BAR-80s_LOW_06,29,4.448276,4.103448,3.482759,2.586207,3.793103
BAR-80s_MED_08,29,4.275862,3.862069,2.965517,2.000000,3.310345
BAR-80s_MED_13,25,4.160000,3.840000,2.360000,2.280000,3.600000
BAR-Ele_LOW_09,27,4.592593,3.962963,2.259259,2.000000,3.259259
...,...,...,...,...,...,...
VIDEOGAME-Jaz_MED_07,35,4.428571,4.285714,2.685714,2.457143,3.742857
VIDEOGAME-Met_LOW_09,34,4.558824,3.794118,2.411765,2.029412,3.235294
VIDEOGAME-Met_LOW_14,26,4.307692,3.923077,2.423077,2.230769,3.346154
VIDEOGAME-Met_MED_19,31,4.483871,4.161290,2.580645,2.645161,3.258065


Check the minimum and maximum reported MET occurences of all clip-context stimuli pairings:

In [7]:
mostMETs_ccpair = clipContextDescrStats['PROLIFIC_PID'].idxmax()
leastMETs_ccpair = clipContextDescrStats['PROLIFIC_PID'].idxmin()
mostMETs_value = clipContextDescrStats['PROLIFIC_PID'].max()
leastMETs_value = clipContextDescrStats['PROLIFIC_PID'].min()

print(f"Clip-context pair with the most reported METs: {mostMETs_ccpair} ({mostMETs_value})")
print(f"Clip-context pair with the least reported METs: {leastMETs_ccpair} ({leastMETs_value})")

Clip-context pair with the most reported METs: BAR-Ele_LOW_14 (35)
Clip-context pair with the least reported METs: BAR-80s_MED_13 (25)


---
## Text Preprocessing

Different NLP models require different levels of text filtering to perform optimally.

**BERT** - Benefits from minimal preprocessing because:
* Understands context and handles common words well
* Has its own tokenisation and handles word variations
* Only needs custom domain-specific stop words removed

**Word2Vec & TF-IDF** - Need more aggressive preprocessing because:
* Don't understand context as well
* Common stop words add noise
* Lemmatisation helps group related terms

 > We will produce two levels of text preprocessing to save out for later analyses. All text will be manually spell checked.

### Spell checking

Collect all misspellings flagged by spell checking packages to manually go through and make corrections where necessary.

In [8]:
import re
from collections import defaultdict

import nltk
nltk.download('stopwords')

from nltk import pos_tag, word_tokenize
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

#!pip uninstall -y pyspellchecker
!pip install pyspellchecker
from spellchecker import SpellChecker
spell = SpellChecker()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Collecting pyspellchecker
  Downloading pyspellchecker-0.8.3-py3-none-any.whl.metadata (9.5 kB)
Downloading pyspellchecker-0.8.3-py3-none-any.whl (7.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m6.2/7.2 MB[0m [31m182.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m104.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.3


In [9]:
def find_potential_misspellings_with_context(row):
    misspellings_with_context = []
    text = row['descr_THOUGHT.text']
    if isinstance(text, str):
        # Tokenise the raw text
        tokens = word_tokenize(text)
        for word in tokens:
            # Clean the word for spell checking (lowercase and remove punctuation)
            cleaned_word = re.sub(r'[^a-zA-Z0-9]', '', word.lower())
            if cleaned_word and cleaned_word not in spell:
                misspellings_with_context.append((cleaned_word, text)) # Store word and original text
    return misspellings_with_context

# Apply the function to each row and collect all misspellings with their context
all_misspellings_with_context = []
for index, row in dataMET.iterrows():
    misspellings_list = find_potential_misspellings_with_context(row)
    all_misspellings_with_context.extend(misspellings_list)

# Create a DataFrame from the collected misspellings and their context
misspellings_df = pd.DataFrame(all_misspellings_with_context, columns=['Potential Misspelling', 'Original Text'])

# Display the dataFrame showing unique misspellings
## (to show all occurrences, remove .drop_duplicates() <- otherwise, thsi shows their first occurrence context only.)
display(misspellings_df.drop_duplicates(subset=['Potential Misspelling']))

# Saving the full dataFrame to a CSV file
misspellings_df.to_csv('/content/context-framed-listening/NLP_outputs/misspellings_df.csv', encoding='utf-8', index=False)

# This dataFrame can be used to create your spelling dictionary.

Unnamed: 0,Potential Misspelling,Original Text
0,nt,"very heavy rock, not for me. somewhere that i ..."
1,ve,"very charged, maybe you've won something or wo..."
2,rememeber,"very bland, like lift music or something. not ..."
8,doesnt,Doesnt fit with what i would imagien for a vid...
9,imagien,Doesnt fit with what i would imagien for a vid...
...,...,...
1204,1970s1980s,I imagined this being something they'd play ba...
1205,disko,I imagined people dancing and disko lights
1206,kidteen,The theme song at the beginning of a kid/teen ...
1207,rihanna,I imagined Rihanna performing this in a concert


> I saved the above file out to manually check over an edit the spellings as needed. See `misspelling_correction_ds1.csv`.

Below, apply these spelling corrections from the **manually edited csv file** to `dataMET`

In [10]:
# Load the correction mapping from the CSV file
try:
    correction_df = pd.read_csv("/content/context-framed-listening/NLP_outputs/misspelling_correction_ds1.csv")

    # Ensure the 'correction' column is treated as strings and handle potential NaNs
    correction_df['correction'] = correction_df['correction'].astype(str).replace('nan', '') # Convert to string and replace 'nan' string with empty string

    # Create a dictionary from the df, filtering out any remaining NaNs if necessary
    ## Using .dropna() to remove rows where 'misspelling' is NaN
    correction_mapping = pd.Series(correction_df.correction.values, index=correction_df.misspelling).dropna().to_dict()

    print("Correction mapping loaded successfully.")
    #print(correction_mapping) # [optional viewing]

except FileNotFoundError:
    print("Error: misspelling_correction_ds1.csv not found.")
    correction_mapping = {} # empty mapping if file not found
except KeyError:
    print("Error: CSV must contain 'misspelling' and 'correction' columns.")
    correction_mapping = {} # empty mapping if columns are missing


def apply_manual_corrections(text, mapping):
    if isinstance(text, str):

        # Clean the text (lowercase and remove punctuation) before applying corrections
        cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
        words = cleaned_text.split()

        # Apply corrections
        ## Ensure the mapped value is a string before joining
        corrected_words = [str(mapping.get(word, word)) for word in words]
        return " ".join(corrected_words)
    return str(text) # Also ensure the return value is a string if the input was not a string

# Apply the manual corrections to the descr_THOUGHT.text column
dataMET['descr_THOUGHT.text_corrected'] = dataMET['descr_THOUGHT.text'].apply(lambda x: apply_manual_corrections(x, correction_mapping))

# Display the original and corrected text to check the changes
display(dataMET[['descr_THOUGHT.text', 'descr_THOUGHT.text_corrected']])

# Now, the preprocessed_MET_descr column in the next cell should use descr_THOUGHT.text_corrected as input

Correction mapping loaded successfully.


Unnamed: 0,descr_THOUGHT.text,descr_THOUGHT.text_corrected
0,"kind of sad, melancholy. not happy or upbeat. ...",kind of sad melancholy not happy or upbeat emo...
1,it did not sound like a video game. if anythin...,it did not sound like a video game if anything...
2,"overly upbeat. no real emotions, peppy. too mu...",overly upbeat no real emotions peppy too much
3,"very heavy rock, not for me. somewhere that i ...",very heavy rock not for me somewhere that i do...
4,"very charged, maybe you've won something or wo...",very charged maybe you have won something or w...
...,...,...
2554,A rock band made up of teenage white kids play...,a rock band made up of teenage white kids play...
2555,People in a ballroom in elegant dresses slow d...,people in a ballroom in elegant dresses slow d...
2557,I imagined a jazz festival and old men on stag...,i imagined a jazz festival and old men on stag...
2558,I imagined a documentary mostly about fun fact...,i imagined a documentary mostly about fun fact...



Importing some more packages necessary for text cleaning before feeding these METs into our NLP models.

### Stop Words

These are word we will remove from the text that won't be acounted for in analyses.

We will define custom domain-specific stop words (e.g. music stlye, context cues, thought types), and get NLTK common English-language stop words.

Both preprocessing levels will have custom stop words removed, but only level 2 preprocessing for Word2Vec and TF-IDF will have the NLTK stop words removed (as BERT understands "the", "is", "and", etc.).

In [11]:
# Define custom stop words (domain-specific terms to remove)
stimuliStopWords = ['music', 'song', 'songs', 'excerpt', 'excerpts', 'piece', 'pieces', 'clip', 'clips',
                    'electronic', 'jazz', 'metal', 'rock',
                    'bar', 'concert', 'film', 'movie', 'videogame', 'video', 'game']

thoughtStopWords = ['think', 'thinks', 'thought', 'thinking',
                    'imagine', 'imagines', 'imagined', 'imagining',
                    'image', 'images', 'imaged', 'imaging',
                    'visualise', 'visualises', 'visualised', 'visualising',
                    'visualize', 'visualizes', 'visualized', 'visualizing',
                    'picture', 'pictures', 'pictured', 'picturing',
                    'scene', 'scenes', 'story', 'stories',
                    'memory', 'memories', 'reminder', 'reminders', 'remind', 'reminds',
                    'remember', 'remembers', 'remembered', 'remembering',
                    'reminiscent', 'reminisce', 'reminisces', 'reminisced', 'reminiscing',
                    'sound', 'sounds', 'sounded', 'sounding', 'like',
                    'make', 'makes', 'made', 'making']


# set NLTK stop words
nltk_stop_words = set(nltk.corpus.stopwords.words('english'))
print(nltk_stop_words)
print("\nNLTK Stop Words:", len(nltk_stop_words))

{'it', "it'll", 'other', 'both', 'below', 'where', "weren't", 'their', 'why', 'himself', "wouldn't", 'about', 'our', 'most', 'under', 'out', 'weren', "isn't", 'wasn', 'between', 'some', 'how', 'very', 'any', 'shouldn', "you've", 'ourselves', 'than', 'through', "you'll", 'what', 'off', 'ain', 'd', "it'd", 'should', 'ours', "shan't", 'he', 'mustn', "i've", 'hasn', 'of', 'then', 'or', 'all', "you'd", 'his', 'during', 'can', 'does', "wasn't", 'that', "mightn't", 'who', 'just', "we'd", 'each', 'them', 'has', 'm', "aren't", 've', 're', 'll', "he'd", 'i', 'before', "don't", 'her', 'only', 'my', "they'd", "doesn't", 'ma', 'no', 'this', 'and', 'with', 'y', 'wouldn', "she's", 'doesn', 'mightn', 'too', "we're", "won't", 'being', 'such', "he'll", 'we', 'on', 'further', 'when', 'were', 'which', "they're", 'don', "they've", 'those', 'once', 'herself', 'are', 'in', 'nor', "couldn't", 'a', 'more', 'same', 'aren', 'now', 'because', 'be', "i'd", 't', "didn't", 'down', 'shan', "hasn't", "should've", 'who

### Lemmatise

Lemmatising groups inflected word forms together by identifying the dictionary form (the lemma) to analyse them as a single item.

[_e.g., improve/improving/improvements/improved/improver = improve_]

Only level 2 preprocessing for Word2Vec and TF-IDF will have lemmatisation (BERT is able to handle word variations).

In [12]:
# Initialise lemmatiser and tag mapping

lemmatiser = nltk.stem.WordNetLemmatizer()
wn = nltk.corpus.wordnet

tag_map = defaultdict(lambda: wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

### Apply text preprocessing

**LEVEL 1:** Minimal preprocessing for BERT, only removing custom stop words from spell-checked MET descriptions

**LEVEL 2:** Aggressive preprocessing for Word2Vec and TF-IDF, removing custom stop words, NLTK stop words, lemmatising, converting to lowercase, and removing punctuation from the spell-checked MET descriptions.



In [13]:
def to_list(arg):
    return list(arg) if isinstance(arg, set) else arg

def preprocess_level1(text, stimuli_stop_words, thought_stop_words):
    """
    Level 1: Minimal preprocessing for BERT
    - Only removes custom domain-specific stop words
    - Preserves original word forms (no lemmatisation)
    - Keeps punctuation and capitalization for BERT's tokeniser
    """
    if isinstance(text, str):
        # Combine the custom stop words (defined from arguments)
        custom_stop_words = set(to_list(stimuli_stop_words) + to_list(thought_stop_words))
        # Tokenise the text
        tokens = word_tokenize(text)
        filtered_tokens = []
        for word in tokens:
            # Check lowercase version against stop words, but keep original case
            if word.lower() not in custom_stop_words:
                filtered_tokens.append(word)
        return " ".join(filtered_tokens)
    return text

def preprocess_level1A(text, stimuli_stop_words):
    """
    Same as Level 1 but doesn't remove thought-related stop words
    """
    if isinstance(text, str):
        # custom_stop_words here refers only to stimuli_stop_words
        custom_stop_words = set(to_list(stimuli_stop_words))
        # Tokenise the text
        tokens = word_tokenize(text)
        filtered_tokens = []
        for word in tokens:
            # Check lowercase version against stop words, but keep original case
            if word.lower() not in custom_stop_words:
                filtered_tokens.append(word)
        return " ".join(filtered_tokens)
    return text


def preprocess_level2(text, stimuli_stop_words, thought_stop_words, nltk_stop_words, lemmatiser, tag_map):
    """
    Level 2: Aggressive preprocessing for Word2Vec and TF-IDF
    - Removes all stop words (custom + NLTK)
    - Applies lemmatisation
    - Converts to lowercase
    - Removes punctuation
    """
    if isinstance(text, str):
        # Combine all stop words (defined from arguments)
        all_stop_words = set(to_list(stimuli_stop_words) + to_list(thought_stop_words) + to_list(nltk_stop_words))
        # Tokenise the text
        tokens = word_tokenize(text)
        # Remove stop words and lemmatize
        lemmatised_tokens = []
        for word, tag in pos_tag(tokens):
            # Convert to lowercase and remove non-alphabetic characters
            cleaned_word = re.sub(r'[^a-zA-Z]', '', word.lower())
            if cleaned_word and cleaned_word not in all_stop_words:
                # Lemmatize using the POS tag
                lemmatised_word = lemmatiser.lemmatize(cleaned_word, tag_map[tag[0]])
                lemmatised_tokens.append(lemmatised_word)
        return " ".join(lemmatised_tokens)
    return text

def preprocess_level2A(text, stimuli_stop_words, nltk_stop_words, lemmatiser, tag_map):
    """
    Same as Level 2 but doesn't remove thought-related stop words
    """
    if isinstance(text, str):
        # Combine stimuli and NLTK stop words (defined from arguments)
        all_stop_words = set(to_list(stimuli_stop_words) + to_list(nltk_stop_words))
        # Tokenise the text
        tokens = word_tokenize(text)
        # Remove stop words and lemmatize
        lemmatised_tokens = []
        for word, tag in pos_tag(tokens):
            # Convert to lowercase and remove non-alphabetic characters
            cleaned_word = re.sub(r'[^a-zA-Z]', '', word.lower())
            if cleaned_word and cleaned_word not in all_stop_words:
                # Lemmatize using the POS tag
                lemmatised_word = lemmatiser.lemmatize(cleaned_word, tag_map[tag[0]])
                lemmatised_tokens.append(lemmatised_word)
        return " ".join(lemmatised_tokens)
    return text


# Apply Level 1 preprocessing (for BERT)
dataMET['METdescr_prepLVL1'] = dataMET['descr_THOUGHT.text_corrected'].apply(
    lambda x: preprocess_level1(x, stimuliStopWords, thoughtStopWords)
)

# Apply Level 1 version B preprocessing
dataMET['METdescr_prepLVL1A'] = dataMET['descr_THOUGHT.text_corrected'].apply(
    lambda x: preprocess_level1A(x, stimuliStopWords)
)

# Apply Level 2 preprocessing (for Word2Vec and TF-IDF)
dataMET['METdescr_prepLVL2'] = dataMET['descr_THOUGHT.text_corrected'].apply(
    lambda x: preprocess_level2(x, stimuliStopWords, thoughtStopWords, nltk_stop_words, lemmatiser, tag_map)
)

# Apply Level 2 version B preprocessing
dataMET['METdescr_prepLVL2A'] = dataMET['descr_THOUGHT.text_corrected'].apply(
    lambda x: preprocess_level2A(x, stimuliStopWords, nltk_stop_words, lemmatiser, tag_map)
)

# Display all text processing stages
display(dataMET[['descr_THOUGHT.text',
                 'descr_THOUGHT.text_corrected',
                 'METdescr_prepLVL1',
                 'METdescr_prepLVL1A',
                 'METdescr_prepLVL2',
                 'METdescr_prepLVL2A']])

# Save out for analyses and inspection
dataMET.to_csv('/content/context-framed-listening/NLP_outputs/dataMET_preprocessed.csv', encoding='utf-8')

print("\nPreprocessing complete")
print("- 'METdescr_prepLVL1': Minimal filtering for BERT (custom stop words only)")
print("- 'METdescr_prepLVL1A': Minimal filtering for BERT (stimuli stop words only)")
print("- 'METdescr_prepLVL2': Full preprocessing for Word2Vec/TF-IDF (all stop words + lemmatisation)")
print("- 'METdescr_prepLVL2A': Full preprocessing for Word2Vec/TF-IDF (stimuli+NLTK stop words + lemmatisation)")

Unnamed: 0,descr_THOUGHT.text,descr_THOUGHT.text_corrected,METdescr_prepLVL1,METdescr_prepLVL1A,METdescr_prepLVL2,METdescr_prepLVL2A
0,"kind of sad, melancholy. not happy or upbeat. ...",kind of sad melancholy not happy or upbeat emo...,kind of sad melancholy not happy or upbeat emo...,kind of sad melancholy not happy or upbeat emo...,kind sad melancholy happy upbeat emotionally c...,kind sad melancholy happy upbeat emotionally c...
1,it did not sound like a video game. if anythin...,it did not sound like a video game if anything...,it did not a if anything maybe the end credits...,it did not sound like a if anything maybe the ...,anything maybe end credit something something ...,sound like anything maybe end credit something...
2,"overly upbeat. no real emotions, peppy. too mu...",overly upbeat no real emotions peppy too much,overly upbeat no real emotions peppy too much,overly upbeat no real emotions peppy too much,overly upbeat real emotion peppy much,overly upbeat real emotion peppy much
3,"very heavy rock, not for me. somewhere that i ...",very heavy rock not for me somewhere that i do...,very heavy not for me somewhere that i do not ...,very heavy not for me somewhere that i do not ...,heavy somewhere belong enjoy,heavy somewhere belong enjoy
4,"very charged, maybe you've won something or wo...",very charged maybe you have won something or w...,very charged maybe you have won something or w...,very charged maybe you have won something or w...,charge maybe something battle hype possibly en...,charge maybe something battle hype possibly en...
...,...,...,...,...,...,...
2554,A rock band made up of teenage white kids play...,a rock band made up of teenage white kids play...,a band up of teenage white kids playing in the...,a band made up of teenage white kids playing i...,band teenage white kid play garage,band make teenage white kid play garage
2555,People in a ballroom in elegant dresses slow d...,people in a ballroom in elegant dresses slow d...,people in a ballroom in elegant dresses slow d...,people in a ballroom in elegant dresses slow d...,people ballroom elegant dress slow dance floor,people ballroom elegant dress slow dance floor
2557,I imagined a jazz festival and old men on stag...,i imagined a jazz festival and old men on stag...,i a festival and old men on stage playing thei...,i imagined a festival and old men on stage pla...,festival old men stage play instrument,imagine festival old men stage play instrument
2558,I imagined a documentary mostly about fun fact...,i imagined a documentary mostly about fun fact...,i a documentary mostly about fun facts or hist...,i imagined a documentary mostly about fun fact...,documentary mostly fun fact historic information,imagine documentary mostly fun fact historic i...



Preprocessing complete
- 'METdescr_prepLVL1': Minimal filtering for BERT (custom stop words only)
- 'METdescr_prepLVL1A': Minimal filtering for BERT (stimuli stop words only)
- 'METdescr_prepLVL2': Full preprocessing for Word2Vec/TF-IDF (all stop words + lemmatisation)
- 'METdescr_prepLVL2A': Full preprocessing for Word2Vec/TF-IDF (stimuli+NLTK stop words + lemmatisation)


`dataMET` (saved out as "dataMET_preprocessed.csv") now contains the preprocessed data from individual participants (one row = one MET) with the two levels of filtering: `METdata_prepLVL1` and `METdata_prepLVL2`. We will refer to this file for the NLP analyses.

---
## Creating Combined METdocs

Here, we are combining the preprocessed MET descriptions into singular aggregated documents grouped by stimuli condition i.e. specific clip and context pairing. This produces 64 document combinations (16 clips × 4 contexts).

Where the individual preprocessed MET descriptions can be analysed to measure semantic similarity between _indivudal_ thought descriptions, METdocs can be analysed to measure semantic similarity between _aggregated representations_ of conditions.

In [14]:
# Initialise DataFrame for Level 1 preprocessed MET description
METdocsLVL1 = pd.DataFrame(index=range(0,1), columns=dataMET.columns)
rowIndex = 0

# Iterate through each unique clip_context_PAIR
for idStimPair in np.unique(dataMET['clip_context_PAIR']):
    # Create mask for current stimulus pair
    stimPairMask = dataMET['clip_context_PAIR'] == idStimPair
    filt_ClipContextData = dataMET[stimPairMask]

    # Get the first row to extract clip and context info
    if len(filt_ClipContextData) > 0:
        first_row = filt_ClipContextData.iloc[0]
        idClip = first_row['clip_name']
        idContext = first_row['context_word']

        # Concatenate all text descriptions
        descrSeries = filt_ClipContextData['METdescr_prepLVL1']

        joinedstring = ""
        for ival in range(0, len(descrSeries.values)):
            joinedstring = joinedstring + str(descrSeries.values[ival]) + " endofasubhere "

        # Assign values to dataframe
        METdocsLVL1.loc[rowIndex, 'METdescr_prepLVL1'] = joinedstring
        METdocsLVL1.loc[rowIndex, 'clip_name'] = idClip
        METdocsLVL1.loc[rowIndex, 'context_word'] = idContext
        METdocsLVL1.loc[rowIndex, 'ClipContext_pair'] = idStimPair
        METdocsLVL1.loc[rowIndex, 'GenreContext_pair'] = idStimPair[0:3] + "_" + idClip[0:3]
        # Assign genre code
        if idClip[0:3] == '80s':
            METdocsLVL1.loc[rowIndex, 'genre_code'] = '80s'
        elif idClip[0:3] == 'Jaz':
            METdocsLVL1.loc[rowIndex, 'genre_code'] = 'Jaz'
        elif idClip[0:3] == 'Met':
            METdocsLVL1.loc[rowIndex, 'genre_code'] = 'Met'
        elif idClip[0:3] == 'Ele':
            METdocsLVL1.loc[rowIndex, 'genre_code'] = 'Ele'

        rowIndex = rowIndex + 1

# Initialise DataFrame for Level 1A preprocessed MET description
METdocsLVL1A = pd.DataFrame(index=range(0,1), columns=dataMET.columns)
rowIndex = 0

# Iterate through each unique clip_context_PAIR
for idStimPair in np.unique(dataMET['clip_context_PAIR']):
    # Create mask for current stimulus pair
    stimPairMask = dataMET['clip_context_PAIR'] == idStimPair
    filt_ClipContextData = dataMET[stimPairMask]

    # Get the first row to extract clip and context info
    if len(filt_ClipContextData) > 0:
        first_row = filt_ClipContextData.iloc[0]
        idClip = first_row['clip_name']
        idContext = first_row['context_word']

        # Concatenate all text descriptions
        descrSeries = filt_ClipContextData['METdescr_prepLVL1A']

        joinedstring = ""
        for ival in range(0, len(descrSeries.values)):
            joinedstring = joinedstring + str(descrSeries.values[ival]) + " endofasubhere "

        # Assign values to dataframe
        METdocsLVL1A.loc[rowIndex, 'METdescr_prepLVL1A'] = joinedstring
        METdocsLVL1A.loc[rowIndex, 'clip_name'] = idClip
        METdocsLVL1A.loc[rowIndex, 'context_word'] = idContext
        METdocsLVL1A.loc[rowIndex, 'ClipContext_pair'] = idStimPair
        METdocsLVL1A.loc[rowIndex, 'GenreContext_pair'] = idStimPair[0:3] + "_" + idClip[0:3]
        # Assign genre code
        if idClip[0:3] == '80s':
            METdocsLVL1A.loc[rowIndex, 'genre_code'] = '80s'
        elif idClip[0:3] == 'Jaz':
            METdocsLVL1A.loc[rowIndex, 'genre_code'] = 'Jaz'
        elif idClip[0:3] == 'Met':
            METdocsLVL1A.loc[rowIndex, 'genre_code'] = 'Met'
        elif idClip[0:3] == 'Ele':
            METdocsLVL1A.loc[rowIndex, 'genre_code'] = 'Ele'

        rowIndex = rowIndex + 1


# Initialise DataFrame for Level 2 preprocessed MET description
METdocsLVL2 = pd.DataFrame(index=range(0,1), columns=dataMET.columns)
rowIndex = 0

# Iterate through each unique clip_context_PAIR
for idStimPair in np.unique(dataMET['clip_context_PAIR']):
    # Create mask for current stimulus pair
    stimPairMask = dataMET['clip_context_PAIR'] == idStimPair
    filt_ClipContextData = dataMET[stimPairMask]

    # Get the first row to extract clip and context info
    if len(filt_ClipContextData) > 0:
        first_row = filt_ClipContextData.iloc[0]
        idClip = first_row['clip_name']
        idContext = first_row['context_word']

        # Concatenate all text descriptions
        descrSeries = filt_ClipContextData['METdescr_prepLVL2']

        joinedstring = ""
        for ival in range(0, len(descrSeries.values)):
            joinedstring = joinedstring + str(descrSeries.values[ival]) + " endofasubhere "

        # Assign values to dataframe
        METdocsLVL2.loc[rowIndex, 'METdescr_prepLVL2'] = joinedstring
        METdocsLVL2.loc[rowIndex, 'clip_name'] = idClip
        METdocsLVL2.loc[rowIndex, 'context_word'] = idContext
        METdocsLVL2.loc[rowIndex, 'ClipContext_pair'] = idStimPair
        METdocsLVL2.loc[rowIndex, 'GenreContext_pair'] = idStimPair[0:3] + "_" + idClip[0:3]
        # Assign genre code
        if idClip[0:3] == '80s':
            METdocsLVL2.loc[rowIndex, 'genre_code'] = '80s'
        elif idClip[0:3] == 'Jaz':
            METdocsLVL2.loc[rowIndex, 'genre_code'] = 'Jaz'
        elif idClip[0:3] == 'Met':
            METdocsLVL2.loc[rowIndex, 'genre_code'] = 'Met'
        elif idClip[0:3] == 'Ele':
            METdocsLVL2.loc[rowIndex, 'genre_code'] = 'Ele'

        rowIndex = rowIndex + 1

# Initialise DataFrame for Level 2 preprocessed MET description
METdocsLVL2A = pd.DataFrame(index=range(0,1), columns=dataMET.columns)
rowIndex = 0

# Iterate through each unique clip_context_PAIR
for idStimPair in np.unique(dataMET['clip_context_PAIR']):
    # Create mask for current stimulus pair
    stimPairMask = dataMET['clip_context_PAIR'] == idStimPair
    filt_ClipContextData = dataMET[stimPairMask]

    # Get the first row to extract clip and context info
    if len(filt_ClipContextData) > 0:
        first_row = filt_ClipContextData.iloc[0]
        idClip = first_row['clip_name']
        idContext = first_row['context_word']

        # Concatenate all text descriptions
        descrSeries = filt_ClipContextData['METdescr_prepLVL2A']

        joinedstring = ""
        for ival in range(0, len(descrSeries.values)):
            joinedstring = joinedstring + str(descrSeries.values[ival]) + " endofasubhere "

        # Assign values to dataframe
        METdocsLVL2A.loc[rowIndex, 'METdescr_prepLVL2A'] = joinedstring
        METdocsLVL2A.loc[rowIndex, 'clip_name'] = idClip
        METdocsLVL2A.loc[rowIndex, 'context_word'] = idContext
        METdocsLVL2A.loc[rowIndex, 'ClipContext_pair'] = idStimPair
        METdocsLVL2A.loc[rowIndex, 'GenreContext_pair'] = idStimPair[0:3] + "_" + idClip[0:3]
        # Assign genre code
        if idClip[0:3] == '80s':
            METdocsLVL2A.loc[rowIndex, 'genre_code'] = '80s'
        elif idClip[0:3] == 'Jaz':
            METdocsLVL2A.loc[rowIndex, 'genre_code'] = 'Jaz'
        elif idClip[0:3] == 'Met':
            METdocsLVL2A.loc[rowIndex, 'genre_code'] = 'Met'
        elif idClip[0:3] == 'Ele':
            METdocsLVL2A.loc[rowIndex, 'genre_code'] = 'Ele'

        rowIndex = rowIndex + 1

# Filter and save
METdocsLVL1 = METdocsLVL1.filter(['context_word', 'genre_code', 'clip_name', 'ClipContext_pair', 'GenreContext_pair', 'METdescr_prepLVL1'], axis=1)
METdocsLVL1.to_csv('/content/context-framed-listening/NLP_outputs/METdocsLVL1.csv', encoding='utf-8')

METdocsLVL1A = METdocsLVL1A.filter(['context_word', 'genre_code', 'clip_name', 'ClipContext_pair', 'GenreContext_pair', 'METdescr_prepLVL1A'], axis=1)
METdocsLVL1A.to_csv('/content/context-framed-listening/NLP_outputs/METdocsLVL1A.csv', encoding='utf-8')

METdocsLVL2 = METdocsLVL2.filter(['context_word', 'genre_code', 'clip_name', 'ClipContext_pair', 'GenreContext_pair', 'METdescr_prepLVL2'], axis=1)
METdocsLVL2.to_csv('/content/context-framed-listening/NLP_outputs/METdocsLVL2.csv', encoding='utf-8')

METdocsLVL2A = METdocsLVL2A.filter(['context_word', 'genre_code', 'clip_name', 'ClipContext_pair', 'GenreContext_pair', 'METdescr_prepLVL2A'], axis=1)
METdocsLVL2A.to_csv('/content/context-framed-listening/NLP_outputs/METdocsLVL2A.csv', encoding='utf-8')


print(f"Created {len(METdocsLVL1)} documents (Clip-Context combinations), combining Level 1 preprocessed MET descriptions")
display(METdocsLVL1.head(5))

print(f"Created {len(METdocsLVL1A)} documents (Clip-Context combinations), combining Level 1A preprocessed MET descriptions")
display(METdocsLVL1A.head(5))

print(f"Created {len(METdocsLVL2)} documents (Clip-Context combinations), combining Level 2 preprocessed MET descriptions")
display(METdocsLVL2.head(5))

print(f"Created {len(METdocsLVL2A)} documents (Clip-Context combinations), combining Level 2A preprocessed MET descriptions")
display(METdocsLVL2A.head(5))

Created 64 documents (Clip-Context combinations), combining Level 1 preprocessed MET descriptions


Unnamed: 0,context_word,genre_code,clip_name,ClipContext_pair,GenreContext_pair,METdescr_prepLVL1
0,bar,80s,80s_LOW_02,BAR-80s_LOW_02,BAR_80s,kind of sad melancholy not happy or upbeat emo...
1,bar,80s,80s_LOW_06,BAR-80s_LOW_06,BAR_80s,the felt really 80s and so it was an of a 80s ...
2,bar,80s,80s_MED_08,BAR-80s_MED_08,BAR_80s,i about an old 80s that takes place in a i abo...
3,bar,80s,80s_MED_13,BAR-80s_MED_13,BAR_80s,i could only about sitting on the couch as a k...
4,bar,Ele,Ele_LOW_09,BAR-Ele_LOW_09,BAR_Ele,this me about europe or a specific country whe...


Created 64 documents (Clip-Context combinations), combining Level 1A preprocessed MET descriptions


Unnamed: 0,context_word,genre_code,clip_name,ClipContext_pair,GenreContext_pair,METdescr_prepLVL1A
0,bar,80s,80s_LOW_02,BAR-80s_LOW_02,BAR_80s,kind of sad melancholy not happy or upbeat emo...
1,bar,80s,80s_LOW_06,BAR-80s_LOW_06,BAR_80s,the felt really 80s and so it was an image of ...
2,bar,80s,80s_MED_08,BAR-80s_MED_08,BAR_80s,i think about an old 80s that takes place in a...
3,bar,80s,80s_MED_13,BAR-80s_MED_13,BAR_80s,i could only think about sitting on the couch ...
4,bar,Ele,Ele_LOW_09,BAR-Ele_LOW_09,BAR_Ele,this made me think about europe or a specific ...


Created 64 documents (Clip-Context combinations), combining Level 2 preprocessed MET descriptions


Unnamed: 0,context_word,genre_code,clip_name,ClipContext_pair,GenreContext_pair,METdescr_prepLVL2
0,bar,80s,80s_LOW_02,BAR-80s_LOW_02,BAR_80s,kind sad melancholy happy upbeat emotionally c...
1,bar,80s,80s_LOW_06,BAR-80s_LOW_06,BAR_80s,felt really style hawaiian hula style decor en...
2,bar,80s,80s_MED_08,BAR-80s_MED_08,BAR_80s,old take place couple dancing good time bring ...
3,bar,80s,80s_MED_13,BAR-80s_MED_13,BAR_80s,could sit couch kid would type would play cred...
4,bar,Ele,Ele_LOW_09,BAR-Ele_LOW_09,BAR_Ele,europe specific country electric popular outsi...


Created 64 documents (Clip-Context combinations), combining Level 2A preprocessed MET descriptions


Unnamed: 0,context_word,genre_code,clip_name,ClipContext_pair,GenreContext_pair,METdescr_prepLVL2A
0,bar,80s,80s_LOW_02,BAR-80s_LOW_02,BAR_80s,kind sad melancholy happy upbeat emotionally c...
1,bar,80s,80s_LOW_06,BAR-80s_LOW_06,BAR_80s,felt really image style hawaiian hula style de...
2,bar,80s,80s_MED_08,BAR-80s_MED_08,BAR_80s,think old take place think couple dancing good...
3,bar,80s,80s_MED_13,BAR-80s_MED_13,BAR_80s,could think sit couch kid would type would pla...
4,bar,Ele,Ele_LOW_09,BAR-Ele_LOW_09,BAR_Ele,make think europe specific country electric po...


Text preprocessing is complete!

The saved out files `METdocsLVL1.csv`,`METdocsLVL1A.csv`,`METdocsLVL2.csv`, and `METdocsLVL2A.csv` will also be used alongside the `dataMET_preprocessed.csv` for the different levels of analysis in subsequent NLP analyses.

(_Reminder:_ LVL1 & LVL2 have thought words removed, LVL1A & LVL2A have thought words _kept in_.)