<a href="https://colab.research.google.com/github/HazelvdW/context-framed-listening/blob/main/NLP_framed_listening.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP analysis for Framed Listening study.
> Authored by **Hazel A. van der Walle** (PhD student, Music, Durham University), September 2025.

All datasets generated and used for this study are openly available on GitHub https://github.com/HazelvdW/context-framed-listening.

The cleaned raw data (processed in R) are used in this notebook, so let's clone necessary files and directories:

In [1]:
!git clone https://github.com/HazelvdW/context-framed-listening.git

Cloning into 'context-framed-listening'...
remote: Enumerating objects: 50, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 50 (delta 26), reused 25 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (50/50), 802.98 KiB | 9.02 MiB/s, done.
Resolving deltas: 100% (26/26), done.


You should now have a file called **"context-framed-listening"** in this notebook.

< Check this out by clicking on the folder icon on the lefthand side panel in this webpage (press the refresh symbol if you can't see it yet).

For this NLP analysis, we are only working from the file **"data_study1_MAIN.csv"** which contains participants qualitative thought descriptions.





---
## Setup

Start by importing the necessary packages in the cell below:

In [2]:
import os
import csv
import pandas as pd
import numpy as np


from google.colab import data_table
data_table.enable_dataframe_formatter()

Load in the data .csv file:

In [3]:
data = pd.read_csv("/content/context-framed-listening/data_study1_MAIN.csv")

This dataset contains every participant's response to all 16 clip-context stimuli pairings.

_For the purposes of this analysis_ we are only interested in trials where music-evoked thoughts (METs) were expereienced and described – this is all rows where "descr_THOUGHT.text" is _not_ NA.

Let's create a new dataset that only contains trials with METs:

In [4]:
dataMET = data[data['descr_THOUGHT.text'].notna()].copy()

Familiarise yourself with the data structure by taking a quick look through before we dig into any analyses.

In [5]:
display(dataMET)

# print out all column headers so we have a quick copy for later reference
print(dataMET.columns)



Unnamed: 0,clip_name,context_word,expName,PROLIFIC_PID,File_ID,date,response_thought_or_not.keys,descr_THOUGHT.text,rating_music_prompted.response,rating_spontaneity.response,...,demographics.livingCountry,demographics.birthCountry,demographics.nativeLanguage,demographics.otherLanguage,demographics.otherLanguageText,demographics.hearingImpariments,demographics.hearingImpairmentsText,demographics.education,demographics.musicianIdentification,demographics.feedback
0,80s_LOW_02_Breaking_Away.mp3,bar,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,y,"kind of sad, melancholy. not happy or upbeat. ...",5.0,4.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
1,Jazz_MED_07_Turiya_and_Ramakrishna.mp3,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,y,it did not sound like a video game. if anythin...,5.0,5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
2,80s_MED_08_After_Tonight.mp3,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,y,"overly upbeat. no real emotions, peppy. too mu...",5.0,4.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
3,Metal_LOW_09_Darkside.mp3,concert,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,y,"very heavy rock, not for me. somewhere that i ...",5.0,5.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
4,Metal_MED_20_Welcome_to_the_Family.mp3,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,y,"very charged, maybe you've won something or wo...",5.0,4.0,...,United Kingdom,United Kingdom,English,False,,False,,5,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2554,Metal_MED_20_Welcome_to_the_Family.mp3,concert,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,y,A rock band made up of teenage white kids play...,5.0,5.0,...,United States,United States,English,False,,False,,3,2,none
2555,80s_LOW_02_Breaking_Away.mp3,concert,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,y,People in a ballroom in elegant dresses slow d...,5.0,5.0,...,United States,United States,English,False,,False,,3,2,none
2557,Jazz_MED_02_I_Guess_Ill_Hang_My_Tears_Out_To_D...,video game,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,y,I imagined a jazz festival and old men on stag...,5.0,5.0,...,United States,United States,English,False,,False,,3,2,none
2558,Electronic_MED_20_The_Distance.mp3,movie,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,y,I imagined a documentary mostly about fun fact...,5.0,5.0,...,United States,United States,English,False,,False,,3,2,none


Index(['clip_name', 'context_word', 'expName', 'PROLIFIC_PID', 'File_ID',
       'date', 'response_thought_or_not.keys', 'descr_THOUGHT.text',
       'rating_music_prompted.response', 'rating_spontaneity.response',
       'rating_novelty.response', 'input_NOT.text',
       'rating_familiarity.response', 'rating_enjoyment.response',
       'demographics.headphones', 'demographics.age', 'demographics.gender',
       'demographics.livingCountry', 'demographics.birthCountry',
       'demographics.nativeLanguage', 'demographics.otherLanguage',
       'demographics.otherLanguageText', 'demographics.hearingImpariments',
       'demographics.hearingImpairmentsText', 'demographics.education',
       'demographics.musicianIdentification', 'demographics.feedback'],
      dtype='object')


We're going to combine the clip and context values into an additional column ("clip_context_PAIR") that we can use in functions later.

We can also drop the columns that where only relevant to distinguishing have- and have-not thoughts.

In [6]:
def create_clip_context_pair(row):
    if 'bar' in row['context_word']:
        return 'BAR-' + row['clip_name']
    elif 'video game' in row['context_word']:
        return 'VIDEOGAME-' + row['clip_name']
    elif 'concert' in row['context_word']:
        return 'CONCERT-' + row['clip_name']
    elif 'movie' in row['context_word']:
        return 'MOVIE-' + row['clip_name']
    else:
        return row['clip_name'] # Default to just the clip name if no match

dataMET['clip_context_PAIR'] = dataMET.apply(create_clip_context_pair, axis=1)

dataMET.drop(columns = ['response_thought_or_not.keys', 'input_NOT.text'],
             inplace=True)

# Check the dataframe by a quick re-view :)
display(dataMET)
print(dataMET.columns)

# Saving a .csv for the option to open and look at the full dataframe
dataMET.to_csv('/content/context-framed-listening/dataMET.csv', encoding='utf-8')



Unnamed: 0,clip_name,context_word,expName,PROLIFIC_PID,File_ID,date,descr_THOUGHT.text,rating_music_prompted.response,rating_spontaneity.response,rating_novelty.response,...,demographics.birthCountry,demographics.nativeLanguage,demographics.otherLanguage,demographics.otherLanguageText,demographics.hearingImpariments,demographics.hearingImpairmentsText,demographics.education,demographics.musicianIdentification,demographics.feedback,clip_context_PAIR
0,80s_LOW_02_Breaking_Away.mp3,bar,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"kind of sad, melancholy. not happy or upbeat. ...",5.0,4.0,3.0,...,United Kingdom,English,False,,False,,5,2,,BAR-80s_LOW_02_Breaking_Away.mp3
1,Jazz_MED_07_Turiya_and_Ramakrishna.mp3,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,it did not sound like a video game. if anythin...,5.0,5.0,2.0,...,United Kingdom,English,False,,False,,5,2,,VIDEOGAME-Jazz_MED_07_Turiya_and_Ramakrishna.mp3
2,80s_MED_08_After_Tonight.mp3,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"overly upbeat. no real emotions, peppy. too mu...",5.0,4.0,4.0,...,United Kingdom,English,False,,False,,5,2,,VIDEOGAME-80s_MED_08_After_Tonight.mp3
3,Metal_LOW_09_Darkside.mp3,concert,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very heavy rock, not for me. somewhere that i ...",5.0,5.0,3.0,...,United Kingdom,English,False,,False,,5,2,,CONCERT-Metal_LOW_09_Darkside.mp3
4,Metal_MED_20_Welcome_to_the_Family.mp3,video game,clip_context_g1,5eff5f05b92981000a2aed73,clip_context_g1_5eff5f05b92981000a2aed73_02059...,2025-07-01_10h45.13.126,"very charged, maybe you've won something or wo...",5.0,4.0,2.0,...,United Kingdom,English,False,,False,,5,2,,VIDEOGAME-Metal_MED_20_Welcome_to_the_Family.mp3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2554,Metal_MED_20_Welcome_to_the_Family.mp3,concert,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,A rock band made up of teenage white kids play...,5.0,5.0,4.0,...,United States,English,False,,False,,3,2,none,CONCERT-Metal_MED_20_Welcome_to_the_Family.mp3
2555,80s_LOW_02_Breaking_Away.mp3,concert,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,People in a ballroom in elegant dresses slow d...,5.0,5.0,5.0,...,United States,English,False,,False,,3,2,none,CONCERT-80s_LOW_02_Breaking_Away.mp3
2557,Jazz_MED_02_I_Guess_Ill_Hang_My_Tears_Out_To_D...,video game,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a jazz festival and old men on stag...,5.0,5.0,4.0,...,United States,English,False,,False,,3,2,none,VIDEOGAME-Jazz_MED_02_I_Guess_Ill_Hang_My_Tear...
2558,Electronic_MED_20_The_Distance.mp3,movie,clip_context_g4,6824fc226d9b4777f8695cf0,clip_context_g4_PROLIFIC_PID_992291.csv,2025-07-01_06h12.29.151,I imagined a documentary mostly about fun fact...,5.0,5.0,5.0,...,United States,English,False,,False,,3,2,none,MOVIE-Electronic_MED_20_The_Distance.mp3


Index(['clip_name', 'context_word', 'expName', 'PROLIFIC_PID', 'File_ID',
       'date', 'descr_THOUGHT.text', 'rating_music_prompted.response',
       'rating_spontaneity.response', 'rating_novelty.response',
       'rating_familiarity.response', 'rating_enjoyment.response',
       'demographics.headphones', 'demographics.age', 'demographics.gender',
       'demographics.livingCountry', 'demographics.birthCountry',
       'demographics.nativeLanguage', 'demographics.otherLanguage',
       'demographics.otherLanguageText', 'demographics.hearingImpariments',
       'demographics.hearingImpairmentsText', 'demographics.education',
       'demographics.musicianIdentification', 'demographics.feedback',
       'clip_context_PAIR'],
      dtype='object')


---
## Descriptive Statistics

Below we are going to run some basic descriptive statistics on the clip-context pairings.


We're going to create a dataframe that includes summary info about each clip-context stimuli pairing including:

* Number of participants that reported METs while listening
* Mean ratings of MET "prompting-power", spontanteity, and novelty, and clip familiarity and enjoyment

In [7]:
columns = dataMET.columns.tolist()[1:-1]

# Drop these following columns so they don't aggregate by clip-context grouping:
drop = ['clip_name', 'context_word', 'expName', 'File_ID', 'date',
        'descr_THOUGHT.text', 'demographics.headphones', 'demographics.age',
        'demographics.gender','demographics.livingCountry',
        'demographics.birthCountry', 'demographics.nativeLanguage',
        'demographics.otherLanguage', 'demographics.otherLanguageText',
        'demographics.hearingImpariments', 'demographics.hearingImpairmentsText',
        'demographics.education','demographics.musicianIdentification',
        'demographics.feedback']

# Setting up an aggregate function collector
agg_fun = {}

# As we dropped trials without METs, we can just sum participants for MET occurrence
agg_fun['PROLIFIC_PID'] = 'count'

# Taking the mean of all columns except participant IDs and dropped columns
for col in columns:
    if col not in drop and col != 'PROLIFIC_PID':
        agg_fun[col] = 'mean'

# Group the dataframe by clip-context pairing, then run the aggregate functions created above
clipContextDescrStats = dataMET.groupby('clip_context_PAIR').agg(agg_fun)
display(clipContextDescrStats)

# Saving a .csv for the option to open and look at the full dataframe
clipContextDescrStats.to_csv('/content/context-framed-listening/clipContextDescrStats.csv', encoding='utf-8')

Unnamed: 0_level_0,PROLIFIC_PID,rating_music_prompted.response,rating_spontaneity.response,rating_novelty.response,rating_familiarity.response,rating_enjoyment.response
clip_context_PAIR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BAR-80s_LOW_02_Breaking_Away.mp3,34,4.235294,3.882353,3.382353,2.294118,3.705882
BAR-80s_LOW_06_Summer_Beach.mp3,29,4.448276,4.103448,3.482759,2.586207,3.793103
BAR-80s_MED_08_After_Tonight.mp3,29,4.275862,3.862069,2.965517,2.000000,3.310345
BAR-80s_MED_13_Your_Love_Drives_Me_Crazy.mp3,25,4.160000,3.840000,2.360000,2.280000,3.600000
BAR-Electronic_LOW_09_Expansion.mp3,27,4.592593,3.962963,2.259259,2.000000,3.259259
...,...,...,...,...,...,...
VIDEOGAME-Jazz_MED_07_Turiya_and_Ramakrishna.mp3,35,4.428571,4.285714,2.685714,2.457143,3.742857
VIDEOGAME-Metal_LOW_09_Darkside.mp3,34,4.558824,3.794118,2.411765,2.029412,3.235294
VIDEOGAME-Metal_LOW_14_Viaje_Por_Existir.mp3,26,4.307692,3.923077,2.423077,2.230769,3.346154
VIDEOGAME-Metal_MED_19_Thunderhorse.mp3,31,4.483871,4.161290,2.580645,2.645161,3.258065


We can check here what the minimum and maximum reported MET occurences were to each clip-context stimuli pairing.

In [8]:
mostMETs_ccpair = clipContextDescrStats['PROLIFIC_PID'].idxmax()
leastMETs_ccpair = clipContextDescrStats['PROLIFIC_PID'].idxmin()
mostMETs_value = clipContextDescrStats['PROLIFIC_PID'].max()
leastMETs_value = clipContextDescrStats['PROLIFIC_PID'].min()

print(f"Clip-context pair with the most reported METs: {mostMETs_ccpair} ({mostMETs_value})")
print(f"Clip-context pair with the least reported METs: {leastMETs_ccpair} ({leastMETs_value})")

Clip-context pair with the most reported METs: BAR-Electronic_LOW_14_Tape.mp3 (35)
Clip-context pair with the least reported METs: BAR-80s_MED_13_Your_Love_Drives_Me_Crazy.mp3 (25)




---



---

## Text Preprocessing

This next section goes through some text cleaning necessary to start feeding these METs into our NLP models.

Run this packages code cell below in preparation:

In [10]:
#!pip uninstall -y pyspellchecker
!pip install pyspellchecker
from spellchecker import SpellChecker
spell = SpellChecker()

import re
from collections import defaultdict

import nltk
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.corpus import stopwords
nltk.download('punkt_tab')


Collecting pyspellchecker
  Downloading pyspellchecker-0.8.3-py3-none-any.whl.metadata (9.5 kB)
Downloading pyspellchecker-0.8.3-py3-none-any.whl (7.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.3


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

### Spell checking

I collected all misspellings are flagged by the spell checking packages above to manually go through and decide what needs to be changed before going ahead with word removal, lemmatising, and tokenising.

In [11]:
# Initialize the spell checker
spell = SpellChecker()

def find_potential_misspellings_with_context(row):
    misspellings_with_context = []
    text = row['descr_THOUGHT.text']
    if isinstance(text, str):
        # Tokenize the raw text
        tokens = word_tokenize(text)
        for word in tokens:
            # Clean the word for spell checking (lowercase and remove punctuation)
            cleaned_word = re.sub(r'[^a-zA-Z0-9]', '', word.lower())
            if cleaned_word and cleaned_word not in spell:
                misspellings_with_context.append((cleaned_word, text)) # Store word and original text
    return misspellings_with_context

# Apply the function to each row and collect all misspellings with their context
all_misspellings_with_context = []
for index, row in dataMET.iterrows():
    misspellings_list = find_potential_misspellings_with_context(row)
    all_misspellings_with_context.extend(misspellings_list)

# Create a DataFrame from the collected misspellings and their context
misspellings_df = pd.DataFrame(all_misspellings_with_context, columns=['Potential Misspelling', 'Original Text'])

# Display the DataFrame (showing unique misspellings and their first occurrence context)
# To show all occurrences, remove .drop_duplicates()
display(misspellings_df.drop_duplicates(subset=['Potential Misspelling']))

# Saving the DataFrame to a CSV file
misspellings_df.to_csv('/content/context-framed-listening/misspellings_df.csv', encoding='utf-8', index=False)

# You can now use this DataFrame to help create your correction_mapping in the next cell.

Unnamed: 0,Potential Misspelling,Original Text
0,nt,"very heavy rock, not for me. somewhere that i ..."
1,ve,"very charged, maybe you've won something or wo..."
2,rememeber,"very bland, like lift music or something. not ..."
8,doesnt,Doesnt fit with what i would imagien for a vid...
9,imagien,Doesnt fit with what i would imagien for a vid...
...,...,...
1204,1970s1980s,I imagined this being something they'd play ba...
1205,disko,I imagined people dancing and disko lights
1206,kidteen,The theme song at the beginning of a kid/teen ...
1207,rihanna,I imagined Rihanna performing this in a concert


With the csv file created, I manually went through these misspelling adding corrections where I deemed appropriate.

Then, I'm plugging that edited csv file back in to run through the corrections.

_Let's implement this in the cell below:_

SyntaxError: invalid syntax (ipython-input-1169789841.py, line 10)

### Stop Words

We need to define "stop words" to remove from the qualitative data.
There are some automatically loaded-in stop words from our imported packages including largely uninteresting words such as articles, conjunctions, prepositions, pronouns, and common verbs.

This improves the efficiency and accuracy of NLP analyses by ensuring the focus is on significant terms and their patterns (i.e. it's not particularly interesting to know the pattern of "the", "and", "in", "it", "is" across the dataset).


We are also going to set some **custom stop words** that aren't giving us any unique meaning from these METs that we are intereseted in analysing. This includes naming the genre of music they heard, the context cues provided, and naming the way in which they experienced the MET (not everyone specifies or is able to identify _how_ they experienced their MET, so this can be a misnomer / misleading in analyses).


**NB:** you can come back to this cell to add other words and re-run analyses if other insignificant words are obscuring the data or if you think too much is being taken out.

In [37]:
# Define custom stop words
customStopWords = ['music', 'song', 'songs', 'excerpt', 'excerpts', 'piece', 'pieces', 'clip', 'clips',
                   'nineteen', '1920s', '20s', '1930s', '30s', '1940s', '40s', '50s', '1950s', 'fifties', '50', '1950', 'fifty',
                   '60s', '1960s', 'sixties', '60', '1960', 'sixty', '70s', '1970s', 'seventies', '70', '1970', 'seventy',
                   '80s', '1980s', 'eighties', '80', '1980', 'eighty', '90s', '1990s', 'nineties', '90', '1990', 'ninety',
                   '00s', '2000s', 'noughties', '2000', 'y2k', '2010',
                   'ambient', 'classical', 'electronic', 'funk', 'jazz', 'metal', 'rock', 'pop',
                   'bar', 'club', 'concert', 'film', 'movie', 'videogame', 'video', 'game',
                   'think', 'thinks', 'thought', 'thinking',
                   'imagine', 'imagines', 'imagined', 'imagining',
                   'image', 'images', 'imaged', 'imaging',
                   'visualise', 'visualises', 'visualised', 'visualising', 'visualize', 'visualizes', 'visualized', 'visualizing',
                   'picture', 'pictures', 'pictured', 'picturing',
                   'scene', 'scenes', 'story', 'stories',
                   'memory', 'memories', 'reminder', 'reminders', 'remind', 'reminds', 'remember', 'remembers', 'remembered', 'remembering',
                   'reminiscent', 'reminisce', 'reminisces', 'reminisced', 'reminiscing',
                   'make', 'makes', 'made', 'making', 'sound', 'sounds', 'sounded', 'sounding',
                   "'s", "n't", 'to', 'of', 'and', 'like', 'around', 'also', 'abit', 'ish']

# Combine custom stop words with NLTK stop words
nltk_stop_words = set(nltk.corpus.stopwords.words('english'))
all_stop_words = set(customStopWords).union(nltk_stop_words)

print("Custom Stop Words:", customStopWords)
print("Total Stop Words:", len(all_stop_words))

Custom Stop Words: ['music', 'song', 'songs', 'excerpt', 'excerpts', 'piece', 'pieces', 'clip', 'clips', 'nineteen', '1920s', '20s', '1930s', '30s', '1940s', '40s', '50s', '1950s', 'fifties', '50', '1950', 'fifty', '60s', '1960s', 'sixties', '60', '1960', 'sixty', '70s', '1970s', 'seventies', '70', '1970', 'seventy', '80s', '1980s', 'eighties', '80', '1980', 'eighty', '90s', '1990s', 'nineties', '90', '1990', 'ninety', '00s', '2000s', 'noughties', '2000', 'y2k', 'ambient', 'classical', 'electronic', 'funk', 'jazz', 'metal', 'rock', 'pop', 'bar', 'club', 'concert', 'film', 'movie', 'videogame', 'video', 'game', 'think', 'thinks', 'thought', 'thinking', 'imagine', 'imagines', 'imagined', 'imagining', 'image', 'images', 'imaged', 'imaging', 'visualise', 'visualises', 'visualised', 'visualising', 'visualize', 'visualizes', 'visualized', 'visualizing', 'picture', 'pictures', 'pictured', 'picturing', 'scene', 'scenes', 'story', 'stories', 'memory', 'memories', 'reminder', 'reminders', 'remin

### Lemmatising


Lemmatising involves grouping together inflected forms of a word (e.g., plurals, adjectives, verbs, adverbs) by identifying a word's lemma - the dictionary form - so they can be analysed as a single item.

[_For example, improve/improving/improvements/improved/improver = improve_]

The "tag_map" below is setting up a dictionary to map part-of-speech tags provided by NLTK's pos_tag function to the format required by the WordNetLemmatizer. This enables the lemmatiser to correctly identifiy and apply the appropriate lemmatisations rules to find its base form.

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()
wn = nltk.corpus.wordnet

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

> Lets put these together into a function that takes a string of text as input and performs the above alongside several other standard preprocessing steps such as punctutaion, case, and spelling cleaning:

In [38]:
lemmatizer = nltk.stem.WordNetLemmatizer()
wn = nltk.corpus.wordnet

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV


def preprocess_text_with_misspellings(text):
    from nltk import word_tokenize, pos_tag

    misspellings = []
    if isinstance(text, str):
        # Clean punctuation and convert to lower case
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())

        # Tokenise
        tokens = word_tokenize(text)

        lemmatized_tokens = []
        for word in tokens:
            # Identify potential misspellings
            if word not in spell:
                misspellings.append(word)

            # Lemmatisation
            tag = pos_tag([word])[0][1][0].upper()
            lemma = lemmatizer.lemmatize(word, tag_map[tag])
            lemmatized_tokens.append(lemma)

        # Remove stop words
        filtered_tokens = [word for word in lemmatized_tokens if word not in all_stop_words]

        return " ".join(filtered_tokens), misspellings
    return text, misspellings


# Apply the function and split the results into two new columns
dataMET[['preprocessed_MET_descr', 'potential_misspellings']] = dataMET['descr_THOUGHT.text'].apply(preprocess_text_with_misspellings).apply(pd.Series)

display(dataMET[['descr_THOUGHT.text', 'preprocessed_MET_descr', 'potential_misspellings']])

# Saving a .csv for the option to open and look at the full dataframe
dataMET.to_csv('/content/context-framed-listening/dataMETprepro.csv', encoding='utf-8')



Unnamed: 0,descr_THOUGHT.text,preprocessed_MET_descr,potential_misspellings
0,"kind of sad, melancholy. not happy or upbeat. ...",kind sad melancholy happy upbeat emotionally c...,[]
1,it did not sound like a video game. if anythin...,anything maybe end credit something something ...,[youd]
2,"overly upbeat. no real emotions, peppy. too mu...",overly upbeat real emotion peppy much,[]
3,"very heavy rock, not for me. somewhere that i ...",heavy somewhere dont belong enjoy,[dont]
4,"very charged, maybe you've won something or wo...",charge maybe youve something battle hyped poss...,[youve]
...,...,...,...
2554,A rock band made up of teenage white kids play...,band teenage white kid play garage,[]
2555,People in a ballroom in elegant dresses slow d...,people ballroom elegant dress slow dance floor,[]
2557,I imagined a jazz festival and old men on stag...,festival old men stage play instrument,[]
2558,I imagined a documentary mostly about fun fact...,documentary mostly fun fact historic information,[]
