<a href="https://colab.research.google.com/github/HazelvdW/context-framed-listening/blob/main/framed_listening_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Framed Listening: **BERT analyses**
> By **Hazel A. van der Walle** (PhD student, Music, Durham University), October 2025.

This notebook mirrors the [Word2Vec analysis](https://github.com/HazelvdW/context-framed-listening/blob/main/framed_listening_Word2Vec.ipynb) and [TF-IDF analysis](https://github.com/HazelvdW/context-framed-listening/blob/main/framed_listening_TFIDF.ipynb) structure for this study.

Here, we run the cosine similarity analyses and semantic similarity analyses using Bidirectional Encoder Representations from Transformers (BERT) embeddings.

For both of these analyses, two levels of investigation are conducted:
1. a broad cateorisation, grouping METs by the genre of the clip (N=4) and context (N=4) pairing (*= 16 documents*)
2. grouping METs by specific clip (N=16) and context (N=4) pairing (*= 64 documents*)


Overviews are described at the start of each analysis section, and Summaries at the end listing the file outputs.



## BERT analysis briefing

This code will be answering questions such as "Do people describe Jazz differently in BAR vs CONCERT contexts?" and "Is genre or context more important for similarity?"

* Document-level analysis
  * Creates one embedding per document (each genre-context or clip-context combination)
  * Builds a full similarity matrix between all documents
  * Extracts specific conditions from that matrix
* BERT embeddings
  * Uses mean pooling across all token embeddings for better document representation
  * Handles text truncation with 512 token limit
  * Processes each document individually for clean embeddings

---

All datasets generated and used for this study are openly available on GitHub https://github.com/HazelvdW/context-framed-listening.

In [None]:
!rm -r context-framed-listening
# Clone the GitHub repository
!git clone https://github.com/HazelvdW/context-framed-listening.git

Refresh files to see **"context-framed-listening"**.


---

## Setup

In [None]:
import pandas as pd
import numpy as np

import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt
import seaborn as sns

Initialise pretrained BERT model and tokeniser (_only need to do this once for this notebook_):

In [None]:
print("\nLoading BERT model and tokeniser...")
BERTtokeniser = BertTokenizer.from_pretrained('bert-base-uncased')
BERTmodel = BertModel.from_pretrained('bert-base-uncased')
print("BERT model loaded successfully!\n")

# Function to get BERT embeddings for text
def get_bert_embedding(text):
    """Get BERT embedding for a single text document."""
    # Tokenise and encode the text
    inputs = BERTtokeniser(text, return_tensors='pt', padding=True,
                           truncation=True, max_length=512)

    # Get BERT outputs
    with torch.no_grad():
        outputs = BERTmodel(**inputs)

    # Use mean pooling of last hidden states as document embedding
    # Shape: (1, hidden_size) -> we take mean across all tokens
    embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embedding

Load in the data file "**dataMET_preprocessed.csv**" that contains the preprocessed text data of participants' thought desciptions, generated using the code notebook titled [framed_listening_text_prep](https://github.com/HazelvdW/context-framed-listening/blob/main/framed_listening_text_prep.ipynb)

In [None]:
dataMETpre = pd.read_csv("/content/context-framed-listening/NLP_outputs/dataMET_preprocessed.csv")

---
## Cosine Similarity Analyses

**Version 1 (Genre-Context):**

* Groups by genre and context (broader categorisation)
* Produces 16 document combinations (4 genres × 4 contexts)


**Version 2 (Clip-Context):**

* Groups by specific clip and context pairing
* Produces 64 document combinations (16 clips × 4 contexts)

**OUTPUTS:**

* Cosine similarity value matrices
* Heatmap of cosine similarity values


====================================
### Version 1: Genre-Context Cosine Matrix
====================================

Combine the preprocessed MET descriptions from `dataMETpre` into "METdocs".


In [None]:
# Initialise DataFrame for Version 1
METdocs_v1 = pd.DataFrame(index=range(0,1), columns=dataMETpre.columns)
rowIndex = 0

# Iterate through each unique context word and genre
for idContext in np.unique(dataMETpre['context_word']):
    for idGenre in np.unique(dataMETpre['clip_genre']):
        # Create masks to filter data
        contextMask = dataMETpre['context_word'] == idContext
        genreMask = dataMETpre['clip_genre'] == idGenre

        # Combined mask
        mask = [all(tup) for tup in zip(contextMask, genreMask)]
        filt_ContextGenreData = dataMETpre[mask]

        # Concatenate all text descriptions
        descrSeries = filt_ContextGenreData['preprocessed_METdescr']

        # Join descriptions with marker
        joinedstring = ""
        for ival in range(0, len(descrSeries.values)):
            joinedstring = joinedstring + str(descrSeries.values[ival]) + " endofasubhere "

        # Assign values to dataframe
        METdocs_v1.loc[rowIndex, 'preprocessed_METdescr'] = joinedstring
        METdocs_v1.loc[rowIndex, 'idGenreContext'] = idContext[0:3] + "_" + idGenre[0:3]

        # Assign context code
        if idContext[0:3] == 'bar':
            METdocs_v1.loc[rowIndex, 'context_code'] = 'BAR'
        elif idContext[0:3] == 'con':
            METdocs_v1.loc[rowIndex, 'context_code'] = 'CON'
        elif idContext[0:3] == 'mov':
            METdocs_v1.loc[rowIndex, 'context_code'] = 'MOV'
        elif idContext[0:3] == 'vid':
            METdocs_v1.loc[rowIndex, 'context_code'] = 'VID'

        # Assign genre code
        if idGenre[0:3] == '80s':
            METdocs_v1.loc[rowIndex, 'genre_code'] = '80s'
        elif idGenre[0:3] == 'Jaz':
            METdocs_v1.loc[rowIndex, 'genre_code'] = 'Jaz'
        elif idGenre[0:3] == 'Met':
            METdocs_v1.loc[rowIndex, 'genre_code'] = 'Met'
        elif idGenre[0:3] == 'Ele':
            METdocs_v1.loc[rowIndex, 'genre_code'] = 'Ele'

        rowIndex = rowIndex + 1

# Filter and save Version 1
METdocs_v1 = METdocs_v1.filter(['context_code', 'genre_code', 'preprocessed_METdescr', 'idGenreContext'], axis=1)
METdocs_v1.to_csv('/content/context-framed-listening/NLP_outputs/BERT/METdocs_v1_GenreContext.csv', encoding='utf-8')

print(f"Version 1: Created {len(METdocs_v1)} documents (Genre-Context combinations)")
display(METdocs_v1.head(5))


Compute cosine similarity for version 1:

In [None]:
# Load the METdocs data for Version 1
wordsin_v1 = METdocs_v1.copy()

# Compute BERT embeddings for each document
print("Computing BERT embeddings for Version 1 documents...")
bert_embeddings_v1 = []

for idx, row in wordsin_v1.iterrows():
    text = str(row['preprocessed_METdescr'])
    embedding = get_bert_embedding(text)
    bert_embeddings_v1.append(embedding)

    if (idx + 1) % 5 == 0:
        print(f"  Processed {idx + 1}/{len(wordsin_v1)} documents...")

bert_embeddings_v1 = np.array(bert_embeddings_v1)
print(f"Version 1: BERT embeddings shape: {bert_embeddings_v1.shape}")

# Calculate cosine similarity matrix
print("Computing cosine similarity matrix for Version 1...")
cosineMatrix_BERT_v1 = cosine_similarity(bert_embeddings_v1, bert_embeddings_v1)

# Create labeled DataFrame
cosineMatrix_BERT_v1_df = pd.DataFrame(
    cosineMatrix_BERT_v1,
    index=wordsin_v1['idGenreContext'],
    columns=wordsin_v1['idGenreContext']
)

# Save cosine similarity matrix
cosineMatrix_BERT_v1_df.to_csv('/content/context-framed-listening/NLP_outputs/BERT/cosineMatrix_BERT_v1_GenreContext.csv', encoding='utf-8')

print("\nVersion 1 BERT Cosine Similarity Matrix:")
display(cosineMatrix_BERT_v1_df)

====================================
### Version 2: Clip-Context Cosine Matrix
====================================

Combine the preprocessed MET descriptions from `dataMETpre` into "METdocs".

In [None]:
# Initialise DataFrame for Version 2
METdocs_v2 = pd.DataFrame(index=range(0,1), columns=dataMETpre.columns)
rowIndex = 0

# Iterate through each unique clip_context_PAIR
for idStimPair in np.unique(dataMETpre['clip_context_PAIR']):
    # Create mask for current stimulus pair
    stimPairMask = dataMETpre['clip_context_PAIR'] == idStimPair
    filt_ClipContextData = dataMETpre[stimPairMask]

    # Get the first row to extract clip and context info
    if len(filt_ClipContextData) > 0:
        first_row = filt_ClipContextData.iloc[0]
        idClip = first_row['clip_name']
        idContext = first_row['context_word']

        # Concatenate all text descriptions
        descrSeries = filt_ClipContextData['preprocessed_METdescr']

        joinedstring = ""
        for ival in range(0, len(descrSeries.values)):
            joinedstring = joinedstring + str(descrSeries.values[ival]) + " endofasubhere "

        # Assign values to dataframe
        METdocs_v2.loc[rowIndex, 'preprocessed_METdescr'] = joinedstring
        METdocs_v2.loc[rowIndex, 'clip_name'] = idClip
        METdocs_v2.loc[rowIndex, 'context_word'] = idContext
        METdocs_v2.loc[rowIndex, 'idClipContext'] = idStimPair
        METdocs_v2.loc[rowIndex, 'idGenreContext'] = idStimPair[0:3] + "_" + idClip[0:3]

        # Assign genre code
        if idClip[0:3] == '80s':
            METdocs_v2.loc[rowIndex, 'genre_code'] = '80s'
        elif idClip[0:3] == 'Jaz':
            METdocs_v2.loc[rowIndex, 'genre_code'] = 'Jaz'
        elif idClip[0:3] == 'Met':
            METdocs_v2.loc[rowIndex, 'genre_code'] = 'Met'
        elif idClip[0:3] == 'Ele':
            METdocs_v2.loc[rowIndex, 'genre_code'] = 'Ele'

        rowIndex = rowIndex + 1

# Filter and save Version 2
METdocs_v2 = METdocs_v2.filter(['context_word', 'genre_code', 'clip_name', 'preprocessed_METdescr', 'idClipContext', 'idGenreContext'], axis=1)
METdocs_v2.to_csv('/content/context-framed-listening/NLP_outputs/BERT/METdocs_v2_ClipContext.csv', encoding='utf-8')

print(f"Version 2: Created {len(METdocs_v2)} documents (Clip-Context combinations)")
display(METdocs_v2.head(5))

Compute cosine similarity for Version 2:

In [None]:
# Load the METdocs data for Version 2
wordsin_v2 = METdocs_v2.copy()

# Compute BERT embeddings for each document
print("Computing BERT embeddings for Version 2 documents...")
bert_embeddings_v2 = []

for idx, row in wordsin_v2.iterrows():
    text = str(row['preprocessed_METdescr'])
    embedding = get_bert_embedding(text)
    bert_embeddings_v2.append(embedding)

    if (idx + 1) % 10 == 0:
        print(f"  Processed {idx + 1}/{len(wordsin_v2)} documents...")

bert_embeddings_v2 = np.array(bert_embeddings_v2)
print(f"Version 2: BERT embeddings shape: {bert_embeddings_v2.shape}")

# Calculate cosine similarity matrix
print("Computing cosine similarity matrix for Version 2...")
cosineMatrix_BERT_v2 = cosine_similarity(bert_embeddings_v2, bert_embeddings_v2)

# Create labeled DataFrame
cosineMatrix_BERT_v2_df = pd.DataFrame(
    cosineMatrix_BERT_v2,
    index=wordsin_v2['idClipContext'],
    columns=wordsin_v2['idClipContext']
)

# Save cosine similarity matrix
cosineMatrix_BERT_v2_df.to_csv('/content/context-framed-listening/NLP_outputs/BERT/cosineMatrix_BERT_v2_ClipContext.csv', encoding='utf-8')

print("\nVersion 2 BERT Cosine Similarity Matrix:")
display(cosineMatrix_BERT_v2_df)

====================================
#### VISUALISATIONS: Cosine Matrix
====================================

In [None]:
# Visualisation for Version 1
plt.figure(figsize=(12, 10))
sns.heatmap(cosineMatrix_BERT_v1_df, cmap='viridis', annot=False)
plt.title('Version 1: BERT Cosine Similarity Matrix (Genre-Context)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('/content/context-framed-listening/NLP_outputs/BERT/heatmap_BERT_v1_GenreContext.png', dpi=300, bbox_inches='tight')
plt.show()

# Visualisation for Version 2
plt.figure(figsize=(14, 12))
sns.heatmap(cosineMatrix_BERT_v2_df, cmap='viridis', annot=False)
plt.title('Version 2: BERT Cosine Similarity Matrix (Clip-Context)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('/content/context-framed-listening/NLP_outputs/BERT/heatmap_BERT_v2_ClipContext.png', dpi=300, bbox_inches='tight')
plt.show()

### SUMMARY

In [None]:
print("\nOutput files created:")
print("  - cosineMatrix_BERT_v1_GenreContext.csv")
print("  - heatmap_BERT_v1_GenreContext.png")
print("  - cosineMatrix_BERT_v2_ClipContext.csv")
print("  - heatmap_BERT_v2_ClipContext.png")

---
## Semantic Similarity Analyses

**OUTPUTS:**

All Representational Dissimilarity Matrix (RDM) masks and similarity measures are saved separately for each version, ready for statistical analysis in R.

=== *ANALYSIS STRUCTURE* ===


**Version 1 (Genre-Context) - 5 conditions:**

* Same context, different genre
* Different context, same genre
* Different context, different genre
* Between (dif.) contexts
* Between (dif.) genres


**Version 2 (Clip-Context) - 7 conditions:**

* Same context, different clip
* Different context, same clip
* Different context, different clip
* Between (dif.) contexts
* Between (dif.) clips
* Within (same) genre
* Between (dif.) genres



====================================
### Version 1: Genre-Context Semantic Similarity
====================================

Set up label columns, NumPy arrays, and stimuli condition masks:

In [None]:
inData_bert_v1 = METdocs_v1
simData_bert_v1 = cosineMatrix_BERT_v1_df

# Extract label columns
labelsCG_bert_v1 = inData_bert_v1['idGenreContext']
labelsGenre_bert_v1 = inData_bert_v1['genre_code']
labelsContext_bert_v1 = inData_bert_v1['context_code']

# Initialise arrays - 8 conditions
sContext_dGenre_bert_v1 = np.zeros(shape=(len(labelsCG_bert_v1), len(labelsCG_bert_v1)))
dContext_sGenre_bert_v1 = np.zeros(shape=(len(labelsCG_bert_v1), len(labelsCG_bert_v1)))
dContext_dGenre_bert_v1 = np.zeros(shape=(len(labelsCG_bert_v1), len(labelsCG_bert_v1)))
bwContext_bert_v1 = np.zeros(shape=(len(labelsCG_bert_v1), len(labelsCG_bert_v1)))
bwGenre_bert_v1 = np.zeros(shape=(len(labelsCG_bert_v1), len(labelsCG_bert_v1)))

# Build condition masks
for irow in range(0, len(labelsCG_bert_v1.values)):
    for icol in range(0, irow):
        same_context = labelsContext_bert_v1.values[irow] == labelsContext_bert_v1.values[icol]
        same_genre = labelsGenre_bert_v1.values[irow] == labelsGenre_bert_v1.values[icol]

        # Stimuli combinatorial conditions
        if same_context and not same_genre:
            sContext_dGenre_bert_v1[irow, icol] = 1
        elif not same_context and same_genre:
            dContext_sGenre_bert_v1[irow, icol] = 1
        elif not same_context and not same_genre:
            dContext_dGenre_bert_v1[irow, icol] = 1

        # Between context
        if not same_context:
            bwContext_bert_v1[irow, icol] = 1

        # Between genre
        if not same_genre:
            bwGenre_bert_v1[irow, icol] = 1

Extract similarity values for each condition:

In [None]:
simMeasures_bert_v1 = {'type': [], 'sim': []}

conditions_bert_v1 = {
    'sContext_dGenre': sContext_dGenre_bert_v1,
    'dContext_sGenre': dContext_sGenre_bert_v1,
    'dContext_dGenre': dContext_dGenre_bert_v1,
    'bwContext': bwContext_bert_v1,
    'bwGenre': bwGenre_bert_v1
}

for condition_name, condition_mask in conditions_bert_v1.items():
    simVals = simData_bert_v1.values[condition_mask == 1]
    for val in simVals:
        simMeasures_bert_v1['type'].append(condition_name)
        simMeasures_bert_v1['sim'].append(val)

# Create DataFrame
simMeasuresDF_bert_v1 = pd.DataFrame(data=simMeasures_bert_v1)
simMeasuresDF_bert_v1 = simMeasuresDF_bert_v1.replace([np.inf, -np.inf], np.nan)

print(f"\nVersion 1 BERT Similarity Measures extracted (8 conditions):")
print(simMeasuresDF_bert_v1.groupby('type').agg({'sim': ['count', 'mean', 'std']}))

# Save outputs
simMeasuresDF_bert_v1.to_csv('/content/context-framed-listening/NLP_outputs/BERT/simMeasuresDF_BERT_v1_GenreContext.csv', encoding='utf-8', index=False)
pd.DataFrame(sContext_dGenre_bert_v1).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_sContext_dGenre_BERT_v1.csv', encoding='utf-8')
pd.DataFrame(dContext_sGenre_bert_v1).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_dContext_sGenre_BERT_v1.csv', encoding='utf-8')
pd.DataFrame(dContext_dGenre_bert_v1).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_dContext_dGenre_BERT_v1.csv', encoding='utf-8')
pd.DataFrame(bwContext_bert_v1).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_bwContext_BERT_v1.csv', encoding='utf-8')
pd.DataFrame(bwGenre_bert_v1).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_bwGenre_BERT_v1.csv', encoding='utf-8')


====================================
### Version 2: Clip-Context Semantic Similarity
====================================

Set up label columns, NumPy arrays, and stimuli condition masks:

In [None]:
inData_bert_v2 = METdocs_v2
simData_bert_v2 = cosineMatrix_BERT_v2_df

# Extract label columns
labelsCG_bert_v2 = inData_bert_v2['idClipContext']
labelsClip_bert_v2 = inData_bert_v2['clip_name']
labelsGenre_bert_v2 = inData_bert_v2['genre_code']
labelsContext_bert_v2 = inData_bert_v2['context_word']

# Initialise arrays
sContext_dClip_bert_v2 = np.zeros(shape=(len(labelsCG_bert_v2), len(labelsCG_bert_v2)))
dContext_sClip_bert_v2 = np.zeros(shape=(len(labelsCG_bert_v2), len(labelsCG_bert_v2)))
dContext_dClip_bert_v2 = np.zeros(shape=(len(labelsCG_bert_v2), len(labelsCG_bert_v2)))
bwContext_bert_v2 = np.zeros(shape=(len(labelsCG_bert_v2), len(labelsCG_bert_v2)))
bwClip_bert_v2 = np.zeros(shape=(len(labelsCG_bert_v2), len(labelsCG_bert_v2)))
bwGenre_bert_v2 = np.zeros(shape=(len(labelsCG_bert_v2), len(labelsCG_bert_v2)))
wiGenre_bert_v2 = np.zeros(shape=(len(labelsCG_bert_v2), len(labelsCG_bert_v2)))

# Build condition masks
for irow in range(0, len(labelsCG_bert_v2.values)):
    for icol in range(0, irow):
        same_context = labelsContext_bert_v2.values[irow] == labelsContext_bert_v2.values[icol]
        same_clip = labelsClip_bert_v2.values[irow] == labelsClip_bert_v2.values[icol]
        same_genre = labelsGenre_bert_v2.values[irow] == labelsGenre_bert_v2.values[icol]

        # Stimuli combinatorial conditions
        if same_context and not same_clip:
            sContext_dClip_bert_v2[irow, icol] = 1
        elif not same_context and same_clip:
            dContext_sClip_bert_v2[irow, icol] = 1
        elif not same_context and not same_clip:
            dContext_dClip_bert_v2[irow, icol] = 1

        # Between context
        if same_context:
            bwContext_bert_v2[irow, icol] = 1

        # Between clip
        if same_clip:
            bwClip_bert_v2[irow, icol] = 1

        # Within/Between genre
        if same_genre:
            wiGenre_bert_v2[irow, icol] = 1
        else:
            bwGenre_bert_v2[irow, icol] = 1

Extract similarity values for each condition:

In [None]:
simMeasures_bert_v2 = {'type': [], 'sim': []}

conditions_bert_v2 = {
    'sContext_dClip': sContext_dClip_bert_v2,
    'dContext_sClip': dContext_sClip_bert_v2,
    'dContext_dClip': dContext_dClip_bert_v2,
    'bwContext': bwContext_bert_v2,
    'bwClip': bwClip_bert_v2,
    'bwGenre': bwGenre_bert_v2,
    'wiGenre': wiGenre_bert_v2
}

for condition_name, condition_mask in conditions_bert_v2.items():
    simVals = simData_bert_v2.values[condition_mask == 1]
    for val in simVals:
        simMeasures_bert_v2['type'].append(condition_name)
        simMeasures_bert_v2['sim'].append(val)

# Create DataFrame
simMeasuresDF_bert_v2 = pd.DataFrame(data=simMeasures_bert_v2)
simMeasuresDF_bert_v2 = simMeasuresDF_bert_v2.replace([np.inf, -np.inf], np.nan)

print(f"\nVersion 2 BERT Similarity Measures extracted (10 conditions):")
print(simMeasuresDF_bert_v2.groupby('type').agg({'sim': ['count', 'mean', 'std']}))

# Save outputs
simMeasuresDF_bert_v2.to_csv('/content/context-framed-listening/NLP_outputs/BERT/simMeasuresDF_BERT_v2_ClipContext.csv', encoding='utf-8', index=False)
pd.DataFrame(sContext_dClip_bert_v2).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_sContext_dClip_BERT_v2.csv', encoding='utf-8')
pd.DataFrame(dContext_sClip_bert_v2).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_dContext_sClip_BERT_v2.csv', encoding='utf-8')
pd.DataFrame(dContext_dClip_bert_v2).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_dContext_dClip_BERT_v2.csv', encoding='utf-8')
pd.DataFrame(bwContext_bert_v2).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_bwContext_BERT_v2.csv', encoding='utf-8')
pd.DataFrame(bwClip_bert_v2).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_bwClip_BERT_v2.csv', encoding='utf-8')
pd.DataFrame(bwGenre_bert_v2).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_bwGenre_BERT_v2.csv', encoding='utf-8')
pd.DataFrame(wiGenre_bert_v2).to_csv('/content/context-framed-listening/NLP_outputs/BERT/RDM_wiGenre_BERT_v2.csv', encoding='utf-8')


====================================
#### VISUALISATIONS for BERT analyses
====================================

* Box plots
* Violin plots
* Bar plots

In [None]:
# Box plots comparing conditions - Version 1
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
simMeasuresDF_bert_v1.boxplot(column='sim', by='type', ax=ax, rot=45)
ax.set_title('Version 1: BERT Similarity Distributions (Genre-Context)', fontsize=14, fontweight='bold')
ax.set_xlabel('Condition Type', fontsize=12)
ax.set_ylabel('BERT Cosine Similarity', fontsize=12)
plt.suptitle('')
plt.tight_layout()
plt.savefig('/content/context-framed-listening/NLP_outputs/BERT/similarity_boxplot_BERT_v1_GenreContext.png', dpi=300, bbox_inches='tight')
plt.show()

# Box plots comparing conditions - Version 2
fig, ax = plt.subplots(1, 1, figsize=(14, 6))
simMeasuresDF_bert_v2.boxplot(column='sim', by='type', ax=ax, rot=45)
ax.set_title('Version 2: BERT Similarity Distributions (Clip-Context)', fontsize=14, fontweight='bold')
ax.set_xlabel('Condition Type', fontsize=12)
ax.set_ylabel('BERT Cosine Similarity', fontsize=12)
plt.suptitle('')
plt.tight_layout()
plt.savefig('/content/context-framed-listening/NLP_outputs/BERT/similarity_boxplot_BERT_v2_ClipContext.png', dpi=300, bbox_inches='tight')
plt.show()

# Violin plot comparing conditions - Version 1
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
sns.violinplot(data=simMeasuresDF_BERT_v1, x='type', y='sim', hue='type', ax=ax, palette='Set2', legend=False)
ax.set_title('Version 1: Similarity Distributions by Condition (Genre-Context)', fontsize=14, fontweight='bold')
ax.set_xlabel('Condition Type', fontsize=12)
ax.set_ylabel('Cosine Similarity', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('/content/context-framed-listening/NLP_outputs/BERT/similarity_violin_BERT_v1_GenreContext.png', dpi=300, bbox_inches='tight')
plt.show()

# Violin plot comparing conditions - Version 2
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
sns.violinplot(data=simMeasuresDF_BERT_v2, x='type', y='sim', hue='type', ax=ax, palette='Set2', legend=False)
ax.set_title('Version 2: Similarity Distributions by Condition (Genre-Context)', fontsize=14, fontweight='bold')
ax.set_xlabel('Condition Type', fontsize=12)
ax.set_ylabel('Cosine Similarity', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('/content/context-framed-listening/NLP_outputs/BERT/similarity_violin_BERT_v2_GenreContext.png', dpi=300, bbox_inches='tight')
plt.show()

# Bar plots with means - Version 1
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
means_bert_v1 = simMeasuresDF_bert_v1.groupby('type')['sim'].mean().sort_values(ascending=False)
stds_bert_v1 = simMeasuresDF_bert_v1.groupby('type')['sim'].std()
means_bert_v1.plot(kind='bar', ax=ax, yerr=stds_bert_v1, capsize=4, color='steelblue', edgecolor='black', alpha=0.8)
ax.set_title('Version 1: Mean BERT Similarity by Condition (Genre-Context)', fontsize=14, fontweight='bold')
ax.set_xlabel('Condition Type', fontsize=12)
ax.set_ylabel('Mean BERT Cosine Similarity', fontsize=12)
ax.set_xticklabels(means_bert_v1.index, rotation=45, ha='right')
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('/content/context-framed-listening/NLP_outputs/BERT/similarity_means_BERT_v1_GenreContext.png', dpi=300, bbox_inches='tight')
plt.show()

# Bar plots with means - Version 2
fig, ax = plt.subplots(1, 1, figsize=(14, 6))
means_bert_v2 = simMeasuresDF_bert_v2.groupby('type')['sim'].mean().sort_values(ascending=False)
stds_bert_v2 = simMeasuresDF_bert_v2.groupby('type')['sim'].std()
means_bert_v2.plot(kind='bar', ax=ax, yerr=stds_bert_v2, capsize=4, color='coral', edgecolor='black', alpha=0.8)
ax.set_title('Version 2: Mean BERT Similarity by Condition (Clip-Context)', fontsize=14, fontweight='bold')
ax.set_xlabel('Condition Type', fontsize=12)
ax.set_ylabel('Mean BERT Cosine Similarity', fontsize=12)
ax.set_xticklabels(means_bert_v2.index, rotation=45, ha='right')
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('/content/context-framed-listening/NLP_outputs/BERT/similarity_means_BERT_v2_ClipContext.png', dpi=300, bbox_inches='tight')
plt.show()

### SUMMARY

In [None]:
print("\nVersion 1 (Genre-Context) BERT Output Files:")
print("  Data Files:")
print("    - cosineMatrix_BERT_v1_GenreContext.csv")
print("    - simMeasuresDF_BERT_v1_GenreContext.csv")
print("    - RDM masks (8 CSV files)")
print("  Visualizations:")
print("    - heatmap_BERT_v1_GenreContext.png")
print("    - similarity_boxplot_BERT_v1_GenreContext.png")
print("    - similarity_violin_BERT_v1_GenreContext.png")
print("    - similarity_means_BERT_v1_GenreContext.png")
print("\nVersion 2 (Clip-Context) BERT Output Files:")
print("  Data Files:")
print("    - cosineMatrix_BERT_v2_ClipContext.csv")
print("    - simMeasuresDF_BERT_v2_ClipContext.csv")
print("    - RDM masks (10 CSV files)")
print("  Visualizations:")
print("    - heatmap_BERT_v2_ClipContext.png")
print("    - similarity_boxplot_BERT_v2_ClipContext.png")
print("    - similarity_violin_BERT_v2_GenreContext.png")
print("    - similarity_means_BERT_v2_ClipContext.png")