### 1. Bag of Word Embeddings

In [1]:
import pickle
import pandas as pd
import os
import numpy as np
import sys
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from tqdm import tqdm
import gensim.downloader as api
from preprocessing import downsample_word_vectors, make_delayed

In [2]:
data_path = "../data"
print(os.listdir(data_path))

['raw_text.pkl', 'subject2', 'subject3', 'X_lagged_W2V.pkl', 'X_lagged_BoW.joblib', 'X_lagged_GloVe.pkl']


In [3]:
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)

### 1.1 
Provided the list of stories, generate embedding vectors via bag-of-words. Notice that the dimensions
of the dimensions of the resulting matrix do not match up with the response matrix. We will have to
down-sample it (see [1] for details) to match dimensions. Explain what is being done here.

In this step, we convert raw story transcripts into numerical feature vectors using a Bag-of-Words (BoW) embedding. For each story, we leverage the DataSequence objects provided in raw_text.pkl, which align each word in the transcript with fMRI sampling time intervals (TRs). We use the .chunks() method to segment the words into chunks, where each chunk corresponds to 2 seconds of language input—the same temporal resolution as the fMRI recordings.

We then join the words in each chunk into a pseudo-sentence and apply a CountVectorizer to convert these sentences into fixed-dimensional BoW vectors. Each row of the resulting feature matrix represents the word frequency distribution for a specific fMRI timepoint. A shared vocabulary across all stories ensures that the BoW representations are consistent in dimension and semantics.

However, these embedding matrices do not yet exactly align in time with the fMRI response matrices (Y), due to potential edge effects and timing discrepancies at the start and end of recordings. This mismatch will be addressed in the next step via temporal trimming and downsampling to ensure proper alignment between stimulus features (X) and brain responses (Y).

In [4]:
raw_text_path = os.path.join(data_path, "raw_text.pkl")
with open(raw_text_path, 'rb') as f:
    raw_text = pickle.load(f)

# Inspect the type and structure of raw_text
print("Type of raw_text:", type(raw_text))
print("Sample of raw_text:", list(raw_text.items())[:2] if isinstance(raw_text, dict) else raw_text[:2])

Type of raw_text: <class 'dict'>
Sample of raw_text: [('sweetaspie', <ridge_utils.DataSequence.DataSequence object at 0x10c8b2c20>), ('thatthingonmyarm', <ridge_utils.DataSequence.DataSequence object at 0x14eda0e50>)]


In [5]:
def generate_word_level_bow_features(raw_text):
    """
    Generates word-level Bag-of-Words vectors for each story.
    Each row corresponds to a single word.

    Returns:
        word_vectors: dict of story_id -> [num_words, vocab_size]
        vectorizer: the fitted CountVectorizer
    """
    # Step 1: Collect all words across all stories (for fitting vectorizer)
    all_words = []
    for story_id in raw_text:
        words = raw_text[story_id].data
        all_words.extend(words)

    # Fit vectorizer on all words (note: single words, not chunks!)
    vectorizer = CountVectorizer(lowercase=True)
    vectorizer.fit(all_words)

    # Step 2: Transform word sequences
    word_vectors = {}
    for story_id, seq in raw_text.items():
        words = seq.data  # List of words
        bow_matrix = vectorizer.transform(words).toarray()  # One row per word
        word_vectors[story_id] = bow_matrix
        print(f"{story_id}: word-level shape = {bow_matrix.shape}")

    return word_vectors, vectorizer


In [6]:
word_vectors, vectorizer = generate_word_level_bow_features(raw_text)

sweetaspie: word-level shape = (697, 12551)
thatthingonmyarm: word-level shape = (2073, 12551)
tildeath: word-level shape = (2297, 12551)
indianapolis: word-level shape = (1554, 12551)
lawsthatchokecreativity: word-level shape = (2084, 12551)
golfclubbing: word-level shape = (1211, 12551)
jugglingandjesus: word-level shape = (887, 12551)
shoppinginchina: word-level shape = (1731, 12551)
cocoonoflove: word-level shape = (1984, 12551)
hangtime: word-level shape = (1927, 12551)
beneaththemushroomcloud: word-level shape = (1916, 12551)
dialogue4: word-level shape = (1692, 12551)
thepostmanalwayscalls: word-level shape = (2220, 12551)
stumblinginthedark: word-level shape = (2681, 12551)
kiksuya: word-level shape = (1699, 12551)
haveyoumethimyet: word-level shape = (2985, 12551)
theinterview: word-level shape = (1079, 12551)
againstthewind: word-level shape = (838, 12551)
tetris: word-level shape = (1350, 12551)
canplanetearthfeedtenbillionpeoplepart2: word-level shape = (2532, 12551)
altern

In [7]:
selected_stories = ["sweetaspie", "thatthingonmyarm", "tildeath", "indianapolis"]

rows = []
for story_id in selected_stories:
    ds = raw_text[story_id]
    
    row = {
        "Story ID": story_id,
        "# Words": len(ds.data),
        "# TRs": len(ds.tr_times),
        "First 5 Words": ds.data[:5],
        "First 5 data_times": np.round(ds.data_times[:5], 3).tolist(),
        "First 5 tr_times": np.round(ds.tr_times[:5], 3).tolist(),
    }
    rows.append(row)

story_df = pd.DataFrame(rows)
print(story_df)

           Story ID  # Words  # TRs                First 5 Words  \
0        sweetaspie      697    172       [, i, embarked, on, a]   
1  thatthingonmyarm     2073    449   [, people, often, ask, me]   
2          tildeath     2297    338    [um, it, was, five, days]   
3      indianapolis     1554    317  [, let's, begin, with, the]   

                    First 5 data_times                First 5 tr_times  
0     [0.006, 1.16, 2.028, 2.886, 3.2]  [-9.0, -7.0, -5.0, -3.0, -1.0]  
1  [0.006, 0.307, 0.731, 1.165, 1.734]  [-9.0, -7.0, -5.0, -3.0, -1.0]  
2  [3.794, 4.143, 4.383, 4.824, 5.136]  [-9.0, -7.0, -5.0, -3.0, -1.0]  
3  [0.006, 1.898, 2.207, 2.467, 2.602]  [-9.0, -7.0, -5.0, -3.0, -1.0]  


1.2 Call downsample word vectors from the file code/preprocessing.py to get the dimensions to match. Further, trim the first 5 seconds and last 10 seconds of the output.

In [8]:
def downsample_features(word_level_features, raw_text):
    """
    Perform Lanczos downsampling of word-level features to TR resolution.

    Args:
        word_level_features (dict): story_id -> [num_words, feature_dim]
        raw_text (dict): story_id -> DataSequence object with data_times & tr_times

    Returns:
        dict: story_id -> TR-aligned feature matrix [num_TRs, feature_dim]
    """
    story_ids = list(word_level_features.keys())

    print("Performing downsampling (Lanczos interpolation)...")
    downsampled = downsample_word_vectors(
        stories=story_ids,
        word_vectors=word_level_features,
        wordseqs=raw_text
    )

    for sid, x in downsampled.items():
        print(f"{sid}: shape after downsampling = {x.shape}")

    return downsampled

In [9]:
def trim_features(downsampled_features, TR=1, skip_seconds=(5, 10)):
    """
    Trim the first and last few TRs from each downsampled story.

    Args:
        downsampled_features (dict): story_id -> [num_TRs, feature_dim]
        TR (float): duration of each TR in seconds
        skip_seconds (tuple): (start_trim_sec, end_trim_sec)

    Returns:
        dict: story_id -> trimmed matrix
    """
    trim_start = int(skip_seconds[0] / TR)
    trim_end = int(skip_seconds[1] / TR)

    print(f"\n Trimming first {skip_seconds[0]}s and last {skip_seconds[1]}s...")

    trimmed = {}
    for story_id in tqdm(downsampled_features, desc="Trimming"):
        X = downsampled_features[story_id]
        if X.shape[0] > (trim_start + trim_end):
            X_trimmed = X[trim_start:-trim_end]
            trimmed[story_id] = X_trimmed
            print(f"{story_id}: trimmed shape = {X_trimmed.shape}")
        else:
            print(f"Skipped {story_id}: too short after trimming ({X.shape[0]} rows)")

    return trimmed

In [10]:
# Step 1: Downsample to TR resolution
story_to_X_ds = downsample_features(word_vectors, raw_text)

Performing downsampling (Lanczos interpolation)...
sweetaspie: shape after downsampling = (172, 12551)
thatthingonmyarm: shape after downsampling = (449, 12551)
tildeath: shape after downsampling = (338, 12551)
indianapolis: shape after downsampling = (317, 12551)
lawsthatchokecreativity: shape after downsampling = (449, 12551)
golfclubbing: shape after downsampling = (216, 12551)
jugglingandjesus: shape after downsampling = (208, 12551)
shoppinginchina: shape after downsampling = (352, 12551)
cocoonoflove: shape after downsampling = (444, 12551)
hangtime: shape after downsampling = (339, 12551)
beneaththemushroomcloud: shape after downsampling = (357, 12551)
dialogue4: shape after downsampling = (322, 12551)
thepostmanalwayscalls: shape after downsampling = (469, 12551)
stumblinginthedark: shape after downsampling = (504, 12551)
kiksuya: shape after downsampling = (347, 12551)
haveyoumethimyet: shape after downsampling = (511, 12551)
theinterview: shape after downsampling = (236, 1255

In [11]:
print(len(raw_text["sweetaspie"].tr_times)) #If 172, it means they are matched

172


In [12]:
from sklearn.feature_extraction.text import CountVectorizer

all_words = []
for story_id in raw_text:
    words = raw_text[story_id].data
    all_words.extend(words)

# Fit vectorizer — match exactly as in your BoW
vectorizer = CountVectorizer(lowercase=True)
vectorizer.fit(all_words)

print("Reconstructed Vocabulary size:", len(vectorizer.vocabulary_))  #should be 12551
print("Sample vocab:", list(vectorizer.vocabulary_.keys())[:10])

Reconstructed Vocabulary size: 12551
Sample vocab: ['embarked', 'on', 'journey', 'toward', 'the', 'sea', 'of', 'matrimony', 'at', 'perilous']


In [13]:
# Step 2: Trim first 5s and last 10s
story_to_X_trimmed = trim_features(story_to_X_ds)


 Trimming first 5s and last 10s...


Trimming: 100%|██████████| 109/109 [00:00<00:00, 127775.05it/s]

sweetaspie: trimmed shape = (157, 12551)
thatthingonmyarm: trimmed shape = (434, 12551)
tildeath: trimmed shape = (323, 12551)
indianapolis: trimmed shape = (302, 12551)
lawsthatchokecreativity: trimmed shape = (434, 12551)
golfclubbing: trimmed shape = (201, 12551)
jugglingandjesus: trimmed shape = (193, 12551)
shoppinginchina: trimmed shape = (337, 12551)
cocoonoflove: trimmed shape = (429, 12551)
hangtime: trimmed shape = (324, 12551)
beneaththemushroomcloud: trimmed shape = (342, 12551)
dialogue4: trimmed shape = (307, 12551)
thepostmanalwayscalls: trimmed shape = (454, 12551)
stumblinginthedark: trimmed shape = (489, 12551)
kiksuya: trimmed shape = (332, 12551)
haveyoumethimyet: trimmed shape = (496, 12551)
theinterview: trimmed shape = (221, 12551)
againstthewind: trimmed shape = (170, 12551)
tetris: trimmed shape = (280, 12551)
canplanetearthfeedtenbillionpeoplepart2: trimmed shape = (545, 12551)
alternateithicatom: trimmed shape = (343, 12551)
goldiethegoldfish: trimmed shape =




In [14]:
raw_text["sweetaspie"].tr_times

array([ -9.,  -7.,  -5.,  -3.,  -1.,   1.,   3.,   5.,   7.,   9.,  11.,
        13.,  15.,  17.,  19.,  21.,  23.,  25.,  27.,  29.,  31.,  33.,
        35.,  37.,  39.,  41.,  43.,  45.,  47.,  49.,  51.,  53.,  55.,
        57.,  59.,  61.,  63.,  65.,  67.,  69.,  71.,  73.,  75.,  77.,
        79.,  81.,  83.,  85.,  87.,  89.,  91.,  93.,  95.,  97.,  99.,
       101., 103., 105., 107., 109., 111., 113., 115., 117., 119., 121.,
       123., 125., 127., 129., 131., 133., 135., 137., 139., 141., 143.,
       145., 147., 149., 151., 153., 155., 157., 159., 161., 163., 165.,
       167., 169., 171., 173., 175., 177., 179., 181., 183., 185., 187.,
       189., 191., 193., 195., 197., 199., 201., 203., 205., 207., 209.,
       211., 213., 215., 217., 219., 221., 223., 225., 227., 229., 231.,
       233., 235., 237., 239., 241., 243., 245., 247., 249., 251., 253.,
       255., 257., 259., 261., 263., 265., 267., 269., 271., 273., 275.,
       277., 279., 281., 283., 285., 287., 289., 29

### 1.3
Create lagged versions of the features using make delayed from code/preprocessing.py with delays ranging form [1, 4] inclusive. Explain what this does.

To better model the brain’s response to language input, we construct lagged (delayed) features using the make_delayed() function. This function generates new features by concatenating the BoW vectors from previous time steps (TRs). Specifically, for each time point t, we concatenate the features from time steps t-1, t-2, t-3, and t-4, resulting in a temporally extended input vector. This approach allows the model to capture the fact that brain activity at time t may be influenced by language input from several seconds earlier. Using lagged features is standard in encoding models for fMRI, as it helps account for the delayed nature of neural responses.

In [15]:
def apply_delays(trimmed_features, delays=range(1, 5)):
    """
    Apply delay embedding to TR-aligned features.

    Args:
        trimmed_features (dict): story_id -> [num_TRs, feature_dim]
        delays (iterable): list of delays to include (e.g., [1, 2, 3, 4])

    Returns:
        dict: story_id -> [num_TRs - max_delay, feature_dim * len(delays)]
    """
    delayed = {}
    for story_id, X in trimmed_features.items():
        X_delayed = make_delayed(X, delays=delays)
        delayed[story_id] = X_delayed
        print(f"{story_id}: delayed shape = {X_delayed.shape}")
    return delayed

In [16]:
story_to_X_lagged = apply_delays(story_to_X_trimmed, delays=range(1, 5))

sweetaspie: delayed shape = (157, 50204)
thatthingonmyarm: delayed shape = (434, 50204)
tildeath: delayed shape = (323, 50204)
indianapolis: delayed shape = (302, 50204)
lawsthatchokecreativity: delayed shape = (434, 50204)
golfclubbing: delayed shape = (201, 50204)
jugglingandjesus: delayed shape = (193, 50204)
shoppinginchina: delayed shape = (337, 50204)
cocoonoflove: delayed shape = (429, 50204)
hangtime: delayed shape = (324, 50204)
beneaththemushroomcloud: delayed shape = (342, 50204)
dialogue4: delayed shape = (307, 50204)
thepostmanalwayscalls: delayed shape = (454, 50204)
stumblinginthedark: delayed shape = (489, 50204)
kiksuya: delayed shape = (332, 50204)
haveyoumethimyet: delayed shape = (496, 50204)
theinterview: delayed shape = (221, 50204)
againstthewind: delayed shape = (170, 50204)
tetris: delayed shape = (280, 50204)
canplanetearthfeedtenbillionpeoplepart2: delayed shape = (545, 50204)
alternateithicatom: delayed shape = (343, 50204)
goldiethegoldfish: delayed shape =

### 1.4 
Repeat the process above for 2 other pre-trained embedding methods: Word2Vec and GloVe. You will have to find these pre-trained embedding methods online, and use them. The processed embeddings resulting from these steps will serve as the features (X matrix) in our regression.

#### 1.4.1 Word2Vec

In [17]:
# Step 1: Load pre-trained Word2Vec
print("Downloading Word2Vec model (Google News)...")
word2vec_model = api.load("word2vec-google-news-300")
vector_size = 300

Downloading Word2Vec model (Google News)...


In [18]:
# Step 2: Map each word to a Word2Vec vector
def generate_word2vec_features(raw_text, w2v_model, vector_size=300):
    """
    Generate word-level Word2Vec embeddings for each story.

    Args:
        raw_text (dict): story_id -> DataSequence
        w2v_model (gensim KeyedVectors): pre-trained Word2Vec model
        vector_size (int): dimensionality of Word2Vec vectors

    Returns:
        dict: story_id -> np.ndarray of shape [num_words, vector_size]
    """
    word_vectors = {}
    for story_id, seq in tqdm(raw_text.items(), desc="Generating Word2Vec"):
        words = seq.data
        vectors = []
        for word in words:
            word_lower = word.lower()
            if word_lower in w2v_model:
                vectors.append(w2v_model[word_lower])
            else:
                vectors.append(np.zeros(vector_size))  # OOV → zero vector

        stacked = np.vstack(vectors)
        word_vectors[story_id] = stacked
        print(f"{story_id}: Word2Vec word-level shape = {stacked.shape}")

    return word_vectors

word2vec_features = generate_word2vec_features(raw_text, word2vec_model)

# Step 3: Downsample Word2Vec vectors
story_to_X_ds_w2v = downsample_features(word2vec_features, raw_text)

# Step 5: Trim first 5s and last 10s
story_to_X_trimmed_w2v = trim_features(story_to_X_ds_w2v)

# Step 6: Apply delays (1 to 4 TRs)
story_to_X_lagged_w2v = apply_delays(story_to_X_trimmed_w2v)

Generating Word2Vec:  24%|██▍       | 26/109 [00:00<00:00, 251.40it/s]

sweetaspie: Word2Vec word-level shape = (697, 300)
thatthingonmyarm: Word2Vec word-level shape = (2073, 300)
tildeath: Word2Vec word-level shape = (2297, 300)
indianapolis: Word2Vec word-level shape = (1554, 300)
lawsthatchokecreativity: Word2Vec word-level shape = (2084, 300)
golfclubbing: Word2Vec word-level shape = (1211, 300)
jugglingandjesus: Word2Vec word-level shape = (887, 300)
shoppinginchina: Word2Vec word-level shape = (1731, 300)
cocoonoflove: Word2Vec word-level shape = (1984, 300)
hangtime: Word2Vec word-level shape = (1927, 300)
beneaththemushroomcloud: Word2Vec word-level shape = (1916, 300)
dialogue4: Word2Vec word-level shape = (1692, 300)
thepostmanalwayscalls: Word2Vec word-level shape = (2220, 300)
stumblinginthedark: Word2Vec word-level shape = (2681, 300)
kiksuya: Word2Vec word-level shape = (1699, 300)
haveyoumethimyet: Word2Vec word-level shape = (2985, 300)
theinterview: Word2Vec word-level shape = (1079, 300)
againstthewind: Word2Vec word-level shape = (838, 

Generating Word2Vec: 100%|██████████| 109/109 [00:00<00:00, 300.68it/s]

odetostepfather: Word2Vec word-level shape = (2675, 300)
threemonths: Word2Vec word-level shape = (2062, 300)
theclosetthatateeverything: Word2Vec word-level shape = (1928, 300)
souls: Word2Vec word-level shape = (1868, 300)
reachingoutbetweenthebars: Word2Vec word-level shape = (1490, 300)
fromboyhoodtofatherhood: Word2Vec word-level shape = (2755, 300)
naked: Word2Vec word-level shape = (3218, 300)
treasureisland: Word2Vec word-level shape = (1763, 300)
penpal: Word2Vec word-level shape = (1592, 300)
gpsformylostidentity: Word2Vec word-level shape = (1650, 300)
adventuresinsayingyes: Word2Vec word-level shape = (2309, 300)
dialogue1: Word2Vec word-level shape = (934, 300)
theadvancedbeginner: Word2Vec word-level shape = (1624, 300)
singlewomanseekingmanwich: Word2Vec word-level shape = (1486, 300)
dialogue5: Word2Vec word-level shape = (1765, 300)
undertheinfluence: Word2Vec word-level shape = (1641, 300)
leavingbaghdad: Word2Vec word-level shape = (1976, 300)
thetriangleshirtwaistco




sweetaspie: shape after downsampling = (172, 300)
thatthingonmyarm: shape after downsampling = (449, 300)
tildeath: shape after downsampling = (338, 300)
indianapolis: shape after downsampling = (317, 300)
lawsthatchokecreativity: shape after downsampling = (449, 300)
golfclubbing: shape after downsampling = (216, 300)
jugglingandjesus: shape after downsampling = (208, 300)
shoppinginchina: shape after downsampling = (352, 300)
cocoonoflove: shape after downsampling = (444, 300)
hangtime: shape after downsampling = (339, 300)
beneaththemushroomcloud: shape after downsampling = (357, 300)
dialogue4: shape after downsampling = (322, 300)
thepostmanalwayscalls: shape after downsampling = (469, 300)
stumblinginthedark: shape after downsampling = (504, 300)
kiksuya: shape after downsampling = (347, 300)
haveyoumethimyet: shape after downsampling = (511, 300)
theinterview: shape after downsampling = (236, 300)
againstthewind: shape after downsampling = (185, 300)
tetris: shape after downsamp

Trimming: 100%|██████████| 109/109 [00:00<00:00, 83839.93it/s]

sweetaspie: trimmed shape = (157, 300)
thatthingonmyarm: trimmed shape = (434, 300)
tildeath: trimmed shape = (323, 300)
indianapolis: trimmed shape = (302, 300)
lawsthatchokecreativity: trimmed shape = (434, 300)
golfclubbing: trimmed shape = (201, 300)
jugglingandjesus: trimmed shape = (193, 300)
shoppinginchina: trimmed shape = (337, 300)
cocoonoflove: trimmed shape = (429, 300)
hangtime: trimmed shape = (324, 300)
beneaththemushroomcloud: trimmed shape = (342, 300)
dialogue4: trimmed shape = (307, 300)
thepostmanalwayscalls: trimmed shape = (454, 300)
stumblinginthedark: trimmed shape = (489, 300)
kiksuya: trimmed shape = (332, 300)
haveyoumethimyet: trimmed shape = (496, 300)
theinterview: trimmed shape = (221, 300)
againstthewind: trimmed shape = (170, 300)
tetris: trimmed shape = (280, 300)
canplanetearthfeedtenbillionpeoplepart2: trimmed shape = (545, 300)
alternateithicatom: trimmed shape = (343, 300)
goldiethegoldfish: trimmed shape = (317, 300)
seedpotatoesofleningrad: trimm




#### 1.4.1 GloVe

In [19]:
print("Downloading GloVe vectors...")
glove_model = api.load("glove-wiki-gigaword-300")

Downloading GloVe vectors...


In [20]:
def generate_glove_features(raw_text, glove_model, vector_size=300):
    glove_vectors = {}
    for story_id, seq in tqdm(raw_text.items(), desc="Generating GloVe"):
        words = seq.data
        vectors = [
            glove_model[word.lower()] if word.lower() in glove_model else np.zeros(vector_size)
            for word in words
        ]
        stacked = np.vstack(vectors)  # Convert to numpy array for .shape
        glove_vectors[story_id] = stacked
        print(f"{story_id}: GloVe word-level shape = {stacked.shape}")
    return glove_vectors

In [21]:
#embbeding
glove_vectors = generate_glove_features(raw_text, glove_model)

# Downsampling
story_to_X_ds_glove = downsample_features(glove_vectors, raw_text)

# Trimming
story_to_X_trimmed_glove = trim_features(story_to_X_ds_glove)

# Delay
story_to_X_lagged_glove = apply_delays(story_to_X_trimmed_glove, delays=range(1, 5))


Generating GloVe:  39%|███▉      | 43/109 [00:00<00:00, 231.78it/s]

sweetaspie: GloVe word-level shape = (697, 300)
thatthingonmyarm: GloVe word-level shape = (2073, 300)
tildeath: GloVe word-level shape = (2297, 300)
indianapolis: GloVe word-level shape = (1554, 300)
lawsthatchokecreativity: GloVe word-level shape = (2084, 300)
golfclubbing: GloVe word-level shape = (1211, 300)
jugglingandjesus: GloVe word-level shape = (887, 300)
shoppinginchina: GloVe word-level shape = (1731, 300)
cocoonoflove: GloVe word-level shape = (1984, 300)
hangtime: GloVe word-level shape = (1927, 300)
beneaththemushroomcloud: GloVe word-level shape = (1916, 300)
dialogue4: GloVe word-level shape = (1692, 300)
thepostmanalwayscalls: GloVe word-level shape = (2220, 300)
stumblinginthedark: GloVe word-level shape = (2681, 300)
kiksuya: GloVe word-level shape = (1699, 300)
haveyoumethimyet: GloVe word-level shape = (2985, 300)
theinterview: GloVe word-level shape = (1079, 300)
againstthewind: GloVe word-level shape = (838, 300)
tetris: GloVe word-level shape = (1350, 300)
canp

Generating GloVe: 100%|██████████| 109/109 [00:00<00:00, 272.49it/s]


adollshouse: GloVe word-level shape = (1656, 300)
catfishingstrangerstofindmyself: GloVe word-level shape = (1522, 300)
dialogue2: GloVe word-level shape = (1835, 300)
theshower: GloVe word-level shape = (1383, 300)
igrewupinthewestborobaptistchurch: GloVe word-level shape = (2449, 300)
thesurprisingthingilearnedsailingsoloaroundtheworld: GloVe word-level shape = (2855, 300)
odetostepfather: GloVe word-level shape = (2675, 300)
threemonths: GloVe word-level shape = (2062, 300)
theclosetthatateeverything: GloVe word-level shape = (1928, 300)
souls: GloVe word-level shape = (1868, 300)
reachingoutbetweenthebars: GloVe word-level shape = (1490, 300)
fromboyhoodtofatherhood: GloVe word-level shape = (2755, 300)
naked: GloVe word-level shape = (3218, 300)
treasureisland: GloVe word-level shape = (1763, 300)
penpal: GloVe word-level shape = (1592, 300)
gpsformylostidentity: GloVe word-level shape = (1650, 300)
adventuresinsayingyes: GloVe word-level shape = (2309, 300)
dialogue1: GloVe word-

Trimming: 100%|██████████| 109/109 [00:00<00:00, 69259.07it/s]

sweetaspie: trimmed shape = (157, 300)
thatthingonmyarm: trimmed shape = (434, 300)
tildeath: trimmed shape = (323, 300)
indianapolis: trimmed shape = (302, 300)
lawsthatchokecreativity: trimmed shape = (434, 300)
golfclubbing: trimmed shape = (201, 300)
jugglingandjesus: trimmed shape = (193, 300)
shoppinginchina: trimmed shape = (337, 300)
cocoonoflove: trimmed shape = (429, 300)
hangtime: trimmed shape = (324, 300)
beneaththemushroomcloud: trimmed shape = (342, 300)
dialogue4: trimmed shape = (307, 300)
thepostmanalwayscalls: trimmed shape = (454, 300)
stumblinginthedark: trimmed shape = (489, 300)
kiksuya: trimmed shape = (332, 300)
haveyoumethimyet: trimmed shape = (496, 300)
theinterview: trimmed shape = (221, 300)
againstthewind: trimmed shape = (170, 300)
tetris: trimmed shape = (280, 300)
canplanetearthfeedtenbillionpeoplepart2: trimmed shape = (545, 300)
alternateithicatom: trimmed shape = (343, 300)
goldiethegoldfish: trimmed shape = (317, 300)
seedpotatoesofleningrad: trimm




#### 1.4.3
 The processed embeddings resulting from these steps will serve as the features (X matrix) in our regression. So we output them to pkl files.

In [22]:
output_dir = "../data"
os.makedirs(output_dir, exist_ok=True)

# save BoW
import joblib
joblib.dump(story_to_X_lagged, "../data/X_lagged_BoW.joblib", compress=3)

['../data/X_lagged_BoW.joblib']

if you want to use this:

import joblib
story_to_X_lagged_bow = joblib.load("../data/X_lagged_BoW.joblib")

In [23]:
# save Word2Vec
with open(os.path.join(output_dir, "X_lagged_W2V.pkl"), "wb") as f:
    pickle.dump(story_to_X_lagged_w2v, f)

In [24]:
# save GloVe
with open(os.path.join(output_dir, "X_lagged_GloVe.pkl"), "wb") as f:
    pickle.dump(story_to_X_lagged_glove, f)

### 1.5

Describe potential benefits of using pre-trained embeddings.

Pre-trained embeddings such as Word2Vec and GloVe offer several advantages over traditional methods like bag-of-words (BoW) when used as features in modeling tasks such as brain response prediction:

1. Semantic richness
Pre-trained embeddings encode semantic similarity between words. Words with similar meanings are located close together in the embedding space, which helps models generalize better. For example, “king” and “queen” will have more similar representations in Word2Vec than in BoW, which treats them as entirely unrelated tokens.

2. Lower dimensionality
While BoW vectors are typically high-dimensional and sparse (e.g., 12,000+ dimensions), pre-trained embeddings are dense and compact (e.g., 300 dimensions). This reduces the risk of overfitting and improves computational efficiency.

3. Transfer learning
Pre-trained embeddings are learned from massive external corpora (e.g., Wikipedia, Google News), capturing broader linguistic knowledge that may not exist in your training data. This improves performance, especially when your dataset is small.

4. Better generalization for rare words
In BoW, rare or unseen words are treated as zeros, losing information. Pre-trained embeddings can still provide meaningful vectors for infrequent words (or even out-of-vocabulary handling via averaging subwords or fallback).