### 1. Bag of Word Embeddings

In [1]:
import pickle
import os
import sys
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import os
import numpy as np
from tqdm import tqdm
from tqdm import tqdm
from preprocessing import downsample_word_vectors
from preprocessing import make_delayed
import os
import pickle 
import numpy as np
import gensim.downloader as api
from tqdm import tqdm
from preprocessing import downsample_word_vectors, make_delayed

In [2]:
data_path = "../data"
print(os.listdir(data_path))

['raw_text.pkl', '.DS_Store', 'subject3.zip', 'subject2.zip', 'subject2', 'subject3', 'X_lagged_BoW.pkl']


In [3]:
subject2_path = os.path.join(data_path, "subject2")
print(os.listdir(subject2_path))

subject3_path = os.path.join(data_path, "subject3")
print(os.listdir(subject3_path))

['onapproachtopluto.npy', 'penpal.npy', 'lawsthatchokecreativity.npy', 'stumblinginthedark.npy', 'leavingbaghdad.npy', 'gpsformylostidentity.npy', 'avatar.npy', 'christmas1940.npy', 'canadageeseandddp.npy', 'treasureisland.npy', 'thesecrettomarriage.npy', 'afatherscover.npy', 'googlingstrangersandkentuckybluegrass.npy', 'theclosetthatateeverything.npy', 'notontheusualtour.npy', 'sloth.npy', 'food.npy', 'kiksuya.npy', 'igrewupinthewestborobaptistchurch.npy', 'alternateithicatom.npy', 'quietfire.npy', 'whyimustspeakoutaboutclimatechange.npy', 'goingthelibertyway.npy', 'cocoonoflove.npy', 'singlewomanseekingmanwich.npy', 'haveyoumethimyet.npy', 'souls.npy', 'tildeath.npy', 'becomingindian.npy', 'indianapolis.npy', 'eyespy.npy', 'firetestforlove.npy', 'cautioneating.npy', 'againstthewind.npy', 'wheretheressmoke.npy', 'theshower.npy', 'vixenandtheussr.npy', 'thatthingonmyarm.npy', 'canplanetearthfeedtenbillionpeoplepart2.npy', 'thetriangleshirtwaistconnection.npy', 'thefreedomridersandme.np

In [4]:
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)

### 1.1 
Provided the list of stories, generate embedding vectors via bag-of-words. Notice that the dimensions
of the dimensions of the resulting matrix do not match up with the response matrix. We will have to
down-sample it (see [1] for details) to match dimensions. Explain what is being done here.

In this step, we convert raw story transcripts into numerical feature vectors using a Bag-of-Words (BoW) embedding. For each story, we leverage the DataSequence objects provided in raw_text.pkl, which align each word in the transcript with fMRI sampling time intervals (TRs). We use the .chunks() method to segment the words into chunks, where each chunk corresponds to approximately 1.5 seconds of language input—the same temporal resolution as the fMRI recordings.

We then join the words in each chunk into a pseudo-sentence and apply a CountVectorizer to convert these sentences into fixed-dimensional BoW vectors. Each row of the resulting feature matrix represents the word frequency distribution for a specific fMRI timepoint. A shared vocabulary across all stories ensures that the BoW representations are consistent in dimension and semantics.

However, these embedding matrices do not yet exactly align in time with the fMRI response matrices (Y), due to potential edge effects and timing discrepancies at the start and end of recordings. This mismatch will be addressed in the next step via temporal trimming and downsampling to ensure proper alignment between stimulus features (X) and brain responses (Y).

In [5]:
raw_text_path = os.path.join(data_path, "raw_text.pkl")
with open(raw_text_path, 'rb') as f:
    raw_text = pickle.load(f)

# Inspect the type and structure of raw_text
print("Type of raw_text:", type(raw_text))
print("Sample of raw_text:", list(raw_text.items())[:2] if isinstance(raw_text, dict) else raw_text[:2])

Type of raw_text: <class 'dict'>
Sample of raw_text: [('sweetaspie', <ridge_utils.DataSequence.DataSequence object at 0x10794f490>), ('thatthingonmyarm', <ridge_utils.DataSequence.DataSequence object at 0x162bc0e80>)]


In [6]:
def generate_word_level_bow_features(raw_text):
    """
    Generates word-level Bag-of-Words vectors for each story.
    Each row corresponds to a single word.

    Returns:
        word_vectors: dict of story_id -> [num_words, vocab_size]
        vectorizer: the fitted CountVectorizer
    """
    # Step 1: Collect all words across all stories (for fitting vectorizer)
    all_words = []
    for story_id in raw_text:
        words = raw_text[story_id].data
        all_words.extend(words)

    # Fit vectorizer on all words (note: single words, not chunks!)
    vectorizer = CountVectorizer(lowercase=True)
    vectorizer.fit(all_words)

    # Step 2: Transform word sequences
    word_vectors = {}
    for story_id, seq in raw_text.items():
        words = seq.data  # List of words
        bow_matrix = vectorizer.transform(words).toarray()  # One row per word
        word_vectors[story_id] = bow_matrix
        print(f"{story_id}: word-level shape = {bow_matrix.shape}")

    return word_vectors, vectorizer


In [7]:
word_vectors, vectorizer = generate_word_level_bow_features(raw_text)

for story_id, X in word_vectors.items():
    print(f"{story_id}: shape = {X.shape}")

sweetaspie: word-level shape = (697, 12551)
thatthingonmyarm: word-level shape = (2073, 12551)
tildeath: word-level shape = (2297, 12551)
indianapolis: word-level shape = (1554, 12551)
lawsthatchokecreativity: word-level shape = (2084, 12551)
golfclubbing: word-level shape = (1211, 12551)
jugglingandjesus: word-level shape = (887, 12551)
shoppinginchina: word-level shape = (1731, 12551)
cocoonoflove: word-level shape = (1984, 12551)
hangtime: word-level shape = (1927, 12551)
beneaththemushroomcloud: word-level shape = (1916, 12551)
dialogue4: word-level shape = (1692, 12551)
thepostmanalwayscalls: word-level shape = (2220, 12551)
stumblinginthedark: word-level shape = (2681, 12551)
kiksuya: word-level shape = (1699, 12551)
haveyoumethimyet: word-level shape = (2985, 12551)
theinterview: word-level shape = (1079, 12551)
againstthewind: word-level shape = (838, 12551)
tetris: word-level shape = (1350, 12551)
canplanetearthfeedtenbillionpeoplepart2: word-level shape = (2532, 12551)
altern

1.2 Call downsample word vectors from the file code/preprocessing.py to get the dimensions to match. Further, trim the first 5 seconds and last 10 seconds of the output.

In [8]:

def downsample_bow_features(bow_features, raw_text):
    """
    Perform Lanczos downsampling of word-level BoW features to TR resolution.

    Args:
        bow_features (dict): story_id -> word-level BoW matrix [num_words, vocab_size]
        raw_text (dict): story_id -> DataSequence object with data_times & tr_times

    Returns:
        dict: story_id -> TR-aligned BoW matrix [num_TRs, vocab_size]
    """
    story_ids = list(bow_features.keys())

    print("Performing downsampling (lanczos interpolation)...")
    downsampled = downsample_word_vectors(
        stories=story_ids,
        word_vectors=bow_features,
        wordseqs=raw_text
    )

    for sid, x in downsampled.items():
        print(f"{sid}: shape after downsampling = {x.shape}")

    return downsampled


def trim_bow_features(downsampled_features, TR=1.5, skip_seconds=(5, 10)):
    """
    Trim the first and last few TRs from each downsampled story.

    Args:
        downsampled_features (dict): story_id -> [num_TRs, vocab_size]
        TR (float): duration of each TR in seconds
        skip_seconds (tuple): (start_trim_sec, end_trim_sec)

    Returns:
        dict: story_id -> trimmed matrix
    """
    trim_start = int(skip_seconds[0] / TR)
    trim_end = int(skip_seconds[1] / TR)

    print(f"\n Trimming first {skip_seconds[0]}s and last {skip_seconds[1]}s...")
    trimmed = {}
    for story_id in tqdm(downsampled_features, desc="⏳ Trimming"):
        X = downsampled_features[story_id]
        if X.shape[0] > (trim_start + trim_end):
            trimmed[story_id] = X[trim_start:-trim_end]
        else:
            print(f"Skipped {story_id} (too short after trimming)")
    return trimmed

In [9]:
# Step 1: Downsample to TR resolution
story_to_X_ds = downsample_bow_features(word_vectors, raw_text)

Performing downsampling (lanczos interpolation)...
sweetaspie: shape after downsampling = (172, 12551)
thatthingonmyarm: shape after downsampling = (449, 12551)
tildeath: shape after downsampling = (338, 12551)
indianapolis: shape after downsampling = (317, 12551)
lawsthatchokecreativity: shape after downsampling = (449, 12551)
golfclubbing: shape after downsampling = (216, 12551)
jugglingandjesus: shape after downsampling = (208, 12551)
shoppinginchina: shape after downsampling = (352, 12551)
cocoonoflove: shape after downsampling = (444, 12551)
hangtime: shape after downsampling = (339, 12551)
beneaththemushroomcloud: shape after downsampling = (357, 12551)
dialogue4: shape after downsampling = (322, 12551)
thepostmanalwayscalls: shape after downsampling = (469, 12551)
stumblinginthedark: shape after downsampling = (504, 12551)
kiksuya: shape after downsampling = (347, 12551)
haveyoumethimyet: shape after downsampling = (511, 12551)
theinterview: shape after downsampling = (236, 1255

In [10]:
# Step 2: Trim first 5s and last 10s
story_to_X_trimmed = trim_bow_features(story_to_X_ds)

for story_id, X in story_to_X_trimmed.items():
    print(f"{story_id}: shape = {X.shape}")


 Trimming first 5s and last 10s...


⏳ Trimming: 100%|██████████| 109/109 [00:00<00:00, 548176.42it/s]

sweetaspie: shape = (163, 12551)
thatthingonmyarm: shape = (440, 12551)
tildeath: shape = (329, 12551)
indianapolis: shape = (308, 12551)
lawsthatchokecreativity: shape = (440, 12551)
golfclubbing: shape = (207, 12551)
jugglingandjesus: shape = (199, 12551)
shoppinginchina: shape = (343, 12551)
cocoonoflove: shape = (435, 12551)
hangtime: shape = (330, 12551)
beneaththemushroomcloud: shape = (348, 12551)
dialogue4: shape = (313, 12551)
thepostmanalwayscalls: shape = (460, 12551)
stumblinginthedark: shape = (495, 12551)
kiksuya: shape = (338, 12551)
haveyoumethimyet: shape = (502, 12551)
theinterview: shape = (227, 12551)
againstthewind: shape = (176, 12551)
tetris: shape = (286, 12551)
canplanetearthfeedtenbillionpeoplepart2: shape = (551, 12551)
alternateithicatom: shape = (349, 12551)
goldiethegoldfish: shape = (323, 12551)
seedpotatoesofleningrad: shape = (287, 12551)
onapproachtopluto: shape = (277, 12551)
canplanetearthfeedtenbillionpeoplepart1: shape = (506, 12551)
bluehope: shap




### 1.3
Create lagged versions of the features using make delayed from code/preprocessing.py with delays ranging form [1, 4] inclusive. Explain what this does.

To better model the brain’s response to language input, we construct lagged (delayed) features using the make_delayed() function. This function generates new features by concatenating the BoW vectors from previous time steps (TRs). Specifically, for each time point t, we concatenate the features from time steps t-1, t-2, t-3, and t-4, resulting in a temporally extended input vector. This approach allows the model to capture the fact that brain activity at time t may be influenced by language input from several seconds earlier. Using lagged features is standard in encoding models for fMRI, as it helps account for the delayed nature of neural responses.

In [11]:
def apply_lagged_features(X_dict, delays=range(1, 5)):
    """
    Apply lagged feature construction to each story using make_delayed.

    Args:
        X_dict (dict): story_id -> [T, D] matrix (after downsampling & trimming)
        delays (list): which time lags to include

    Returns:
        dict: story_id -> [T, D * len(delays)] matrix
    """
    X_delayed_dict = {}
    for story_id, X in X_dict.items():
        X_delayed = make_delayed(X, delays=delays)
        X_delayed_dict[story_id] = X_delayed
        print(f"{story_id}: delayed shape = {X_delayed.shape}")
    return X_delayed_dict

In [12]:
story_to_X_lagged = apply_lagged_features(story_to_X_trimmed, delays=range(1, 5))

sweetaspie: delayed shape = (163, 50204)
thatthingonmyarm: delayed shape = (440, 50204)
tildeath: delayed shape = (329, 50204)
indianapolis: delayed shape = (308, 50204)
lawsthatchokecreativity: delayed shape = (440, 50204)
golfclubbing: delayed shape = (207, 50204)
jugglingandjesus: delayed shape = (199, 50204)
shoppinginchina: delayed shape = (343, 50204)
cocoonoflove: delayed shape = (435, 50204)
hangtime: delayed shape = (330, 50204)
beneaththemushroomcloud: delayed shape = (348, 50204)
dialogue4: delayed shape = (313, 50204)
thepostmanalwayscalls: delayed shape = (460, 50204)
stumblinginthedark: delayed shape = (495, 50204)
kiksuya: delayed shape = (338, 50204)
haveyoumethimyet: delayed shape = (502, 50204)
theinterview: delayed shape = (227, 50204)
againstthewind: delayed shape = (176, 50204)
tetris: delayed shape = (286, 50204)
canplanetearthfeedtenbillionpeoplepart2: delayed shape = (551, 50204)
alternateithicatom: delayed shape = (349, 50204)
goldiethegoldfish: delayed shape =

### 1.4 
Repeat the process above for 2 other pre-trained embedding methods: Word2Vec and GloVe. You will have to find these pre-trained embedding methods online, and use them. The processed embeddings resulting from these steps will serve as the features (X matrix) in our regression.

#### 1.4.1 Word2Vec

In [13]:
# Step 1: Load pre-trained Word2Vec
print("Downloading Word2Vec model (Google News)...")
word2vec_model = api.load("word2vec-google-news-300")
vector_size = 300

Downloading Word2Vec model (Google News)...


In [14]:
# Step 2: Map each word to a Word2Vec vector
def generate_word2vec_features(raw_text, w2v_model, vector_size=300):
    word_vectors = {}
    for story_id, seq in tqdm(raw_text.items(), desc="🧠 Generating Word2Vec"):
        words = seq.data
        vectors = []
        for word in words:
            word_lower = word.lower()
            if word_lower in w2v_model:
                vectors.append(w2v_model[word_lower])
            else:
                vectors.append(np.zeros(vector_size))  # OOV → zero vector
        word_vectors[story_id] = np.vstack(vectors)
    return word_vectors

word2vec_features = generate_word2vec_features(raw_text, word2vec_model)

# Step 3: Downsample Word2Vec vectors
story_to_X_ds_w2v = downsample_bow_features(word2vec_features, raw_text)

# Step 5: Trim first 5s and last 10s
story_to_X_trimmed_w2v = trim_bow_features(story_to_X_ds_w2v)

# Step 6: Apply delays (1 to 4 TRs)
story_to_X_lagged_w2v = apply_lagged_features(story_to_X_trimmed_w2v)

🧠 Generating Word2Vec: 100%|██████████| 109/109 [00:00<00:00, 167.23it/s]


Performing downsampling (lanczos interpolation)...
sweetaspie: shape after downsampling = (172, 300)
thatthingonmyarm: shape after downsampling = (449, 300)
tildeath: shape after downsampling = (338, 300)
indianapolis: shape after downsampling = (317, 300)
lawsthatchokecreativity: shape after downsampling = (449, 300)
golfclubbing: shape after downsampling = (216, 300)
jugglingandjesus: shape after downsampling = (208, 300)
shoppinginchina: shape after downsampling = (352, 300)
cocoonoflove: shape after downsampling = (444, 300)
hangtime: shape after downsampling = (339, 300)
beneaththemushroomcloud: shape after downsampling = (357, 300)
dialogue4: shape after downsampling = (322, 300)
thepostmanalwayscalls: shape after downsampling = (469, 300)
stumblinginthedark: shape after downsampling = (504, 300)
kiksuya: shape after downsampling = (347, 300)
haveyoumethimyet: shape after downsampling = (511, 300)
theinterview: shape after downsampling = (236, 300)
againstthewind: shape after dow

⏳ Trimming: 100%|██████████| 109/109 [00:00<00:00, 25552.15it/s]

sweetaspie: delayed shape = (163, 1200)
thatthingonmyarm: delayed shape = (440, 1200)
tildeath: delayed shape = (329, 1200)
indianapolis: delayed shape = (308, 1200)
lawsthatchokecreativity: delayed shape = (440, 1200)
golfclubbing: delayed shape = (207, 1200)
jugglingandjesus: delayed shape = (199, 1200)
shoppinginchina: delayed shape = (343, 1200)
cocoonoflove: delayed shape = (435, 1200)
hangtime: delayed shape = (330, 1200)
beneaththemushroomcloud: delayed shape = (348, 1200)
dialogue4: delayed shape = (313, 1200)
thepostmanalwayscalls: delayed shape = (460, 1200)
stumblinginthedark: delayed shape = (495, 1200)
kiksuya: delayed shape = (338, 1200)
haveyoumethimyet: delayed shape = (502, 1200)
theinterview: delayed shape = (227, 1200)
againstthewind: delayed shape = (176, 1200)
tetris: delayed shape = (286, 1200)
canplanetearthfeedtenbillionpeoplepart2: delayed shape = (551, 1200)
alternateithicatom: delayed shape = (349, 1200)
goldiethegoldfish: delayed shape = (323, 1200)
seedpota




thefreedomridersandme: delayed shape = (340, 1200)
exorcism: delayed shape = (473, 1200)
itsabox: delayed shape = (361, 1200)
inamoment: delayed shape = (211, 1200)
afearstrippedbare: delayed shape = (433, 1200)
swimmingwithastronauts: delayed shape = (391, 1200)
ifthishaircouldtalk: delayed shape = (255, 1200)
whenmothersbullyback: delayed shape = (313, 1200)
vixenandtheussr: delayed shape = (397, 1200)
adollshouse: delayed shape = (247, 1200)
catfishingstrangerstofindmyself: delayed shape = (332, 1200)
dialogue2: delayed shape = (335, 1200)
theshower: delayed shape = (398, 1200)
igrewupinthewestborobaptistchurch: delayed shape = (445, 1200)
thesurprisingthingilearnedsailingsoloaroundtheworld: delayed shape = (486, 1200)
odetostepfather: delayed shape = (410, 1200)
threemonths: delayed shape = (359, 1200)
theclosetthatateeverything: delayed shape = (320, 1200)
souls: delayed shape = (361, 1200)
reachingoutbetweenthebars: delayed shape = (302, 1200)
fromboyhoodtofatherhood: delayed sha

#### 1.4.1 GloVe

In [15]:
print("Downloading GloVe vectors...")
glove_model = api.load("glove-wiki-gigaword-300")

Downloading GloVe vectors...


In [16]:
def generate_glove_features(raw_text, glove_model, vector_size=300):
    glove_vectors = {}
    for story_id, seq in tqdm(raw_text.items(), desc="Generating GloVe"):
        words = seq.data
        vectors = [
            glove_model[word.lower()] if word.lower() in glove_model else np.zeros(vector_size)
            for word in words
        ]
        glove_vectors[story_id] = np.vstack(vectors)
    return glove_vectors

In [17]:
#embbeding
glove_vectors = generate_glove_features(raw_text, glove_model)

# Downsampling
story_to_X_ds_glove = downsample_bow_features(glove_vectors, raw_text)

# Trimming
story_to_X_trimmed_glove = trim_bow_features(story_to_X_ds_glove)

# Delay
story_to_X_lagged_glove = apply_lagged_features(story_to_X_trimmed_glove, delays=range(1, 5))


Generating GloVe: 100%|██████████| 109/109 [00:00<00:00, 313.45it/s]


Performing downsampling (lanczos interpolation)...
sweetaspie: shape after downsampling = (172, 300)
thatthingonmyarm: shape after downsampling = (449, 300)
tildeath: shape after downsampling = (338, 300)
indianapolis: shape after downsampling = (317, 300)
lawsthatchokecreativity: shape after downsampling = (449, 300)
golfclubbing: shape after downsampling = (216, 300)
jugglingandjesus: shape after downsampling = (208, 300)
shoppinginchina: shape after downsampling = (352, 300)
cocoonoflove: shape after downsampling = (444, 300)
hangtime: shape after downsampling = (339, 300)
beneaththemushroomcloud: shape after downsampling = (357, 300)
dialogue4: shape after downsampling = (322, 300)
thepostmanalwayscalls: shape after downsampling = (469, 300)
stumblinginthedark: shape after downsampling = (504, 300)
kiksuya: shape after downsampling = (347, 300)
haveyoumethimyet: shape after downsampling = (511, 300)
theinterview: shape after downsampling = (236, 300)
againstthewind: shape after dow

⏳ Trimming: 100%|██████████| 109/109 [00:00<00:00, 368395.76it/s]

sweetaspie: delayed shape = (163, 1200)
thatthingonmyarm: delayed shape = (440, 1200)
tildeath: delayed shape = (329, 1200)
indianapolis: delayed shape = (308, 1200)
lawsthatchokecreativity: delayed shape = (440, 1200)
golfclubbing: delayed shape = (207, 1200)
jugglingandjesus: delayed shape = (199, 1200)
shoppinginchina: delayed shape = (343, 1200)
cocoonoflove: delayed shape = (435, 1200)
hangtime: delayed shape = (330, 1200)
beneaththemushroomcloud: delayed shape = (348, 1200)
dialogue4: delayed shape = (313, 1200)
thepostmanalwayscalls: delayed shape = (460, 1200)
stumblinginthedark: delayed shape = (495, 1200)
kiksuya: delayed shape = (338, 1200)
haveyoumethimyet: delayed shape = (502, 1200)
theinterview: delayed shape = (227, 1200)
againstthewind: delayed shape = (176, 1200)
tetris: delayed shape = (286, 1200)
canplanetearthfeedtenbillionpeoplepart2: delayed shape = (551, 1200)
alternateithicatom: delayed shape = (349, 1200)
goldiethegoldfish: delayed shape = (323, 1200)
seedpota




#### 1.4.3
 The processed embeddings resulting from these steps will serve as the features (X matrix) in our regression. So we output them to pkl files.

In [18]:
output_dir = "../data"
os.makedirs(output_dir, exist_ok=True)

# save BoW
import joblib
joblib.dump(story_to_X_lagged, "../data/X_lagged_BoW.joblib", compress=3)

['../data/X_lagged_BoW.joblib']

if you want to use this:

import joblib
story_to_X_lagged_bow = joblib.load("../data/X_lagged_BoW.joblib")

In [19]:
# save Word2Vec
with open(os.path.join(output_dir, "X_lagged_W2V.pkl"), "wb") as f:
    pickle.dump(story_to_X_lagged_w2v, f)

In [20]:
# save GloVe
with open(os.path.join(output_dir, "X_lagged_GloVe.pkl"), "wb") as f:
    pickle.dump(story_to_X_lagged_glove, f)

### 1.5

Describe potential benefits of using pre-trained embeddings.

Pre-trained embeddings such as Word2Vec and GloVe offer several advantages over traditional methods like bag-of-words (BoW) when used as features in modeling tasks such as brain response prediction:

1. Semantic richness
Pre-trained embeddings encode semantic similarity between words. Words with similar meanings are located close together in the embedding space, which helps models generalize better. For example, “king” and “queen” will have more similar representations in Word2Vec than in BoW, which treats them as entirely unrelated tokens.

2. Lower dimensionality
While BoW vectors are typically high-dimensional and sparse (e.g., 12,000+ dimensions), pre-trained embeddings are dense and compact (e.g., 300 dimensions). This reduces the risk of overfitting and improves computational efficiency.

3. Transfer learning
Pre-trained embeddings are learned from massive external corpora (e.g., Wikipedia, Google News), capturing broader linguistic knowledge that may not exist in your training data. This improves performance, especially when your dataset is small.

4. Better generalization for rare words
In BoW, rare or unseen words are treated as zeros, losing information. Pre-trained embeddings can still provide meaningful vectors for infrequent words (or even out-of-vocabulary handling via averaging subwords or fallback).