### Text to CLIP cross modality ###
This notebook is the implementation of our original hypothesis <br>
Before starting the open ended project, we have hypothesized that meaning of textual data (both individual words of concepts and full sentences) and images of the same concepts might be interpreted in the same places inside the brain. <br>
To try proving said hypothesis, we set out to expand on the work done by Pereira et al. (2018). In their work, "Toward a universal decoder of linguistic meaning from brain activation", Pereira et al. have made big strides in proving that meaning of different concepts, ranging from abstract to physical objects, is being parsed withing the brain in the same area. They scanned the

#### Setup ####

##### Dependecies #####
First, let's download all relevant dependecies to check our hypothesis

In [1]:
# Install dependencies
%pip install ftfy regex tqdm scikit-learn numpy matplotlib
%pip install -U gdown
%pip install git+https://github.com/openai/CLIP.git

Collecting matplotlib
  Downloading matplotlib-3.10.5-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp312-cp312-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.59.0-cp312-cp312-win_amd64.whl.metadata (110 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Using cached kiwisolver-1.4.8-cp312-cp312-win_amd64.whl.metadata (6.3 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.2.3-py3-none-any.whl.metadata (5.0 kB)
Downloading matplotlib-3.10.5-cp312-cp312-win_amd64.whl (8.1 MB)
   ---------------------------------------- 0.0/8.1 MB ? eta -:--:--
   --------- ------------------------------ 1.8/8.1 MB 12.6 MB/s eta 0:00:01
   ---------------------------------------- 8.1/8.1 MB 27.9 MB/s eta 0:00:00
Downloading contourpy-1.3.3-cp312

  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git 'C:\Users\user\AppData\Local\Temp\pip-req-build-z_yhlwzd'


##### Data #####
Now, we can import all of our relevant data! <br>
For this project, we've created a drive folder, containing all of the relevant code and data from the original Pereira et al. paper. Moreover, because the list of concepts and the related images from the original paper is static, we pre-calculated all of the relevant CLIP embeddings (The exact code we used can be seen here in "one_time_drive_setup"), and persisted them to drive. Let's download all of the relevant data, so we could continue our analysis. <br>

In [2]:
import platform
from pathlib import Path

# If the data already exists, we don't need to download it again
if not Path("data").exists():
    # Check operating system - handlind the data is done differently on Windows and Linux.
    # This will allow us to run the code on Colab, locally, and on any other platform we may choose.
    IS_WINDOWS = platform.system() == "Windows"
    
    if IS_WINDOWS:
        !python -m gdown --folder --id 1CwmFOsYFnq6t33KAzpvw0gaOTQXbcozs -O ./data/
        !powershell -NoProfile -Command "Expand-Archive -Path ./data/experiment-images.zip -DestinationPath ./data/ -Force"
        !powershell -NoProfile -Command "Remove-Item ./data/experiment-images.zip"
    else:
        !gdown --folder --id 1CwmFOsYFnq6t33KAzpvw0gaOTQXbcozs --output ./data
        !unzip ./data/experiment-images.zip
        !rm ./data/experiment-images.zip

For the embeddings, we saved the following:
* <u>Textual data</u> - we saved the embedding of the prompt "A picture of {c}", where c is the name of the relevant concept. The reason for that choice is that CLIP reacts very well to prompting, and that embedding is improving results over the non-prompt version.
* <u>Visual data</u> - because each concept has 6 separate images describing it, we embedded all of them to CLIP's embedding space. Since this is not a singular vector, there are several approaches on how to parse the data, but the base embedding will be the same nonetheless, so we saved the embeddings for all images.

In [None]:
import numpy as np

with np.load("data/clip_text_embeddings.npz") as text_embeddings:
    clip_text_embeddings = text_embeddings["data"]

with np.load("data/clip_image_embeddings.npz") as image_embeddings:
    clip_image_embeddings = image_embeddings["data"]

#### Training ####
As stated earlier, we are going to keep Pereira et al's method of learning a **linear** decoder, that would be able to generalize from the data it saw to unseen data, and thus capture deeper meaning than the training data itself. In the original paper, they managed to achieve results which are **much** better than chances, implying that a linear decoder is more than sufficient to capture meaning of textual-fMRI data when the data is projected onto an embedding space. <br>
We have changed the embedding space from GloVe to CLIP, in order to see if textual-fMRI data can capture meaning of images as well, but we believe that adding extra complexity to the model will defeat our purpose. If our hypothesis is correct, then a linear decoder will be able to capture the meaning of the images in the same embedding space, without adding any non-linearity - a simple model should suffice, as much as is did for the same modality data. <br>
For that reason, we are going to train our decoder in the same way that it was trained in the original paper:
* For each participant's fMRI data - we're only going to take the top 5000 relevant `voxels`.
* For the decoder - we're going to learn a simple ridge regression (using the same code)
<br> 
Another important thing to remember is that we're only going to feed the function **textual** fMRI data, because our hypothesis states that from the same areas in the brain responsible for interpreting textual data can also give us insight on visual data. That means that we will only train our model on textual data, and withold any visual data for evaluation only.

First, let's define the function that will help us determine the top 5000 relevant voxels: <br>
The fMRI data from the experiments consists of a big series of voxel each corresponding to the activation in a different area in the brain. The problem here is that most of the voxels are non-importent and are just noise which will reduce our model's accuracy. For that reason we will clean up the data and only use the 5000 most influencing voxels out of the 200,000 in the original fMRI data and we will be doing so using the select_top_voxels_indexes function:

In [5]:
from sklearn.feature_selection import f_regression
import numpy as np

def select_top_voxels_indices(fmri_data, semantic_vectors, num_voxels=5000):
    f_scores = []
    for i in range(semantic_vectors.shape[1]):
        f, _ = f_regression(fmri_data, semantic_vectors[:, i])
        f_scores.append(f)

    f_scores = np.array(f_scores)
    voxel_scores = np.max(f_scores, axis=0)
    top_voxel_indices = np.argsort(voxel_scores)[-num_voxels:]

    return top_voxel_indices

Now, let's take the fMRI textual data from our data folder:

In [12]:
import scipy.io

mat = scipy.io.loadmat("data/brain-responses-data/examples_180concepts_wordclouds.mat")
fmri_text_data = mat["examples"]

and get only our top voxels:

In [15]:
top_voxel_indices = select_top_voxels_indices(fmri_text_data, clip_text_embeddings)
reduced_fmri_data = fmri_text_data[:, top_voxel_indices]

Finally, let's train our textual decoder. We'll use the function "learn_decoder", which we took directly from the original paper:

In [16]:
""" learn_decoder """
import sklearn.linear_model

def learn_decoder(data, vectors):
     """ Given data (a CxV matrix of V voxel activations per C concepts)
     and vectors (a CxD matrix of D semantic dimensions per C concepts)
     find a matrix M such that the dot product of M and a V-dimensional 
     data vector gives a D-dimensional decoded semantic vector. 

     The matrix M is learned using ridge regression:
     https://en.wikipedia.org/wiki/Tikhonov_regularization
     """
     ridge = sklearn.linear_model.RidgeCV(
         alphas=[1, 10, .01, 100, .001, 1000, .0001, 10000, .00001, 100000, .000001, 1000000],
         fit_intercept=False
     )
     ridge.fit(data, vectors)
     return ridge.coef_.T

The training itself:

In [18]:
decoder = learn_decoder(reduced_fmri_data, clip_text_embeddings)

#### Evaluation
For evaluation, we're going to have the following guiding principles:
1. "Distance" between datapoints - 


1. Strategy - because quality fMRI data is extremely limited (Pereira is still the only open English dataset for fMRI single conept data, similary to what we're trying to model. There are some other alternatives - Allen 672 which is Chinese, Tuckute 2024, which is for full sentences only, six words each, each of them on no more than 16 participants), we don't have much data to work with. That means we want to base our decoder on all of the available data, scarce as it is. Because of that fact, we'll choose to evaluate the data using K-Fold Cross Validation - that way, we can train on every piece of data we have, and still evaluate the model.

### Multimodal to Cross-Modality
