Author: Oyetunji Abioye

**LAB1 - NeSy Recommendation System**

Movie Recommendation System
Recommend movies to users based on explicit preferences.

We will go through probabilistic uncertainty, vague knowledge, commonsense reasoning, and similarity-based inference. More precisely:

* Classical Prolog= Structured rules for explicit knowledge (e.g., “Alice likes Sci-Fi → Recommend Sci-Fi movies”)
* Probabilistic Prolog: Uncertain preferences (e.g., “Alice probably likes Sci-Fi with 80% confidence”)
* Possibilistic Prolog Vague or imprecise knowledge (e.g., “If Alice likes Sci-Fi, she might like Cyberpunk”)
* Commonsense Reasoning	Interpolation & analogy-based reasoning (e.g., “Alice likes Action and Sci-Fi → Infer she might like Cyberpunk”)
* Similarity-Based Reasoning	Vector embeddings for contextual similarity (e.g., “Inception is similar to Interstellar → Recommend Interstellar”)

Let's do concept spaces

Let's start with the following step

* Step 1: Define the Logical Rules (Prolog)
* Step 2: Introduce Probabilistic Reasoning (Probabilistic Prolog)
* Step 3: Implement Commonsense Reasoning
* Step 4: Implement Similarity-Based Reasoning Using Vector Embeddings
* Step 5:
    - Integrate real-world datasets (IMDB, MovieLens)
    - Enhance user personalization (e.g., feedback loops)
    - Reason with Real Embeddings (embed IMDB for example)
* Step 3: Handle Vague Knowledge (Possibilistic Prolog)

[Lecture Reference](https://colab.research.google.com/drive/1TmrVRMbc83WaIK27eDfAseiIVRTZ-Cnc?usp=sharing)


[Python Prolog Documentation](https://github.com/yuce/pyswip)

[Python Problog Documentation](https://dtai.cs.kuleuven.be/problog/tutorial/advanced/01_python_interface.html)


In [None]:
!apt-get install -y swi-prolog
! pip install pyswip --quiet

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swi-prolog is already the newest version (8.4.2+dfsg-2ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.


 ## Step 1: Define the Logical Rules (Prolog)

In [2]:
from pyswip import Prolog

prolog = Prolog()

prolog.assertz("parent(john, alice)")
prolog.assertz("parent(alice, bob)")
prolog.assertz("grandparent(X, Y) :- parent(X, Z), parent(Z, Y)")

result = list(prolog.query("grandparent(X, bob)"))
print(result)

[{'X': 'john'}]


In [3]:
Prolog.assertz("father(michael,john)")
Prolog.assertz("father(michael,gina)")
list(Prolog.query("father(michael,X)")) == [{'X': 'john'}, {'X': 'gina'}]
for soln in Prolog.query("father(X,Y)"):
    print(soln["X"], "is the father of", soln["Y"])


michael is the father of john
michael is the father of gina


## Step 2: Introduce Probabilistic Reasoning (Probabilistic Prolog)

In [4]:
!pip install problog -q


In [5]:
from problog.program import PrologString
from problog import get_evaluatable

problog_code = """
% Facts with probabilities
0.9::parent(john, alice).
0.8::parent(alice, bob).
grandparent(X, Y) :- parent(X, Z), parent(Z, Y).
query(grandparent(X, bob)).
"""

get_evaluatable().create_from(problog_code).evaluate()

{grandparent(john,bob): 0.7200000000000001}

## Step 4: Implement Similarity-Based Reasoning Using Vector Embeddings

In [6]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

movie_embeddings = {
    "m1": np.array([0.9, 0.8, 0.1]),
    "m2": np.array([0.85, 0.75, 0.15]),
    "m3": np.array([0.2, 0.1, 0.9])
}

In [7]:
# define a function that use embedding to find similar alternatives for examples (cos similarity, etc)
# integrate this function with prolog

def find_similar_movies(movie_name, movie_embeddings, top_n=2):
  if movie_name not in movie_embeddings:
    return []

  movie_vector = movie_embeddings[movie_name].reshape(1, -1)
  similarities = []
  for other_movie, other_vector in movie_embeddings.items():
      if other_movie != movie_name:
          similarity = cosine_similarity(movie_vector, other_vector.reshape(1, -1))[0][0]
          similarities.append((other_movie, similarity))

  similarities.sort(key=lambda x: x[1], reverse=True)
  return similarities[:top_n]


similar_movies = find_similar_movies("m1", movie_embeddings)
print("m1 is similar to m2", similar_movies)

similar_movies = find_similar_movies("m3", movie_embeddings)
print("m3 is disimilar from m1 and m2", similar_movies)


m1 is similar to m2 [('m2', 0.9988075343833681), ('m3', 0.3123506333062124)]
m3 is disimilar from m1 and m2 [('m2', 0.3583550444099566), ('m1', 0.3123506333062124)]


In [8]:
! pip install fireducks datasets -q

# mention encoder
We now know how to develop mention encoders, so we can develop one for movies:
*	Take the list of movie names.
*	For each name, find mention sentences in a corpus.
* Based on these sentences, extract representations using a BERT-family encoder.
* Average the mention vectors.
* Done.


In [10]:
from collections import deque

# List of movie names
movie_names =  deque([
    "The Shawshank Redemption",
    "The Godfather",
    "The Dark Knight",
    "Pulp Fiction",
    "Forrest Gump",
    "The Lord of the Rings",
    "Inception",
    "The Matrix",
    "The Silence of the Lambs",
    "The Lion King",
    "Telnet"])


print(movie_names)

deque(['The Shawshank Redemption', 'The Godfather', 'The Dark Knight', 'Pulp Fiction', 'Forrest Gump', 'The Lord of the Rings', 'Inception', 'The Matrix', 'The Silence of the Lambs', 'The Lion King', 'Telnet'])


In [11]:
import re
import fireducks.pandas as pd
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")

df = pd.DataFrame(dataset)

print("DF size before filter", df.shape)
print(df.head(5))

# Create a regex pattern that matches any of the items
pattern = '|'.join(map(re.escape, movie_names))

# Use str.contains to create a boolean mask (na=False avoids errors with NaN)
mask = df['text'].str.contains(pattern, na=False)

# Filter the DataFrame using the mask
filtered_df = df[mask]

print("DF size after filter", filtered_df.shape)
print(filtered_df.head(5))



DF size before filter (25000, 2)
                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0
DF size after filter (213, 2)
                                                  text  label
74   I'm studying Catalan, and was delighted to fin...      0
257  Hail Bollywood and men Directors !<br /><br />...      0
460  An old intellectual talks about what he consid...      0
528  Scarecrow is set in the small American town of...      0
751  Note: I couldn't force myself to actually writ...      0


In [12]:
import torch
from transformers import AutoTokenizer, AutoModel

# Specify the BERT model
model_name = 'bert-base-uncased'

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Tokenize the text column
inputs = tokenizer(
    filtered_df['text'].tolist(),
    padding=True,
    truncation=True,
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**inputs)

# Extract the embeddings for the [CLS] token for each sentence
# The [CLS] token is located at index 0 for each sequence
cls_embeddings = outputs.last_hidden_state[:, 0, :]

filtered_df['cls_embeddings'] = cls_embeddings.tolist()

print(filtered_df.head())

                                                  text  label  \
74   I'm studying Catalan, and was delighted to fin...      0   
257  Hail Bollywood and men Directors !<br /><br />...      0   
460  An old intellectual talks about what he consid...      0   
528  Scarecrow is set in the small American town of...      0   
751  Note: I couldn't force myself to actually writ...      0   

                                        cls_embeddings  
74   [0.12923839688301086, -0.33946046233177185, 0....  
257  [-0.39316117763519287, -0.09070698916912079, 0...  
460  [-0.02286955714225769, 0.169071763753891, 0.19...  
528  [-0.41711029410362244, -0.07180090248584747, 0...  
751  [0.0463026762008667, -0.6328936219215393, 0.39...  


In [13]:
print("Embedding shape", cls_embeddings.shape)
average_vector = torch.mean((cls_embeddings), dim=0)
print("Average mention vector shape:", average_vector.shape)

Embedding shape torch.Size([213, 768])
Average mention vector shape: torch.Size([768])


# fine-tune a mention Encoder
- you have your mentions of movies
- use a contrastive loss to move the mention vectors that are similar (same genre for example) together

In [14]:
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
import random
from torch.optim import AdamW


class ContrastiveDataset(Dataset):
    def __init__(self, df):
        self.df = df.reset_index(drop=True)

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        text1 = self.df.loc[idx, 'text']
        label1 = self.df.loc[idx, 'label']

        # Randomly choose another sample (ensuring it's not the same)
        other_idx = random.choice([i for i in range(len(self.df)) if i != idx])
        text2 = self.df.loc[other_idx, 'text']
        label2 = self.df.loc[other_idx, 'label']

        # Define the contrastive label: 1 if same rating, 0 otherwise
        sentiment_label = 1.0 if label1 == label2 else 0.0
        return text1, text2, sentiment_label


# Collate function for combining pairs into batches
def collate_fn(batch):
    text1, text2, sentiment_label = zip(*batch)
    return list(text1), list(np.array(text2)), torch.tensor(sentiment_label, dtype=torch.float32)


dataset = ContrastiveDataset(filtered_df)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

In [15]:
class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin

    def forward(self, output1, output2, label):
        # Compute Euclidean distance between two batches of embeddings
        euclidean_distance = F.pairwise_distance(output1, output2)

        # Loss for similar pairs: squared distance;
        loss_similar = label * torch.pow(euclidean_distance, 2)

        # Loss for dissimilar pairs: squared difference between margin and distance
        loss_dissimilar = (1 - label) * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2)

        loss = torch.mean(loss_similar + loss_dissimilar)
        return loss

criterion = ContrastiveLoss(margin=1.0)

In [16]:
optimizer = AdamW(model.parameters(), lr=2e-5)

# -----------------------------
# Training Loop
# -----------------------------
num_epochs = 3
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    epoch_loss = 0.0
    for texts1, texts2, labels in dataloader:
        # Tokenize both sets of texts
        inputs1 = tokenizer(texts1, padding=True, truncation=True, return_tensors="pt")
        inputs2 = tokenizer(texts2, padding=True, truncation=True, return_tensors="pt")

        # Move input tensors to the device (CPU or GPU)
        inputs1 = {key: val.to(device) for key, val in inputs1.items()}
        inputs2 = {key: val.to(device) for key, val in inputs2.items()}
        labels = labels.to(device)

        # Forward pass: get the outputs and extract the CLS embedding ([CLS] token is at index 0)
        outputs1 = model(**inputs1)
        outputs2 = model(**inputs2)
        cls_embeddings1 = outputs1.last_hidden_state[:, 0, :]
        cls_embeddings2 = outputs2.last_hidden_state[:, 0, :]

        # Compute the contrastive loss using the embeddings and similarity labels
        loss = criterion(cls_embeddings1, cls_embeddings2, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

Epoch 1/3, Loss: 2.6969
Epoch 2/3, Loss: 0.2768
Epoch 3/3, Loss: 0.2419


# entity encoder
We now assume we don't have mentions for movie names. We will devlopp a entity encoder. We can train our enoder using the following prompt. "Movie has Genre which mean [Mask]", or simply "Movie has genre [CLS]". Train the encoder using a cross-entropy loss.


In [17]:
# Load the Movielens dataset
dataset = load_dataset("ashraq/movielens_ratings", split="train")

# Randomly select 200 examples
sampled_dataset = dataset.shuffle(seed=42).select(range(200))

sampled_df = pd.DataFrame(sampled_dataset)

print(sampled_df.head(5))

      imdbId  tmdbId  movie_id  user_id  rating  \
0  tt2345759  282035      7400    10785     4.0   
1  tt2278388  120467       545    41527     3.0   
2  tt4882376  433247      7738    12551     3.0   
3  tt6966692  490132     12323    28576     4.5   
4  tt2436386  227719      1546    10178     3.5   

                                title  \
0                    The Mummy (2017)   
1    Grand Budapest Hotel, The (2014)   
2  First They Killed My Father (2017)   
3                   Green Book (2018)   
4              Project Almanac (2015)   

                                     genres  \
0  Action|Adventure|Fantasy|Horror|Thriller   
1                              Comedy|Drama   
2                                     Drama   
3                              Comedy|Drama   
4                           Sci-Fi|Thriller   

                                             posters  
0  https://m.media-amazon.com/images/M/MV5BMTkwMT...  
1  https://m.media-amazon.com/images/M/MV5BMzM5Nj... 

In [18]:
class EntityDataset(Dataset):
    def __init__(self, df, tokenizer, max_length=32):
        self.df = df.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # Get first genre from split by "|"
        genre_word = self.df.loc[idx, 'genres'].split("|")[0]

        # Construct our prompt
        prompt = f"Movie has Genre which mean {self.tokenizer.mask_token}"

        # Tokenize the prompt text
        encoding = self.tokenizer(prompt,
                                  max_length=self.max_length,
                                  truncation=True,
                                  return_tensors="pt")
        # Remove the batch dimension
        input_ids = encoding["input_ids"].squeeze(0)
        attention_mask = encoding["attention_mask"].squeeze(0)

        # Create labels: All tokens are ignored except the one corresponding to the mask
        # All other positions are set to -100
        labels = torch.full(input_ids.shape, -100)

        # Find the index of the mask token
        mask_positions = (input_ids == self.tokenizer.mask_token_id).nonzero(as_tuple=True)
        if len(mask_positions[0]) == 0:
            raise ValueError("No mask token found in input_ids!")
        mask_index = mask_positions[0].item()  # assume only one mask token in our prompt

        # Convert the genre word to its token id.
        # Ensure that the genre word is in the tokenizer's vocabulary as a single token
        target_token_id = self.tokenizer.convert_tokens_to_ids(genre_word)
        labels[mask_index] = target_token_id

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        }


def collate_fn(batch):
    input_ids = [item["input_ids"] for item in batch]
    attention_masks = [item["attention_mask"] for item in batch]
    labels = [item["labels"] for item in batch]

    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
    attention_masks = torch.nn.utils.rnn.pad_sequence(attention_masks, batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=-100)

    return {
        "input_ids": input_ids,
        "attention_mask": attention_masks,
        "labels": labels
    }

In [19]:
from transformers import BertTokenizer, BertForMaskedLM, AdamW

df = pd.DataFrame(sampled_df)

# We use BertForMaskedLM which automatically computes
# cross-entropy loss over the masked position when given labels.
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)


dataset = EntityDataset(df, tokenizer, max_length=32)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)

num_epochs = 3

for epoch in range(num_epochs):
    model.train()  # set model to training mode
    total_loss = 0.0
    for batch in dataloader:
        # Move tensors to device
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        # Forward pass - the model will compute loss if labels are provided
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Backward pass and optimizer update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_loss:.4f}")

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

Epoch 1/3 - Loss: 0.2319
Epoch 2/3 - Loss: 0.0000
Epoch 3/3 - Loss: 0.0000


# Joint encoder on logical formulas
- consider a set of rules
- form pairs of (premises, hypothesis) pairs where there are entailment
- train a joint encoder with [CLS] p_1, p_2, ...,p_n [Sep]h[sep]
- use binary cross-enropy

to predict entailment, feed [CLS] to a linear classifier