<a href="https://colab.research.google.com/github/Pearlkakande/machinelearning/blob/main/models/models1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


MODEL TRAINING FOR GRAPH-BASED KNOWLEDGE DISTILLATION AND FEATURE LEARNING FOR BOOK RECOMMENDATIONS

Ten models are to be trained with model evaluation being handled by wandb

dataset= Eitanli/goodreads for the hugging face dataset library

SET UP

packages, dataset, wandb

In [2]:
# Install necessary packages (uncomment if needed)
!pip install datasets torch wandb sentence-transformers scikit-learn
!pip install torch-scatter torch-sparse torch-cluster torch-geometric -f https://data.pyg.org/whl/torch-2.0.0+cpu.html
import torch
import torch.nn.functional as F
from torch_geometric.data import HeteroData, DataLoader
from torch_geometric.nn import GCNConv, GATConv  # and other layers as needed
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import wandb
import numpy as np
import pandas as pd

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
[0mCollecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu1



WANDB

In [None]:
# (Global W&B login is fine to do once)
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpearlkakande[0m ([33mpearlkakande-makerere-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

Dataset Loading

In [4]:
# Load the dataset from Hugging Face
dataset = load_dataset("Eitanli/goodreads")["train"]  # assume 'train' split
df = pd.DataFrame(dataset)
print(df.columns.tolist())


README.md:   0%|          | 0.00/737 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


goodreads_data.csv:   0%|          | 0.00/11.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

['Unnamed: 0', 'Book', 'Author', 'Description', 'Genres', 'Avg_Rating', 'Num_Ratings', 'URL']


GRAPH CONSTUCTION

Description embedding, spliting of dataset

the book, author and genre are nodes

edges show relationship and addding rating-based popularity as a node attribute or as edge weight

In [7]:
# Use a pre-trained sentence transformer to embed the book descriptions.
model_st = SentenceTransformer('all-MiniLM-L6-v2')
df['desc_emb'] = df['Description'].fillna("").apply(lambda x: model_st.encode(x))


# Split into train/test (e.g., 80/20 split by index for simplicity)
train_df = df.sample(frac=0.8, random_state=42)
test_df = df.drop(train_df.index)

# Build a simple heterogeneous graph (using PyG's HeteroData):
def build_hetero_graph(df):
    data = HeteroData()

    # Create book nodes: use description embeddings as features.
    book_emb = np.stack(df['desc_emb'].values)
    data['book'].x = torch.tensor(book_emb, dtype=torch.float)

    # For authors and genres, create a mapping.
    authors = list(df['Author'].unique())
    genres = list(df['Genres'].unique())
    author2id = {a: i for i, a in enumerate(authors)}
    genre2id = {g: i for i, g in enumerate(genres)}

    # Create author nodes with one-hot features.
    data['author'].num_nodes = len(authors)
    data['author'].x = F.one_hot(torch.arange(len(authors)), num_classes=len(authors)).float()

    # Create genre nodes.
    data['genre'].num_nodes = len(genres)
    data['genre'].x = F.one_hot(torch.arange(len(genres)), num_classes=len(genres)).float()

    # Build edges: book -> author and book -> genre.
    book_ids = np.arange(len(df))
    author_ids = [author2id[a] for a in df['Author']]
    genre_ids = [genre2id[g] for g in df['Genres']]

    data['book', 'written_by', 'author'].edge_index = torch.tensor([book_ids, author_ids], dtype=torch.long)
    data['book', 'has_genre', 'genre'].edge_index = torch.tensor([book_ids, genre_ids], dtype=torch.long)

    # (Optional) Build book-to-book similarity edges based on cosine similarity between description embeddings:
    # For simplicity, here we add edges for pairs with similarity > 0.9.
    from sklearn.metrics.pairwise import cosine_similarity
    sim_matrix = cosine_similarity(book_emb)
    src, dst = np.where(sim_matrix > 0.9)
    # remove self-loops:
    mask = src != dst
    data['book', 'similar_to', 'book'].edge_index = torch.tensor([src[mask], dst[mask]], dtype=torch.long)

    # You can also include rating-based popularity as a node attribute or as edge weight.
    # For example, store ratings_count and average_rating in a separate tensor.
    #convert num ratings to int type for later manipulation
    df['Num_Ratings'] = df['Num_Ratings'].str.replace(',', '').astype(int)
    data['book'].ratings_count = torch.tensor(df['Num_Ratings'].values.astype(float), dtype=torch.float)
    data['book'].average_rating = torch.tensor(df['Avg_Rating'].values.astype(float), dtype=torch.float)

    return data

data = build_hetero_graph(train_df)  # build graph from training data

MODEL 1 LightGCN for Book Recommendation



In [10]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch import nn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
import wandb

# Initialize wandb before training the model
wandb.init(project="book-recommendation",    # Replace with your project name
           entity="pearlkakande-makerere-university",       # Replace with your wandb username or team name
           config={"model": "LightGCN", "epochs": 30, "learning_rate": 0.01})

# Assume wandb.login() was called in an earlier cell

class LightGCN(nn.Module):
    def __init__(self, in_channels, hidden_channels, num_layers=2):
        super(LightGCN, self).__init__()
        self.convs = nn.ModuleList()
        self.convs.append(GCNConv(in_channels, hidden_channels))
        for _ in range(num_layers - 1):
            self.convs.append(GCNConv(hidden_channels, hidden_channels))
        # Final linear layer: note we later concatenate popularity signal so input dims change.
        # Here we assume hidden_channels remains unchanged; adjust if needed.
        self.lin = nn.Linear(hidden_channels + 1, 2)  # binary classification: warm vs cold

    def forward(self, data):
        # Use only 'book' nodes for prediction.
        x = data['book'].x  # initial features from description embeddings
        for conv in self.convs:
            x = conv(x, data['book', 'similar_to', 'book'].edge_index)
            x = F.relu(x)
        # Concatenate popularity signal (ratings_count)
        ratings = data['book'].ratings_count.unsqueeze(1)
        x_cat = torch.cat([x, ratings], dim=1)
        out = self.lin(x_cat)
        return out, x_cat  # also return latent embedding for potential distillation

# Define warm items based on ratings_count > threshold.
threshold = train_df['Num_Ratings'].median()
labels = (train_df['Num_Ratings'] > threshold).astype(int).values
labels = torch.tensor(labels, dtype=torch.long)

# Set device and create model instance
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model1 = LightGCN(in_channels=data['book'].x.size(1), hidden_channels=64, num_layers=2).to(device)
optimizer = torch.optim.Adam(model1.parameters(), lr=0.01)

# Move data to device
data['book'].x = data['book'].x.to(device)
data['book', 'similar_to', 'book'].edge_index = data['book', 'similar_to', 'book'].edge_index.to(device)
data['book'].ratings_count = data['book'].ratings_count.to(device)
labels = labels.to(device)

# Training loop for 30 epochs
num_epochs = 30
model1.train()
for epoch in range(num_epochs):
    optimizer.zero_grad()
    out, _ = model1(data)
    loss = F.cross_entropy(out, labels)
    loss.backward()
    optimizer.step()

    # Evaluate on training data (for demo purposes)
    preds = out.argmax(dim=1).cpu().numpy()
    truelabels = labels.cpu().numpy()
    acc = accuracy_score(truelabels, preds)
    prec = precision_score(truelabels, preds, zero_division=0)
    rec = recall_score(truelabels, preds, zero_division=0)
    f1 = f1_score(truelabels, preds, zero_division=0)

    wandb.log({
        "Model": "LightGCN",
        "epoch": epoch,
        "loss": loss.item(),
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1
    })

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss={loss.item():.4f}, Acc={acc:.4f}, Prec={prec:.4f}, Rec={rec:.4f}, F1={f1:.4f}")

# After training, create a summary table for the training set.
results = {"Model": "LightGCN", "Loss": loss.item(), "Accuracy": acc, "Precision": prec, "Recall": rec, "F1": f1}
print(pd.DataFrame([results]))

# Finish the wandb run
wandb.finish()


Epoch 0: Loss=50.7437, Acc=0.5009, Prec=0.5004, Rec=1.0000, F1=0.6671
Epoch 10: Loss=73.9310, Acc=0.6894, Prec=0.6168, Rec=1.0000, F1=0.7630
Epoch 20: Loss=72.4205, Acc=0.7999, Prec=0.7142, Rec=1.0000, F1=0.8332
      Model       Loss  Accuracy  Precision  Recall        F1
0  LightGCN  13.945383  0.923125   0.866739     1.0  0.928613


0,1
accuracy,▁▃▁▄▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇██
epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
f1,▆▆▁▇▇▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████
loss,▁▁█▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁
precision,▅▅▁▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇███
recall,██▁███████████████████████████

0,1
Model,LightGCN
accuracy,0.92312
epoch,29
f1,0.92861
loss,13.94538
precision,0.86674
recall,1
