In [1]:
!pip install -U transformers accelerate sentence-transformers scikit-learn

Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cubl

## Fine Tune Model

### Beginner/Advanced

In [2]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os

os.environ["WANDB_DISABLED"] = "true"

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

class BookDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

if __name__ == "__main__":
    if torch.cuda.is_available():
        print(f"✅ GPU is available: {torch.cuda.get_device_name(0)}")
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        print("🧹 Cleared GPU cache.")
    else:
        print("🖥️ Running on CPU")

    # Step 1: Prepare Data
    df = pd.read_csv("labeled_books.csv")
    df = df[df["level"].isin(["Beginner", "Advanced"])].dropna(subset=["description"]).reset_index(drop=True)
    label_map = {"Beginner": 0, "Advanced": 1}
    df["label"] = df["level"].map(label_map)

    train_texts, test_texts, train_labels, test_labels = train_test_split(
        df["description"].tolist(),
        df["label"].tolist(),
        test_size=0.2,
        random_state=42,
        stratify=df["label"]
    )

    # Step 2: Tokenization
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=256, return_tensors="pt")
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=256, return_tensors="pt")

    # Step 3: Dataset
    train_dataset = BookDataset(train_encodings, train_labels)
    test_dataset = BookDataset(test_encodings, test_labels)

    # Step 4: Model
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2, hidden_dropout_prob=0.3)

    # Step 5: Training Args
    training_args = TrainingArguments(
        output_dir="./results",
        eval_steps=500,
        save_steps=500,
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=4,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_steps=10,
        do_eval=True
    )

    # Step 6: Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    # Train
    trainer.train()
    trainer.save_model("bert_level_classifier")
    print("Model trained and saved to 'bert_level_classifier'")


✅ GPU is available: NVIDIA A100-SXM4-40GB
🧹 Cleared GPU cache.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss
10,0.6771
20,0.6614
30,0.6596
40,0.6308
50,0.5536
60,0.6028
70,0.5813
80,0.5298
90,0.5394
100,0.4822


Model trained and saved to 'bert_level_classifier'


In [3]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.5625038146972656, 'eval_accuracy': 0.7161290322580646, 'eval_precision': 0.6428571428571429, 'eval_recall': 0.7941176470588235, 'eval_f1': 0.7105263157894737, 'eval_runtime': 3.2924, 'eval_samples_per_second': 47.078, 'eval_steps_per_second': 3.037, 'epoch': 4.0}


In [4]:
train_metrics = trainer.predict(train_dataset)
from sklearn.metrics import classification_report

y_true = train_metrics.label_ids
y_pred = train_metrics.predictions.argmax(-1)

print(classification_report(y_true, y_pred, target_names=["Beginner", "Advanced"]))

              precision    recall  f1-score   support

    Beginner       0.86      0.75      0.80       345
    Advanced       0.73      0.85      0.78       271

    accuracy                           0.80       616
   macro avg       0.80      0.80      0.79       616
weighted avg       0.80      0.80      0.80       616



### Theory/Practice

In [5]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import EarlyStoppingCallback
import os

os.environ["WANDB_DISABLED"] = "true"

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

class BookDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

if __name__ == "__main__":
    if torch.cuda.is_available():
        print(f"✅ GPU is available: {torch.cuda.get_device_name(0)}")
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
    else:
        print("🖥️ Running on CPU")

    # Load and prepare data
    df = pd.read_csv("labeled_books.csv")
    df = df[df["type"].isin(["Theory", "Practice"])].dropna(subset=["description"]).reset_index(drop=True)
    label_map = {"Theory": 0, "Practice": 1}
    df["label"] = df["type"].map(label_map)

    train_texts, test_texts, train_labels, test_labels = train_test_split(
        df["description"].tolist(),
        df["label"].tolist(),
        test_size=0.2,
        random_state=42,
        stratify=df["label"]
    )

    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=256, return_tensors="pt")
    test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=256, return_tensors="pt")

    train_dataset = BookDataset(train_encodings, train_labels)
    test_dataset = BookDataset(test_encodings, test_labels)

    model = BertForSequenceClassification.from_pretrained(
        "bert-base-uncased", num_labels=2, hidden_dropout_prob=0.3
    )

    training_args = TrainingArguments(
        output_dir="./results",
        eval_steps=500,
        save_steps=500,
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=4,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_steps=10,
        do_eval=True
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics

    )

    trainer.train()
    trainer.save_model("bert_type_classifier")
    print("Model saved to 'bert_type_classifier'")


✅ GPU is available: NVIDIA A100-SXM4-40GB


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss
10,0.6878
20,0.688
30,0.6272
40,0.6402
50,0.6172
60,0.5587
70,0.5268
80,0.4648
90,0.4751
100,0.4889


Model saved to 'bert_type_classifier'


In [6]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.4502439796924591, 'eval_accuracy': 0.7973856209150327, 'eval_precision': 0.8367346938775511, 'eval_recall': 0.640625, 'eval_f1': 0.7256637168141593, 'eval_runtime': 0.5664, 'eval_samples_per_second': 270.125, 'eval_steps_per_second': 17.655, 'epoch': 4.0}


In [7]:
train_metrics = trainer.predict(train_dataset)
from sklearn.metrics import classification_report

y_true = train_metrics.label_ids
y_pred = train_metrics.predictions.argmax(-1)

print(classification_report(y_true, y_pred, target_names=["Theory", "Practice"]))


              precision    recall  f1-score   support

      Theory       0.82      0.92      0.87       358
    Practice       0.87      0.70      0.78       254

    accuracy                           0.83       612
   macro avg       0.84      0.81      0.82       612
weighted avg       0.84      0.83      0.83       612



## Book Pairing

In [None]:
import pandas as pd
import torch
import numpy as np
from sentence_transformers import SentenceTransformer, util
from transformers import BertTokenizer, BertForSequenceClassification

# === Load sentence encoder for semantic search ===
semantic_model = SentenceTransformer("all-MiniLM-L6-v2")

# === Load books dataset and embed live ===
df = pd.read_csv("all_books.csv").dropna(subset=["description"])
df["embedding"] = semantic_model.encode(df["description"].tolist()).tolist()
book_embeddings = np.vstack(df["embedding"].to_numpy()).astype("float32")

# === User input ===
topic = input("Enter a topic (e.g., 'computer vision'): ")
style = input("Choose pairing style: (1) Beginner ➜ Advanced or (2) Theory ➜ Practice: ")

# === Setup correct classifier based on style ===
if style.strip() == "1":
    model_path = "bert_level_classifier"
    label_map = {0: "Beginner", 1: "Advanced"}
elif style.strip() == "2":
    model_path = "bert_type_classifier"
    label_map = {0: "Theory", 1: "Practice"}
else:
    raise ValueError("Invalid pairing style. Choose 1 or 2.")

tokenizer = BertTokenizer.from_pretrained(model_path)
classifier = BertForSequenceClassification.from_pretrained(model_path)
classifier.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
classifier.to(device)

# === Step: Semantic Search ===
topic_embedding = semantic_model.encode(topic, convert_to_tensor=True).to(device)
similarities = util.cos_sim(topic_embedding, torch.tensor(book_embeddings).to(device))[0].cpu().numpy()
df["similarity"] = similarities
top_books = df.sort_values(by="similarity", ascending=False).head(50).copy()

# === Step: Predict labels ===
def predict(descriptions):
    encodings = tokenizer(descriptions, truncation=True, padding=True, max_length=256, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = classifier(**encodings)
        preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
    return preds

top_books["predicted_label"] = predict(top_books["description"].tolist())
top_books["predicted_label"] = top_books["predicted_label"].map(label_map)

# === Step: Recommend 3 from each category ===
group_a = list(label_map.values())[0]
group_b = list(label_map.values())[1]

recommendations = {
    group_a: top_books[top_books["predicted_label"] == group_a].head(3),
    group_b: top_books[top_books["predicted_label"] == group_b].head(3)
}

# === Output ===
for label in [group_a, group_b]:
    print(f"\n📘 Top 3 {label} Books on '{topic}':")
    for _, row in recommendations[label].iterrows():
        print(f"- {row['title']} (Similarity: {row['similarity']:.2f})")
        print(f"  {row['description'][:200]}...")
        print(f"  🔗 More Info: {row.get('infoLink', 'N/A')}\n")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Enter a topic (e.g., 'computer vision'): AI in healthcare
Choose pairing style: (1) Beginner ➜ Advanced or (2) Theory ➜ Practice: 2

📘 Top 3 Theory Books on 'AI in healthcare':
- Medical Applications of Artificial Intelligence (Similarity: 0.81)
  Enhanced, more reliable, and better understood than in the past, artificial intelligence (AI) systems can make providing healthcare more accurate, affordable, accessible, consistent, and efficient. Ho...
  🔗 More Info: https://play.google.com/store/books/details?id=tRDSBQAAQBAJ&source=gbs_api

- Punish the Machine! (Similarity: 0.77)
  Spare The Doctor And Save The Patient The health care industry is in deep trouble. More than 50 percent of physicians report burnout and the US health care system is topping the charts for cost ?while...
  🔗 More Info: http://books.google.com/books?id=U4DrwAEACAAJ&dq=AI+in+healthcare&hl=&as_pt=BOOKS&source=gbs_api

- Healthcare and Artificial Intelligence (Similarity: 0.77)
  This book provides an overview of t