# Generate Embeddings on Google Colab

This notebook runs embedding generation on Google Colab's free GPU.

**Steps:**
1. Upload your dataset CSV file
2. Install dependencies
3. Run embedding generation
4. Download results


## Step 1: Check GPU Availability


In [1]:
# Check if GPU is available
import torch

if torch.cuda.is_available():
    print(f"‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    device = "cuda"
else:
    print("‚ùå No GPU available. Please enable GPU in Runtime > Change runtime type > GPU")
    device = "cpu"


‚úÖ GPU Available: Tesla T4
GPU Memory: 14.7 GB


## Step 2: Install Dependencies


In [2]:
# Install required packages
%pip install sentence-transformers pandas numpy tqdm -q


## Step 3: Upload Dataset

Upload your `standards_dataset.csv` file using the file uploader below.


In [3]:
from google.colab import files
import pandas as pd
from pathlib import Path
import os

# Create datasets directory
os.makedirs("datasets", exist_ok=True)
os.makedirs("datasets/embeddings", exist_ok=True)

print("Please upload your standards_dataset.csv file:")
uploaded = files.upload()

# Move uploaded file to datasets directory
dataset_path = None
for filename in uploaded.keys():
    if filename.endswith('.csv'):
        os.rename(filename, f"datasets/{filename}")
        print(f"‚úÖ Uploaded: {filename}")
        dataset_path = f"datasets/{filename}"
        break

if not dataset_path:
    print("‚ùå No CSV file found. Please upload standards_dataset.csv")
else:
    # Verify dataset
    df = pd.read_csv(dataset_path)
    print(f"\nDataset loaded: {len(df)} sections")
    print(f"Columns: {list(df.columns)}")


Please upload your standards_dataset.csv file:


Saving standards_dataset.csv to standards_dataset.csv
‚úÖ Uploaded: standards_dataset.csv

Dataset loaded: 2273 sections
Columns: ['section_number', 'section_path', 'content', 'content_length', 'line_number']


## Step 4: Generate Embeddings


In [4]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import json
import time
import os
import torch
import gc

# 1. Clean up memory from the previous crash
gc.collect()
torch.cuda.empty_cache()
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

if 'dataset_path' not in locals() or dataset_path is None:
    print("‚ùå Please upload dataset first!")
else:
    # Load dataset
    df = pd.read_csv(dataset_path)
    texts = df['content'].tolist()

    print(f"Total texts to embed: {len(texts):,}")

    # --- NEW MODEL CONFIGURATION ---
    # switching to the most reliable/efficient model
    model_name = "sentence-transformers/all-MiniLM-L6-v2"

    # Since this model is tiny, we can use a huge batch size for speed
    batch_size = 128 if torch.cuda.is_available() else 32

    print(f"\nConfiguration:")
    print(f"  Model: {model_name}")
    print(f"  Device: cuda" if torch.cuda.is_available() else "cpu")
    print(f"  Batch size: {batch_size}")

    # Load model
    print(f"\nLoading model...")
    model = SentenceTransformer(model_name, device="cuda" if torch.cuda.is_available() else "cpu")

    # Generate embeddings
    print(f"\nGenerating embeddings...")
    embeddings = []
    start_time = time.time()

    try:
        total_batches = (len(texts) + batch_size - 1) // batch_size

        with tqdm(total=len(texts), desc="Processing", unit="text") as pbar:
            for i in range(0, len(texts), batch_size):
                batch = texts[i:i+batch_size]

                # Encode
                # normalization is usually good for MiniLM too
                batch_embeddings = model.encode(
                    batch,
                    show_progress_bar=False,
                    convert_to_numpy=True,
                    normalize_embeddings=True,
                    batch_size=len(batch)
                )

                embeddings.append(batch_embeddings)
                pbar.update(len(batch))

        # Concatenate embeddings
        all_embeddings = np.vstack(embeddings)
        total_time = time.time() - start_time

        print(f"\n{'='*60}")
        print(f"‚úÖ Embeddings generated successfully!")
        print(f"{'='*60}")
        print(f"Total time: {total_time:.2f} seconds")
        print(f"Speed: {len(texts)/total_time:.1f} texts/sec")
        print(f"Shape: {all_embeddings.shape}")

        # Save embeddings
        os.makedirs("datasets/embeddings", exist_ok=True)
        embeddings_path = "datasets/embeddings/embeddings.npy"
        np.save(embeddings_path, all_embeddings)
        print(f"\n‚úÖ Saved embeddings to: {embeddings_path}")

        # Save metadata
        metadata = {
            'model_name': model_name,
            'embedding_dimension': int(all_embeddings.shape[1]),
            'num_embeddings': int(all_embeddings.shape[0]),
            'processing_time': total_time
        }

        metadata_path = "datasets/embeddings/metadata.json"
        with open(metadata_path, 'w') as f:
            json.dump(metadata, f, indent=2)

        # Update dataset
        df['embedding_index'] = range(len(df))
        output_dataset = "datasets/standards_dataset_with_embeddings.csv"
        df.to_csv(output_dataset, index=False)
        print(f"‚úÖ Saved CSV to: {output_dataset}")

    except Exception as e:
        print(f"\n‚ùå Error: {e}")
        print("If you still see OOM, restart the Runtime via 'Runtime' -> 'Disconnect and Delete Runtime'")

Total texts to embed: 2,273

Configuration:
  Model: sentence-transformers/all-MiniLM-L6-v2
  Device: cuda
  Batch size: 128

Loading model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Generating embeddings...


Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2273/2273 [00:06<00:00, 342.12text/s]


‚úÖ Embeddings generated successfully!
Total time: 6.65 seconds
Speed: 341.8 texts/sec
Shape: (2273, 384)

‚úÖ Saved embeddings to: datasets/embeddings/embeddings.npy
‚úÖ Saved CSV to: datasets/standards_dataset_with_embeddings.csv





## Step 5: Download Results


In [5]:
from google.colab import files
import zipfile
import os

# Create zip file with all results
zip_path = "embeddings_results.zip"
with zipfile.ZipFile(zip_path, 'w') as zipf:
    # Add embeddings
    if os.path.exists("datasets/embeddings/embeddings.npy"):
        zipf.write("datasets/embeddings/embeddings.npy", "embeddings.npy")

    # Add metadata
    if os.path.exists("datasets/embeddings/metadata.json"):
        zipf.write("datasets/embeddings/metadata.json", "metadata.json")

    # Add dataset with indices
    if os.path.exists("datasets/standards_dataset_with_embeddings.csv"):
        zipf.write("datasets/standards_dataset_with_embeddings.csv", "standards_dataset_with_embeddings.csv")

print("üì¶ Created zip file with all results")
print("\nDownloading files...")
files.download(zip_path)
print("\n‚úÖ Download complete!")
print("\nFiles included:")
print("  - embeddings.npy (embeddings array)")
print("  - metadata.json (model info)")
print("  - standards_dataset_with_embeddings.csv (dataset with indices)")


üì¶ Created zip file with all results

Downloading files...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


‚úÖ Download complete!

Files included:
  - embeddings.npy (embeddings array)
  - metadata.json (model info)
  - standards_dataset_with_embeddings.csv (dataset with indices)
