<a href="https://colab.research.google.com/github/okonp07/GraphRAG-Pipeline-Deployment-in-Python/blob/main/Deploying_a_Simple_GraphRAG_Pipeline_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Deploying a Lightweight GraphRAG Pipeline for Cybersecurity Question Answering using TinyLlama (Bonus Project)**

|  |  |
|:---|:---|
| **Estimated Runtime** | ~20–30 minutes (depending on hardware and optional components like FAISS) |
| **Prior Knowledge** | Python programming, basic understanding of LLMs and embeddings, introductory graph theory, cybersecurity domain familiarity |
| **Key Libraries** | `langchain`, `transformers`, `sentence-transformers`, `networkx`, `gradio`, `dotenv`, `faiss-cpu` (optional), `huggingface_hub` |
| **Model Used** | `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (custom-trained by author on `zeroshot/cybersecurity-corpus`) |
| **Primary Use Case** | Domain-specific question answering within cybersecurity using graph-augmented document context |
| **User Interface** | Gradio web app |
| **System Requirements** | Minimum: 8GB RAM, Recommended: GPU-enabled environment for faster response |
| **Deployment Format** | Jupyter Notebook |
| **License** | Apache License 2.0 (for model); MIT/BSD/Apache for supporting libraries |
| **Author** | Okon Prince; Data Scientist / AI Researcher
| **Specialization** | AI apps for cybersecurity, education & intelligence


*This project implements a lightweight GraphRAG (Graph-based Retrieval-Augmented Generation) system using Python to enhance context-aware language generation. It combines semantic search with knowledge graph construction to retrieve relevant document chunks and generate informed responses via a fine-tuned TinyLlama language model. Designed for cybersecurity applications, the system is optimized for low-resource environments and offers a simple Gradio interface for interaction.*




We will start by installing the dependencies and configuring our colab environment

In [None]:
!pip install peft accelerate bitsandbytes transformers datasets
!pip install datasets


Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

In [None]:
!pip uninstall -y bitsandbytes
!pip install bitsandbytes -U --prefer-binary --no-cache-dir


[0mCollecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvi

In [None]:
!nvidia-smi

Tue Apr  8 03:39:48 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   42C    P8             11W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!pip install transformers peft datasets accelerate bitsandbytes wandb --quiet

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import prepare_model_for_kbit_training
import wandb


In [None]:
from IPython.display import display
from datasets import load_dataset
from peft import prepare_model_for_kbit_training
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from transformers import AutoTokenizer
import ipywidgets as widgets
import os
import torch
import wandb

Now that we have our environment up and running, we will carry on with the rest of the project. Here is the plan.

1.	Install and import dependencies
2.	Load and prepare your dataset
3.	Tokenize the dataset
4.	Load model with 4-bit quantization
5.	Prepare model for PEFT fine-tuning
6.	Set up training configuration
7.	Train the model

because training LLMs is a resource hungry endeavour as even models that are described as tiny have billions of parameters, we will be training our model using a GPU. The code below confirms that we have our GPU instance correctly setup and our notebook is indeed running on GPU. For this project, we used google colabs L4 GPU instance.

In [None]:
print(torch.cuda.is_available())  # Should return True


True


**STAGE 1: Install & Import Dependencies**

**STAGE 2: Load and Prepare Dataset**

We’ll load the nist-research/bodmas Cybersecurity Dataset from Hugging Face and use the report field for training. The **BODMAS Cybersecurity Dataset** from Hugging Face is a specialized corpus containing real-world cybersecurity-related text data, curated to support training and evaluation of language models in cybersecurity contexts. It includes diverse document types such as incident reports, malware analyses, and threat intelligence, making it valuable for building models that understand domain-specific terminology and threats.

In [None]:

# Load the ZeroShot Cybersecurity Corpus
dataset = load_dataset("zeroshot/cybersecurity-corpus")

# Preview the dataset
print(dataset)
print(dataset["train"][0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sent_train.csv:   0%|          | 0.00/112k [00:00<?, ?B/s]

sent_valid.csv:   0%|          | 0.00/29.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/789 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/211 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 789
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 211
    })
})
{'text': 'U.S. Air Force Announces Third Bug Bounty Program - https://t.co/DVDtbF6iKI', 'label': 0}


**STAGE 3: Tokenize Dataset:**

Tokenization is the process of breaking down text into smaller units called tokens—such as words, subwords, or characters—that can be processed by a language model.

In [None]:

# Load the TinyLlama tokenizer
checkpoint = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

# Apply the tokenizer to the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Optional: Remove the 'text' column if no longer needed
tokenized_dataset = tokenized_dataset.remove_columns(["text"])

# Set the dataset format for PyTorch
tokenized_dataset.set_format("torch")


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Map:   0%|          | 0/789 [00:00<?, ? examples/s]

Map:   0%|          | 0/211 [00:00<?, ? examples/s]

**STAGE 4: Load 4-bit Quantized Model:**
In this code, we initialize a **4-bit quantization configuration** using `BitsAndBytesConfig` to efficiently load a large language model with reduced memory usage and faster computation, leveraging settings like NF4 quantization and float16 computation. It then loads a **causal language model** (such as TinyLlama) from the specified checkpoint using this configuration, automatically assigning it to available devices (e.g., GPU/CPU) for optimized performance.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

base_model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**STAGE 5: Prepare Model for PEFT**

The function below prepares our model for low-bit quantization (K-bit training), which involves adapting the model to work efficiently with lower precision data types like 8-bit or even lower. This step is performed to optimize the model for deployment on resource-constrained environments without significantly compromising performance.

In [None]:
model = prepare_model_for_kbit_training(base_model)


**STAGE 6: Set Up Training Arguments**

Now, we will initialize the WandB (Weights & Biases) library for experiment tracking and sets up the training parameters using TrainingArguments, including settings for batch size, gradient accumulation, evaluation strategy, learning rate, and more. It also defines a DataCollatorForLanguageModeling to preprocess the data for a language modeling task without using masked language modeling (MLM) by setting mlm=False.

In [None]:
wandb.login()  # or use wandb.init(project="your_project")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    report_to="wandb",
    run_name="tinyllama-cybersecurity-finetune",
    save_total_limit=2,
    push_to_hub=False,
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mokonp07[0m ([33mokonp07-psp-analytics-limited[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


**7. Prepare your LoRA configuration (Pre-Training)**

With the code below, we prepare our model for efficient fine-tuning using LoRA by applying low-rank adaptations to specific parts of the model (e.g., query and value projections) while quantizing the model for reduced memory and computation. The LoRA configuration is defined with parameters like rank, dropout, and task type, and the model is then wrapped with LoRA for fine-tuning, with an optional step to display the number of trainable parameters.

In [None]:

# Prepare the quantized model for LoRA fine-tuning
model = prepare_model_for_kbit_training(base_model)

# Define your LoRA configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # depends on model architecture
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)

# (Optional) Print trainable params
model.print_trainable_parameters()


trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023


**STAGE 7: Train the Model**

Now we can initialize a Trainer object, which handles the training loop for the model using the specified training arguments, datasets, tokenizer, and data collator. The trainer.train() method then begins the training process on the provided training and validation datasets.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()


  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
0,3.2978,3.432296
1,3.2076,3.322762
2,3.1502,3.293762


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=294, training_loss=3.2791869202438666, metrics={'train_runtime': 471.3059, 'train_samples_per_second': 5.022, 'train_steps_per_second': 0.624, 'total_flos': 7495572403519488.0, 'train_loss': 3.2791869202438666, 'epoch': 2.992405063291139})

**STEP 9: (Optional) Save Model**

Remove the hashes and run the cell below to save the model

In [None]:
# trainer.save_model("./final_model")
# tokenizer.save_pretrained("./final_model")


**Done!**


We now have a complete end-to-end fine-tuning pipeline for a small 1.1B parameter model using 4-bit quantization. It’s Colab-A100 safe and W&B-ready.


**STAGE 2. Query the model (inference)**
The function below generates a response to a given question by formatting the question as a prompt, tokenizing it, and feeding it into the model with custom sampling parameters (e.g., top-p, top-k, temperature) for diverse and creative outputs. It then decodes the model's output and extracts the response after the "A:" portion, ensuring it is returned as a clean, readable answer.


In [None]:
def generate_response(question):
    prompt = f"Q: {question}\nA:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        # Generating output with better sampling options
        outputs = model.generate(
            **inputs,
            max_length=200,      # Allow longer responses
            do_sample=True,      # Enable sampling for creativity
            top_p=0.95,          # Top-p (nucleus sampling)
            top_k=50,            # Top-k for diversity
            temperature=0.8,     # Control randomness in response
            pad_token_id=tokenizer.eos_token_id  # Avoid padding token errors
        )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Look for the part after "A:" for the actual response
    if "A:" in decoded:
        response = decoded.split("A:")[1].strip()
    else:
        response = decoded.strip()

    return response


**Interactive Widget**

This code creates an interactive interface with a text input box for entering a question, a button to submit the question, and an output box to display the response. When the button is clicked, it triggers the generation of a response to the question using the generate_response function, and displays the question and answer in the output box.

In [None]:
# Widgets for input and output
question_box = widgets.Textarea(
    value='',
    placeholder='Type your question here...',
    description='Question:',
    layout=widgets.Layout(width='100%', height='100px'),
    style={'description_width': 'initial'}
)

output_box = widgets.Output()

# What happens on submit
def on_click_generate(b):
    output_box.clear_output()
    question = question_box.value.strip()
    if question:
        with output_box:
            print("Question:", question)
            print("Answer:", generate_response(question))

# Button to submit the question
ask_button = widgets.Button(
    description="Ask",
    button_style='primary',
    tooltip="Click to get the model's answer"
)
ask_button.on_click(on_click_generate)

# Display everything
display(question_box, ask_button, output_box)


Textarea(value='', description='Question:', layout=Layout(height='100px', width='100%'), placeholder='Type you…

Button(button_style='primary', description='Ask', style=ButtonStyle(), tooltip="Click to get the model's answe…

Output()