## Login to Hugging Face Hub

In [1]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

* Purpose: Logs into the Hugging Face Hub for access to pre-trained models.
* Interaction: The user is prompted to authenticate using their Hugging Face credentials.

## Set Up the Device for Computation

### Check GPU Availability

In [2]:
import torch
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Device Setup

In [3]:
torch.cuda.set_device(0)

This selects the GPU (if available) for running computations, optimizing performance for large models.

## Import Required Libraries

In [4]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import Qdrant
from langchain_core.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import LLMChain, StuffDocumentsChain
from transformers import AutoTokenizer, pipeline

* PyPDFDirectoryLoader: Loads and preprocesses PDF documents from a specified directory.
* Qdrant: Stores and retrieves document embeddings for similarity-based search.
* PromptTemplate: Defines structured templates for generating prompts with dynamic placeholders.
* HuggingFacePipeline: Integrates Hugging Face's LLM pipelines into LangChain workflows.
* HuggingFaceEmbeddings: Converts text into dense vector embeddings for retrieval tasks.
* RetrievalQA: Chains document retrieval and LLM-based question answering.
* RecursiveCharacterTextSplitter: Splits long texts into smaller, context-preserving chunks.
* LLMChain: Links prompts with an LLM for structured response generation.
* StuffDocumentsChain: Combines retrieved documents into a single context for LLM processing.
* AutoTokenizer: Tokenizes text to prepare it for input into a Hugging Face model.
* pipeline: Configures and executes text-generation tasks using Hugging Face models.

## Load PDF Documents

In [5]:
loader = PyPDFDirectoryLoader("./pdf_directory")
docs = loader.load()
print(len(docs))

Ignoring wrong pointing object 33 0 (offset 0)
Ignoring wrong pointing object 35 0 (offset 0)


406


* Purpose: Reads all .pdf files from the specified directory and loads them as documents.
* Output: The variable docs contains a list of document objects, and len(docs) prints the total number of documents loaded.

## Set Up the Embedding Model

In [6]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

  embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)





* Purpose: Initializes the HuggingFaceEmbeddings model to convert text into vector representations (embeddings).
* Model Used: sentence-transformers/all-mpnet-base-v2, a widely-used transformer for embeddings.

## Split Documents into Chunks

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(docs)

Purpose: Splits large documents into smaller chunks of 500 characters for more effective indexing and retrieval.

## Create a Qdrant Vector Store

In [8]:
qdrant_collection = Qdrant.from_documents(
    all_splits,
    embeddings,
    location=":memory:",
    collection_name="responses",
)

Purpose: Stores the embedded document chunks in a Qdrant vector database, enabling similarity-based retrieval.

## Configure a Retriever

In [9]:
qdrant_retriever = qdrant_collection.as_retriever()

Purpose: Configures the retriever to fetch documents from the Qdrant database based on a query's similarity to the stored embeddings.

## Load and Configure the LLM from Hugging Face

In [10]:
model_name = "google/gemma-2-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipeline = pipeline(
    "text-generation",
    model=model_name,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
    max_new_tokens=512
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

* Purpose: Initializes the Gemma-2 LLM for text generation using the Hugging Face pipeline.
* Model Used: google/gemma-2-2b-it.

##  Test the LLM with Chat-Style Prompts

In [11]:
messages = [
    {"role": "user", "content": "Where is Milan?"},
]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(
	prompt,
	max_new_tokens=256,
	add_special_tokens=True,
	do_sample=True,
	temperature=0.7,
	top_k=50,
	top_p=0.95
)
print(outputs[0]["generated_text"][len(prompt):])

The 'max_batch_size' argument of HybridCache is deprecated and will be removed in v4.46. Use the more precisely named 'batch_size' argument instead.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Milan is a city located in **Northern Italy**. 

More specifically:

* **Region:** Lombardy
* **Country:** Italy
 
It's the capital of the Lombardy region and is one of the major fashion, economic, and cultural centers of Italy. 



* Purpose: Tests the LLM with a chat-style prompt asking a geographical question.
* Steps:
    * Constructs a chat-based input format for the model.
    * Generates a response using the LLM with specified parameters like temperature (controls randomness) and top-k/top-p sampling (controls diversity).

## Define the Prompt Template

In [12]:
qa_prompt_tmpl_str = (
            "Context information is below.\n"
            "---------------------\n"
            "{context}\n"
            "---------------------\n"
            "Given the context information above I want you to think step by step to answer the query in a crisp manner, in case you don't know the answer say 'I don't know!'.\n"
            "Query: {question}\n"
            "Answer: "
            )
prompt_template = PromptTemplate.from_template(qa_prompt_tmpl_str)

Purpose: Defines a chain-of-thought style prompt template that:
* Provides context.
* Asks the model to answer step-by-step.
* Prompts the model to explicitly state "I don't know!" if unsure.

## Create the LLM Chain and Combine Documents with the LLM Chain

In [13]:
gemma_llm = HuggingFacePipeline(
    pipeline=pipeline,
    model_kwargs={"temperature": 0.7},
)

llm_chain = LLMChain(llm=gemma_llm, prompt=prompt_template)

combine_documents_chain = StuffDocumentsChain(
    llm_chain=llm_chain,
    document_variable_name="context",
    verbose=True
)

print(llm_chain.input_keys)  
print(combine_documents_chain.input_keys)  

['context', 'question']
['input_documents', 'question']


  gemma_llm = HuggingFacePipeline(
  llm_chain = LLMChain(llm=gemma_llm, prompt=prompt_template)
  combine_documents_chain = StuffDocumentsChain(


Purpose:
* Connects the Gemma LLM with the defined prompt template to form a processing pipeline.
* Processes the retrieved documents to form a cohesive input (context) for the LLM.

## Set Up the Retrieval-QA Pipeline and Query the Retrieval-QA System

In [14]:
qa = RetrievalQA(
    combine_documents_chain=combine_documents_chain,
    retriever=qdrant_retriever,
    verbose=True,
    return_source_documents=True,
)

print(qa.input_keys)

query_str = "How can we use LLM in cybersecurity?"
response = qa.invoke(query_str)

print(response["result"])


  qa = RetrievalQA(


['query']


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m
Context information is below.
---------------------
Table 1: The main cybersecurity tasks and applications where LLMs have been utility.
Vulnerability
Detection
(In)secure
Code
Generation
Program
Repairing Binary IT
Operations
Threat
Intelligence
Anomaly
Detection
LLM
Assisted
Attack
Others
RQ1 ✓ ✓ ✓ ✓ ✓ - - - ✓
RQ2 ✓ ✓ ✓ - - ✓ ✓ ✓ ✓
RQ3 - - - - ✓ - ✓ ✓ -
Despite initial efforts of LLMs in cybersecurity, the field faces several challenges [17, 31]. First, many studies rely on

importance of LLMs in the field of cybersecurity, although the researches of LLMs’ application in their fields
are less.
3 RQ1: How to construct cybersecurity-oriented domain LLMs?
The cybersecurity domain is facing escalating threats, demanding intelligent and efficient solutions to counter sophisti-
cated and evolving attacks [37, 38, 39]. LLMs offer

Purpose:
* Implements a Retrieval-QA pipeline that:
    * Fetches relevant documents using the retriever.
    * Generates a response using the LLM chain.
    * Optionally returns the source documents for reference.
* Queries the system with a question about LLMs in cybersecurity as per the following steps:
    * Retrieves relevant document chunks from Qdrant.
    * Passes the context and question to the LLM.
    * Outputs the generated answer.