# **INTELIGENCIA ARTIFICIAL APLICADA A LA CIBERSEGURIDAD**
## **PRÁCTICA P3 - BLOQUE III**

**INSTRUCCIONES / RECOMENDACIONES**

- Se recomienda leer con detalle la descripción de cada una de las celdas.
- Las celdas que ya tienen código, se deberán ejecutar directamente. NOTA: algunas celdas con código deberán parametrizarse según lo indicado en cada caso.
- Las celdas que están vacías, se completarán con la implementación requerida en el notebook.
- No se incluirán más celdas de las establecidas en el presente notebook, por lo que la solución al mismo deberá implementarse exclusivamente en las celdas vacías.
- La entrega se realizará vía Moodle. Será necesario subir la solución a este notebook con el nombre: **NOMBRE_GRUPO.ipynb**

- **Fecha de Publicación: 08/04/2024**
- **Fecha de Entrega: 14/04/2024**
- **Test: 15/04/2024**

In [1]:
# SETUP

!pip3 install transformers
!pip3 install einops
!pip3 install accelerate
!pip3 install unstructured-pytesseract
!pip3 install unstructured-inference
!pip3 install sentence_transformers
!pip3 install chromadb

#!pip3 install protobuf==3.20.*

!pip3 install langchain



In [2]:
!pip3 install unstructured
!pip3 install pillow_heif
#!pip3 install cmake
!pip3 install pikepdf pypdf
#!pip3 install python-poppler

Collecting pikepdf
  Using cached pikepdf-8.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
Collecting pypdf
  Using cached pypdf-4.2.0-py3-none-any.whl (290 kB)
Installing collected packages: pypdf, pikepdf
Successfully installed pikepdf-8.15.0 pypdf-4.2.0


In [3]:
# IMPORTS

from transformers import AutoTokenizer
import transformers
import torch

import langchain

langchain.__version__

'0.1.16'

## LLM

Se deberá parametrizar el modelo LLM según los siguientes atributos:
- model_id
- max_length
- max_new_tokens

In [4]:
# LLM using HuggingFace GPT2

from langchain import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="gpt2-large",
    task="text-generation",
    model_kwargs={
        "max_length": 1024,
        'do_sample': True,
        'top_k': 10,
        'num_return_sequences': 2,
        #'device_map': 'auto',
        'trust_remote_code': True,
        'torch_dtype': torch.bfloat16
    },
    pipeline_kwargs={"max_new_tokens": 100},
    device=0, # With GPU
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## DOCUMENTS & SPLITTER & VECTORSTORE

Se deberá construir un sistema RAG sobre un caso de uso de Ciberseguridad, utilizando para ello el modelo LLM especificado. La selección del caso de uso de Ciberseguridad será propuesta por el alumn@, estableciendo el documento PDF base (**pdf_base**) para realizar posteriormente las consultas al modelo.

In [5]:
# INDEXING

from langchain.document_loaders import OnlinePDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain

In [6]:
# 1. LOAD
loader = OnlinePDFLoader("https://nvlpubs.nist.gov/nistpubs/ir/2013/NIST.IR.7298r2.pdf")
document = loader.load()

# 2. SPLIT
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=64)
documents = text_splitter.split_documents(document)

# 3. EMBED & STORE
embeddings = HuggingFaceEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)

# RETRIEVAL & GENERATION
qa = ConversationalRetrievalChain.from_llm(
    llm,
    vectorstore.as_retriever(),
    return_source_documents=True,
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## QUESTION TO LLM = MY_QUERY + CONTEXT_from_VECTORSTORE -> ANSWER FROM LLM

Se deberá generar un catálogo de 10 prompts de consultas sobre el sistema RAG generado, utilizando para ello la variable **my_query**. Los prompts generados deberán obtener respuestas del modelo LLM con una calidad adecuada.

In [7]:
# QUESTION TO LLM

import warnings
warnings.filterwarnings('ignore')

chat_history = []

In [8]:
my_query = "What is the difference between authentication and authorisation?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: SP 800-53; CNSSI-4009

Authenticity –

The property of being genuine and being able to be verified and trusted; confidence in the validity of a transmission, a message, or message originator. See Authentication.

SOURCE: SP 800-53; SP 800-53A; CNSSI-4009; SP 800-39

Authority –

Person(s) or established bodies with rights and responsibilities to exert control in an administrative sphere.

SOURCE: CNSSI-4009

Authorization –

SOURCE: CNSSI-4009

Authenticate –

To confirm the identity of an entity when that identity is presented.

SOURCE: SP 800-32

To verify the identity of a user, user device, or other entity.

SOURCE: CNSSI-4009

Authentication –

Verifying the identity of a user, process, or device, often as a prerequisite to allowing access to resources in an information system.

SOURCE: SP 800-53; SP 800-53A; SP

In [9]:
my_query = "In my company there is a department called CSIRT, what do they do?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Computer Security Incident Response Team (CSIRT) –

A capability set up for the purpose of assisting in responding to computer security-related incidents; also called a Computer Incident Response Team (CIRT) or a CIRC (Computer Incident Response Center, Computer Incident Response Capability).

SOURCE: SP 800-61

Computer Security Object (CSO) –

SOURCE: CNSSI-4009

Computer Incident Response Team – (CIRT)

Group of individuals usually consisting of Security Analysts organized to develop, recommend, and coordinate immediate mitigation actions for containment, eradication, and recovery resulting from computer security incidents. Also called a Computer Security Incident Response Team (CSIRT) or a CIRC (Computer Incident Response Center, Computer Incident Response Capability, or Cyber Incident Response Team).

SOURCE: CNSSI-4009

In [18]:
my_query = "What is the main difference between a blacklist and a whitelist?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: SP 800-94

20

NIST IR 7298 Revision 2, Glossary of Key Information Security Terms

Blacklisting –

The process of the system invalidating a user ID based on the user’s inappropriate actions. A blacklisted user ID cannot be used to log on to the system, even with the correct authenticator. Blacklisting and lifting of a blacklisting are both security-relevant events. Blacklisting also applies to blocks placed against IP addresses to prevent inappropriate or unauthorized use of Internet resources.

2. Can also refer to a small group of people who have prior knowledge of unannounced Red Team activities. The White Team acts as observers during the Red Team activity and ensures the scope of testing does not exceed a predefined threshold. SOURCE: CNSSI-4009

Whitelist –

A list of discrete entities, such as hosts or applic

In [11]:
my_query = "What is the difference between public and private keys?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: SP 800-63

Private Key –

A cryptographic key, used with a public key cryptographic algorithm, that is uniquely associated with an entity and is not made public. In an asymmetric (public) cryptosystem, the private key is associated with a public key. Depending on the algorithm, the private key may be used, for example, to: 1) Compute the corresponding public key, 2) Compute a digital signature that may be verified by the

corresponding public key,

SOURCE: FIPS 196

Public Key –

A cryptographic key used with a public key cryptographic algorithm that is uniquely associated with an entity and that may be made public.

SOURCE: FIPS 140-2

A cryptographic key that may be widely published and is used to enable the operation of an asymmetric cryptography scheme. This key is mathematically linked with a corresponding priva

In [12]:
my_query = "What is a Zombie?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: SP 800-63

Zombie –

A program that is installed on a system to cause it to attack other systems.

SOURCE: SP 800-83

Zone Of Control –

Three-dimensional space surrounding equipment that processes classified and/or sensitive information within which TEMPEST exploitation is not considered practical or where legal authority to identify and remove a potential TEMPEST exploitation exists.

SOURCE: CNSSI-4009

216

NIST IR 7298, Glossary of Key Information Security Terms

NON-NIST REFERENCES

SOURCE: CNSSI-4009

215

NIST IR 7298 Revision 2, Glossary of Key Information Security Terms

Zeroize –

To remove or eliminate the key from a cryptographic equipment or fill device.

SOURCE: CNSSI-4009

Overwrite a memory location with data consisting entirely of bits with the value zero so that the data is destroyed and not recove

In [13]:
my_query = "Que es un Zombie?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: SP 800-63

Zombie –

A program that is installed on a system to cause it to attack other systems.

SOURCE: SP 800-83

Zone Of Control –

Three-dimensional space surrounding equipment that processes classified and/or sensitive information within which TEMPEST exploitation is not considered practical or where legal authority to identify and remove a potential TEMPEST exploitation exists.

SOURCE: CNSSI-4009

216

NIST IR 7298, Glossary of Key Information Security Terms

NON-NIST REFERENCES

SOURCE: CNSSI-4009

215

NIST IR 7298 Revision 2, Glossary of Key Information Security Terms

Zeroize –

To remove or eliminate the key from a cryptographic equipment or fill device.

SOURCE: CNSSI-4009

Overwrite a memory location with data consisting entirely of bits with the value zero so that the data is destroyed and not recove

In [14]:
my_query = "What is the main difference between a virus and a macro virus?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: CNSSI-4009

Macro Virus –

A virus that attaches itself to documents and uses the macro programming capabilities of the document’s application to execute and propagate.

SOURCE: CNSSI-4009

Magnetic Remanence –

Magnetic representation of residual information remaining on a magnetic medium after the medium has been cleared. See Clearing.

SOURCE: CNSSI-4009

Maintenance Hook –

Virus –

A computer program that can copy itself and infect a computer without permission or knowledge of the user. A virus might corrupt or delete data on a computer, use email programs to spread itself to other computers, or even erase everything on a hard disk.

SOURCE: CNSSI-4009

Vulnerability –

Weakness in an information system, system security procedures, internal controls, or implementation that could be exploited or triggered by a th

In [19]:
my_query = "How do I protect from a macro virus?"

result = qa({"question": my_query, "chat_history": chat_history})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: CNSSI-4009

Macro Virus –

A virus that attaches itself to documents and uses the macro programming capabilities of the document’s application to execute and propagate.

SOURCE: CNSSI-4009

Magnetic Remanence –

Magnetic representation of residual information remaining on a magnetic medium after the medium has been cleared. See Clearing.

SOURCE: CNSSI-4009

Maintenance Hook –

SOURCE: CNSSI-4009

Malware –

A program that is inserted into a system, usually covertly, with the intent of compromising the confidentiality, integrity, or availability of the victim’s data, applications, or operating system or of otherwise annoying or disrupting the victim.

SOURCE: SP 800-83

See Malicious Code. See also Malicious Applets and Malicious Logic.

SOURCE: SP 800-53; CNSSI-4009

A virus, worm, Trojan horse, or other code-based 

In [16]:
my_query = "What's the process of validation?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: SP 800-38C

Validation –

The process of demonstrating that the system under consideration meets in all respects the specification of that system.

SOURCE: FIPS 201

Independent Verification & Validation (IV&V) –

A comprehensive review, analysis, and testing (software and/or hardware) performed by an objective third party to confirm (i.e., verify) that the requirements are correctly defined, and to confirm (i.e., validate) that the system correctly implements the required functionality and security requirements.

SOURCE: CNSSI-4009

Indicator –

SOURCE: CNSSI-4009

Verification –

Confirmation, through the provision of objective evidence, that specified requirements have been fulfilled (e.g., an entity’s requirements have been correctly defined, or an entity’s attributes have been correctly presented; or a procedure

In [17]:
my_query = "What methods do exist to protect a computer?"

result = qa({"question": my_query, "chat_history": []})
print(result["answer"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

SOURCE: CNSSI-4009

Computer Security (COMPUSEC) – Measures and controls that ensure confidentiality, integrity, and

availability of information system assets including hardware, software, firmware, and information being processed, stored, and communicated.

SOURCE: CNSSI-4009

Computer Security Incident –

See Incident.

41

NIST IR 7298 Revision 2, Glossary of Key Information Security Terms

Computer Security Incident Response Team (CSIRT) –

Extent to which protective measures, techniques, and procedures must be applied to information systems and networks based on risk, threat, vulnerability, system interconnectivity considerations, and information assurance needs. Levels of protection are: 1. Basic: information systems and networks requiring implementation of standard minimum security countermeasures. 2. Medium: informa

# Conclusiones

Detalle las conclusiones extraídas del presente estudio sobre LLMs y sistemas RAG.

Como se ha podido observar el modelo no presenta un resultado correcto. Puesto que parece tener muchas respuestas incorrectas. En algunas preguntas, parte de las respuestas son correctas, pero en su totalidad no lo son. Por otro lado se probó a realizar una misma pregunta en inglés y en español y la respuesta en español no tiene sentido alguno.

Destacar que en el código se está especificando al modelo que realice dos respuestas a las preguntas que se le hace al modelo. Pero de cara a mostrar una al usuario, se coje de manera aleatoria, lo que hace que el modelo tenga un gran inconsistencia, si se le pregunta varias veces lo mismo. Además algunas de estas preguntas se podrían clasificar como alucinaciones, al no tener sentido alguno en el contexto del pdf.

Por último, comentar que si bien se especifica que el tamaño máximo de los bloques sea 512, hay múltiples ocasiones en las que se crean fragementos de mayor tamaño. A su vez, se ha seleccionado un valor de 64 para el solapamiento de los chuncks, este valor implica que esa información será procesada multiples veces afectado al tiempo de ejecución y resultados del modelo. De esta manera se procesa más o menos información en función del tamaño, pero también implica que la redundancia de la información procesada. En el caso de que la respuesta se encuentre en este segmento, aparecerá múltiples veces en los posibles tokens de respuesta del modelo.