<a href="https://colab.research.google.com/github/HimaniGrg/Q-A_Generator/blob/main/RAG_Q%26A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Simple Retriever-Generator Model for Q&A

In [None]:
!pip install transformers datasets



In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForLanguageModeling, AutoModelForCausalLM
from datasets import Dataset
import torch
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import numpy as np

In [None]:
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)f

print(f"Model: {model_name}")

Model: google/flan-t5-small


In [None]:
# function to generate the answer based on the prompt
def question_answer(prompt, model):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    encoded_input = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
    input_ids = encoded_input["input_ids"].to(device)
    attention_mask = encoded_input["attention_mask"].to(device)

    outputs = model.generate(input_ids, attention_mask=attention_mask, max_length=256, num_return_sequences=1, no_repeat_ngram_size=2)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer

In [None]:
cybersecurity_questions = [
    'What is phishing?',
    'How does ransomware spread?',
    'What role does a firewall play in network security?',
    'What is a Man-in-the-Middle (MitM) attack and how can encrypted communications prevent it?',
    'What is penetration testing and why is it important for identifying security vulnerabilities?'
]
security_compliance_questions = [
    'What is security compliance?',
    'What is the purpose of risk management in security?',
    'What are the key components of GDPR compliance?',
    'What are the main principles of the NIST cybersecurity framework?',
    'What is the purpose of SOC 1 and who needs to follow it?',
]

# testing the pre-trained model
print("Cybersecurity question answering before fine-tuning: \n")
for question in cybersecurity_questions:
    answer = question_answer(question, model)
    print(f"Question: {question}\nAnswer: {answer}\n")

print("------------------------------------------------")

print("Security compliance question answering before fine-tuning: \n ")
for question in security_compliance_questions:
    answer = question_answer(question, model)
    print(f"Question: {question}\nAnswer: {answer}\n")

Cybersecurity question answering before fine-tuning: 

Question: What is phishing?
Answer: phishing is a method of detecting incoming data from originating locations.

Question: How does ransomware spread?
Answer: Viruses spread through the internet.

Question: What role does a firewall play in network security?
Answer: firewall

Question: What is a Man-in-the-Middle (MitM) attack and how can encrypted communications prevent it?
Answer: a smear of resemblance

Question: What is penetration testing and why is it important for identifying security vulnerabilities?
Answer: Using a penetration test is based on the vulnerability of the user to unauthorized access to the data.

------------------------------------------------
Security compliance question answering before fine-tuning: 
 
Question: What is security compliance?
Answer: Security compliance is the ability to protect the public and the private in the event of a breach of security.

Question: What is the purpose of risk management 

In [None]:
# create a cybersecurity and security compliance knowledge base as a CSV file
def create_cybersecurity_csv():
    """Create a CSV file with cybersecurity and security compliances"""

    cybersecurity_data = [
        ["Phishing", "A social engineering attack where attackers send fraudulent messages to trick individuals into revealing sensitive information or installing malware. Common indicators include urgent language, suspicious links, and requests for personal information."],
        ["Ransomware", "A type of malicious software that encrypts a victim's files and demands payment for the decryption key. Ransomware often spreads through phishing emails, malicious downloads, or exploiting system vulnerabilities."],
        ["Two-Factor Authentication (2FA)", "A security method that requires users to provide two different authentication factors: something they know (password) and something they have (mobile device) or something they are (biometric). This significantly increases account security."],
        ["SQL Injection", "A code injection technique that exploits vulnerabilities in database-driven websites. Attackers insert malicious SQL statements into entry fields, allowing them to access, modify, or delete data from the database."],
        ["Man-in-the-Middle Attack", "An attack where the attacker secretly intercepts and possibly alters communications between two parties who believe they're directly communicating with each other. It can be used to steal login credentials or personal information."],
        ["VPN", "Virtual Private Network creates an encrypted connection over a less secure network. VPNs provide privacy, anonymity, and security by creating a private network from a public internet connection."],
        ["Social Engineering", "Psychological manipulation techniques that trick people into making security mistakes or giving away sensitive information. Types include phishing, pretexting, baiting, and tailgating."],
        ["Firewall", "A network security device that monitors and filters incoming and outgoing network traffic based on an organization's security policies. Firewalls establish a barrier between trusted internal networks and untrusted external networks."],
        ["Encryption", "The process of encoding information so that only authorized parties can access it. Encryption uses mathematical algorithms to convert data into a coded format that appears random without the decryption key."],
        ["Malware", "Short for malicious software, it refers to any software designed to harm or exploit devices, services, or networks. Types include viruses, trojans, worms, ransomware, spyware, and adware."],
        ["Brute Force Attack", "A method of trial and error used to decode encrypted data such as passwords by systematically checking all possible combinations until the correct one is found. Protection includes complex passwords and account lockouts."],
        ["Penetration Testing", "An authorized simulated attack on a computer system to evaluate security. Penetration testers use the same tools and techniques as attackers to find and demonstrate business impacts of vulnerabilities."],
        ["Cross-Site Scripting (XSS)", "A web security vulnerability that allows attackers to inject client-side scripts into web pages viewed by other users. This can be used to bypass access controls and impersonate users."],
        ["Spyware", "Software that secretly gathers user information through their internet connection without their knowledge, usually for advertising purposes. It can track internet activity, harvest data, and monitor keystrokes."],
        ["Hashing", "The process of converting data of any size into a fixed-size string. Unlike encryption, hashing is one-way and cannot be reversed. It's commonly used to verify data integrity and store passwords securely."],
        ["Botnet", "A network of infected computers controlled remotely by attackers, often used for DDoS attacks or spam distribution. Users are typically unaware their computer is part of a botnet."],
        ["Cyber Threat Intelligence", "Evidence-based knowledge about existing or emerging threats that helps organizations make informed security decisions. It includes context, mechanisms, indicators, implications, and action-oriented advice."],
        ["CSRF Attack", "Cross-Site Request Forgery tricks users into submitting unwanted requests to websites where they're authenticated. This can force users to execute actions without their consent or knowledge."],
        ["Zero Trust Security", "A security model that requires strict identity verification for every person and device trying to access resources, regardless of whether they're inside or outside the network perimeter."],
        ["APT", "Advanced Persistent Threat is a prolonged, targeted cyber attack where an attacker establishes an undetected presence in a network to steal sensitive data. APTs are typically conducted by nation-states or state-sponsored groups."],
        ["Security Misconfigurations", "Improperly configured security settings that leave systems vulnerable. Common examples include default credentials, error messages revealing too much information, and unnecessary features enabled."],
        ["Privilege Escalation", "A type of attack that exploits bugs, design flaws, or configuration oversights to gain elevated access to resources that are normally protected. It allows attackers to gain higher-level permissions than intended."],
        ["Supply Chain Attack", "A cyber attack that targets less-secure elements in the supply chain, such as third-party vendors or software. The SolarWinds attack of 2020 is a notable example that affected thousands of organizations."],
        ["Defense in Depth", "A cybersecurity approach that uses multiple layers of security controls throughout a system. If one defense fails, others still provide protection, making it harder for attackers to reach valuable assets."],
        ["Digital Forensics", "The process of uncovering and interpreting electronic data to preserve evidence in a way that is suitable for presentation in a court of law. Used to investigate cyber crimes and security incidents."],
        ["Fileless Malware", "A type of malicious software that exists exclusively in a computer's RAM, making it difficult to detect using traditional security tools that scan for files on disk. It often leverages legitimate system tools."],
        ["SIEM", "Security Information and Event Management systems combine security information management and security event management. They provide real-time analysis of security alerts generated by applications and network hardware."]
    ]
    security_compliance_data = [
      ["GDPR", "General Data Protection Regulation, a regulation in EU law on data protection and privacy. It sets guidelines for the collection, storage, and processing of personal data within the EU and addresses the transfer of personal data outside the EU."],
      ["ISO/IEC 27001", "An international standard for information security management. It provides a framework for managing and securing information, ensuring the confidentiality, integrity, and availability of data within an organization."],
      ["HIPAA", "The Health Insurance Portability and Accountability Act, a US law that sets standards for protecting sensitive patient data in the healthcare industry. It outlines security and privacy regulations regarding the storage and transmission of healthcare data."],
      ["PCI-DSS", "The Payment Card Industry Data Security Standard is a set of security standards designed to ensure that all companies that process, store, or transmit credit card information maintain a secure environment."],
      ["NIST Cybersecurity Framework", "The National Institute of Standards and Technology Cybersecurity Framework provides a set of guidelines for improving the security of critical infrastructure. It includes five core functions: Identify, Protect, Detect, Respond, and Recover."],
      ["SOC 2", "System and Organization Controls 2 is a framework for managing and securing sensitive customer data. It focuses on the five key areas of security, availability, processing integrity, confidentiality, and privacy."],
      ["FISMA", "The Federal Information Security Management Act requires federal agencies and contractors to secure information systems. It mandates risk assessments, security planning, and implementation of security controls."],
      ["CMMC", "The Cybersecurity Maturity Model Certification is a cybersecurity framework specifically designed for Department of Defense contractors. It assesses the maturity and security of an organization’s practices to handle controlled unclassified information."],
      ["NIST 800-53", "A set of security and privacy controls developed by NIST for federal information systems in the United States. It provides a catalog of controls designed to protect the confidentiality, integrity, and availability of information systems."],
      ["SOX", "The Sarbanes-Oxley Act sets standards for all U.S. public company boards, management, and public accounting firms. It focuses on financial recordkeeping, internal controls, and corporate governance to prevent fraud."],
      ["CIS Controls", "The Center for Internet Security Critical Security Controls is a set of best practices to defend against cyber threats. It includes 20 prioritized actions that help organizations improve their cybersecurity posture."],
      ["SOC 1", "System and Organization Controls 1 is a framework designed for service organizations that provide financial reporting services. It focuses on ensuring that financial data handling by third-party services complies with necessary regulations."],
      ["GDPR Data Subject Rights", "Under the GDPR, individuals have the right to access, rectify, erase, and restrict the processing of their personal data. These rights also include the ability to object to processing and request data portability."],
      ["FCRA", "The Fair Credit Reporting Act regulates the collection, use, and dissemination of consumer credit information. It mandates the accuracy and privacy of credit information and requires consent for credit reports."],
      ["NIST 800-171", "NIST 800-171 provides guidelines for securing Controlled Unclassified Information (CUI) in non-federal systems and organizations. It covers 14 security control families, such as access control, incident response, and system and communications protection."]
    ]

    # Create DataFrame and save to CSV
    df = pd.DataFrame(cybersecurity_data, columns=['Concept', 'Description'])
    df.to_csv('cybersecurity_knowledge.csv', index=False)
    print(f"Created CSV with {len(df)} cybersecurity concepts")
    return df

In [None]:
df = create_cybersecurity_csv()

df.head()

Created CSV with 27 cybersecurity concepts


Unnamed: 0,Concept,Description
0,Phishing,A social engineering attack where attackers se...
1,Ransomware,A type of malicious software that encrypts a v...
2,Two-Factor Authentication (2FA),A security method that requires users to provi...
3,SQL Injection,A code injection technique that exploits vulne...
4,Man-in-the-Middle Attack,An attack where the attacker secretly intercep...


In [None]:
df.tail()

Unnamed: 0,Concept,Description
22,Supply Chain Attack,A cyber attack that targets less-secure elemen...
23,Defense in Depth,A cybersecurity approach that uses multiple la...
24,Digital Forensics,The process of uncovering and interpreting ele...
25,Fileless Malware,A type of malicious software that exists exclu...
26,SIEM,Security Information and Event Management syst...


In [None]:
# Load sentence transformer for embeddings
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings for our knowledge base
print("Creating embeddings for knowledge base...")
df['text'] = df['Concept'] + ": " + df['Description']
embeddings = embedding_model.encode(df['text'].tolist())
print(f"Created {len(embeddings)} embeddings of dimension {embeddings[0].shape[0]}")

Loading embedding model...
Creating embeddings for knowledge base...
Created 27 embeddings of dimension 384


In [None]:
# function to retrive relevent concept
def retriever(query, df, embeddings):
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  embeddings = torch.tensor(embeddings).to(device)

  # encode the query
  query_embedding = embedding_model.encode([query], convert_to_tensor=True)

  # compute cosined similarities between query and our data
  similarities = util.pytorch_cos_sim(query_embedding, embeddings)[0]

  most_relevant_index = similarities.argmax().item()

  # Return the most relevant concept and description
  return df.iloc[most_relevant_index]['Concept'], df.iloc[most_relevant_index]['Description']

In [None]:
query = "What is phishing?"
concept, description = retriever(query, df, embeddings)
print("Relevant Concept:", concept)
print("Description:", description)

Relevant Concept: Phishing
Description: A social engineering attack where attackers send fraudulent messages to trick individuals into revealing sensitive information or installing malware. Common indicators include urgent language, suspicious links, and requests for personal information.


In [None]:
# Function to generate response using FLAN-T5
def generate_response(query, model, retrieved_concept, retrieved_desc):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Combine the query with the relevant information
    input_text = f"Question: {query} Answer with context: {retrieved_concept} \n Description: {retrieved_desc}"

    # Tokenize the prompt
    encoded_input = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
    input_ids = encoded_input["input_ids"].to(device)
    attention_mask = encoded_input["attention_mask"].to(device)

    # Generate response
    outputs = model.generate(input_ids, attention_mask=attention_mask, max_length=256, num_return_sequences=1, no_repeat_ngram_size=2)

    # Decode and return the generated response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

In [None]:
def question_answer_rag(query, model, df, embeddings):
    # Retrieve relevant concept and description
    retrieved_concept, retrieved_desc = retriever(query, df, embeddings)

    answer = generate_response(query, model, retrieved_concept, retrieved_desc)
    return answer

In [None]:
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
generator_model = AutoModelForCausalLM.from_pretrained(model_name)

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
cybersecurity_questions = [
    'What is phishing and how can you protect against it?',
    'How does ransomware spread and what steps should be taken to prevent it?',
    'What role does a firewall play in network security?',
    'What is a Man-in-the-Middle (MitM) attack and how can encrypted communications prevent it?',
    'What is penetration testing and why is it important for identifying security vulnerabilities?'
]
security_compliance_questions = [
    'What is security compliance?',
    'What is the purpose of risk management in security?',
    'What are the key components of GDPR compliance?',
    'What are the main principles of the NIST cybersecurity framework?',
    'What is the purpose of SOC 1 and who needs to follow it?',
]

# testing the pre-trained model
print("Cybersecurity question answering after implementation of RAG: \n")
for question in cybersecurity_questions:
    answer = question_answer_rag(question, generator_model, df, embeddings)
    print(f"{answer}\n")

print("------------------------------------------------")

print("Security compliance question answering after implementation of RAG: \n ")
for question in security_compliance_questions:
    answer = question_answer_rag(question, generator_model, df, embeddings)
    print(f"{answer}\n")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Cybersecurity question answering after implementation of RAG: 



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is phishing and how can you protect against it? Answer with context: Phishing 
 Description: A social engineering attack where attackers send fraudulent messages to trick individuals into revealing sensitive information or installing malware. Common indicators include urgent language, suspicious links, and requests for personal information.

The attack is not a new attack, but it is a major threat to the security of the Internet. The attacks are not new, they are still being used by the government, the media, governments, corporations, government agencies, etc. It is important to remember that phishers are often used to attack the web, especially in the United States. This is why phisher attacks have been used in many countries, including the US, Canada, Australia, New Zealand, Singapore, Malaysia, Taiwan, Thailand, Vietnam, Indonesia, Pakistan, South Korea, China, India, Japan, Russia, Saudi Arabia, United Kingdom, France, Germany, Italy, Spain, Sweden, Switzerland, Tur

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: How does ransomware spread and what steps should be taken to prevent it? Answer with context: Ransomware 
 Description: A type of malicious software that encrypts a victim's files and demands payment for the decryption key. Ransomware often spreads through phishing emails, malicious downloads, or exploiting system vulnerabilities.

The ransomware is a type that can be used to infect computers, computers and other systems. It can also be exploited to steal data from computers or other computers. The ransomware can infect the computer, computer or computer. A ransomware attack can cause the victim to lose the data and the system is compromised. This type can take place in the form of a ransomware or ransomware. In the case of ransomware, the ransomware has a number of different types of attacks. For example, a malicious attack could cause a computer to be infected with a virus that is infected by a malware. If the malware is not infected, it can then infect a system. An attacke

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What role does a firewall play in network security? Answer with context: Firewall 
 Description: A network security device that monitors and filters incoming and outgoing network traffic based on an organization's security policies. Firewalls establish a barrier between trusted internal networks and untrusted external networks. The firewall is a network firewall that is designed to protect the network from attacks.

The firewall protects the user from any malicious activity. It protects against any unauthorized activity or malicious activities. This firewall prevents unauthorized access to the Internet. If a user is using a VPN or other network, the firewall will not be able to access the internet. A firewall can be used to prevent unauthorized use of the web. In addition, a computer can access a web browser or a browser that can't access it. An attacker can use a proxy to connect to a server. When a malicious user accesses the server, it can then use the proxy. To prevent a 

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is a Man-in-the-Middle (MitM) attack and how can encrypted communications prevent it? Answer with context: Man-in-the-Middle Attack 
 Description: An attack where the attacker secretly intercepts and possibly alters communications between two parties who believe they're directly communicating with each other. It can be used to steal login credentials or personal information.

The attacker can use a combination of encryption and encryption to intercept communications. The attacker is able to use the combination to decrypt communications from two different parties. This is the case with the MitM attack. In the example above, the attack is used by the two party to obtain a password for the password of the user. If the key is not in the same key, it is encrypted. However, if the encryption is in a different key and the message is different, then the encrypted message will be encrypted and encrypted again. When the sender is using the different encryption, he or she can decry

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is penetration testing and why is it important for identifying security vulnerabilities? Answer with context: Penetration Testing 
 Description: An authorized simulated attack on a computer system to evaluate security. Penetration testers use the same tools and techniques as attackers to find and demonstrate business impacts of vulnerabilities.

The penetration test is a test of the security of a system. The penetration tests are a way to test the vulnerabilities of an operating system, and the penetration testers are the ones who are responsible for the vulnerability. This test can be used to determine the extent of vulnerability in a given system and to identify the source of attack. In this test, the attacker is able to detect the presence of any vulnerabilities in the system or to discover the existence of other vulnerabilities that may be present. If the attack is detected, it is the first time the victim has been exposed to the threat. It is important to understand

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is security compliance? Answer with context: Security Misconfigurations 
 Description: Improperly configured security settings that leave systems vulnerable. Common examples include default credentials, error messages revealing too much information, and unnecessary features enabled.

The following is a list of security misconfiguration scenarios that are not covered in this article. The following are examples of the security configuration scenarios. These are the following scenarios:
1. Security mis-configure: The security manager is not aware of any security issues. This is the case when the user is using a computer. If the administrator is aware that the system is running a malicious program, the attacker can use the same configuration to execute malicious code. In this scenario, a user can execute a program that is run by the computer, or a system that has a security issue. For example, if the program is being run on a different computer or system, it is executing a s

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is the purpose of risk management in security? Answer with context: Defense in Depth 
 Description: A cybersecurity approach that uses multiple layers of security controls throughout a system. If one defense fails, others still provide protection, making it harder for attackers to reach valuable assets.

The security of a network is a critical part of the security. The security is not a single security issue. It is an interconnected system that is interconnected with multiple networks. This is why the Security Policy Center (SCPC) is dedicated to providing a comprehensive and comprehensive guide to the best security practices for security professionals.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What are the key components of GDPR compliance? Answer with context: Security Misconfigurations 
 Description: Improperly configured security settings that leave systems vulnerable. Common examples include default credentials, error messages revealing too much information, and unnecessary features enabled.

The GDPA compliance standard is a standard that is used to enforce the security of the system. It is designed to protect the integrity of systems and to prevent the misuse of sensitive information. The standard has been adopted by the Federal Trade Commission and the Department of Homeland Security. This standard was adopted in the wake of a series of attacks on the United States and other countries. In addition, the standard provides a set of security features that are designed for the purpose of protecting the systems. These include:
1. Security settings for all systems that have a security profile. 2. A set that includes a default password. 3. An optional set for a pass

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What are the main principles of the NIST cybersecurity framework? Answer with context: Defense in Depth 
 Description: A cybersecurity approach that uses multiple layers of security controls throughout a system. If one defense fails, others still provide protection, making it harder for attackers to reach valuable assets.

The NISD is a security framework that is designed to protect the integrity of a network. It is based on the principles that are used in the framework. The NISS is an open-source framework for the security of an organization. This framework is used to provide security for organizations and organizations. In addition, it is also used for security in a variety of different organizations, including the U.S. Department of Defense.

Question: What is the purpose of SOC 1 and who needs to follow it? Answer with context: Defense in Depth 
 Description: A cybersecurity approach that uses multiple layers of security controls throughout a system. If one defense fails,