<a href="https://colab.research.google.com/github/RealAI-RAI/RAG_SPM/blob/main/RAG_SPM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install Necessary Dependencies**

In [None]:
!pip install langchain_experimental
!pip install "langchain[docarray]"
!pip install langchain_openai
!pip install transformers
!pip install python-dotenv
!pip install langchain
!pip install openai
!pip install tiktoken
!pip install faiss-gpu
!pip install rag

**Imports**

In [None]:
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from dotenv import load_dotenv
import pandas as pd
import nltk
import re
import os


### Loading OpenAI API Key from Environment Variables

In [None]:
load_dotenv()

OPENAI_API_KEY = os.getenv("Your API Key")

### Initializing OpenAI Chat Model

In [None]:
# Set the OPENAI_API_KEY environment variable
os.environ['OPENAI_API_KEY'] = 'Your Open APi Key'

# Now initialize the ChatOpenAI model
model = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'], model="gpt-3.5-turbo")


## Creating Chat Model Instance with OpenAI's GPT-3.5 Turbo Model

In [None]:
from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

**Checking Model Response**

In [None]:
model.invoke("What is the capital of Pakistan?")

AIMessage(content='Islamabad', response_metadata={'token_usage': {'completion_tokens': 2, 'prompt_tokens': 14, 'total_tokens': 16}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3bc1b5746c', 'finish_reason': 'stop', 'logprobs': None})

**Loading Data**

In [None]:
file_path = "/content/PHISHING & SCAM MESSAGES ONLY + THEIR SPECIFIC EXPLANATIONS IN OUR DESIRED FORMAT (1).xlsx" # Replace with the path to your Excel file
data = pd.read_excel(file_path)

```python
Displaying the first few rows of the dataset
```

In [None]:
data.head()

Unnamed: 0,Phishing/scam message,Explanations and CTA
0,"ATTN: Account Holder, Your UCHICAGO account ha...",Risk Level: 99% - Definitely a scam/phishing b...
1,Your UChicago account has been filed under the...,Risk Level: 99% - Definitely a scam/phishing b...
2,"Account Holder,\n\nYour UCHICAGO account has b...",Risk Level: 99% - Definitely a scam/phishing b...
3,"Dear Account Holder,\n\nYour UCHICAGO account ...",Risk Level: 99% - Definitely a scam/phishing b...
4,A copy of your Student record is available for...,Risk Level: 99% - Definitely a scam/phishing b...


To utilize the RAG (Retrieval-Augmented Generation) model effectively, we need to structure the knowledge base appropriately knowledge base consist of passages of text along with corresponding embeddings (vectors) for each passage. we need to prepare the data in a format suitable for training the RAG model.



*   **create_passage_dict_with_vectors:**
This function creates a dictionary of passages along with their vector representations.
*   **structured_data:**
This list contains structured data suitable for RAG, where each entry consists of a passage and its vector representations.
* **knowledge_base**:
This variable holds the structured data, which forms the knowledge base for RAG.
Now, we have successfully prepared the data for RAG. we wll can proceed with further steps such as training the RAG model or using the knowledge base for retrieval and generation tasks. Let me know if you need further assistance with any specific aspect!

**Preprocessing Text Data and Vectorization Using Word2Vec**



In [None]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
def preprocess_text(text):
    # Check if the input is not NaN
    if isinstance(text, str):
        # Convert text to lowercase
        text = text.lower()

        # Remove special characters, numbers, and punctuation
        text = re.sub(r'[^a-zA-Z\s]', '', text)

        # Tokenization
        tokens = word_tokenize(text)

        # Remove stop words
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word not in stop_words]

        # Lemmatization
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]

        return tokens
    else:
        # Return an empty list if the input is NaN
        return []

# Preprocess messages and explanations
data['Phishing/scam message'] = data['Phishing/scam message'].apply(preprocess_text)
data['Explanations and CTA'] = data['Explanations and CTA'].apply(preprocess_text)

# Define Word2Vec model training function
def train_word2vec_model(data):
    # Train Word2Vec model
    sentences = data['Phishing/scam message'].tolist() + data['Explanations and CTA'].tolist()
    word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)
    return word2vec_model

# Train Word2Vec model
word2vec_model = train_word2vec_model(data)

# Function to convert preprocessed text to vector representations
def text_to_vectors(text, word2vec_model):
    vectors = []
    for word in text:
        if word in word2vec_model.wv:
            vectors.append(word2vec_model.wv[word])
    return vectors

# Convert preprocessed text to vector representations
data['Message Vectors'] = data['Phishing/scam message'].apply(lambda text: text_to_vectors(text, word2vec_model))
data['Explanation Vectors'] = data['Explanations and CTA'].apply(lambda text: text_to_vectors(text, word2vec_model))

# Print first few rows to verify preprocessing and vectorization
data.head()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,Phishing/scam message,Explanations and CTA,Message Vectors,Explanation Vectors
0,"[attn, account, holder, uchicago, account, fil...","[risk, level, definitely, scamphishing, offici...","[[-0.03162857, 0.055801008, -0.0039279945, 0.0...","[[0.070889, 0.419793, -0.06136308, -0.2445411,..."
1,"[uchicago, account, filed, list, account, set,...","[risk, level, definitely, scamphishing, univer...","[[-0.17063361, 0.21021594, 0.02098644, 0.05522...","[[0.070889, 0.419793, -0.06136308, -0.2445411,..."
2,"[account, holder, uchicago, account, filed, li...","[risk, level, definitely, scamphishing, legiti...","[[-0.12153986, 0.22089499, -0.3603233, 0.29381...","[[0.070889, 0.419793, -0.06136308, -0.2445411,..."
3,"[dear, account, holder, uchicago, account, fil...","[risk, level, definitely, scamphishing, reques...","[[-0.14314194, 0.1792661, 0.12133565, 0.196950...","[[0.070889, 0.419793, -0.06136308, -0.2445411,..."
4,"[copy, student, record, available, look, look,...","[risk, level, definitely, scamphishing, authen...","[[-0.14440073, 0.031502437, 0.111991294, 0.223...","[[0.070889, 0.419793, -0.06136308, -0.2445411,..."




No charts were generated by quickchart


**Creating Knowledge Base for Retrieval-Augmented Generation (RAG)**

* Creating a dictionary of passages and their vector representations.
* Structuring the data for RAG.
* Building the knowledge base.

In [None]:
# Function to create dictionary of passages and their vector representations
def create_passage_dict_with_vectors(data):
    passage_dict = {}
    for index, row in data.iterrows():
        message_id = index
        message = row['Phishing/scam message']
        explanation = row['Explanations and CTA']
        message_vectors = row['Message Vectors']
        explanation_vectors = row['Explanation Vectors']

        passage_dict[message_id] = {
            'message': message,
            'explanation': explanation,
            'message_vectors': message_vectors,
            'explanation_vectors': explanation_vectors
        }
    return passage_dict

# Create passage dictionary with vector representations
passage_dict = create_passage_dict_with_vectors(data)

# Structure data for RAG
structured_data = []
for message_id, passage_data in passage_dict.items():
    structured_data.append({
        'passage': ' '.join(passage_data['message']) + ' ' + ' '.join(passage_data['explanation']),
        'passage_vectors': passage_data['message_vectors'] + passage_data['explanation_vectors']
    })

# Build the knowledge base
knowledge_base = structured_data

# Print first few entries of the knowledge base for verification
for i, entry in enumerate(knowledge_base[:5]):
    print(f"Passage ID: {i}")
    print(f"Passage: {entry['passage']}")
    print(f"Passage Vectors: {entry['passage_vectors']}")
    print()


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
      dtype=float32), array([-3.07075202e-01,  9.62014273e-02,  1.89631172e-02,  2.48683140e-01,
        1.75502911e-01, -1.62882850e-01,  8.58369321e-02,  2.15453908e-01,
       -3.34774703e-02, -2.14586202e-02, -9.70844105e-02, -2.15633661e-02,
       -1.78537853e-02,  4.60846163e-03, -1.40093416e-01,  1.16447069e-01,
       -2.42988199e-01, -1.65209696e-01,  1.66572496e-01, -1.68786451e-01,
        2.96384143e-03, -1.53331131e-01, -7.28215054e-02, -2.90299356e-01,
        1.78462416e-01,  5.47409616e-03, -2.96226144e-01, -2.46156886e-01,
       -1.95339635e-01, -2.57917464e-01,  5.36370650e-02,  3.61166567e-01,
       -2.10002810e-01, -6.34003580e-02, -1.93170294e-01,  1.98368609e-01,
        6.62219711e-03, -4.16232079e-01, -2.45616864e-02, -4.39556576e-02,
       -9.69266072e-02,  4.60283682e-02,  1.28322706e-01,  4.04281467e-02,
       -8.69444460e-02, -2.23133549e-01, -1.33419916e-01,  2.75419384e-01,
        9.749

**Generating Responses using RAG**

1. **Retriever:**
This component retrieves relevant passages from the knowledge base given a query. It utilizes semantic similarity or other methods to find passages that are most relevant to the query.

2. **Generator:**
Once the relevant passages are retrieved, the generator component generates responses based on these passages and the input query. It may use a language model like GPT (Generative Pre-trained Transformer) to generate text.

3. **Tokenizer:** The tokenizer is responsible for tokenizing input text and preparing it for processing by the retriever and generator components.

 we have built our knowledge base and structured the data for Retrieval-Augmented Generation (RAG), we can proceed to use this knowledge base for generating responses to queries


*   Define a function to query the knowledge base and retrieve relevant passages.
*   Use the retrieved passages to generate responses using the RAG model.




In [None]:
# Define a function to query the knowledge base
def query_knowledge_base(query, knowledge_base):
    # Placeholder for actual retrieval logic (e.g., using semantic similarity)
    # For simplicity, we'll just return all passages for now
    return knowledge_base

# Function to generate responses using RAG
def generate_responses(query, knowledge_base, chat_model):
    # Query the knowledge base
    relevant_passages = query_knowledge_base(query, knowledge_base)

    # Prepare messages for the chat model
    messages = [
        SystemMessage(content="You are a helpful assistant."),
        HumanMessage(content=query)
    ]

    # Generate response using the chat model
    responses = chat_model.invoke(messages)

    return responses
# Example usage:
query = "How can I protect my university account from phishing?"
responses = generate_responses(query, knowledge_base, model)

# Print generated responses
for i, response in enumerate(responses):
    print(f"Response {i+1}: {response}")


Response 1: ('content', "To protect your university account from phishing, here are some important steps you can take:\n\n1. Be cautious with emails: Always be wary of emails requesting sensitive information or containing suspicious links. Verify the sender's email address and avoid clicking on any links or downloading attachments from unknown sources.\n\n2. Enable two-factor authentication (2FA): Set up 2FA for your university account if available. This adds an extra layer of security by requiring a second verification step, such as a code sent to your phone, in addition to your password.\n\n3. Use a strong, unique password: Ensure your university account password is strong and unique. Avoid using the same password for multiple accounts and consider using a password manager to securely store and generate complex passwords.\n\n4. Keep your software updated: Make sure your device's operating system, antivirus software, and web browsers are up to date with the latest security patches to 

**Generating Responses with Context Understanding**

In [None]:
# Function to generate responses using RAG with context understanding
def generate_responses_with_context(query, conversation_history, knowledge_base, chat_model):
    # Query the knowledge base
    relevant_passages = query_knowledge_base(query, knowledge_base)

    # Prepare messages for the chat model, including the conversation history
    messages = [
        SystemMessage(content="You are a helpful assistant."),
        HumanMessage(content=query)
    ]
    messages.extend(conversation_history)

    # Generate response using the chat model
    responses = chat_model.invoke(messages)

    return responses

# Example usage with context understanding
query = "How can I protect my university account from phishing?"
conversation_history = [
    HumanMessage(content="I received an email asking for my login credentials. Is it safe to provide them?"),
    SystemMessage(content="No, university IT departments typically do not ask for login credentials via email.")
]
responses = generate_responses_with_context(query, conversation_history, knowledge_base, model)

# Print generated responses
for i, response in enumerate(responses):
    print(f"Response {i+1}: {response}")


Response 1: ('content', "It is not safe to provide your login credentials in response to an email asking for them. This is a common phishing tactic used by cybercriminals to steal sensitive information. It's important to always verify the legitimacy of the request before providing any personal information. If you are unsure, contact your university's IT department directly to confirm if the email is legitimate.")
Response 2: ('additional_kwargs', {})
Response 3: ('response_metadata', {'token_usage': {'completion_tokens': 71, 'prompt_tokens': 67, 'total_tokens': 138}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3bc1b5746c', 'finish_reason': 'stop', 'logprobs': None})
Response 4: ('type', 'ai')
Response 5: ('name', None)
Response 6: ('id', None)
Response 7: ('example', False)


In [None]:
def sender_input():
    # Function to get input from the sender
    return input("Sender: Thrilled you're interested! We're on the hunt for influencers to rock our latest sneaker line. As part of our collab, we offer you 40% off our collection. You'll bag a 25% commission on sales with your unique code. Just sport our kicks, tag us, and you'll be our next feature on social and our site. Sound like a plan?")

def receiver_input():
    # Function to get input from the receiver
    return input("Receiver: need More Information ")


In [None]:
def generate_responses_with_context(knowledge_base, chat_model):
    # Initialize conversation history
    conversation_history = []

    # Sender's turn
    sender_input = input("Sender: Frustrated! My phone/tablet just crashed and it's not turning on. Need to get this fixed before we can video chat. Rain check? ")
    conversation_history.append(HumanMessage(content=sender_input))

    # Receiver's turn
    receiver_input = input("Receiver: need More Information ")
    conversation_history.append(HumanMessage(content=receiver_input))

    # Check if the conversation should continue
    while True:
        # Sender's turn
        sender_input = input("Sender: Considering your expertise, Lee, we'd like your review on this confidential security policy update. Access it with your employee credentials.")
        conversation_history.append(HumanMessage(content=sender_input))
        responses = chat_model.invoke([
            SystemMessage(content="You are a helpful assistant."),
            HumanMessage(content=sender_input)
        ])
        for response in responses:
            conversation_history.append(response)

        # Receiver's turn
        receiver_input = input("Receiver: Achieve unparalleled success! See my collection of sports cars and designer attire? It's all attainable. ??? Join my investment circle with an initial deposit, and let's drive towards a prosperous future together.")
        conversation_history.append(HumanMessage(content=receiver_input))
        responses = chat_model.invoke([
            SystemMessage(content="You are a helpful assistant."),
            HumanMessage(content=receiver_input)
        ])
        for response in responses:
            conversation_history.append(response)

        # Check if the conversation should end
        if receiver_input.strip().lower() == "end":
            break

    # Generate response for the entire conversation
    responses = chat_model.invoke([
        SystemMessage(content="You are a helpful assistant."),
        *conversation_history  # Include the entire conversation history
    ])

    return responses


In [None]:
# Print generated responses
for i, response in enumerate(responses):
    print(f"Response {i+1}: {response}")

Response 1: ('content', "It is not safe to provide your login credentials in response to an email asking for them. This is a common phishing tactic used by cybercriminals to steal sensitive information. It's important to always verify the legitimacy of the request before providing any personal information. If you are unsure, contact your university's IT department directly to confirm if the email is legitimate.")
Response 2: ('additional_kwargs', {})
Response 3: ('response_metadata', {'token_usage': {'completion_tokens': 71, 'prompt_tokens': 67, 'total_tokens': 138}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3bc1b5746c', 'finish_reason': 'stop', 'logprobs': None})
Response 4: ('type', 'ai')
Response 5: ('name', None)
Response 6: ('id', None)
Response 7: ('example', False)
