## RAG for information retrieval and anomaly detection
### LlamaIndex and Langchain

Please install the following libraries for running the rest of NoteBook (go to the cell and press Shift+Enter to run)

In [77]:
!pip install langchain llama_index pandas plotly



In [78]:
import warnings
warnings.filterwarnings('ignore')

### Imports

Import the required libraries below

In [39]:
import os
from collections import defaultdict
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, PromptTemplate
from langchain_core.prompts import PromptTemplate
from langchain.chains import LLMChain
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core import Settings
Settings.chunk_size = 1024
Settings.chunk_overlap = 100

### OpenAI Key

Add your OpenAI key for GPT 3.5. you can add for GPT 4 to get better reports. Follow this to create your own key:
https://community.openai.com/t/how-to-generate-openai-api-key/401363/3
Access or create your GPT 3.5 key here after signing up: 
https://platform.openai.com/api-keys
For GPT 4: 
https://platform.openai.com/settings/profile?tab=api-keys

In [40]:
# Set API keys and configurations
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"# your_openai_api_key

### Models

We are going to use APIs for text-embedding-ada-002 (embedding model) to convert text to embedding and text-davinci-003 (LLM generator) for creating reports

In [41]:
# Initialize embeddings and LLM
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
llm = OpenAI(temperature=0, model_name="text-davinci-003")

### Log Data

Reading sample log file and clean the logs

In [42]:
mpath = './'
# Load log lines from text file
with open(mpath + 'demofile1.txt', 'r') as file:
    log_lines = file.readlines()
    
log_lines = [j.rstrip('\n') for j in log_lines]
log_lines = [j for j in log_lines if j]

### Vector Store

Create the Index for logs using LlamaIndex VectorStoreIndex to use it to search and retrieve similar log meaning. We are going to perform embedding similarity using this Index store. This index store is in memory but in practice when data is large, it is advised to use a Vector DB like FAISS, CHROMA, PineCone, QDrant etc. 

In [43]:
# Convert log lines to LlamaIndex documents
documents = [Document(text=line.strip(), metadata={'key':i}) for i, line in enumerate(log_lines)]

# Initialize LlamaIndex
index = VectorStoreIndex(documents, embed_fn=embedding_model.embed_query)

### Example of information retrieval based on a question

Let's test generating meaningful summary of answer given a question using the index we created using LlamaIndex

In [44]:
query_engine = index.as_query_engine(similarity_top_k=2)
response = query_engine.query("describe any issue present")
print(response)

There is a potential security vulnerability detected and the disk space is critically low.


### Example of information retrieval based on issue category

Let's test Index to retrieve relevant logs from the index store based on the type of issue

In [45]:
retriever = index.as_retriever(similarity_top_k=10)
# please use any to see if this can match or find issues in available logs
"""
Firmware corruption
Overheating hardware
Misconfigured routing protocols
VPN tunnel failures
Port forwarding errors
VLAN misconfiguration
Spanning tree loops
Port security violations
Faulty switch fabric
Insufficient PoE power
hardware
vulnerability
"""
nodes = retriever.retrieve("hardware")
nodes = [j for j in nodes if j.score>0.7]
nodes = sorted(nodes, key=lambda x:x.score, reverse=True)[:3]
for n in nodes:
    print(f'Log Line      : {n.text}')
    print(f'Matching Score: {n.score}')

Matching Score: 0.7614385990600855
Log Line      : Jun 3 10:15:09 Router1 %QOS-5-QOS_FLOW_MONITORING: 80% of bandwidth utilization reached on GigabitEthernet0/0
Matching Score: 0.7524785872109645
Log Line      : Jun 3 10:15:00 Router1 %LINK-5-CHANGED: Interface GigabitEthernet0/1, changed state to administratively down
Matching Score: 0.751182378692093


### Ground Truth for anomalies

Reading signatures of known issue. Each containing issue name, issue description and set of logs previously seen

In [46]:
with open(mpath + 'signatures.txt', 'r') as file:
    signatures = file.readlines()
signatures = [j.replace('\n', '') for j in signatures]
signatures = ['####' if j=='' else j for j in signatures]
signatures = ' \n '.join(signatures)
signatures = signatures.split('####')
signatures = [j for j in signatures if j not in ('', ' \n ', '####')]

### Utility

search_logs function would help use to get relevant logs to the query using Index store with filterig based on similarity threshold and number of top matched records

In [47]:
# Define a search function
def search_logs(query, top_k=5, thre = 0.7):
    retriever = index.as_retriever(similarity_top_k=top_k*3)
    nodes = retriever.retrieve(query) #vulnerability
    nodes = [j for j in nodes if j.score>thre]
    nodes = sorted(nodes, key=lambda x:x.score, reverse=True)[:top_k]
    #nodes = [j.text for j in nodes]

    return nodes

### Utility

Using the selected LLM, we would create a report on issue description and possible resolution for retrieved logs

In [48]:
# Function to generate a report using Langchain
def generate_report(query, retrieved_docs):
    prompt_template = """
    Based on the following query and logs:
    Query: {query}
    Logs: {logs}
    Generate a detailed report, describe the problem and recommend a solution. 
    please remove those logs that are not relevant to query
    """
    #prompt = PromptTemplate.from_template(prompt_template)
    #message = prompt.format(query=query, logs="\n"+"\n".join([doc.text for doc in retrieved_docs]))
    prompt = PromptTemplate.from_template(prompt_template)
    llm = OpenAI(temperature=0,model_name="gpt-3.5-turbo-instruct")
    chain = LLMChain(llm=llm, prompt=prompt)
    response = chain.run({"query": query, "logs": " ".join([doc.text for doc in retrieved_docs])})
    return response

### Information Retrieval for Unknown Error

In [49]:
# Example usage for search and reporting
query = "Error in module 3"
retrieved_logs = search_logs(query, top_k=3, thre = 0.7)
report = generate_report(query, retrieved_logs)
log_text = '\n'.join([j.text for j in retrieved_logs])
print(f"Relevant Logs: \n{log_text}\n\n{report}")

Relevant Logs: 
2024-06-01T12:00:00Z ERROR module_3 Unexpected null pointer exception occurred.
2024-06-01T12:15:00Z ERROR module_1 Disk space critically low.
2024-06-01T12:05:00Z WARN module_4 Potential security vulnerability detected.


Report:

Problem:
The query "Error in module 3" indicates that there is an error occurring in module 3. The logs provided show three different errors, but only one of them is related to module 3. The other two errors are not relevant to the query and should be removed.

Solution:
To address the issue, the logs that are not relevant to the query should be removed. This will help to focus on the specific error in module 3 and provide a more accurate analysis.

Detailed Report:
The logs provided show three different errors that occurred on 2024-06-01 at different times. The first error, "Unexpected null pointer exception occurred" is related to module 3 and occurred at 2024-06-01T12:00:00Z. This error is likely the cause of the query "Error in module 3".

### Information Retrieval for observed error

In [50]:
# Find logs related to a specific problem
problem_description = "Null pointer exception"
query = f"Find logs related to the following problem: {problem_description}"
related_logs = search_logs(query, top_k=5, thre = 0.7)
related_logs_report = generate_report(problem_description, related_logs)
print("Related Logs Report:", related_logs_report)

Related Logs Report: 
Problem:
The query "Null pointer exception" is causing a null pointer exception error in the system. This error is causing critical issues such as unexpected router reboots, memory allocation errors, and logging to host failures.

Logs:
2024-06-01T12:00:00Z ERROR module_3 Unexpected null pointer exception occurred.
2024-06-02T13:30:40Z [CRITICAL] Router rebooted unexpectedly, last log messages may indicate cause.
2024-06-02T13:20:30Z [ERROR] Memory allocation error on process 'httpd', device may be under a DoS attack.

Detailed Report:
Based on the logs provided, it can be seen that the null pointer exception error occurred on 2024-06-01T12:00:00Z in module_3. This error is causing critical issues such as unexpected router reboots and memory allocation errors. The error is also affecting the logging system, as seen in the logs on Jun 3 10:15:00, where the logging to host 192.168.1.3 stopped.

The null pointer exception error is a common error in programming and ca

In [75]:
print(signatures[3])

 
 Security Breaches and Configuration Errors 
 Security Breaches: These include unauthorized access and security protocol failures that may lead to data breaches or malicious activities. 
 Configuration Errors: These involve incorrect settings or commands that could lead to network instability, exposure to vulnerabilities, or performance issues. 
 %FIREWALL-1-INTRUSION_DETECTED: Alerts on detected intrusion attempts, critical for security incident responses. 
 %VPN-3-HANDSHAKE_FAILURE: Indicates failures in VPN handshake processes, suggesting potential security or configuration issues. 
 %CONFIG-2-INVALID_COMMAND: Signals that an invalid command was entered, potentially causing disruptions if not corrected. 
 %BGP-4-CONFIG_ERROR: Warns of configuration errors with BGP neighbors, which could lead to routing issues or network instability. 
 %SECURITY-3-LOGIN_FAILED: Login attempt failed for username {user} from {source} 
 %AAA-3-CONFIG_ERROR: Configuration error detected in AAA setup 
 

## Anomaly Detection

#### finding anomalies by exhaustive search over logs

In [52]:
# searching exhaustively for log anomalies
# this can be optimized for performance by storing embeddings for signatures from past anomalies
anomalies = []
for sig in signatures:
    sig = sig.strip()
    nodes = search_logs(sig, top_k=3, thre = 0.7)
    anomalies.append([sig.split(' \n ')[0],  nodes])

In [74]:
print(anomalies[3][0])
for i in anomalies[3][1]:
    print(i.node.text)

Security Breaches and Configuration Errors
Jun 3 10:15:00 Router1 %SEC-6-IPACCESSLOGP: Security violation detected from source IP 192.168.2.100 to destination IP 192.168.1.1, attempting unauthorized access
Jun 3 10:15:00 Router1 %SEC-6-IPACCESSLOGP: Security violation detected from source IP 192.168.2.100 to destination IP 192.168.1.1, attempting unauthorized access
Jun 3 10:15:08 Router1 %IDS-6-ATTACK: Detected multiple SSH connection attempts from 192.168.2.100 within a short time period, potential brute force attack


### Disambiguation

Logs might have noisy information making them sustible to be selected for multiple issue types.

In [54]:
# Anomalies has matched to multiple issues, need disambiguation, you can write complex disambiguation and replace this
common_nodes = defaultdict(list)
for ar in anomalies:
    for i in ar[1]:
        common_nodes[i.node.metadata['key']].append([ar[0], i.score])

In [55]:
for i in common_nodes[19]:
    print(i)

['Network Latency Issues', 0.8201085526021584]
['Packet Loss Issues', 0.8129348685243899]
['Broadcast Storm Issues', 0.7936497638372606]
['Traffic Overload', 0.8323136167370643]
['Overutilization of Resources', 0.842460503126184]
['Underutilization of Resources', 0.8267723949015237]


In [56]:
print(documents[19].text)

Jun 3 10:15:09 Router1 %QOS-5-QOS_FLOW_MONITORING: 80% of bandwidth utilization reached on GigabitEthernet0/0


### Best Match

Here, best match is selected based on highest match score
you can replace this with better logic or model

In [57]:
nodes_best_match = {k:max(v, key=lambda x:x[1]) for k,v in common_nodes.items()}

In [58]:
nodes_best_match[19]

['Overutilization of Resources', 0.842460503126184]

### Final Anomalies

Removing anomalies from issue types that are not a good match. And keeping them with the issue type with best match. 

In [59]:
anomalies_selected = []
for ar in anomalies:
    selected_recs = []
    for i in ar[1]:
        rec_no = i.metadata['key']
        if rec_no in nodes_best_match:
            cat = nodes_best_match[rec_no][0]
            if ar[0] != cat:
                continue
        selected_recs.append(i)
    anomalies_selected.append([ar[0], selected_recs])

anomalies_selected = [i for i in anomalies_selected if i[1]]

### Visualizing Anomalies

#### tSNE

In [60]:
# Plot anomalies
from sklearn.manifold import TSNE
import pandas as pd
import plotly.io as pio
import plotly.express as px
pio.renderers.default='browser'

In [61]:
df = pd.DataFrame()

for a_s in anomalies_selected:
    for i in a_s[1]:
        dfp = pd.DataFrame([embedding_model.embed_query(i.node.text)], [a_s[0]])
        df = pd.concat([df, dfp])

In [62]:
 # We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components, perplexity=df.shape[0]//2)
tsne_result = tsne.fit_transform(df.values)
tsne_result.shape
# (25, 2)
# Two dimensions for each anomaly

(25, 2)

#### Please check your browser, it should appear there in a new tab

In [63]:
#Please check your browser for the plot, hovering over scatter dots should reveal information
fig = px.scatter(x=tsne_result[:, 0], 
                 y=tsne_result[:, 1], 
                 color=df.index
                 )
fig.update_traces(marker={'size': 50})
fig.update_layout(
    title="t-SNE visualization of Anomaly Detection",
    xaxis_title="First t-SNE",
    yaxis_title="Second t-SNE",
)
fig.show()

### Generating Report
#### problem description, possible resolution

In [64]:
# Generate report for each anomaly
anomalies_reports = []
for ano in anomalies_selected:
    anomaly_report = generate_report(ano[0], ano[1])
    anomalies_reports.append([ ano[0], ano[1], anomaly_report])

In [65]:
for ano in anomalies_reports:
    print(ano[0])
    log_text = '\n'.join([j.text for j in ano[1]])
    print(f"Anomaly Report: \n{log_text}\n\n{ano[2]}")
    print('\n-----------------\n')
    #print(input())

Hardware and Configuration Errors
Anomaly Report: 
Jun 3 10:15:00 Router1 %LINK-5-CHANGED: Interface GigabitEthernet0/0, changed state to administratively down
Jun 3 10:15:00 Router1 %LINK-5-CHANGED: Interface GigabitEthernet0/1, changed state to administratively down
Jun 3 10:15:00 Router1 %LINK-5-CHANGED: Interface GigabitEthernet0/0, changed state to administratively down


Report:

Problem:
The logs show that there are multiple hardware and configuration errors on Router1. Specifically, the GigabitEthernet0/0 and GigabitEthernet0/1 interfaces have changed state to administratively down. This indicates that there is an issue with the configuration or hardware of these interfaces.

Possible Causes:
1. Misconfiguration: The interfaces may have been configured incorrectly, causing them to go into an administratively down state.
2. Hardware Failure: There may be a hardware issue with the interfaces, causing them to fail and go into an administratively down state.
3. Software Bug: There 