# Log diagnosis experiment

This notebook performs the following steps:
1) Load the data
2) Use an LLM to detect source file types.
3) Filter the data in the file to retrieve the "most relevant" log lines (this notebook uses a statistical, TF-IDF-based approach).
4) Use a LLM to diagnose selected log lines (from step 3).

In [None]:
%load_ext autoreload
%autoreload 2

## Define globals and utils

Below we define some utility functions to read a .tgz dataset from S3 (or local machine), a custom tokenizer (using Porter Stemmer), stopwords, etc...

To run this notebook with another dataset just change the variable `file` below.

In [None]:
import pandas as pd
from modules.logger import Logger
from modules.llm.azure_ai import AzureLlm
from modules.dataset import load_dataset_local
from modules.syslog.tfidf import get_stemmed_tokens, get_stop_words, get_tokens
import logging
import os
from dotenv import load_dotenv

load_dotenv("env")

file = './datasets/syslog-demo.tar'


STOPWORDS_TO_IGNORE = [
    'down'
]

logger = Logger(logging.INFO)
llm = AzureLlm(logger, os.getenv('AZURE_OPENAI_API_KEY'))

## 1 - Use LLM to infer file type

In this section, a LLM is used to detect the file types of our data sources. The accepted types are `syslog` and `configuration`. If the LLM can't find out the category (e.g., the file is empty, or does not contain enough information for the classification), the type `unknown` is used.

For longer files (more than x characters), the **first n** characters + **middle n** charactehers and **last n** characters are used as input for the LLM prompt.

In [None]:
from modules.llm.prompt import FileTypeDetectionPrompt

def augment_file_type(dataset, x=500, n=100):

    for data in dataset:
        content = data['content']
        file_name = data['path']
        
        # Check if the content length exceeds the threshold 'x'
        if len(content) > x:
            # Extract first 'n' characters, 'n' middle characters, and last 'n' characters
            first_part = content[:n]
            middle_part = content[len(content)//2 - n//2:len(content)//2 + n//2]
            last_part = content[-n:]
            content = first_part + "\n...\n" + middle_part + "\n...\n" + last_part

        prompt = FileTypeDetectionPrompt(logger, llm)
        prompt.run(file_name=file_name, file_content=content)
        file_type = prompt.getType()
        data['type'] = file_type
        print(f"{file_name} -> {file_type}")
    return dataset


### Load dataset and augment with file types

In [None]:
import os
import csv
import glob
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, ENGLISH_STOP_WORDS

dataset = load_dataset_local(file)
dataset = augment_file_type(dataset=dataset, x=10000, n=3000)

## 2 - Compute TF-IDF on all documents

The approach is straightforward. For each file (for sake of comparison, also configuration files are considered here) tf-idf scores are computed such that each row of the document-term matrix contains the weights of the tokens of each log line.

For each line (row) we aggretate such score and compute an avarage score. Then, we sort all lines by such avarage score (highest to lowest) to so have the most important lines (n) at the top of the file.

In [None]:
# TF-IDF log sorting approach inspired by https://github.com/ExceptionalHandler/NLP/blob/master/README.md

import re
stopwords = get_stop_words("stopwords.txt")
stopwords = list(set(ENGLISH_STOP_WORDS).union(stopwords).difference(STOPWORDS_TO_IGNORE))

vectorizer = CountVectorizer(tokenizer=get_tokens, stop_words=stopwords)
tok = '_tok'

tf_idf_transformer = TfidfTransformer()

def getFeatures(lines, scores, top_n):
    line_scores = list(zip(lines, scores))
    sorted_lines = sorted(line_scores, key=lambda x: x[1], reverse=True)[:top_n]
    return [line for line, score in sorted_lines]

def process(path, content, top_n):
    lines = content.splitlines()
    numbered_lines = [f'{i+1:04}: {lines[i]}' for i in range(len(lines))]
        
    try:
        doc_matrix = vectorizer.fit_transform(lines)
        tfidf_matrix = tf_idf_transformer.fit_transform(doc_matrix).toarray()
        per_line_scores = [row.sum()/len(row.nonzero()[0]) if row.nonzero()[0].size > 0 else 0 for row in tfidf_matrix]
            
        return numbered_lines, getFeatures(numbered_lines, per_line_scores, top_n), getFeatures(lines, per_line_scores, top_n)
    except ValueError as e:
        # Return an empty list
        print(f"Skipping file {path}: {e}")
        return numbered_lines,[]

results = []
results_plt = {}
for data in dataset:
    numbered_lines, numbered_featured, features = process(data['path'], data['content'], 10)
    
    results.append({
        "file_path": data['path'],
        "file_type": data['type'],
        "snippet": '\n'.join(features)
    })
    
    results_plt[data['path']]={
        "file content": '\n'.join(numbered_lines[0:30]),
        "retrieved features": '\n'.join(numbered_featured)
    }

### Show log features retrieved using TF/IDF analysis

First select a file that has been processed with available results.

In [None]:
import modules.utils as utils
import ipywidgets as widgets
from IPython.display import display, clear_output

options = list(results_plt.keys())
file_select = widgets.Dropdown(options=options,
                               value=options[0],
                               description = "select file",
                               disabled = False,
                               layout={'width': '600px'})
display(file_select)

### Show the raw file content with the retrieved features

In [None]:
utils.displayDictionary(results_plt[file_select.value])

## 3 - Use a LLM to perform an initial diagnosis

In [None]:
from modules.diagnose import *

diagnose = Diagnose(logger, llm)
diagnose.setInputFeatures('snippet')
diagnose.setInputSource('file_type')
diagnose.setInputFile('file_path')
diagnose.setInputType('type')
diagnose.setOutputIssueState('issue')
diagnose.setOutputIssueDesc('description')
diagnose.setOutputResolution('resolution')

for row in results: 
   snippet = row['snippet']
   file_path = row['file_path']
   file_type = row['file_type']
   if file_type != 'syslog':
      print(f'\n\nIgnoring non-syslog file: {file_path} (type: {file_type})')
      row['issue'] = "IGNORED"
      continue
   if len(snippet) == 0:
      print(f'\n\nIgnoring empty file: {file_path} (type: {file_type})')
      row['issue'] = "IGNORED"
      continue
       
   if 'router' in file_path:
      row['type'] = diagnose.getNetworkDeviceType()
   elif 'pod' in file_path:
      row['type'] = diagnose.getPodType()
   else:
      row['type'] = diagnose.getGenericType()

diagnose.run(results, inject=True)  
print('Diagosis Complete!')

utils.displayDictionary(results)