# Tutorial: Log Anomaly Detection Using LogAI

This is an example to show how to use LogAI to conduct log anomaly detection analysis.

## Load Data

You can use `OpensetDataLoader` to load a sample open log dataset. Here we use HealthApp dataset from
[LogHub](https://zenodo.org/record/3227177#.Y1M3LezML0o) as an example.


In [2]:
import os
from logai.dataloader.openset_data_loader import OpenSetDataLoader, OpenSetDataLoaderConfig

#File Configuration
filepath = os.path.join(".", "datasets", "HealthApp_2000.log") # Point to the target HealthApp.log dataset

dataset_name = "HealthApp"
data_loader = OpenSetDataLoader(
    OpenSetDataLoaderConfig(
        dataset_name=dataset_name,
        filepath=filepath)
)

logrecord = data_loader.load_data()

logrecord.to_dataframe().head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected[constants.LOG_TIMESTAMPS] = pd.to_datetime(


Unnamed: 0,logline,timestamp,Action,ID
0,onExtend:1514038530000 14 0 4,2017-12-23 22:15:29.615,Step_LSC,30002312
1,onReceive action: android.intent.action.SCREEN_ON,2017-12-23 22:15:29.633,Step_StandReportReceiver,30002312
2,processHandleBroadcastAction action:android.in...,2017-12-23 22:15:29.635,Step_LSC,30002312
3,flush sensor data,2017-12-23 22:15:29.635,Step_StandStepCounter,30002312
4,getTodayTotalDetailSteps = 1514038440000##699...,2017-12-23 22:15:29.635,Step_SPUtils,30002312


## Preprocess

In preprocessing step user can retrieve and replace any regex strings and clean the raw loglines. This
can be very useful to improve information extraction of the unstructured part of logs,
 as well as generate more structured attributes with domain knowledge.

Here in the example, we use the below regex to retrieve IP addresses.

In [3]:
from logai.preprocess.preprocessor import PreprocessorConfig, Preprocessor
from logai.utils import constants

loglines = logrecord.body[constants.LOGLINE_NAME]
attributes = logrecord.attributes

preprocessor_config = PreprocessorConfig(
    custom_replace_list=[
        [r"\d+\.\d+\.\d+\.\d+", "<IP>"],   # retrieve all IP addresses and replace with <IP> tag in the original string.
    ]
)

preprocessor = Preprocessor(preprocessor_config)

clean_logs, custom_patterns = preprocessor.clean_log(
    loglines
)
custom_patterns.head(5)

Unnamed: 0,<IP>
0,[]
1,[]
2,[]
3,[]
4,[]


## Parsing

After preprocessing, we call auto-parsing algorithms to automatically parse the cleaned logs.


In [4]:
from logai.information_extraction.log_parser import LogParser, LogParserConfig
from logai.algorithms.parsing_algo.drain import DrainParams

# parsing
parsing_algo_params = DrainParams(
    sim_th=0.5, depth=5
)

log_parser_config = LogParserConfig(
    parsing_algorithm="drain",
    parsing_algo_params=parsing_algo_params
)

parser = LogParser(log_parser_config)
parsed_result = parser.parse(clean_logs)
# parsed_result
parsed_loglines = parsed_result['parsed_logline']

In [15]:
parsed_result

Unnamed: 0,logline,parsed_logline,parameter_list
0,onExtend:1514038530000 14 0 4,* * 0 4,"[onExtend:1514038530000, 14]"
1,onReceive action: android.intent.action.SCREEN_ON,onReceive action: *,[android.intent.action.SCREEN_ON]
2,processHandleBroadcastAction action:android.in...,processHandleBroadcastAction *,[action:android.intent.action.SCREEN_ON]
3,flush sensor data,flush sensor data,[]
4,getTodayTotalDetailSteps = 1514038440000##699...,getTodayTotalDetailSteps = *,[1514038440000##6993##548365##8661##12266##271...
...,...,...,...
253369,calculateCaloriesWithCache totalCalories=52108,calculateCaloriesWithCache *,[totalCalories=52108]
253370,calculateAltitudeWithCache totalAltitude=60,calculateAltitudeWithCache *,[totalAltitude=60]
253371,processHandleBroadcastAction action:android.in...,processHandleBroadcastAction *,[action:android.intent.action.TIME_TICK]
253372,processHandleBroadcastAction action:android.in...,processHandleBroadcastAction *,[action:android.intent.action.TIME_TICK]


## Time-series Anomaly Detection

Here we show an example to conduct time-series anomaly detection with parsed logs.

### Feature Extraction

After parsing the logs and get log templates, we can extract timeseries features by coverting
these parsed loglines into counter vectors.

In [6]:
from logai.information_extraction.feature_extractor import FeatureExtractorConfig, FeatureExtractor

config = FeatureExtractorConfig(
    group_by_time="15min",
    group_by_category=['parsed_logline', 'Action', 'ID'],
)

feature_extractor = FeatureExtractor(config)

timestamps = logrecord.timestamp['timestamp']
parsed_loglines = parsed_result['parsed_logline']
counter_vector = feature_extractor.convert_to_counter_vector(
    log_pattern=parsed_loglines,
    attributes=attributes,
    timestamps=timestamps
)

counter_vector.head(5)


Unnamed: 0,parsed_logline,Action,ID,timestamp,event_index,counts
0,* * 0 0,Step_LSC,30002312,2017-12-23 23:00:00,[1347],1
1,* * 0 0,Step_LSC,30002312,2017-12-24 11:15:00,[4660],1
2,* * 0 0,Step_LSC,30002312,2017-12-24 12:00:00,"[6985, 7064]",2
3,* * 0 0,Step_LSC,30002312,2017-12-24 12:45:00,"[7458, 7459, 7473]",3
4,* * 0 0,Step_LSC,30002312,2017-12-24 15:30:00,"[7999, 8000, 8003, 8007, 8008, 8009, 8010, 801...",38


### Anomaly Detection

With the generated `counter_vcetor`, you can use `AnomalyDetector` to detect timeseries anomalies.
Here we use an algorithm in Merlion library called `DynamicBaseLine`.

In [7]:
from logai.analysis.anomaly_detector import AnomalyDetector, AnomalyDetectionConfig
from sklearn.model_selection import train_test_split
import pandas as pd

counter_vector["attribute"] = counter_vector.drop(
                [
                    constants.LOG_COUNTS,
                    constants.LOG_TIMESTAMPS,
                    constants.EVENT_INDEX
                ],
                axis=1
            ).apply(
                lambda x: "-".join(x.astype(str)), axis=1
            )

attr_list = counter_vector["attribute"].unique()

anomaly_detection_config = AnomalyDetectionConfig(
    algo_name='dbl'
)

res = pd.DataFrame()
for attr in attr_list:
    temp_df = counter_vector[counter_vector["attribute"] == attr]
    if temp_df.shape[0] >= constants.MIN_TS_LENGTH:
        train, test = train_test_split(
            temp_df[[constants.LOG_TIMESTAMPS, constants.LOG_COUNTS]],
            shuffle=False,
            train_size=0.3
        )
        anomaly_detector = AnomalyDetector(anomaly_detection_config)
        anomaly_detector.fit(train)
        anom_score = anomaly_detector.predict(test)
        res = res.append(anom_score)


  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1:] - orig_t[:-1], axis=None)[0].item())
  res = res.append(anom_score)
  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1:] - orig_t[:-1], axis=None)[0].item())
  res = res.append(anom_score)
  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1:] - orig_t[:-1], axis=None)[0].item())
  res = res.append(anom_score)
  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1:] - orig_t[:-1], axis=None)[0].item())
  res = res.append(anom_score)
  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1:] - orig_t[:-1], axis=None)[0].item())
  res = res.append(anom_score)
  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1:] - orig_t[:-1], axis=None)[0].item())
  res = res.append(anom_score)
  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1:] - orig_t[:-1], axis=None)[0].item())
  res = res.append(anom_score)
  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1:] - orig_t[:-1], axis=None)[0].item())
  res = res.append(anom_score)
  dt = pd.to_timedelta(scipy.stats.mode(orig_t[1

In [8]:
# Get anomalous datapoints
anomalies = counter_vector.iloc[res[res>0].index]
anomalies.head(5)

Unnamed: 0,parsed_logline,Action,ID,timestamp,event_index,counts,attribute
42,* * 0 0,Step_LSC,30002312,2017-12-27 12:45:00,"[91837, 91876]",2,* * 0 0-Step_LSC-30002312
43,* * 0 0,Step_LSC,30002312,2017-12-27 17:30:00,[93344],1,* * 0 0-Step_LSC-30002312
44,* * 0 0,Step_LSC,30002312,2017-12-27 20:00:00,[97356],1,* * 0 0-Step_LSC-30002312
45,* * 0 0,Step_LSC,30002312,2017-12-27 20:30:00,[100183],1,* * 0 0-Step_LSC-30002312
46,* * 0 0,Step_LSC,30002312,2017-12-28 08:30:00,"[103117, 103379, 103426]",3,* * 0 0-Step_LSC-30002312


## Semantic Anomaly Detection

We can also use the log template for semantic based anomaly detection. In this approach, we retrieve
the semantic features from the logs. This includes two parts: vectorizing the unstructured log templates
and encoding the structured log attributes.

### Vectorization for unstructured loglines

Here we use `word2vec` to vectorize unstructured part of the logs. The output will be a list of
numeric vectors that representing the semantic features of these log templates.

In [9]:
from logai.information_extraction.log_vectorizer import VectorizerConfig, LogVectorizer

vectorizer_config = VectorizerConfig(
    algo_name = "word2vec"
)

vectorizer = LogVectorizer(
    vectorizer_config
)

# Train vectorizer
vectorizer.fit(parsed_loglines)

# Transform the loglines into features
log_vectors = vectorizer.transform(parsed_loglines)

### Categorical Encoding for log attributes

We also do categorical encoding for log attributes to convert the strings into numerical representations.

In [10]:
from logai.information_extraction.categorical_encoder import CategoricalEncoderConfig, CategoricalEncoder

encoder_config = CategoricalEncoderConfig(name="label_encoder")

encoder = CategoricalEncoder(encoder_config)

attributes_encoded = encoder.fit_transform(attributes)

### Feature Extraction

Then we extract and concate the semantic features for both the unstructured and structured part of logs.


In [11]:
from logai.information_extraction.feature_extractor import FeatureExtractorConfig, FeatureExtractor

timestamps = logrecord.timestamp['timestamp']

config = FeatureExtractorConfig(
    max_feature_len=100
)

feature_extractor = FeatureExtractor(config)

_, feature_vector = feature_extractor.convert_to_feature_vector(log_vectors, attributes_encoded, timestamps)


  block_list = gb.mean().reset_index()


### Anomaly Detection

With the extracted log semantic feature set, we can perform anomaly detection to find the abnormal
logs. Here we use `isolation_forest` as an example.

In [12]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(feature_vector, train_size=0.7, test_size=0.3)

from logai.algorithms.anomaly_detection_algo.isolation_forest import IsolationForestParams
from logai.analysis.anomaly_detector import AnomalyDetectionConfig, AnomalyDetector

algo_params = IsolationForestParams(
    n_estimators=10,
    max_features=100
)
config = AnomalyDetectionConfig(
    algo_name='isolation_forest',
    algo_params=algo_params
)

anomaly_detector = AnomalyDetector(config)
anomaly_detector.fit(train)
res = anomaly_detector.predict(test)
# obtain the anomalous datapoints
anomalies = res[res==1]



#### Check the corresponding loglines

In [26]:
x = loglines.iloc[anomalies.index].head(5)
print(x[119363])
print(x)

 getTodayTotalDetailSteps = 1514509800000##1012##1214852##83501##91877##281665596
125187    onReceive action: android.intent.action.SCREEN...
119363     getTodayTotalDetailSteps = 1514509800000##101...
182220                              onStandStepChanged 1356
170650    setTodayTotalDetailSteps=1514644020000##8332##...
14713     setTodayTotalDetailSteps=1514112540000##9656##...
Name: logline, dtype: object


#### Check the corresponding attributes

In [14]:
attributes.iloc[anomalies.index].head(5)

Unnamed: 0,Action,ID
125187,Step_StandReportReceiver,30002312
119363,Step_SPUtils,30002312
182220,Step_LSC,30002312
170650,Step_SPUtils,30002312
14713,Step_SPUtils,30002312
