# Word2Vec on Windows Eventlogs
This notebook is part of a project that uses Word2Vec for anomaly detection in Windows 10 event logs.<br>
It uses a dataset which consists of all events derived from the System event log from my own PC.<br>
The data was read using Powershell and Get-WinEvent. You can find the Powershell script and the python script for parsing in my github repository.<br>
This version uses the Gensim implementation of Word2Vec, <a target="_blank" href="https://colab.research.google.com/github/Grrtzm/word2vec/blob/main/windows_eventlog_anomaly_detection_with_tensorflow_word2vec.ipynb">this version uses a Tensorflow implementation of Word2Vec</a>.

## Setup

In [None]:
!pip install gensim
import gensim

In [None]:
import pandas as pd
import string
# from time import time  # To time our operations
from datetime import datetime # For DateTime -> date operations
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

# Word2Vec hyperparameters:
# num_ns: Set the number of negative samples per positive context.
# num_ns: between [5, 20] is shown to work best for smaller datasets, while num_ns between [2,5] suffices for larger datasets.
num_ns = 5
window_size = 5
embedding_dim = 20 #  20 seems to be sufficient, 128 is the default value from the tutorial   # Dimension of the dense embedding.
vocab_size = 20000 # inital size of the vocabulary. We will resize it later before defining the model.
# sequence_length = 40 # Number of words in a sentence.
epochs = 1000 # Number of training epochs for Word2Vec

# df = pd.read_csv('D:\logs\AllEvents-sorted.csv', parse_dates=["TimeCreated"]) 
df = pd.read_csv('D:/logs/winevt/20211202/System-Events-custom.csv', parse_dates=["TimeCreated"]) 
df.head()

## Create Event "word" from multiple columns
This creates a new column containing the words Word2Vec will be trained on.

In [None]:
import string
punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~'   # `|` is not present here
transtab = str.maketrans(dict.fromkeys(punct, ''))

# Define a function to remove spaces
# Source: https://iqcode.com/code/python/pandas-series-remove-punctuation
# and https://stackoverflow.com/questions/50444346/fast-punctuation-removal-with-pandas
def remove_spaces(text):
    for punctuation in string.punctuation:
        text = text.replace(' ', '')
    return text

# Create a new column. Concatenate the columns, remove unwanted characters and convert to lowercase
df['Event'] = df['EventID'].map(str) + df['Level'].map(str) + df['Provider'].apply(remove_spaces).map(str)
df['Event'] = '|'.join(df['Event'].tolist()).translate(transtab).split('|') # remove all other unwanted characters
df['Event'] = df['Event'].str.lower()

# Delete the redundant columns
df = df.drop(['EventID','Level','Provider'], axis=1)

# Change order of columns by name, so we can display it orderly
df = df[['TimeCreated', 'EventRecordID', 'Event', 'Message']]

# Show a preview
df.head(8)

In [None]:
num_rows = len(df.axes[0])
print(f"Number of lines/events in the dataset: {num_rows}\n")

# Data Preprocessing

Since the purpose of this tutorial is to learn how to generate word embeddings using genism library, I will not do the EDA and feature selection for the word2vec model for the sake of simplicity. 
<br> 
Gensim word2Vec requires that a format of list of list for training where every document is contained in a list and every list contains list of tokens of that document. At first, we need to generate a format of list of list for training the make model word embedding. To be more specific, each make model is contained in a list and every list contains list of features of that make model.
To achieve these, we need to do the following data preprocessing steps :
1. Create a new column for Make Model 
2. Generate a format of list of list for each Make Model with the following features: Engine Fuel Type, Transmission Type, Driven_Wheels, Market Category, Vehicle Size and Vehicle Style. 

In [None]:
# Hier creëer ik een tekst dataset eventlist, een list van lists. De list eventrow is gevuld met alle events van die dag.
# Zie ook blokje [8] van lees-data-v3
eventlist = []
eventrow = []
previous_date = None # datetime.now().date()
for idx, row in df.iterrows():
    date = row['TimeCreated'].date()
    eventrow.append(row['Event'])
    if date != previous_date:
        eventrow = []
        eventlist.append(eventrow)
        previous_date = date

In [None]:
# print(len(eventlist))
# print()
# print(eventlist[:3])

In [None]:
## Train the Gensim word2vec model with our own custom corpus

# Skip-gram:
model = Word2Vec(eventlist, min_count=1, vector_size=20, workers=8, window=3, sg=1, epochs=1000) #, compute_loss=True) #, callbacks=[epoch_logger])

# CBOW:
# model = Word2Vec(eventlist, min_count=1, vector_size=20, workers=8, window=3, sg=0, epochs=1000, hs=1, negative=0) #, compute_loss=True) #, callbacks=[epoch_logger])

word_vectors = model.wv
word_vectors.save('vectors.kv')
reloaded_word_vectors = KeyedVectors.load('vectors.kv')

Reference: https://radimrehurek.com/gensim/models/word2vec.html and https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4
Apparently, Gensim v4.0 was released on 31-3-2021

Let's try to understand the hyperparameter of this model.
1. vector_size: The number of dimensions of the embeddings and the default is 100.
2. window: The maximum distance between a target word and words around the target word. The default window is 5.
3. min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
4. workers: The number of partitions during training and the default workers is 3. 
5. sg: The training algorithm, either CBOW(0) or skip gram (1). The default training alogrithm is CBOW. 
6. epochs: number of training cycles
7. hs: 0, 1   1=hierarchical softmax

In [None]:
# https://radimrehurek.com/gensim/models/keyedvectors.html#module-gensim.models.keyedvectors
# distances(word_or_vector, other_words=())
# Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.

print(f"Number of words in internal Word2Vec vocabulary: {len(model.wv)}")
print(model.wv.get_vector('system137errorntfs', norm=True))
print()
print(model.wv.similarity('system136warningntfs','system137errorntfs'))
print()
print(model.wv.distance('system136warningntfs','system137errorntfs'))
print()
print(model.wv.distances('system137errorntfs', other_words=(["system26informationapplicationpopup","system51warningdisk"])))
print()
print(model.wv.distance('system137errorntfs','system26informationapplicationpopup'))
print()
print(model.wv.distance('system137errorntfs','system7040informationservicecontrolmanager'))
print()
print(model.wv.rank_by_centrality(['system51warningdisk','system137errorntfs'], use_norm=True))
print()

#print(model.wv['system18informationbthusb']) # Foutmelding: KeyError: "Key 'system18informationbthusb' not present"
# print(model.wv.most_similar('system136warningntfs')) # Foutmelding: KeyError: "Key 'system18informationbthusb' not present"
# print()
# print(model.wv.get_vector('system136warningntfs', norm=True))
# print()
# print(model.wv.evaluate_word_pairs('system136warningntfs','system137errorntfs')) # Dit werkt niet; FileNotFoundError: [Errno 2] No such file or directory: 'system136warningntfs'

In [None]:
num_events = len(df.axes[0])
# add empty columns
cos_sim = []
for idx, row in df.iterrows():
    current_event = row['Event']
    if idx == 0:
        cos_sim.append(float(0))
        previous_event = current_event
    if idx > 0:
        if idx < num_events + 1:
            cs = model.wv.similarity(previous_event, current_event)
            cos_sim.append(cs)
            previous_event = current_event

df['cos_sim'] = cos_sim

# Change order of columns by name, so we can display it orderly
df = df[['TimeCreated', 'EventRecordID', 'Event', 'cos_sim', 'Message']]
df.head()

# saving the dataframe
df.to_csv('D:/logs/winevt/20211202/System-Events-dist.csv')

In [None]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

fig, ax = plt.subplots(1, figsize=(200, 20))
x = df['TimeCreated']
y = df['cos_sim']
line, = ax.plot(x, y)
# Major ticks every month, minor ticks every day,
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_minor_locator(mdates.DayLocator())
ax.grid(True, which='both', axis='both')
# Text in the x axis will be displayed in 'YYYY-mm-dd' format.
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b-%d'))
ax.xaxis.set_minor_formatter(mdates.DateFormatter('%d'))
# Rotates and right-aligns the x labels so they don't crowd each other.
for label in ax.get_xticklabels(which='major'):
    label.set(rotation=90, horizontalalignment='right')
for label in ax.get_xticklabels(which='minor'):
    label.set(rotation=90, horizontalalignment='right')
plt.xlabel("Date")
plt.ylabel("Cosine Similarity")
plt.savefig("cos_sim.png", format="png")
plt.show()

In [None]:
# Pandas display settings
# Set it to None to display all columns in the dataframe
pd.set_option('display.max_columns', None)
pd.set_option('display.width',200)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.colheader_justify','left')
#df.style.set_properties(**{'text-align': 'left'})
#df.style.set_properties(subset=['Event', 'Message'], **{'text-align': 'left'})

#dfStyler = df.style.set_properties(**{'text-align': 'left'})
#dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
df.style.set_properties(**{'text-align': 'left'}).set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])

print("Top 10 of all 'anomalies':\n")
df.nsmallest(n=10, columns=['cos_sim'])

In [None]:
# Save result to a file:
# df.to_csv('D:/Machine_Learning/word2vec/file1.csv')

In [None]:
# Only display the rows containing 'critical'
dfcritical = df[df['Event'].str.contains('critical')]
print(f"Number of 'critical' events: {len(dfcritical)}\n")
print("Top 10 of 'critical' anomalies:\n")
dfcritical.nsmallest(n=10, columns=['cos_sim'])

In [None]:
# Only display the rows containing 'error'
dferror = df[df['Event'].str.contains('error')]
print(f"Number of 'error' events: {len(dferror)}\n")
print("Top 10 of 'error' anomalies:\n")
dferror.nsmallest(n=10, columns=['cos_sim'])

In [None]:
from re import search
print(f"   Index TimeCreated                      EventRecordID Event                                                        cos_sim")
for ind in dferror.index:
    if not search('errortcpip', dferror['Event'][ind]):
        if not search('system10010', dferror['Event'][ind]):
            print(f"{ind:8d} {dferror['TimeCreated'][ind]} {dferror['EventRecordID'][ind]:13d} {dferror['Event'][ind]:60s} {dferror['cos_sim'][ind]:.8f}")