# LANL Host Dataset

### Article : https://www.worldscientific.com/doi/pdf/10.1142/9781786345646_001

Notes:

"The events from the host logs included in the data set are all related to authentication and process activity on each machine"

### Pour obtenir le dataset:
1- https://csr.lanl.gov/data/2017/#citing : donner mail, obtenir le lien

2- https://csr.lanl.gov/data-fence/... 10 chiffres.../... token... iXYXXbqw15UugRnZALCZ2Y8dvEk=... /unified-host-network-dataset-2017/wls.html pour avoir l'index avec tous les fichiers compressés

3- download : for i in $(seq -w 1 90); do wget -c https://csr.lanl.gov/data-fence/...10 chiffres.../...token.../unified-host-network-dataset-2017/wls/wls_day-$i.bz2; done

4- decompress, as required : bzip2 -dk filename.bz2

### EventID : 

EventID         Description

Authentication events

4768            Kerberos authentication ticket was requested (TGT)

4769            Kerberos service ticket was requested (TGS)

4770            Kerberos service ticket was renewed

4774            An account was mapped for logon

4776            Domain controller attempted to validate credentials

4624            An account successfully logged on, see Logon Types

4625            An account failed to logon, see Logon Types

4634            An account was logged off, see Logon Types

4647            User initiated logoff

4648            A logon was attempted using explicit credentials

4672            Special privileges assigned to a new logon

4800            The workstation was locked

4801            The workstation was unlocked

4802            The screensaver was invoked

4803            The screensaver was dismissed

Process events

4688            Process start

4689            Process end

System events

4608            Windows is starting up

4609            Windows is shutting down

1100            Event logging service has shut down (often recorded instead of EventID 4609)


Detailed description : - EventID : https://learn.microsoft.com/en-us/windows-server/identity/ad-ds/plan/appendix-l--events-to-monitor


### Logon Types for EventIDs: 4624, 4625 and 4634

LogonTypes (EventIDs: 4624, 4625 and 4634)

2 — Interactive

3 — Network

4 — Batch

12 — Cached Remote-Interactive

5 — Service

9 — New Credentials

7 — Unlock

10 — Remote Interactive

8 — Network Clear Text 11 — Cached Interactive

0 — Used only by the system account

### Host Log Fields

. Time: The epoch time of the event in seconds.

• EventID: Four digit integer corresponding to the event id of the record.

• LogHost: The hostname of the computer that the event was recorded on.In the case of directed authentication events, the LogHost will correspond to the computer that the authentication event is terminating at (destination computer).

• LogonType: Integer corresponding to the type of logon, see Table 2.

• LogonTypeDescription: Description of the LogonType, see Table 2.

• UserName: The user account initiating the event. If the user ends in $, then it corresponds to a computer account for the specified computer.

• DomainName: Domain name of UserName.

• LogonID: A semi-unique (unique between current sessions and LogHost)number that identifies the logon session just initiated. Any events logged subsequently during this logon session should report the same LogonID through to the logoff event.

• SubjectUserName: For authentication mapping events, the user account specified by this field is mapping to the user account in UserName.

• SubjectDomainName: Domain name of SubjectUserName.

• SubjectLogonID: See LogonID.

• Status: Status of the authentication request. “0 × 0” means success otherwise failure; failure codes for the appropriate EventID are available online.f

• Source: For authentication events, this will correspond to the the computer where the authentication originated (source computer), if it is a local logon event then this will be the same as the LogHost.

• ServiceName: The account name of the computer or service the user is requesting the ticket for.

• Destination: This is the server the mapped credential is accessing. This may indicate the local computer when starting another process with new account credentials on a local computer.

• AuthenticationPackage: The type of authentication occurring including Negotiate, Kerberos, NTLM plus a few more.

• FailureReason: The reason for a failed logon.

• ProcessName: The process executable name, for authentication events this is the process that processed the authentication event. ProcessNames may include the file type extensions (i.e., exe).

• ProcessID: A semi-unique (unique between currently running processes AND LogHost) value that identifies the process. ProcessID allows you to correlate other events logged in association with the same process through to the process end.

• ParentProcessName: The process executable that started the new process. ParentProcessNames often do not have file extensions like ProcessName but can be compared by removing file extensions from the name.

• ParentProcessID: Identifies the exact process that started the new process. Look for a preceding event 4688 with a ProcessID that matches this ParentProcessID.

### Bibliothèques

In [1]:
import json
import os
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
import keras
from keras import ops
from keras import layers

from sklearn.preprocessing import StandardScaler

# from sklearn.svm import OneClassSVM

# from sklearn.decomposition import PCA
# from sklearn.decomposition import KernelPCA

from sklearn.neighbors import KernelDensity
# from sklearn.model_selection import GridSearchCV

# from sklearn.mixture import BayesianGaussianMixture

2024-08-05 17:59:23.159254: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Calcul du nombre de lignes des deux premiers fichiers wls_day-01.bz2 et wls_day-02.bz2 une fois décompressés : plus de 50 millions de lignes par fichier

In [2]:
# affiche le nombre de lignes de chaque fichier

def get_lines_number(filepath):
    """retourne le nombre de lignes d'un fichier *.json

    Args:
        filename (_type_): full path filename
    """
    
    with open(filepath) as f:
        ctr_lignes = sum(1 for line in f)
        
    return ctr_lignes

In [25]:
print(os.getcwd())

if os.getcwd()=='/home/benjamin.deporte/Cyber/code':
    DIRPATH='/home/benjamin.deporte/Cyber/data/LALN_processed/'
else:
    DIRPATH='/home/benjamin/Folders_Python/Cyber/data/LALN_processed/'

/home/benjamin.deporte/Cyber/code


In [26]:
dirpath = DIRPATH
liste_fichiers = [ 'wls_day-01.json', 'wls_day-02.json']

for filename in liste_fichiers:
    filepath = dirpath + filename

    # print(f'{filename} contient {get_lines_number(filepath)} lignes')

### Framework pour le traitement des données

In [27]:
# STRUCTURE DES DONNEES
# ---------------------

# il y a 90 fichiers wls_day-nn.bz2, un pour chaque jour d'enregistrement

# chaque fichier wls_day-nn.bz2 peut se décompresser :
# - via la ligne de commande : bzip2 -dk filename.bz2 (préserve le fichier compressé original)
# - via la bibliothèque python : https://docs.python.org/3/library/bz2.html

# un fichier décompressé est un fichier *.json au format lignes : chaque ligne du fichier est un objet JSON distinct

# chaque objet JSON est un objet à un seul niveau, avec un maximum de 21 paires (key, value)

# ALGORITHME
# ----------

# on va lire les objets JSON un par un (ligne par ligne) et traiter les données séquentiellement.
# chaque champ est managé par un dictionnaire, où :
# - les keys sont les valeurs possibles, déjà vues, du champ
# - les values sont une liste de deux int : 1/ une valeur pour encoder categorical, 2/ un compteur

# PROCESS :
# ---------

# loop sur objets JSON :
# - concaténation de UserName et DomaineName, pour donner un UserNameDomainName qui identifie le user account. Drop UserName, DomaineName
# - pour chacun des 20 champs :
# -- checke si présent dans le JSON object
# ---- si non présent : valeur = -1 (correspond à NaN)
# ---- si présent : regarde dans le dictionnaire du champ
# ------ si existe déjà : retourne valeur, incrémente compteur
# ------ si n'existe pas : crée key, crée nouvelle valeur (= max anciennes + 1), met compteur à 1

In [28]:
# liste des champs à traiter

list_fields = [
    'Time', # int64
    'EventID', # int64
    'LogHost', # object
    'LogonType', # float64, corresponding to the type of logon, see Table 2.
    'LogonTypeDescription', # object Description of the LogonType, see Table 2.
    'UserName', # object - The user account initiating the event. If the user ends in $, then it corresponds to a computer account for the specified computer.
    'DomainName', # object - Domain name of UserName.
    'LogonID', #: object. A semi-unique (unique between current sessions and LogHost)number that identifies the logon session just initiated. Any events logged subsequently during this logon session should report the same LogonID through to the logoff event.
    'SubjectUserName', # object. For authentication mapping events, the user account specified by this field is mapping to the user account in UserName.
    'SubjectDomainName', # object - Domain name of SubjectUserName.
    'SubjectLogonID', # object - See LogonID.
    'Status', # object - Status of the authentication request. “0 × 0” means success otherwise failure; failure codes for the appropriate EventID are available online.f
    'Source', # object - For authentication events, this will correspond to the the computer where the authentication originated (source computer), if it is a local logon event then this will be the same as the LogHost.
    'ServiceName', # object - The account name of the computer or service the user is requesting the ticket for.
    'Destination', # object - This is the server the mapped credential is accessing. This may indicate the local computer when starting another process with new account credentials on a local computer.
    'AuthenticationPackage', # object - The type of authentication occurring including Negotiate, Kerberos, NTLM plus a few more.
    'FailureReason', # object - The reason for a failed logon.
    'ProcessName', # object - The process executable name, for authentication events this is the process that processed the authentication event. ProcessNames may include the file type extensions (i.e., exe).
    'ProcessID', # object - A semi-unique (unique between currently running processes AND LogHost) value that identifies the process. ProcessID allows you to correlate other events logged in association with the same process through to the process end.
    'ParentProcessName', # object - The process executable that started the new process. ParentProcessNames often do not have file extensions like ProcessName but can be compared by removing file extensions from the name.
    'ParentProcessID', # object - Identifies the exact process that started the new process. Look for a preceding event 4688 with a ProcessID that matches this ParentProcessID.
    'UserNameDomainName', # object - n'existe pas dans les fichiers, concaténation de UserName et DomainName 
] 

In [29]:
class FieldRecord():
    """classe pour gérer les champs. Les valeurs sont apprises au fil de l'eau, et une valeur 'categorical' est associée à chaque valeur unique.
    
    - list_values (type list) est la liste des valeurs apprises, que peut prendre le champ. Valeur initiale : [None]
    - list_counts (type list) est la liste des nombres d'occurences de la valeur correpondante, à date. Valeur initiale : [0]
    - nom : le nom du champ
    - des méthodes utilitaires
    """
    
    def __init__(self, nom):
        self.nom = nom # nom du champ...
        self.list_values = [None] # valeurs possibles que peut prendre le champ. On initialise None à la valeur catgéorical 0
        self.list_counts = [0] # nombres d'occurences constatées de la valeur correspondante
           
    def __str__(self):
        ctr = sum(x for x in self.list_counts)
        msg = f'Objet FieldRecord pour champ {self.nom} \n - connaît {len(self.list_values)} valeurs distinctes \n - a vu {ctr} champs au total'
        return msg
    
    def __repr__(self):
        ctr = sum(x for x in self.list_counts)
        msg = f'Objet FieldRecord pour champ {self.nom} \n - connaît {len(self.list_values)} valeurs distinctes \n - a vu {ctr} champs au total'
        return msg
    
    def get_field_categorical_value(self, val):
        """retourne l'int pour encodage catégorical du champ passé en argument
        NB : les NaN ou None sont encodés à 0 par défaut. (cf constructeur)

        Args:
            val (int ou string): valeur extraite de la ligne JSON
        """
        if val not in self.list_values:
            # si la valeur est nouvelle, rajoute à la liste des valeurs connues
            self.list_values.append(val) 
            self.list_counts.append(1)
            cat_val = self.list_values.index(val)
        else:
            # si la valeur est déjà connue, retourne son index dans la liste comme categorical value et incrémente le compteur
            cat_val = self.list_values.index(val) 
            self.list_counts[cat_val] += 1
            
        return cat_val

In [30]:
# créé les 21+1 objets Field Records

def get_fresh_dico():
    
    dico = {
        field : FieldRecord(field) for field in list_fields
    }
    return dico

In [31]:
dico = get_fresh_dico()

In [32]:
def display_stats_dico(dico=dico):
    """affiche statistiques descriptives du grand répertoire des champs traités
    """

    output = ""
    
    for field in list_fields:
        ctr = sum(x for x in dico[field].list_counts)
        output = output + f"Champ {field}: valeurs distinctes apprises {len(dico[field].list_values)}, valeurs vues : {ctr}\n"
        
    return output

In [33]:
print(display_stats_dico(dico))

Champ Time: valeurs distinctes apprises 1, valeurs vues : 0
Champ EventID: valeurs distinctes apprises 1, valeurs vues : 0
Champ LogHost: valeurs distinctes apprises 1, valeurs vues : 0
Champ LogonType: valeurs distinctes apprises 1, valeurs vues : 0
Champ LogonTypeDescription: valeurs distinctes apprises 1, valeurs vues : 0
Champ UserName: valeurs distinctes apprises 1, valeurs vues : 0
Champ DomainName: valeurs distinctes apprises 1, valeurs vues : 0
Champ LogonID: valeurs distinctes apprises 1, valeurs vues : 0
Champ SubjectUserName: valeurs distinctes apprises 1, valeurs vues : 0
Champ SubjectDomainName: valeurs distinctes apprises 1, valeurs vues : 0
Champ SubjectLogonID: valeurs distinctes apprises 1, valeurs vues : 0
Champ Status: valeurs distinctes apprises 1, valeurs vues : 0
Champ Source: valeurs distinctes apprises 1, valeurs vues : 0
Champ ServiceName: valeurs distinctes apprises 1, valeurs vues : 0
Champ Destination: valeurs distinctes apprises 1, valeurs vues : 0
Champ Au

In [34]:
# ---------------------------------
# -------------- Test -------------
# ---------------------------------

dirpath = DIRPATH
filename = 'wls_day-02.json'
filepath = dirpath + filename

N_SAMPLES = 20

# 1ere DataFrame : valeurs brutes
dict_for_df_raw = {
    field : [] for field in list_fields
}

# 2e DataFrame : valeurs nettes
dict_for_df = {
    field : [] for field in list_fields
}

dico = get_fresh_dico()

with open(filepath, 'r') as f:
    for i in range(N_SAMPLES):
        # lit lignes du fichier une à une et convertit en dict Python
        line = f.readline()
        obj_json = json.loads(line)
        # calcule à la main le UserNameDoaminName
        obj_json['UserNameDomainName'] = obj_json.get('UserName') + obj_json.get('DomainName')
        # print(obj_json)
        # trouve la valeur de chaque champ (éventuellement None) et traduit en categorical value suivant dictionnaire dico
        for field in list_fields:
            # print(f'{field} = {obj_json.get(field,None)}')
            # print(dico[field])
            val = obj_json.get(field, None)
            dict_for_df_raw[field].append(val)
            dict_for_df[field].append(dico[field].get_field_categorical_value(val))
            
        msg = display_stats_dico(dico)
        
        # if i%10==0:
        #     print(msg + '\r')

FileNotFoundError: [Errno 2] No such file or directory: '/home/benjamin.deporte/Cyber/data/LALN_processed/wls_day-02.json'

In [None]:
df_raw = pd.DataFrame(dict_for_df_raw)

df_raw

In [None]:
df = pd.DataFrame(dict_for_df)

df

In [None]:
df.describe(include='all').transpose()

In [None]:
df.info()

In [None]:
# -----------------------------------------------
# -- PARKING LOT : code pour itérateur ----------

# class LinesBatchIterator():
#     """iterateur pour retourner des batchs de taille BATCH_SIZE depuis le fichier json FILEPATH
#     """
    
#     def __init__(self, filepath, batch_size=None):
#         self.filepath = filepath
#         if batch_size == None:
#             self.batch_size = 10 # taille de batch par défaut pour debug
#         else:
#             self.batch_size = batch_size
            
#     def __iter__(self):
#         return self
    
#     def __next__(self):
#         """retourne le prochain batch de taille 'batch_size'
#         """
        
#         # 1ere DataFrame : valeurs brutes - POUR DEBUG
#         dict_for_df_raw = {
#             field : [] for field in list_fields
#             }
        
#         dict_for_df = {
#             field : [] for field in list_fields
#             }
        
#         with open(filepath) as f:
#             for i in range(self.batch_size):
#                 # lit lignes du fichier une à une et convertit en dict Python
#                 # print(i)
#                 line = f.readline()
#                 # si fin du fichier, on arrête
#                 if line=="":
#                     raise StopIteration
#                 # sinon, on traite
#                 obj_json = json.loads(line)
#                 # calcule à la main le UserNameDoaminName
#                 obj_json['UserNameDomainName'] = obj_json.get('UserName') + obj_json.get('DomainName')
#                 # trouve la valeur de chaque champ (éventuellement None) et traduit en categorical value suivant dictionnaire dico
#                 for field in list_fields:
#                     val = obj_json.get(field, None)
#                     dict_for_df_raw[field].append(val)
#                     dict_for_df[field].append(dico[field].get_field_categorical_value(val))
#             df = pd.DataFrame(dict_for_df)
#             df_raw = pd.DataFrame(dict_for_df_raw)
            
#         return df_raw, df
    
#     def __repr__(self):
#         msg = f"objet batch iterator, fichier = {self.filepath}, batch_size={self.batch_size}"
#         return msg

In [None]:
# dirpath = '/home/benjamin/Folders_Python/Cyber/data/LALN_processed/'
# filename = 'wls_day-02.json'
# filepath = dirpath + filename

# line_iterator = LinesBatchIterator(filepath=filepath)

In [None]:
# print(line_iterator)

In [None]:
# df_raw, df = next(line_iterator)

# df_raw

In [None]:
# df_raw, df = next(line_iterator)

# df_raw

### EDA sur un GROS fichier

In [None]:
# choisit le fichier

dirpath = '/home/benjamin/Folders_Python/Cyber/data/LALN_processed/'
filename = 'wls_day-01_subset_10000.json'

filepath = dirpath + filename

In [None]:
# Initialisations

# 1ere DataFrame : valeurs brutes
dict_for_df_raw = {
    field : [] for field in list_fields
}

# 2e DataFrame : valeurs nettes
dict_for_df = {
    field : [] for field in list_fields
}

dico = get_fresh_dico()

In [None]:
%%time

print(f"Processe {filename}\n")

with open(file=filepath) as f:
    ctr_lines = 0
    line = f.readline()
    while line != "":
        ctr_lines += 1
        print(f"traite ligne {ctr_lines}", end="\r", flush=True)
        obj_json = json.loads(line)
        # calcule à la main le UserNameDomainName
        obj_json['UserNameDomainName'] = obj_json.get('UserName') + obj_json.get('DomainName')
        # print(obj_json)
        # trouve la valeur de chaque champ (éventuellement None) et traduit en categorical value suivant dictionnaire dico
        for field in list_fields:
            # print(f'{field} = {obj_json.get(field,None)}')
            # print(dico[field])
            val = obj_json.get(field, None)
            dict_for_df_raw[field].append(val)
            dict_for_df[field].append(dico[field].get_field_categorical_value(val))
        line = f.readline()
    print(f"\n")
    
    # df_raw = pd.DataFrame(dict_for_df_raw)
    # df = pd.DataFrame(dict_for_df)

In [None]:
print(display_stats_dico(dico))

In [None]:
df = pd.DataFrame(dict_for_df)

In [None]:
df.describe(include='all').transpose()

In [None]:
df.info()

In [None]:
df

### Training VAE continu (inspiré tuto F Chollet Keras)

In [None]:
# Sampling layer

class Sampling(layers.Layer):
    """Uses (z_mean, z_log_var) to sample z, the latent variable encoding an input."""

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.seed_generator = keras.random.SeedGenerator(42)

    def call(self, inputs):
        z_mean, z_log_var = inputs  # log var plutôt que l'écart type
        batch = ops.shape(z_mean)[0]
        dim = ops.shape(z_mean)[1]
        epsilon = keras.random.normal(shape=(batch, dim), seed=self.seed_generator)
        # reparametrisation trick from original article Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014
        scale = 0.25 # manage l'amplitude du bruit dans le reparametrisation trick pour éviter le posterior collapse
        return z_mean + ops.exp(0.5 * z_log_var) * epsilon * scale

In [None]:
input_dim = len(list_fields)  # nombre de features
output_dim = input_dim
latent_dim = 2  # choix pour affichage 2D

In [None]:
# Encoder

encoder_inputs = keras.Input(shape=(input_dim,))

# basic MLP
x = layers.Dense(128, activation="relu")(encoder_inputs)
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(32, activation="relu")(x)

# output mean et log_var de la gaussienne
z_mean = layers.Dense(latent_dim, name="z_mean")(x)
z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)

# sample de la gaussienne inférée par le MLP
z = Sampling()([z_mean, z_log_var])
encoder = keras.Model(encoder_inputs, [z_mean, z_log_var, z], name="encoder")
encoder.summary()

In [None]:
# Decoder

# input : vecteur de l'espace latent
latent_inputs = keras.Input(shape=(latent_dim,))

x = layers.Dense(32, activation="relu")(latent_inputs)
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(128, activation="relu")(x)

decoder_outputs = layers.Dense(output_dim)(x)

decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder")
decoder.summary()

In [None]:
# Classe VAE

class VAE(keras.Model):
    def __init__(self, encoder, decoder, **kwargs):
        super().__init__(**kwargs)
        # encodeur vers l'espace latent
        self.encoder = encoder
        # décodeur depuis l'espace latent
        self.decoder = decoder
        # losses
        self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
        self.reconstruction_loss_tracker = keras.metrics.Mean(
            name="reconstruction_loss"
        )
        self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")

    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reconstruction_loss_tracker,
            self.kl_loss_tracker,
        ]

    def train_step(self, data):
        # une étape de forward pass, avec différentiation des losses
        with tf.GradientTape() as tape:
            z_mean, z_log_var, z = self.encoder(data)  # forward pass de l'encodeur
            reconstruction = self.decoder(z)  # veceteur reconstruit depuis la variable latent samplée
            # 1ere loss : erreur reconstruction entre la data et la reconstruction
            reconstruction_loss = ops.mean(keras.losses.mean_squared_error(data, reconstruction)),  # norme L2 pour calculer l'erreur de reconstruction
            # 2e loss : KL entre le posterior z|x appris par l'encodeur et N(0,I) cible
            kl_loss = -0.5 * (1 + z_log_var - ops.square(z_mean) - ops.exp(z_log_var))
            kl_loss = ops.mean(ops.sum(kl_loss, axis=1))
            # loss totale (ELBO)
            total_loss = reconstruction_loss + kl_loss
            
        # calcul des gradients
        grads = tape.gradient(total_loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)
        return {
            "loss": self.total_loss_tracker.result(),
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

In [None]:
# Display projection de 'data' dans le latent space

def plot_label_clusters(vae, data, titre=None): #, labels):
    """Affiche en 2D la projection de data dans l'espace latent de vae

    Args:
        vae (_type_): modèle VAE
        labels (_type_): np.array 2D des points à afficher

    Returns:
        fig, ax
    """
    
    LIM = 3.0
    BANDWIDTH = 0.20 # taille du noyau gaussien pour KDE
    NUM = 50 # nombre points pour affichage contour
    
    fig, ax = plt.subplots(figsize=(6,6))
    
    # calculate projection of data on the latent space
    z_mean, _, _ = vae.encoder.predict(data, verbose=0)

    # fit un KDE pour afficher un contour
    rng = np.random.RandomState(42)
    kde = KernelDensity(kernel='gaussian',bandwidth=BANDWIDTH).fit(z_mean)
    
    # affiche plt.contour
    levels = [ 0.10, 0.25, 0.50, 0.75, 0.90 ]
    x = np.linspace( -LIM, +LIM, num=NUM)
    y = np.linspace( -LIM, +LIM, num=NUM)
    X, Y = np.meshgrid(x, y) # X a shape NUm x NUM
    Z = np.zeros(shape=(NUM,NUM)) # ny,nx
    for j in range(NUM): # ny
        for k in range(NUM): # nx
            xc = X[j,k]
            yc = Y[j,k]
            point = np.array([xc,yc]).reshape(-1,2)
            Z[j,k] = np.exp(kde.score_samples(point))  # repasse aux probas depuis le log retourné par score_samples
            # print(point)
    cs = ax.contour(X,Y,Z,levels)
    ax.clabel(cs, inline=True, fontsize=10)  # affiche log probas
    
    # trace nuage de points projetés dans l'espace latent
    ax.scatter(z_mean[:, 0], z_mean[:, 1], marker='.') 
    ax.set_xlabel("z[0]")
    ax.set_xlim(left=-LIM, right=+LIM)
    ax.set_ylabel("z[1]")
    ax.set_ylim(bottom=-LIM, top=+LIM)
    
    if titre==None:
        titre = 'Espace latent VAE'
    ax.set_title(titre)
    ax.grid(True)
        
    return fig, ax

In [None]:
# instancie modèle, prêt au training

vae = VAE(encoder, decoder)

vae.compile(optimizer=keras.optimizers.Adam())

In [None]:
# Paramètres pour training séquentiel

# ---- batch -----
BATCH_SIZE = 2000 # batch de lignes que l'on va lire par training

# ---- choisit le fichier
dirpath = '/home/benjamin/Folders_Python/Cyber/data/LALN_processed/'
filename = 'wls_day-01_subset_100000.json'
filepath = dirpath + filename

# ---- outputs modèle
list_outputs = ['kl_loss', 'loss', 'reconstruction_loss']

# --- pour faire une video
video_rep = '/home/benjamin/Folders_Python/Cyber/data/jpg_for_videos/'

In [None]:
%%time

# Boucle training

# --- annonce la couleur
print(f"Processe {filename}\n")

# --- boucle
with open(file=filepath) as f:
    
    ctr_lines_total = 0
    ctr_batch = 1
    line = f.readline()
    
    while line != "":
        print(f"Process batch {ctr_batch} de {BATCH_SIZE} lignes")
        ctr_lines_in_batch = 0
        
        # --- inits pour le batch 
        # 1ere DataFrame : valeurs brutes (DEBUG ONLY)
        # dict_for_df_raw = { field : [] for field in list_fields }
        
        # 2e DataFrame : valeurs nettes
        dict_for_df = { field : [] for field in list_fields }
        
        # -- lit un batch, produit une DataFrame
        while (line != "") and (ctr_lines_in_batch<BATCH_SIZE):
            ctr_lines_in_batch += 1
            
            # - donne des nouvelles
            print(f"traite ligne {ctr_lines_in_batch} du batch {ctr_batch}", end="\r", flush=True)
            obj_json = json.loads(line)
            
            # - calcule à la main le UserNameDomainName
            obj_json['UserNameDomainName'] = obj_json.get('UserName') + obj_json.get('DomainName')
            
            # - trouve la valeur de chaque champ (éventuellement None) et traduit en categorical value suivant dictionnaire dico
            for field in list_fields:
                val = obj_json.get(field, None)
                # dict_for_df_raw[field].append(val)
                dict_for_df[field].append(dico[field].get_field_categorical_value(val))
                
            # - ligne suivante
            line = f.readline()
            
        df = pd.DataFrame(dict_for_df)
        
        # -- training du VAE avec le nouveau batch
        points = df.to_numpy()
        print(f'\nentraîne VAE sur batch {ctr_batch} --------------')
        
        callback = tf.keras.callbacks.EarlyStopping(    # NB : il faut instancier le callback à chaque training
            monitor="loss",
            patience=10,
            restore_best_weights=True,
        )
        
        # scaling pour convergence VAE. Mais a priori différent par batch ! Hum.
        s = StandardScaler()
        points_red = s.fit_transform(points)
        
        history = vae.fit(points_red, epochs=1000, batch_size=32, callbacks=[callback], verbose=0)
        for o in list_outputs:
            val = history.history.get(o)[-1]
            print(f'{o} : {val:.2f}')
        
        # -- affiche projection du batch dans l'espace latent 
        titre = f'Espace latent VAE - batch {ctr_batch}'
        fig, ax = plot_label_clusters(vae, points_red, titre)
        savefig = video_rep + filename + f'_batch_{ctr_batch}.png'
        plt.savefig(savefig)
        plt.show()
        
        # -- prochain batch
        ctr_batch += 1
        line = f.readline()
        

In [None]:
# pour faire une video :

# ffmpeg -r 1  -f image2 -s 640x640 -i wls_day-01_subset_nnnn.json_batch_%d.png -vcodec libx264 -crf 15 -pix_fmt yuv420p video.mp4

# -r : frames per second

# Variational Bayesian Gaussian Mixture

In [None]:
N_RESPONSIBILITIES_MAX = min(10,int(N / 100))
print(f'Max = {N_RESPONSIBILITIES_MAX} components')

In [None]:
dpbgm = BayesianGaussianMixture(
    n_components = N_RESPONSIBILITIES_MAX, # max number of components, will be infered by data
    weight_concentration_prior_type = 'dirichlet_process',   # weight concentration prior is Dirchlet process : (approximate) infinite number of components
    random_state = 42,
    # reg_covar = 1e-6,
    verbose = 3,
    max_iter = 1000,
)

In [None]:
dpbgm.fit(df.to_numpy())

In [None]:
comp_number = [ x for x in range(N_RESPONSIBILITIES_MAX) ]
responsibilities = sorted(list(dpbgm.weights_),reverse=True)

fig, ax = plt.subplots(figsize=(20,6))
ax.bar(comp_number, responsibilities)
ax.set_xlabel('component number')
ax.set_ylabel('component weight')
ax.set_yscale("log")
ax.set_title(f'Bayesian MM responsibilities')
plt.show()

In [None]:
# get responsibilities per point
probas = dpbgm.predict_proba(df.to_numpy())

# get predicted label, and associated responsibility, per data point
labels = np.argmax(probas, axis=1)
certainty = np.max(probas, axis=1)

In [None]:
# 3D display of Bayesian GMM - projected points on PCA manifold
# if CHOICE=='lineaire':
#     pca = PCA(n_components=3)
# else:
#     pca = KernelPCA(n_components=3, kernel='rbf', gamma=5.0)
    
# X_red = pca.fit_transform(df)

X_red = pca3d.transform(df)

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(projection='3d')

# invoke color map
cmap = plt.cm.viridis

# create normalizing object to map labels values into color map
vmin = np.min(certainty)
vmax = np.max(certainty)
norm = matplotlib.colors.Normalize(vmin=vmin, vmax=vmax, clip=False)

# instantiate 3d object
p = ax.scatter(X_red[:,0], X_red[:,1], X_red[:,2], c=cmap(norm(certainty)), marker='o') # 
ax.set_title(f'Bayesian GMM responsibility of affected gaussian, \nmixture of maximum {N_RESPONSIBILITIES_MAX} gaussians,\n 3D PCA projection - {N} points')
ax.set_xlabel('vp_1')
ax.set_ylabel('vp_2')
ax.set_zlabel('vp_3')

# display color map
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
fig.colorbar(p, ax=ax)

plt.show()

In [None]:
# 3D display of Bayesian GMM - projected points on PCA manifold
# if CHOICE=='lineaire':
#     pca = PCA(n_components=3)
# else:
#     pca = KernelPCA(n_components=3, kernel='rbf', gamma=5.0)
    
# X_red = pca.fit_transform(df)

X_red = pca3d.transform(df)

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(projection='3d')

# invoke color map
cmap = plt.cm.viridis

# create normalizing object to map labels values into color map
vmin = np.min(labels)
vmax = np.max(labels)
# norm = matplotlib.colors.Normalize(vmin=vmin, vmax=vmax, clip=False)

# instantiate 3d object
p = ax.scatter(X_red[:,0], X_red[:,1], X_red[:,2], c=labels, marker='o') # 
ax.set_title(f'Bayesian GMM labels of affected gaussian, \nmixture of maximum {N_RESPONSIBILITIES_MAX} gaussians,\n 3D PCA projection - {N} points')
ax.set_xlabel('vp_1')
ax.set_ylabel('vp_2')
ax.set_zlabel('vp_3')

# display color map
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
fig.colorbar(p, ax=ax)

plt.show()

# One Class SVM

In [None]:
latent_points = vae.encoder.predict(s.transform(points))[0]  # récupére les z_mean samplées dns l'espace latent de dim 2.

In [None]:
latent_points

In [None]:
clf = OneClassSVM(kernel='rbf', gamma='scale').fit(latent_points)  # classifier dans l'espace latent dim 2

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay

# fig, ax = display_dataset(points, labels, mus, covs, sigma_max)

fig, ax = plt.subplots(figsize=(6,6))

DecisionBoundaryDisplay.from_estimator(
    estimator=clf,
    X=latent_points,
    response_method="predict",
    plot_method="contour",
    levels=[0,1,2,3,4,5],
    ax=ax
)

ax.scatter(latent_points[:,0], latent_points[:,1], marker='.')

ax.set_title('OC SVM')
ax.grid(True)

plt.show()