# NN matrix trainer
In dit notebook wordt geprobeerd om de situationele relevantie te berekenen met behulp van een Neural Network. Dit neurale netwerk bestaat uit één input Dense layer die mapt naar één output Node.

Allereerst worden een aantal library's geïmporteerd.

In [1]:
from datetime import datetime
from dateutil.parser import parse
import json

import numpy as np
import pandas as pd

In [2]:
# remove the future warnings
import warnings
warnings.filterwarnings("ignore", message=r"Passing", category=FutureWarning)

### importeren van klasses en functies om de vergelijkingen te maken
Allereerst wordt de observatie klasse neergezet, aangezien de functies die de vergelijkingen van twee observaties uitvoeren op basis van de instanties van deze klasse werken.

In [3]:
class Observation:
    """An object for storing an observation.
    """
    def __init__(self, serie: str, prd_begin: datetime, prd_end: datetime, pattern: str, sector: str, indexx: str, perc: float, absp: float, obsrv: str, rlvnc: int, m_data: dict, oid: int = None):
        """The init method.

        Args:
            serie (str): The name of the main component
            prd_begin (datetime): The beginning of the period of the observation
            prd_end (datetime): The end of the period of the observation
            pattern (str): The name of the pattern
            sector (str): The sector corresponding to the main component
            indexx (str): The indexx of the component
            perc (float): The percentage change of the observation
            absp (float): The absolute change of the observation
            obsrv (str): The observation string
            rlvnc (int): The original relevance score
            m_data (dict): Extra meta data
            oid (int, optional): The id of the observation in the database. Defaults to None.
        """
        self.observ_id = oid
        self.serie = serie
        self.period_begin = prd_begin
        self.period_end = prd_end

        # extra data about the period
        self.year = self.period_end.year
        self.month_number = self.period_end.month
        self.week_number = self.period_end.isocalendar()[1:2][0]
        self.day_number = self.period_end.day

        # base information
        self.pattern = pattern
        self.sector = sector
        self.indexx = indexx
        self.observation = obsrv
        self.perc_change = perc
        self.abs_change = absp

        # base relevance and situational relevance
        self.relevance1 = rlvnc
        self.relevance2 = rlvnc

        self.meta_data = m_data

    def __str__(self):
        return self.observation


Hieronder staan een aantal functies die gebruikt worden om de relatie tussen twee observaties te bemachtigen.

In [4]:
def check_pattern(observ1, observ2):
    """Checks if the two given observations have the same/overlapping, similar or no shared patterns.

    Args:
        observ1 (NLGengine.observation.Observation): The first observation to be compared
        observ2 (NLGengine.observation.Observation): The second observation to be compared

    Returns:
        int: The corresponding index for the weights dictionary
    """
    pattern_set = set([observ1.pattern, observ2.pattern])

    if len(pattern_set) == 1:
        # the two patterns are the same
        indexx = 0
    elif ("combi-daling" in pattern_set) and ("individu-daling" in pattern_set):
        # the two patterns are similar both not the same
        indexx = 1
    elif ("combi-stijging" in pattern_set) and ("individu-stijging" in pattern_set):
        # the two patterns are similar both not the same
        indexx = 1
    else:
        # patterns neither the same nor similar
        indexx = 2

    return indexx


def check_period(observ1, observ2):
    """Checks if the two given observations have identical, overlapping, next or no shared periods.

    Args:
        observ1 (NLGengine.observation.Observation): The first observation to be compared
        observ2 (NLGengine.observation.Observation): The second observation to be compared

    Returns:
        int: The corresponding index for the weights dictionary
    """
    if (observ1.period_begin == observ2.period_begin) and (observ1.period_end == observ2.period_end):
        # the two observations are periodically identical
        indexx = 0
    elif has_overlap(observ1.period_begin, observ1.period_end, observ2.period_begin, observ2.period_end):
        # the two observations are overlapping
        indexx = 1
    elif (np.busday_count(observ1.period_end.date(), observ2.period_begin.date()) == 1) or (np.busday_count(observ2.period_end.date(), observ1.period_begin.date()) == 1):
        # the two observations are after each other (next)
        # so the start of observation 2 is 1 day after the end of observation 1 (minus the weekends) or vice versa.
        indexx = 2
    else:
        # the two observations are different
        indexx = 3

    return indexx


def check_component(observ1, observ2):
    """Checks if the two given observations have the same/overlapping, similar (same sector) or no shared components.

    Args:
        observ1 (NLGengine.observation.Observation): The first observation to be compared
        observ2 (NLGengine.observation.Observation): The second observation to be compared

    Returns:
        int: The corresponding index for the weights dictionary
    """
    if observ1.meta_data.get("components") and observ2.meta_data.get("components"):
        # observations are both combi patterns and hold multiple components
        if any(i in observ1.meta_data.get("components") for i in observ2.meta_data.get("components")):
            # both observations have one or more overlapping component(s)
            indexx = 0
        elif any(i in observ1.meta_data.get("sectors") for i in observ2.meta_data.get("sectors")):
            # both observations have one or more similar components(s)
            indexx = 1
        else:
            # both observations don't have overlapping or similar component(s)
            indexx = 2

    elif observ1.meta_data.get("components"):
        # observation 1 has multiple components
        if observ2.serie in observ1.meta_data.get("components"):
            # both observations have one or more overlapping component(s)
            indexx = 0
        elif observ2.sector in observ1.meta_data.get("sectors"):
            # both observations have one or more similar component(s)
            indexx = 1
        else:
            # both observations don't have overlapping or similar component(s)
            indexx = 2

    elif observ2.meta_data.get("components"):
        # observation 2 has multiple components
        if observ1.serie in observ2.meta_data.get("components"):
            # both observations have one or more overlapping component(s)
            indexx = 0
        elif observ1.sector in observ2.meta_data.get("sectors"):
            # both observations have one or more similar component(s)
            indexx = 1
        else:
            # both observations don't have overlapping or similar component(s)
            indexx = 2

    else:
        # neither of the observations has multiple components
        if observ1.serie == observ2.serie:
            # both observations have the same component
            indexx = 0
        elif observ1.sector == observ2.sector:
            # both observations have similar components
            indexx = 1
        else:
            # both observation share no overlapping of components
            indexx = 2

    return indexx


def has_overlap(A_start: datetime, A_end: datetime, B_start: datetime, B_end: datetime):
    """Checks if two periods have an overlap.
    https://stackoverflow.com/questions/3721249/python-date-interval-intersection

    Args:
        A_start (datetime): The period_begin datetime of the first observation
        A_end (datetime): The period_end datetime of the first observation
        B_start (datetime): The period_begin datetime of the second observation
        B_end (datetime): The period_end datetime of the second observation

    Returns:
        bool: Returns True if the two periods overlap
    """
    assert A_start <= A_end, "the start datetime is greater as the end datetime"
    assert B_start <= B_end, "the start datetime is greater as the end datetime"

    latest_start = max(A_start, B_start)
    earliest_end = min(A_end, B_end)
    return latest_start <= earliest_end

### Inladen van de data
Allereerst wordt de data uit de 'test_cases.json' file ingeladen en worden alle dicts met observaties omgezet naar instanties van de Observation klasse

In [5]:
with open(r'test_cases.json') as f:
    data = json.load(f)

test_cases = data.get("test_cases")
observations = data.get("observations")

# convert all the observation dictionary's to objects
test_observations = {}
for key in observations:
    info = observations.get(key)

    # format the json observation into an Observation instance
    test_observations[key] = Observation(info.get("serie"),
                                    parse(info.get("period_begin")),
                                    parse(info.get("period_end")),
                                    info.get("pattern"),
                                    info.get("sector"),
                                    info.get("indexx"),
                                    info.get("perc_change"),
                                    info.get("abs_change"),
                                    info.get("observation"),
                                    info.get("relevance"),
                                    info.get("meta_data"),
                                    oid=int(key))

Hierna worden voor alle mogelijke opties een OneHotEncoder toegevoegd, zodat er één unique waarde is voor iedere klasse.

In [6]:
formatted_obs = []

for case in test_cases:
    
    new_case = [0 for x in range(10)]
    new_case.append(case.get("score"))
    
    # gettting the observations
    obs1 = test_observations.get(str(case.get("prev_observ")))
    obs2 = test_observations.get(str(case.get("new_observ")))
    
    # finding the similarities between the observations
    pattern_index = check_pattern(obs1, obs2)
    period_index = check_period(obs1, obs2) + 3
    comp_index = check_component(obs1, obs2) + 7
    
    # applying the onehotencoder
    new_case[pattern_index] = 1
    new_case[period_index] = 1
    new_case[comp_index] = 1
    
    formatted_obs.append(new_case)
    

# turning it into a dataframe
columns = ['zh', 'zv', 'zo', 'pi', 'pov', 'pop', 'pa', 'sh', 'sv', 'sa', 'score']
df_observations = pd.DataFrame(formatted_obs, columns=columns)
df_observations.head()

Unnamed: 0,zh,zv,zo,pi,pov,pop,pa,sh,sv,sa,score
0,0,1,0,1,0,0,0,1,0,0,-2.0
1,0,1,0,1,0,0,0,0,0,1,0.5
2,0,0,1,1,0,0,0,1,0,0,1.5
3,0,0,1,0,0,0,1,0,1,0,-1.0
4,0,1,0,1,0,0,0,1,0,0,-2.0


Hierna worden de X en de y gedefinieërd, waarna een split wordt uitgevoerd om een train en test set te krijgen.

In [7]:
# defining X and y
X = df_observations.drop(columns=['score'])
y = df_observations['score']

# normalize the score between -1 and 1 (max of y is between -2 and 2), so divide by 2
y = y / 2

# splitting the datasets into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

### Opzetten van een Neuraal netwerk
Aangezien 

In [8]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras import backend as K

# neural network
model = Sequential()
model.add(Dense(4, input_dim=10, activation=keras.activations.tanh))
model.add(Dense(1, activation=keras.activations.tanh))

Using TensorFlow backend.


In [9]:
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])

In [10]:
epochs = 300
batch_size = 32

# training the network
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)


Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 7

## Evaluatie
Hierna wordt een mean squared error uitgevoerd om de predictions te testen

In [11]:
def score(X: list, y: list):
    """Returns the mean square accuracy on the given test data and labels

    Args:
        X (list): A list with all the predicted weights
        y (list): A list with all the preferred values

    Returns:
        float: Mean accuracy
    """
    assert len(X) == len(y), "The size of the two lists are not the same"

    mean = np.mean([(a - b) ** 2 for a, b in zip(X, y)])
    return mean

In [13]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)

print(f"Accuracy (mse): {score(y_pred.flatten(), y_test)}")

Accuracy (mse): 0.20758732807558541


### Opslaan van het model

In [16]:
save_to_NLGlib = False

In [14]:
# serialize model to JSON
model_json = model.to_json()
with open("deter_model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("deter_model.h5")
print("Saved model to notebook repository")

Saved model to notebook repository


In [15]:
if save_to_NLGlib:
    # serialize model to JSON
    model_json = model.to_json()
    with open("../NLGengine/content_determination/deter_model.json", "w") as json_file:
        json_file.write(model_json)
    # serialize weights to HDF5
    model.save_weights("../NLGengine/content_determination/deter_model.h5")
    print("Saved model to NLGengine file")

Saved model to NLGengine file


## Bronnen
- https://towardsdatascience.com/building-our-first-neural-network-in-keras-bdc8abbc17f5
- https://machinelearningmastery.com/save-load-keras-deep-learning-models/
- https://keras.io/api/layers/activations/#tanh-function