#**Discourse-Level Sentiment Analyzer for Cybercrime (The Scammer)**

###- This project presents a comprehensive study on developing a Discourse Level Sentiment Analyzer (DLSA) specifically designed for detecting Cyber Scams that utilized pressure and coercive language.
###- This project aims to extend the research conducted by (Mathias Kraus, 2019) by applying and evaluating their discourse-aware sentiment analysis solution in the context of scams.
###- The research focuses on addressing the limitations of conventional sentiment analysis tools that neglect the semantic coherence in textual data.
###- By incorporating Rhetorical Structure Theory (RST), the study aims to identify and analyze the discourse-level sentiment of scammers, particularly the use of pressure and coercive language in digital communications.


## **Reference**
### Mathias Kraus, S. F. (2019). Sentiment analysis based on rhetorical structure theory:Learning deep neural networks from discourse trees. Expert Systems withApplications, 118, 65-79. doi:https://doi.org/10.1016/j.eswa.2018.10.002

##**Data Collection**
### There are two types of data that are required in this project: scams text that utilize pressure and coercion & non-(scams text that utilize pressure and coercion).
### **Scam Data:** Sample scams text data were collected from the internet, specifically by transfering text from google images into excel files. Few data pertaining to scams that utilize pressure and coercion could be found. The project create a solution by using LLM to populate the data.
### **Non Scam Data:** Data akin to that encountered in pressure and coercive scams were collected. It was sourced from Kaggle repository as part of the [Spam/Ham Detection using LSTM](https://www.kaggle.com/code/gadaadhaarigeek/spam-ham-detection-using-lstm/input). Upon acquisition, the dataset underwent a filtering process to isolate the spam-related content.



##**Pressure and Coercion Lexicon**
### The development of lexicon related to pressure and coercion scam is crucial to complement SentiWordNet 3.0 in assigning polarity score to sentences/discourse. The development steps are as follows:

###1.   Identifying words/cues related to pressure and coercion present in the dataset
###2.   Development of of a new lexicon which is called ScamWordNet



##**Text Preprocessing**
### Raw dataset undergone text preprocessing such as:
*   Tokenization
*   Sentence Split
*   Part of Speech Tagging
*   Lemmatization
*   Name Entity Recognition

###This done by using an open-source library for natural language analysis tools called [CoreNLP](https://stanfordnlp.github.io/CoreNLP/index.html)



##**Discourse Parsing**
###The preprocessed dataset was converted into a discourse units by using a pre-built parser developed by (Ji et al., 2014) called [RST parser](https://github.com/jiyfeng/DPLP/tree/master)

## **Initialization**
### Load data into environment and import required libraries.

In [None]:
import zipfile


with zipfile.ZipFile('dev.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/')

with zipfile.ZipFile('test.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/')

with zipfile.ZipFile('train.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/')

with zipfile.ZipFile('vocab.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/')


In [None]:
!pip install dgl
!pip install networkx
!pip install numpy
!pip install torch
!pip install nltk
!pip install gensim
!pip install matplotlib

Collecting dgl
  Downloading dgl-1.1.3-cp310-cp310-manylinux1_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dgl
Successfully installed dgl-1.1.3


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('sentiwordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.


True

##**Discourse LSTM Algorithm**

###- Discourse-LSTM is a model proposed in the paper written by (Mathias Kraus, 2019).
###- This Discourse-LSTM  takes into account the relations between discourse units and its hierarchy type which is nucleus and satellite .
###- Nucleus is the main part of the discourse while satellite is the elaboration of the nucleus.
###- The research paper mentioned that they had implemented this model, but unfortunately, it is currently not publicly accessible, therefore this is the implementation based on the concept presented in the paper.
###- The main difference between discourse-lstm and tree-lstm is that discourse LSTM replaces the global LSTM that serves all nodes with one that is dependent on the relation type.
###- The implementation of this model took into account 15 relation types and 2 hierarchy types.
###- Next, The update gate, forget gate and bias, was modified to have 15 linear layers. But, Only one linear layer corresponding to the relation type will be chosen for each discourse unit.
###- Furthermore, The hidden state and memory state of each discourse unit will be weighted according to their hierarchy type.
###- The rest of the implementation is then mostly similar to tree-lstm:

*   where the model will propagate top-down or bottom-up, from node to another node, where each node represents a discourse unit.
*   Then, at each node, the model will compute the forget cell  while taking into account the hidden state of its child.
*   Then the node will compute its input, output and update cell, followed by hidden and memory cell before passing it to its parent.
*   After passing through each node, the model will output the logits of the computations which is the result of the final linear function being applied to the hidden states.


  

In [None]:

import itertools
import time
import dgl
import networkx as nx
import numpy as np
import torch as th
import torch.nn as nn
import torch.nn.functional as F


class ChildSumDiscourseLSTMCell(nn.Module):

    def __init__(self, x_size, h_size, num_relations):
        super(ChildSumDiscourseLSTMCell, self).__init__()

        self.W_iou = nn.Linear(x_size, 3 * h_size, bias=False)

        # Define the list containing 15 instances of nn.Linear corresponding to 15 num_relations
        self.U_iou = [nn.Linear(h_size, 3 * h_size, bias=False) for _ in range(num_relations)]
        self.b_iou = [nn.Parameter(th.zeros(1, 3 * h_size)) for _ in range(num_relations)]
        self.U_f = [nn.Linear(h_size, h_size) for _ in range(num_relations)]

        # Weight matrices for child hidden states
        self.W_h = [nn.Linear(h_size, h_size) for _ in range(2)]  # 2 for nucleus and satellite
        self.W_c = [nn.Linear(h_size, h_size) for _ in range(2)] # 2 for nucleus and satellite


    def message_func(self, edges):
      return {"h": edges.src["h"], "c": edges.src["c"],
              "hierarchy_type": edges.src["hierarchy_type"],
              "relation_type": edges.src["relation_type"]}


    def reduce_func(self, nodes):

        hierarchy_type = nodes.mailbox["hierarchy_type"]
        relation_type = nodes.mailbox["relation_type"]
        weighted_h = torch.cat([self.W_h[hierarchy_type[i]](nodes.mailbox["h"][i]).unsqueeze(0) for i in range(hierarchy_type.size(0))], dim=0)
        weighted_c = torch.cat([self.W_c[hierarchy_type[i]](nodes.mailbox["c"][i]).unsqueeze(0) for i in range(hierarchy_type.size(0))], dim=0)
        h_tild = th.sum(weighted_h, 1)
        f = th.sigmoid(torch.cat([self.U_f[relation_type[i]](weighted_h[i]).unsqueeze(0) for i in range(relation_type.size(0))], dim=0))
        c = th.sum(f * weighted_c, 1)
        return {"iou": self.U_iou[nodes.data['relation_type']](h_tild), "c": c}


    def apply_node_func(self, nodes):

        relation_type = nodes.data["relation_type"]
        iou = nodes.data["iou"] + self.b_iou[relation_type]
        i, o, u = th.chunk(iou.squeeze(), chunks=3, dim=0)
        i, o, u = th.sigmoid(i), th.sigmoid(o), th.tanh(u)
        c = i * u + nodes.data["c"]
        h = o * th.tanh(c)
        return {"h": h, "c": c}


class DiscourseLSTM(nn.Module):
    def __init__(
        self,
        num_vocabs,
        x_size,
        h_size,
        num_classes,
        dropout,
        num_relations,
        cell_type="childsum",
        pretrained_emb=None,
    ):
        super(DiscourseLSTM, self).__init__()
        self.x_size = x_size
        self.embedding = nn.Embedding(num_vocabs, x_size)

        if pretrained_emb is not None:
            print("Using glove")
            self.embedding.weight.data.copy_(pretrained_emb)
            self.embedding.weight.requires_grad = True

        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(h_size, num_classes)
        self.cell_type = cell_type
        self.cell = ChildSumDiscourseLSTMCell(x_size, h_size, num_relations)


    def forward(self, batch, g, h, c):
        """Compute tree-lstm prediction given a batch.
        Parameters
        ----------
        batch : dgl.data.SSTBatch
            The data batch.
        g : dgl.DGLGraph
            Tree for computation.
        h : Tensor
            Initial hidden state.
        c : Tensor
            Initial cell state.
        Returns
        -------
        logits : Tensor
            The prediction of each node.
        """
        # feed embedding

        embeds = self.embedding((batch.wordid * batch.mask).long())

        g.ndata["iou"] = self.cell.W_iou(
            self.dropout(embeds)
        ) * batch.mask.float().unsqueeze(-1)

        g.ndata["h"] = h
        g.ndata["c"] = c

        g.ndata["relation_type"] = batch.relation_type
        g.ndata["hierarchy_type"] = batch.hierarchy_type

        # propagate
        dgl.prop_nodes_topo(
            g,
            self.cell.message_func,
            self.cell.reduce_func,
            apply_node_func=self.cell.apply_node_func,
        )

        # compute logits
        h_dropped = self.dropout(g.ndata.pop("h"))
        logits = self.linear(h_dropped)
        return logits

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


##**Functions to combine .merge and .brackets**

###.merge and .brackets are files that resulted in the process of parsing by utilizing RST parser by (Ji et al., 2014). .merge files have text information while .brackets files have type, span and relation information. These information were combined to be further used to construct graphs

In [None]:
import re

class Node:
    def __init__(self, node_type, span, relation, text=None):
        self.node_type = node_type
        self.span = span
        self.relation = relation
        self.text = text
        self.children = []

    def add_child(self, child_node):
        self.children.append(child_node)

    def is_within_span(self, span):
        if self.span is None:  # Root node
            return True
        return self.span[0] <= span[0] and self.span[1] >= span[1]

    def __repr__(self):
      if self.text is not None:
          return f"{self.node_type}({self.span}, {self.relation}, '{self.text}')"
      else:
          return f"{self.node_type}({self.span}, {self.relation}, {self.text})"

def parse_brackets(data):
    pattern = re.compile(r"\(\((\d+), (\d+)\), '(\w+)', '(\w+)'\)")
    parsed_data = []
    for line in data.split('\n'):
        match = pattern.match(line)
        if match:
            parsed_data.append({
                'start': int(match.group(1)),
                'end': int(match.group(2)),
                'type': match.group(3),
                'relation': match.group(4)
            })
    return parsed_data

def parse_merge(data):
    parsed_data = []
    for line in data.split('\n'):
        parts = line.split('\t')
        if len(parts) > 5:
            parsed_data.append({
                'index': int(parts[0]),
                'word_index': int(parts[-1]),
                'word': parts[2],
                'lemma': parts[3],
                'pos': parts[4],
                'text': re.findall(r'_!\((.*?)\)_!', parts[-2])
            })
    return parsed_data

def find_text_for_span(span_start, span_end, merge_data):
    text_segments = []
    for item in merge_data:
        if span_start <= item['word_index'] <= span_end:
            text_segments.append(item['word'])
    return ' '.join(text_segments)

def create_combined_structure(brackets_data, merge_data):
    structure = []
    for item in brackets_data:
        text_content = find_text_for_span(item['start'], item['end'], merge_data)
        structure.append({
            'type': item['type'],
            'span': (item['start'], item['end']),
            'relation': item['relation'],
            'text': text_content
        })
    return structure

def create_tree_structure(elements):
    root = Node("Root", None, None)
    stack = [root]

    for element in elements:
        new_node = Node(element['type'], element['span'], element['relation'], element['text'])

        while not stack[-1].is_within_span(new_node.span):
            stack.pop()

        stack[-1].add_child(new_node)
        stack.append(new_node)

    return root

def tree_to_string(node, indent=""):
    output = indent + str(node) + "\n"
    for child in node.children:
        output += tree_to_string(child, indent + "  ")
    return output

def combine(data_folder,total_data):

  data_list = []
  labels = []

  count = 0
  for i in range(total_data):
    if(os.path.exists(os.path.join(data_folder, f'fyp_data_{i + 1}.txt.brackets'))):

      # Reading the contents of the files
      file_path_brackets = os.path.join(data_folder, f'fyp_data_{i + 1}.txt.brackets')
      file_path_merge = os.path.join(data_folder, f'fyp_data_{i + 1}.txt.merge')

      with open(file_path_brackets, 'r') as file:
          data_brackets = file.read()

      with open(file_path_merge, 'r') as file:
          data_merge = file.read()

      # Process the data
      parsed_brackets = parse_brackets(data_brackets)
      parsed_merge = parse_merge(data_merge)

      combined_structure = create_combined_structure(parsed_brackets, parsed_merge)
      tree_structure = create_tree_structure(combined_structure)

      data_list.append(tree_to_string(tree_structure))

      if i in range(242):
        labels.append(1) # first 241 files are on pressure and coercion scams
      else:
        labels.append(0) #after 241 files are on non pressure and coercion scams. e.g. spams discourse

  return data_list, labels


##**Functions to build graph (elementary discourse unit), calculate sentiment scores & embeddings**

###This graph represent a single elementary discourse unit where it contains a collection of sentences. Each node in this graph have information such as:

*   Node type
*   Relation type
*   Text that were converted into embeddings
*   Labels (resulted from the calculation of sentiment scores)



In [None]:
import gensim.downloader as api
glove_model = api.load("glove-wiki-gigaword-300")



In [None]:
import torch

# Precompute all embeddings as a PyTorch tensor
all_embeddings = torch.tensor(glove_model.vectors, dtype=torch.float32)

In [None]:

import argparse
import collections
import torch
import torch as th
from torch.utils.data import DataLoader
import torch.nn.functional as F
import torch.nn.init as INIT
import time
import torch.optim as optim
import sys
import dgl
from nltk.tree import Tree
import networkx as nx
import matplotlib.pyplot as plt
import re
import ast
from nltk.tokenize import word_tokenize
import numpy as np
import nltk
import pandas as pd
from nltk.corpus import sentiwordnet as swn
from nltk.stem import WordNetLemmatizer

class DiscourseNode:
    def __init__(self, node_type, relation, text):
        self.node_type = node_type
        self.relation = relation
        self.text = text
        self.children = []

    def add_child(self, child):
        self.children.append(child)

    def get_text(self):
        return self.text

    def __str__(self):
        return f'node({self.node_type}, {self.relation}, {self.text}, {self.children})'

def calculate_sentiment_score(tokens, word, pos, df):

    if not df[df['SynsetTerms'].str.contains(rf'\b{word}\b', case=False, regex=True)].empty:
        df_synset_that_match = df[(df['SynsetTerms'] == word) & (df['POS'] == pos.lower())]
        count = 0
        for token in tokens:
            try:
              if df_synset_that_match['Gloss'].str.contains(rf'\b{token}\b', case=False, regex=True).item():
                count+=1
            except:
              pass

        if count/len(tokens)> 0.1:
            return -1 *df_synset_that_match['NegScore'].item()
        else:
          return 0
    else:
        synsets = list(swn.senti_synsets(word, pos))
        if synsets:
            synset = synsets[0]
            return synset.pos_score() - synset.neg_score()
        else:
            return 0.0

def calculate_sentence_sentiment(sentence, df):

    tokens = word_tokenize(sentence)
    total_sentiment_score = 0.0
    total_word_to_div = 0
    lemmatizer = WordNetLemmatizer()
    for token in tokens:

        token_sentiment_score = 0.0

        # Get the part-of-speech tag using NLTK's pos_tag function
        pos = nltk.pos_tag([token])[0][1]


        if pos.startswith('A'):

            token = lemmatizer.lemmatize(token, pos='a')
            token_sentiment_score = calculate_sentiment_score(tokens, token, 'a', df)
        elif pos.startswith('V'):

            token = lemmatizer.lemmatize(token, pos='v')
            token_sentiment_score = calculate_sentiment_score(tokens, token, 'v', df)
        elif pos.startswith('N'):

            token = lemmatizer.lemmatize(token, pos='n')
            token_sentiment_score = calculate_sentiment_score(tokens, token, 'n', df)
        elif pos.startswith('R'):

            token = lemmatizer.lemmatize(token, pos='r')
            token_sentiment_score = calculate_sentiment_score(tokens, token, 'r', df)
        elif pos.startswith('J'):

            token = lemmatizer.lemmatize(token, pos='a')
            token_sentiment_score = calculate_sentiment_score(tokens, token, 'a', df)

            if token_sentiment_score != 0.0:
              total_word_to_div+=1
        total_sentiment_score += float(token_sentiment_score)

    try:
      return total_sentiment_score/total_word_to_div

    except: # div by 0 error
      return 0

def parse_discourse_structure(data):

    node_types = []
    relations = []
    texts = []
    nodes = []
    labels = []

    #create dictionary to map name to integer
    node_type_dict = {"Nucleus": 0, "Satellite": 1}
    relations_dict = {"textualorganization": 0,
                        "contrast": 1,
                        "list": 2,
                        "reason": 3,
                        "condition": 4,
                        "summary": 5,
                        "span": 6,
                        "same_unit": 7,
                        "elaboration": 8,
                        "purpose": 9,
                        "definition": 10,
                        "circumstance": 11,
                        "means": 12,
                        "attribution": 13,
                        "temporal": 14}


    file_path = '/content/ScamWordNet_1.0.3.xlsx'
    df = pd.read_excel(file_path)
    df['SynsetTerms'] = df['SynsetTerms'].str.replace(r'#[0-9]+', '', regex=True)

    root = None
    current_node = None
    for line in data.split('\n'):
        line = line.strip()
        generalized_pattern = r"(\w+)\(\((\d+),\s*(\d+)\),\s*(\w+),\s*'(.+?)'\)"


        match = re.match(generalized_pattern, line)
        if match:
            generalized_components = [
                match.group(1),
                match.group(2), match.group(3),
                match.group(4),
                match.group(5)
            ]
        else:
            generalized_components = []

        if not line.strip():

            continue

        parts = line.split(',')

        if len(parts) < 3:

            continue

        if 'Root' in parts[0]:

            continue
        else:
            if(int(generalized_components[1])!=int(generalized_components[2])):
                continue

            try:
                node_type = generalized_components[0]
                node_types.append(node_type_dict[node_type])
            except:
                node_types.append(-1)

            try:
                relation = generalized_components[3]
                relations.append(relations_dict[relation])
            except:
                relations.append(-1)

            text = generalized_components[4]
            texts.append(get_sentence_embedding(text))
            labels.append(1 if calculate_sentence_sentiment(text,df) < 0 else 0)

            nodes.append(DiscourseNode(node_type, relation, text))

    node_mapping = {}
    edges = []
    for i in range(len(nodes)):

        node_mapping[nodes[i]] = i

    for i in range (len(nodes) - 1):

        edges.append((node_mapping[nodes[i]],node_mapping[nodes[i+1]]))


    src, dst = zip(*edges)
    g = dgl.graph((src,dst))

    return g, node_types, relations, texts, labels

def preprocess_text(text):

    return word_tokenize(text.lower())

def get_sentence_embedding(sentence):

    """
    Creates a sentence embedding by averaging the word embeddings.
    Unknown words are ignored.
    """
    words = preprocess_text(sentence)
    indices = [glove_model.key_to_index[word] for word in words if word in glove_model]
    word_embeddings_tensor = all_embeddings[indices]

    if len(word_embeddings_tensor) == 0:
        return torch.zeros(glove_model.vector_size)  # Return zero vector if no words are found

    sentence_embedding = torch.mean(word_embeddings_tensor, dtype=torch.float32)
    return sentence_embedding



##**DLSTMDataset()**

###DLSTMDataset() is function to prepare the dataset by constructing a Discourse tree out of those individual discourse units. This function also includes operation of comparing the data embeddings (dataset vocabulary that was converted to vector) with pre-trained GloVe embeddings that will be used to train the model.

In [None]:
from collections import OrderedDict
import os
from gensim import downloader
import warnings

__all__ = ["DLSTM", "DLSTMDataset"]

class DLSTMDataset():

    def __init__(
        self,
        name="dlstm",
        mode="train",
        glove_embed=True,
        vocab_file=None,
        raw_dir=None,
        force_reload=False,
        verbose=False,
        transform=None,
        total_data=261,

    ):
        assert mode in ["train", "dev", "test", "tiny"]
        self._glove_embed = glove_embed if mode == "train" else False
        self.mode = mode
        self._vocab_file = vocab_file
        self.name = name
        self.raw_dir = raw_dir
        self.force_reload = force_reload
        self.verbose = verbose
        self.transform = transform
        self.raw_path = '/content/vocab'
        self._trees = []
        self._transform = None
        self._pretrained_emb = torch.tensor([[]])
        self._vocab = []
        self._data_folder = "/content/{}".format(self.mode)
        self._total_data = total_data

    def process(self):
      from nltk.corpus.reader import BracketParseCorpusReader

      # load vocab file
      self._vocab = OrderedDict()
      vocab_file = (
            self._vocab_file
            if self._vocab_file is not None
            else os.path.join(self.raw_path, "vocab.txt")
      )
      with open(vocab_file, encoding="utf-8") as vf:
        for line in vf.readlines():
            line = line.strip()
            self._vocab[line] = len(self._vocab)

       # filter glove
        glove_model = downloader.load("glove-wiki-gigaword-300")

        glove_emb = {}

        if self._glove_embed:
            for word in self._vocab:
              if word in glove_model:
                  glove_emb[word] = glove_model[word]
              else:
                  glove_emb[word] = np.random.uniform(-0.05, 0.05, 300)  # Use a random vector for unknown words


        # initialize with glove
        pretrained_emb = []
        fail_cnt = 0

        if self._glove_embed:
          for word in self._vocab.keys():

              if word in glove_model:
                  pretrained_emb.append(glove_model[word])
              else:
                  # If the word is not in the GloVe model, use a random vector
                  pretrained_emb.append(np.random.uniform(-0.05, 0.05, 300))
                  fail_cnt += 1


        if self._glove_embed:

            self._pretrained_emb = torch.tensor(np.stack(pretrained_emb, 0))
            print(
                "Miss word in GloVe {0:.4f}".format(
                    1.0 * fail_cnt / len(self._pretrained_emb)
                )
            )

        # build trees
        graphs = self._build_tree(self._data_folder,self._total_data)

        for graph in graphs:

          self._trees.append(graph)

    def _build_tree(self,data_folder,total_data):

      i = 0
      data_list, labels = combine(data_folder,total_data)

      graphs = []

      for data in data_list:

        dgl_graph, node_types, relations, texts, label = parse_discourse_structure(data)

        with warnings.catch_warnings():
          warnings.simplefilter("ignore")
          texts_np = np.array(texts, dtype=np.float32)
          dgl_graph.ndata['x'] = torch.tensor(texts_np, dtype=torch.float32)
          dgl_graph.ndata['y'] = torch.tensor(label, dtype=torch.int64)
          dgl_graph.ndata['hierarchy_type'] = torch.tensor(node_types, dtype=torch.int64)
          dgl_graph.ndata['relation_type'] = torch.tensor(relations, dtype=torch.int64)

          # Initialize 'mask' with ones
          mask = torch.ones((len(texts)), dtype=torch.int64)

          # mask the data if the relation type is not inside what was defined. the masked data will be ignored by the DLSTM algorithm
          condition = dgl_graph.ndata['relation_type'] == -1
          mask[condition] = 0
          dgl_graph.ndata['mask'] = mask
          i+=1
          graphs.append(dgl_graph)

      return graphs

    @property
    def vocab(self):
        r"""Vocabulary

        Returns
        -------
        OrderedDict
        """
        return self._vocab

    @property
    def pretrained_emb(self):
        r"""Pre-trained word embedding, if given."""
        return self._pretrained_emb

    def __getitem__(self, idx):
        r"""Get graph by index

        Parameters
        ----------
        idx : int

        Returns
        -------
        :class:`dgl.DGLGraph`

            graph structure, word id for each node, node labels and masks.

            - ``ndata['x']``: word id of the node
            - ``ndata['y']:`` label of the node
            - ``ndata['mask']``: 1 if the node is a leaf, otherwise 0
        """
        if self._transform is None:
            return self._trees[idx]
        else:
            return self._transform(self._trees[idx])

    def __len__(self):
        r"""Number of graphs in the dataset."""
        return len(self._trees)

    @property
    def vocab_size(self):
        r"""Vocabulary size."""
        return len(self._vocab)

    @property
    def num_classes(self):
        r"""Number of classes for each node."""
        return 2 #like anomaly detection: to identify data points that do not belong to the 'normal class'. normal data
                # can be considered as 'normal' class and the goal is to detect outliers or anomalies that do not fit this class.


DLSTM = DLSTMDataset


##**Train/Dev/Test setup**

###Preparing for Training, Validation and Test processes by defining the parameters, pre-processing dataset, loading, and batching.

In [None]:
import argparse


parser = argparse.ArgumentParser()
parser.add_argument("--gpu", type=int, default=-1)
parser.add_argument("--seed", type=int, default=41)
parser.add_argument("--batch-size", type=int, default=1)
parser.add_argument("--child-sum", action="store_true")
parser.add_argument("--x-size", type=int, default=300)
parser.add_argument("--h-size", type=int, default=150)
parser.add_argument("--epochs", type=int, default=100)
parser.add_argument("--log-every", type=int, default=0)
parser.add_argument("--lr", type=float, default=0.05)
parser.add_argument("--weight-decay", type=float, default=1e-4)
parser.add_argument("--dropout", type=float, default=0.5)
parser.add_argument("--num-relations", type=int, default=15)


args, unknown = parser.parse_known_args()




In [None]:
DLSTMBatch = collections.namedtuple(
    "DLSTMBatch", ["graph", "mask", "wordid", "label", "hierarchy_type", "relation_type"]
)

def batcher(device):
    def batcher_dev(batch):
        batch_trees = dgl.batch(batch)
        return DLSTMBatch(
            graph=batch_trees,
            mask=batch_trees.ndata["mask"].to(device),
            wordid=batch_trees.ndata["x"].to(device),
            label=batch_trees.ndata["y"].to(device),
            hierarchy_type=batch_trees.ndata["hierarchy_type"].to(device),
            relation_type=batch_trees.ndata["relation_type"].to(device),
        )

    return batcher_dev



np.random.seed(args.seed)
th.manual_seed(args.seed)
th.cuda.manual_seed(args.seed)



cuda = args.gpu >= 0
device = th.device("cuda:{}".format(args.gpu)) if cuda else th.device("cpu")
if cuda:
    th.cuda.set_device(args.gpu)


trainset = DLSTMDataset()
trainset.process()


train_loader = DataLoader(
    dataset=trainset,
    batch_size= args.batch_size,
    collate_fn=batcher(device),
    shuffle=True,
    num_workers=0,
)


devset = DLSTMDataset(mode='dev')
devset.process()
dev_loader = DataLoader(
    dataset=devset,
    batch_size=1,
    collate_fn=batcher(device),
    shuffle=False,
    num_workers=0,
)

testset = DLSTMDataset(mode='test')
testset.process()
test_loader = DataLoader(
    dataset=testset,
    batch_size=1,
    collate_fn=batcher(device),
    shuffle=False,
    num_workers=0,
)





Miss word in GloVe 0.0493


##**Train/Dev/Test Process**

###The model were Trained, Validated and Tested using the following components:
*   Loss Function: Binary Cross Entropy
*   Optimizer: Adam
*   Performance Metrics: Precision, recall and F1 score



In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, precision_recall_curve

model = DiscourseLSTM(
    trainset.vocab_size,
    args.x_size,
    args.h_size,
    trainset.num_classes,
    args.dropout,
    args.num_relations,
    cell_type="childsum" ,
    pretrained_emb=trainset.pretrained_emb,
).to(device)
print(model)
params_ex_emb = [
    x
    for x in list(model.parameters())
    if x.requires_grad and x.size(0) != trainset.vocab_size
]

params_emb = list(model.embedding.parameters())


for p in params_ex_emb:
    if p.dim() > 1:
        INIT.xavier_uniform_(p)

optimizer = optim.Adam(
    [
        {
            "params": params_ex_emb,
            "lr": args.lr,
            "weight_decay": args.weight_decay,
        },
        {"params": params_emb, "lr": 0.1 * args.lr},
    ]
)


best_epoch = -1
best_precision_dev = -1
for epoch in range(args.epochs):
    t_epoch = time.time()
    model.train()
    pred_list_train = []
    label_list_train = []

    for step, batch in enumerate(train_loader):
        g = batch.graph.to(device)
        n = g.num_nodes()
        h = th.zeros((n, args.h_size)).to(device)
        c = th.zeros((n, args.h_size)).to(device)
        logits = model(batch, g, h, c)

        pred = th.sigmoid(logits)
        pred_class_1 = pred[: , 1]
        loss = F.binary_cross_entropy(pred_class_1, batch.label.float())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        pred_index = []

        for i in pred:

            pred_val, index = torch.max(i, dim=0)
            if index == 0:
                pred_index.append(index.item())
            else:
                pred_index.append(index.item())

        for i in range(len(pred_index)):

          pred_list_train.append(pred_index[i])
          label_list_train.append(batch.label[i].item())

    print(
        "Epoch {:05d} training time {:.4f}s".format(
            epoch, time.time() - t_epoch
        )
    )


    # Apply a threshold (e.g., 0.5) to convert prediction scores into binary predictions
    threshold = 0.5
    predicted_labels = [1 if score >= threshold else 0 for score in pred_list_train]

    precision_train = precision_score(label_list_train, predicted_labels)

    recall_train = recall_score(label_list_train, predicted_labels)

    f1_train = f1_score(label_list_train, predicted_labels)

    print(
        "epoch: {} Train set: precision {:.4f} | recall {:.4f} | f1 {:.4f}".format(
            epoch, precision_train, recall_train, f1_train
        )
    )

    # eval on dev set
    pred_list_dev = []
    label_list_dev = []
    model.eval()
    for step, batch in enumerate(dev_loader):

        g = batch.graph.to(device)
        n = g.num_nodes()
        with th.no_grad():
            h = th.zeros((n, args.h_size)).to(device)
            c = th.zeros((n, args.h_size)).to(device)
            logits = model(batch, g, h, c)

        pred = th.sigmoid(logits)
        pred_class_1 = pred[: , 1]

        pred_index = []

        for i in pred:

            pred_val, index = torch.max(i, dim=0)
            if index == 0:
                pred_index.append(index.item())
            else:
                pred_index.append(index.item())

        for i in range(len(pred_index)):

          pred_list_dev.append(pred_index[i])
          label_list_dev.append(batch.label[i].item())


    # Apply a threshold (e.g., 0.5) to convert prediction scores into binary predictions
    threshold = 0.5
    predicted_labels = [1 if score >= threshold else 0 for score in pred_list_dev]

    precision_dev = precision_score(label_list_dev, predicted_labels)

    recall_dev = recall_score(label_list_dev, predicted_labels)

    f1_dev = f1_score(label_list_dev, predicted_labels)

    print(
        "Dev set: precision {:.4f} | recall {:.4f} | f1 {:.4f}".format(
             precision_dev, recall_dev, f1_dev
        )
    )


    if precision_dev > best_precision_dev:

      best_precision_dev = precision_dev
      best_epoch = epoch
      th.save(model.state_dict(), "best_{}.pkl".format(args.seed))
      print('saved best.pkl...........................................')
      print('best epoch {}'.format(best_epoch))

    # lr decay
    for param_group in optimizer.param_groups:
        param_group["lr"] = max(1e-5, param_group["lr"] * 0.99)  # 10


# test
model.load_state_dict(th.load("best_{}.pkl".format(args.seed)))


pred_list_test = []
label_list_test = []

model.eval()

for step, batch in enumerate(test_loader):
    g = batch.graph.to(device)
    n = g.num_nodes()
    with th.no_grad():
        h = th.zeros((n, args.h_size)).to(device)
        c = th.zeros((n, args.h_size)).to(device)
        logits = model(batch, g, h, c)

    pred = th.sigmoid(logits)
    pred_class_1 = pred[: , 1]

    pred_index = []

    for i in pred:

        pred_val, index = torch.max(i, dim=0)
        if index == 0:
            pred_index.append(index.item())
        else:
            pred_index.append(index.item())

    for i in range(len(pred_index)):

      pred_list_test.append(pred_index[i])
      label_list_test.append(batch.label[i].item())


precision, recall, thresholds = precision_recall_curve(label_list_test , pred_list_test )

# Apply a threshold (e.g., 0.5) to convert prediction scores into binary predictions
threshold = 0.5
predicted_labels = [1 if score >= threshold else 0 for score in pred_list_test]

precision_test = precision_score(label_list_test, predicted_labels)

recall_test = recall_score(label_list_test, predicted_labels)

f1_test = f1_score(label_list_test, predicted_labels)

print(
    "------------------------------------------------------------------------------------"
)

print(
        "Test set: precision {:.4f} | recall {:.4f} | f1 {:.4f}".format(
            precision_test, recall_test, f1_test
        )
    )

Using glove
DiscourseLSTM(
  (embedding): Embedding(1015, 300)
  (dropout): Dropout(p=0.5, inplace=False)
  (linear): Linear(in_features=150, out_features=2, bias=True)
  (cell): ChildSumDiscourseLSTMCell(
    (W_iou): Linear(in_features=300, out_features=450, bias=False)
  )
)




Epoch 00000 training time 1.9360s
epoch: 0 Train set: precision 0.7941 | recall 0.9643 | f1 0.8710




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189
saved best.pkl...........................................
best epoch 0




Epoch 00001 training time 1.6362s
epoch: 1 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00002 training time 1.8142s
epoch: 2 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00003 training time 1.5789s
epoch: 3 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00004 training time 2.3986s
epoch: 4 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00005 training time 2.0724s
epoch: 5 Train set: precision 0.7937 | recall 0.8929 | f1 0.8403




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00006 training time 1.7510s
epoch: 6 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00007 training time 1.6593s
epoch: 7 Train set: precision 0.7969 | recall 0.9107 | f1 0.8500




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00008 training time 1.7824s
epoch: 8 Train set: precision 0.7941 | recall 0.9643 | f1 0.8710




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00009 training time 1.6901s
epoch: 9 Train set: precision 0.8116 | recall 1.0000 | f1 0.8960




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00010 training time 2.5316s
epoch: 10 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00011 training time 1.9574s
epoch: 11 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00012 training time 1.7210s
epoch: 12 Train set: precision 0.8125 | recall 0.9286 | f1 0.8667




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00013 training time 1.7877s
epoch: 13 Train set: precision 0.8209 | recall 0.9821 | f1 0.8943




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00014 training time 1.6923s
epoch: 14 Train set: precision 0.8182 | recall 0.9643 | f1 0.8852




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00015 training time 1.7572s
epoch: 15 Train set: precision 0.8209 | recall 0.9821 | f1 0.8943




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00016 training time 2.5908s
epoch: 16 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00017 training time 1.8077s
epoch: 17 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00018 training time 1.7085s
epoch: 18 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00019 training time 1.5811s
epoch: 19 Train set: precision 0.7941 | recall 0.9643 | f1 0.8710




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00020 training time 1.8767s
epoch: 20 Train set: precision 0.7910 | recall 0.9464 | f1 0.8618




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00021 training time 1.8585s
epoch: 21 Train set: precision 0.8333 | recall 0.8929 | f1 0.8621




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00022 training time 2.6776s
epoch: 22 Train set: precision 0.8030 | recall 0.9464 | f1 0.8689




Dev set: precision 0.8000 | recall 0.7059 | f1 0.7500




Epoch 00023 training time 1.8648s
epoch: 23 Train set: precision 0.7969 | recall 0.9107 | f1 0.8500




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00024 training time 1.6059s
epoch: 24 Train set: precision 0.7910 | recall 0.9464 | f1 0.8618




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00025 training time 1.8718s
epoch: 25 Train set: precision 0.8154 | recall 0.9464 | f1 0.8760




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00026 training time 1.7214s
epoch: 26 Train set: precision 0.7879 | recall 0.9286 | f1 0.8525




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00027 training time 1.7355s
epoch: 27 Train set: precision 0.7910 | recall 0.9464 | f1 0.8618




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00028 training time 2.5724s
epoch: 28 Train set: precision 0.7879 | recall 0.9286 | f1 0.8525




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00029 training time 1.9249s
epoch: 29 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00030 training time 1.9505s
epoch: 30 Train set: precision 0.7879 | recall 0.9286 | f1 0.8525




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00031 training time 1.7368s
epoch: 31 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00032 training time 1.8902s
epoch: 32 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00033 training time 2.0710s
epoch: 33 Train set: precision 0.8030 | recall 0.9464 | f1 0.8689




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00034 training time 2.4444s
epoch: 34 Train set: precision 0.8125 | recall 0.9286 | f1 0.8667




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00035 training time 1.8459s
epoch: 35 Train set: precision 0.7941 | recall 0.9643 | f1 0.8710




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00036 training time 1.5835s
epoch: 36 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00037 training time 1.7749s
epoch: 37 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00038 training time 1.7407s
epoch: 38 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00039 training time 2.3989s
epoch: 39 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00040 training time 2.1845s
epoch: 40 Train set: precision 0.8235 | recall 1.0000 | f1 0.9032




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00041 training time 1.7606s
epoch: 41 Train set: precision 0.8116 | recall 1.0000 | f1 0.8960




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00042 training time 1.7574s
epoch: 42 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00043 training time 1.6816s
epoch: 43 Train set: precision 0.8226 | recall 0.9107 | f1 0.8644




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00044 training time 1.7827s
epoch: 44 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00045 training time 2.5805s
epoch: 45 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00046 training time 1.9280s
epoch: 46 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00047 training time 1.9716s
epoch: 47 Train set: precision 0.8030 | recall 0.9464 | f1 0.8689




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00048 training time 2.0492s
epoch: 48 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00049 training time 1.8089s
epoch: 49 Train set: precision 0.8281 | recall 0.9464 | f1 0.8833




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00050 training time 2.3647s
epoch: 50 Train set: precision 0.8235 | recall 1.0000 | f1 0.9032




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00051 training time 2.7076s
epoch: 51 Train set: precision 0.8154 | recall 0.9464 | f1 0.8760




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00052 training time 1.9745s
epoch: 52 Train set: precision 0.8182 | recall 0.9643 | f1 0.8852




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00053 training time 1.9337s
epoch: 53 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00054 training time 1.7829s
epoch: 54 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00055 training time 1.6924s
epoch: 55 Train set: precision 0.8154 | recall 0.9464 | f1 0.8760




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00056 training time 2.6096s
epoch: 56 Train set: precision 0.8182 | recall 0.9643 | f1 0.8852




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00057 training time 1.8260s
epoch: 57 Train set: precision 0.8000 | recall 0.9286 | f1 0.8595




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00058 training time 1.7577s
epoch: 58 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00059 training time 1.7274s
epoch: 59 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00060 training time 1.8447s
epoch: 60 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00061 training time 1.7823s
epoch: 61 Train set: precision 0.8209 | recall 0.9821 | f1 0.8943




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00062 training time 2.5559s
epoch: 62 Train set: precision 0.7941 | recall 0.9643 | f1 0.8710




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00063 training time 1.8965s
epoch: 63 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00064 training time 1.8483s
epoch: 64 Train set: precision 0.7941 | recall 0.9643 | f1 0.8710




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00065 training time 1.8604s
epoch: 65 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00066 training time 1.9636s
epoch: 66 Train set: precision 0.8116 | recall 1.0000 | f1 0.8960




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00067 training time 2.2551s
epoch: 67 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00068 training time 2.2776s
epoch: 68 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00069 training time 1.8467s
epoch: 69 Train set: precision 0.8182 | recall 0.9643 | f1 0.8852




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00070 training time 1.6419s
epoch: 70 Train set: precision 0.8358 | recall 1.0000 | f1 0.9106




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00071 training time 1.8153s
epoch: 71 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00072 training time 1.8352s
epoch: 72 Train set: precision 0.8030 | recall 0.9464 | f1 0.8689




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00073 training time 2.4284s
epoch: 73 Train set: precision 0.7941 | recall 0.9643 | f1 0.8710




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00074 training time 1.9980s
epoch: 74 Train set: precision 0.8333 | recall 0.9821 | f1 0.9016




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00075 training time 1.6711s
epoch: 75 Train set: precision 0.8116 | recall 1.0000 | f1 0.8960




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00076 training time 1.8008s
epoch: 76 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00077 training time 1.7303s
epoch: 77 Train set: precision 0.8209 | recall 0.9821 | f1 0.8943




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00078 training time 1.7257s
epoch: 78 Train set: precision 0.8182 | recall 0.9643 | f1 0.8852




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00079 training time 2.4129s
epoch: 79 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00080 training time 2.0176s
epoch: 80 Train set: precision 0.8358 | recall 1.0000 | f1 0.9106




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00081 training time 1.6275s
epoch: 81 Train set: precision 0.7941 | recall 0.9643 | f1 0.8710




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00082 training time 1.8425s
epoch: 82 Train set: precision 0.7910 | recall 0.9464 | f1 0.8618




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00083 training time 1.6816s
epoch: 83 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00084 training time 1.7681s
epoch: 84 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00085 training time 2.4223s
epoch: 85 Train set: precision 0.8060 | recall 0.9643 | f1 0.8780




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00086 training time 2.0705s
epoch: 86 Train set: precision 0.8116 | recall 1.0000 | f1 0.8960




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00087 training time 1.8096s
epoch: 87 Train set: precision 0.8116 | recall 1.0000 | f1 0.8960




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00088 training time 1.6259s
epoch: 88 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00089 training time 1.5441s
epoch: 89 Train set: precision 0.8116 | recall 1.0000 | f1 0.8960




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00090 training time 1.5816s
epoch: 90 Train set: precision 0.8088 | recall 0.9821 | f1 0.8871




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00091 training time 2.3438s
epoch: 91 Train set: precision 0.8000 | recall 1.0000 | f1 0.8889




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00092 training time 2.0891s
epoch: 92 Train set: precision 0.8226 | recall 0.9107 | f1 0.8644




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00093 training time 1.7623s
epoch: 93 Train set: precision 0.8209 | recall 0.9821 | f1 0.8943




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00094 training time 1.7916s
epoch: 94 Train set: precision 0.8485 | recall 1.0000 | f1 0.9180




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00095 training time 1.8454s
epoch: 95 Train set: precision 0.8209 | recall 0.9821 | f1 0.8943




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00096 training time 1.8210s
epoch: 96 Train set: precision 0.7971 | recall 0.9821 | f1 0.8800




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00097 training time 2.4453s
epoch: 97 Train set: precision 0.8235 | recall 1.0000 | f1 0.9032




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00098 training time 1.9899s
epoch: 98 Train set: precision 0.8154 | recall 0.9464 | f1 0.8760




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189




Epoch 00099 training time 1.5382s
epoch: 99 Train set: precision 0.8116 | recall 1.0000 | f1 0.8960




Dev set: precision 0.8500 | recall 1.0000 | f1 0.9189
------------------------------------------------------------------------------------
Test set: precision 0.8000 | recall 1.0000 | f1 0.8889




##**Developing a User Interface**
###A simple UI were created to test the functionality of the model. This was done by using Python Streamlit Library. The UI incorporates a text area where users can input or paste potential scam messages they have received, along with a button to initiate the message analysis process.
###The backend process includes:

*   Parsing the user input into discourse tree
*   Using the trained model to predict whether the text contains Scam utilizing pressure and coercion or does not.



In [None]:
import re
import os

class Node:
    def __init__(self, node_type, span, relation, text=None):
        self.node_type = node_type
        self.span = span
        self.relation = relation
        self.text = text
        self.children = []

    def add_child(self, child_node):
        self.children.append(child_node)

    def is_within_span(self, span):
        if self.span is None:  # Root node
            return True
        return self.span[0] <= span[0] and self.span[1] >= span[1]

    def __repr__(self):
      if self.text is not None:
          return f"{self.node_type}({self.span}, {self.relation}, '{self.text}')"
      else:
          return f"{self.node_type}({self.span}, {self.relation}, {self.text})"

def parse_brackets(data):
    pattern = re.compile(r"\(\((\d+), (\d+)\), '(\w+)', '(\w+)'\)")
    parsed_data = []
    for line in data.split('\n'):
        match = pattern.match(line)
        if match:
            parsed_data.append({
                'start': int(match.group(1)),
                'end': int(match.group(2)),
                'type': match.group(3),
                'relation': match.group(4)
            })
    return parsed_data

def parse_merge(data):
    parsed_data = []
    for line in data.split('\n'):
        parts = line.split('\t')
        if len(parts) > 5:
            parsed_data.append({
                'index': int(parts[0]),
                'word_index': int(parts[-1]),
                'word': parts[2],
                'lemma': parts[3],
                'pos': parts[4],
                'text': re.findall(r'_!\((.*?)\)_!', parts[-2])
            })
    return parsed_data

def find_text_for_span(span_start, span_end, merge_data):
    text_segments = []
    for item in merge_data:
        if span_start <= item['word_index'] <= span_end:
            text_segments.append(item['word'])
    return ' '.join(text_segments)

def create_combined_structure(brackets_data, merge_data):
    structure = []
    for item in brackets_data:
        text_content = find_text_for_span(item['start'], item['end'], merge_data)
        structure.append({
            'type': item['type'],
            'span': (item['start'], item['end']),
            'relation': item['relation'],
            'text': text_content
        })
    return structure

def create_tree_structure(elements):
    root = Node("Root", None, None)
    stack = [root]

    for element in elements:
        new_node = Node(element['type'], element['span'], element['relation'], element['text'])

        while not stack[-1].is_within_span(new_node.span):
            stack.pop()

        stack[-1].add_child(new_node)
        stack.append(new_node)

    return root

def tree_to_string(node, indent=""):
    output = indent + str(node) + "\n"
    for child in node.children:
        output += tree_to_string(child, indent + "  ")
    return output

class File_size_0(Exception):
    def __init__(self):
        super().__init__()

def combine():

    # Reading the contents of the files
    file_path_brackets = 'C:/Users/user/Desktop/FYPCode/DPLP/Data2/input.txt.brackets'
    file_path_merge = 'C:/Users/user/Desktop/FYPCode/DPLP/Data2/input.txt.merge'

    file_size_bytes = os.path.getsize(file_path_brackets)
    file_size_kb = file_size_bytes / 1024

    if file_size_kb == 0:
        raise File_size_0()

    with open(file_path_brackets, 'r') as file:
        data_brackets = file.read()

    with open(file_path_merge, 'r') as file:
        data_merge = file.read()

    # Process the data
    parsed_brackets = parse_brackets(data_brackets)
    parsed_merge = parse_merge(data_merge)

    combined_structure = create_combined_structure(parsed_brackets, parsed_merge)
    tree_structure = create_tree_structure(combined_structure)

    return tree_to_string(tree_structure)



In [None]:
from collections import OrderedDict
import collections
from matplotlib import pyplot as plt
import torch as th
import dgl
import torch
import re
import gensim.downloader as api
from nltk.tokenize import word_tokenize
import dgl
import numpy as np
import torch as th
import torch.nn as nn
import networkx as nx
import torch
from gensim.models.keyedvectors import KeyedVectors


def build_tree(data):


    dgl_graph, node_types, relations, texts = parse_discourse_structure(data)

    dgl_graph.ndata['x'] = torch.tensor(texts, dtype=torch.float32)
    dgl_graph.ndata['hierarchy_type'] = torch.tensor(node_types, dtype=torch.int64)
    dgl_graph.ndata['relation_type'] = torch.tensor(relations, dtype=torch.int64)


    # Initialize 'mask' with ones
    mask = torch.ones((len(texts)), dtype=torch.int64)

    condition = dgl_graph.ndata['relation_type'] == -1
    mask[condition] = 0
    dgl_graph.ndata['mask'] = mask

    return dgl_graph


def predict_discourse(data):


    vocab = OrderedDict()
    vocab_file =  "C:/Users/user/Desktop/FYPCode/DPLP/vocab/vocab.txt"

    with open(vocab_file, encoding="utf-8") as vf:
        for line in vf.readlines():
            line = line.strip()
            vocab[line] = len(vocab)

    device = th.device("cpu")


    model = DiscourseLSTM(
        len(vocab),
        300, #args.x_size,
        150, #args.h_size,
        2, #trainset.num_classes,
        0.5, #args.dropout,
        15, #args.num_relations,
        cell_type="childsum" ,
    ).to(device)

    g = build_tree(data)

    DLSTMBatch = collections.namedtuple(
        "DLSTMBatch", ["graph", "mask", "wordid", "hierarchy_type", "relation_type"]
    )

    batch = DLSTMBatch(
                graph=g,
                mask=g.ndata["mask"].to(device),
                wordid=g.ndata["x"].to(device),
                hierarchy_type=g.ndata["hierarchy_type"].to(device),
                relation_type=g.ndata["relation_type"].to(device),
            )


    n = g.num_nodes()
    #Load the trained DLSTM model
    model.load_state_dict(th.load("C:/Users/user/Desktop/FYPCode/Discourse LSTM/best_{}.pkl".format(41)))
    model.eval()
    with th.no_grad():
            h = th.zeros((n, 150)).to(device)
            c = th.zeros((n, 150)).to(device)
            logits = model(batch, g, h, c)

    pred_values= th.sigmoid(logits)

    class_0 = []
    class_1 = []
    for i in pred_values:

        pred_val, index = torch.max(i, dim=0)
        if index == 0:
            class_0.append(pred_val.item())
        else:
            class_1.append(pred_val.item())


    if len(class_1) > len(class_0):
        return 1
    try:
        if max(class_1) > max(class_0):
            return 1
    except:
        pass

    if len(class_1) < len(class_0):
        return 0
    try:
        if max(class_1) < max(class_0):
            return 0
    except:
        pass

In [None]:
import subprocess
import shutil
import os
import glob
from combine import combine, File_size_0
from predict_discourse import predict_discourse
import re

def clear_folder():

    folder_path = 'C:/Users/user/Desktop/FYPCode/DPLP/Data2'

    # Creating a pattern to match all files
    file_pattern = os.path.join(folder_path, '*')

    # Using glob.glob to get a list of all files
    for file in glob.glob(file_pattern):
        if os.path.isfile(file):  # Checking if it is a file
            os.remove(file)



def save_text(text):
    # Specify the folder path and the file name
    folder_path = "C:/Users/user/Desktop/FYPCode/DPLP/Data"  # Replace with the actual folder path
    file_name = "input.txt"

    # Combine the folder path and file name to create the full file path
    full_file_path = folder_path + "/" + file_name

    # Open the file for writing
    with open(full_file_path, 'w') as file:
        file.write(text)


def convert_file_to_xml():
    # Define the command as a list of strings
    command = ["python", "C:/Users/user/Desktop/FYPCode/DPLP/corenlp.py"]
    # Use subprocess.run() to execute the command
    try:
        result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True)

    except subprocess.CalledProcessError as e:
        pass

def move_xml_to_folder():
    try:
        shutil.move('C:/Users/user/Desktop/FYPCode/Discourse LSTM/input.txt.xml', 'C:/Users/user/Desktop/FYPCode/DPLP/Data2/input.txt.xml')
    except:
        shutil.move('C:/Users/user/Desktop/FYPCode/DPLP/input.txt.xml', 'C:/Users/user/Desktop/FYPCode/DPLP/Data2/input.txt.xml')


def discourse_parsing():
    # Define the command as a list of strings
    command_1 = ["py", "-2.7", "C:/Users/user/Desktop/FYPCode/DPLP/convert.py", "C:/Users/user/Desktop/FYPCode/DPLP/Data2"]
    command_2 = ["py", "-2.7", "C:/Users/user/Desktop/FYPCode/DPLP/segmenter.py", "C:/Users/user/Desktop/FYPCode/DPLP/Data2"]
    command_3 = ["py", "-2.7", "C:/Users/user/Desktop/FYPCode/DPLP/rstparser.py", "C:/Users/user/Desktop/FYPCode/DPLP/Data2"]

    script_directory = "C:/Users/user/Desktop/FYPCode/DPLP/"
    os.chdir(script_directory)

    # Use subprocess.run() to execute the command
    try:
        result = subprocess.run(command_1, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True)
    except subprocess.CalledProcessError as e:
        pass


    try:
        result = subprocess.run(command_2, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True)
    except subprocess.CalledProcessError as e:
        pass

    try:
        result = subprocess.run(command_3, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True)
    except subprocess.CalledProcessError as e:
        pass


def UI_process(text):

    filtered_text = re.sub(r'[^a-zA-Z0-9.?!,$\s]', '', text)
    save_text(filtered_text)
    convert_file_to_xml()
    move_xml_to_folder()
    discourse_parsing()
    try:
        data = combine()
    except File_size_0:
        raise File_size_0()
    pred = predict_discourse(data)
    return pred




In [None]:
import streamlit as st
from UI_process import UI_process, clear_folder
from combine import File_size_0


# Title for the Streamlit app
st.title('Pressure and Coercion Scam Language Detection WebApp')

st.write("")
st.write("")



st.markdown("""
    Welcome to the pressure and Coercion Scam Language Detection WebApp!
    This application is intended for community use. It aims to help people, especially vulnerable individuals
    that are unsure whether the coercive or pressure inducing message that they received are scams message.
    This platform could potentially clear your doubtness for you to futher take the case to the authority.
    Simply insert the text you received and want to analyze and click 'Check Text'.
""")

st.write("")
st.write("")
# Initialize session state variables
if 'user_data' not in st.session_state:
    st.session_state.user_data = None
if 'counter' not in st.session_state:
    st.session_state.counter = 0
if 'text_area_content' not in st.session_state:
    st.session_state['text_area_content'] = ""
if 'button_clicked' not in st.session_state:
    st.session_state['button_clicked'] = False

# Text input for user
user_input = st.text_area("Insert text to check for scam language:", value=st.session_state['text_area_content'])

# Button to process data

if st.button('Check Text'):

    if not st.session_state['button_clicked']:

        st.session_state['button_clicked'] = True

        if (st.session_state.counter>=1):
            clear_folder()
        st.session_state.user_data = user_input
        st.session_state.counter+=1

        try:
            pred = UI_process(user_input)

            if pred == 1:
                # If scam language is detected
                st.error('Alert: Potential scam message detected that utilized pressure or coercion language.')

            else:
                # If no scam language is detected
                st.success('The message is free from the usage of pressure and coercion to scam.')
            st.session_state['button_clicked'] = False

        except File_size_0:
            st.error('Alert: Current text couldn\'t be processed due to the sentence it contains. Please insert multiple sentences that is coherence')
            st.session_state['button_clicked'] = False

##**Outcome**
###- Perfect recall and consistent performance across all sets might raise a concern about overfitting
###- There are many areas that need improvement, such as:
*   Data Collection: More data is needed to build a more accurate model.
*   ScamWordNet Lexicon: Improvements on assigning more accurate polarity scores are needed
*   Discourse-LSTM: The newly developed algorithm needs more refinement and proof-checking
*   User Interface: The model needs to find an appropriate platform to integrate with. The current user interface is not very practical

###- With very limited current research on DLSA and an underdeveloped model, it is not yet practical for use in detecting and preventing cyber scams.

