<a href="https://colab.research.google.com/github/ThuyHaLE/Problem3_Natural-Language-Processing/blob/main/NLU_Sentence_Classifier(Dependency_parsing_Trankit).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Natural Language Understanding (NLU)**

- **Core functionalities**:
  - ***Extracting Meaning***: aims to understand the intent, sentiment, and overall message conveyed in a sentence or passage.
  - ***Analyzing Context***: considers the context surrounding the language. For instance, "the bank is closed" can have different meanings depending on the context (financial institution vs. riverbank).
  - ***Disambiguation***: Language can be ambiguous at times. NLU tackles situations where a word or phrase can have multiple meanings.

- **How NLU works**:
  - ***Breaking it down***: break down the input text into smaller components like words, phrases, and sentences.
  - ***Understanding Relationships***: analyzes the relationships between these components. This involves part-of-speech tagging (identifying nouns, verbs, etc.) and recognizing grammatical structures.
  - ***Deriving Meaning***: Based on the analysis, infer the meaning of the text. This might involve considering the context in which the language is used.

-  **Common NLU tasks**:
  - ***Text Classification***: automatically categorizing a text into predefined categories. Examples: sentiment analysis (positive, negative, neutral), spam filtering (spam, not spam), or topic labeling (sports, politics, entertainment).
  - ***Named Entity Recognition (NER)***: identifies and classifies named entities within a text, such as people, organizations, locations, dates, monetary values, etc for information extraction or question answering.
  - ***Text Summarization***: generate a concise summary of a long text while preserving the key points and meaning. Summarization can be beneficial for quickly grasping the main idea of an article or document.
  - ***Machine Translation***: plays a vital role in machine translation, where the system understands the source language (e.g., English) and translates it accurately into the target language (e.g., Spanish) while preserving the meaning and intent.
  - ***Question Answering***: answer questions for information retrieval from a knowledge base or open-ended, more challenging questions that require reasoning and inference.
  - ***Part-of-Speech Tagging (POS)***: assigns grammatical labels (e.g., noun, verb, adjective) to each word in a sentence. POS tagging is a fundamental step in NLU, providing valuable information about the sentence structure that can be used for various downstream tasks.
  - ***Sentiment Analysis***: goes beyond simple classification (positive, negative) and aims to understand a text's emotional tone or opinion. This can involve analyzing the sentiment of product reviews, social media posts, or customer feedback data.
  - ***Dialogue Management***: in chatbots or virtual assistants. It allows the system to understand the user's intent within a conversation, track the conversation flow, and generate appropriate responses.
  - ***Textual Entailment***: determines whether the meaning of one sentence (hypothesis) is entailed by the definition of another sentence (text). It requires the NLU system to understand the logical relationships between sentences.
  - ***Natural Language Inference***: involves reasoning about the relationship between two sentences (Similar to textual entailment). The system determines if the second sentence (hypothesis) can be inferred from the first sentence (premise).

Parsing: the task of creating a parse tree from a given sentence <Br>
Sentence Classifier (simple sentences, compound sentences, complex sentences, passive sentences) using Dependency parsing (Trankit)

In [None]:
#Install some packages
!pip install --quiet "deplacy" "trankit" "transformers"

In [None]:
#Import some libraries
import deplacy #Vizualization
import trankit #Dependency parsing

In [None]:
#Initializing a pipeline for english
nlp_parser = trankit.Pipeline(lang="english", gpu=False)

Loading pretrained XLM-Roberta, this may take a while...


Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Loading tokenizer for english
Loading tagger for english
Loading lemmatizer for english
Loading NER tagger for english
Active language: english


In [None]:
#Simple Sentence
simple_sentences = ['Learning English is important nowadays.',
                    'Learning English and computer using are important nowadays.',
                    'I play some video games and learn English on my computer.',
                    'My sister and I play some video games and learn English on our computer.']
#Parsing for each sentence in list
for sentence in simple_sentences:
    deplacy.serve(nlp_parser(sentence), port=None)

In [None]:
#Compound Sentence using conjunctions: for, A: and, N: no, B: but, O: or, Y: yet, S: so
compound_sentences = ['Playing video game is fun, but it can be dangerous too.',
                      'Nature does not hurry, yet everything is accomplished.',
                      'I am working now, but we will eat later. ',
                      'Playing video game is fun, but it can be dangerous too, we must be careful.']
#Parsing for each sentence in list
for sentence in compound_sentences:
    deplacy.serve(nlp_parser(sentence), port=None)

In [None]:
#Complex Sentence that combines independent clauses with subordinate clauses using a subordinating conjunction
complex_sentences = ['Because I am working now, we will eat later. ',
                    'He always takes time to cover carefully his daughter even though he is extremely busy.',
                    'You should think about money saving from now if you want to study abroad.',
                    'Even though he is busy, he always takes time to cover carefully his daughter.']
#Parsing for each sentence in list
for sentence in complex_sentences:
    deplacy.serve(nlp_parser(sentence), port=None)

In [None]:
#Passive Voice Sentence
passive_sentences = ['The house was being painted when I arrived.',
                     'Over 20 models have been produced in the past two years.']
#Parsing for each sentence in list
for sentence in passive_sentences:
    deplacy.serve(nlp_parser(sentence), port=None)

    #nlp_parser(text) => return dictionary {'text': text, 'sentences': parsing result for text, 'lang': language}

    #parsing result for text: a dictionary {'id': index in sentence, starting at 1
                                            'text': text
                                            'tokens': parsing result for each word
                                            'dspan': pair with 'slots' 'start' and 'end'}

    #parsing result for each word: a dictionary {'id':index in text, starting at 1
                                                'text': word,
                                                'upos': universal POS  tags
                                                'xpos': treebank-specific POS tags
                                                'feats': list of morphological features separated by |
                                                'head': index of syntactic parent, 0 for ROOT'
                                                'deprel': syntactic relationship between HEAD and this word
                                                'dspan': pair with 'slots' 'start' and 'end'
                                                'span': pair with 'slots' 'start' and 'end'
                                                'lemma': word's lemma or stem,
                                                'ner': Named-entity recognition}

Sentence Classifier

In [None]:
#Parsing for each word in text
def nlp_sentence_parser(text):
    sentences, nlp_sentences = [], []
    for nlp_sentence in nlp_parser(text)['sentences']: #using parsing result
        sentence = nlp_sentence['text'] #get text
        sentences.append(sentence)
        nlp_sentence = nlp_sentence['tokens'] #get list parsing result for each word in text
        # Pull out some necessary information in parsing result for each word
        pos_dep = [(nlp['text'], nlp['xpos'], nlp['upos'], nlp['deprel'], nlp['head']) for nlp in nlp_sentence]
        nlp_sentences.append(pos_dep)
    return sentences, nlp_sentences

nlp_sentence_parser("Because I am working now, we will eat later")

(['Because I am working now, we will eat later'],
 [[('Because', 'IN', 'SCONJ', 'mark', 4),
   ('I', 'PRP', 'PRON', 'nsubj', 4),
   ('am', 'VBP', 'AUX', 'aux', 4),
   ('working', 'VBG', 'VERB', 'advcl', 9),
   ('now', 'RB', 'ADV', 'advmod', 4),
   (',', ',', 'PUNCT', 'punct', 9),
   ('we', 'PRP', 'PRON', 'nsubj', 9),
   ('will', 'MD', 'AUX', 'aux', 9),
   ('eat', 'VB', 'VERB', 'root', 0),
   ('later', 'RBR', 'ADV', 'advmod', 9)]])

In [None]:
#Sentence Classifier
def sentence_classifier(question_respond: str):
  #Import Counter
  from collections import Counter

  # List of subjects in a sentence, including for passive form
  subject_tags = ['csubj', 'csubj:pass', 'nsubj', 'nsubj:pass', 'xsubj']
  # nsubj (nominal subject) is the syntactic subject and the proto-agent of a clause.
  # csubj (clausal subject) is a clausal syntactic subject of a clause, i.e., the subject is itself a clause
  # xsubj (controlling subject) is renamed to nsubj (nominal subject)

  #Classify each sentence into 4 classes: simple sentence, compound sentence, complex sentence, passive sentence
  simple_sentences, compound_sentences, complex_sentences, passive_sentences = [], [], [], []

  #Parsing for question_respond (paragraph) => return sentences and parsing result for each word in sentence
  sentences, nlp_sentences = nlp_sentence_parser(question_respond)

  #Loop for each (sentence, parsing result for each word in sentence)
  for idx, (sentence, nlp_sentence) in enumerate(list(zip(sentences, nlp_sentences))):
    adv_clause_counter, wh_clause_counter, mark_counter, subject_tag_counter = 0, 0, 0, 0

    #loop for each words parsing result
    for idx, word_nlp in enumerate(nlp_sentence):
      #Pull out necessary information as follows: 'text': word, 'upos': universal POS tag, 'xpos': treebank-specific POS tag,
      #'head': index of syntactic parent, 0 for ROOT', 'deprel': syntactic relationship between HEAD and this word
      text, xpos, upos, deprel, head = word_nlp

      #Check if this word is a wh clause or if it in a sentence has an adv clause
      #check if syntactic relationship between HEAD and this word = advmod (adverbial modifier) and universal POS tag of it = adverb
      if deprel == 'advmod' and upos == 'ADV':
        #check if treebank-specific POS tag = Wh-adverb (special subclass of adverbs - set of words beginning with wh-) => it's wh_clause
        if xpos == 'WRB':
          wh_clause_counter += 1
        else:
          #analysis sub_sentence sliced by the index of this word and the index of its syntactic parent
          sub_nlp = nlp_sentence[idx:head] if idx < head else nlp_sentence[head:idx]
          #for each word in sub_sentence => frequence of its deprel
          sub_nlp_counter = Counter([word_nlp[3] for word_nlp in sub_nlp])
          #for subject in subject_tags => frequence of adv clause
          sub_subject_tag = sum([sub_nlp_counter[subj] for subj in subject_tags])
          adv_clause_counter += sub_subject_tag

      #check if syntactic relationship between HEAD and this word = mark
      if deprel == 'mark':
        mark_counter += 1

      #check if syntactic relationship between HEAD and this word is in subject_tags
      if deprel in subject_tags:
        subject_tag_counter += 1

    #Classifing
    #Check if this word is a wh clause, if it in a sentence has an adv clause, or if there is a mark => complex_sentences
    if mark_counter >= 1 or adv_clause_counter >= 1 or wh_clause_counter >= 1:
      complex_sentences.append(sentence)

    #Check if there is no mark and if there is a subject in a sentence => simple_sentences
    elif mark_counter == 0 and subject_tag_counter == 1:
      simple_sentences.append(sentence)

    else: #compound_sentences
      compound_sentences.append(sentence)

    #check if syntactic relationship between HEAD and this word has both nsubj:pass and aux:pass => passive voice sentence
    dep_sentence = [word_nlp[3] for word_nlp in nlp_sentence] #for each word in sentence => frequence of its deprel
    if 'nsubj:pass' in dep_sentence and 'aux:pass' in dep_sentence:
      passive_sentences.append(sentence)

  return {'simple-sentences': simple_sentences,
          'compound-sentences': compound_sentences,
          'complex-sentences': complex_sentences,
          'passive-sentences':passive_sentences}

sentence_classifier("".join(simple_sentences + compound_sentences + complex_sentences + passive_sentences))

{'simple-sentences': ['Learning English is important nowadays.',
  'Learning English and computer using are important nowadays.',
  'I play some video games and learn English on my computer.',
  'My sister and I play some video games and learn English on our computer.',
  'Over 20 models have been produced in the past two years.'],
 'compound-sentences': ['Playing video game is fun, but it can be dangerous too.',
  'Nature does not hurry, yet everything is accomplished.',
  'I am working now, but we will eat later.',
  'Playing video game is fun, but it can be dangerous too, we must be careful.'],
 'complex-sentences': ['Because I am working now, we will eat later.',
  'He always takes time to cover carefully his daughter even though he is extremely busy.',
  'You should think about money saving from now if you want to study abroad.',
  'Even though he is busy, he always takes time to cover carefully his daughter.',
  'The house was being painted when I arrived.'],
 'passive-sentences'