<a href="https://colab.research.google.com/github/LeoMaggio/Deep-NLP/blob/main/practices/P4/Practice_4_NER_and_Intent_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 3:** Named Entities Recognition & Intent Detection

## Named Entities Recognition

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

![https://miro.medium.com/max/875/0*mlwDqNm7DFc_4maP.jpeg](https://miro.medium.com/max/875/0*mlwDqNm7DFc_4maP.jpeg)   

Text domain is **crucial** while recognizing entities (political, medical, food...)

In this practice you will:
- Experiment with pre-trained models to extract entities from text
- 

### **Question 1: data preparation**

The data collection is available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt). 
This dataset was presented in [1][2] and consists of a set of manually annotated Wikipedia text. The data already in [CONLL](https://simpletransformers.ai/docs/ner-data-formats/#text-file-in-conll-format) format. Please read carefully before proceeding with data parsing.

You need to extract clean sentences (no annotation) and, for each sentence, text associated to each entity:     
- `sentences`: list of sentences
- `annotations`: list of list of entities (both string and class information). E.g., `[[('010', 'MISC'), ('Japanese', 'MISC'), ('The Mad Capsule Markets', 'ORG')], [('Osc-Dis', 'MISC'), ('Introduction 010', 'MISC'), ('Come', 'MISC')], ...]`. You can remove I- prefix because the data collection does not actually cotains valuable prefixes.

---


[1] Balasuriya, Dominic, et al. "Named entity recognition in wikipedia."
    Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources. Association for Computational Linguistics, 2009.

[2] Nothman, Joel, et al. "Learning multilingual named entity recognition
    from Wikipedia." Artificial Intelligence 194 (2013): 151-175 

In [2]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/NER/wikigold.conll.txt

In [79]:
import re

def split_text_label(filename):
    f = open(filename)
    split_labeled_text = []
    split_text = []
    split_labels = []
    sentence = []
    annotations = []
    text = ""
    for line in f:
        if len(line)==0 or line.startswith('-DOCSTART') or line[0]=="\n":
             if len(sentence) > 0:
                 split_labeled_text.append(sentence)
                 split_text.append(text)
                 split_labels.append(annotations)
                 sentence = []
                 text = ""
                 annotations = []
             continue
        splits = line.split(' ')
        sentence.append([splits[0],splits[-1].rstrip("\n")])
        annotations.append((splits[0],splits[-1].lstrip("I-").rstrip("\n")))
        text = f"{text} {splits[0]}"
        text = re.sub(r'\s([?.!,)"](?:\s|$))', r'\1', text.strip())
    if len(sentence) > 0:
        split_labeled_text.append(sentence)
        split_text.append(text)
        split_labels.append(annotations)
        sentence = []
        text = ""
        annotations = []
    return split_labeled_text, split_text, split_labels
sentences_with_labels, sentences, ground_truths = split_text_label("wikigold.conll.txt")

### **Question 2: inference with spacy for entity recognition**

Spacy models comes with built-in NER models. Instantiate a [spacy model](https://spacy.io/usage/models) for the english language and get, for each sentence in the data collection, its named entities extracted from the model.

Given that, the provided data collection only contains a subset of spacy labels map all the classes not available in the data collection to the `MISC` class. 

In [42]:
%%capture
!pip install -U spacy 
!python -m spacy download en_core_web_sm

In [43]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [76]:
def get_ents(doc):
  sentence = []
  if doc.ents:
    for ent in doc.ents:
      sentence.append((ent.text, ent.label_))
  return sentence

In [80]:
predictions_spacy = []
for s in sentences:
  doc = nlp(s)
  doc_ents = get_ents(doc)
  if doc_ents:
    predictions_spacy.append(doc_ents)

### **Question 3: compute metrics for evaluating NER**

Use [eval4ner](https://github.com/cyk1337/eval4ner) to evaluate the spacy model for NER on the parsed dataset.

**Note**: please use `pip install git+https://github.com/MorenoLaQuatra/eval4ner` to use a fixed version of the library. Before passing the parameter to the evaluation function, create a deepcopy of each variable:

The issue has been already reported to the original author.

In [61]:
%%capture
!pip install git+https://github.com/MorenoLaQuatra/eval4ner

In [82]:
import eval4ner.muc as muc

muc.evaluate_all(predictions_spacy, ground_truths * 1, sentences, verbose=True)

AssertionError: ignored

### **Question 4: inference with transformers pipeline**

Transformer-based models can be fine-tuned for token-level classification. The most relevant task in this class is NER. Use [transformers pipelines](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.TokenClassificationPipeline) to recognize entities in the previous data collection. 

Evaluate the model using the same procedure of Q3.

**Note:** the output of the pipeline differs with respect to spacy. Please be sure to process data correctly before running evaluation.

**Note 2:** `ignore_labels` parameter could be useful to correctly parse entities.

**Note 3:** `##` symbol is used when a token is a continuation of a previous one (Poli + ##TO)

In [None]:
%%capture
! pip install datasets transformers

In [None]:
# Your code here

## Intent Detection

In data mining, intention mining or intent mining is the problem of determining a user's intention from logs of his/her behavior in interaction with a computer system, such as in search engines. Intent Detection is the identification and categorization of what a user online intended or wanted to find when they type or speak with a conversational agent (or a search engine).

![https://d33wubrfki0l68.cloudfront.net/32e2326762c75a0357ab1ae1976a60d4bbce724b/f4ac0/static/a5878ba6b0e4e77163dc07d07ecf2291/2b6c7/intent-classification-normal.png](https://d33wubrfki0l68.cloudfront.net/32e2326762c75a0357ab1ae1976a60d4bbce724b/f4ac0/static/a5878ba6b0e4e77163dc07d07ecf2291/2b6c7/intent-classification-normal.png)

Data source (ATIS dataset): https://github.com/yvchen/JointSLU ; https://www.kaggle.com/siddhadev/atis-dataset-clean/home

Use provided train/dev/test split accordingly.

In [1]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.train.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.dev.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.test.csv

### **Question 5: two-step classification model**

Train a classification model to identify the intent from sentence text. The model should leverage on pretrained BERT model to generate features for each sentence (No-finetuning).

![https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/no_finetuning.png?raw=true](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/no_finetuning.png?raw=true)


Assess the performance of the generated model by using the **classification accuracy**.

In [2]:
%%capture
!pip install transformers

In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

In [7]:
df_train = pd.read_csv('https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.train.csv')
df_dev = pd.read_csv('https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.dev.csv')
df_test = pd.read_csv('https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P4/IntentDetection/atis.test.csv')

In [23]:
df_train

Unnamed: 0,id,tokens,slots,intent
0,train-00001,BOS what is the cost of a round trip flight fr...,O O O O O O O B-round_trip I-round_trip O O B-...,atis_airfare
1,train-00002,BOS now i need a flight leaving fort worth and...,O O O O O O O B-fromloc.city_name I-fromloc.ci...,atis_flight
2,train-00003,BOS i need to fly from kansas city to chicago ...,O O O O O O B-fromloc.city_name I-fromloc.city...,atis_flight
3,train-00004,BOS what is the meaning of meal code s EOS,O O O O O O B-meal_code I-meal_code I-meal_code O,atis_abbreviation
4,train-00005,BOS show me all flights from denver to pittsbu...,O O O O O O B-fromloc.city_name O B-toloc.city...,atis_flight
...,...,...,...,...
4269,train-04270,BOS show me flights from baltimore to philadel...,O O O O O B-fromloc.city_name O B-toloc.city_n...,atis_flight
4270,train-04271,BOS i need information on flights from indiana...,O O O O O O O B-fromloc.city_name O B-toloc.ci...,atis_flight
4271,train-04272,BOS what flights are there from phoenix to mil...,O O O O O O B-fromloc.city_name O B-toloc.city...,atis_flight
4272,train-04273,BOS please show me the flights from atlanta to...,O O O O O O O B-fromloc.city_name O B-toloc.ci...,atis_flight


In [10]:
%%capture
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [17]:
tokenized = df_train.tokens.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [18]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [19]:
attention_mask = np.where(padded != 0, 1, 0)

In [20]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [22]:
train_features = last_hidden_states[0][:,0,:].numpy()

In [None]:
train_labels = df_train.intent

### **Question 6: two-step classification model**

Train a new BERT model for the task of [sequence classification](https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification) (include BERT fine-tuning).  

![https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/finetuning.png?raw=true](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P4/IntentDetection/finetuning.png?raw=true)

Assess the performance of the generated model by using the **classification accuracy**.

Which model has better performance?

In [None]:
# Your code here