#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Giuseppe Gallipoli

**Credits:** Moreno La Quatra

**Practice 1:** Text processing and topic modeling

# **Text processing**
---
The text processing phase is a preliminary stage where the text to be manipulated is processed to be ready for subsequent analysis.

Text processing usually entails several steps that could possibly include:
- **Language Identification**: identifying the language of a given text.
- **Tokenization**: splitting a given text in several sentences/words.
- **Dependency tree parsing:** analyzing the depencies between words composing the text.
- **Stemming/Lemmatization:** obtain the root form for each word in text.
- **Stopword removal**: removing words that are si commonly used that they carry very little useful information.
- **Part of Speech Tagging:** given a word, retrieve its part of speech (proper noun, common noun or verb).



### Language Identification

| Text                                                                                                                                | Language Code |
|-------------------------------------------------------------------------------------------------------------------------------------|---------------|
| The "Deep Natural Language Processing" course is offered during the first semester of the second year at Politecnico di Torino      | `EN`            |
| Il corso "Deep Natural Language Processing" viene impartito al Politecnico di Torino durante il primo semestre del secondo anno.    | `IT`            |
| Le cours "Deep Natural Language Processing" est enseigné au Politecnico di Torino pendant le premier semestre de la deuxième année. | `FR`            |

**Language Identification** is a crucial prelimiary step because each language has its own characteristics. The knowledge of the main language associated to a given text could be beneficial for all subsequent steps in text processing pipeline.

The data collection used in this first part of the practice is provided [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P1/langid_dataset.csv) - [source: Kaggle](https://www.kaggle.com/martinkk5575/language-detection)


# Exercise 1:

Benchmark different language-detection algorithm by computing the accuracy of each approach:
- [fastlangid](https://pypi.org/project/fastlangid/) (built on FastText)
- [LangID](https://github.com/saffsd/langid.py)
- [langdetect](https://pypi.org/project/langdetect/)

**Hint:** language code conversion: [iso639-lang](https://pypi.org/project/iso639-lang/)

For each method report:
- Accuracy
- Average time per example

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

In [None]:
# your code here

# Exercise 2

For English-written text, apply word-level tokenization. What is the average number of words per sentence?

Implement word-tokenization using both [nltk](https://www.nltk.org/) and [spacy](https://spacy.io/). Report the results for both of them.

For spaCy use the `en_core_web_sm` model.

In [None]:
# your code here

# Exercise 3

Dependency Parsing aims at analyzing the grammatical structure of sentences. The main goal is to find out related words as well as the type of the relationship between them.

The output of this step is a dependency tree similar to the one reported in the figure below.

![dependency tree](http://www.rangakrish.com/wp-content/uploads/2018/04/Deptree-example2.png)

Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [None]:
# your code here

# Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [None]:
# your code here

# **Occurrence-based text representation - TF-IDF**

---

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

# Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
# your code here

# Exercise 6

Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.

In [None]:
# your code here

# **Topic Modeling**
---
Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modeling focuses on capturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)


# Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

In [None]:
# your code here

# Exercise 8

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to, for, in, of, on*..). Moreover, missing punctuation removal could be critical for topic identification. Repeat the same procedure of Ex. 7 by adding preliminary preprocessing step to:
1. **remove stopwords**
2. **strip punctuation**
3. **lowercase all words**

In [None]:
# your code here

# Exercise 9

Leveraging the same corpus used for LSI model generation, apply LDA modeling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [None]:
# your code here

# Exercise 10

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [None]:
# your code here

# Exercise 11
**Credits:** Giuseppe Gallipoli

#### Introduction
[Large Language Models](https://en.wikipedia.org/wiki/Large_language_model) (LLMs) are a type of deep learning model capable of language generation. These models are built on deep learning architectures, primarily using neural networks, and are trained on massive amounts of text data. LLMs generally leverage the *Transformer* architecture (a groundbreaking deep architecture that you will study in more detail later in the course), which allows them to process language in context, capturing complex relationships between words and concepts.

Large Language Models have demonstrated excellent capabilities across a wide variety of tasks, making them versatile models which can be applied in diverse scenarios and use cases.

Given their relevance, although you have not yet covered this topic in the course, we will provide you, starting from this first laboratory practice, with practical applications showing how LLMs can be used to solve a diverse range of tasks.\
Don't worry about the theoretical or more technical aspects: they will be covered in more detail in due time.\
For now, the most important thing to know is that users interact with LLMs by means of a **prompt**, which is a piece of text containing the instruction or question the user wants to give or ask the model.

#### Topic Modeling using Large Language Models

In this practice, we will use a Large Language Model to address a topic modeling-related task. Specifically, rather than modeling topic distributions as done with techniques like LSI or LDA, we will ask the LLM to extract the most relevant topic(s) from sentences (or from an entire corpus) according to different approaches.

For this task, we will use the [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) 7B model, i.e., `HuggingFaceH4/zephyr-7b-beta`.

**1<sup>st</sup> approach**: Ask the model to identify the topic(s) contained in a given sentence <u>without providing</u> a predefined list of topics to choose from.\
*Example of prompt*:\
Which are the most relevant topics of the following sentence?

\
<u>Suggestion</u>: To increase speed, switch to a GPU runtime. You can do this by clicking on Runtime → Change runtime type → Hardware accelerator → Select T4 GPU.\
If you encounter an `OutOfMemoryError`, try restarting the session by clicking on Runtime → Restart session.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

In [None]:
# fill in the following code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd, random

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

df_tmodeling = ...

PROMPT = "Write your prompt here"

# It may take some time to download the model
model = AutoModelForCausalLM.from_pretrained('HuggingFaceH4/zephyr-7b-beta', torch_dtype=torch.float16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta')

for ...
  full_prompt = ...
  input = tokenizer(full_prompt, return_tensors='pt').to(device)
  output = model.generate(**input, max_new_tokens=32)
  output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
  ...

**2<sup>nd</sup> approach**: Ask the model to identify the topic(s) contained in a given sentence <u>providing</u> a predefined list of topics to choose from.

*Example of prompt*:\
Which are the most relevant topics of the following sentence?\
Choose among: medicine, COVID, Artificial Intelligence, treatment, English literature, vaccine, gardening

In [None]:
# fill in the following code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd, random

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

df_tmodeling = ...

PROMPT = "Write your prompt here"

# It may take some time to download the model. If you have already downloaded and loaded it, you can skip this part
model = AutoModelForCausalLM.from_pretrained('HuggingFaceH4/zephyr-7b-beta', torch_dtype=torch.float16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta')

for ...
  full_prompt = ...
  input = tokenizer(full_prompt, return_tensors='pt').to(device)
  output = model.generate(**input, max_new_tokens=32)
  output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
  ...

**3<sup>rd</sup> approach**: Ask the model to identify the topic(s) contained in a given sentence <u>providing</u> a predefined list of topics to choose from along with the corresponding definitions.

*Example of prompt*:\
Which are the most relevant topics of the following sentence?\
Choose among:
- medicine: treatment for illness or injury, or the study of this
- COVID: an infectious disease caused by a coronavirus
- Artificial Intelligence: computer systems that have some of the qualities that the human brain has, such as learn from data
- treatment: the use of drugs to cure a person of an illness or injury
- English literature: artistic works written in the English language, especially those with a high and lasting artistic value
- vaccine: a substance that is put into the body of a person or animal to protect them from a disease
- gardening: the job or activity of working in a garden

Definitions are taken from the [Cambridge Dictionary](https://dictionary.cambridge.org/) and have been slightly adapted.

In [None]:
# fill in the following code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd, random

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

df_tmodeling = ...

PROMPT = "Write your prompt here"

# It may take some time to download the model. If you have already downloaded and loaded it, you can skip this part
model = AutoModelForCausalLM.from_pretrained('HuggingFaceH4/zephyr-7b-beta', torch_dtype=torch.float16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta')

for ...
  full_prompt = ...
  input = tokenizer(full_prompt, return_tensors='pt').to(device)
  output = model.generate(**input, max_new_tokens=32)
  output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
  ...

After manually inspecting some of the outputs for each approach, here you can find some questions to reason about the results:
- Did you find an approach which worked best overall?
- Did you encounter any cases where the LLM failed?
- What happens if all the topics provided are irrelevant to the sentence?
- Does the presence of definitions improve the model's performance?
- What challenges or limitations did you observe?