#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 1:** Text processing and topic modeling

# **Text processing**
---
The text processing phase is a preliminary stage where the text to be manipulated is processed to be ready for subsequent analysis.

Text processing usually entails several steps that could possibly include:
- **Language Identification**: identifying the language of a given text.
- **Tokenization**: splitting a given text in several sentences/words. 
- **Dependency tree parsing:** analyzing the depencies between words composing the text.
- **Stemming/Lemmatization:** obtain the root form for each word in text.
- **Stopword removal**: removing words that are si commonly used that they carry very little useful information.
- **Part of Speech Tagging:** given a word, retrieve its part of speech (proper noun, common noun or verb).



### Language Identification

| Text                                                                                                                                | Language Code |
|-------------------------------------------------------------------------------------------------------------------------------------|---------------|
| The "Deep Natural Language Processing" course is offered during the first semester of the second year at Politecnico di Torino      | `EN`            |
| Il corso "Deep Natural Language Processing" viene impartito al Politecnico di Torino durante il primo semestre del secondo anno.    | `IT`            |
| Le cours "Deep Natural Language Processing" est enseigné au Politecnico di Torino pendant le premier semestre de la deuxième année. | `FR`            |

**Language Identification** is a crucial prelimiary step because each language has its own characteristics. The knowledge of the main language associated to a given text could be beneficial for all subsequent steps in text processing pipeline.

The data collection used in this first part of the practice is provided [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P1/langid_dataset.csv) - [source: Kaggle](https://www.kaggle.com/martinkk5575/language-detection)


# Exercise 1:

Benchmark different language-detection algorithm by computing the accuracy of each approach:
- [FastText](https://pypi.org/project/fastlangid/)
- [LangID](https://github.com/saffsd/langid.py)
- [langdetect](https://pypi.org/project/langdetect/)

**Hint:** language code conversion: [iso639-lang](https://pypi.org/project/iso639-lang/)

For each method report:
- Accuracy
- Average time per example

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

In [2]:
%%capture
! pip install fastlangid
! pip install langid
! pip install langdetect
! pip install iso639-lang
! pip install pandas --upgrade
! pip install numpy --upgrade

In [3]:
# dataset reading 
import pandas as pd
from iso639 import Lang
df_langid = pd.read_csv('langid_dataset.csv')
print (df_langid)
# convert to language codes
df_langid['language'] = df_langid['language'].apply(lambda x: Lang(x).pt1)
print (df_langid)

                                                    Text  language
0      klement gottwaldi surnukeha palsameeriti ning ...  Estonian
1      sebes joseph pereira thomas  på eng the jesuit...   Swedish
2      ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...      Thai
3      விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...     Tamil
4      de spons behoort tot het geslacht haliclona en...     Dutch
...                                                  ...       ...
21995  hors du terrain les années  et  sont des année...    French
21996  ใน พศ  หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...      Thai
21997  con motivo de la celebración del septuagésimoq...   Spanish
21998  年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...   Chinese
21999   aprilie sonda spațială messenger a nasa și-a ...  Romanian

[22000 rows x 2 columns]
                                                    Text language
0      klement gottwaldi surnukeha palsameeriti ning ...       et
1      sebes joseph pereira thomas  på

In [None]:
! pip install sklearn # to compute accuracy
! pip install tqdm # advancement

In [5]:
import langid
from tqdm import tqdm
import time
from sklearn.metrics import accuracy_score
y_true = []
y_pred = []

start = time.time()
for index, row in tqdm(df_langid.iterrows()):
    sentence = row["Text"]
    real_lang_code = row["language"]
    y_true.append(real_lang_code)
    y_pred.append(langid.classify(sentence)[0])
end = time.time()

print ("\nlangid avg ms per example:", (end-start)*1000/len(df_langid.index))
print ("\nlangid accuracy:", accuracy_score(y_true, y_pred))

22000it [01:26, 253.10it/s]



langid avg ms per example: 3.9518305605108086

langid accuracy: 0.9542727272727273


In [6]:
from langdetect import detect

y_true = []
y_pred = []

start = time.time()
for index, row in tqdm(df_langid.iterrows()):
    sentence = row["Text"]
    real_lang_code = row["language"]
    y_true.append(real_lang_code)
    try:
        y_pred.append(detect(sentence))
    except Exception as e:
        y_pred.append("")
        print (e, "\n")
end = time.time()

print ("\nlangdetect avg ms per example:", (end-start)*1000/len(df_langid.index))
print ("\nlangdetect accuracy:", accuracy_score(y_true, y_pred))

7398it [01:06, 141.76it/s]

No features in text. 



18181it [02:20, 133.56it/s]

No features in text. 



19384it [02:29, 154.55it/s]

No features in text. 



22000it [02:45, 133.03it/s]


langdetect avg ms per example: 7.517641067504883

langdetect accuracy: 0.8430454545454545





In [7]:
from fastlangid.langid import LID
ft_langid = LID()

y_true = []
y_pred = []

start = time.time()
for index, row in tqdm(df_langid.iterrows()):
    sentence = row["Text"]
    real_lang_code = row["language"]
    y_true.append(real_lang_code)
    try:
        y_pred.append(ft_langid.predict(sentence))
    except Exception as e:
        y_pred.append("")
        print (e, "\n")
end = time.time()

print ("\nft_langid avg ms per example:", (end-start)*1000/len(df_langid.index))
print ("\nft_langid accuracy:", accuracy_score(y_true, y_pred))

22000it [00:06, 3659.85it/s]


ft_langid avg ms per example: 0.2734569419514049

ft_langid accuracy: 0.9223636363636364





# Exercise 2

For English-written text, apply word-level tokenization. What is the average number of words per sentence?

Implement word-tokenization using both [nltk](https://www.nltk.org/) and [spacy](https://spacy.io/). Report the results for both of them.

For spaCy use the `en_core_web_sm` model.

In [8]:
%%capture
! pip install --upgrade spacy
! pip install nltk
! python -m spacy download en_core_web_sm

In [9]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

len_total = 0
count_total = 0
for index, row in tqdm(df_langid.iterrows()):
    if row["language"] == "en":
        len_total += len(word_tokenize(row["Text"]))
        count_total += 1
print ("\nAvg. len in #words - NLTK:", len_total/count_total)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
22000it [00:01, 13074.25it/s]


Avg. len in #words - NLTK: 68.752





In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")

len_total = 0
count_total = 0
for index, row in tqdm(df_langid.iterrows()):
    if row["language"] == "en":
        doc = nlp(row["Text"])
        len_total += len(doc)
        count_total += 1
print ("\nAvg. len in #words - spaCy:", len_total/count_total)

22000it [00:29, 753.38it/s] 


Avg. len in #words - spaCy: 72.334





# Exercise 3

Dependency Parsing aims at analyzing the grammatical structure of sentences. The main goal is to find out related words as well as the type of the relationship between them.

The output of this step is a dependency tree similar to the one reported in the figure below.

![dependency tree](http://www.rangakrish.com/wp-content/uploads/2018/04/Deptree-example2.png)

Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [11]:
%%capture
! pip install --upgrade spacy
! python -m spacy download ro_core_news_sm
! python -m spacy download en_core_web_sm
! python -m spacy download pt_core_news_sm
! python -m spacy download it_core_news_sm

In [12]:
import random
import spacy
sentences_pool = []
lang_code = "en"
if lang_code =="en":
    nlp = spacy.load(lang_code + '_core_web_sm')
else:
    nlp = spacy.load(lang_code + '_core_news_sm')

for index, row in df_langid.iterrows():
    if row["language"] == lang_code:
        sentences_pool.append(row["Text"])

s = random.choice(sentences_pool)
doc = nlp(s)
spacy.displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

# Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [13]:
#Lemmatization
doc = nlp(s)
lemmas = []
for w in doc:
    lemmas.append(w.lemma_)
print(lemmas)

['he', 'become', 'professor', 'for', 'old', 'testament', 'hebrew', 'and', 'greek', 'at', 'the', 'university', 'of', 'buenos', 'air', 'the', 'hebrew', 'university', 'of', 'jerusalem', 'and', 'then', 'until', 'his', 'retirement', 'at', 'the', 'waldensian', 'theological', 'seminary', 'and', 'at', 'the', 'sapienza', 'among', 'other', 'thing', 'he', 'be', 'a', 'fellow', 'at', 'princeton', 'theological', 'seminary', 'at', 'st', 'johns', 'college', 'cambridge', 'and', 'at', 'the', 'hebrew', 'university', 'he', 'lecture', 'widely', 'and', 'publish', 'various', 'article', 'and', 'book', 'he', 'be', 'a', 'member', 'of', 'the', 'editorial', 'board', 'of', 'henoch', 'vetus', 'testamentum', 'and', 'of', 'zeitschrift', 'für', 'die', 'alttestamentliche', 'wissenschaft']


In [14]:
#Stopword removal
clean_sentence = [] 
for w in doc:
    if not w.is_stop:
        clean_sentence.append(w.text)
print(" ".join(clean_sentence))

professor old testament hebrew greek university buenos aires hebrew university jerusalem retirement waldensian theological seminary sapienza things fellow princeton theological seminary st johns college cambridge hebrew university lectured widely published articles books member editorial board henoch vetus testamentum zeitschrift für die alttestamentliche wissenschaft


In [15]:
#PoS Tagging
for w in doc:
    print (w.text, w.pos_)

he PRON
became VERB
professor NOUN
for ADP
old ADJ
testament NOUN
hebrew VERB
and CCONJ
greek VERB
at ADP
the DET
university PROPN
of ADP
buenos PROPN
aires VERB
the DET
hebrew ADJ
university NOUN
of ADP
jerusalem PROPN
and CCONJ
then ADV
until ADP
his PRON
retirement NOUN
at ADP
the DET
waldensian ADJ
theological ADJ
seminary NOUN
and CCONJ
at ADP
the DET
sapienza NOUN
among ADP
other ADJ
things NOUN
he PRON
was AUX
a DET
fellow NOUN
at ADP
princeton PROPN
theological PROPN
seminary PROPN
at ADP
st PROPN
johns PROPN
college PROPN
cambridge PROPN
and CCONJ
at ADP
the DET
hebrew ADJ
university NOUN
he PRON
lectured VERB
widely ADV
and CCONJ
published VERB
various ADJ
articles NOUN
and CCONJ
books NOUN
he PRON
was AUX
a DET
member NOUN
of ADP
the DET
editorial ADJ
board NOUN
of ADP
henoch PROPN
vetus PROPN
testamentum PROPN
and CCONJ
of ADP
zeitschrift NOUN
für NOUN
die VERB
alttestamentliche NOUN
wissenschaft NOUN


# **Occurrence-based text representation - TF-IDF**

---

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

# Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = df_langid["Text"].tolist()
language_labels = df_langid["language"].tolist()

In [17]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Exercise 6

Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.

In [18]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, language_labels , test_size=0.20)
mlp = MLPClassifier(verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
accuracy_score(y_test, y_pred)

Iteration 1, loss = 2.33928756
Iteration 2, loss = 0.65888238
Iteration 3, loss = 0.19871749
Iteration 4, loss = 0.07867437
Iteration 5, loss = 0.03961102
Iteration 6, loss = 0.02473234
Iteration 7, loss = 0.01776265
Iteration 8, loss = 0.01399070
Iteration 9, loss = 0.01170330
Iteration 10, loss = 0.01021843
Iteration 11, loss = 0.00918507
Iteration 12, loss = 0.00843415
Iteration 13, loss = 0.00786660
Iteration 14, loss = 0.00742332
Iteration 15, loss = 0.00706980
Iteration 16, loss = 0.00678065
Iteration 17, loss = 0.00653812
Iteration 18, loss = 0.00633167
Iteration 19, loss = 0.00615260
Iteration 20, loss = 0.00599456
Iteration 21, loss = 0.00585331
Iteration 22, loss = 0.00572534
Iteration 23, loss = 0.00560820
Iteration 24, loss = 0.00549824
Iteration 25, loss = 0.00539611
Iteration 26, loss = 0.00529902
Iteration 27, loss = 0.00520692
Iteration 28, loss = 0.00511859
Iteration 29, loss = 0.00503320
Iteration 30, loss = 0.00495065
Iteration 31, loss = 0.00487050
Iteration 32, los

0.97

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, language_labels , test_size=0.20)
dt = DecisionTreeClassifier()
dt = dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
accuracy_score(y_test, y_pred)

0.9177272727272727

# **Topic Modelling**
---
Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modelling focuses on caturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)


# Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

In [22]:
import pandas as pd
df_tmodelling = pd.read_csv('CovidFake_filtered.csv')

In [23]:
# generating the corpus 
corpus = df_tmodelling["headlines"].tolist()
corpus = [s.split() for s in corpus]
print (corpus[:10])

[['A', 'post', 'claims', 'compulsory', 'vacination', 'violates', 'the', 'principles', 'of', 'bioethics,', 'that', 'coronavirus', "doesn't", 'exist,', 'that', 'the', 'PCR', 'test', 'returns', 'many', 'false', 'positives,', 'and', 'that', 'influenza', 'vaccine', 'is', 'related', 'to', 'COVID-19.'], ['A', 'photo', 'claims', 'that', 'this', 'person', 'is', 'a', 'doctor', 'who', 'died', 'after', 'attending', 'to', 'too', 'many', 'COVID-19', 'patinents', 'in', 'Hospital', 'Muñiz', 'in', 'Buenos', 'Aires.'], ['Post', 'about', 'a', 'video', 'claims', 'that', 'it', 'is', 'a', 'protest', 'against', 'confination', 'in', 'the', 'town', 'of', 'Aranda', 'de', 'Duero', '(Burgos)'], ['All', 'deaths', 'by', 'respiratory', 'failure', 'and', 'pneumonia', 'are', 'being', 'registered', 'as', 'COVID-19,', 'according', 'to', 'the', 'Civil', 'Registry', 'website.'], ['The', 'dean', 'of', 'the', 'College', 'of', 'Biologists', 'of', 'Euskadi', 'states', 'that', 'there', 'are', 'a', 'lot', 'of', 'PCR', 'false', 

In [24]:
# constructing dictionary and processing the corpus
from gensim.corpora.dictionary import Dictionary
tm_dict = Dictionary(corpus)
processed_corpus = [tm_dict.doc2bow(text) for text in corpus]

In [25]:
# training LSI model
from gensim.models import LsiModel
model = LsiModel(processed_corpus, id2word=tm_dict)
model.print_topics(5)

[(0,
  '0.636*"the" + 0.389*"of" + 0.314*"in" + 0.280*"a" + 0.254*"to" + 0.179*"and" + 0.155*"that" + 0.121*"is" + 0.107*"coronavirus" + 0.106*"A"'),
 (1,
  '-0.629*"the" + 0.598*"a" + 0.386*"in" + 0.105*"A" + 0.097*"and" + 0.078*"has" + 0.077*"COVID-19" + 0.071*"video" + 0.070*"on" + 0.064*"been"'),
 (2,
  '0.709*"to" + -0.633*"of" + 0.137*"and" + 0.123*"is" + 0.071*"for" + 0.069*"be" + 0.054*"that" + 0.051*"are" + 0.049*"from" + 0.048*"due"'),
 (3,
  '-0.734*"in" + 0.428*"of" + 0.363*"to" + 0.283*"a" + -0.172*"the" + -0.097*"coronavirus" + 0.046*"from" + -0.040*"was" + 0.037*"COVID-19." + 0.034*"on"'),
 (4,
  '0.479*"to" + 0.425*"of" + -0.407*"a" + 0.391*"in" + -0.259*"that" + -0.240*"the" + -0.213*"is" + -0.187*"and" + -0.115*"for" + -0.075*"coronavirus"')]

# Exercise 8 (Optional)

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to, for, in, of, on*..). Moreover, missing punctuation removal could be critical for topic identification. Repeat the same procedure of Ex. 7 by adding preliminary preprocessing step to:
1. **remove stopwords**
2. **strip punctuation**
3. **lowercase all words**

In [26]:
from nltk.corpus import stopwords
import nltk
import string

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
rs_corpus = [[w.lower() for w in s if w.lower() not in stop_words] for s in corpus]
rs_corpus = [[w.translate(str.maketrans('', '', string.punctuation)) for w in s] for s in rs_corpus]
print(rs_corpus[:10])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


[['post', 'claims', 'compulsory', 'vacination', 'violates', 'principles', 'bioethics', 'coronavirus', 'exist', 'pcr', 'test', 'returns', 'many', 'false', 'positives', 'influenza', 'vaccine', 'related', 'covid19'], ['photo', 'claims', 'person', 'doctor', 'died', 'attending', 'many', 'covid19', 'patinents', 'hospital', 'muñiz', 'buenos', 'aires'], ['post', 'video', 'claims', 'protest', 'confination', 'town', 'aranda', 'de', 'duero', 'burgos'], ['deaths', 'respiratory', 'failure', 'pneumonia', 'registered', 'covid19', 'according', 'civil', 'registry', 'website'], ['dean', 'college', 'biologists', 'euskadi', 'states', 'lot', 'pcr', 'false', 'positives', 'asymptomatic', 'spread', 'coronavirus'], ['households', 'covid19', 'patients', 'porto', 'alegre', 'campo', 'grande', 'santo', 'antônio', 'da', 'platina', 'must', 'put', 'red', 'ribbon', 'garbage', 'bags', 'garbagemen', 'instructed', 'handle', 'safer', 'way'], ['chain', 'lists', 'recommendations', 'prevent', 'treat', 'coronavirus'], ['60000

In [27]:
rs_tm_dict = Dictionary(corpus)
rs_processed_corpus = [rs_tm_dict.doc2bow(text) for text in rs_corpus]
rs_model = LsiModel(rs_processed_corpus, id2word=rs_tm_dict)
rs_model.print_topics(5)

[(0,
  '-0.761*"coronavirus" + -0.405*"covid19" + -0.171*"video" + -0.140*"people" + -0.118*"shows" + -0.108*"novel" + -0.101*"claim" + -0.101*"new" + -0.090*"shared" + -0.082*"posts"'),
 (1,
  '0.823*"covid19" + -0.528*"coronavirus" + 0.067*"video" + 0.050*"shows" + -0.045*"novel" + 0.040*"hospital" + -0.038*"new" + 0.037*"claims" + 0.034*"patients" + 0.034*"lockdown"'),
 (2,
  '0.510*"video" + 0.359*"shows" + -0.280*"covid19" + 0.274*"claim" + 0.242*"posts" + 0.239*"times" + -0.236*"coronavirus" + 0.217*"shared" + 0.177*"thousands" + 0.173*"multiple"'),
 (3,
  '-0.538*"video" + 0.329*"posts" + 0.326*"claim" + -0.278*"people" + 0.277*"shared" + 0.254*"times" + 0.215*"multiple" + -0.200*"shows" + 0.180*"novel" + 0.151*"thousands"'),
 (4,
  '-0.872*"people" + 0.326*"video" + 0.135*"shows" + 0.097*"coronavirus" + 0.087*"covid19" + -0.070*"government" + -0.069*"lockdown" + -0.068*"virus" + 0.067*"patients" + -0.065*"died"')]

# Exercise 9 (Optional)

Leveraging the same corpus used for LSI model generation, apply LDA modelling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [None]:
from gensim.models.ldamodel import LdaModel
lda = LdaModel(rs_processed_corpus, id2word=rs_tm_dict, num_topics=3)
lda.print_topics(5)

# Exercise 10 (Optional)

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [None]:
!pip install pyLDAvis

In [34]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

lda_display = gensimvis.prepare(lda, rs_processed_corpus, rs_tm_dict, sort_topics=False)
pyLDAvis.display(lda_display)