# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 1:** Text processing and topic modeling

# **Text processing**
---
The text processing phase is a preliminary stage where the text to be manipulated is processed to be ready for subsequent analysis.

Text processing usually entails several steps that could possibly include:
- **Language Identification**: identifying the language of a given text.
- **Tokenization**: splitting a given text in several sentences/words. 
- **Dependency tree parsing:** analyzing the depencies between words composing the text.
- **Stemming/Lemmatization:** obtain the root form for each word in text.
- **Stopword removal**: removing words that are si commonly used that they carry very little useful information.
- **Part of Speech Tagging:** given a word, retrieve its part of speech (proper noun, common noun or verb).



### Language Identification

| Text                                                                                                                                | Language Code |
|-------------------------------------------------------------------------------------------------------------------------------------|---------------|
| The "Deep Natural Language Processing" course is offered during the first semester of the second year at Politecnico di Torino      | `EN`            |
| Il corso "Deep Natural Language Processing" viene impartito al Politecnico di Torino durante il primo semestre del secondo anno.    | `IT`            |
| Le cours "Deep Natural Language Processing" est enseigné au Politecnico di Torino pendant le premier semestre de la deuxième année. | `FR`            |

**Language Identification** is a crucial prelimiary step because each language has its own characteristics. The knowledge of the main language associated to a given text could be beneficial for all subsequent steps in text processing pipeline.

The data collection used in this first part of the practice is provided [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P1/langid_dataset.csv) - [source: Kaggle](https://www.kaggle.com/martinkk5575/language-detection)

# Exercise 1:

Benchmark different language-detection algorithm by computing the accuracy of each approach:
- [FastText](https://pypi.org/project/fastlangid/)
- [LangID](https://github.com/saffsd/langid.py)
- [langdetect](https://pypi.org/project/langdetect/)

**Hint:** language code conversion: [iso639-lang](https://pypi.org/project/iso639-lang/)

For each method report:
- Accuracy
- Average time per example

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

--2021-10-06 15:29:53--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12990065 (12M) [text/plain]
Saving to: ‘langid_dataset.csv.3’


2021-10-06 15:29:53 (139 MB/s) - ‘langid_dataset.csv.3’ saved [12990065/12990065]



In [None]:
import pandas as pd
df = pd.read_csv("langid_dataset.csv")

y = df['language']
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


# FastText


In [None]:
%%time
from fastlangid.langid import LID

langid = LID()

results_fasttext = langid.predict(df['Text'])


CPU times: user 3.48 s, sys: 32.1 ms, total: 3.52 s
Wall time: 3.74 s


# LangId

In [None]:
%%time
import langid

results_langid = []
for i in range(len(df)):
    det = langid.classify(df['Text'][i])
    results_langid.append(det[0])



CPU times: user 1min, sys: 1min 5s, total: 2min 5s
Wall time: 1min 11s


LangDetect

In [None]:
from langdetect import detect

for i in range(len(df['Text'])):
    try:
        language = detect(df['Text'][i])
    except:
        df.drop([i])
    

In [None]:
%%time


results_langdetect = []
for elem in df['Text']:
    results_langdetect.append(detect(elem))


('et', -883.7894897460938)

# Results


In [None]:
from iso639 import Lang
res1 = []
for element in results_fasttext:
    if element.startswith("zh"):
        element = "zh"
    lg = Lang(element)
    res1.append(lg.name)


In [None]:
from iso639 import Lang
res2 = []
for element in results_langid:
    if element.startswith("zh"):
        element = "zh"
    lg = Lang(element)
    res2.append(lg.name)
res2

['Estonian',
 'English',
 'Thai',
 'Tamil',
 'Dutch',
 'Japanese',
 'Turkish',
 'Latin',
 'Urdu',
 'Japanese',
 'Indonesian',
 'Portuguese',
 'French',
 'Chinese',
 'Korean',
 'Thai',
 'Estonian',
 'Portuguese',
 'English',
 'Hindi',
 'Tamil',
 'Spanish',
 'French',
 'French',
 'Estonian',
 'Korean',
 'French',
 'Pushto',
 'Dutch',
 'Persian',
 'French',
 'Romanian',
 'Russian',
 'Japanese',
 'Indonesian',
 'Latin',
 'Latin',
 'English',
 'French',
 'Portuguese',
 'English',
 'Urdu',
 'English',
 'Indonesian',
 'Indonesian',
 'Japanese',
 'Arabic',
 'Pushto',
 'Indonesian',
 'Swedish',
 'Dutch',
 'Russian',
 'Dutch',
 'Arabic',
 'Arabic',
 'Turkish',
 'Urdu',
 'French',
 'Portuguese',
 'Portuguese',
 'Indonesian',
 'Indonesian',
 'Tamil',
 'Swedish',
 'Persian',
 'Russian',
 'English',
 'Korean',
 'Arabic',
 'Thai',
 'Portuguese',
 'Arabic',
 'Tamil',
 'Pushto',
 'Urdu',
 'Pushto',
 'English',
 'Russian',
 'Persian',
 'Japanese',
 'Portuguese',
 'Hindi',
 'Persian',
 'English',
 'Swedi

In [None]:
from sklearn.metrics import accuracy_score
print("FastText")
print(accuracy_score(res1, y))
print("---------")
print("LangId")
print(accuracy_score(res2, y))
#print(accuracy_score(results_langdetect, y)
print("---------")


FastText
0.9677272727272728
---------
LangId
0.9542727272727273
---------


0        Estonian
1         Swedish
2            Thai
3           Tamil
4           Dutch
           ...   
21995      French
21996        Thai
21997     Spanish
21998     Chinese
21999    Romanian
Name: language, Length: 22000, dtype: object

# Exercise 2

For English-written text, apply word-level tokenization. What is the average number of words per sentence?

Implement word-tokenization using both [nltk](https://www.nltk.org/) and [spacy](https://spacy.io/). Report the results for both of them.

For spaCy use the `en_core_web_sm` model.

In [None]:
df_eng = df.drop(df[df.language !="English"].index)
df_eng.head()

Unnamed: 0,Text,language
37,in johnson was awarded an american institute ...,English
40,bussy-saint-georges has built its identity on ...,English
76,minnesotas state parks are spread across the s...,English
90,nordahl road is a station served by north coun...,English
97,a talk by takis fotopoulos about the internati...,English


# Spacy

In [None]:
import spacy
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
token_list = []
for text in df_eng['Text']:
    
    my_doc = nlp(text)

    # Create list of word tokens
    
    for token in my_doc:
        token_list.append(token.text)


# NLTK 

In [None]:
import nltk
from nltk import word_tokenize
import numpy as np
tk = []
for text in df_eng['Text']:
    
    tokens = word_tokenize(text)
    tk = tk + tokens
    
len(tk)



68738

In [None]:
t1 = len(token_list)
t2 = len(tk)
d = len(df_eng)

In [None]:
print("NLTK: ",t1/d )
print("SpaCy: ",t2/d )


NLTK:  72.334
SpaCy:  68.738


# Exercise 3

Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [None]:
import spacy
from spacy import displacy


nlp = spacy.load("en_core_web_sm")

doc = nlp("Hi, my name is Gianluca and my surname is LM")
displacy.render(doc, style="dep",jupyter=True)



# Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [None]:
#Lemmatization
nlp = spacy.load("en_core_web_sm")

doc = nlp("Hi, my name is Gianluca and my surname is LM")
lemmas = " ".join([token.lemma_ for token in doc])

#Stopwords removal
all_stopwords = nlp.Defaults.stop_words
tokens = word_tokenize(lemmas)
tokens_without_sw = [word for word in tokens if not word in all_stopwords]


#Part of speech tagging
for word in doc:
    print(word,": ",word.pos_)

Hi :  INTJ
, :  PUNCT
my :  PRON
name :  NOUN
is :  AUX
Gianluca :  PROPN
and :  CCONJ
my :  PRON
surname :  NOUN
is :  VERB
LM :  PROPN


# **Occurrence-based text representation - TF-IDF**

---
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

# Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Text'])

print(X)

  (0, 43365)	0.15245962403688543
  (0, 122097)	0.15245962403688543
  (0, 76696)	0.13890427169403846
  (0, 45787)	0.15245962403688543
  (0, 113244)	0.13890427169403846
  (0, 43364)	0.15245962403688543
  (0, 75247)	0.22902894148770514
  (0, 153)	0.2155525828166711
  (0, 55264)	0.2678531884721598
  (0, 63450)	0.24339387890154524
  (0, 122096)	0.15245962403688543
  (0, 59244)	0.15245962403688543
  (0, 122428)	0.11632821567894924
  (0, 67654)	0.15245962403688543
  (0, 106284)	0.0828549222249433
  (0, 117123)	0.1339265942360799
  (0, 136)	0.08509082541842236
  (0, 112023)	0.15245962403688543
  (0, 60954)	0.14646128506875586
  (0, 49445)	0.15245962403688543
  (0, 45293)	0.12265170453053793
  (0, 80288)	0.15245962403688543
  (0, 79323)	0.15245962403688543
  (0, 53103)	0.13228208686853668
  (0, 47020)	0.14646128506875586
  :	:
  (21999, 102253)	0.18987120980426156
  (21999, 101536)	0.19555363016241606
  (21999, 69301)	0.3911072603248321
  (21999, 95538)	0.18546358214789696
  (21999, 69551)	0.18

# Exercise 6

Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)
clf = LogisticRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)


In [None]:
accuracy_score(y_pred,y_test) 

0.9286931818181818

# **Topic Modelling**

Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modelling focuses on caturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)


# Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

--2021-10-07 09:46:01--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1088708 (1.0M) [text/plain]
Saving to: ‘CovidFake_filtered.csv’


2021-10-07 09:46:01 (104 MB/s) - ‘CovidFake_filtered.csv’ saved [1088708/1088708]



In [None]:
import pandas as pd
from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel
df_c = pd.read_csv("CovidFake_filtered.csv")
df_c.head()

Unnamed: 0.1,Unnamed: 0,headlines,outcome
0,0,A post claims compulsory vacination violates t...,0
1,1,A photo claims that this person is a doctor wh...,0
2,2,Post about a video claims that it is a protest...,0
3,3,All deaths by respiratory failure and pneumoni...,0
4,4,The dean of the College of Biologists of Euska...,0


In [None]:
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel

corpus = df_c['headlines'].tolist()
corpus = [s.split() for s in corpus]

c_dict = Dictionary(corpus) 
processed_corpus = [c_dict.doc2bow(text) for text in corpus]



model = LsiModel(processed_corpus, id2word=c_dict)
model.print_topics(5)

[(0,
  '0.636*"the" + 0.389*"of" + 0.314*"in" + 0.280*"a" + 0.254*"to" + 0.179*"and" + 0.155*"that" + 0.121*"is" + 0.107*"coronavirus" + 0.106*"A"'),
 (1,
  '0.629*"the" + -0.598*"a" + -0.386*"in" + -0.105*"A" + -0.097*"and" + -0.078*"has" + -0.077*"COVID-19" + -0.071*"video" + -0.070*"on" + -0.064*"been"'),
 (2,
  '0.709*"to" + -0.633*"of" + 0.137*"and" + 0.123*"is" + 0.071*"for" + 0.069*"be" + 0.054*"that" + 0.051*"are" + 0.049*"from" + 0.048*"due"'),
 (3,
  '-0.734*"in" + 0.428*"of" + 0.363*"to" + 0.283*"a" + -0.172*"the" + -0.097*"coronavirus" + 0.046*"from" + -0.040*"was" + 0.037*"COVID-19." + 0.034*"on"'),
 (4,
  '0.479*"to" + 0.425*"of" + -0.407*"a" + 0.391*"in" + -0.259*"that" + -0.240*"the" + -0.213*"is" + -0.187*"and" + -0.115*"for" + -0.075*"coronavirus"')]

# Exercise 8 (Optional)

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to, for, in, of, on*..). Repeat the same procedure of Ex. 7 by adding a preliminary preprocessing step to **remove stopwords**.

In [None]:
import spacy
import nltk
from nltk import word_tokenize

from nltk.corpus import stopwords
import nltk
import string

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
rs_corpus = [[w.lower() for w in s if w.lower() not in stop_words] for s in corpus]
rs_corpus = [[w.translate(str.maketrans('', '', string.punctuation)) for w in s] for s in rs_corpus]
print(rs_corpus[:10])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[['post', 'claims', 'compulsory', 'vacination', 'violates', 'principles', 'bioethics', 'coronavirus', 'exist', 'pcr', 'test', 'returns', 'many', 'false', 'positives', 'influenza', 'vaccine', 'related', 'covid19'], ['photo', 'claims', 'person', 'doctor', 'died', 'attending', 'many', 'covid19', 'patinents', 'hospital', 'muñiz', 'buenos', 'aires'], ['post', 'video', 'claims', 'protest', 'confination', 'town', 'aranda', 'de', 'duero', 'burgos'], ['deaths', 'respiratory', 'failure', 'pneumonia', 'registered', 'covid19', 'according', 'civil', 'registry', 'website'], ['dean', 'college', 'biologists', 'euskadi', 'states', 'lot', 'pcr', 'false', 'positives', 'asymptomatic', 'spread', 'coronavirus'], ['households', 'covid19', 'patients', 'porto', 'alegre', 'campo', 'grande', 'santo', 'antônio', 'da', 'platina', 'must', 'put', 'red', 'ribbon', 'garbage', 'bags', 'garbagemen', 'instructed

In [None]:
rs_tm_dict = Dictionary(corpus)
rs_processed_corpus = [rs_tm_dict.doc2bow(text) for text in rs_corpus]
rs_model = LsiModel(rs_processed_corpus, id2word=rs_tm_dict)
rs_model.print_topics(5)

[(0,
  '0.743*"coronavirus" + 0.406*"covid19" + 0.172*"video" + 0.139*"people" + 0.120*"shows" + 0.120*"facebook" + 0.109*"novel" + 0.107*"claim" + 0.098*"new" + 0.096*"shared"'),
 (1,
  '0.812*"covid19" + -0.544*"coronavirus" + 0.065*"video" + 0.050*"shows" + -0.045*"novel" + -0.040*"new" + 0.039*"hospital" + 0.037*"claims" + 0.037*"facebook" + 0.034*"lockdown"'),
 (2,
  '-0.358*"video" + -0.326*"facebook" + -0.297*"claim" + 0.294*"covid19" + -0.279*"shows" + -0.278*"posts" + -0.267*"times" + 0.250*"coronavirus" + -0.250*"shared" + -0.196*"multiple"'),
 (3,
  '0.653*"video" + 0.300*"shows" + 0.253*"people" + -0.240*"facebook" + -0.237*"posts" + -0.214*"claim" + -0.202*"shared" + -0.175*"times" + -0.152*"multiple" + 0.143*"lockdown"'),
 (4,
  '0.881*"people" + -0.303*"video" + -0.130*"shows" + -0.097*"coronavirus" + -0.089*"covid19" + 0.073*"lockdown" + 0.072*"government" + 0.068*"virus" + 0.065*"died" + -0.065*"patients"')]

# Exercise 9 (Optional)

Leveraging the same corpus used for LSI model generation, apply LDA modelling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [None]:
from gensim.models.ldamodel import LdaModel
lda = LdaModel(rs_processed_corpus, id2word=rs_tm_dict, num_topics=3)
lda.print_topics(5)

[(0,
  '0.035*"coronavirus" + 0.027*"covid19" + 0.007*"video" + 0.005*"president" + 0.005*"china" + 0.005*"outbreak" + 0.005*"patients" + 0.004*"government" + 0.004*"people" + 0.004*"new"'),
 (1,
  '0.032*"coronavirus" + 0.018*"video" + 0.015*"shows" + 0.014*"covid19" + 0.011*"china" + 0.010*"people" + 0.008*"facebook" + 0.007*"claim" + 0.006*"novel" + 0.006*"shared"'),
 (2,
  '0.061*"coronavirus" + 0.023*"covid19" + 0.013*"people" + 0.011*"new" + 0.008*"water" + 0.007*"virus" + 0.007*"wuhan" + 0.006*"novel" + 0.006*"cure" + 0.006*"vaccine"')]

# Exercise 10 (Optional)

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [None]:
#import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

lda_display = gensimvis.prepare(lda, rs_processed_corpus, rs_tm_dict, sort_topics=False,jupyter=True)
pyLDAvis.display(lda_display)