[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QbAR_-IM3I_bp9TVkP-l9t_K8cOGi8OK)

# Natural Language Processing Lab

This notebook serves as a practice project for exploring Natural Language Processing (NLP) tools using a tweet corpus using the [TASS 2020](http://www.sepln.org/workshops/tass/) competition (IberLEF - SEPLN) dataset. Such tweet-based datasets contains noisy text that will be preprocessed using regular expressions and NLP tools like [spacy](https://spacy.io/api). The goal is to prepare the data for subsequent tasks, particularly sentiment analysis using various machine learning approaches.

In [2]:
import csv
import random
from spacy import displacy
from Logic import Preprocessing, LinguisticAnalyzer

PATH_TO_TRAIN = 'data/train.csv'

# Part 1 - Loading and Preprocessing the Corpus

We will work with a dataset of tweets, which often have particular characteristics such as user mentions, hashtags, URLs, and so on. In this task, we will use the [csv](https://docs.python.org/3/library/csv.html) library to load the dataset.
The dataset is loaded as a list of n-tuples, each one corresponding to a .csv row (including the header). After loading, we will illustrate a random tweet.

In [3]:
with open(PATH_TO_TRAIN, newline='', encoding="utf-8") as corpus_csv:
    reader = csv.reader(corpus_csv)
    next(reader)  # skip header
    train_set = [row for row in reader]

random_tweet = random.choice(train_set)
print("Sample tweet information:")
print(f"Tweet ID: {random_tweet[0]}")
print(f"Tweet text: {random_tweet[1]}")
print(f"Tweet category: {random_tweet[2]}")

Sample tweet information:
Tweet ID: 180233463700525056
Tweet text: Preparo artículo sobre futuro de la lengua. Sabías q las telenovelas hispanas son uno de los grandes agentes difusión español en el mundo?
Tweet category: P


Given the tweets' nature, we want to clean and normalize the text by applying transformations such as:

* Replacing URLs with a generic placeholder (e.g., `(URL)` or `URL`).
* Replacing user mentions with a generic placeholder (`USUARIO`).
* Replacing certain abbreviations with their original forms.
* Normalizing recurring laughter tokens (e.g., from `jajajajJaajaj` to `jajaja`).
* Lowercasing.
* Removing diacritics, 
* Replacing insults with a placeholder.
* Replacing hashtags with another placeholder
* Reducing and normalizing repeated letters.
* Removing excessive spaces.

We will define helper functions for most of these preprocessing steps and then call them in a main function, e.g., `process_tweet(text)`.


In [4]:
# Preprocess the tweet
preprocessor = Preprocessing()
preprocessed_tweet = preprocessor.process_tweet(random_tweet[1])

print("Preprocessed tweet:")
print(preprocessed_tweet)
print("Original tweet:")
print(random_tweet[1])

Preprocessed tweet:
preparo articulo sobre futuro de la lengua. sabias que las telenovelas hispanas son uno de los grandes agentes difusion español en el mundo?
Original tweet:
Preparo artículo sobre futuro de la lengua. Sabías q las telenovelas hispanas son uno de los grandes agentes difusión español en el mundo?


# Part 2 - Linguistic Analysis Tools

We will use the [Spacy](https://spacy.io/) library (particularly the [linguistic features](https://spacy.io/usage/linguistic-features)) to retrieve linguistic information about the words and sentences in a text. There are different models available to work with Spanish texts, such as `es_dep_news_trf`. Below, we illustrate retrieving part-of-speech tags and dependency information.
For each word, we will show:

* lemma
* POS-tag

For each sentence:

* root
* major noun and prepositional phrases (constituents of the sentence)
* syntactic function of each identified constituent


In [5]:
linguistic_analyzer = LinguisticAnalyzer()
example1 = linguistic_analyzer.analyze_text("El director del liceo fue sumariado con separación del cargo por “insubordinación”, lo que provocó que el conflicto se agudizara.")
for token in example1:
    print(token.text, token.pos_, token.dep_, token.head)

displacy.render(example1, style="dep", jupyter=True)

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  model.load_state_dict(torch.load(filelike, map_location=device))


El DET det director
director NOUN nsubj sumariado
del ADP case liceo
liceo NOUN nmod director
fue AUX aux sumariado
sumariado VERB ROOT sumariado
con ADP case separación
separación NOUN obl sumariado
del ADP case cargo
cargo NOUN nmod separación
por ADP case insubordinación
“ SYM nmod insubordinación
insubordinación NOUN obl sumariado
” PUNCT punct insubordinación
, PUNCT punct provocó
lo PRON det provocó
que PRON nsubj provocó
provocó VERB advcl sumariado
que SCONJ mark agudizara
el DET det conflicto
conflicto NOUN nsubj agudizara
se PRON expl:pv agudizara
agudizara VERB ccomp provocó
. PUNCT punct sumariado


## More Examples of Sentence Parsing

Below, we provide a few Spanish sentences and demonstrate the detection of their roots, lemmas, POS tags, and some major phrase extraction using our custom function.


In [7]:
sent1 = "Un árbol frondoso nos protege del sol."     
sent2 = "Vimos un árbol frondoso."                    
sent3 = "Le cortaron una rama a un árbol frondoso."
sent4 = "Soñé con un árbol frondoso."
sent5 = "Nos sentamos bajo un árbol frondoso."
sent6 = "El hijo de Luis vive en Madrid"
sent7 = "No se acostumbra a su nueva vida."
sent8 = "Es muy propenso a las infecciones respiratorias en primavera"

sentences = [sent1, sent2, sent3, sent4, sent5, sent6, sent7, sent8]

for i, sent in enumerate(sentences, 1):
    print(f"Example {i}:")
    print(f"Sentence: {sent}")
    lemapos = linguistic_analyzer.word_pos_lemma(sent)
    print(f"Lemmas and POS-TAGS:{str(list(lemapos))}")
    rooot = linguistic_analyzer.get_roots(sent)
    print(f"Roots:{rooot}")
    major_phrases = linguistic_analyzer.get_major_phrases(sent)
    print(f"Major Phrases: {major_phrases}")
    print("\n")

Example 1:
Sentence: Un árbol frondoso nos protege del sol.
Lemmas and POS-TAGS:[('Un', 'DET', 'uno'), ('árbol', 'NOUN', 'árbol'), ('frondoso', 'ADJ', 'frondoso'), ('nos', 'PRON', 'yo'), ('protege', 'VERB', 'proteger'), ('del', 'ADP', 'del'), ('sol', 'NOUN', 'sol'), ('.', 'PUNCT', '.')]
Roots:['protege']
Major Phrases: [['Un árbol frondoso ', 'SN: Sujeto nsubj'], ['nos', 'SN: Objeto Directo obj']]


Example 2:
Sentence: Vimos un árbol frondoso.
Lemmas and POS-TAGS:[('Vimos', 'VERB', 'ver'), ('un', 'DET', 'uno'), ('árbol', 'NOUN', 'árbol'), ('frondoso', 'ADJ', 'frondoso'), ('.', 'PUNCT', '.')]
Roots:['Vimos']
Major Phrases: [['árbol', 'SN: Objeto Directo obj']]


Example 3:
Sentence: Le cortaron una rama a un árbol frondoso.
Lemmas and POS-TAGS:[('Le', 'PRON', 'él'), ('cortaron', 'VERB', 'cortar'), ('una', 'DET', 'uno'), ('rama', 'NOUN', 'rama'), ('a', 'ADP', 'a'), ('un', 'DET', 'uno'), ('árbol', 'NOUN', 'árbol'), ('frondoso', 'ADJ', 'frondoso'), ('.', 'PUNCT', '.')]
Roots:['cortaron']


# Part 3 - Analyzing Sample Tweets

In this section, we look at specific tweets to identify potential issues in the linguistic processing, having previously preprocessed the texts.


Analice los siguientes tweets y comente qué problemas encuentra en las salidas del procesamiento lingüístico de la parte 2, habiendo previamente preprocesado los textos en base a la parte 1.

* *Un gran acierto d @jorgefdezpp el nombramiento d @Ignacos como nuevo Director Gral. de la Policía: un puesto central en democracia*

* *Les personas no somos eternas Lo más importante es el legado que dejamos, la enseñanza para toda una familia y una generación^*

* *@FanClubMASes: Éstas Navidades regala solidaridad:#positivegeneration (cont) http://t.co/BksU1DZv*

* *@DavidSummersHG: @Edurnity Oyeee...! A Madrid si quiero ir eh? Un besazo guapa! Vamossss ya contaba contigo si o si!!! Un besote enorme!!*



In [8]:
def analyze_tweet(tweet):
    preprocessed_tweet = preprocessor.process_tweet(tweet)
    pos = linguistic_analyzer.pos_tags(preprocessed_tweet)
    roots = linguistic_analyzer.get_roots(preprocessed_tweet)
    major_phrases = linguistic_analyzer.get_major_phrases(preprocessed_tweet)
    lemmas_and_pos = linguistic_analyzer.word_pos_lemma(preprocessed_tweet)
    return pos, roots, major_phrases, lemmas_and_pos

def print_analysis(tweet, pos, roots, major_phrases, lemmas_and_pos):
    print(f"Tweet: {tweet}")
    print(f"Processed Tweet: {preprocessor.process_tweet(tweet)}")
    print(f"POS tags: {pos}")
    print(f"Roots: {roots}")
    print(f"Major phrases: {major_phrases}")
    print(f"Lemmas and POS tags: {lemmas_and_pos}")

## Tweet 1

In [10]:
tweet1 = "Un gran acierto d @jorgefdezpp el nombramiento d @Ignacos como nuevo Director Gral. de la Policía: un puesto central en democracia"
print_analysis(tweet1, *analyze_tweet(tweet1))
displacy.render(linguistic_analyzer.analyze_text(preprocessor.process_tweet(tweet1)), style="dep", jupyter=True)

Tweet: Un gran acierto d @jorgefdezpp el nombramiento d @Ignacos como nuevo Director Gral. de la Policía: un puesto central en democracia
Processed Tweet: un gran acierto de usuario el nombramiento de usuario como nuevo director general de la policia un puesto central en democracia
POS tags: ['DET', 'ADJ', 'NOUN', 'ADP', 'NOUN', 'DET', 'NOUN', 'ADP', 'NOUN', 'SCONJ', 'ADJ', 'NOUN', 'ADJ', 'ADP', 'DET', 'PROPN', 'DET', 'NOUN', 'ADJ', 'ADP', 'NOUN']
Roots: ['acierto']
Major phrases: []
Lemmas and POS tags: [('un', 'DET', 'uno'), ('gran', 'ADJ', 'gran'), ('acierto', 'NOUN', 'acierto'), ('de', 'ADP', 'de'), ('usuario', 'NOUN', 'usuario'), ('el', 'DET', 'el'), ('nombramiento', 'NOUN', 'nombramiento'), ('de', 'ADP', 'de'), ('usuario', 'NOUN', 'usuario'), ('como', 'SCONJ', 'como'), ('nuevo', 'ADJ', 'nuevo'), ('director', 'NOUN', 'director'), ('general', 'ADJ', 'general'), ('de', 'ADP', 'de'), ('la', 'DET', 'el'), ('policia', 'PROPN', 'policia'), ('un', 'DET', 'uno'), ('puesto', 'NOUN', 'puest

In [11]:
displacy.render(linguistic_analyzer.analyze_text(tweet1), style="ent", jupyter=True)

### **Tweet 1 Explanation**

When analyzing the first *tweet*, we observe that it lacks a verb. Therefore, our function `major_phrases(text)` will return an empty list, as there are no nominal or prepositional phrases fulfilling a syntactic function relative to a verb.

On the other hand, we see that the tweet is composed of the following constituents:  
- "Un gran acierto d @jorgefdezpp,"  
- "el nombramiento d @Ignacos como nuevo Director Gral. de la Policía,"  
- "un puesto central en democracia."

This causes the syntactic analyzer to interpret the constituent "un puesto central en democracia" as being related to "el nombramiento...," since "nombramiento" has an appositional dependency with "puesto." In other words, it appears to describe the noun. However, this is not accurate, as it is understood that the central position in democracy is actually the role of the *Director General de la Policía.*

Regarding the first issue, since the tweet lacks a verb, the function for identifying nominal phrases from part 2 is not useful, as the syntactic analyzer recognizes "acierto" as the root of the sentence, which, as mentioned earlier, is not a verb.


## Tweet 2

In [29]:
tweet2 = "Les personas no somos eternas. Lo más importante es el legado que dejamos, la enseñanza para toda una familia y una generación^"
print_analysis(tweet2, *analyze_tweet(tweet2))
#displacy.render(linguistic_analyzer.analyze_text(preprocessor.process_tweet(tweet2)), style="dep", jupyter=True)

tweet2 = "Les personas no somos eternas. Lo más importante es el legado que dejamos, la enseñanza para toda una familia y una generación^"
part2_analysis = linguistic_analyzer.analyze_text(tweet2)
#displacy.render(part2_analysis, style="dep", jupyter=True)

# render as a copyable tree
displacy.render(part2_analysis, style="dep", options={"compact": True, "bg": "#09a3d5", "color": "white", "font": "Source Sans Pro"})

Tweet: Les personas no somos eternas. Lo más importante es el legado que dejamos, la enseñanza para toda una familia y una generación^
Processed Tweet: les personas no somos eternas. lo mas importante es el legado que dejamos, la enseñanza para toda una familia y una generacion^
POS tags: ['DET', 'NOUN', 'ADV', 'AUX', 'ADJ', 'PUNCT', 'PRON', 'ADV', 'ADJ', 'AUX', 'DET', 'NOUN', 'PRON', 'VERB', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'DET', 'NOUN', 'CCONJ', 'DET', 'NOUN']
Roots: ['eternas', 'legado']
Major phrases: []
Lemmas and POS tags: [('les', 'DET', 'les'), ('personas', 'NOUN', 'persona'), ('no', 'ADV', 'no'), ('somos', 'AUX', 'ser'), ('eternas', 'ADJ', 'eterno'), ('.', 'PUNCT', '.'), ('lo', 'PRON', 'él'), ('mas', 'ADV', 'mas'), ('importante', 'ADJ', 'importante'), ('es', 'AUX', 'ser'), ('el', 'DET', 'el'), ('legado', 'NOUN', 'legado'), ('que', 'PRON', 'que'), ('dejamos', 'VERB', 'dejar'), (',', 'PUNCT', ','), ('la', 'DET', 'el'), ('enseñanza', 'NOUN', 'enseñanza'), ('para', 'ADP', 'p

### **Tweet 2 Explanation**

The conversion to lowercase does not seem to affect the spaCy analysis, as the same syntactic diagram was obtained regardless of the text's representation. The analysis correctly separates the input into two sentences: *"Les personas no somos eternas."* and *"Lo más importante es el legado que dejamos, la enseñanza para toda una familia y una generación^."*

The algorithm effectively processes "Les," identifying it as a determiner (DET) and lemmatizing it to "les" maintaining its grammatical role as an article despite the linguistic context. This demonstrates the algorithm's adaptability to modern language usage, such as inclusive or gender-neutral expressions.

Additionally, the word "generación^" is accurately tagged as a noun (NOUN), with its lemma correctly identified as "generacion". The special character (`^`) does not interfere with the analysis, as the algorithm treats it as an external marker without impacting the syntactic or semantic role of the word.


## Tweet 3

In [14]:
tweet3 = "@FanClubMASes: Éstas Navidades regala solidaridad:#positivegeneration (cont) http://t.co/BksU1DZv"
print_analysis(tweet3, *analyze_tweet(tweet3))
displacy.render(linguistic_analyzer.analyze_text(preprocessor.process_tweet(tweet3)), style="dep", jupyter=True)

Tweet: @FanClubMASes: Éstas Navidades regala solidaridad:#positivegeneration (cont) http://t.co/BksU1DZv
Processed Tweet: usuario estas navidades regala solidaridad HASHTAG cont URL
POS tags: ['NOUN', 'DET', 'NOUN', 'VERB', 'NOUN', 'PROPN', 'PROPN', 'PROPN']
Roots: ['regala']
Major phrases: [['usuario ', 'SN: Adjunto obl'], ['estas navidades ', 'SN: Adjunto obl'], ['solidaridad', 'SN: Objeto Directo obj'], ['HASHTAG', 'SN: Objeto Directo obj']]
Lemmas and POS tags: [('usuario', 'NOUN', 'usuario'), ('estas', 'DET', 'este'), ('navidades', 'NOUN', 'navidad'), ('regala', 'VERB', 'regalar'), ('solidaridad', 'NOUN', 'solidaridad'), ('HASHTAG', 'PROPN', 'HASHTAG'), ('cont', 'PROPN', 'cont'), ('URL', 'PROPN', 'URL')]


In [15]:
displacy.render(linguistic_analyzer.analyze_text(tweet3), style="ent", jupyter=True)



### **Tweet 3 Explanation**

The sentence "Estas navidades regala solidaridad" features an imperative verb in the second person singular. It is interpreted as an announcement in which the subject of the sentence is omitted. 

There is an ambiguity in the sentence "estas navidades regala solidaridad," which can arise, for example, if we do not remove the ":" during the preprocessing of tweets. In such cases, the analyzer interprets "estas navidades" as the subject of the sentence and "solidaridad" as the direct object. We chose to remove this punctuation mark because, in both this tweet and tweet 4, it incorrectly marked "usuario:" as the root of the sentence instead of "regala," as indicated in the output.

- Building on the previous point, removing the ":" does not resolve the issue of incorrectly identifying the subject of the sentence. Instead, it identifies 'HASHTAG cont URL' as the subject, whereas the subject is actually omitted (implicit subject).

- We observe that "cont" indicates the tweet references another tweet or quoted text. By replacing the URL with the string "URL", we are potentially losing information in our preprocessing that could be relevant for further analysis.

## Tweet 4

In [18]:
tweet4 = "@DavidSummersHG: @Edurnity Oyeee...! A Madrid si quiero ir eh? Un besazo guapa! Vamossss ya contaba contigo si o si!!! Un besote enorme!!"
print_analysis(tweet4, *analyze_tweet(tweet4))
displacy.render(linguistic_analyzer.analyze_text(preprocessor.process_tweet(tweet4)), style="dep", jupyter=True)

Tweet: @DavidSummersHG: @Edurnity Oyeee...! A Madrid si quiero ir eh? Un besazo guapa! Vamossss ya contaba contigo si o si!!! Un besote enorme!!
Processed Tweet: usuario usuario oye.! a madrid si quiero ir eh? un besazo guapa! vamos ya contaba contigo si o si! un besote enorme!
POS tags: ['NOUN', 'NOUN', 'VERB', 'PUNCT', 'PUNCT', 'ADP', 'PROPN', 'SCONJ', 'VERB', 'VERB', 'INTJ', 'PUNCT', 'DET', 'NOUN', 'ADJ', 'PUNCT', 'INTJ', 'ADV', 'VERB', 'PRON', 'ADV', 'CCONJ', 'INTJ', 'PUNCT', 'DET', 'NOUN', 'ADJ', 'PUNCT']
Roots: ['oye', 'quiero', 'besazo', 'contaba']
Major phrases: [['usuario usuario ', 'SN: Sujeto nsubj'], ['a madrid ', 'SP: Adjunto obl'], ['contigo', 'SN: Objeto Directo obj'], ['si', 'SN: Objeto Directo obj'], ['besote', 'SN: Objeto Directo obj']]
Lemmas and POS tags: [('usuario', 'NOUN', 'usuario'), ('usuario', 'NOUN', 'usuario'), ('oye', 'VERB', 'oír'), ('.', 'PUNCT', '.'), ('!', 'PUNCT', '!'), ('a', 'ADP', 'a'), ('madrid', 'PROPN', 'madrid'), ('si', 'SCONJ', 'si'), ('quiero',

In [19]:
displacy.render(linguistic_analyzer.analyze_text(tweet4), style="ent", jupyter=True)



### **Tweet 4 Explanation**

**Identified Issues:**

This tweet is divided by the syntactic analyzer into four sentences with the roots "usuario," "quiero," "besazo," and "contaba":
- In the first sentence, there is potential ambiguity in interpreting "oye" as a verb, assigning "usuario" as the Subject. We consider the correct POS-tag for "oye" to be an Interjection; however, it is recognized as a proper noun. This issue could be resolved by removing the ellipsis during text preprocessing. While this would result in "oye!" instead of "oye.!", replacing two or more consecutive dots with a single one is generally more appropriate. Furthermore, the expression "...!" is neither common nor grammatically correct.

- In the second sentence, we have a main verb with an omitted subject. The phrase "a Madrid" is correctly interpreted as an adjunct, and "eh?" is identified as an interjection. However, the lemma for "besazo" is incorrectly labeled as "besazo" when it should be "beso." The same applies to "besote," as both are augmentatives of the word "beso."

- The fourth constituent is a sentence where a dependency exists that should not be present between the noun phrase "Un besote enorme" and the verb "contaba." For a better analysis, we interpret that this phrase should be an independent constituent from the sentence "vamos ya contaba contigo sí o sí."

- Using generic usernames may result in losing gender-related information. Semantically, the segment "Un besazo guapa!" is directed toward the user "@Edurnity" and not "@DavidSummersHG." However, after preprocessing, it becomes impossible to determine which "usuario" it is addressed to.

- The presence of multiple "dep" dependencies in the dependency tree highlights the informal structure of the tweet, making it more challenging to analyze. This is because the "dep" dependency indicates that it was impossible to determine a more precise relationship between the constituents of a sentence or statement.

# General Conclusions

* We observe that tools such as spaCy are accurate for fairly formal text but may struggle with informal tweets that contain a wide variety of abbreviations, repeated letters, user mentions, etc.
* Replacements such as user names and URLs are often convenient to avoid clutter with information that might be irrelevant for certain classification tasks.
* For upcoming sentiment analysis tasks, some of these steps (like removing stopwords) could further be used. spaCy can still be valuable, but we need to keep in mind that tweets are often more challenging to parse.
