## NLTK

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# nltk.download('maxent_ne_chunker')
# nltk.download('words')
from pprint import pprint

In [None]:
text = "How do you join NATO and how close is Ukraine to becoming a member? by Orson Wells. Paid 1$ 19/12/1994"

In [None]:
tokens = word_tokenize(text)
tags = pos_tag(tokens)

In [None]:
pprint(tags)

In [None]:
ne_tree = nltk.ne_chunk(tags)
print(ne_tree)

In [None]:
# GPE localization
# Organization
# Person

In [None]:
for tagged_word in ne_tree:
    if hasattr(tagged_word, 'label'):
        print(tagged_word.label())
        print(tagged_word.leaves())


## Spacy

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
text = "How do you join NATO and how close is Ukraine to becoming a member? by Orson Wells Paid with $10.0 in 20/12/2022"

In [None]:
train = nlp(text)

In [None]:
train.ents

In [None]:
for entity in train.ents:
    print(entity.text, entity.label)

In [None]:
#https://spacy.io/usage/visualizers#ent
displacy.render(train, jupyter=True, style="ent")

# ACTIVITY
- Can we apply scapy to NER in other languages (Catala, Spanish,...)?

In [None]:
text = "La compilació és estructurada de tal manera que abans de cada rondalla s'especifica qui la va contar al recopilador, si la hi contaren diverses persones de diversos pobles i, àdhuc, si els personatges eren d'aqueix poble."
train = nlp(text)
train.ents
for entity in train.ents:
    print(entity.text, entity.label)


In [None]:
displacy.render(train, jupyter=True, style="ent")

As we could check, spacy can't read different languages by default. For that matter we must download the correct language package for the task.
We can find languages installation guide at [spacy site](https://spacy.io/usage/models)

In [None]:
# Install catalan language package
!python -m spacy download ca_core_news_sm

In [None]:
# load new language package
nlp = spacy.load("ca_core_news_sm")

In [None]:
train = nlp(text)
train.ents
for entity in train.ents:
    print(entity.text, entity.label)

In [None]:
displacy.render(train, jupyter=True, style="ent")

We could't make work for catalan package. Let's check the spanish package.

In [None]:
#!python -m spacy download es_core_news_sm

nlp = spacy.load("es_core_news_sm")

In [None]:
text = "La libertad, Sancho, es uno de los más preciosos dones que a Los Hombres dieron los cielos; con ella no pueden igualarse los tesoros que encierra la tierra ni el mar encubre; por la libertadI así como por la honra se puede y debe aventurar la vida, y, por el contrario, el cautiverio es el mayor mal que puede venir a los hombres."
train = nlp(text)
train.ents
for entity in train.ents:
    print(entity.text, entity.label)

In [None]:
displacy.render(train, jupyter=True, style="ent")

### Activity 6.1: Load activity 6 text and try to get all dates

Try with spacy

In [None]:
nlp = en_core_web_sm.load()
text = open("personX.txt", "r").read()
train = nlp(text)
train.ents

In [None]:
for entity in train.ents:
  if entity.label_ == "DATE":
    print(entity.text, entity.label_)

Attempt with NLTK

In [None]:
tokens = word_tokenize(text)
tags = pos_tag(tokens)

for tag in tags:
  if tag[1] == "CD":
    print(tag)

### Conclusion:

Using a high level tool like a nlp for a realy specific task like get dates or a text with a pattern is inacurate and slow. For that matter is better to use a regular expresion.