## 10. Procesamiento del Lenguaje Natural (NLP)

En este notebook veremos algunas técnicas de procesamiento del lenguaje de natural utilizando con Spacy y NLTK, dos de los frameworks más comunes para el procesamiento de lenguaje natural.


Tarea:

* Comparar los resultados con la librería GenSim para tokenización y embeddings.
* Realizar la tokenización del dataset de Fake News de Kaggle (https://www.kaggle.com/datasets/mrisdal/fake-news) y construir un modelo de clasificación para detectar noticias falsas.

## I. Spacy

**Instalación y carga del modelo**

In [None]:
# instalamos el modelo de inglés grande (large EN)
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# cargamos librerias
import pandas as pd
import numpy as np
import spacy
from spacy import displacy # para visualizacion

**Cargamos el modelo de Spacy**

In [None]:
nlp = spacy.load("en_core_web_lg")

### 1. Exploración de funcionalidades

In [None]:
# consideramos el siguiente texto de referencia
raw_text = 'spaCy provides a variety of linguistic annotations to give you insights into a text’s grammatical structure.'

# procesamiento del texto en el modelo de NLP
doc = nlp(raw_text)

**Tokenización**

In [None]:
for token in doc:
    print(token.text)

spaCy
provides
a
variety
of
linguistic
annotations
to
give
you
insights
into
a
text
’s
grammatical
structure
.


**Anotación linguística**

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_)

spaCy VERB nsubj
provides VERB ROOT
a DET det
variety NOUN dobj
of ADP prep
linguistic ADJ amod
annotations NOUN pobj
to PART aux
give VERB relcl
you PRON dative
insights NOUN dobj
into ADP prep
a DET det
text NOUN poss
’s PART case
grammatical ADJ amod
structure NOUN pobj
. PUNCT punct


**Part-of-Speech Tagging**

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

spaCy spacy VERB VB nsubj xxxXx True False
provides provide VERB VBZ ROOT xxxx True False
a a DET DT det x True True
variety variety NOUN NN dobj xxxx True False
of of ADP IN prep xx True True
linguistic linguistic ADJ JJ amod xxxx True False
annotations annotation NOUN NNS pobj xxxx True False
to to PART TO aux xx True True
give give VERB VB relcl xxxx True True
you you PRON PRP dative xxx True True
insights insight NOUN NNS dobj xxxx True False
into into ADP IN prep xxxx True True
a a DET DT det x True True
text text NOUN NN poss xxxx True False
’s ’s PART POS case ’x False True
grammatical grammatical ADJ JJ amod xxxx True False
structure structure NOUN NN pobj xxxx True False
. . PUNCT . punct . False False


**Reconocimiento de entidades (NER)**

In [None]:
doc = nlp('Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [None]:
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


**Visualización de dependencias**

In [None]:
displacy.serve(doc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


**Similitud entre palabras**

In [None]:
for token in doc:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Apple True 49.544395 False
is True 110.41255 False
looking True 48.28714 False
at True 118.82375 False
buying True 45.90773 False
U.K. True 34.055897 False
startup True 39.72299 False
for True 69.12914 False
$ True 190.25487 False
1 True 118.7086 False
billion True 67.87469 False


**Similitud entre oraciones**

In [None]:
# oraciones
doc1 = nlp('I like to eat potato chips')
doc2 = nlp('Fast food is really delicious.')

# calculo de similitud
print(doc1, "<->", doc2, doc1.similarity(doc2))

# calculo de similitud entre palabras
print(doc1[4], '<->', doc2[1], doc1[4].similarity(doc2[1]))

I like to eat potato chips <-> Fast food is really delicious. 0.4332894543463131
potato <-> food 0.4062895178794861


## II. NLTK

In [None]:
!pip install nltk



In [None]:
# importamos librerias
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk
from nltk.sentiment import SentimentIntensityAnalyzer

In [None]:
# cargar modelo
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

### 1. Exploración de funcionalidades

In [None]:
# consideramos el texto de referencia
raw_text = 'NLTK is a powerful library for natural language processing with multiple functionalities'

**Tokenización**

In [None]:
tokens = word_tokenize(raw_text)
for token in tokens:
  print(token)

NLTK
is
a
powerful
library
for
natural
language
processing
with
multiple
functionalities


**Part-of-Speech Tagging**

In [None]:
tokens = word_tokenize(raw_text)
pos_tags = pos_tag(tokens)
for pos in pos_tags:
  print(pos)

('NLTK', 'NNP')
('is', 'VBZ')
('a', 'DT')
('powerful', 'JJ')
('library', 'NN')
('for', 'IN')
('natural', 'JJ')
('language', 'NN')
('processing', 'NN')
('with', 'IN')
('multiple', 'JJ')
('functionalities', 'NNS')


**Reconocimiento de entidades (NER)**

In [None]:
tokens = word_tokenize(raw_text)
ner_tags = ne_chunk(nltk.pos_tag(tokens))
for ner in ner_tags:
  print(ner)

(ORGANIZATION NLTK/NNP)
('is', 'VBZ')
('a', 'DT')
('powerful', 'JJ')
('library', 'NN')
('for', 'IN')
('natural', 'JJ')
('language', 'NN')
('processing', 'NN')
('with', 'IN')
('multiple', 'JJ')
('functionalities', 'NNS')


**Análisis de Sentimiento**

In [None]:
sid = SentimentIntensityAnalyzer()
sentiment_scores = sid.polarity_scores(raw_text)
print(sentiment_scores)

{'neg': 0.0, 'neu': 0.629, 'pos': 0.371, 'compound': 0.6486}
