# Vectorización de documentos
Creado por **Hernández Jiménez Erick Yael**. Grupo: _5BV1_ de la **Escuela Superior de Cómputo**.

Para la materia de **Tecnologías de Lenguaje Natural** en el semestre 2025-1 de la carrera de **Ingeniería en Inteligencia Artificial**.

> Última vez modificado: 13 de octubre de 2024

### Breve descripción
Esta práctica normalizará y vectorizará 3 documentos en inglés para su análisis en el reporte correspondiente. Se usará normalización con:
- _stemming_
- lematización
- _POS-tagging_
Así como métodos de conteo como:
- conteo de términos.
- frecuencia de términos.
- frecuencia inversa de documentos.
- _One Hot Enconding_.
- probabilidad de términos.
- combinación de frecuencias (TDF-IDF).
Para más detalles respecto a la justificación y resultados obtenidos, consúltese el reporte de práctica.
### Documentos por analizar
|ID|Title|Variable|
|-:|-:|-:|
|1|Pancreatic cancer with metastasis. Jaundice with transaminitis, evaluate for obstruction process.|`doc_1`|
|2|Pancreatitis. Breast cancer. No output from enteric tube. Assess tube.|`doc_2`|
|3|Metastasis pancreatic cancer. Acute renal failure, evaluate for hydronephrosis or obstructive uropathy.|`doc_3`|

# Bibliotecas usadas

In [1]:
import pdfplumber                           # Para la lectura y archivos PDF
import nltk                                 # Para métodos de procesamiento de lenguaje
from nltk.corpus import stopwords           # Para las stop-words
from nltk.stem import PorterStemmer         # Para aplicar stemming
from nltk.stem import WordNetLemmatizer     # Para lematización
from nltk.tokenize import word_tokenize     # Método para tokenizar
from string import punctuation              # Para los signos de puntuación
from sklearn.preprocessing import MultiLabelBinarizer   # Para One Hot Encoding
nltk.download('punkt_tab')                      # Modelo para tokenizar
nltk.download('averaged_perceptron_tagger_eng') # Modelo para el POS-Tagging
nltk.download('stopwords')                  # Modelos con stop-words
nltk.download('wordnet')                    # Para auxiliar en el proceso de lematización

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mafes\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\mafes\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mafes\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mafes\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Lectura de documentos


In [2]:
# Ruta del documento PDF 1
ruta_doc_1 = './documentos/documento-1.pdf'

# Abrir el PDF
with pdfplumber.open(ruta_doc_1) as pdf:
    # Extraer texto de todas las páginas
    doc_1 = ""                        # Inicializamos con una cadena vacía el documento
    for page in pdf.pages:            # Por cada página en las páginas del pdf...
        doc_1 += page.extract_text()  # agregamos el texto a lo que se extraiga en la página

    # Imprimimos el texto resultante
    print(doc_1)

Medicine
®
Clinical Case Report
OPEN
Treatment of pancreatic head cancer
with obstructive jaundice by endoscopy
ultrasonography-guided gastrojejunostomy
A case report and literature review
∗
Zhaohua Shen, MD, Li Tian, MD, Xiaoyan Wang, MD
Abstract
Rationale:Ultrasonography-guidedgastrojejunostomy(EUS-GJ)mightbeasafe,innovativeandminimallyinvasiveinterventional
treatmentforpatientswithgastricoutletobstruction(GOO)asanalternativetothesurgicalapproach.Todate,fewcaseshavebeen
reportedin the literature.
Patientconcerns:Acaseofpancreaticheadcarcinomawithobstructivejaundiceoccurredina78-year-oldmanwithaprior
history ofpancreatic headcancer. Biliary stentplacement wasconducted1yearearlier.
Diagnoses: The patient was diagnosed with pancreatic cancer, pulmonary infection, pyloric obstruction, and biliary stent
implantation.
Interventions:EUS-GJwasperformed.Thewireandadouble-ballooncatheterreachedthepositionofstenosis,thenadouble
mushroom headbracket wasreleased underEUS. The positionwas confirme

In [3]:
# Ruta del documento PDF 2
ruta_doc_2 = './documentos/documento-2.pdf'

# Abrir el PDF
with pdfplumber.open(ruta_doc_2) as pdf:
    # Extraer texto de todas las páginas
    doc_2 = ""                          # Inicializamos con una cadena vacía el documento
    for page in pdf.pages:              # Por cada página en las páginas del pdf...
        doc_2 += page.extract_text()    # agregamos el texto a lo que se extraiga en la página

    # Imprimimos el texto resultante
    print(doc_2)

10
Review Article
Page 1 of 10
Drain and nasogastric tube use following
pancreatoduodenectomy: a narrative review
Thomas B. Russell^, Peter L. Labib^, Somaiah Aroori^
Department of HPB Surgery, University Hospitals Plymouth NHS Trust, Derriford Road, Plymouth, UK
Contributions: (I) Conception and design: All authors; (II) Administrative support: TB Russell; (III) Provision of study materials or patients: None;
(IV) Collection and assembly of data: All authors; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final
approval of manuscript: All authors.
Correspondence to: Somaiah Aroori, MS, MD, FRCS. Department of HPB Surgery, University Hospitals Plymouth NHS Trust, Derriford Road,
Plymouth, PL6 8DH, UK. Email: s.aroori@nhs.net.
Background and Objective: Patients with cancer affecting the head of the pancreas have a dismal
prognosis. Around one fifth present early enough to be considered candidates for surgical resection.
Pancreatoduodenecto

In [4]:
# Ruta del documento PDF 3
ruta_doc_3 = './documentos/documento-3.pdf'

# Abrir el PDF
with pdfplumber.open(ruta_doc_3) as pdf:
    # Extraer texto de todas las páginas
    doc_3 = ""                          # Inicializamos con una cadena vacía el documento
    for page in pdf.pages:              # Por cada página en las páginas del pdf...
        doc_3 += page.extract_text()    # agregamos el texto a lo que se extraiga en la página

    # Imprimimos el texto resultante
    print(doc_3)

W J N World Journal of
Nephrology
Submit a Manuscript: https://www.f6publishing.com World J Nephrol 2022 November 25; 11(6): 146-163
DOI: 10.5527/wjn.v11.i6.146 ISSN 2220-6124 (online)
REVIEW
Acute kidney injury due to bilateral malignant ureteral obstruction: Is
there an optimal mode of drainage?
Rabea Ahmed Gadelkareem, Ahmed Mahmoud Abdelraouf, Ahmed Mohammed El-Taher, Abdelfattah
Ibrahim Ahmed
Specialty type: Urology and Rabea Ahmed Gadelkareem, Ahmed Mahmoud Abdelraouf, Ahmed Mohammed El-Taher,
Abdelfattah Ibrahim Ahmed, Department of Urology, Assiut Urology and Nephrology Hospital,
nephrology
Faculty of Medicine, Assiut University, Assiut 71515, Assiut, Egypt
Provenance and peer review:
Corresponding author: Rabea Ahmed Gadelkareem, MD, Assistant Professor, Department of
Invited article; Externally peer
Urology, Assiut Urology and Nephrology Hospital, Faculty of Medicine, Assiut University,
reviewed.
Elgamaa Street, Assiut 71515, Assiut, Egypt. dr.rabeagad@yahoo.com
Peer-review m

# Normalización

## Conversión a minúsculas

In [5]:
# Convertimos todo a minúsculas
doc_1 = doc_1.lower()
doc_2 = doc_2.lower()
doc_3 = doc_3.lower()

#Imprimimos resultados
print(doc_1)

medicine
®
clinical case report
open
treatment of pancreatic head cancer
with obstructive jaundice by endoscopy
ultrasonography-guided gastrojejunostomy
a case report and literature review
∗
zhaohua shen, md, li tian, md, xiaoyan wang, md
abstract
rationale:ultrasonography-guidedgastrojejunostomy(eus-gj)mightbeasafe,innovativeandminimallyinvasiveinterventional
treatmentforpatientswithgastricoutletobstruction(goo)asanalternativetothesurgicalapproach.todate,fewcaseshavebeen
reportedin the literature.
patientconcerns:acaseofpancreaticheadcarcinomawithobstructivejaundiceoccurredina78-year-oldmanwithaprior
history ofpancreatic headcancer. biliary stentplacement wasconducted1yearearlier.
diagnoses: the patient was diagnosed with pancreatic cancer, pulmonary infection, pyloric obstruction, and biliary stent
implantation.
interventions:eus-gjwasperformed.thewireandadouble-ballooncatheterreachedthepositionofstenosis,thenadouble
mushroom headbracket wasreleased undereus. the positionwas confirme

In [6]:
print(doc_2)

10
review article
page 1 of 10
drain and nasogastric tube use following
pancreatoduodenectomy: a narrative review
thomas b. russell^, peter l. labib^, somaiah aroori^
department of hpb surgery, university hospitals plymouth nhs trust, derriford road, plymouth, uk
contributions: (i) conception and design: all authors; (ii) administrative support: tb russell; (iii) provision of study materials or patients: none;
(iv) collection and assembly of data: all authors; (v) data analysis and interpretation: all authors; (vi) manuscript writing: all authors; (vii) final
approval of manuscript: all authors.
correspondence to: somaiah aroori, ms, md, frcs. department of hpb surgery, university hospitals plymouth nhs trust, derriford road,
plymouth, pl6 8dh, uk. email: s.aroori@nhs.net.
background and objective: patients with cancer affecting the head of the pancreas have a dismal
prognosis. around one fifth present early enough to be considered candidates for surgical resection.
pancreatoduodenecto

In [7]:
print(doc_3)

w j n world journal of
nephrology
submit a manuscript: https://www.f6publishing.com world j nephrol 2022 november 25; 11(6): 146-163
doi: 10.5527/wjn.v11.i6.146 issn 2220-6124 (online)
review
acute kidney injury due to bilateral malignant ureteral obstruction: is
there an optimal mode of drainage?
rabea ahmed gadelkareem, ahmed mahmoud abdelraouf, ahmed mohammed el-taher, abdelfattah
ibrahim ahmed
specialty type: urology and rabea ahmed gadelkareem, ahmed mahmoud abdelraouf, ahmed mohammed el-taher,
abdelfattah ibrahim ahmed, department of urology, assiut urology and nephrology hospital,
nephrology
faculty of medicine, assiut university, assiut 71515, assiut, egypt
provenance and peer review:
corresponding author: rabea ahmed gadelkareem, md, assistant professor, department of
invited article; externally peer
urology, assiut urology and nephrology hospital, faculty of medicine, assiut university,
reviewed.
elgamaa street, assiut 71515, assiut, egypt. dr.rabeagad@yahoo.com
peer-review m

## Aplicamos POS-tagging

In [8]:
# Tokenizamos los documentos, cabe mencionar que el tipo de dato acá cambia de cadena a lista
doc_1 = word_tokenize(doc_1, "english")
doc_2 = word_tokenize(doc_2, "english")
doc_3 = word_tokenize(doc_3, "english")

In [9]:
''' 
Aplicamos POS-tagging, cabe mencionar que el tipo de dato acá cambia de lista a tupla
A continuación viene el significado de cada etiqueta:
- DT: Determinante (e.g. "the")
- JJ: Adjetivo (e.g. "quick")
- NN: Sustantivo (e.g. "fox")
- VBZ: Verbo (e.g. "jumps")
- IN: Preposición (e.g. "over")
''' 
doc_1 = nltk.pos_tag(doc_1)
doc_2 = nltk.pos_tag(doc_2)
doc_3 = nltk.pos_tag(doc_3)

In [10]:
# Imprimimos los resultados
# Cada 
print(doc_1)

[('medicine', 'NN'), ('®', 'NNP'), ('clinical', 'JJ'), ('case', 'NN'), ('report', 'NN'), ('open', 'JJ'), ('treatment', 'NN'), ('of', 'IN'), ('pancreatic', 'JJ'), ('head', 'NN'), ('cancer', 'NN'), ('with', 'IN'), ('obstructive', 'JJ'), ('jaundice', 'NN'), ('by', 'IN'), ('endoscopy', 'JJ'), ('ultrasonography-guided', 'JJ'), ('gastrojejunostomy', 'NN'), ('a', 'DT'), ('case', 'NN'), ('report', 'NN'), ('and', 'CC'), ('literature', 'NN'), ('review', 'NN'), ('∗', 'NNP'), ('zhaohua', 'NNP'), ('shen', 'NN'), (',', ','), ('md', 'NN'), (',', ','), ('li', 'JJ'), ('tian', 'NN'), (',', ','), ('md', 'NN'), (',', ','), ('xiaoyan', 'NNP'), ('wang', 'NN'), (',', ','), ('md', 'FW'), ('abstract', 'JJ'), ('rationale', 'NN'), (':', ':'), ('ultrasonography-guidedgastrojejunostomy', 'JJ'), ('(', '('), ('eus-gj', 'JJ'), (')', ')'), ('mightbeasafe', 'NN'), (',', ','), ('innovativeandminimallyinvasiveinterventional', 'JJ'), ('treatmentforpatientswithgastricoutletobstruction', 'NN'), ('(', '('), ('goo', 'NN'), ('

In [11]:
print(doc_2)

[('10', 'CD'), ('review', 'NN'), ('article', 'NN'), ('page', 'NN'), ('1', 'CD'), ('of', 'IN'), ('10', 'CD'), ('drain', 'NN'), ('and', 'CC'), ('nasogastric', 'JJ'), ('tube', 'NN'), ('use', 'NN'), ('following', 'VBG'), ('pancreatoduodenectomy', 'NN'), (':', ':'), ('a', 'DT'), ('narrative', 'JJ'), ('review', 'NN'), ('thomas', 'NN'), ('b.', 'NN'), ('russell^', 'NN'), (',', ','), ('peter', 'NN'), ('l.', 'NN'), ('labib^', 'NN'), (',', ','), ('somaiah', 'JJ'), ('aroori^', 'NN'), ('department', 'NN'), ('of', 'IN'), ('hpb', 'NN'), ('surgery', 'NN'), (',', ','), ('university', 'NN'), ('hospitals', 'NNS'), ('plymouth', 'VBP'), ('nhs', 'JJ'), ('trust', 'NN'), (',', ','), ('derriford', 'NN'), ('road', 'NN'), (',', ','), ('plymouth', 'NN'), (',', ','), ('uk', 'JJ'), ('contributions', 'NNS'), (':', ':'), ('(', '('), ('i', 'NN'), (')', ')'), ('conception', 'NN'), ('and', 'CC'), ('design', 'NN'), (':', ':'), ('all', 'DT'), ('authors', 'NNS'), (';', ':'), ('(', '('), ('ii', 'NN'), (')', ')'), ('administ

In [12]:
print(doc_3)

[('w', 'NN'), ('j', 'NN'), ('n', 'JJ'), ('world', 'NN'), ('journal', 'NN'), ('of', 'IN'), ('nephrology', 'JJ'), ('submit', 'NN'), ('a', 'DT'), ('manuscript', 'NN'), (':', ':'), ('https', 'NN'), (':', ':'), ('//www.f6publishing.com', 'JJ'), ('world', 'NN'), ('j', 'NN'), ('nephrol', 'NN'), ('2022', 'CD'), ('november', 'IN'), ('25', 'CD'), (';', ':'), ('11', 'CD'), ('(', '('), ('6', 'CD'), (')', ')'), (':', ':'), ('146-163', 'JJ'), ('doi', 'NN'), (':', ':'), ('10.5527/wjn.v11.i6.146', 'CD'), ('issn', 'JJ'), ('2220-6124', 'JJ'), ('(', '('), ('online', 'JJ'), (')', ')'), ('review', 'VBP'), ('acute', 'JJ'), ('kidney', 'NN'), ('injury', 'NN'), ('due', 'JJ'), ('to', 'TO'), ('bilateral', 'JJ'), ('malignant', 'JJ'), ('ureteral', 'JJ'), ('obstruction', 'NN'), (':', ':'), ('is', 'VBZ'), ('there', 'EX'), ('an', 'DT'), ('optimal', 'JJ'), ('mode', 'NN'), ('of', 'IN'), ('drainage', 'NN'), ('?', '.'), ('rabea', 'NN'), ('ahmed', 'VBD'), ('gadelkareem', 'NN'), (',', ','), ('ahmed', 'VBD'), ('mahmoud', 'J

## Eliminación de _stop-words_ y signos de puntuación

In [13]:
stop_words = set(stopwords.words('english'))    # Importamos la lista de stop-words en inglés

In [14]:
# Eliminamos las stop-words
doc_1 = [[word, tag]                                        # Inicializamos una lista con palabras...
    for word, tag                                           # donde cada palabra y etiqueta...
    in doc_1                                                # en el documento...
    if word not in stop_words and word not in punctuation]  # no pertenece a las stop-words ni es un signo de puntuación

In [15]:
# Eliminamos las stop-words
doc_2 = [[word, tag]                                        # Inicializamos una lista con palabras...
    for word, tag                                           # donde cada palabra y etiqueta...
    in doc_2                                                # en el documento...
    if word not in stop_words and word not in punctuation]  # no pertenece a las stop-words ni es un signo de puntuación

In [16]:
# Eliminamos las stop-words
doc_3 = [[word, tag]                                        # Inicializamos una lista con palabras...
    for word, tag                                           # donde cada palabra y etiqueta...
    in doc_3                                                # en el documento...
    if word not in stop_words and word not in punctuation]  # no pertenece a las stop-words ni es un signo de puntuación

In [17]:
# Imprimir los resultados
print(f"Documento 1 filtrado: {doc_1}")

Documento 1 filtrado: [['medicine', 'NN'], ['®', 'NNP'], ['clinical', 'JJ'], ['case', 'NN'], ['report', 'NN'], ['open', 'JJ'], ['treatment', 'NN'], ['pancreatic', 'JJ'], ['head', 'NN'], ['cancer', 'NN'], ['obstructive', 'JJ'], ['jaundice', 'NN'], ['endoscopy', 'JJ'], ['ultrasonography-guided', 'JJ'], ['gastrojejunostomy', 'NN'], ['case', 'NN'], ['report', 'NN'], ['literature', 'NN'], ['review', 'NN'], ['∗', 'NNP'], ['zhaohua', 'NNP'], ['shen', 'NN'], ['md', 'NN'], ['li', 'JJ'], ['tian', 'NN'], ['md', 'NN'], ['xiaoyan', 'NNP'], ['wang', 'NN'], ['md', 'FW'], ['abstract', 'JJ'], ['rationale', 'NN'], ['ultrasonography-guidedgastrojejunostomy', 'JJ'], ['eus-gj', 'JJ'], ['mightbeasafe', 'NN'], ['innovativeandminimallyinvasiveinterventional', 'JJ'], ['treatmentforpatientswithgastricoutletobstruction', 'NN'], ['goo', 'NN'], ['asanalternativetothesurgicalapproach.todate', 'NN'], ['fewcaseshavebeen', 'JJ'], ['reportedin', 'VBP'], ['literature', 'NN'], ['patientconcerns', 'NNS'], ['acaseofpancrea

In [18]:
print(f"Documento 2 filtrado: {doc_2}")

Documento 2 filtrado: [['10', 'CD'], ['review', 'NN'], ['article', 'NN'], ['page', 'NN'], ['1', 'CD'], ['10', 'CD'], ['drain', 'NN'], ['nasogastric', 'JJ'], ['tube', 'NN'], ['use', 'NN'], ['following', 'VBG'], ['pancreatoduodenectomy', 'NN'], ['narrative', 'JJ'], ['review', 'NN'], ['thomas', 'NN'], ['b.', 'NN'], ['russell^', 'NN'], ['peter', 'NN'], ['l.', 'NN'], ['labib^', 'NN'], ['somaiah', 'JJ'], ['aroori^', 'NN'], ['department', 'NN'], ['hpb', 'NN'], ['surgery', 'NN'], ['university', 'NN'], ['hospitals', 'NNS'], ['plymouth', 'VBP'], ['nhs', 'JJ'], ['trust', 'NN'], ['derriford', 'NN'], ['road', 'NN'], ['plymouth', 'NN'], ['uk', 'JJ'], ['contributions', 'NNS'], ['conception', 'NN'], ['design', 'NN'], ['authors', 'NNS'], ['ii', 'NN'], ['administrative', 'JJ'], ['support', 'NN'], ['tb', 'NN'], ['russell', 'NN'], ['iii', 'NN'], ['provision', 'NN'], ['study', 'NN'], ['materials', 'NNS'], ['patients', 'NNS'], ['none', 'NN'], ['iv', 'NN'], ['collection', 'NN'], ['assembly', 'NN'], ['data', 

In [19]:
print(f"Documento 3 filtrado: {doc_3}")

Documento 3 filtrado: [['w', 'NN'], ['j', 'NN'], ['n', 'JJ'], ['world', 'NN'], ['journal', 'NN'], ['nephrology', 'JJ'], ['submit', 'NN'], ['manuscript', 'NN'], ['https', 'NN'], ['//www.f6publishing.com', 'JJ'], ['world', 'NN'], ['j', 'NN'], ['nephrol', 'NN'], ['2022', 'CD'], ['november', 'IN'], ['25', 'CD'], ['11', 'CD'], ['6', 'CD'], ['146-163', 'JJ'], ['doi', 'NN'], ['10.5527/wjn.v11.i6.146', 'CD'], ['issn', 'JJ'], ['2220-6124', 'JJ'], ['online', 'JJ'], ['review', 'VBP'], ['acute', 'JJ'], ['kidney', 'NN'], ['injury', 'NN'], ['due', 'JJ'], ['bilateral', 'JJ'], ['malignant', 'JJ'], ['ureteral', 'JJ'], ['obstruction', 'NN'], ['optimal', 'JJ'], ['mode', 'NN'], ['drainage', 'NN'], ['rabea', 'NN'], ['ahmed', 'VBD'], ['gadelkareem', 'NN'], ['ahmed', 'VBD'], ['mahmoud', 'JJ'], ['abdelraouf', 'NN'], ['ahmed', 'VBD'], ['mohammed', 'VBN'], ['el-taher', 'RB'], ['abdelfattah', 'VBZ'], ['ibrahim', 'NN'], ['ahmed', 'VBD'], ['specialty', 'NN'], ['type', 'NN'], ['urology', 'NN'], ['rabea', 'NN'], ['a

## Lematización

In [20]:
lemmatizer = WordNetLemmatizer()    # Incorporamos el lematizador

In [22]:
# Aplicamos la lematización
doc_1 = [[lemmatizer.lemmatize(word), tag]  # Empezamos una lista con los lemas de cada palabra...
         for word, tag                      # donde cada palabra
         in doc_1]                          # se encuentra en el documento

In [23]:
doc_2 = [[lemmatizer.lemmatize(word), tag]  # Empezamos una lista con los lemas de cada palabra...
         for word, tag                      # donde cada palabra
         in doc_2]                          # se encuentra en el documento

In [24]:
doc_3 = [[lemmatizer.lemmatize(word), tag]  # Empezamos una lista con los lemas de cada palabra...
         for word, tag                      # donde cada palabra
         in doc_3]                          # se encuentra en el documento

In [25]:
# Imprimir los resultados
print(f"Documento 1 lematizado: {doc_1}")

Documento 1 lematizado: [['medicine', 'NN'], ['®', 'NNP'], ['clinical', 'JJ'], ['case', 'NN'], ['report', 'NN'], ['open', 'JJ'], ['treatment', 'NN'], ['pancreatic', 'JJ'], ['head', 'NN'], ['cancer', 'NN'], ['obstructive', 'JJ'], ['jaundice', 'NN'], ['endoscopy', 'JJ'], ['ultrasonography-guided', 'JJ'], ['gastrojejunostomy', 'NN'], ['case', 'NN'], ['report', 'NN'], ['literature', 'NN'], ['review', 'NN'], ['∗', 'NNP'], ['zhaohua', 'NNP'], ['shen', 'NN'], ['md', 'NN'], ['li', 'JJ'], ['tian', 'NN'], ['md', 'NN'], ['xiaoyan', 'NNP'], ['wang', 'NN'], ['md', 'FW'], ['abstract', 'JJ'], ['rationale', 'NN'], ['ultrasonography-guidedgastrojejunostomy', 'JJ'], ['eus-gj', 'JJ'], ['mightbeasafe', 'NN'], ['innovativeandminimallyinvasiveinterventional', 'JJ'], ['treatmentforpatientswithgastricoutletobstruction', 'NN'], ['goo', 'NN'], ['asanalternativetothesurgicalapproach.todate', 'NN'], ['fewcaseshavebeen', 'JJ'], ['reportedin', 'VBP'], ['literature', 'NN'], ['patientconcerns', 'NNS'], ['acaseofpancr

In [26]:
print(f"Documento 2 lematizado: {doc_2}")

Documento 2 lematizado: [['10', 'CD'], ['review', 'NN'], ['article', 'NN'], ['page', 'NN'], ['1', 'CD'], ['10', 'CD'], ['drain', 'NN'], ['nasogastric', 'JJ'], ['tube', 'NN'], ['use', 'NN'], ['following', 'VBG'], ['pancreatoduodenectomy', 'NN'], ['narrative', 'JJ'], ['review', 'NN'], ['thomas', 'NN'], ['b.', 'NN'], ['russell^', 'NN'], ['peter', 'NN'], ['l.', 'NN'], ['labib^', 'NN'], ['somaiah', 'JJ'], ['aroori^', 'NN'], ['department', 'NN'], ['hpb', 'NN'], ['surgery', 'NN'], ['university', 'NN'], ['hospital', 'NNS'], ['plymouth', 'VBP'], ['nh', 'JJ'], ['trust', 'NN'], ['derriford', 'NN'], ['road', 'NN'], ['plymouth', 'NN'], ['uk', 'JJ'], ['contribution', 'NNS'], ['conception', 'NN'], ['design', 'NN'], ['author', 'NNS'], ['ii', 'NN'], ['administrative', 'JJ'], ['support', 'NN'], ['tb', 'NN'], ['russell', 'NN'], ['iii', 'NN'], ['provision', 'NN'], ['study', 'NN'], ['material', 'NNS'], ['patient', 'NNS'], ['none', 'NN'], ['iv', 'NN'], ['collection', 'NN'], ['assembly', 'NN'], ['data', 'NNS

In [27]:
print(f"Documento 3 lematizado: {doc_3}")

Documento 3 lematizado: [['w', 'NN'], ['j', 'NN'], ['n', 'JJ'], ['world', 'NN'], ['journal', 'NN'], ['nephrology', 'JJ'], ['submit', 'NN'], ['manuscript', 'NN'], ['http', 'NN'], ['//www.f6publishing.com', 'JJ'], ['world', 'NN'], ['j', 'NN'], ['nephrol', 'NN'], ['2022', 'CD'], ['november', 'IN'], ['25', 'CD'], ['11', 'CD'], ['6', 'CD'], ['146-163', 'JJ'], ['doi', 'NN'], ['10.5527/wjn.v11.i6.146', 'CD'], ['issn', 'JJ'], ['2220-6124', 'JJ'], ['online', 'JJ'], ['review', 'VBP'], ['acute', 'JJ'], ['kidney', 'NN'], ['injury', 'NN'], ['due', 'JJ'], ['bilateral', 'JJ'], ['malignant', 'JJ'], ['ureteral', 'JJ'], ['obstruction', 'NN'], ['optimal', 'JJ'], ['mode', 'NN'], ['drainage', 'NN'], ['rabea', 'NN'], ['ahmed', 'VBD'], ['gadelkareem', 'NN'], ['ahmed', 'VBD'], ['mahmoud', 'JJ'], ['abdelraouf', 'NN'], ['ahmed', 'VBD'], ['mohammed', 'VBN'], ['el-taher', 'RB'], ['abdelfattah', 'VBZ'], ['ibrahim', 'NN'], ['ahmed', 'VBD'], ['specialty', 'NN'], ['type', 'NN'], ['urology', 'NN'], ['rabea', 'NN'], ['

## Stemming

In [29]:
stemmer = PorterStemmer()   # Incorporamos el stemmer

In [30]:
# Aplicar el stemming
doc_1 = [[stemmer.stem(word), tag]  # Iniciamos una lista con los estemas de las palabras...
         for word, tag              # para cada palabra...
         in doc_1]                  # en el documento

In [31]:
doc_2 = [[stemmer.stem(word), tag]  # Iniciamos una lista con los estemas de las palabras...
         for word, tag              # para cada palabra...
         in doc_2]                  # en el documento

In [32]:
doc_3 = [[stemmer.stem(word), tag]  # Iniciamos una lista con los estemas de las palabras...
         for word, tag              # para cada palabra...
         in doc_3]                  # en el documento

In [33]:
# Imprimir los resultados
print(f"Documento 1 con stemming: {doc_1}")

Documento 1 con stemming: [['medicin', 'NN'], ['®', 'NNP'], ['clinic', 'JJ'], ['case', 'NN'], ['report', 'NN'], ['open', 'JJ'], ['treatment', 'NN'], ['pancreat', 'JJ'], ['head', 'NN'], ['cancer', 'NN'], ['obstruct', 'JJ'], ['jaundic', 'NN'], ['endoscopi', 'JJ'], ['ultrasonography-guid', 'JJ'], ['gastrojejunostomi', 'NN'], ['case', 'NN'], ['report', 'NN'], ['literatur', 'NN'], ['review', 'NN'], ['∗', 'NNP'], ['zhaohua', 'NNP'], ['shen', 'NN'], ['md', 'NN'], ['li', 'JJ'], ['tian', 'NN'], ['md', 'NN'], ['xiaoyan', 'NNP'], ['wang', 'NN'], ['md', 'FW'], ['abstract', 'JJ'], ['rational', 'NN'], ['ultrasonography-guidedgastrojejunostomi', 'JJ'], ['eus-gj', 'JJ'], ['mightbeasaf', 'NN'], ['innovativeandminimallyinvasiveintervent', 'JJ'], ['treatmentforpatientswithgastricoutletobstruct', 'NN'], ['goo', 'NN'], ['asanalternativetothesurgicalapproach.tod', 'NN'], ['fewcaseshavebeen', 'JJ'], ['reportedin', 'VBP'], ['literatur', 'NN'], ['patientconcern', 'NNS'], ['acaseofpancreaticheadcarcinomawithobs

In [34]:
print(f"Documento 2 con stemming: {doc_2}")

Documento 2 con stemming: [['10', 'CD'], ['review', 'NN'], ['articl', 'NN'], ['page', 'NN'], ['1', 'CD'], ['10', 'CD'], ['drain', 'NN'], ['nasogastr', 'JJ'], ['tube', 'NN'], ['use', 'NN'], ['follow', 'VBG'], ['pancreatoduodenectomi', 'NN'], ['narr', 'JJ'], ['review', 'NN'], ['thoma', 'NN'], ['b.', 'NN'], ['russell^', 'NN'], ['peter', 'NN'], ['l.', 'NN'], ['labib^', 'NN'], ['somaiah', 'JJ'], ['aroori^', 'NN'], ['depart', 'NN'], ['hpb', 'NN'], ['surgeri', 'NN'], ['univers', 'NN'], ['hospit', 'NNS'], ['plymouth', 'VBP'], ['nh', 'JJ'], ['trust', 'NN'], ['derriford', 'NN'], ['road', 'NN'], ['plymouth', 'NN'], ['uk', 'JJ'], ['contribut', 'NNS'], ['concept', 'NN'], ['design', 'NN'], ['author', 'NNS'], ['ii', 'NN'], ['administr', 'JJ'], ['support', 'NN'], ['tb', 'NN'], ['russel', 'NN'], ['iii', 'NN'], ['provis', 'NN'], ['studi', 'NN'], ['materi', 'NNS'], ['patient', 'NNS'], ['none', 'NN'], ['iv', 'NN'], ['collect', 'NN'], ['assembl', 'NN'], ['data', 'NNS'], ['author', 'NNS'], ['v', 'NN'], ['da

In [35]:
print(f"Documento 3 con stemming: {doc_3}")

Documento 3 con stemming: [['w', 'NN'], ['j', 'NN'], ['n', 'JJ'], ['world', 'NN'], ['journal', 'NN'], ['nephrolog', 'JJ'], ['submit', 'NN'], ['manuscript', 'NN'], ['http', 'NN'], ['//www.f6publishing.com', 'JJ'], ['world', 'NN'], ['j', 'NN'], ['nephrol', 'NN'], ['2022', 'CD'], ['novemb', 'IN'], ['25', 'CD'], ['11', 'CD'], ['6', 'CD'], ['146-163', 'JJ'], ['doi', 'NN'], ['10.5527/wjn.v11.i6.146', 'CD'], ['issn', 'JJ'], ['2220-6124', 'JJ'], ['onlin', 'JJ'], ['review', 'VBP'], ['acut', 'JJ'], ['kidney', 'NN'], ['injuri', 'NN'], ['due', 'JJ'], ['bilater', 'JJ'], ['malign', 'JJ'], ['ureter', 'JJ'], ['obstruct', 'NN'], ['optim', 'JJ'], ['mode', 'NN'], ['drainag', 'NN'], ['rabea', 'NN'], ['ahm', 'VBD'], ['gadelkareem', 'NN'], ['ahm', 'VBD'], ['mahmoud', 'JJ'], ['abdelraouf', 'NN'], ['ahm', 'VBD'], ['moham', 'VBN'], ['el-tah', 'RB'], ['abdelfattah', 'VBZ'], ['ibrahim', 'NN'], ['ahm', 'VBD'], ['specialti', 'NN'], ['type', 'NN'], ['urolog', 'NN'], ['rabea', 'NN'], ['ahm', 'VBD'], ['gadelkareem', 