# Examen Práctico – NLP (Unidades 1 a 4)

**Objetivo:** Ejecutar un flujo sencillo de PLN y responder preguntas tipo examen con base en los resultados.

**Temas cubiertos:** tokenización básica, TF‑IDF, similitud coseno y modelado de tópicos (LDA).

> Ejecuta las celdas en orden. Al final, responde las preguntas en un informe (o en celdas Markdown).

In [1]:
# ✅ Instalar librerías necesarias
!pip -q install pandas scikit-learn nltk

In [2]:
# ✅ Importaciones
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import nltk
nltk.download("punkt")
print("Librerías listas.")

Librerías listas.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1) Dataset de ejemplo (corpus pequeño en español)

In [3]:
# Corpus sintético
docs = [
    "La inteligencia artificial avanza rápidamente en salud.",
    "Los hospitales usan NLP para analizar historias clínicas.",
    "El fútbol es un deporte muy popular en Latinoamérica.",
    "Los bancos utilizan modelos de lenguaje para contratos.",
    "La selección ganó un partido importante en Quito.",
    "La IA mejora los procesos financieros en bancos.",
    "Los pacientes reciben diagnósticos con apoyo de NLP.",
    "El equipo de Guayaquil obtuvo la victoria en la final."
]
df = pd.DataFrame({"document": docs})
df

Unnamed: 0,document
0,La inteligencia artificial avanza rápidamente ...
1,Los hospitales usan NLP para analizar historia...
2,El fútbol es un deporte muy popular en Latinoa...
3,Los bancos utilizan modelos de lenguaje para c...
4,La selección ganó un partido importante en Quito.
5,La IA mejora los procesos financieros en bancos.
6,Los pacientes reciben diagnósticos con apoyo d...
7,El equipo de Guayaquil obtuvo la victoria en l...


## 2) Vectorización con TF‑IDF

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download("stopwords")

# ✅ Stopwords en español desde NLTK
spanish_stopwords = stopwords.words("spanish")

tfidf = TfidfVectorizer(stop_words=spanish_stopwords, ngram_range=(1,1))
X_tfidf = tfidf.fit_transform(df["document"])

# Mostrar matriz TF-IDF como DataFrame
pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out()).round(3)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,analizar,apoyo,artificial,avanza,bancos,clínicas,contratos,deporte,diagnósticos,equipo,...,popular,procesos,quito,reciben,rápidamente,salud,selección,usan,utilizan,victoria
0,0.0,0.0,0.447,0.447,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.447,0.447,0.0,0.0,0.0,0.0
1,0.419,0.0,0.0,0.0,0.0,0.419,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.419,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.386,0.0,0.461,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.461,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.447,0.0,0.0,0.0,0.447,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.386,0.0,0.0,0.0,0.0,0.0,...,0.0,0.461,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.461,0.0,0.0,0.0,0.0,0.0,0.0,0.461,0.0,...,0.0,0.0,0.0,0.461,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.447,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.447


## 3) Similitud coseno

In [5]:
sim_matrix = cosine_similarity(X_tfidf)
pd.DataFrame(sim_matrix.round(2))

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.14,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.15,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.15,0.0,1.0,0.0,0.0
6,0.0,0.14,0.0,0.0,0.0,0.0,1.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## 4) Modelado de tópicos con LDA

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download("stopwords")

# ✅ Stopwords en español desde NLTK
spanish_stopwords = stopwords.words("spanish")

# Vectorización BoW para LDA
cv = CountVectorizer(stop_words=spanish_stopwords)
X_bow = cv.fit_transform(df["document"])

# LDA
lda = LatentDirichletAllocation(n_components=2, random_state=42, learning_method="batch")
doc_topic = lda.fit_transform(X_bow)

# Función auxiliar para mostrar top palabras
def top_words(model, feature_names, n_top=8):
    for idx, comp in enumerate(model.components_):
        top_idx = comp.argsort()[-n_top:][::-1]
        words = [feature_names[i] for i in top_idx]
        print(f"Tópico {idx}: {', '.join(words)}")

top_words(lda, cv.get_feature_names_out(), n_top=10)

# Distribución de tópicos por documento
pd.DataFrame(doc_topic).round(3)


Tópico 0: nlp, bancos, analizar, usan, clínicas, historias, hospitales, ganó, quito, importante
Tópico 1: bancos, artificial, salud, rápidamente, inteligencia, avanza, obtuvo, equipo, final, guayaquil


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,0,1
0,0.087,0.913
1,0.922,0.078
2,0.896,0.104
3,0.093,0.907
4,0.913,0.087
5,0.907,0.093
6,0.093,0.907
7,0.087,0.913


---

#  Preguntas de examen

1. **TF‑IDF**  
   a) ¿Qué representan las filas y columnas de la matriz?  
   b) ¿Qué diferencia práctica observas frente a Bag‑of‑Words?  

2. **Similitud coseno**  
   a) Identifica dos pares de documentos más similares y explica.  
   b) Elige un par con similitud baja: ¿qué los hace distintos?  

3. **LDA – Tópicos**  
   a) Nombra los 2 tópicos y lista 5–8 palabras clave.  
   b) ¿Qué documento pertenece con mayor probabilidad a cada tópico?  

4. **Diseño y mejora**  
   a) Propón dos mejoras de preprocesamiento.  
   b) ¿Qué ocurriría si aumentas a 3 tópicos?  

