<a href="https://colab.research.google.com/github/IsmaelArista/Global-Hitss/blob/main/Semillero_NLP_Clase_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌐 Introducción al Procesamiento de Lenguaje Natural (NLP)

En esta clase aprenderemos los conceptos básicos del **Procesamiento de Lenguaje Natural (NLP)**, un área de la inteligencia artificial enfocada en permitir que las computadoras comprendan y procesen el lenguaje humano.

**Temas a tratar:**
- Tokenización  
- Stopwords  
- Lematización  
- Stemming  
- Representación vectorial (Bag of Words y TF-IDF)


## 📘 Tokenización
La **tokenización** es el proceso de dividir un texto en unidades más pequeñas llamadas *tokens*.  
Los tokens pueden ser palabras, frases o incluso caracteres.

**Ejemplo:**
> Texto: "El NLP es fascinante." → Tokens: ["El", "NLP", "es", "fascinante"]


In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Descarga los recursos necesarios
nltk.download('punkt')
nltk.download('punkt_tab')

texto = "El NLP es una rama fascinante de la inteligencia artificial."
tokens = word_tokenize(texto.lower())
print(tokens)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['el', 'nlp', 'es', 'una', 'rama', 'fascinante', 'de', 'la', 'inteligencia', 'artificial', '.']


## 🚫 Stopwords
Las **stopwords** son palabras comunes (como artículos, preposiciones y pronombres) que no aportan mucho significado semántico y suelen eliminarse antes de procesar el texto.


In [None]:
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('spanish'))
tokens_filtrados = [palabra for palabra in tokens if palabra not in stop_words]
print(tokens_filtrados)


['nlp', 'rama', 'fascinante', 'inteligencia', 'artificial', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from nltk.corpus import stopwords
stopwords_sp = set(stopwords.words('spanish'))

print("sí" in stopwords_sp)
print("no" in stopwords_sp)


True
True


In [None]:
stopwords_custom = set(stopwords.words('spanish'))
stopwords_custom.remove("no")  # conservar "no"



In [None]:


print("sí" in stopwords_custom)
print("no" in stopwords_custom)

True
False


In [None]:
stopwords_custom = set(stopwords.words('spanish'))
stopwords_custom.update(["sí", "vale", "ok", "aja"])  # agregas las que no aporten significado

## 🌱 Lematización
La **lematización** consiste en reducir una palabra a su forma base o *lema*, considerando el contexto y la morfología.

**Ejemplo:**  
> corriendo → correr


In [None]:
# Instalar el modelo de idioma español de SpaCy
!python -m spacy download es_core_news_sm


Collecting es-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m97.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
nlp = spacy.load('es_core_news_sm')  # Modelo para español

texto = "Los gatos estaban corriendo rápidamente por el jardín."
doc = nlp(texto)

lemmas = [token.lemma_ for token in doc]
print(lemmas)


['el', 'gato', 'estar', 'correr', 'rápidamente', 'por', 'el', 'jardín', '.']


## 🔪 Stemming
El **stemming** recorta las palabras a su raíz, sin considerar reglas lingüísticas.  
Es una forma más agresiva que la lematización.

**Ejemplo:**  
> jugando, jugador, jugó → jug


In [None]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('spanish')
stems = [stemmer.stem(palabra) for palabra in tokens_filtrados]
print(stems)


['nlp', 'ram', 'fascin', 'inteligent', 'artificial', '.']


## 🧮 Representación Vectorial del Texto
Una vez que el texto está limpio, debemos representarlo numéricamente para usarlo en modelos de Machine Learning.  
Las dos técnicas más comunes son:

1. **Bag of Words (BoW)**: cuenta las apariciones de cada palabra.  
2. **TF-IDF (Term Frequency - Inverse Document Frequency)**: pondera las palabras según su frecuencia en el corpus.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    "El NLP es una rama de la inteligencia artificial.",
    "La tokenización y la lematización son procesos importantes en NLP.",
    "TF-IDF permite representar textos de forma numérica."
]

# Bag of Words
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(corpus)
print("Bag of Words:")
print(vectorizer_bow.get_feature_names_out())
print(X_bow.toarray())

# TF-IDF
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(corpus)
print("\nTF-IDF:")
print(vectorizer_tfidf.get_feature_names_out())
print(X_tfidf.toarray())


Bag of Words:
['artificial' 'de' 'el' 'en' 'es' 'forma' 'idf' 'importantes'
 'inteligencia' 'la' 'lematización' 'nlp' 'numérica' 'permite' 'procesos'
 'rama' 'representar' 'son' 'textos' 'tf' 'tokenización' 'una']
[[1 1 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 1]
 [0 0 0 1 0 0 0 1 0 2 1 1 0 0 1 0 0 1 0 0 1 0]
 [0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0]]

TF-IDF:
['artificial' 'de' 'el' 'en' 'es' 'forma' 'idf' 'importantes'
 'inteligencia' 'la' 'lematización' 'nlp' 'numérica' 'permite' 'procesos'
 'rama' 'representar' 'son' 'textos' 'tf' 'tokenización' 'una']
[[0.35955412 0.27345018 0.35955412 0.         0.35955412 0.
  0.         0.         0.35955412 0.27345018 0.         0.27345018
  0.         0.         0.         0.35955412 0.         0.
  0.         0.         0.         0.35955412]
 [0.         0.         0.         0.33535157 0.         0.
  0.         0.33535157 0.         0.51008702 0.33535157 0.25504351
  0.         0.         0.33535157 0.         0.         0.33535157
  0.

## ✅ Conclusiones

En esta clase aprendimos los pasos fundamentales del preprocesamiento de texto:
- Tokenización para dividir texto en unidades.  
- Eliminación de stopwords.  
- Lematización y stemming para normalizar palabras.  
- Representación vectorial del texto para modelos de Machine Learning.

Estos conceptos son la base de cualquier aplicación moderna de NLP, como chatbots, análisis de sentimiento o motores de búsqueda.


In [1]:
import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv"
data = pd.read_csv(url)

# Mostrar algunas filas
print(data.head())

corpus = data['tweet'].astype(str).tolist()
print("Tweets en el corpus:", len(corpus))

   id  label                                              tweet
0   1      0   @user when a father is dysfunctional and is s...
1   2      0  @user @user thanks for #lyft credit i can't us...
2   3      0                                bihday your majesty
3   4      0  #model   i love u take with u all the time in ...
4   5      0             factsguide: society now    #motivation
Tweets en el corpus: 31962


In [2]:
import nltk
from nltk.tokenize import word_tokenize

# Descarga de recursos
nltk.download('punkt')
nltk.download('punkt_tab')

texto = corpus[0]
tokens = word_tokenize(texto.lower())
print(tokens)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['@', 'user', 'when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#', 'run']


In [3]:
# Stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('spanish'))
tokens_filtrados = [palabra for palabra in tokens if palabra not in stop_words]
print(tokens_filtrados)


['@', 'user', 'when', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#', 'run']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# Stopwords añadidas
stopwords_custom = set(stopwords.words('spanish'))
stopwords_custom.update(["sí", "vale", "ok", "aja"])


In [6]:
!python -m spacy download es_core_news_sm

Collecting es-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m76.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [7]:
# Lematización
import spacy
!python -m spacy download es_core_news_sm
nlp = spacy.load('es_core_news_sm')

doc = nlp(corpus[0])
lemmas = [token.lemma_ for token in doc]
print(lemmas)


Collecting es-core-news-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
[' ', '@user', 'when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'haber', 'drags', 'his ', 'kids', 'into', 'his', 'dysfunction', '.', '  ', '#', 'run']


In [8]:
# Stemming
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('spanish')
stems = [stemmer.stem(palabra) for palabra in tokens_filtrados]
print(stems)


['@', 'user', 'when', 'fath', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#', 'run']


In [9]:
# Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(corpus)
print("Bag of Words (docs x términos):", X_bow.shape)
print(vectorizer_bow.get_feature_names_out()[:20])  # primeras 20 palabras
# print(X_bow.toarray())


Bag of Words (docs x términos): (31962, 41392)
['00' '000' '000001' '001' '0099' '00am' '00h30' '00pm' '01' '0115' '0161'
 '01926889917' '02' '0266808099' '03' '030916' '03111880779' '033' '0345'
 '039']


In [10]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(corpus)
print("TF-IDF (docs x términos):", X_tfidf.shape)
print(vectorizer_tfidf.get_feature_names_out()[:20])
print(X_tfidf.toarray())  # cuidado: puede ser grande


TF-IDF (docs x términos): (31962, 41392)
['00' '000' '000001' '001' '0099' '00am' '00h30' '00pm' '01' '0115' '0161'
 '01926889917' '02' '0266808099' '03' '030916' '03111880779' '033' '0345'
 '039']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
