# Lenguaje Natural NLP

¿Qué se hace cuando queremos procesar datos en forma de comentarios, reviews u otros formatos que incluyen palabras?

In [85]:
import pandas as pd
import nltk
from nltk.corpus import movie_reviews

## Escenario

Con el fin de mejorar su catálogo, somos contratados por Netflix para decir si una pelúcula es buena o mala dada una reseña en internet

## Exploración

Vamos a analizar un dataset que contiene reviews de películas

In [90]:
data = pd.read_csv(r"C:\Users\felip\Desktop\Cursos\Datos\movie_review.csv")

In [92]:
data

Unnamed: 0,Review,Category
0,"plot : two teen couples go to a church party ,...",neg
1,the happy bastard's quick movie review \ndamn ...,neg
2,it is movies like these that make a jaded movi...,neg
3,""" quest for camelot "" is warner bros . ' firs...",neg
4,synopsis : a mentally unstable man undergoing ...,neg
...,...,...
1995,wow ! what a movie . \nit's everything a movie...,pos
1996,"richard gere can be a commanding actor , but h...",pos
1997,"glory--starring matthew broderick , denzel was...",pos
1998,steven spielberg's second epic film on world w...,pos


¿Qué podemos observar en los datos?

*hint 1: Revisa la reseña 1, 2 y 1999* 

*hint 2: Revisa la columna category y di que contiene*

In [95]:
data.Review[1]

'the happy bastard\'s quick movie review \ndamn that y2k bug . \nit\'s got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . \nlittle do they know the power within . . . \ngoing for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . \nwe don\'t know why the crew was really out in the middle of nowhere , we don\'t know the origin of what took over the ship ( just that a big pink flashy thing hit the mir ) , and , of course , we don\'t know why donald sutherland is stumbling around drunkenly throughout . \nhere , it\'s just " hey , let\'s chase these people around with some robots " . \nthe acting is below average , even from the likes of curtis . \nyou\'re more likely to get a kick out of h

In [97]:
data.Review[1][0:31]

"the happy bastard's quick movie"

In [99]:
data.Category.value_counts()

Category
neg    1000
pos    1000
Name: count, dtype: int64

**¿Cómo se te ocurre que podemos saber si la peli es buena o mala dada la columna de Review?**

## Limpiando datos de texto

Analizando la siguiente oración:

**"Me gusta correr en el parque a las 2 de la tarde"**

¿Qué palabras nos aportan información útil?

Existen palabras en el español que ayudan a hablar apropiadamente pero no necesariamente dan información adicional

**"gusta correr parque tarde"** 

Puede ser una oración que parece ser dicha por alguien de una época mas primitiva, sin embargo, podemos entender el mensaje únicamente con estas 4 palabras

### RegEx

Las **Expresiones regulares**  RegEx, son cadenas que nos permiten hacer búsquedas en texto de manera personalizada

https://regexr.com/

### STOPWORDS

Son las palabras en algún idioma que no aportan información esencial y pueden ser eliminadas del texto

In [106]:
from nltk.corpus import stopwords
import string

In [162]:
# Descargar stopwords
nltk.download('stopwords')

# Definir las stopwords y los signos de puntuación
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\felip\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [168]:
#Exploramos las stopwords en inglés y los signos de puntuación
#stop_words
#punctuation

In [174]:
def limpiar_texto_regex(texto, stop_words, punctuation):
    # Crear un patrón regex para eliminar signos de puntuación
    pattern_puntuacion = r'[{}]'.format(re.escape(''.join(punctuation)))
    
    # Eliminar signos de puntuación
    texto = re.sub(pattern_puntuacion, '', texto)
    
    # Convertir el texto a minúsculas
    texto = texto.lower()
    
    # Dividir el texto en palabras
    palabras = texto.split()
    
    # Filtrar las palabras que no están en el conjunto de stopwords
    palabras_filtradas = [palabra for palabra in palabras if palabra not in stop_words]
    
    # Volver a unir las palabras filtradas en una cadena de texto
    texto_limpio = ' '.join(palabras_filtradas)
    
    return texto_limpio

In [178]:
# Aplicar la función de limpieza al df
data['Cleaned_Review'] = data['Review'].apply(lambda x: limpiar_texto_regex(x, stop_words, punctuation))

In [182]:
data[['Cleaned_Review', "Review"]]

Unnamed: 0,Cleaned_Review,Review
0,plot two teen couples go church party drink dr...,"plot : two teen couples go to a church party ,..."
1,happy bastards quick movie review damn y2k bug...,the happy bastard's quick movie review \ndamn ...
2,movies like make jaded movie viewer thankful i...,it is movies like these that make a jaded movi...
3,quest camelot warner bros first featurelength ...,""" quest for camelot "" is warner bros . ' firs..."
4,synopsis mentally unstable man undergoing psyc...,synopsis : a mentally unstable man undergoing ...
...,...,...
1995,wow movie everything movie funny dramatic inte...,wow ! what a movie . \nit's everything a movie...
1996,richard gere commanding actor hes always great...,"richard gere can be a commanding actor , but h..."
1997,glorystarring matthew broderick denzel washing...,"glory--starring matthew broderick , denzel was..."
1998,steven spielbergs second epic film world war i...,steven spielberg's second epic film on world w...


Mostramos los diferentes escenarios de encoding 

In [188]:
from sklearn.preprocessing import LabelEncoder

# Codificar la categoría como un número (Label Encoding)
label_encoder = LabelEncoder()
data['Category_Label'] = label_encoder.fit_transform(data['Category'])

# Mostrar las primeras filas del DataFrame con la nueva columna
data[['Review', 'Category', 'Category_Label']].head()

Unnamed: 0,Review,Category,Category_Label
0,"plot : two teen couples go to a church party ,...",neg,0
1,the happy bastard's quick movie review \ndamn ...,neg,0
2,it is movies like these that make a jaded movi...,neg,0
3,""" quest for camelot "" is warner bros . ' firs...",neg,0
4,synopsis : a mentally unstable man undergoing ...,neg,0


In [192]:
from sklearn.preprocessing import OneHotEncoder

# Codificar la categoría utilizando One-Hot Encoding
onehot_encoder = OneHotEncoder(sparse_output=False)

# Transformar la columna de categorías (necesita ser 2D)
category_encoded = onehot_encoder.fit_transform(data[['Category']])

# Crear un DataFrame con las columnas One-Hot
category_df = pd.DataFrame(category_encoded, columns=onehot_encoder.categories_[0])

# Concatenar el DataFrame original con el nuevo DataFrame de categorías
df_onehot = pd.concat([data, category_df], axis=1)

# Mostrar las primeras filas del DataFrame resultante
df_onehot.head()

Unnamed: 0,Review,Category,Cleaned_Sentence,Cleaned_Review,Category_Label,neg,pos
0,"plot : two teen couples go to a church party ,...",neg,"[p, l, , , w, , e, e, n, , c, u, p, l, e, ...",plot two teen couples go church party drink dr...,0,1.0,0.0
1,the happy bastard's quick movie review \ndamn ...,neg,"[h, e, , h, p, p, , b, r, , q, u, c, k, , ...",happy bastards quick movie review damn y2k bug...,0,1.0,0.0
2,it is movies like these that make a jaded movi...,neg,"[ , , v, e, , l, k, e, , h, e, e, , h, , ...",movies like make jaded movie viewer thankful i...,0,1.0,0.0
3,""" quest for camelot "" is warner bros . ' firs...",neg,"[ , , q, u, e, , f, r, , c, e, l, , , , ...",quest camelot warner bros first featurelength ...,0,1.0,0.0
4,synopsis : a mentally unstable man undergoing ...,neg,"[n, p, , , , e, n, l, l, , u, n, b, l, e, ...",synopsis mentally unstable man undergoing psyc...,0,1.0,0.0


### Tokenización

El problema de la tokenización 

Cuantas letras tiene strawberry

In [198]:
from nltk.tokenize import word_tokenize

# Tokenizar las palabras de una reseña
data['Tokens'] = data['Cleaned_Review'].apply(lambda x: word_tokenize(x.lower()))

# Mostrar las primeras filas del DataFrame con los tokens
data[['Cleaned_Review', 'Tokens']].head()


Unnamed: 0,Cleaned_Review,Tokens
0,plot two teen couples go church party drink dr...,"[plot, two, teen, couples, go, church, party, ..."
1,happy bastards quick movie review damn y2k bug...,"[happy, bastards, quick, movie, review, damn, ..."
2,movies like make jaded movie viewer thankful i...,"[movies, like, make, jaded, movie, viewer, tha..."
3,quest camelot warner bros first featurelength ...,"[quest, camelot, warner, bros, first, featurel..."
4,synopsis mentally unstable man undergoing psyc...,"[synopsis, mentally, unstable, man, undergoing..."


### Limpieza general

Regex que elimina numeros

In [200]:
# Función para eliminar números utilizando expresiones regulares
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# Aplicar la función de eliminación de números a las reseñas
data['No_Numbers'] = data['Cleaned_Review'].apply(remove_numbers)

# Mostrar las primeras filas del DataFrame sin números
data[['Cleaned_Review', 'No_Numbers']].head()

Unnamed: 0,Cleaned_Review,No_Numbers
0,plot two teen couples go church party drink dr...,plot two teen couples go church party drink dr...
1,happy bastards quick movie review damn y2k bug...,happy bastards quick movie review damn yk bug ...
2,movies like make jaded movie viewer thankful i...,movies like make jaded movie viewer thankful i...
3,quest camelot warner bros first featurelength ...,quest camelot warner bros first featurelength ...
4,synopsis mentally unstable man undergoing psyc...,synopsis mentally unstable man undergoing psyc...


In [215]:
data['Cleaned_Review'] = data['No_Numbers']

### Vectorizadores

Es el proceso de convertir el lenguaje en números que pueda manejar la computadora para diferentes fines

Como se vectoriza

In [219]:
from sklearn.feature_extraction.text import CountVectorizer

# Crear un vectorizador de palabras (Bag of Words)
vectorizer = CountVectorizer(max_features=1000)

# Ajustar y transformar las reseñas en una matriz BoW
bow_matrix = vectorizer.fit_transform(data['Cleaned_Review'])

# Convertir la matriz BoW en un DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Mostrar las primeras filas del DataFrame de Bag of Words
bow_df.head()


Unnamed: 0,ability,able,absolutely,across,act,acting,action,actor,actors,actress,...,wrote,year,years,yes,yet,york,youll,young,youre,youve
0,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,1,0,2,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
3,0,1,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,1,0,1,0,1,0


¿Qué hace Count Vectorizer?

1. Tokeniza el texto (separa el texto en palabras).
2. Cuenta cuántas veces aparece cada palabra en cada documento.
3. Genera una matriz en la que cada fila es un documento y cada columna es una palabra, con valores numéricos que indican cuántas veces aparece esa palabra en el documento correspondiente.

In [255]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Crear un vectorizador TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=3000)

# Ajustar y transformar las reseñas en una matriz TF-IDF
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Cleaned_Review'])
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Review'])


# Convertir la matriz TF-IDF en un DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Mostrar las primeras filas del DataFrame de TF-IDF
tfidf_df.head()


Unnamed: 0,000,10,100,12,13,15,17,1993,1994,1995,...,yes,yet,york,you,young,younger,your,yourself,youth,zero
0,0.0,0.33974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.039548,0.0,0.0,0.0458,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.055398,0.0,0.0,0.048117,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.033186,0.030106,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.036631,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.024544,0.0,0.026379,0.0,0.0,0.0,0.0,0.0,0.0


Vectorizador Tfidf (Term Frequency-Inverse Document Frequency)

Es una vectorización que permite a las palabras más informativas para el análisis tener un mayor peso en los modelos, mientras que las palabras comunes y sin mucho valor predictivo tengan menos influencia.

¿Cómo funciona?

1. TF (Frecuencia del término):

        Es la frecuencia de una palabra en un documento específico. Se puede calcular como el número de veces que la palabra aparece en el documento dividido por el número total de palabras en ese documento. Ejemplo: Si la palabra "documento" aparece 3 veces en un documento de 100 palabras, el TF de "documento" sería 3/100 = 0.03.

2. IDF (Frecuencia inversa del documento):

        Mide cuántos documentos contienen una palabra en particular, con el objetivo de reducir la importancia de palabras comunes en todo el corpus. Se calcula como el logaritmo del número total de documentos dividido por el número de documentos que contienen el término. Las palabras que aparecen en muchos documentos tienen un IDF bajo, lo que reduce su peso.
   
3. TF-IDF (Frecuencia del término inversamente ponderada por el documento):

        Es el producto de TF y IDF, y asigna más peso a las palabras que son importantes en un documento pero raras en todo el corpus.

In [257]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Dividir los datos en conjunto de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(tfidf_df, data['Category'], test_size=0.2, random_state=42)

# Entrenar un modelo de Naive Bayes
model = MultinomialNB()
model.fit(X_train, y_train)

# Realizar predicciones
y_pred = model.predict(X_test)

# Evaluar el modelo
print("Accuracy:", accuracy_score(y_test, y_pred))
#print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.805
