# Análisis de sentimientos en youtube (inglés)

## Información que nos da la web

**Original**  
Discover the YouTube Comments Dataset, a fully cleaned and preprocessed collection of YouTube video comments. This dataset is perfect for sentiment analysis, natural language processing, and text-based machine learning projects.   
With all irrelevant data already removed and cleaning steps thoroughly performed, it provides clean, structured information, allowing you to focus solely on insights and analysis.   
Dive into the world of social media trends and user behavior with this ready-to-use dataset!  
https://www.kaggle.com/datasets/atifaliak/youtube-comments-dataset   
**Traducción**  
Descubre el Conjunto de Datos de Comentarios de YouTube, una colección completamente limpia y procesada de comentarios de videos de YouTube. Este conjunto de datos es perfecto para análisis de sentimientos, procesamiento de lenguaje natural y proyectos de machine learning basados en texto.   
Con todos los datos irrelevantes ya eliminados y los pasos de limpieza realizados a fondo, proporciona información estructurada y limpia, lo que te permite concentrarte únicamente en los conocimientos y el análisis.   
¡Sumérgete en el mundo de las tendencias en redes sociales y el comportamiento de los usuarios con este conjunto de datos listo para usar!


# 0- Importaciones

In [49]:
import kagglehub
import pandas as pd

# 1-Análisis de los datos

In [50]:
## 1.1- Importación del dataset



# Download latest version
path = kagglehub.dataset_download("atifaliak/youtube-comments-dataset")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\Jorge\.cache\kagglehub\datasets\atifaliak\youtube-comments-dataset\versions\1


In [51]:
#Carga del .csv
dataset = pd.read_csv(path + "/YoutubeCommentsDataSet.csv")
if dataset is not None:
    print("Dataset loaded successfully")
else:
    print("Something went wrong, dataset is None")

Dataset loaded successfully


## 1.2 - Explorando los datos

In [52]:
print("Dimensions of the dataset: ", dataset.shape)
print("Columns of the dataset: ",dataset.columns)
print("\n")
print("Inicio del dataset: ", dataset.head(5))
print("\n")
print("Final del dataset: ",dataset.tail(5))

Dimensions of the dataset:  (18408, 2)
Columns of the dataset:  Index(['Comment', 'Sentiment'], dtype='object')


Inicio del dataset:                                               Comment Sentiment
0  lets not forget that apple pay in 2014 require...   neutral
1  here in nz 50 of retailers don’t even have con...  negative
2  i will forever acknowledge this channel with t...  positive
3  whenever i go to a place that doesn’t take app...  negative
4  apple pay is so convenient secure and easy to ...  positive


Final del dataset:                                                   Comment Sentiment
18403  i really like the point about engineering tool...  positive
18404  i’ve just started exploring this field and thi...  positive
18405  excelente video con una pregunta filosófica pr...   neutral
18406  hey daniel just discovered your channel a coup...  positive
18407  this is great focus is key a playful approach ...  positive


Claro, vamos a entender que los sentimientos están realmente bien seleccionados.

In [53]:
#Número de valores únicos por columnas
print("Valores únicos por columnas: ", dataset.nunique())
#Nombres únicos en la columna de Sentiment
print(dataset.Sentiment.unique())

Valores únicos por columnas:  Comment      17871
Sentiment        3
dtype: int64
['neutral' 'negative' 'positive']


# 2-Preprocesamiento del texto

In [54]:
import re
import unicodedata
import tqdm
import contractions
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize



def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def pre_process_corpus(docs):
    norm_docs = []
    for doc in tqdm.tqdm(docs):
        doc = strip_html_tags(doc)
        doc = doc.translate(doc.maketrans("\n\t\r", "   "))
        doc = doc.lower()
        doc = remove_accented_chars(doc)
        doc = contractions.fix(doc)
        # lower case and remove special characters\whitespaces
        doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
        doc = re.sub(' +', ' ', doc)
        doc = doc.strip()  
        norm_docs.append(doc)
    return norm_docs


def categoric_to_number(data):
    sentiment_mapping = {
        'positive': 2,
        'negative': 0,
        'neutral': 1
    }

    # Si 'data' es una lista, aplicamos map() directamente a ella
    return [sentiment_mapping[label] for label in data]




In [55]:
#Llamamos a las funciones con una copia del dataset
dataset_copy = dataset.copy()
print("Before procesing: ",dataset_copy.head(5))


Before procesing:                                               Comment Sentiment
0  lets not forget that apple pay in 2014 require...   neutral
1  here in nz 50 of retailers don’t even have con...  negative
2  i will forever acknowledge this channel with t...  positive
3  whenever i go to a place that doesn’t take app...  negative
4  apple pay is so convenient secure and easy to ...  positive


In [56]:
#Mezclamos la data
from sklearn.model_selection import train_test_split
y = dataset_copy['Sentiment']
X = dataset_copy.drop(columns=['Sentiment'])
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


## Resumen de la distribución final

| Conjunto    | Porcentaje del total | Objetivo                                      |
|------------|--------------------|----------------------------------------------|
| **Train**   | 80%                | Entrenar el modelo                          |
| **Validation** | 10%             | Ajustar hiperparámetros y evitar sobreajuste |
| **Test**    | 10%                | Evaluar el modelo con datos nunca vistos     |


In [57]:
# Aplico las transformaciones a cada conjunto de datos
X_train_processed = pre_process_corpus(X_train['Comment'].astype(str).values.tolist())
X_validation_processed = pre_process_corpus(X_val['Comment'].astype(str).values.tolist())
X_test_processed = pre_process_corpus(X_test['Comment'].astype(str).values.tolist())




100%|██████████| 14726/14726 [00:01<00:00, 9637.44it/s]
100%|██████████| 1841/1841 [00:00<00:00, 9897.93it/s]
100%|██████████| 1841/1841 [00:00<00:00, 10005.47it/s]


In [58]:
from tensorflow.keras.preprocessing.text import Tokenizer
import pickle

# Verifica el tipo de X_train_processed
print(type(X_train_processed))

# Asegúrate de que X_train_processed sea una lista de textos
# Como X_train_processed es una lista de cadenas, no necesitas hacer .astype(str).values.tolist()
# Solo pasa la lista directamente al tokenizer.

t = Tokenizer(oov_token='<UNK>')
t.fit_on_texts(X_train_processed)  # Fit the tokenizer on the processed train texts
t.word_index['<PAD>'] = 0

# Convertir los comentarios en secuencias de enteros
X_train_token = t.texts_to_sequences(X_train_processed)
X_validation_token = t.texts_to_sequences(X_validation_processed)
X_test_token = t.texts_to_sequences(X_test_processed)

# Transformar las etiquetas y en números
y_train_token = categoric_to_number(y_train)
y_validation_token = categoric_to_number(y_val)
y_test_token = categoric_to_number(y_test)




# Guardar el tokenizador en un archivo
with open('tokenizer.pkl', 'wb') as f:
    pickle.dump(t, f)




<class 'list'>


# 3-Visualizando el dataset

In [59]:
import matplotlib.pyplot as plt
%matplotlib inline

print("Vocabulary size={}".format(len(t.word_index)))
print("Number of comments={}".format(t.document_count))



Vocabulary size=32683
Number of comments=14726


# 4-Normalización de secuencias y codificación de etiquetas

In [60]:
from tensorflow.keras.preprocessing import sequence
X_train_padded = sequence.pad_sequences(X_train_token, maxlen=15,padding="post",truncating="post")
X_test_padded = sequence.pad_sequences(X_test_token, maxlen=15,padding="post",truncating="post")
X_val_padded = sequence.pad_sequences(X_validation_token, maxlen=15,padding="post",truncating="post")

# 5-Entrenamiento

In [61]:
# Hyperparameters of the model
EMBEDDING_DIM = 300
MAX_SEQUENCE_LENGTH = 15
VOCAB_SIZE = len(t.word_index)

In [70]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

# Convertir las etiquetas en formato one-hot
y_train_token = to_categorical(y_train_token, num_classes=3)
y_validation_token = to_categorical(y_validation_token, num_classes=3)
y_test_token = to_categorical(y_test_token, num_classes=3)


# Crear el modelo
model = Sequential()

# Capa de Embedding
model.add(Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))

# Capa LSTM
model.add(LSTM(128))

# Capa Dense
model.add(Dense(64, activation='relu'))

# Capa de salida con activación softmax para clasificación multiclase
model.add(Dense(3, activation='softmax'))  # 3 clases de salida

# Compilar el modelo con categorical_crossentropy
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(learning_rate=5e-4),
              metrics=['accuracy'])

# Resumen del modelo
model.summary()

In [66]:
import numpy as np
print(X_train_padded.shape)


# Convertir las listas a arreglos de Numpy
y_train_token = np.array(y_train_token)
y_validation_token = np.array(y_validation_token)
y_test_token = np.array(y_test_token)



(14726, 15)


In [None]:

history = model.fit(
    X_train_padded, y_train_token,  # Datos de entrenamiento
    validation_data=(X_test_padded, y_test_token),  # Datos de validación
    epochs=15,  # Número de épocas
    batch_size=32,  # Tamaño del batch
    verbose=2  # Mostrar progreso
)


Epoch 1/30


ValueError: Attr 'Toutput_types' of 'OptionalFromValue' Op passed list of length 0 less than minimum 1.

# Guardo el modelo

In [None]:
model.save('analisis-sentimiento-comentarios.h5')



# Carga del modelo desde el archivo guardado

In [None]:
from tensorflow.keras.models import load_model
# Cargar el modelo guardado
model = load_model('analisis-sentimiento-comentarios.h5')
# Ver el resumen del modelo
model.summary()



In [None]:
#Prueba del modelo
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np


# Cargar el tokenizador antes de predecir
with open('tokenizer.pkl', 'rb') as f:
    t = pickle.load(f)

def predecir_sentimiento(comentario):
    comentario_tokenizado = t.texts_to_sequences([comentario])
    comentario_padded = pad_sequences(comentario_tokenizado, maxlen=15, padding="post", truncating="post")
    prediccion = model.predict(comentario_padded)
    clase_predicha = np.argmax(prediccion)
    
    etiquetas = {0: "Negativo", 1: "Neutral", 2: "Positivo"}
    return etiquetas[clase_predicha]


# Lista de comentarios positivos
positive_comments = [
    "I absolutely love this product! It works perfectly and exceeded my expectations.",
    "Great customer service! They responded quickly and solved my issue in no time.",
    "The quality is outstanding, and it feels like a premium product.",
    "Fast shipping and exactly as described. Will buy again!",
    "This is the best purchase I've made this year. Totally worth the price!"
]

# Lista de comentarios negativos
negative_comments = [
    "Terrible experience! The product broke within a week.",
    "Customer service was unhelpful and rude. I wouldn't recommend this company.",
    "Not what I expected. Poor quality and completely different from the pictures.",
    "Shipping took forever, and when it arrived, it was damaged.",
    "A waste of money. I regret buying this."
]

# Lista de comentarios neutrales
neutral_comments = [
    "The product is okay, nothing special.",
    "It does what it says, but I expected better quality for the price.",
    "Not bad, but not great either. Just an average product.",
    "Arrived on time, but I haven't tried it yet.",
    "It’s fine, but I don’t think I’d buy it again."
]

with open("resultados.txt","w") as f:
    
# Prueba con comentarios
    f.write("PRUEBA DE POSITIVOS-----------------")
    for comentario in positive_comments:
        f.write(f"Comentario: {comentario}\nValor: {predecir_sentimiento(comentario)}\n")
    
    f.write("PRUEBA DE NEGATIVOS-----------------")
    for comentario in negative_comments:
        f.write(f"Comentario: {comentario}\nValor: {predecir_sentimiento(comentario)}\n")

    f.write("PRUEBA DE NEUTRALES-----------------")
    for comentario in neutral_comments:
        f.write(f"Comentario: {comentario}\nValor: {predecir_sentimiento(comentario)}\n")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 127ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2