# Análisis de sentimientos de Twitter

### Descripción general
Este es un conjunto de datos de análisis de sentimientos a nivel de entidad de twitter. Dado un mensaje y una entidad, la tarea es juzgar el sentimiento del mensaje sobre la entidad. Hay tres clases en este conjunto de datos: Positivo, Negativo y Neutral. Consideramos los mensajes que no son relevantes para la entidad (es decir. Irrelevante) como Neutral.

Data:


passionate-nlp. (2021). Twitter Sentiment Analysis. Kaggle.com. https://doi.org/10/2510249/62d9f449c80ece3e48827afd505f39de

‌

#### **Autor :Juan Mario Moreno**


In [1]:
import pandas as pd

# Cargar los datos
train_data = pd.read_csv("twitter_training.csv")
validation_data = pd.read_csv("twitter_validation.csv")

# Ver las primeras filas del conjunto de entrenamiento
print(train_data.head())

# Verificar información general
print(train_data.info())
print(validation_data.info())

   2401  Borderlands  Positive  \
0  2401  Borderlands  Positive   
1  2401  Borderlands  Positive   
2  2401  Borderlands  Positive   
3  2401  Borderlands  Positive   
4  2401  Borderlands  Positive   

  im getting on borderlands and i will murder you all ,  
0  I am coming to the borders and I will kill you...     
1  im getting on borderlands and i will kill you ...     
2  im coming on borderlands and i will murder you...     
3  im getting on borderlands 2 and i will murder ...     
4  im getting into borderlands and i can murder y...     
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74681 entries, 0 to 74680
Data columns (total 4 columns):
 #   Column                                                 Non-Null Count  Dtype 
---  ------                                                 --------------  ----- 
 0   2401                                                   74681 non-null  int64 
 1   Borderlands                                            74681 non-null  object
 2   Po

In [2]:
train_data

Unnamed: 0,2401,Borderlands,Positive,"im getting on borderlands and i will murder you all ,"
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...
...,...,...,...,...
74676,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74677,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74678,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74679,9200,Nvidia,Positive,Just realized between the windows partition of...


## train_data y validation_data tienen las siguientes columnas:

- ID: Un identificador único (no es necesario para el análisis).

- Tema: El tema del tweet (por ejemplo, "Borderlands", "Facebook").

- Sentimiento: La etiqueta de sentimiento (por ejemplo, "Positive", "Negative", "Neutral", "Irrelevant").

- Texto: El texto del tweet.

In [3]:
validation_data

Unnamed: 0,3364,Facebook,Irrelevant,"I mentioned on Facebook that I was struggling for motivation to go for a run the other day, which has been translated by Tom’s great auntie as ‘Hayley can’t get out of bed’ and told to his grandma, who now thinks I’m a lazy, terrible person 🤣"
0,352,Amazon,Neutral,BBC News - Amazon boss Jeff Bezos rejects clai...
1,8312,Microsoft,Negative,@Microsoft Why do I pay for WORD when it funct...
2,4371,CS-GO,Negative,"CSGO matchmaking is so full of closet hacking,..."
3,4433,Google,Neutral,Now the President is slapping Americans in the...
4,6273,FIFA,Negative,Hi @EAHelp I’ve had Madeleine McCann in my cel...
...,...,...,...,...
994,4891,GrandTheftAuto(GTA),Irrelevant,⭐️ Toronto is the arts and culture capital of ...
995,4359,CS-GO,Irrelevant,tHIS IS ACTUALLY A GOOD MOVE TOT BRING MORE VI...
996,2652,Borderlands,Positive,Today sucked so it’s time to drink wine n play...
997,8069,Microsoft,Positive,Bought a fraction of Microsoft today. Small wins.


In [5]:
print(train_data.columns)
print(validation_data.columns)

Index(['2401', 'Borderlands', 'Positive',
       'im getting on borderlands and i will murder you all ,'],
      dtype='object')
Index(['3364', 'Facebook', 'Irrelevant',
       'I mentioned on Facebook that I was struggling for motivation to go for a run the other day, which has been translated by Tom’s great auntie as ‘Hayley can’t get out of bed’ and told to his grandma, who now thinks I’m a lazy, terrible person 🤣'],
      dtype='object')


In [6]:
# Cargar los datos con nombres de columnas personalizados
train_data = pd.read_csv("twitter_training.csv", header=None, names=["ID", "Tema", "Sentimiento", "Texto"])
validation_data = pd.read_csv("twitter_validation.csv", header=None, names=["ID", "Tema", "Sentimiento", "Texto"])

# Verificar las columnas
print(train_data.columns)
print(validation_data.columns)

Index(['ID', 'Tema', 'Sentimiento', 'Texto'], dtype='object')
Index(['ID', 'Tema', 'Sentimiento', 'Texto'], dtype='object')


# Limpieza y preprocesamiento del texto

In [8]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [10]:
print(train_data['Texto'].isnull().sum())
print(validation_data['Texto'].isnull().sum())

686
0


In [11]:
train_data['Texto'] = train_data['Texto'].fillna('')
validation_data['Texto'] = validation_data['Texto'].fillna('')

In [12]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

# Descargar recursos de NLTK
nltk.download('punkt')
nltk.download('stopwords')

# Función para limpiar texto
def clean_text(text):
    # Eliminar URLs, menciones y caracteres especiales
    text = re.sub(r"http\S+|@\S+|#\S+", "", text)
    text = re.sub(r"[^\w\s]", "", text)
    # Convertir a minúsculas
    text = text.lower()
    # Tokenización
    tokens = word_tokenize(text)
    # Eliminar stopwords
    stop_words = set(stopwords.words("spanish"))  # Cambia a "spanish" si es necesario
    tokens = [word for word in tokens if word not in stop_words]
    # Stemming
    stemmer = SnowballStemmer("spanish")  # Cambia a "spanish" si es necesario
    tokens = [stemmer.stem(word) for word in tokens]
    return " ".join(tokens)

# Aplicar la función a los tweets
train_data['cleaned_text'] = train_data['Texto'].apply(clean_text)
validation_data['cleaned_text'] = validation_data['Texto'].apply(clean_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
# Vectorización del texto
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['cleaned_text'])
X_val = vectorizer.transform(validation_data['cleaned_text'])

In [14]:

# Obtener las etiquetas
y_train = train_data['Sentimiento']
y_val = validation_data['Sentimiento']

In [18]:
# Entrenar el modelo
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

model = MultinomialNB()
model.fit(X_train, y_train)

In [19]:
# Predecir en el conjunto de validación
y_pred = model.predict(X_val)

In [21]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluar el modelo
print("Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))

Accuracy: 0.701
Classification Report:
               precision    recall  f1-score   support

  Irrelevant       0.83      0.48      0.61       172
    Negative       0.62      0.83      0.71       266
     Neutral       0.81      0.58      0.68       285
    Positive       0.68      0.84      0.75       277

    accuracy                           0.70      1000
   macro avg       0.74      0.68      0.69      1000
weighted avg       0.73      0.70      0.69      1000



# Probando el modelo

In [23]:
new_tweets = [
    "I love the new iPhone! It's amazing!",
    "This game is so boring, I regret buying it.",
    "The weather today is neither good nor bad.",
    "Just saw a new movie, it was okay."
]

In [24]:
# Aplicar la función de limpieza a los nuevos tweets
new_tweets_cleaned = [clean_text(tweet) for tweet in new_tweets]

In [25]:
# Vectorizar los nuevos tweets
X_new = vectorizer.transform(new_tweets_cleaned)

In [26]:
# Predecir las etiquetas
predictions = model.predict(X_new)

# Mostrar los resultados
for tweet, sentiment in zip(new_tweets, predictions):
    print(f"Tweet: {tweet} \nSentimiento: {sentiment}\n")

Tweet: I love the new iPhone! It's amazing! 
Sentimiento: Positive

Tweet: This game is so boring, I regret buying it. 
Sentimiento: Negative

Tweet: The weather today is neither good nor bad. 
Sentimiento: Negative

Tweet: Just saw a new movie, it was okay. 
Sentimiento: Positive



# Guardar el modelo

In [27]:
import joblib

# Guardar el modelo
joblib.dump(model, "sentiment_model.pkl")

# Guardar el vectorizador
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

['tfidf_vectorizer.pkl']