<a href="https://colab.research.google.com/github/Israel-Mege/sentiment_analysis/blob/main/01_Sentiment_Analysis_BCXP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🎯 Problema de Negocio

El objetivo de este proyecto es analizar los sentimientos de los tweets, identificando si los mensajes son positivos o negativos.

#Cleaning

Bootcap Xperience. Este ejercicio esta basado en Sentiment140 Dataset. Es un un conjunto de datos de 1,600,000 tweets preprocesados, se pretende generar diversas características que se utilizarán para entrenar un modelo de clasificación.

El conjunto de datos es una colección de 1,6 millones de tuits que **han sido etiquetados como positivos o negativos**.


Pregunta de trabajo: ¿cómo consiguieron tantos tuits etiquetados?. Documentación https://www.kaggle.com/datasets/kazanova/sentiment140

Hipótesis de trabajo: asumimos que cualquier tuit con emoticones positivos, como :), era positivo, y los tuits con emoticones negativos, como :(, eran negativos.

**Es bueno discutir eso más adelante, pero por ahora, simplemente limpiemos el asunto. En este cuaderno, eliminaremos las columnas que no queremos y estandarizaremos la columna de sentimiento.**



#Prep work

In [24]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Instalar kagglehub si no está instalado
!pip install kagglehub[pandas-datasets]

# Descargar stopwords de NLTK
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [12]:
#cargar el dataset con parámetros específicos
try:
    df = kagglehub.load_dataset(
        KaggleDatasetAdapter.PANDAS,
        "kazanova/sentiment140",
        "training.1600000.processed.noemoticon.csv",
        pandas_kwargs={
            'encoding': 'latin-1',
            'on_bad_lines': 'skip'
        }
    )
    print("Dataset cargado exitosamente!")
    print("\nPrimeras 5 filas:", df.head())
    print("\nDimensiones del dataset:", df.shape)

except Exception as e:
    print("Error al cargar el dataset:", str(e))

Dataset cargado exitosamente!

Primeras 5 filas:    0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY _TheSpecialOne_  \
0  0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   scotthamilton   
1  0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY        mattycus   
2  0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY         ElleCTF   
3  0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY          Karoli   
4  0  1467811372  Mon Apr 06 22:20:00 PDT 2009  NO_QUERY        joy_wolf   

  @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D  
0  is upset that he can't update his Facebook by ...                                                                   
1  @Kenichan I dived many times for the ball. Man...                                                                   
2    my whole body feels itchy and like its on fire                                                                    
3  @nationwide

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1599999 non-null  int64 
 1   ids     1599999 non-null  int64 
 2   date    1599999 non-null  object
 3   flag    1599999 non-null  object
 4   user    1599999 non-null  object
 5   text    1599999 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


 ·The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment .
It contains the following 6 fields:

  1.**target:** the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

  2.**ids:** The id of the tweet ( 2087)

  3.**date:** the date of the tweet (Sat May 16 23:58:44 UTC 2009)

  4.**flag:** The query (lyx). If there is no query, then this
value is NO_QUERY.

  5.**user:** the user that tweeted (robotickilldozr)

In [17]:
df.columns = ['target', 'ids', 'date', 'flag', 'user', 'text']
df

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...



### Tratamiento de Datos

Durante el preprocesamiento se realizan las siguientes operaciones clave:

- 🧹 **Eliminación de URLs, menciones y emojis**: Se remueven del texto ya que pueden agregar ruido al análisis.
- ❌ **Eliminación de signos de puntuación y caracteres especiales**: Para normalizar el texto antes de aplicar las transformaciones.
- 🔡 **Transformación a minúsculas**: Para evitar que las palabras en mayúsculas se traten como diferentes palabras.
- 🛠️ **Generación de características adicionales**: Se generan nuevas variables basadas en el texto, como longitud del tweet, conteo de stopwords, densidad de palabras en mayúsculas, entre otras.


In [19]:
# counting missing values in dataset
df.isnull().sum()


Unnamed: 0,0
target,0
ids,0
date,0
flag,0
user,0
text,0


In [20]:
#Crear sample
df_nuevo = df.sample(n=10, random_state=42).reset_index(drop=True)
df_nuevo

Unnamed: 0,target,ids,date,flag,user,text
0,0,2200003313,Tue Jun 16 18:18:13 PDT 2009,NO_QUERY,DEWGetMeTho77,@Nkluvr4eva My poor little dumpling In Holmde...
1,0,1467998601,Mon Apr 06 23:11:18 PDT 2009,NO_QUERY,Young_J,I'm off too bed. I gotta wake up hella early t...
2,0,2300049112,Tue Jun 23 13:40:12 PDT 2009,NO_QUERY,dougnawoschik,I havent been able to listen to it yet My spe...
3,0,1993474319,Mon Jun 01 10:26:09 PDT 2009,NO_QUERY,thireven,now remembers why solving a relatively big equ...
4,0,2256551006,Sat Jun 20 12:56:51 PDT 2009,NO_QUERY,taracollins086,"Ate too much, feel sick"
5,0,2052381070,Sat Jun 06 00:32:23 PDT 2009,NO_QUERY,Portablemonkey,Tried to purchase a parked domain through GoDa...
6,4,1983449090,Sun May 31 13:10:36 PDT 2009,NO_QUERY,jessig06,on lunch....dj should come eat with me
7,0,2245480599,Fri Jun 19 16:11:34 PDT 2009,NO_QUERY,Aligrl,Just got back from VA Tech Equine Medical Cent...
8,0,1770706008,Mon May 11 22:01:35 PDT 2009,NO_QUERY,leyyyy,can't log in to my other twitter account. supe...
9,4,2050057894,Fri Jun 05 17:59:34 PDT 2009,NO_QUERY,AmiAhuja,@TamaraSchilling Adventure - That's what we al...


In [22]:
# Guardar el DataFrame como CSV en Colab
df_nuevo.to_csv('/content/nuevo_dataset.csv', index=False)

# Para leer el dataset más tarde en la misma sesión:
df_cargado = pd.read_csv('/content/nuevo_dataset.csv')

##Limpieza de datos

In [25]:
# Función para limpiar el texto
def clean_text(text):
    # Convertir a minúsculas
    text = str(text).lower()

    # Eliminar URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)

    # Eliminar menciones (@usuario)
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)

    # Eliminar signos de puntuación
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Eliminar números
    text = re.sub(r'\d+', '', text)

    # Eliminar espacios múltiples
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Aplicar la limpieza al texto
df['clean_text'] = df['text'].apply(clean_text)

In [30]:
# Aplicar la limpieza a la columna de texto
df_cargado['clean_text'] = df_cargado['text'].apply(clean_text)

In [35]:
print("Información del DataFrame:")
df_cargado.info()

print("\nEjemplos de textos limpios:")
print(df_cargado[['text', 'clean_text']].head())

print("\nSuma datos no nulos:")
df_cargado.isna().sum()

Información del DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   target      10 non-null     int64 
 1   ids         10 non-null     int64 
 2   date        10 non-null     object
 3   flag        10 non-null     object
 4   user        10 non-null     object
 5   text        10 non-null     object
 6   clean_text  10 non-null     object
dtypes: int64(2), object(5)
memory usage: 692.0+ bytes

Ejemplos de textos limpios:
                                                text  \
0  @Nkluvr4eva My poor little dumpling  In Holmde...   
1  I'm off too bed. I gotta wake up hella early t...   
2  I havent been able to listen to it yet  My spe...   
3  now remembers why solving a relatively big equ...   
4                           Ate too much, feel sick    

                                          clean_text  
0  my poor little dumpling in holmde

Unnamed: 0,0
target,0
ids,0
date,0
flag,0
user,0
text,0
clean_text,0


In [None]:
#Este es otro codigo para limpiar texto pero NO ESTA ADECUADO NI PROBRADO
#https://www.kaggle.com/code/youssefhatem1/twitter-sentiment-analysis-using-cnn-architectures
# Define English stopwords
en_stop = set(stopwords.words('english'))

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()
# Define function to clean and preprocess text
def process_text(document):
    document = re.sub(r'\s+', ' ', document, flags=re.I)  # Remove extra white space from text
    document = re.sub(r'\W', ' ', str(document))   # Remove all the special characters from text
    document = re.sub(r'[^a-zA-Z\s]', '', document) # Remove any character that isn't alphabetical
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)     # Remove all single characters from text
    document = document.lower()    # Converting to Lowercase
    # Word tokenization
    tokens = document.split()
    lemma_txt = [lemmatizer.lemmatize(word) for word in tokens]
    lemma_no_stop_txt = [word for word in lemma_txt if word not in en_stop]
    # Drop words
    tokens = [word for word in lemma_no_stop_txt if len(word) > 3 and not word.isdigit()]
    # Getting unique words
    indices = np.unique(tokens, return_index=True)[1]
    clean_txt = np.array(tokens)[np.sort(indices)].tolist()
    return ' '.join(clean_txt)  # Return the cleaned text as a string

NameError: name 'WordNetLemmatizer' is not defined

In [None]:
# Generar características adicionales
df['tweet_length'] = df['text'].str.len()
df['word_count'] = df['clean_text'].str.split().str.len()
df['uppercase_ratio'] = df['text'].apply(lambda x: sum(1 for c in str(x) if c.isupper()) / len(str(x)) if len(str(x)) > 0 else 0)

# Contar stopwords
stop_words = set(stopwords.words('english'))
df['stopwords_count'] = df['clean_text'].apply(lambda x: len([word for word in str(x).split() if word in stop_words]))

# Mostrar el resultado
print(df[['text', 'clean_text', 'tweet_length', 'word_count', 'uppercase_ratio', 'stopwords_count']].head())

# Guardar el dataset procesado
df.to_csv('cleaned_dataset.csv', index=False)

                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                          clean_text  tweet_length  \
0  a thats a bummer you shoulda got david carr of...           115   
1  is upset that he cant update his facebook by t...           111   
2  i dived many times for the ball managed to sav...            89   
3     my whole body feels itchy and like its on fire            47   
4  no its not behaving at all im mad why am i her...           111   

   word_count  uppercase_ratio  stopwords_count  
0          16         0.060870                8  
1          21         0.027027                9  
2          16         0.044944                7  
3          10         0.000000            

In [None]:
#Contar stopwords
stop_words = set(stopwords.words('english'))
df['stopwords_count'] = df['clean_text'].apply(lambda x: len([word for word in str(x).split() if word in stop_words]))

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 11 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   polarity         1600000 non-null  int64  
 1   id               1600000 non-null  int64  
 2   date             1600000 non-null  object 
 3   query            1600000 non-null  object 
 4   user             1600000 non-null  object 
 5   text             1600000 non-null  object 
 6   clean_text       1600000 non-null  object 
 7   tweet_length     1600000 non-null  int64  
 8   word_count       1600000 non-null  int64  
 9   uppercase_ratio  1600000 non-null  float64
 10  stopwords_count  1600000 non-null  int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 134.3+ MB


In [None]:
#Revisamos el resultados
print(df[['text', 'clean_text', 'tweet_length', 'word_count', 'uppercase_ratio', 'stopwords_count']].head(3))

                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   

                                          clean_text  tweet_length  \
0  a thats a bummer you shoulda got david carr of...           115   
1  is upset that he cant update his facebook by t...           111   
2  i dived many times for the ball managed to sav...            89   

   word_count  uppercase_ratio  stopwords_count  
0          16         0.060870                8  
1          21         0.027027                9  
2          16         0.044944                7  


In [None]:
df

Unnamed: 0,polarity,id,date,query,user,text,clean_text,tweet_length,word_count,uppercase_ratio,stopwords_count
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",a thats a bummer you shoulda got david carr of...,115,16,0.060870,8
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...,111,21,0.027027,9
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,89,16,0.044944,7
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,47,10,0.000000,4
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i her...,111,20,0.009009,15
...,...,...,...,...,...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...,just woke up having no school is the best feel...,56,11,0.035714,6
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...,thewdbcom very cool to hear old walt interview...,78,9,0.076923,2
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...,are you ready for your mojo makeover ask me fo...,57,11,0.087719,6
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...,happy th birthday to my boo of alll time tupac...,65,12,0.076923,3


In [None]:
df.to_csv('cleaned_dataset.csv', index=False)

In [None]:
from google.colab import files
files.download('cleaned_dataset.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>