<a href="https://colab.research.google.com/github/GeorgeU2030/analysis-rnn-lstm/blob/main/code-implementation/RNN_LSTM_TI2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment analysis - RNN and LSTM.

Members

- Luis Botero
- Juan Medina
- George Trujillo

1. Gather the dataset for sentiment analysis from the UCI Machine Learning Repository, specifically
the Sentiment Labelled Sentences Data Set

In [3]:
# We get the file from the download site
!wget https://archive.ics.uci.edu/static/public/331/sentiment+labelled+sentences.zip

# Unzip the obtained file
!unzip sentiment+labelled+sentences.zip

--2023-11-21 18:15:53--  https://archive.ics.uci.edu/static/public/331/sentiment+labelled+sentences.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘sentiment+labelled+sentences.zip’

sentiment+labelled+     [ <=>                ]  82.21K   488KB/s    in 0.2s    

2023-11-21 18:15:54 (488 KB/s) - ‘sentiment+labelled+sentences.zip’ saved [84188]

Archive:  sentiment+labelled+sentences.zip
   creating: sentiment labelled sentences/
  inflating: sentiment labelled sentences/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/sentiment labelled sentences/
  inflating: __MACOSX/sentiment labelled sentences/._.DS_Store  
  inflating: sentiment labelled sentences/amazon_cells_labelled.txt  
  inflating: sentiment labelled sentences/imdb_labelled.txt  
  inflating: __MACOSX/sentiment labelled sente

2. Preprocess the text data, including tokenization, lowercasing, and removing stopwords. Prepare the
data for supervised learning (use NLTK).

In [4]:
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# We load the data from the websites
# amazon
df_amazon = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\t', header=None)
df_amazon.columns = ['sentence', 'label']

# imdb
df_imdb = pd.read_csv('sentiment labelled sentences/imdb_labelled.txt', sep='\t', header=None)
df_imdb.columns = ['sentence', 'label']

# yelp
df_yelp = pd.read_csv('sentiment labelled sentences/yelp_labelled.txt', sep='\t', header=None)
df_yelp.columns = ['sentence', 'label']

# We mix the dataframes, reload the index, for no obtain duplicate indexes
df = pd.concat([df_amazon, df_imdb, df_yelp], ignore_index=True)

# We tokenize each sentence into individual words, convert words to lowercase, and remove stopwords
stop_words = set(stopwords.words('english'))

df['sentence'] = df['sentence'].apply(lambda x: ' '.join([word.lower() for word in x.split() if word.isalnum() and word.lower() not in stop_words]))

print(df['sentence'])


0                                   way plug us unless go
1                                          good excellent
2                                                   great
3                   tied charger conversations lasting 45
4                                                     mic
                              ...                        
2743                            think food flavor texture
2744                                   appetite instantly
2745                           overall impressed would go
2746           whole experience think go ninja sushi next
2747    wasted enough life poured salt wound drawing t...
Name: sentence, Length: 2748, dtype: object


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# We prepare the data for supervised learning
# We divide the data in training and test

X_train, X_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.3, random_state=123)

# Convert the text to numerical characteristics
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Convert the categorical to numerical characteristics

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


3. Implement a DummyClassifier. Train the model on the preprocessed data and evaluate its
performance in terms of accuracy, precision, recall, and F1-score. Use these results as the baseline

In [6]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# We create the Dummy Classifier
dummy_classifier = DummyClassifier(strategy="most_frequent")

# We train the model with the data preprocessed
dummy_classifier.fit(X_train_vectorized, y_train_encoded)