# **Sentiment Analysis Model - TI2**
The goal of this project is to build a sentiment analysis model using supervised learning with vanilla Recurrent Neural Networks and LSTM.


## **Preprocess:**

Neceesary imports:

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import pandas as pd
# Necessary resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\Cristian
[nltk_data]     Perafan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Cristian
[nltk_data]     Perafan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

English stopwords: 

In [2]:
stop_words = set(stopwords.words('english'))

Read the data into a pandas dataframe, where each row is a sentence and each column is a label (0 for negative, 1 for positive) and the text itself.

In [3]:
amazon_df = pd.read_csv('./sentiment labelled sentences/amazon_cells_labelled.txt', sep='\t',names=['sentence', 'tag'])

imdb_df = pd.read_csv('./sentiment labelled sentences/imdb_labelled.txt', sep='\t',names=['sentence', 'tag'])

yelp_df = pd.read_csv('./sentiment labelled sentences/yelp_labelled.txt', sep='\t',names=['sentence', 'tag'])

Tokenize and delete the stop words from text data  using NLTK:

- *word.isalnum()* ensures that only words containing alphabetic or numeric characters are included and excludes punctuation marks or other special characters.

In [4]:
stop_words = set(stopwords.words('english'))

# Tokenize and delete stop words from the Amazon sentences

amazon_df['tokens'] = amazon_df['sentence'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalnum() and word.lower() not in stop_words])


# Tokenize and delete stop words from the IMDB sentences
imdb_df['tokens'] = imdb_df['sentence'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalnum() and word.lower() not in stop_words])

# Tokenize and delete stop words from the Yelp sentences
yelp_df['tokens'] = yelp_df['sentence'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalnum() and word.lower() not in stop_words])


combined_sentiments_df = pd.concat([amazon_df, imdb_df, yelp_df], ignore_index=True)


Split data into training and test sets:

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(combined_sentiments_df['tokens'], combined_sentiments_df['tag'], test_size=0.3, random_state=42)


## **DummyClassifier**

Neceesary imports:

In [8]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score


Build the DummyClassifier model:

In [9]:
clf_dummy = DummyClassifier(random_state=43,strategy='prior')
clf_dummy.fit(X_train, Y_train)

y_pred = clf_dummy.predict(X_test)

**Model performance**

*Accuracy*: is the fraction of predictions our model got right.

In [10]:
accuracy_score(Y_test, y_pred)

0.4727272727272727

*Presicion*: is the fraction of positive predictions that are correct.

In [130]:
precision_score(Y_test, y_pred, average=None)

  _warn_prf(average, modifier, msg_start, len(result))


array([0.        , 0.47272727])

*Recaal*: is the fraction of the truly positive instances that the classifier recognizes.

In [131]:
recall_score(Y_test, y_pred, average=None)

array([0., 1.])

*f1-score*: is the harmonic mean of precision and recall.

In [135]:
f1_score(Y_test, y_pred, average=None)

array([0.        , 0.64197531])

## **RNN Model**	

Neceesary imports:

In [11]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers




Building the RNN model and adding layers:

In [12]:
model = keras.Sequential()
model.add(layers.SimpleRNN(64, input_shape=(None, 28)))
model.add(layers.BatchNormalization())
model.add(layers.Dense(10))
print(model.summary())



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 64)                5952      
                                                                 
 batch_normalization (Batch  (None, 64)                256       
 Normalization)                                                  
                                                                 
 dense (Dense)               (None, 10)                650       
                                                                 
Total params: 6858 (26.79 KB)
Trainable params: 6730 (26.29 KB)
Non-trainable params: 128 (512.00 Byte)
_________________________________________________________________
None


Compiling the model:

In [13]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits = True),
    optimizer="sgd",
    metrics=["accuracy"],
)





Training the model: