# Sentiment analysis with deep learning

##### by Rishabh Chauhan

## Introduction:

Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) task that involves determining the sentiment expressed in a piece of text. This project focuses on sentiment analysis using deep learning techniques to classify text data into different sentiment categories, such as positive, negative, or neutral.

## Objective:

The primary goal of this notebook is to build a sentiment analysis model that can automatically classify the sentiment of textual data. The model utilizes deep learning architectures, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to capture complex patterns and relationships within the text.

## Dataset:

The sentiment analysis model is trained and evaluated on a dataset containing labeled examples of text with corresponding sentiment labels. The dataset consists of a variety of sentences from social media, customer reviews, or other sources, each annotated with its sentiment (positive, negative, or neutral).

link to the dataset: https://www.kaggle.com/datasets/kazanova/sentiment140
Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

So below are all the libraries that you will need for this project

In [1]:
import pandas as pd
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
import math
import nltk
from nltk.corpus import stopwords
import tensorflow as tf
from keras.models import load_model
from keras.optimizers import RMSprop
from keras.callbacks import LearningRateScheduler, EarlyStopping
from sklearn.metrics import classification_report,confusion_matrix
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils import to_categorical
from keras.layers import Bidirectional
from keras.callbacks import ModelCheckpoint, EarlyStopping

Lets begin with loading the dataset. The dataset does not have dedicated columns so first we will provide the column names as an argument in form of a list.

In [None]:
cols=['target', 'ids', 'date', 'flag', 'user', 'sentences']
df=pd.read_csv("twitter.csv", encoding='ISO-8859-1',names=cols )
df.head()

For this notebook we will only focus on 2 sentiments i.e. 'Negetive' and 'Positive'. So lets filter the sentiments.

In [None]:
data= df[['sentences', 'target']]
data=data[data['target']!=2]
data.head()

In summary, the code below sets up functions for common text preprocessing tasks, such as removing stopwords, handling user mentions and URLs, and lemmatizing words.

These preprocessing steps are crucial for cleaning and transforming raw text data before using it for NLP tasks like sentiment analysis or text classification.

In [None]:
nltk.download('stopwords')
print(stopwords.words('english'))
stop_words = set(stopwords.words('english')) 

def remove_stop_words(sentence): 
  words = sentence.split() 
  filtered_words = [word for word in words if word not in stop_words] 
  return ' '.join(filtered_words)

def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    st = ""
    for w in w_tokenizer.tokenize(text):
        st = st + lemmatizer.lemmatize(w) + " "
    return st

In [None]:
data['sentences']=data['sentences'].apply(preprocess)
data['sentences']=data['sentences'].apply(remove_stop_words)
data['sentences']=data['sentences'].apply(lemmatize_text)
data.head()

As the 'Positive' sentiment is '4' in the dataset, we will quickly change that to '1' in order to obtain a more general format

In [None]:
for i in range(len(data['target'])):
    if data.iloc[i,1] == 4:
        data.iloc[i,1]=1

The code below sets up a tokenizer, fits it on the text data, and converts the text into sequences of integers. This sequence data is often used as input to train machine learning models, such as neural networks, for tasks like sentiment analysis or text classification. The choice of num_words limits the vocabulary size to the most frequent words, which can help manage computational resources and improve model efficiency.

In [None]:
max_word=5000
max_sequence_length = 200
tokenizer = Tokenizer(num_words=max_word, split=' ') 
tokenizer.fit_on_texts(data['sentences'].values)
X = tokenizer.texts_to_sequences(data['sentences'].values)
X= pad_sequences(X, maxlen=max_sequence_length)

In [None]:
print(X[0])

After completing the preprocessing of the dataset, the next step is to start preparing for deep learning modelling

In [None]:
y = keras.utils.to_categorical(data['target'], 2, dtype="float32")
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)

Lets quickly take a look at how the data is looking after splitting for training and testing.

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

The code give below defines a sequential neural network model for a binary classification task using Keras. It consists of an Embedding layer followed by two Bidirectional LSTM layers with dropout. The final layer is a Dense layer with softmax activation. The model is compiled using the Adam optimizer and categorical crossentropy loss, with accuracy as the evaluation metric.

In [None]:
# Model
model = Sequential()
model.add(Embedding(max_word, 100, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(128, dropout=0.5, return_sequences=True)))
model.add(Bidirectional(LSTM(64, dropout=0.5)))
model.add(Dense(2, activation='softmax'))
# Compile Model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

This code sets up callbacks for model training in Keras. It includes a ModelCheckpoint callback to save the best model based on validation loss and an EarlyStopping callback to halt training if the validation loss doesn't improve for a specified number of epochs. The model is then trained for 5 epochs using the provided training and validation data.

Note: The specific parameters of the callbacks, such as the file name for saving the best model ("best_model.hdf5"), monitoring metric ('val_loss'), and patience for early stopping (3), can be adjusted based on your preferences and the characteristics of your training process.

In [None]:
# Callbacks
checkpoint = ModelCheckpoint("best_model.hdf5", monitor='val_loss', verbose=1, save_best_only=True, mode='auto', period=1, save_weights_only=False)
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
# Training
history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test), callbacks=[checkpoint, early_stopping])

After training is complete we will test the model on testing dataset and generate the classification report to analyse the performance of our model.

In [None]:
y_pred=model.predict(X_test)
report = classification_report(y_test.argmax(axis=1), np.around(y_pred, decimals=0).argmax(axis=1))
print("\nClassification Report:")
print(report)

In [None]:
model.save("SentModel.h5")