# Sentiment analysis with deep learning

##### by Rishabh Chauhan

## Introduction:

Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) task that involves determining the sentiment expressed in a piece of text. This project focuses on sentiment analysis using deep learning techniques to classify text data into different sentiment categories, such as positive, negative, or neutral.

## Objective:

The primary goal of this notebook is to build a sentiment analysis model that can automatically classify the sentiment of textual data. The model utilizes deep learning architectures, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to capture complex patterns and relationships within the text.

## Dataset:

The sentiment analysis model is trained and evaluated on a dataset containing labeled examples of text with corresponding sentiment labels. The dataset consists of a variety of sentences from social media, customer reviews, or other sources, each annotated with its sentiment (positive, negative, or neutral).

link to the dataset: https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-dataset
Acknowledgement: HUSSEIN, SHERIF (2021), “Twitter Sentiments Dataset”, Mendeley Data, V1, doi: 10.17632/z9zw7nt5h2.1

So below are all the libraries that you will need for this project

In [1]:
import pandas as pd
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
import math
import nltk
from nltk.corpus import stopwords
import tensorflow as tf
from keras.models import load_model
from keras.optimizers import RMSprop
from keras.callbacks import LearningRateScheduler, EarlyStopping
from sklearn.metrics import classification_report,confusion_matrix
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils import to_categorical
from keras.layers import Bidirectional
from keras.callbacks import ModelCheckpoint, EarlyStopping

Lets begin with loading the dataset. The dataset does not have dedicated columns so first we will provide the column names as an argument in form of a list.

In [37]:
data=pd.read_csv("Twitter_Data.csv")
data.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


In [38]:
data.rename(columns={"clean_text": "sentences", "category":"target"}, inplace=True)

The step above is not necessary its just for my convenience!!!

For this task we will work with 3 sentiments i.e. 'Negetive', 'Neutral' 'Positive'. So lets filter the sentiments.

In [40]:
data['target'].value_counts()

target
 1.0    72250
 0.0    55213
-1.0    35510
Name: count, dtype: int64

In summary, the code below sets up functions for common text preprocessing tasks, such as removing stopwords, handling user mentions and URLs, and lemmatizing words.

These preprocessing steps are crucial for cleaning and transforming raw text data before using it for NLP tasks like sentiment analysis or text classification.

In [43]:
nltk.download('stopwords')
print(stopwords.words('english'))
stop_words = set(stopwords.words('english')) 

def remove_stop_words(sentence): 
  if isinstance(sentence, str):
    words = sentence.split() 
    filtered_words = [word for word in words if word not in stop_words] 
    return ' '.join(filtered_words)
  else:
    return ""

def preprocess(text):
    if isinstance(text, str):
        new_text = []
        for t in text.split(" "):
            t = '@user' if t.startswith('@') and len(t) > 1 else t
            t = 'http' if t.startswith('http') else t
            new_text.append(t)
        return " ".join(new_text)
    else:
        return ""

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    if isinstance(text, str):
        st = ""
        for w in w_tokenizer.tokenize(text):
            st = st + lemmatizer.lemmatize(w) + " "
        return st.strip()
    else:
        return ""


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:997)>


In [44]:
data['sentences']=data['sentences'].apply(preprocess)
data['sentences']=data['sentences'].apply(remove_stop_words)
data['sentences']=data['sentences'].apply(lemmatize_text)
data.head()

Unnamed: 0,sentences,target
0,modi promised “minimum government maximum gove...,-1.0
1,talk nonsense continue drama vote modi,0.0
2,say vote modi welcome bjp told rahul main camp...,1.0
3,asking supporter prefix chowkidar name modi gr...,1.0
4,answer among powerful world leader today trump...,1.0


We also have to calculate the average length of each sentence in the dataset (This will be of help later!)

In [57]:
def Cavg(text):
    arr=[]
    for i in text:
        np.array(arr.append(len(i)))
    return(np.average(arr))
avg_len=Cavg(data['sentences'])
print(avg_len)

96.27799116456006


The code below sets up a tokenizer, fits it on the text data, and converts the text into sequences of integers. This sequence data is often used as input to train machine learning models, such as neural networks, for tasks like sentiment analysis or text classification. The choice of num_words limits the vocabulary size to the most frequent words, which can help manage computational resources and improve model efficiency.

In [58]:
max_word=5000 #the maximum number of words to keep, based on word frequency.
max_sequence_length = 100 #the average length was 96 so we can round of to 100
tokenizer = Tokenizer(num_words=max_word, split=' ') 
tokenizer.fit_on_texts(data['sentences'].values)
X = tokenizer.texts_to_sequences(data['sentences'].values)
X= pad_sequences(X, maxlen=max_sequence_length)

In [46]:
print(X[0])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    1  247   27 1579 1016 1251 1092   50  105   41   14   25  970
  105 

After completing the preprocessing of the dataset, the next step is to start preparing for deep learning modelling

In [59]:
y = keras.utils.to_categorical(data['target'], 3, dtype="float32")
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)

  arr = np.asarray(values, dtype=dtype)


Lets quickly take a look at how the data is looking after splitting for training and testing.

In [60]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(114086, 100)
(114086, 3)
(48894, 100)
(48894, 3)


The code give below defines a sequential neural network model for a binary classification task using Keras. It consists of an Embedding layer followed by two Bidirectional LSTM layers with dropout. The final layer is a Dense layer with softmax activation. The model is compiled using the Adam optimizer and categorical crossentropy loss, with accuracy as the evaluation metric.

In [61]:
# Model
model = Sequential()
model.add(Embedding(max_word, 100, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(128, dropout=0.5, return_sequences=True)))
model.add(Bidirectional(LSTM(64, dropout=0.5)))
model.add(Dense(3, activation='softmax'))
# Compile Model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

The code below sets up callbacks for model training in Keras. It includes a ModelCheckpoint callback to save the best model based on validation loss and an EarlyStopping callback to halt training if the validation loss doesn't improve for a specified number of epochs. The model is then trained for 5 epochs (the higher the value higher will be the training cycles and higher the amount of time taken) using the provided training and validation data.

In [63]:
# Callbacks
checkpoint = ModelCheckpoint("best_model.hdf5", monitor='val_loss', verbose=1, save_best_only=True, mode='auto', period=1, save_weights_only=False)
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
# Training
history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test), callbacks=[checkpoint, early_stopping])

Epoch 1/5
Epoch 1: val_loss improved from inf to 0.35320, saving model to best_model.hdf5
Epoch 2/5
   1/3566 [..............................] - ETA: 7:14 - loss: 0.1719 - accuracy: 0.9375

  saving_api.save_model(


Epoch 2: val_loss improved from 0.35320 to 0.34191, saving model to best_model.hdf5
Epoch 3/5
Epoch 3: val_loss did not improve from 0.34191
Epoch 4/5
Epoch 4: val_loss did not improve from 0.34191
Epoch 5/5
Epoch 5: val_loss did not improve from 0.34191


After training is complete we will test the model on testing dataset and generate the classification report to analyse the performance of our model.

In [64]:
y_pred=model.predict(X_test)
report = classification_report(y_test.argmax(axis=1), np.around(y_pred, decimals=0).argmax(axis=1))
print("\nClassification Report:")
print(report)


Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.97      0.91     16551
           1       0.93      0.88      0.91     21626
           2       0.88      0.79      0.83     10717

    accuracy                           0.89     48894
   macro avg       0.89      0.88      0.88     48894
weighted avg       0.89      0.89      0.89     48894



Finally lets save the model and the tokenizer using "model.save()" so that it can be used later for applications.

In [None]:
import pickle
# saving deep learning model
model.save("SentModel.h5")
# saving tokenizer
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)


After saving the model and tokenizer we can load using the following code

In [None]:
with open('tokenizer.pickle', 'rb') as handle:
    loaded_tokenizer = pickle.load(handle)
loaded_model=keras.models.load_model()

In [76]:
max_sequence_length = 100
sentiment=['Neutral', 'Positive',"Negative"]

In [80]:
test_txt=["I hate everything in my life","I love what i saw last night","hi how are you?"]
seq = tokenizer.texts_to_sequences(test_txt)
ans = pad_sequences(seq, maxlen=max_sequence_length)
preds = model.predict(ans)
for i in preds:
    print(sentiment[np.around(i, decimals=0).argmax()])

Negative
Positive
Neutral


If you would like to see one of my applications of sentiment analysis check out the following repository on my profile:

https://github.com/Rc17git/YT_Comment_Analysis_Webbapp