## Twitter Sentiment Analysis

This dataset contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

### Import the basic libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import random

I have downloaded a dataset on Twitter Sentiment Analysis from Kaggle. Twitter Sentiment dataset is the most commonly used dataset for sentiment classification learning.

In [2]:
DATASET_COLUMNS = ['target', 'ids', 'date', 'flag', 'user', 'text']
df=pd.read_csv(r'D:\2.10.2022\Download\Deep learning Project\Sentiments Project\training.1600000.processed.noemoticon.csv',encoding='ISO-8859-1',names=DATASET_COLUMNS)

Exploratory Data Analysis

In [3]:
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


Drop Unnecessary Columns

In [5]:
df = df.drop(['ids', 'date', 'flag', 'user'], axis=1)

In [6]:
#df=df.iloc[:1400000]

Shape of data

In [7]:
df.shape

(1600000, 2)

Checking for null values

In [8]:
df.isnull().sum()

target    0
text      0
dtype: int64

In [9]:
df['target'] = df['target'].replace(4,1)

Check unique target values

In [10]:
df.target.value_counts()

0    800000
1    800000
Name: target, dtype: int64

Check the number of target values

In [11]:
df['target'].unique()

array([0, 1], dtype=int64)

In [12]:
df.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [16]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [17]:
tweets = np.array(df['text'])
labels = np.array(df['target'])

In [18]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(tweets, labels, test_size=0.2, random_state=42)

In [19]:
y_train = to_categorical(y_train)

In [20]:
max_num_words  = 10000 #total how many words to be considered from the corpus(collection of documents)
seq_len        = 50    #how many words to be taken from each document
embedding_size = 100   #vector lengh (embedding size) for each word

In [22]:
tokenizer = Tokenizer(num_words = max_num_words)
tokenizer.fit_on_texts(tweets)
x_train = tokenizer.texts_to_sequences(x_train)
x_train = pad_sequences(x_train, maxlen = seq_len)

In [23]:
x_train.shape

(1280000, 50)

In [24]:
x_train[3]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,  231,
        118,   76,  250,    4, 1457,   97, 1092,   72,  115,  418,  180,
        111,  145,   37,  250,   45,   40])

In [25]:
len(x_train[7])

50

In [26]:
model = Sequential()  # iniitialize the network
model.add(Embedding( input_dim = max_num_words,    
                     input_length = seq_len,
                     output_dim = embedding_size))
model.add(LSTM(5))   #If we want to perform RNN then simply replace LSTM to RNN but recommended
model.add(Dense(2 , activation  = 'softmax'))   # final layer
from tensorflow.keras.optimizers import Adam
adam = Adam(lr = .001)
model.compile(optimizer= adam , loss = 'categorical_crossentropy' , metrics = ['accuracy'])



In [27]:
model.fit(x_train, y_train, epochs=5, batch_size = 32, validation_split=.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x218ca6a5040>

In [28]:
x_test = tokenizer.texts_to_sequences(x_test)
x_test = pad_sequences(x_test, maxlen=seq_len)

In [29]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [30]:
pred_prob = model.predict(x_test)



In [31]:
pred_classes = pred_prob.argmax(axis = 1)

In [32]:
tab1 = confusion_matrix(y_test, pred_classes)
tab1

array([[132221,  27273],
       [ 31026, 129480]], dtype=int64)

In [33]:
tab1.diagonal().sum()*100/tab1.sum()

81.7815625