<a href="https://colab.research.google.com/github/KishoreKumar1308/Twitter-Sentiment/blob/main/TwitterSentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter Sentiment Analysis

In [1]:
from google.colab import drive
drive.mount('/content/drive') # Mount Google Drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Import required libraries 
import numpy as np
import pandas as pd
import re
import nltk

In [3]:
nltk.download(['punkt','stopwords','wordnet'])
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
path = "/content/drive/MyDrive/Twitter-Sentiment/training.1600000.processed.noemoticon.csv" 
# Either create a folder named "Twitter-Sentiment" and upload the input file in that folder
# Or chnage the path to where you have uploaded the input file.

data = pd.read_csv(path,encoding = "latin-1",header = None) 
# While reading normally the file throwed utf-8 cannot encode few values error
# Therefore using "latin-1" encoding (Referred from Stackoverflow (https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte))

In [5]:
data.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Data cleaning and Preprocessing

We only need the Target class (column 0) and the tweet column (column 5).

In [6]:
data.drop([1,2,3,4],axis = 1,inplace = True) # Dropping columns 1,2,3,4

In [7]:
data.columns = ["Target","Tweet"] # Renaming the column names

In [8]:
data.head()

Unnamed: 0,Target,Tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [9]:
def extract_text(tweet):
  t1 = re.sub(r'@\S+',' ',tweet) # Removing any user mentions
  t1 = re.sub(r'http\S+',' ',t1) # Removing any URLs
  return re.sub(r'[^0-9a-zA-Z]+',' ',t1).strip().lower() # Removing extra spaces and sending only lowercase alphanumeric characters

In [10]:
data["CleanedTweet"] = data["Tweet"].apply(extract_text) # Applying the function on "Tweet" Column and stroing it in new "CleanedTweet" column

In [11]:
data.head()

Unnamed: 0,Target,Tweet,CleanedTweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",awww that s a bummer you shoulda got david car...
1,0,is upset that he can't update his Facebook by ...,is upset that he can t update his facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...
3,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all....",no it s not behaving at all i m mad why am i h...


In [12]:
lemmatizer = WordNetLemmatizer() # Lemmatizer to get the root word of any given word
stemmer = PorterStemmer() # Stemmer for normalizing similar words
stop_list = stopwords.words('english')
def preprocess_text(text):
  tokens = word_tokenize(text) # Tokenizing given text
  nw_removed = [word for word in tokens if word not in stop_list]
  ls = []
  for w in nw_removed: # For each word in the token
    ls.append(lemmatizer.lemmatize(stemmer.stem(w))) # First we are stemming the word and finding its root word using lemmatizing
    # It is then appended to the list

  return ' '.join(ls) # returning the preprocessed words as a string

In [13]:
data["PreProcessedText"] = data["CleanedTweet"].apply(preprocess_text) # Applying the preprocessing on cleaned tweets

In [14]:
data.head()

Unnamed: 0,Target,Tweet,CleanedTweet,PreProcessedText
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",awww that s a bummer you shoulda got david car...,awww bummer shoulda got david carr third day
1,0,is upset that he can't update his Facebook by ...,is upset that he can t update his facebook by ...,upset updat facebook text might cri result sch...
2,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,dive mani time ball manag save 50 rest go bound
3,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,"@nationwideclass no, it's not behaving at all....",no it s not behaving at all i m mad why am i h...,behav mad see


In [15]:
data["Target"] = data["Target"].map({0:0,4:1}) # Mapping targets from (0,4) --> (0,1)

In [16]:
shuffled = data.sample(frac = 1) # Shuffling the data
shuffled.head()

Unnamed: 0,Target,Tweet,CleanedTweet,PreProcessedText
413201,0,"aw god, I hate having noone online",aw god i hate having noone online,aw god hate noon onlin
1231905,1,its the case of the mondays.. although most of...,its the case of the mondays although most of i...,case monday although
835816,1,http://twitpic.com/3lhh3 - Me and the lovely R...,me and the lovely ruthie henshall x,love ruthi henshal x
1205879,1,Night tweeters! As our hottie Danny Jones woul...,night tweeters as our hottie danny jones would...,night tweeter hotti danni jone would say tweet...
1528262,1,"ok, now I am going to sleep hehehe lol",ok now i am going to sleep hehehe lol,ok go sleep heheh lol


In [17]:
glove_path = "/content/drive/MyDrive/Twitter-Sentiment/glove.twitter.27B.50d.txt" # Path to Glove word embeddings

In [18]:
samples = shuffled["PreProcessedText"] # X
labels = pd.get_dummies(shuffled["Target"]) # y

In [19]:
validation_split = 0.2 # Test size
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

In [20]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization # For vectorizing the input text
vectorizer = TextVectorization(max_tokens = 20000, output_sequence_length = 200) # We consider only the Top 20000 text, the sequence is padded to have length of 200
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128) # Creating tensor slices
vectorizer.adapt(text_ds) # Fitting vectorizer on this slices

In [21]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc)))) # Word index has each word and its corresponding index

In [22]:
embeddings_index = {}
with open(glove_path) as f: # Opeing the Twitter glove word embeddings
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

In [23]:
num_tokens = len(voc) + 2 # +2 because 0 index is reserved and 1 is for unkown element
embedding_dim = 50 # We have used 50d word embeddings so embedding dim is 50

embedding_matrix = np.zeros((num_tokens, embedding_dim)) # Creating Embedding matrix 
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [24]:
from tensorflow.keras.layers import Embedding
import keras

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
) # Embeddibng Layer

In [25]:
from tensorflow.keras import layers
# 1D Convelutional Neural Network
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(2, activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)

In [26]:
from tensorflow.keras.callbacks import ReduceLROnPlateau # To change the learning rate based on changes in validation loss
from tensorflow.keras.callbacks import EarlyStopping # To Stop the model at the right time
from keras.callbacks import ModelCheckpoint # To save best model
callback1 = ReduceLROnPlateau(factor=0.2, min_lr = 0.001,monitor = 'val_loss',verbose = 1)
callback2 = EarlyStopping(monitor = 'val_loss',mode = 'min',verbose = 1,patience = 5)
callback3 = ModelCheckpoint('best_model.h5',montior = 'val_accuracy',mode = 'max',verbose = 1,save_best_only = True)

In [27]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy() # Making the input samples as numpy array, vectorizing it and convering the Tensor back to numpy array
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

In [28]:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["acc"])

In [29]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 50)          1000100   
                                                                 
 conv1d (Conv1D)             (None, None, 128)         32128     
                                                                 
 max_pooling1d (MaxPooling1D  (None, None, 128)        0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, None, 128)         82048     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, None, 128)        0         
 1D)                                                         

In [30]:
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val),callbacks = [callback1,callback2,callback3])

Epoch 1/20
Epoch 00001: val_loss improved from -inf to 0.49809, saving model to best_model.h5
Epoch 2/20
Epoch 00002: val_loss did not improve from 0.49809
Epoch 3/20
Epoch 00003: val_loss did not improve from 0.49809
Epoch 4/20
Epoch 00004: val_loss did not improve from 0.49809
Epoch 5/20
Epoch 00005: val_loss did not improve from 0.49809
Epoch 6/20
Epoch 00006: val_loss did not improve from 0.49809
Epoch 7/20
Epoch 00007: val_loss did not improve from 0.49809
Epoch 8/20
Epoch 00008: val_loss did not improve from 0.49809
Epoch 9/20
Epoch 00009: val_loss did not improve from 0.49809
Epoch 10/20
Epoch 00010: val_loss did not improve from 0.49809
Epoch 11/20
Epoch 00011: val_loss did not improve from 0.49809
Epoch 12/20
Epoch 00012: val_loss did not improve from 0.49809
Epoch 13/20
Epoch 00013: val_loss did not improve from 0.49809
Epoch 00013: early stopping


<keras.callbacks.History at 0x7fd7e09d6250>

In [31]:
import h5py
model.save("model.h5") # Saving the model

In [32]:
import pickle
pickle.dump({'config': vectorizer.get_config(),'weights': vectorizer.get_weights()}, open("vectorizer.pkl", "wb")) # Saving the TextVectorizer

In [33]:
model.evaluate(x_val,y_val)



[0.48911532759666443, 0.7640406489372253]

In [35]:
saved_model = tf.keras.models.load_model('best_model.h5')
saved_model.evaluate(x_val,y_val)



[0.49809205532073975, 0.7552781105041504]

In [41]:
x = vectorizer(string_input)
preds = model(x)
final_model = keras.Model(string_input, preds) # pipelining the vectorizing and predicting part

probabilities = final_model.predict([[cleaned_text]])
sentiment = ["Negative","Positive"]
print("*"*10)
print(f"The given Tweet has {sentiment[np.argmax(probabilities[0])]} sentiment.")



1

In [43]:
tf.keras.backend.clear_session()