Skip to content

The main aim of the project is to analyze the Twitter data describing the covid situation and to build a text classification model which can distinguish the tweets into 5 categories such as Extremely Negative (0), Negative (1), Neutral (2), Positive (3) and Extremely Positive (4). The provided dataset contains tweets with dimension (37041, 2) an…

Notifications You must be signed in to change notification settings

Tirth8038/Tweets-Sentiment-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Tweets-Sentiment-Classification

The main aim of the project is to analyze the Twitter data describing the covid situation and to build a text classification model which can distinguish the tweets into 5 categories such as Extremely Negative (0), Negative (1), Neutral (2), Positive (3) and Extremely Positive (4). The provided dataset contains tweets with dimension (37041, 2) and numerical labels with dimension (37041,2) of above categories separately. However, the provided tweets need to be cleaned as it contains irrelevant elements such as mentions (@), HTTP links, HTML tags, punctuation marks and URL. Using the regex function, I removed those elements and Stopwords from tweets. Apart from this, to normalize the terms, I implemented Porter Stemmer and used WordNet Lemmatizer to convert the term to its base form. After this, to convert the words into vectors of equal length, I tokenized the tweets and converted it to sequence and then post padded the sequence with zero and kept the length of largest sequence in tweets as maximum length. After Preprocessing the data, the Tweet dataset has dimension of (37041, 286). For Model Selection, I build 3 different models consisting of one Baseline model such as Multinomial Naive Bayes and 2 advanced Recurrent Neural Network models such as GRU Architecture with a single Embedding layer, 1 Bidirectional layer followed by Global Average Pooling 1D and 2 Dense layers & LSTM Architecture with a single Embedding layer followed by 2 Bidirectional layers and 2 Dense layers. In addition to this, I also tried applying Dropout with a 40% dropout rate during training of RNN models and Early Stopping method for preventing overfitting and evaluated that Early Stopping gave better results than Dropout. For evaluation of models, I splitted the dataset into training,testing and validation split with (80,10,10) ratio and calculated F1 macro, AUC Score on test data and using the Confusion Matrix, I calculated the accuracy by dividing the sum of diagonal elements by the sum of all elements. In addition to this, I plotted training vs. validation loss and accuracy graphs to visualize the performance of models. Interestingly, by not implementing the preprocessing techniques like removing stopwords, Porter Stemmer or WordNetLemmatizer and using just basic text cleaning function in the RNN model with LSTM architecture, the accuracy of the model was increased from 73.87% to 77.1% and had AUC score of 0.95.

About

The main aim of the project is to analyze the Twitter data describing the covid situation and to build a text classification model which can distinguish the tweets into 5 categories such as Extremely Negative (0), Negative (1), Neutral (2), Positive (3) and Extremely Positive (4). The provided dataset contains tweets with dimension (37041, 2) an…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published