Tweets-Sentiment-Classification

The main aim of the project is to analyze the Twitter data describing the covid situation and to build a text classification model which can distinguish the tweets into 5 categories such as Extremely Negative (0), Negative (1), Neutral (2), Positive (3) and Extremely Positive (4). The provided dataset contains tweets with dimension (37041, 2) and numerical labels with dimension (37041,2) of above categories separately. However, the provided tweets need to be cleaned as it contains irrelevant elements such as mentions (@), HTTP links, HTML tags, punctuation marks and URL. Using the regex function, I removed those elements and Stopwords from tweets. Apart from this, to normalize the terms, I implemented Porter Stemmer and used WordNet Lemmatizer to convert the term to its base form. After this, to convert the words into vectors of equal length, I tokenized the tweets and converted it to sequence and then post padded the sequence with zero and kept the length of largest sequence in tweets as maximum length. After Preprocessing the data, the Tweet dataset has dimension of (37041, 286). For Model Selection, I build 3 different models consisting of one Baseline model such as Multinomial Naive Bayes and 2 advanced Recurrent Neural Network models such as GRU Architecture with a single Embedding layer, 1 Bidirectional layer followed by Global Average Pooling 1D and 2 Dense layers & LSTM Architecture with a single Embedding layer followed by 2 Bidirectional layers and 2 Dense layers. In addition to this, I also tried applying Dropout with a 40% dropout rate during training of RNN models and Early Stopping method for preventing overfitting and evaluated that Early Stopping gave better results than Dropout. For evaluation of models, I splitted the dataset into training,testing and validation split with (80,10,10) ratio and calculated F1 macro, AUC Score on test data and using the Confusion Matrix, I calculated the accuracy by dividing the sum of diagonal elements by the sum of all elements. In addition to this, I plotted training vs. validation loss and accuracy graphs to visualize the performance of models. Interestingly, by not implementing the preprocessing techniques like removing stopwords, Porter Stemmer or WordNetLemmatizer and using just basic text cleaning function in the RNN model with LSTM architecture, the accuracy of the model was increased from 73.87% to 77.1% and had AUC score of 0.95.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
problemset5_basicmodels.ipynb		problemset5_basicmodels.ipynb
problemset5main.ipynb		problemset5main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweets-Sentiment-Classification

About

Releases

Packages

Languages

Tirth8038/Tweets-Sentiment-Classification

Folders and files

Latest commit

History

Repository files navigation

Tweets-Sentiment-Classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages