<a href="https://colab.research.google.com/github/AhanR/NLP-Tweets-basic/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split as tts

import nltk
from nltk.corpus import stopwords
from nltk.classify import SklearnClassifier
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
data = pd.read_csv("./Sentiment.csv")
data = data[['text', 'sentiment']]

In [6]:
data.head()

Unnamed: 0,text,sentiment
0,RT @NancyLeeGrahn: How did everyone feel about...,Neutral
1,RT @ScottWalker: Didn't catch the full #GOPdeb...,Positive
2,RT @TJMShow: No mention of Tamir Rice and the ...,Neutral
3,RT @RobGeorge: That Carly Fiorina is trending ...,Positive
4,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Positive


Splitting the data into two parts, one with neutral comments and one without

In [7]:
data_cleaned = data[data.sentiment != 'Neutral']

In [8]:
data_cleaned.head()

Unnamed: 0,text,sentiment
1,RT @ScottWalker: Didn't catch the full #GOPdeb...,Positive
3,RT @RobGeorge: That Carly Fiorina is trending ...,Positive
4,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Positive
5,"RT @GregAbbott_TX: @TedCruz: ""On my first day ...",Positive
6,RT @warriorwoman91: I liked her and was happy ...,Negative


In [11]:
train_clean, test_clean = tts(data_cleaned, test_size = 0.1)
train, test = tts(data, test_size = 0.1)

In [24]:
test_clean.head()

Unnamed: 0,text,sentiment
8986,RT @DCordrey1: Kelly is less credible now @sas...,Negative
7340,"Guys, thanks for not unfollowing me in masses ...",Negative
5671,No clear winners in the #GOPDebate last night....,Negative
703,"RT @JillBidenVeep: I hate the phrase ""I just c...",Negative
7216,"Other than question to Walker, nothing on race...",Negative


Cleaning the tweets, remove the #s and the @s and the stop words

In [13]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [15]:
stopwords_set = set(stopwords.words("english"))
def cleanTweets(data):
  final_data = []
  for i, row in data.iterrows():
    word_filtered = [word.lower() for word in row.text.split(" ") if len(word)>=3]
    words_cleaned = [word for word in word_filtered if word not in stopwords_set and "http" not in word and '@' not in word and "#" not in word and word != "RT"]
    final_data.append((words_cleaned, row.sentiment))
    return final_data
tweets_clean = cleanTweets(train_clean)
tweets = cleanTweets(train)

In [16]:
all_words_in_tweets_clean = set([word for (doc,x) in tweets_clean for word in doc])
all_words_in_tweets = set([word for (doc,x) in tweets for word in doc])

In [17]:
def feature_extractor(line):
  line_words = set(line)
  features = {}
  for w in all_words_in_tweets_clean:
    features[w] = (w in line_words)
  return features

In [20]:
training_set_clean = nltk.classify.apply_features(feature_extractor, tweets_clean)
nb_classifier_clean = nltk.NaiveBayesClassifier.train(training_set_clean)

In [34]:
training_set = nltk.classify.apply_features(feature_extractor, tweets)
nb_classifier = nltk.NaiveBayesClassifier.train(training_set)

Writing a simple accuracy checker

In [36]:
def tester(test_data,model):
  res = []
  for i, row in test_data.iterrows():
    res.append(nb_classifier_clean.classify(feature_extractor(row.text.lower().split(" "))))
  correct,wrong = 0,0
  for (pred, aim) in zip(res, test_data['sentiment']):
    if pred==aim:
      correct+=1
    else:
      wrong+=1
  print("Accuracy : ",(correct/(correct+wrong)))
  print(f'Correct/Wrong :: {correct}/{wrong}')

In [37]:
tester(test_clean,nb_classifier_clean)

Accuracy :  0.7726001863932899
Correct/Wrong :: 829/244


In [38]:
tester(test, nb_classifier)

Accuracy :  0.6066282420749279
Correct/Wrong :: 842/546


As we can see, having neutral statemtnts confuses the model a bit more. Hence we decide to drop the neutral sentiment for naive bayes classifier