# Sentiment Analysis using Naive Bayes

In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes.

**Fill in the Blanks**

## Importing required libraries

In [88]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Reading dataset

In [89]:
data=pd.read_csv('tweets.csv')
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


## Text processing for the tweets

In [90]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mutum\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Mutum\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [91]:
from nltk.tokenize import word_tokenize , sent_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing 
    # convert passed tweet to lower case 
    #tweet = [word.lower() for word in sent_tokenize(tweet)]
    tweet = str(tweet).lower()
    
    #--Fill--
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    print(tweet)
    # use work_tokenize imported above to tokenize the tweet
    #--Fill--
    tweet = word_tokenize(tweet)
    return [word for word in tweet if word not in stopwords]

## Process all tweets

In [92]:
processed=[]

for tweet in data['tweets']:
    
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned=processTweet(tweet)
    processed.append(' '.join(cleaned))

obama has called the gop budget social darwinism. nice try, but they believe in social creationism.
in his teen years, obama has been known to use marijuana and cocaine.
ipa congratulates president barack obama for leadership regarding jobs act: washington, apr 05, 2012 (business w... URL
rt AT_USER whatsromneyhiding - his connection to supporters of critical race theory.... oh wait, that was obama, not romney...
rt AT_USER obama has approved more targeted assassinations than any modern us prez; read & rt: URL
video shows federal officials joking about cost of lavish conference URL obama crime p2 news tcot teaparty
one chicago kid who says "obama is my man" tells jesse watters that the gun violence in chicago is like "world war 17"
rt AT_USER american kid "you're from the uk? ohhh cool, so do you have tea with the queen?". british kid: "do you like, go to mcdonalds with obama?
a valid explanation for why obama won't let women on the golf course.   whatsromneyhiding
president obama &lt;

rt AT_USER american kid "you're from the uk? ohhh cool, so do you have tea with the queen?". british kid: "do you like, go to mcdonalds with obama?
during the dnc convention later this summer in charlotte, n.c. there will be an extreme heat wave. all that hot air from speakers & obama.
obama reveals how to get national unemployment rate to 6% URL
AT_USER you're right! blame obama, he could do something if he wanted to. like open up alaska for drilling maybe green light a refinery
rt AT_USER women at augusta? obama says absolutely. romney hedges. comment on fb or tweet so we can share on edshow!
whatsromneyhiding the person who refuses to let obama be clear
rt AT_USER obama campaign launches whatsromneyhiding campaign on romney's tax returns featuring a AT_USER tweet.
rt AT_USER obama's college records? whatsromneyhiding
rt AT_USER american kid "you're from the uk? ohhh cool, so do you have tea with the queen?". british kid: "do you like, go to mcdonalds with obama?
AT_USER looks like b

In [93]:
data['processed'] = processed

## Create pipeline and define parameters for GridSearch

In [94]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## Split data into test and train

In [95]:
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

#--Fill--

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Perform classification (using GridSearch)

In [96]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above

clf = GridSearchCV(estimator=text_clf, param_grid=tuned_parameters,cv=10) #scoring="accuracy")
clf.fit(x_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        pre

## Classification report 

In [97]:
clf.predict(x_test)

array([0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 2, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0,
       0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0], dtype=int64)

In [98]:
#y_test

In [99]:
# print classification report after predicting on test set with best model obtained in GridSearch
#--Fill--
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.92      0.90       196
           1       0.74      0.74      0.74        65
           2       0.56      0.33      0.42        15

    accuracy                           0.84       276
   macro avg       0.73      0.66      0.69       276
weighted avg       0.84      0.84      0.84       276



## Important:

In [100]:
counts = data.labels.value_counts()
print(counts)

0    947
1    352
2     81
Name: labels, dtype: int64


We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, try using [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare.