# Sentiment Analysis using Naive Bayes

In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes.

**Fill in the Blanks**

## Importing required libraries

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
%cd /content/drive/My Drive/Colab Notebooks/END/END1
!ls

/content/drive/My Drive/Colab Notebooks/END/END1
'POS Tagging based on Heuristics.ipynb'        Tweets_Airline.csv
'Sentiment Analysis using Naive Bayes.ipynb'   tweets.csv
'Sentiment Analysis using SVM.ipynb'


In [None]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV,StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Reading dataset

In [None]:
data=pd.read_csv('tweets.csv')
data.drop(data.columns[0],axis=1,inplace=True)
data.head()
print(data.count)
# print(data['tweets'].empty)
data.dropna(axis=0, inplace=True)
print(data.count)

<bound method DataFrame.count of                                                  tweets  labels
0     Obama has called the GOP budget social Darwini...       1
1     In his teen years, Obama has been known to use...       0
2     IPA Congratulates President Barack Obama for L...       0
3     RT @Professor_Why: #WhatsRomneyHiding - his co...       0
4     RT @wardollarshome: Obama has approved more ta...       1
...                                                 ...     ...
1375  @liberalminds Its trending idiot.. Did you loo...       0
1376  RT @AstoldByBass: #KimKardashiansNextBoyfriend...       0
1377  RT @GatorNation41: gas was $1.92 when Obama to...       1
1378  @xShwag haha i know im just so smart, i mean y...       1
1379  #OBAMA:  DICTATOR IN TRAINING.  If he passes t...       0

[1380 rows x 2 columns]>
<bound method DataFrame.count of                                                  tweets  labels
0     Obama has called the GOP budget social Darwini...       1
1     In his

## Text processing for the tweets

In [None]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing 
    # convert passed tweet to lower case 
    # --Fill--
    tweet = tweet.lower()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    
    # use work_tokenize imported above to tokenize the tweet
    # --Fill--
    tweet = word_tokenize(tweet)
    return [word for word in tweet if word not in stopwords]

## Process all tweets

In [None]:
processed=[]
i=0
for tweet in data['tweets']:
    i+=1
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned=processTweet(tweet)
    if i<10:
      print(cleaned)
    processed.append(' '.join(cleaned))

['obama', 'called', 'gop', 'budget', 'social', 'darwinism', 'nice', 'try', 'believe', 'social', 'creationism']
['teen', 'years', 'obama', 'known', 'use', 'marijuana', 'cocaine']
['ipa', 'congratulates', 'president', 'barack', 'obama', 'leadership', 'regarding', 'jobs', 'act', 'washington', 'apr', '05', '2012', 'business', 'w', '...']
['rt', 'whatsromneyhiding', 'connection', 'supporters', 'critical', 'race', 'theory', '...', 'oh', 'wait', 'obama', 'romney', '...']
['rt', 'obama', 'approved', 'targeted', 'assassinations', 'modern', 'us', 'prez', 'read', 'rt']
['video', 'shows', 'federal', 'officials', 'joking', 'cost', 'lavish', 'conference', 'obama', 'crime', 'p2', 'news', 'tcot', 'teaparty']
['one', 'chicago', 'kid', 'says', '``', 'obama', 'man', "''", 'tells', 'jesse', 'watters', 'gun', 'violence', 'chicago', 'like', '``', 'world', 'war', '17', "''"]
['rt', 'american', 'kid', '``', "'re", 'uk', 'ohhh', 'cool', 'tea', 'queen', "''", 'british', 'kid', '``', 'like', 'go', 'mcdonalds', '

In [None]:
data['processed'] = processed
data

Unnamed: 0,tweets,labels,processed
0,Obama has called the GOP budget social Darwini...,1,obama called gop budget social darwinism nice ...
1,"In his teen years, Obama has been known to use...",0,teen years obama known use marijuana cocaine
2,IPA Congratulates President Barack Obama for L...,0,ipa congratulates president barack obama leade...
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0,rt whatsromneyhiding connection supporters cri...
4,RT @wardollarshome: Obama has approved more ta...,1,rt obama approved targeted assassinations mode...
...,...,...,...
1375,@liberalminds Its trending idiot.. Did you loo...,0,trending idiot.. look tweets lol making fun ob...
1376,RT @AstoldByBass: #KimKardashiansNextBoyfriend...,0,rt kimkardashiansnextboyfriend barack obama
1377,RT @GatorNation41: gas was $1.92 when Obama to...,1,rt gas 1.92 obama took office ... guess promis...
1378,"@xShwag haha i know im just so smart, i mean y...",1,haha know im smart mean got ta listen obama cu...


## Create pipeline and define parameters for GridSearch

In [None]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## Split data into test and train

In [None]:
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

# --Fill--
x_train, x_test,y_train, y_test= train_test_split(X,y, test_size=0.2)
print(len(trainX),len(testX))

1100 275


## Perform classification (using GridSearch)

In [None]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above 
# clf = --Fill--
kfolds = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
clf=GridSearchCV(estimator=text_clf,param_grid=tuned_parameters, cv= kfolds)
clf.fit(x_train, y_train)
y_pred=clf.predict(x_test)

## Classification report 

In [None]:
# print classification report after predicting on test set with best model obtained in GridSearch
# --Fill--
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.83      0.92      0.87       190
           1       0.71      0.67      0.69        67
           2       1.00      0.17      0.29        18

    accuracy                           0.81       275
   macro avg       0.85      0.58      0.62       275
weighted avg       0.81      0.81      0.79       275



## Important:

In [None]:
counts = data.labels.value_counts()
print(counts)

0    942
1    352
2     81
Name: labels, dtype: int64


We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, try using [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare.