# **Sentiment Analysis of Twitter Data using DataMining Techinques**

This notebook is made to train a model to perform sentiment analysis on twitter data and create a pickle file of the trained model and vector used for vectorization.

Sentiment Analysis here is perfomed using TFID for vectorization and Multinomial Naive Bayes for Classification

## Import Packages

In [0]:
!pip install emoji

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/40/8d/521be7f0091fe0f2ae690cc044faf43e3445e0ff33c574eae752dd7e39fa/emoji-0.5.4.tar.gz (43kB)
[K     |███████▌                        | 10kB 16.9MB/s eta 0:00:01[K     |███████████████                 | 20kB 1.6MB/s eta 0:00:01[K     |██████████████████████▋         | 30kB 2.4MB/s eta 0:00:01[K     |██████████████████████████████▏ | 40kB 3.1MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 1.6MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.5.4-cp36-none-any.whl size=42176 sha256=04981daabc3a116f2cfbb0b78c4e0b1806ab15d5e4532399d8212f280fa1d303
  Stored in directory: /root/.cache/pip/wheels/2a/a9/0a/4f8e8cce8074232aba240caca3fade315bb49fac68808d1a9c
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.5.4


In [0]:
import numpy as np
import pandas as pd
import re
import pickle
import emoji
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score 
from sklearn.feature_extraction.text import TfidfVectorizer

## Import Data and mount google drive folder

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
tweet_dataset = pd.read_csv("/content/gdrive/My Drive/dmw_project/training.1600000.processed.noemoticon.csv",
                names=['sentiment', 'id', 'date', 'query', 'user', 'text'],
                encoding='latin-1')
tweet_dataset.head()

Unnamed: 0,sentiment,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Preprocess Data

In [0]:
# Number of examples for each class
tweet_dataset.sentiment.value_counts()

4    800000
0    800000
Name: sentiment, dtype: int64

In [0]:
# Drop unneccasary columns
tweet_dataset = tweet_dataset.drop(columns=['id', 'date', 'query', 'user'])
tweet_dataset.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [0]:
# Change polarity value of 4 to 1. so the classes now is 0/1.
tweet_dataset.sentiment = tweet_dataset.sentiment.replace({0: 0, 4: 1})
tweet_dataset.sentiment.value_counts()

1    800000
0    800000
Name: sentiment, dtype: int64

In [0]:
#save only the required data in a new csv
tweet_dataset.to_csv("/content/gdrive/My Drive/dmw_project/sentiment140-subset.csv", index=False)

In [0]:
def preprocess_tweet(tweet):
	"""Gets and returns processed tweets

	Parameters
	----------
	tweet : str
			String containing text of the tweet
	
	Returns
	-------
	str
			Processed and cleaned text of the tweet
	"""
	#convert the tweet to lower case
	tweet.lower()
	
	#convert all urls to sting ""
	tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))',' ',tweet)
	
	#convert all @username to ""
	tweet = re.sub('@[^\s]+',' ', tweet)
	
	#convert "#topic" to just "topic"
	tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
	
	#removing mentions
	tweet = re.sub(r':', '', tweet)
	tweet = re.sub(r'‚Ä¶', '', tweet)
	
	#replace consecutive non-ASCII characters with a space
	tweet = re.sub(r'[^\x00-\x7F]+',' ', tweet)
	tweet = re.sub(r"""
              [,.;@#?!&$"']+
              \ *  
              """,
              " ", 
              tweet, flags=re.VERBOSE)
 
	# replace emojis with text
	tweet = emoji.demojize(tweet)

	#correct all multiple white spaces to a single white space
	tweet = re.sub('[\s]+', ' ', tweet)
	
	return tweet

In [0]:
#apply the preprocess_tweet function for all the tweets in the dataset
tweet_dataset['text'] = tweet_dataset['text'].apply(preprocess_tweet)

#dependent variable / features
data = np.array(tweet_dataset.text)

#independent variable / label
label = np.array(tweet_dataset.sentiment)

## Train Model

In [0]:
# Convert text tokens into vectors
tfv = TfidfVectorizer(sublinear_tf=True, stop_words = "english")
features = tfv.fit_transform(data)

# Train the model based on the vectors
model = MultinomialNB()
model.fit(features, label)

# Predict the trained data
probability_to_be_positive = model.predict_proba(features)[:,1]

#AUC ROC curve score
print ("AUC score on train data:" , roc_auc_score(label, probability_to_be_positive))

#print top 5 scores as a sanity check
print ("top 5 scores: ", probability_to_be_positive[:5])

auc (train data): 0.8760825103968751
top 5 scores:  [0.23989194 0.16944144 0.34234499 0.09660771 0.35207738]


## Play with the trained model

In [0]:
#@title Interactive Input: Enter the tweet

tweet = 'You are ugly' #@param {type:"string"}
input = [tweet]
test_data = np.asarray(input)
features1 = tfv.transform(test_data)
probability_to_be_positive1 = model.predict_proba(features1)
if( probability_to_be_positive1[0][1] > 0.70):
  print("Positive Tweet")
elif(probability_to_be_positive1[0][0] > 0.70):
  print("Negative Tweet")
else:
  print("Neutral Tweet")
# print(probability_to_be_positive1)

Negative Tweet


## Save the trained model and vectorizer

pickle package is used for saving the above objects

In [0]:
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))

In [0]:
filename = 'finalized_tfv.sav'
pickle.dump(tfv, open(filename, 'wb'))

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
!cp -i /content/data/training.1600000.processed.noemoticon.csv /content/gdrive/My\ Drive/dmw_project

In [0]:
!cp -i /content/finalized_model.sav /content/gdrive/My\ Drive/dmw_project
!cp -i /content/finalized_tfv.sav /content/gdrive/My\ Drive/dmw_project

+.21## **References**   
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/

https://medium.com/greyatom/lets-learn-about-auc-roc-curve-4a94b4d88152

https://www.kaggle.com/kazanova/sentiment140

Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.