# Twitter Sentiment Analysis

### Dataset source : Sentiment140

The data is a CSV with emoticons removed. Data file format has 6 fields:  
0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)  
1 - the id of the tweet (2087)  
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)  
3 - the query (lyx). If there is no query, then this value is NO_QUERY.  
4 - the user that tweeted (robotickilldozr)  
5 - the text of the tweet (Lyx is cool) 

## Disclaimer :  
### In the training set, we only use Positive or Negative scored tweets. So we make the asumption that all tweets are either positive or negative, and not neutral.

#### Imports

In [1]:
# General imports

import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Words analysis oriented imports

# Natural Language ToolKit - It will launch an installation popup. Go to the "Models" tab and select "punkt" from the "Identifier" column. Then click "Download" and it will install the necessary files.
import nltk  
#nltk.download()  
from nltk.stem import PorterStemmer # For words normalization
from sklearn.feature_extraction.text import CountVectorizer # For words frequency
from sklearn.feature_extraction.text import TfidfTransformer # For words weighting

#### Read CSV

In [2]:
df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", header=None, encoding = "ISO-8859-1")

df.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


#### Preprocessing

a/ Conversion into lowercase, punctuation handling and normalization of the tweets

In [3]:
cols = ['sentiment_score', 'id', 'date', 'query', 'author', 'tweet']

# Rename the columns
df.columns = cols

# Convert all tweets in lowercase
df['tweet'] = df.tweet.map(lambda x: x.lower())  

# Remove punctuation
df['tweet'] = df.tweet.str.replace('[^\w\s]', '')  

# Tokenizer
df['tweet'] = df['tweet'].apply(nltk.word_tokenize) 

# Word stemming
# Normalize our text for all variations of words carry the same meaning, regardless of the tense
stemmer = PorterStemmer()
df['tweet'] = df['tweet'].apply(lambda x: [stemmer.stem(y) for y in x]) 

df.head()

Unnamed: 0,sentiment_score,id,date,query,author,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"[switchfoot, httptwitpiccom2y1zl, awww, that, ..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,"[is, upset, that, he, cant, updat, hi, faceboo..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,"[kenichan, i, dive, mani, time, for, the, ball..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,"[my, whole, bodi, feel, itchi, and, like, it, ..."
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"[nationwideclass, no, it, not, behav, at, all,..."


b/ Words frequency calculation in the tweets

In [4]:
# Convert the list of words into space-separated strings
df['tweet'] = df['tweet'].apply(lambda x: ' '.join(x))

# Transform the data into occurences
count_vect = CountVectorizer()  
counts = count_vect.fit_transform(df['tweet'])

# Use of Term Frequency Inverse Document Frequency, more known as tf-idf
# The tf–idf value increases proportionally to the number of times a word appears
transformer = TfidfTransformer().fit(counts)
counts = transformer.transform(counts)  

#### Train the model

In [5]:
X_train, X_test, y_train, y_test = train_test_split(counts, df['sentiment_score'], test_size=0.2, random_state=42)  

model = MultinomialNB().fit(X_train, y_train) 

In [6]:
predicted = model.predict(X_test)

#### Calculate the accuracy of the model

In [7]:
accuracy_score(y_test, predicted, normalize=True)
#print(np.mean(predicted == y_test)) 

0.766734375

In [8]:
# print the confusion matrix
print(confusion_matrix(y_test, predicted)) 

[[131350  28144]
 [ 46501 114005]]


#### Author : Thibaut BREMAND  
- thibaut.bremand [at] gmail.com
- https://github.com/ThibautBremand

#### Sources :  
- https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf : The original research paper which inspered me
- http://help.sentiment140.com  : The original training dataset  
- https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/ : Explanation of the methodology. 
- https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn : More details about the Naive Bayes classification