# Sentiment analysis 

The objective of the second problem is to perform Sentiment analysis from the tweets collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
%matplotlib inline
import re

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [2]:
tweets_df=pd.read_csv('tweets.csv',encoding='latin1')

### Drop null values
- drop all the rows with null values

In [3]:
tweets_df.shape

(9093, 3)

In [4]:
tweets_df.sample(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
8219,Apple just brought is water in the iPad line #...,,No emotion toward brand or product
7126,Looking for an iPhone app that manages multipl...,,No emotion toward brand or product
8712,Just because google patented something i.e. (A...,,Negative emotion
6839,RT @mention VERY IMPORTANT: Make sure you are ...,,No emotion toward brand or product
3056,"In iPad Design Headaches: Take Two Tablets, Ca...",iPad,Negative emotion


In [5]:
tweets_new=tweets_df.dropna()

In [6]:
tweets_new.shape

(3291, 3)

### Print the dataframe
- print initial 5 rows of the data
- use df.head()

In [7]:
tweets_new.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

print dataframe

In [8]:
tweets_new=tweets_new.applymap(lambda x:x.lower())

In [9]:
tweets_new=tweets_new.applymap(lambda x:re.sub('[^0-9a-zA-Z\#\+\_\s]','',x))

In [10]:
tweets_new=tweets_new.applymap(lambda x:x.strip())

In [11]:
tweets_new.head()

## printing the dataframe after applying all the prepocessing data techniques as per the question

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iphone,negative emotion
1,jessedee know about fludapp awesome ipadiphon...,ipad or iphone app,positive emotion
2,swonderlin can not wait for #ipad 2 also they ...,ipad,positive emotion
3,sxsw i hope this years festival isnt as crashy...,ipad or iphone app,negative emotion
4,sxtxstate great stuff on fri #sxsw marissa may...,google,positive emotion


## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [12]:
tweets_new = tweets_new[(tweets_new["is_there_an_emotion_directed_at_a_brand_or_product"] == "positive emotion") | (tweets_new["is_there_an_emotion_directed_at_a_brand_or_product"] == "negative emotion")]

In [14]:
tweets_new["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

## selected and showing the count of rows that have value equal to positive or negative emotion

positive emotion    2672
negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

## Question 4

### Encode labels
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - change "positive emotion" to 1
    - change "negative emotion" to 0
- use map function to replace values

In [15]:
#sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [16]:
tweets_new['Label'] = tweets_new["is_there_an_emotion_directed_at_a_brand_or_product"].map({'positive emotion': '1', 'negative emotion': '0'})

In [17]:
tweets_new.head()

## assigned a label of 0 to negative emotion and 1 to positive emotion 

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,Label
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iphone,negative emotion,0
1,jessedee know about fludapp awesome ipadiphon...,ipad or iphone app,positive emotion,1
2,swonderlin can not wait for #ipad 2 also they ...,ipad,positive emotion,1
3,sxsw i hope this years festival isnt as crashy...,ipad or iphone app,negative emotion,0
4,sxtxstate great stuff on fri #sxsw marissa may...,google,positive emotion,1


## Question 5

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [18]:
X=tweets_new.tweet_text
y=tweets_new.Label

In [19]:
print(X.shape)
print(y.shape)

(3191,)
(3191,)


### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=7)

## splitting the data into train and test set

In [22]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(2393,)
(798,)
(2393,)
(798,)


## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [23]:
vect=CountVectorizer(ngram_range=(1, 2),stop_words='english',min_df=2)

In [24]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [25]:
X_train_dtm

<2393x5383 sparse matrix of type '<class 'numpy.int64'>'
	with 37278 stored elements in Compressed Sparse Row format>

In [26]:
X_test_dtm=vect.transform(X_test)

In [27]:
X_test_dtm

<798x5383 sparse matrix of type '<class 'numpy.int64'>'
	with 10661 stored elements in Compressed Sparse Row format>

## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

In [28]:
logreg=LogisticRegression()

### Fit the classifer
- fit logistic regression classifier

In [29]:
logreg.fit(X_train_dtm,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [30]:
nb = MultinomialNB()

### Fit the classifer
- fit naive bayes classifier

In [31]:
nb.fit(X_train_dtm,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [33]:
y_predict_test=logreg.predict(X_test_dtm)

## making predictions on X_test

In [34]:
y_predict_test

array(['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '0',
       '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '1', '0', '1',
       '1', '1', '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '0', '1', '1', '0', '1', '1', '1', '1', '1', '1',
       '1', '0', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '0', '0', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '0', '1', '1', '1', '0', '1', '1', '1', '0', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [35]:
y_predict_testnb=nb.predict(X_test_dtm)

## using NB model to make predictions on X_test. Storing the results in a separate variable as per requirement 

In [36]:
y_predict_testnb

array(['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '0',
       '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '1',
       '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '0', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '0', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '0', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '0', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [37]:
metrics.accuracy_score(y_test, y_predict_test)

## checking accuracy on test data 

0.87468671679198

In [38]:
y_predict_train=logreg.predict(X_train_dtm)

In [39]:
metrics.accuracy_score(y_train, y_predict_train)

## checking accuracy on train data 

0.9715837860426243

### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [40]:
metrics.accuracy_score(y_test, y_predict_testnb)

## checking accuracy on test data 

0.868421052631579

In [42]:
y_predict_trainnb=nb.predict(X_train_dtm)

In [43]:
metrics.accuracy_score(y_train, y_predict_trainnb)

## checking accuracy on train data 

0.9348098620977852