# Sentiment analysis 

The objective of this problem is to perform Sentiment analysis from the tweets collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [1]:
import pandas as pd
df= pd.read_csv('tweets.csv', encoding='latin')

In [2]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [3]:
df.shape

(9093, 3)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [5]:
df.isnull().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

### Drop null values
- drop all the rows with null values

In [6]:
df['tweet_text'].isnull().sum()

1

In [7]:
df.dropna(axis=0,subset=['tweet_text'], inplace=True)

In [8]:
df.isnull().sum()

tweet_text                                               0
emotion_in_tweet_is_directed_at                       5801
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

### Print the dataframe
- print initial 5 rows of the data
- use df.head()

In [9]:
df.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [10]:
df.shape

(9092, 3)

In [11]:
type(df)

pandas.core.frame.DataFrame

## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

In [12]:
import nltk
import re
import unicodedata
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
df['tweet_text'] = df['tweet_text'].apply(lambda s: s.lower())

In [14]:
type(df)

pandas.core.frame.DataFrame

In [15]:
type(df['tweet_text'] )

pandas.core.series.Series

In [16]:
df['tweet_text'] = df['tweet_text'].apply(lambda s: re.sub('[^0-9a-z +_]','',s))


In [17]:
df['tweet_text'] = df['tweet_text'].apply(lambda s: s.strip())

print dataframe

In [18]:
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iPhone,Negative emotion
1,jessedee know about fludapp awesome ipadiphon...,iPad or iPhone App,Positive emotion
2,swonderlin can not wait for ipad 2 also they s...,iPad,Positive emotion
3,sxsw i hope this years festival isnt as crashy...,iPad or iPhone App,Negative emotion
4,sxtxstate great stuff on fri sxsw marissa maye...,Google,Positive emotion
...,...,...,...
9088,ipad everywhere sxsw link,iPad,Positive emotion
9089,wave buzz rt mention we interrupt your regular...,,No emotion toward brand or product
9090,googles zeiger a physician never reported pote...,,No emotion toward brand or product
9091,some verizon iphone customers complained their...,,No emotion toward brand or product


## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [19]:
df.is_there_an_emotion_directed_at_a_brand_or_product.unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

In [20]:
df.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [21]:
df.emotion_in_tweet_is_directed_at.value_counts()

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: emotion_in_tweet_is_directed_at, dtype: int64

In [22]:
df = df.loc[(df['is_there_an_emotion_directed_at_a_brand_or_product'] == 'Positive emotion') | (df['is_there_an_emotion_directed_at_a_brand_or_product'] == 'Negative emotion')]

In [23]:
df.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [24]:
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iPhone,Negative emotion
1,jessedee know about fludapp awesome ipadiphon...,iPad or iPhone App,Positive emotion
2,swonderlin can not wait for ipad 2 also they s...,iPad,Positive emotion
3,sxsw i hope this years festival isnt as crashy...,iPad or iPhone App,Negative emotion
4,sxtxstate great stuff on fri sxsw marissa maye...,Google,Positive emotion
...,...,...,...
9077,mention your pr guy just convinced me to switc...,iPhone,Positive emotion
9079,quotpapyrussort of like the ipadquot nice lol...,iPad,Positive emotion
9080,diller says google tv quotmight be run over by...,Other Google product or service,Negative emotion
9085,ive always used camera+ for my iphone bc it ha...,iPad or iPhone App,Positive emotion


## Question 4

### Encode labels
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - change "positive emotion" to 1
    - change "negative emotion" to 0
- use map function to replace values

In [25]:
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder() 
df['is_there_an_emotion_directed_at_a_brand_or_product']= label_encoder.fit_transform(df['is_there_an_emotion_directed_at_a_brand_or_product']) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [26]:
df.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

1    2978
0     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [27]:
df.head(3)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iPhone,0
1,jessedee know about fludapp awesome ipadiphon...,iPad or iPhone App,1
2,swonderlin can not wait for ipad 2 also they s...,iPad,1


## Question 5

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [28]:
X = df['tweet_text']

In [29]:
y = df['is_there_an_emotion_directed_at_a_brand_or_product']

### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [30]:
from sklearn.model_selection import train_test_split

In [31]:
test_size = 0.25 # taking 75:25 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

In [32]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(2661,)
(2661,)
(887,)
(887,)


In [33]:
type(y_test)

pandas.core.series.Series

## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [34]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(ngram_range=(1,2), stop_words='english', min_df=2)

In [35]:
X_train = cvect.fit_transform(X_train)

In [36]:
X_test= cvect.transform(X_test)

In [37]:
X_train

<2661x6100 sparse matrix of type '<class 'numpy.int64'>'
	with 42471 stored elements in Compressed Sparse Row format>

In [38]:
len(cvect.vocabulary_)

6100

In [None]:
print(cvect.vocabulary_)

In [40]:
print(cvect.get_feature_names)

<bound method CountVectorizer.get_feature_names of CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=2,
                ngram_range=(1, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)>


In [41]:
X_train.shape

(2661, 6100)

## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

In [42]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

### Fit the classifer
- fit logistic regression classifier

In [43]:
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [44]:
from sklearn.naive_bayes import MultinomialNB # using multinomoal NB algorithm from Naive Bayes

# creatw the model
NB_model = MultinomialNB()

### Fit the classifer
- fit naive bayes classifier

In [45]:
NB_model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [46]:
y_logic_predict = model.predict(X_test)

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [47]:
NB_predict = NB_model.predict(X_test)

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [48]:
from sklearn import metrics
model_score = model.score(X_test, y_test)

In [49]:
print(model_score)
print(metrics.confusion_matrix(y_test, y_logic_predict))
print(metrics.classification_report(y_test, y_logic_predict))

0.8647125140924464
[[ 44 101]
 [ 19 723]]
              precision    recall  f1-score   support

           0       0.70      0.30      0.42       145
           1       0.88      0.97      0.92       742

    accuracy                           0.86       887
   macro avg       0.79      0.64      0.67       887
weighted avg       0.85      0.86      0.84       887



### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [50]:
model_score = NB_model.score(X_test, y_test)

In [51]:
print(model_score)
print(metrics.confusion_matrix(y_test, NB_predict))
print(metrics.classification_report(y_test, NB_predict))

0.8635851183765502
[[ 46  99]
 [ 22 720]]
              precision    recall  f1-score   support

           0       0.68      0.32      0.43       145
           1       0.88      0.97      0.92       742

    accuracy                           0.86       887
   macro avg       0.78      0.64      0.68       887
weighted avg       0.85      0.86      0.84       887



# Predicting my tweet

In [73]:
data = {'tweet_text':  ['i really dont dont dont dont dont dont dont dont dont dont dont dont like it','i love it',' i like it']}
data = pd.DataFrame (data, columns = ['tweet_text'])

In [74]:
data

Unnamed: 0,tweet_text
0,i really dont dont dont dont dont dont dont do...
1,i love it
2,i like it


In [75]:
data = data['tweet_text']

In [76]:
data.shape

(3,)

In [77]:
data = cvect.transform(data)

In [78]:
y_logic_predict = model.predict(data)

In [79]:
y_logic_predict

array([0, 1, 1])

In [80]:
data.shape

(3, 6100)