# Case Study 04
Machine Learning to predict public sentiment from text data.
Look into twitter text data to predict if the given text has positive or negative sentiment towards a particular brand. The dataset includes twitter text related to Apple and Google products with user sentiment ranked between ‘positive’, ‘negative’, ‘neutral’ and ‘no_idea’, sentiments. Create a simpleRNN or LSTM based classifiers to classify tweets into the four classes. You can avoid ‘emotion_in_tweet_is_directed_at’ column.

### Importing Libraries and dataset

In [2]:
import numpy as np
import pandas as pd

In [4]:
data = pd.read_csv('/content/judge-1377884607_tweet_product_company.csv', encoding='ISO-8859-1')

In [8]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [5]:
#droping column 	'emotion_in_tweet_is_directed_at'

df = data.drop(['emotion_in_tweet_is_directed_at'],axis=1)
    

In [6]:
df.head()

Unnamed: 0,tweet_text,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion


## Pre-processing

In [7]:
messages = df[['tweet_text','is_there_an_emotion_directed_at_a_brand_or_product']]
messages.columns = ["Text","Label"]

In [8]:
pd.set_option('display.max_colwidth', None)
messages.head()

Unnamed: 0,Text,Label
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",Negative emotion
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,Positive emotion
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,Negative emotion
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Positive emotion


In [9]:
messages['Label'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: Label, dtype: int64

In [10]:
messages['Label']=messages['Label'].map({'Positive emotion':3,"I can't tell":0, 'No emotion toward brand or product':2, 'Negative emotion':1})

In [11]:
messages.head()

Unnamed: 0,Text,Label
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",1
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",3
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,3
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,1
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",3


In [12]:
messages.shape

(9093, 2)

In [13]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    9092 non-null   object
 1   Label   9093 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 142.2+ KB


In [14]:
messages = messages.dropna()

In [15]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    9092 non-null   object
 1   Label   9092 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 213.1+ KB


In [16]:
from keras.preprocessing import text
tokenizer = text.Tokenizer() 
tokenizer.fit_on_texts(list(messages['Text']))
tokenized_texts = tokenizer.texts_to_sequences(messages['Text'])

In [17]:
tokenized_texts[0]

[5869,
 23,
 51,
 11,
 607,
 18,
 257,
 111,
 2582,
 634,
 6,
 1351,
 25,
 32,
 86,
 893,
 23,
 104,
 5,
 1112,
 2583,
 3955,
 6,
 1]

In [18]:
from keras.utils import pad_sequences
X = pad_sequences(tokenized_texts, maxlen=100)

In [19]:
X[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 5869,
         23,   51,   11,  607,   18,  257,  111, 2582,  634,    6, 1351,
         25,   32,   86,  893,   23,  104,    5, 1112, 2583, 3955,    6,
          1], dtype=int32)

In [20]:
tokenizer.word_index

{'sxsw': 1,
 'mention': 2,
 'the': 3,
 'link': 4,
 'to': 5,
 'at': 6,
 'rt': 7,
 'for': 8,
 'ipad': 9,
 'google': 10,
 'a': 11,
 'apple': 12,
 'in': 13,
 'of': 14,
 'is': 15,
 'quot': 16,
 'and': 17,
 'iphone': 18,
 'store': 19,
 'on': 20,
 'up': 21,
 '2': 22,
 'i': 23,
 'new': 24,
 'austin': 25,
 'you': 26,
 'an': 27,
 'with': 28,
 'amp': 29,
 'my': 30,
 'app': 31,
 'it': 32,
 'social': 33,
 'launch': 34,
 'circles': 35,
 'this': 36,
 'android': 37,
 'pop': 38,
 'today': 39,
 'be': 40,
 'just': 41,
 'from': 42,
 'not': 43,
 'out': 44,
 'by': 45,
 'are': 46,
 'your': 47,
 'that': 48,
 'network': 49,
 'ipad2': 50,
 'have': 51,
 'via': 52,
 'will': 53,
 'line': 54,
 'about': 55,
 'free': 56,
 'get': 57,
 'now': 58,
 'if': 59,
 'called': 60,
 'me': 61,
 'party': 62,
 'mobile': 63,
 'so': 64,
 'sxswi': 65,
 'but': 66,
 'all': 67,
 'or': 68,
 'major': 69,
 'like': 70,
 'has': 71,
 'no': 72,
 "it's": 73,
 'one': 74,
 'what': 75,
 'time': 76,
 'temporary': 77,
 'w': 78,
 'can': 79,
 'opening'

In [21]:
len(tokenizer.word_index)

10147

## Creating Model

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, messages['Label'].values, test_size=0.2)

In [23]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding,SimpleRNN,Dropout

In [24]:
model = Sequential()

model.add(Embedding(input_dim = len(tokenizer.word_index)+1, output_dim = 128,input_length=100))
model.add(SimpleRNN(10))
model.add(Dropout(0.5))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4, activation='sigmoid')) 

In [25]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [26]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 128)          1298944   
                                                                 
 simple_rnn (SimpleRNN)      (None, 10)                1390      
                                                                 
 dropout (Dropout)           (None, 10)                0         
                                                                 
 dense (Dense)               (None, 50)                550       
                                                                 
 dropout_1 (Dropout)         (None, 50)                0         
                                                                 
 dense_1 (Dense)             (None, 4)                 204       
                                                                 
Total params: 1,301,088
Trainable params: 1,301,088
Non-

In [27]:
#Fitting RNN Model
model.fit(X_train, y_train, epochs=20, validation_split=0.1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f1770032160>

In [28]:
y_pred = model.predict(X_test)



We have obtained an accuracy of 59.62

# Testing Model

"I can't tell":0 
'Negative emotion':1 
'No emotion toward brand or product':2 
'Positive emotion':3

In [66]:
test1 = "product is outstanding!!"
test1 = tokenizer.texts_to_sequences([test1])
test2 = pad_sequences(test1, maxlen=100)

In [67]:
out = model.predict(test2)



In [68]:
out

array([[0.02288836, 0.04785236, 0.63990605, 0.9534689 ]], dtype=float32)

Here 4th value is more compared to others. Hence the model has predicted 'Positive Emotion' correctly

In [48]:
test1 = " talking about the future of search engines"
test1 = tokenizer.texts_to_sequences([test1])
test2 = pad_sequences(test1, maxlen=100)

In [49]:
out = model.predict(test2)



In [50]:
out

array([[0.03952718, 0.04802795, 0.8903283 , 0.80672336]], dtype=float32)

Here value of '2' is more compared to others. Hence the model has predicted 'No emotion toward brand or product' correctly

In [63]:
test1 = "This 3G iPad sucks"
test1 = tokenizer.texts_to_sequences([test1])
test2 = pad_sequences(test1, maxlen=100)

In [64]:
out = model.predict(test2)



In [65]:
out

array([[0.00207606, 0.01461084, 0.591082  , 0.99065393]], dtype=float32)

Here value of '3' is more compared to others. Hence the model prediction gone wrong here. Correct prediction would be '1'

Submitted By Ajuma Mohammed KKEM ML/AL August Batch