# Emoji Sentiment Analysis with Tweets_Chinese
        
## step4-Sentiment analysis

#### 4.1 Constructing train and test dataset
- data cleaning
- preparing new columns for 4.3 and 4.4
- split the dataset

#### 4.2 Classification without emojis
- LSTM
- DNN

#### 4.3 Classification with replacing the emojis with their descriptive names
- LSTM
- DNN

#### 4.4 Classification with replacing the emojis with the most similar 5 text tokens in Word2Vec model
- LSTM
- DNN

#### 4.5 Classification with emojis word embedding vectors in Word2Vec model
- LSTM
- DNN



![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTigQWzoYCNiDyrz1BN4WTf2X2k9OZ_yvW-FsmcIMsdS9fppNmh)
***Note, the dataset of labelled tweets with emojis is cloned from [DanaOshri, github](https://github.com/DanaOshri/Twitter-Sentiment-Analysis-Emoji-Embedding-and-LSTM)***

### 4.1 read the dataset with sentiment labels

In [1]:
# pip install emoji

In [88]:
import pandas as pd
import os
os.getcwd()
df = pd.read_csv('./sentiment/raw.csv')
df

Unnamed: 0.1,Unnamed: 0,tweets,labels
0,0,lmfaoo 😭 😭 😭 😭 😭,0
1,1,i hate this feeling 😢,0
2,2,ca n't believe i just went out in this cold to...,0
3,3,i need a new trap house so if you really fuck ...,0
4,4,so very sorry for your loss 💔,0
...,...,...,...
13195,13195,i love waking up skinny ahaha wish it lasted a...,1
13196,13196,magnificent pair of tits 😍 my cock is hard 🍆 😀,1
13197,13197,soon mamsh 😘 god will give you the best among ...,1
13198,13198,i trust u 😎,1


In [388]:
df.iloc[:-20,1:3]

Unnamed: 0,tweets,labels
0,😭 😭 😭 😭 😭,0
1,i hate this feeling 😢,0
2,ca believe i just went out in this cold to buy...,0
3,i need a new trap house so if you really wit m...,0
4,so very sorry for your loss 💔,0
...,...,...
13175,pretty sure catch these 😂,1
13176,sum kiss sound nice oh 😔,1
13177,you are looking so beautiful 😘,1
13178,enjoying a night with 😈,1


In [376]:
# df.query('labels == 1')
#0:6600
#1:6600

In [89]:
## data cleaning
#i. lemmatization

from textblob import TextBlob

def lemm(text):
    textTB = TextBlob(text)
    words = textTB.words
    words_lemmatized = words.lemmatize()
    return ' '.join(words_lemmatized)

example = df['tweets'][4]+' tweets'+' looks cleaned'
print('Example of original Tweet:')
print(example)
print('---------------')
print('Lemmatizating ...')
example = lemm(example)
print(example)
%time df['tweets']=df['tweets'].apply(lambda x : lemm(x))

Example of original Tweet:
so very sorry for your loss 💔 tweets looks cleaned
---------------
Lemmatizating ...
so very sorry for your loss 💔 tweet look cleaned
CPU times: user 2.65 s, sys: 78.6 ms, total: 2.73 s
Wall time: 2.75 s


In [90]:
#ii. keep only english characters and emojis

import emoji
import nltk 
nltk.download('words')
words = set(nltk.corpus.words.words())

def keepengemoji(text):
    ls = []
    for w in text.split(' '):
        if w in words:
            ls.append(w)
        elif w in emoji.UNICODE_EMOJI['en']:
            w = ' '+w+' '
            ls.append(w)
        else:
            continue
    return ' '.join(ls)

example = df['tweets'][4]+'我 们 لغة عربية 123 😅😅'
print('Example of original Tweet:')
print(example)
print('---------------')
print('Extracting English words and emojis ...')
example = keepengemoji(example)
print(example)
df['tweets']=df['tweets'].apply(lambda x : keepengemoji(x))

[nltk_data] Downloading package words to /Users/leahtan/nltk_data...
[nltk_data]   Package words is already up-to-date!


Example of original Tweet:
so very sorry for your loss 💔我 们 لغة عربية 123 😅😅
---------------
Extracting English words and emojis ...
so very sorry for your loss


In [91]:
import emoji

def remove_emojis(text):
  return ''.join(c for c in text if c not in emoji.UNICODE_EMOJI['en'])

example = df['tweets'][0]
print('Example of original Tweet:')
print(example)
print('---------------')
print('Extracting pure text ...')
example = remove_emojis(example)
print(example)

Example of original Tweet:
 😭   😭   😭   😭   😭 
---------------
Extracting pure text ...
              


In [92]:
puretext = [remove_emojis(t).strip() for t in df['tweets']]
puretext[:10]

['',
 'i hate this feeling',
 'ca believe i just went out in this cold to buy food what in the poor ca i just be rich and have people working for me',
 'i need a new trap house so if you really wit me baby put your name on this lease',
 'so very sorry for your loss',
 'random',
 'so you wan na be bad at sex and get away with it',
 'brake failure',
 'he is so annoying',
 'back this because damn he fire']

In [93]:
df['text']=puretext

In [95]:
# new column, replace emoji with its name, for 4.3
import demoji
wtemo = [demoji.replace_with_desc(df.tweets[i]).replace(':','') for i in range(len(df))]
df['wtemo'] = wtemo


In [97]:
# new column, replace emoji with its most similar text tokens, for 4.4
emosimi_df = pd.read_csv("en_most_similar_names.csv")  # joined str of 5 most similar text tokens of emojis
emosimi_df

Unnamed: 0.1,Unnamed: 0,0
0,©,mail prince ladder cutie momo
1,‼,bossy treatment dreamy urgent announcement
2,⁉,predraft ser miserable spree ake
3,↗,status silver elite diamond rookie
4,↘,status silver elite noodle diamond
...,...,...
329,🩸,rifle assault bin catwalk platinum
330,🩹,scratch tae knee fell attached
331,🫠,shaking titty kissing heel hop
332,🫣,titty prefer typo struggling confirm


In [109]:
import emoji
ls = []
for i in df['tweets']:
    wls = []
    words = i.split()
    for word in words:
        if word not in emoji.UNICODE_EMOJI['en']:
            wls.append(word)
        elif word in emosimi_df['Unnamed: 0'].to_list():
            wls.append(emosimi_df.loc[emosimi_df['Unnamed: 0'] == word, '0'].iloc[0])
        else:
            wls.append(demoji.replace_with_desc(word).replace(':',''))
    simi = ' '.join(wls)
    ls.append(simi)

In [111]:
df['simiemo'] = ls

In [112]:
df.head()

Unnamed: 0.1,Unnamed: 0,tweets,labels,text,wtemo,simiemo
0,0,😭 😭 😭 😭 😭,0,,loudly crying face loudly crying face lou...,dry cringing outfit instantly had dry cringing...
1,1,i hate this feeling 😢,0,i hate this feeling,i hate this feeling crying face,i hate this feeling condolence fletcher rip de...
2,2,ca believe i just went out in this cold to buy...,0,ca believe i just went out in this cold to buy...,ca believe i just went out in this cold to buy...,ca believe i just went out in this cold to buy...
3,3,i need a new trap house so if you really wit m...,0,i need a new trap house so if you really wit m...,i need a new trap house so if you really wit m...,i need a new trap house so if you really wit m...
4,4,so very sorry for your loss 💔,0,so very sorry for your loss,so very sorry for your loss broken heart,so very sorry for your loss condolence heartbr...


In [378]:
df['simiemo'].iloc[1]

'i hate this feeling condolence fletcher rip devastating heartbreaking'

In [113]:
# split dataset
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

y_train = train['labels']
y_test = test['labels']

print(len(y_train),len(y_test))

10560 2640


### 4.2 Classification without emojis
using text as x
- LSTM: 0.57
- DNN: 0.56

In [114]:
# classification with pure text
x_train42 = train['text'].to_list()
x_test42 = test['text'].to_list()

print(len(x_train42),len(x_test42))

10560 2640


In [115]:
# encode the words
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print('Loading data...')
def get_sequences(tokenizer, tweets):
  sequences = tokenizer.texts_to_sequences(tweets)
  padded = pad_sequences(sequences, truncating ='post', maxlen = maxlen)
  return padded


# tokenizer
tokenizer = Tokenizer(num_words=10000, oov_token='<UNK>')
tokenizer.fit_on_texts(x_train42)
print(tokenizer.texts_to_sequences([x_train42[0]]))
x_train42_seq = get_sequences(tokenizer, x_train42)

tokenizer.fit_on_texts(x_test42)
print(tokenizer.texts_to_sequences([x_test42[0]]))
x_test42_seq = get_sequences(tokenizer, x_test42)

print('x_train shape:', x_train42_seq.shape)
print('x_train shape:', x_test42_seq.shape)

get_sequences(tokenizer, x_train42[:2])

# set parameters
max_features = 60000 # cut texts after this number of words (among top max_features most common words)
embedding_dims = 300
maxlen = 40 #based on 2, most tweets are shorter than 40 words


Loading data...
x_train shape: (10560, 40)
x_train shape: (2640, 40)
[[3, 1323, 18, 23, 629, 3, 217, 1841]]
[[277, 796, 1039, 26, 280, 6, 308]]


In [354]:
# LSTM model
print('Build LSTM model...')

model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(10000, 300, input_length=maxlen),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20, return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),
        tf.keras.layers.Dense(6, activation= 'softmax')
])

model.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = 'adam',
    metrics = ['accuracy']
)


# run lstm model
# the model, with training set, validation set
h = model.fit(
    x_train42_seq, y_train,
    validation_data=( x_test42_seq, y_test,),
    epochs=10,
    callbacks=[
               tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2)
    ]
)

# testing model
score, accuracy = model.evaluate(x_test42_seq, y_test)
print('Test accuracy: {}, Test loss: {}'.format(accuracy, score))

Build LSTM model...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Test accuracy: 0.5746212005615234, Test loss: 1.2135310173034668


In [None]:
# DNN model

In [366]:
# let's try a more complicated DNN since the dimension is high


print('Build DNN model...')
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(max_features, embedding_dims, input_length=maxlen))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(512, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(512, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

# config model
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# training model
model.fit(x_train42_seq, y_train, batch_size=64, epochs=20, validation_data=(x_test42_seq, y_test))


Build DNN model...
Model: "sequential_85"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_33 (Embedding)    (None, 40, 300)           18000000  
                                                                 
 flatten_33 (Flatten)        (None, 12000)             0         
                                                                 
 dense_200 (Dense)           (None, 512)               6144512   
                                                                 
 dropout_108 (Dropout)       (None, 512)               0         
                                                                 
 dense_201 (Dense)           (None, 512)               262656    
                                                                 
 dropout_109 (Dropout)       (None, 512)               0         
                                                                 
 dense_202 (Dense)           (None

<keras.callbacks.History at 0x7ff14aa0ba90>

In [356]:
# testing DNN model
score, accuracy = model.evaluate(x_test42_seq, y_test)
print('Test accuracy: {}, Test loss: {}'.format(accuracy, score))

Test accuracy: 0.5587121248245239, Test loss: 2.3906760215759277


### 4.3 Classification with replacing the emojis with their descriptive names

using the wtemo column as x
- LSTM: 0.77
- DNN: 0.78

In [120]:
# classification with emojis' descriptive names
x_train43 = train['wtemo'].to_list()
x_test43 = test['wtemo'].to_list()

print(len(x_train43),len(x_test43))

10560 2640


In [123]:
# encode the words
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# tokenizer
tokenizer = Tokenizer(num_words=10000, oov_token='<UNK>')
tokenizer.fit_on_texts(x_train43)
print(tokenizer.texts_to_sequences([x_train43[0]]))
x_train43_seq = get_sequences(tokenizer, x_train43)

tokenizer.fit_on_texts(x_test43)
print(tokenizer.texts_to_sequences([x_test43[0]]))
x_test43_seq = get_sequences(tokenizer, x_test43)

print('Loading data...')
print('x_train shape:', x_train43_seq.shape)
print('x_train shape:', x_test43_seq.shape)
get_sequences(tokenizer, x_train43[:2])

[[11, 1452, 21, 35, 741, 11, 280, 1973, 2, 3, 16, 7, 15]]
[[315, 251, 1170, 3, 351, 5, 2, 14, 388]]
Loading data...
x_train shape: (10560, 40)
x_train shape: (2640, 40)


array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,   10, 1680,   21,   35,  857,   10,
         297, 1523,    2,    3,   16,    7,   15],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0, 1391,   34,  423,   14,   18,
         698,  164,    4,   86,    8,    5,    2]], dtype=int32)

In [350]:
# LSTM model
print('Build LSTM model...')

model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(10000, 300, input_length=maxlen),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20, return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),
        tf.keras.layers.Dense(6, activation= 'softmax')
])

model.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = 'adam',
    metrics = ['accuracy']
)


# run lstm model
# the model, with training set, validation set
h = model.fit(
    x_train43_seq, y_train,
    validation_data=( x_test43_seq, y_test,),
    epochs=10,
    callbacks=[
               tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2)
    ]
)

# testing model
score, accuracy = model.evaluate(x_test43_seq, y_test)
print('Test accuracy: {}, Test loss: {}'.format(accuracy, score))

Build LSTM model...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Test accuracy: 0.773106038570404, Test loss: 0.9008961319923401


In [365]:
# let's try a more complicated DNN since the dimension is high


print('Build DNN model...')
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(max_features, embedding_dims, input_length=maxlen))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(512, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(512, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

# config model
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# training model
model.fit(x_train43_seq, y_train, batch_size=64, epochs=20, validation_data=(x_test43_seq, y_test))


Build DNN model...
Model: "sequential_84"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_32 (Embedding)    (None, 40, 300)           18000000  
                                                                 
 flatten_32 (Flatten)        (None, 12000)             0         
                                                                 
 dense_197 (Dense)           (None, 512)               6144512   
                                                                 
 dropout_106 (Dropout)       (None, 512)               0         
                                                                 
 dense_198 (Dense)           (None, 512)               262656    
                                                                 
 dropout_107 (Dropout)       (None, 512)               0         
                                                                 
 dense_199 (Dense)           (None

<keras.callbacks.History at 0x7ff1b03e6f70>

In [352]:
# testing DNN model
score, accuracy = model.evaluate(x_test43_seq, y_test)
print('Test accuracy: {}, Test loss: {}'.format(accuracy, score))

Test accuracy: 0.7753787636756897, Test loss: 1.2321490049362183


### 4.4 Classification with replacing the emojis with the most similar 5 text tokens

using the simiemo column as x

if the emoji is not shown in our trained Word2Vec model, than replacing it with its descriptive name as 4.3

- LSTM: 0.76
- DNN: 0.75

In [128]:
# classification with emojis' descriptive names
x_train44 = train['simiemo'].to_list()
x_test44 = test['simiemo'].to_list()

print(len(x_train44),len(x_test44))

10560 2640


In [130]:
# encode the words
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


# tokenizer
tokenizer = Tokenizer(num_words=10000, oov_token='<UNK>')
tokenizer.fit_on_texts(x_train44)
print(tokenizer.texts_to_sequences([x_train44[0]]))
x_train44_seq = get_sequences(tokenizer, x_train44)

tokenizer.fit_on_texts(x_test44)
print(tokenizer.texts_to_sequences([x_test44[0]]))
x_test44_seq = get_sequences(tokenizer, x_test44)

print('Loading data...')
print('x_train shape:', x_train44_seq.shape)
print('x_train shape:', x_test44_seq.shape)
get_sequences(tokenizer, x_train44[:2])

[[8, 1771, 39, 59, 1018, 8, 444, 2251, 13, 12, 16, 15, 14]]
[[535, 477, 1494, 50, 539, 25, 47, 74, 77, 46, 11, 576]]
Loading data...
x_train shape: (10560, 40)
x_train shape: (2640, 40)


array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    8, 1978,   40,   59, 1159,    8,
         471, 1825,   14,   12,   16,   15,   13],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0, 1700,   58,  608,   11,   18,  974,  237,
           2,  119,    5,    7,    4,    6,    3]], dtype=int32)

In [348]:
# LSTM model
print('Build LSTM model...')

model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(10000, 300, input_length=maxlen),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20, return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),
        tf.keras.layers.Dense(6, activation= 'softmax')
])

model.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer = 'adam',
    metrics = ['accuracy']
)


# run lstm model
# the model, with training set, validation set
h = model.fit(
    x_train44_seq, y_train,
    validation_data=( x_test44_seq, y_test,),
    epochs=10,
    callbacks=[
               tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2)
    ]
)

# testing model
score, accuracy = model.evaluate(x_test44_seq, y_test)
print('Test accuracy: {}, Test loss: {}'.format(accuracy, score))

Build LSTM model...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Test accuracy: 0.7617424130439758, Test loss: 1.3963693380355835


In [364]:
# DNN model

print('Build DNN model...')
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(max_features, embedding_dims, input_length=maxlen))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(512, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(512, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

# config model
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# training model
model.fit(x_train44_seq, y_train, batch_size=64, epochs=20, validation_data=(x_test44_seq, y_test))

# testing DNN model
score, accuracy = model.evaluate(x_test44_seq, y_test)
print('Test accuracy: {}, Test loss: {}'.format(accuracy, score))

Build DNN model...
Model: "sequential_83"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_31 (Embedding)    (None, 40, 300)           18000000  
                                                                 
 flatten_31 (Flatten)        (None, 12000)             0         
                                                                 
 dense_194 (Dense)           (None, 512)               6144512   
                                                                 
 dropout_104 (Dropout)       (None, 512)               0         
                                                                 
 dense_195 (Dense)           (None, 512)               262656    
                                                                 
 dropout_105 (Dropout)       (None, 512)               0         
                                                                 
 dense_196 (Dense)           (None

### 4.5 Classification with emojis word embedding vectors
using original tweets & pre-trained model

* the result is not very satisfying because many tokens are droped due to key missing in pre-trained model

- LSTM: 0.74
- DNN: 0.74

In [138]:
#import the trianed Word2Vec model
# Load back with memory-mapping = read-only, shared across processes.
wv = KeyedVectors.load("en_word2vec.wordvectors", mmap='r')
#example
vector = wv['sun']  # Get numpy vector of a word
print(wv['sun']+wv['hi'])

[-2.23617658e-01  3.03572536e-01 -2.34024569e-01 -2.47136518e-01
  3.42488512e-02  2.26997703e-01 -2.13153273e-01  2.68146336e-01
  1.67842329e-01  9.75845903e-02  1.41029686e-01  2.98320055e-01
  2.76181161e-01 -5.50599098e-01 -3.01040024e-01 -1.99736565e-01
 -3.91080305e-02 -1.45365596e-01  3.59019399e-01  2.67562568e-01
  8.66341814e-02  3.83522123e-01  3.28554213e-01 -3.49370912e-02
 -2.03482211e-02  1.65105611e-01 -2.13322595e-01 -4.57124531e-01
  5.29949784e-01 -2.87918448e-01 -4.31380808e-01 -3.17272484e-01
 -2.64893621e-02  1.50817081e-01  1.24318779e-01 -6.29126579e-02
  1.95232570e-01  2.90508717e-02 -1.99170351e-01 -4.06680286e-01
  1.04108453e-03 -9.04468596e-02 -3.02806824e-01 -9.40951332e-02
 -1.70607924e-01  1.75087050e-01 -3.72158021e-01 -9.81016532e-02
  4.30744514e-02 -4.16884720e-02 -1.62985206e-01 -7.64575601e-03
  5.25135756e-01  4.74683762e-01  5.04842401e-01 -1.79820448e-01
  3.28110069e-01 -2.36159451e-02  3.41797620e-02 -3.61506194e-02
 -8.63424167e-02 -2.08626

In [279]:
# sum all vectors of one tweet

#train set
vecls = list()
for tweet in train.tweets:
    vec = []
    for word in tweet.split():
        if vec == []:
            try:
                vec = wv[word].tolist()
            except:
                continue
        else:
            try:
                vec += wv[word].tolist()
            except:
                continue
    vecls.append(vec)
train['word2vec'] = vecls


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['word2vec'] = vecls


In [276]:
#test set
vecls2 = list()
for tweet in test.tweets:
    vec = []
    for word in tweet.split():
        if vec == []:
            try:
                vec = wv[word]
            except:
                continue
        else:
            try:
                vec += wv[word]
            except:
                continue
    vecls2.append(vec)
test['word2vec'] = vecls2

  if vec == []:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['word2vec'] = vecls2


In [280]:
train

Unnamed: 0.1,Unnamed: 0,tweets,labels,text,wtemo,simiemo,word2vec
10056,10056,the emotion in your match the tweet perfectly 😂,1,the emotion in your match the tweet perfectly,the emotion in your match the tweet perfectly ...,the emotion in your match the tweet perfectly ...,"[0.16144225001335144, 0.32414403557777405, -0...."
4215,4215,stray are coming to my country guy i cry 😭,0,stray are coming to my country guy i cry,stray are coming to my country guy i cry loud...,stray are coming to my country guy i cry dry c...,"[-0.10359860211610794, 0.11853410303592682, 0...."
10394,10394,they are both legit giving happiness that we f...,1,they are both legit giving happiness that we find,they are both legit giving happiness that we f...,they are both legit giving happiness that we f...,"[0.2070552110671997, -0.1897415965795517, -0.2..."
6926,6926,water too warm hand are a pleasant surprise i ...,1,water too warm hand are a pleasant surprise i ...,water too warm hand are a pleasant surprise i ...,water too warm hand are a pleasant surprise i ...,"[0.02034851536154747, 0.18007014691829681, -0...."
10158,10158,with actor mother 😊,1,with actor mother,with actor mother smiling face with smiling e...,with actor mother wishing wonderful appreciate...,"[0.021692397072911263, -0.013324310071766376, ..."
...,...,...,...,...,...,...,...
5836,5836,slut sex here to promote and slut for detail ❤,0,slut sex here to promote and slut for detail,slut sex here to promote and slut for detail ...,slut sex here to promote and slut for detail ️...,"[-0.10501617938280106, 0.13486815989017487, -0..."
7086,7086,ya i feel glad and grateful to have a great 😊,1,ya i feel glad and grateful to have a great,ya i feel glad and grateful to have a great s...,ya i feel glad and grateful to have a great wi...,"[-0.3802816867828369, 0.14718611538410187, -0...."
4134,4134,i clapped but i show you lot the ugliness 😭,0,i clapped but i show you lot the ugliness,i clapped but i show you lot the ugliness lou...,i clapped but i show you lot the ugliness dry ...,"[-0.004160878248512745, -0.09457463771104813, ..."
9053,9053,the only meat ill be getting today thanks to ...,1,the only meat ill be getting today thanks to ...,the only meat ill be getting today thanks to ...,the only meat ill be getting today thanks to k...,"[0.16144225001335144, 0.32414403557777405, -0...."


In [None]:
######

In [158]:
x_train45 = train['tweets'].to_list()
x_test45 = test['tweets'].to_list()

print(len(x_train45),len(x_test45))

10560 2640


In [315]:
#example
word_index['😂']

8

In [318]:
# drop the words if not in word_index
wls = []
for i in train['tweets']:
    ls = []
    words = i.split(' ')
    for w in words:
        try:
            word_index[w]
            ls.append(w)
        except:
            continue
    wls.append(' '.join(ls))

In [322]:
wls2 = []
for i in test['tweets']:
    ls = []
    words = i.split(' ')
    for w in words:
        try:
            word_index[w]
            ls.append(w)
        except:
            continue
    wls2.append(' '.join(ls))

In [323]:
x_train45 = wls
x_test45 = wls2

In [324]:
# encode the words
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print('Loading data...')
def get_sequences(tokenizer, tweets):
  sequences = tokenizer.texts_to_sequences(tweets)
  padded = pad_sequences(sequences, truncating ='post', maxlen = maxlen)
  return padded

# tokenizer
tokenizer = Tokenizer(num_words=10000, oov_token='<UNK>')
tokenizer.fit_on_texts(x_train45)
word_index = tokenizer.word_index #get word_index
print(tokenizer.texts_to_sequences([x_train45[0]]))
x_train45_seq = get_sequences(tokenizer, x_train45)

tokenizer.fit_on_texts(x_test45)
print(tokenizer.texts_to_sequences([x_test45[0]]))
x_test45_seq = get_sequences(tokenizer, x_test45)

print('x_train shape:', x_train45_seq.shape)
print('x_train shape:', x_test45_seq.shape)


Loading data...
[[4, 1471, 20, 29, 725, 4, 262, 2008, 8]]
[[332, 907, 1171, 32, 335, 42, 7, 366]]
x_train shape: (10560, 40)
x_train shape: (2640, 40)


In [325]:
padding_type='post'
truncation_type='post'

# set parameters
max_features = 60000 # cut texts after this number of words (among top max_features most common words)
embedding_dims = 300
maxlen = 40 #based on 2, most tweets are shorter than 40 words

embedding_matrix = np.zeros((len(word_index) + 1, embedding_dims))
for word, i in word_index.items():
    try:
        embedding_vector = wv[word]
    except:
        embedding_vector = None
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [326]:
embedding_matrix.shape

(5813, 300)

In [327]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

embedding_layer = Embedding(input_dim=len(word_index)+1,
                            output_dim= embedding_dims,
                            weights=[embedding_matrix],
                            input_length=40,
                            trainable=False)

In [347]:
# LSTM model
print('Build LSTM model...')


from tensorflow.keras.callbacks import EarlyStopping, TensorBoard
from tensorflow.keras.models import Sequential
model = Sequential([
    embedding_layer,
    Bidirectional(LSTM(20, return_sequences=True)), 
    Bidirectional(LSTM(20)),
    Dense(128, activation='relu'),
   Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

# testing model
log_folder = 'logs'
callbacks = [
            EarlyStopping(patience = 10),
            TensorBoard(log_dir=log_folder)
            ]
num_epochs = 10
model.fit(x_train45_seq, y_train, epochs=num_epochs, validation_data=(x_test45_seq, y_test),
          callbacks=[
               tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2)
    ])


# testing model
score, accuracy = model.evaluate(x_test45_seq, y_test)
print('Test accuracy: {}, Test loss: {}'.format(accuracy, score))


Build LSTM model...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Test accuracy: 0.7416666746139526, Test loss: 0.8434050679206848


In [363]:
# DNN model

print('Build DNN model...')
model = tf.keras.Sequential()
model.add(embedding_layer)
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(512, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(512, activation='sigmoid'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

# config model
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# training model
model.fit(x_train45_seq, y_train, batch_size=64, epochs=20, validation_data=(x_test45_seq, y_test),
         callbacks=[
               tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2)
    ])

# testing DNN model
score, accuracy = model.evaluate(x_test45_seq, y_test)
print('Test accuracy: {}, Test loss: {}'.format(accuracy, score))

Build DNN model...
Model: "sequential_82"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_21 (Embedding)    (None, 40, 300)           1743900   
                                                                 
 flatten_30 (Flatten)        (None, 12000)             0         
                                                                 
 dense_191 (Dense)           (None, 512)               6144512   
                                                                 
 dropout_102 (Dropout)       (None, 512)               0         
                                                                 
 dense_192 (Dense)           (None, 512)               262656    
                                                                 
 dropout_103 (Dropout)       (None, 512)               0         
                                                                 
 dense_193 (Dense)           (None