# Tweet emoji Analysis <a class="anchor" id="tea"></a>

<a href="https://www.linkedin.com/in/ouassim-adnane/">Ouassim Adnane</a> 08 June 2020

# Overview 

In this notebook, I've used a tweets dataset that contains tweet text with 12 emotions (neutral, worry, happiness, sadness, love, surprise, fun, relief, hate, empty, enthusiasm, boredom and anger) and the goal is to predict the percentage of emotions in a giving text

To achieve that goal I've used some techniques fist to preprocess the text data :

<li>correct misspelled text</li>
<li>replace English contractions with there meaning (isn't => is not)</li>
<li>remove some punctuations, URLS user mentions and extra spaces</li>
<li>replace emojis with there meaning</li><br>

For the modeling part I've used LSTM's and Roberta base Model:
<li>First a Basic LSTM </li>
<li>LSTM model with glove word embeddings</li>
<li>Roberta Base model </li>
<br>
In the final part, I've made a donut chart that detects the level of emotions is a particular text.

# Load Necesarry Packages <a class="anchor" id="tu"></a>


In [None]:
!pip install tweet-preprocessor 2>/dev/null 1>/dev/null
!pip install emoji KaggleDatasets transformers

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/24/fa/b3368f41b95a286f8d300e323449ab4e86b85334c2e0b477e94422b8ed0f/emoji-1.2.0-py3-none-any.whl (131kB)
[K     |██▌                             | 10kB 20.2MB/s eta 0:00:01[K     |█████                           | 20kB 26.0MB/s eta 0:00:01[K     |███████▌                        | 30kB 21.6MB/s eta 0:00:01[K     |██████████                      | 40kB 18.6MB/s eta 0:00:01[K     |████████████▌                   | 51kB 14.7MB/s eta 0:00:01[K     |███████████████                 | 61kB 14.4MB/s eta 0:00:01[K     |█████████████████▌              | 71kB 14.1MB/s eta 0:00:01[K     |████████████████████            | 81kB 13.8MB/s eta 0:00:01[K     |██████████████████████▌         | 92kB 14.0MB/s eta 0:00:01[K     |█████████████████████████       | 102kB 15.1MB/s eta 0:00:01[K     |███████████████████████████▌    | 112kB 15.1MB/s eta 0:00:01[K     |██████████████████████████████  | 122kB 15.1MB/s

In [None]:
import preprocessor as p
import numpy as np 
import pandas as pd 
import emoji
import keras
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import plotly.graph_objects as go
import plotly.express as px
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
#from kaggle_datasets import KaggleDatasets
import transformers
from transformers import TFAutoModel, AutoTokenizer
from tqdm.notebook import tqdm
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors
from tqdm import tqdm

# Data preparation / Load Datasets  <a class="anchor" id="dp"></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Download Training Dataset from GoogleDrive
df_emotion_train = pd.read_csv("/content/drive/MyDrive/CE888 Data Science for Decision Making/Tweet Evaluation Project/Preprocessed Datasets/df_emotion_train.csv")
df_emotion_train.head(1)

Unnamed: 0,Tweet text,labels,hashtag,cleaned_text,tokenized_lemmatized_text
0,“Worry is a down payment on a problem you may ...,2,"['motivation', 'leadership', 'worry']",worry payment problem have joyce meyer,"['worry', 'payment', 'problem', 'have', 'joyce..."


In [None]:
# Download Test Dataset from GoogleDrive
df_emotion_val = pd.read_csv("/content/drive/MyDrive/CE888 Data Science for Decision Making/Tweet Evaluation Project/Preprocessed Datasets/df_emotion_val.csv")
df_emotion_val.head(1)

Unnamed: 0,Tweet text,labels,hashtag,cleaned_text,tokenized_lemmatized_text
0,"@user @user Oh, hidden revenge and anger...I r...",0,[],oh hidden revenge anger i rememberthe time she...,"['oh', 'hidden', 'revenge', 'anger', 'i', 'rem..."


In [None]:
# Download Validation Dataset from GoogleDrive
df_emotion_test = pd.read_csv("/content/drive/MyDrive/CE888 Data Science for Decision Making/Tweet Evaluation Project/Preprocessed Datasets/df_emotion_test.csv")
df_emotion_test.head(1)

Unnamed: 0,Tweet text,labels,hashtag,cleaned_text,tokenized_lemmatized_text
0,#Deppression is real. Partners w/ #depressed p...,3,"['Deppression', 'depressed', 'anxiety']",real partners w people truly dont understand d...,"['real', 'partner', 'w', 'people', 'truly', 'd..."


In [None]:
#Create function view content in .txt file
def text_file_reader(data):
  with open(data) as f:
    contents = f.read()
    print(contents)

In [None]:
#Types of labels for emotion dataset
text_file_reader('/content/drive/MyDrive/CE888 Data Science for Decision Making/Tweet Evaluation Project/emotion/mapping.txt')

0	anger
1	joy
2	optimism
3	sadness


# Modeling  <a class="anchor" id="m"></a>


### Encoding the data and train test split <a class="anchor" id="m-ed"></a>


In [None]:
sent_to_id  = {"anger":0, "joy":1,"optimism":2,"sadness":3}

In [None]:
def encoder(y_label):
  label_encoder = LabelEncoder()
  integer_encoded = label_encoder.fit_transform(y_label)

  onehot_encoder = OneHotEncoder(sparse=False)
  integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
  Y = onehot_encoder.fit_transform(integer_encoded)
  return Y

In [None]:
X_train = df_emotion_train['tokenized_lemmatized_text']
y_train = encoder(df_emotion_train['labels'])

X_test = df_emotion_test['tokenized_lemmatized_text']
y_test = encoder(df_emotion_test['labels'])

X_val = df_emotion_val['tokenized_lemmatized_text']
y_val = encoder(df_emotion_val['labels'])

In [None]:
len(X_train)

3257

### LSTM <a class="anchor" id="m-l"></a>

In [None]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 160
Epoch = 5
token.fit_on_texts(list(X_train) + list(X_test))
X_train_pad = sequence.pad_sequences(token.texts_to_sequences(X_train), maxlen=max_len)
X_val_pad = sequence.pad_sequences(token.texts_to_sequences(X_val), maxlen=max_len)
X_test_pad = sequence.pad_sequences(token.texts_to_sequences(X_test), maxlen=max_len)

In [None]:
w_idx = token.word_index

In [None]:
embed_dim = 160
lstm_out = 250

model = Sequential()
model.add(Embedding(len(w_idx) +1 , embed_dim,input_length = X_test_pad.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(keras.layers.core.Dense(4, activation='softmax')) #Number of Classification
#adam rmsprop 
model.compile(loss = "categorical_crossentropy", optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 160, 160)          1880480   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 160, 160)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 250)               411000    
_________________________________________________________________
dense (Dense)                (None, 4)                 1004      
Total params: 2,292,484
Trainable params: 2,292,484
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
batch_size = 32

In [None]:
Epoch = 15
model.fit(X_train_pad, y_train, epochs = Epoch, batch_size=batch_size,validation_data=(X_val_pad, y_val))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7fe515202050>

In [None]:
# Clone trained LSTM model by creating a new object
LSTM = model

# Evaluate performance on test dataset
Epoch = 1
LSTM.fit(X_test_pad, y_test, epochs = Epoch, batch_size=batch_size,validation_data=(X_val_pad, y_val))



<tensorflow.python.keras.callbacks.History at 0x7fe515437910>

#### Test LSTM on texts <a class="anchor" id="m-lr"></a>


In [None]:
def get_sentiment(model,text):
    twt = token.texts_to_sequences([text])
    twt = sequence.pad_sequences(twt, maxlen=max_len, dtype='int32')
    sentiment = model.predict(twt,batch_size=1,verbose = 2)
    sent = np.round(np.dot(sentiment,100).tolist(),0)[0]
    result = pd.DataFrame([sent_to_id.keys(),sent]).T
    result.columns = ["sentiment","percentage"]
    result=result[result.percentage !=0]
    return result

In [None]:
def plot_result(df):
    #colors=['#D50000','#000000','#008EF8','#F5B27B','#EDECEC','#D84A09','#019BBD','#FFD000','#7800A0','#098F45','#807C7C','#85DDE9','#F55E10']
    #fig = go.Figure(data=[go.Pie(labels=df.sentiment,values=df.percentage, hole=.3,textinfo='percent',hoverinfo='percent+label',marker=dict(colors=colors, line=dict(color='#000000', width=2)))])
    #fig.show()
    colors={'anger':'rgb(213,0,0)','joy':'rgb(0,0,0)',
                    'optimism':'rgb(0,142,248)','sadness':'rgb(245,178,123)'}
    col_2={}
    for i in result.sentiment.to_list():
        col_2[i]=colors[i]
    fig = px.pie(df, values='percentage', names='sentiment',color='sentiment',color_discrete_map=col_2,hole=0.3)
    fig.show()

In [None]:
result =get_sentiment(model,"Had an absolutely brilliant day ðŸ˜ loved seeing an old friend and reminiscing")
plot_result(result)

## Albert Base

In [None]:
def regular_encode(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        #return_attention_masks=False, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    
    return np.array(enc_di['input_ids'])

def build_model(transformer, max_len=160):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(4, activation='softmax')(cls_token) #Number of Classification
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
AUTO = tf.data.experimental.AUTOTUNE
MODEL = 'albert-base-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL)
X_train_t = regular_encode(X_train.to_list(), tokenizer, maxlen=max_len)
X_test_t = regular_encode(X_test.to_list(), tokenizer, maxlen=max_len)
X_val_t = regular_encode(X_val.to_list(), tokenizer, maxlen=max_len)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=684.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760289.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1312669.0, style=ProgressStyle(descript…




Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).



In [None]:
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((X_train_t, y_train))
    .repeat()
    .shuffle(1995)
    .batch(batch_size)
    .prefetch(AUTO)
)

#valid_dataset = (
#    tf.data.Dataset
#    .from_tensor_slices((X_test_t, y_test))
#    .batch(batch_size)
#    .cache()
#    .prefetch(AUTO)
#)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((X_val_t, y_val))
    .batch(batch_size)
    .cache()
    .prefetch(AUTO)
)

In [None]:
transformer_layer = TFAutoModel.from_pretrained(MODEL)
albert = build_model(transformer_layer, max_len=max_len)
albert.summary()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=63048440.0, style=ProgressStyle(descrip…




Some layers from the model checkpoint at albert-base-v2 were not used when initializing TFAlbertModel: ['predictions']
- This IS expected if you are initializing TFAlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFAlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFAlbertModel were initialized from the model checkpoint at albert-base-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFAlbertModel for predictions without further training.


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_word_ids (InputLayer)  [(None, 160)]             0         
_________________________________________________________________
tf_albert_model (TFAlbertMod TFBaseModelOutputWithPool 11683584  
_________

In [None]:
n_steps = X_train.shape[0] // batch_size
Epoch = 15
albert.fit(train_dataset,steps_per_epoch=n_steps,validation_data=valid_dataset,epochs=Epoch)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7fe5169b7490>

In [None]:
# Clone trained albert model by creating a new object
albert2 = albert

# Modify test dataset to fit into the model
test_dataset = (
    tf.data.Dataset
    .from_tensor_slices((X_test_t, y_test))
    .batch(batch_size)
    .cache()
    .prefetch(AUTO)
)

# Evaluate performance on test dataset
Epoch = 1
albert2.fit(test_dataset,steps_per_epoch=n_steps,validation_data=valid_dataset,epochs=Epoch)



<tensorflow.python.keras.callbacks.History at 0x7fe51695dd50>

### Test ALBERT on Texts

In [None]:
def get_sentiment2(model,text):
    x_test1 = regular_encode([text], tokenizer, maxlen=max_len)
    test1 = (tf.data.Dataset.from_tensor_slices(x_test1).batch(1))
    #test1
    sentiment = model.predict(test1,verbose = 0)
    sent = np.round(np.dot(sentiment,100).tolist(),0)[0]
    result = pd.DataFrame([sent_to_id.keys(),sent]).T
    result.columns = ["sentiment","percentage"]
    result=result[result.percentage !=0]
    return result

In [None]:
result =get_sentiment2(albert,"The pain my heart feels is just too much for it to bear. Nothing eases this pain. I can’t hold myself back. I really miss you")
plot_result(result)


The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).

