**Numpy**
1.   NumPy is the fundamental package for scientific computing with Python.
2.   It provides a high-performance multidimensional array object, and tools for    working with these arrays.

**Pandas**


1.   Pandas is the most popular python library that is used for data analysis.
2.   We can manipulate like Excel sheets


In [1]:
import numpy as np
import pandas as pd

from pathlib import Path

import warnings
warnings.filterwarnings("ignore")

In [6]:
root = Path("/home/joydipb/Documents/CMT309_Comp_DS/emoji_prediction")
train_data = pd.read_csv(root/"train/us_train.text",
                         sep='delimiter', header=None)
train_data.columns = ["tweets"]
train_labels = pd.read_csv(root/"train/us_train.labels",
                           sep='delimiter', header=None)
train_tweets = train_data['tweets'].astype(str).values.tolist()

In [7]:
train_data["labels"] = train_labels
train_data

Unnamed: 0,tweets,labels
0,Sunday afternoon walking through Venice in the...,12
1,Time for some BBQ and whiskey libations. Chomp...,19
2,Love love love all these people ️ ️ ️ #friends...,0
3,"️ ️ ️ ️ @ Toys""R""Us",0
4,Man these are the funniest kids ever!! That fa...,2
...,...,...
499995,"Angel Baby #ella #greenville @ Greenville, Nor...",1
499996,@user fight who bob??,2
499997,oh. My. Goodness. @ Chili's Grill &amp; Bar,1
499998,Missing my baby already :( C u in a month @ Ba...,9


In [8]:
test_data = pd.read_csv(root/"test/us_test.text", sep='delimiter', header=None)
test_data.columns = ["tweets"]
test_labels = pd.read_csv(root/"test/us_test.labels",
                          sep='delimiter', header=None)
test_tweets = test_data['tweets'].astype(str).values.tolist()

In [9]:
test_data["labels"] = test_labels
test_data

Unnamed: 0,tweets,labels
0,en Pelham Parkway,2
1,The calm before...... | w/ sofarsounds @user |...,10
2,Just witnessed the great solar eclipse @ Tampa...,6
3,This little lady is 26 weeks pregnant today! E...,1
4,"Great road trip views! @ Shartlesville, Pennsy...",16
...,...,...
49995,@user @user @user #la #westhollywood #dtboy #l...,5
49996,Climbing subway stairs. That was nothing. #sta...,19
49997,Pops with Ms Drina at The Swanees Anniversary ...,6
49998,"We love ️ Soren! July 26, 2017 was her first d...",0


In [10]:
dev = pd.read_csv(root/"dev/us_dev.text", sep='delimiter', header=None)
dev.columns = ["tweets"]
dev_labels = pd.read_csv(root/"dev/us_dev.labels", sep='delimiter', header=None)
dev_tweets = dev['tweets'].astype(str).values.tolist()

In [11]:
dev["labels"] = dev_labels
dev

Unnamed: 0,tweets,labels
0,A little throwback with my favourite person @ ...,0
1,glam on @user yesterday for #kcon makeup using...,7
2,Democracy Plaza in the wake of a stunning outc...,11
3,Then &amp; Now. VILO @ Walt Disney Magic Kingdom,0
4,Who never... @ A Galaxy Far Far Away,2
...,...,...
49995,My #O2otd Love this chain so much and our new ...,1
49996,Met Santa and Olaf @ the North Pole today @ No...,0
49997,New York by Night Strideby #HERElocationNYC......,11
49998,Kisses for the birthday girl! @ Helzberg Diamonds,0


In [12]:
# print the shapes of all files
train_data.shape, test_data.shape #, mappings.shape

((500000, 2), (50000, 2))

In [13]:
train_length = train_data.shape[0]
test_length = test_data.shape[0]
train_length, test_length

(500000, 50000)

**NLTK is a library for Natural Language Processing (NLP) to create features from text**
When using words as features, we need to handle:
1.   Context -> eg: Not good
2.   Identify root words -> eg: help, helper, helping
3.   Words with similar meaning -> eg: good and nice

**Stopwords are useless words or commonly used words. They add very little information to our model so can be removed**

In [14]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/joydipb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
stop_words = stopwords.words("english")
stop_words[:5]

['i', 'me', 'my', 'myself', 'we']

We need to follow the following steps to pre process the data before using it:


1.   Each tweet should be tokenized into a list of words
2.   Remove words starting with **@** because they generally refer to twitter       handles and thus provide little or no information
3.   Remove **stopwords**
4.   Remove the **#** character to get the actual word used as hashtag

In [16]:
# tokenize the sentences
def tokenize(tweets):
    stop_words = stopwords.words("english")
    tokenized_tweets = []
    for tweet in tweets:
        # split all words in the tweet
        words = tweet.split(" ")
        tokenized_string = ""
        for word in words:
            # remove @handles -> useless -> no information
            if word[0] != '@' and word not in stop_words:
                # if a hashtag, remove # -> adds no new information
                if word[0] == "#":
                    word = word[1:]
                tokenized_string += word + " "
        tokenized_tweets.append(tokenized_string)
    return tokenized_tweets

**Keras** - It is an Open Source Neural Network library written in Python that runs on top of Tensorflow, i.e., it uses tensors to run the operations. 

**Tokenizer** - vectorize text by turning each text into a sequence of integers

**filters** - a string where each element is a character that will be filtered from text

**lower** - boolean for lower case conversion

**tokenizer.texts_to_sequences(tweets)** - transform each tweet in tweets to a sequence of integers


In [17]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.wrappers import Bidirectional
from keras.layers import Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [18]:
# translate tweets to a sequence of numbers
def encod_tweets(tweets):
    tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', split=" ", lower=True)
    tokenizer.fit_on_texts(tweets)
    return tokenizer, tokenizer.texts_to_sequences(tweets)

**Example**  
Uncomment and run the following code cell to see the example of the output

**pad_sequences** - transforms list of sequences (list of integers) into 2D numpy arrays of shape (num_samples, maxlen)

maxlen is the length of longest sequence, can be provided as an argument also

If sequences are shorter than maxlen, they are padded with value at front or end (pre or post padding)
If sequences are longer than maxlen, they are truncated

**bit_vec** -> vector of 0 and 1

In [19]:
# apply padding to dataset and convert labels to bitmaps
def format_data(encoded_tweets, max_length, labels):
    x = pad_sequences(encoded_tweets, maxlen= max_length, padding='post')
    y = []
    for emoji in labels:
        bit_vec = np.zeros(20)
        bit_vec[emoji] = 1
        y.append(bit_vec)
    y = np.asarray(y)
    return x, y

In [32]:
# create weight matrix from pre trained embeddings
def create_weight_matrix(vocab, raw_embeddings):
    vocab_size = len(vocab) + 1
    weight_matrix = np.zeros((vocab_size, 300))
    raw_embeddings = np.array(raw_embeddings)   # convert to numpy array
    for word, idx in vocab.items():
        if word in raw_embeddings:
            weight_matrix[idx] = raw_embeddings[word]
            
    return weight_matrix

**Embeddings** -> are used mainly for text processing.

**Example**:

Hope to see you soon. -> [0, 1, 2, 3, 4] (embedding of words)

Nice to see you again. -> [5, 1, 2, 3, 6]

**Vocab size** = number of unique words in vocabulary = max number in embeddings + 1 = 6 + 1 = 7

**Sequential** -> means we are using linear stack of layers

**LSTM** -> Long Short term memory

**Bidirectional** -> wrapper to indicate the type of LSTM used

**Dense** -> Denselu connected neural network

**Activation function** -> decided whether a neuron should be activated or not by calculating weighted sum and adding bias to it.

It provides non-linearlity to output of neuron

**A neural network without an activation function is just a Linear Regression**. By using activation function, we can make our model solve complex functions.

**Softmax** -> similar to **Sigmoid function** -> used for multiple classes, gives output between 0 & 1 and divide by sum of outputs

**Optimizer** -> finds the trainable variables on which cost depends and change their values to **optimize cost**

**Entroy** -> -sum(p log p) -> avg amount of information drawn from one sample

In [21]:
# final model
def final_model(weight_matrix, vocab_size, max_length, x, y, epochs = 5):
    embedding_layer = Embedding(vocab_size, 300, weights=[weight_matrix], input_length=max_length, trainable=True, mask_zero=True)
    model = Sequential()
    model.add(embedding_layer)
    model.add(Bidirectional(LSTM(128, dropout=0.2, return_sequences=True)))
    model.add(Bidirectional(LSTM(128, dropout=0.2)))
    model.add(Dense(20, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(x, y, epochs = epochs, validation_split = 0.25)
    score, acc = model.evaluate(x_test, y_test)
    return model, score, acc

Tokenizing the train and test tweets and then encoding them

In [22]:
import math

In [23]:
tokenized_tweets = tokenize(train_data['tweets'])
tokenized_tweets += tokenize(test_data['tweets'])
max_length = math.ceil(sum([len(s.split(" ")) for s in tokenized_tweets])/len(tokenized_tweets))
tokenizer, encoded_tweets = encod_tweets(tokenized_tweets)
max_length, len(tokenized_tweets)

(10, 550000)

Apply padding to the encoded data using pad_sequences for both train and test tweets

In [24]:
x, y = format_data(encoded_tweets[:train_length], max_length, train_data['labels'])
len(x), len(y)

(500000, 500000)

In [25]:
x_test, y_test = format_data(encoded_tweets[train_length:], max_length, test_data['labels'])
len(x_test), len(y_test)

(50000, 50000)

Building vocabulary using **word_index** 

In [26]:
vocab = tokenizer.word_index
vocab, len(vocab)

({'️': 1,
  'i': 2,
  'love': 3,
  'the': 4,
  '…': 5,
  'new': 6,
  'amp': 7,
  'happy': 8,
  'day': 9,
  'my': 10,
  'beach': 11,
  'night': 12,
  'one': 13,
  'time': 14,
  'today': 15,
  "i'm": 16,
  'york': 17,
  'good': 18,
  'park': 19,
  'best': 20,
  'this': 21,
  'you': 22,
  'like': 23,
  'christmas': 24,
  'last': 25,
  'birthday': 26,
  'university': 27,
  'city': 28,
  'we': 29,
  'california': 30,
  'beautiful': 31,
  'get': 32,
  'got': 33,
  'a': 34,
  'little': 35,
  'great': 36,
  'thanks': 37,
  'see': 38,
  'back': 39,
  'family': 40,
  'so': 41,
  'much': 42,
  'life': 43,
  'center': 44,
  'thank': 45,
  'fun': 46,
  'favorite': 47,
  'it': 48,
  'first': 49,
  "it's": 50,
  'always': 51,
  'home': 52,
  'me': 53,
  'school': 54,
  'tonight': 55,
  'go': 56,
  'when': 57,
  'friends': 58,
  'morning': 59,
  'nyc': 60,
  'florida': 61,
  'high': 62,
  'weekend': 63,
  'girl': 64,
  'amazing': 65,
  'state': 66,
  'year': 67,
  "can't": 68,
  'texas': 69,
  'us': 7

**keyedvectors** -> word vector storage and look up

It is used to load hidden weight matrix

**binary** -> to specify whether the data is binary or not

In [27]:
from gensim.models.keyedvectors import KeyedVectors

In [53]:
! wget https://nlp.stanford.edu/data/glove.6B.zip

--2022-05-07 11:53:44--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-05-07 11:53:45--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-05-07 11:56:25 (5.14 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [28]:
! unzip glove*.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [29]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
#f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
f = open('glove.6B.100d.txt','r',encoding='utf-8')
for line in f:
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


create the weight matrix using our vocab and embeddings_index

In [33]:
weight_matrix = create_weight_matrix(vocab, embeddings_index)
len(weight_matrix)

364687

Run the final model on train data

In [None]:
model, score, acc = final_model(weight_matrix, len(vocab)+1, max_length, x, y, epochs = 5)
model, score, acc

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 52500 samples, validate on 17500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


(<keras.engine.sequential.Sequential at 0x7fceae9c8e10>,
 8.686213728018256,
 0.06367235630750656)

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 10, 300)           29238900  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 10, 256)           439296    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 256)               394240    
_________________________________________________________________
dense_1 (Dense)              (None, 20)                5140      
Total params: 30,077,576
Trainable params: 30,077,576
Non-trainable params: 0
_________________________________________________________________


Use .predict() funtion to predict the y values for x_test

y values are numpy arrays of length 20 == number of classes

The class can be found out by finding the index of the maximum value

In [None]:
y_pred = model.predict(x_test)
y_pred

array([[4.5205639e-03, 6.1835712e-03, 2.5095116e-02, ..., 2.8466381e-02,
        6.5138914e-02, 2.8007482e-03],
       [7.2174267e-09, 3.0331739e-07, 1.6326581e-07, ..., 3.0650888e-08,
        2.7001197e-06, 5.2905165e-08],
       [9.7029078e-07, 2.2035099e-05, 1.5461341e-05, ..., 1.0900231e-05,
        4.9335358e-07, 2.6841351e-06],
       ...,
       [1.4020579e-02, 1.3225450e-01, 1.0889382e-01, ..., 4.6023492e-02,
        5.4387182e-02, 2.5488444e-02],
       [9.3367770e-02, 1.0195305e-03, 2.7876866e-05, ..., 5.4133898e-03,
        1.2723766e-03, 5.9620561e-03],
       [3.2763390e-03, 1.5374083e-03, 1.1093880e-03, ..., 8.5164723e-04,
        1.5113539e-04, 2.7398905e-03]], dtype=float32)

In [None]:
for pred in y_pred:
    print(np.argmax(pred))

9
13
11
9
3
2
14
9
15
8
8
10
2
1
9
9
3
2
14
13
16
2
7
16
4
3
3
2
2
2
15
11
2
15
15
5
2
9
9
17
19
15
9
13
14
10
3
10
3
15
9
2
15
8
2
7
13
4
9
5
12
7
2
4
14
2
16
3
7
9
9
11
11
9
14
19
7
15
9
18
19
18
17
11
2
3
3
9
9
4
3
9
9
6
8
17
2
3
9
9
4
12
5
2
2
3
17
9
7
19
13
5
6
9
7
16
3
9
16
15
1
19
2
7
13
7
2
9
15
5
17
2
8
9
14
2
13
3
2
9
17
9
13
2
9
3
11
6
19
11
9
2
2
1
9
2
16
4
5
10
9
9
9
9
9
6
7
3
3
9
12
3
9
15
2
9
9
15
15
19
15
9
9
9
5
3
15
3
9
8
2
11
7
19
1
4
9
7
9
6
18
15
7
11
7
18
1
4
1
2
2
3
9
2
3
6
1
5
3
17
3
8
9
3
9
14
10
5
10
15
7
11
2
16
11
17
17
2
13
12
2
16
9
14
17
9
9
9
15
3
9
16
14
8
9
7
2
8
3
9
17
9
9
7
12
19
9
13
2
15
9
7
3
13
9
1
12
5
2
2
9
3
2
18
8
2
2
2
15
9
3
1
2
8
2
15
2
13
9
9
4
2
7
1
13
9
2
2
11
9
19
9
16
12
7
9
9
9
2
7
9
6
2
2
8
7
12
5
2
17
16
9
7
13
8
9
15
16
9
9
0
7
14
19
5
2
1
9
2
2
2
17
8
2
3
4
9
9
9
2
18
7
3
4
2
7
11
2
9
2
19
9
4
18
19
9
2
2
7
7
8
12
0
3
13
7
3
9
9
9
9
4
3
16
17
0
13
2
19
9
16
9
3
19
4
3
15
7
16
8
11
2
2
15
9
3
9
3
9
5
14
19
2
14
17
19
11
3
18
8
10


Print the classification report which gives:

**precision** -> what % of predicted a's are actually a

**recall** -> what % of a are predicted to be a

**fi-score** -> Harmonic mean of precision and recall

**support** -> actual values of each class

In [None]:
import math
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
y_pred = np.array([np.argmax(pred) for pred in y_pred])
y_true = np.array(test_data['Label'])
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.07      0.01      0.01     10760
           1       0.04      0.01      0.01      5280
           2       0.04      0.05      0.05      5241
           3       0.02      0.04      0.02      2886
           4       0.02      0.01      0.02      2518
           5       0.02      0.01      0.02      2317
           6       0.02      0.02      0.02      2049
           7       0.02      0.03      0.02      1894
           8       0.02      0.02      0.02      1796
           9       0.02      0.14      0.04      1671
          10       0.01      0.00      0.01      1544
          11       0.68      0.73      0.70      1528
          12       0.54      0.70      0.61      1462
          13       0.01      0.02      0.02      1346
          14       0.01      0.01      0.01      1377
          15       0.01      0.02      0.01      1250
          16       0.02      0.02      0.02      1306
          17       0.01    

In [None]:
emoji_pred = [mappings[mappings['number'] == pred]['emoticons'] for pred in y_pred]
emoji_pred

[9    ❤
 Name: emoticons, dtype: object,
 13    ✨
 Name: emoticons, dtype: object,
 11    🇺🇸
 Name: emoticons, dtype: object,
 9    ❤
 Name: emoticons, dtype: object,
 3    😂
 Name: emoticons, dtype: object,
 2    😍
 Name: emoticons, dtype: object,
 14    💙
 Name: emoticons, dtype: object,
 9    ❤
 Name: emoticons, dtype: object,
 15    💕
 Name: emoticons, dtype: object,
 8    😘
 Name: emoticons, dtype: object,
 8    😘
 Name: emoticons, dtype: object,
 10    😁
 Name: emoticons, dtype: object,
 2    😍
 Name: emoticons, dtype: object,
 1    📸
 Name: emoticons, dtype: object,
 9    ❤
 Name: emoticons, dtype: object,
 9    ❤
 Name: emoticons, dtype: object,
 3    😂
 Name: emoticons, dtype: object,
 2    😍
 Name: emoticons, dtype: object,
 14    💙
 Name: emoticons, dtype: object,
 13    ✨
 Name: emoticons, dtype: object,
 16    😎
 Name: emoticons, dtype: object,
 2    😍
 Name: emoticons, dtype: object,
 7    🔥
 Name: emoticons, dtype: object,
 16    😎
 Name: emoticons, dtype: object,
 4    

In [None]:
for i in range(100, 150):
    test_tweet = test_data['tweets'][i]
    pred_label = y_pred[i]
    pred_emoji = emoji_pred[i]
    print('tweet: ', test_tweet)
    print('pred emoji: ', pred_label, pred_emoji)
    print('-'*50)

tweet:  New York ain't ready... #BuffaloState #HWS @ New York Strip Clubs
pred emoji:  4 4    😉
Name: emoticons, dtype: object
--------------------------------------------------
tweet:  Just livin man @ South Beach Miami
pred emoji:  12 12    ☀
Name: emoticons, dtype: object
--------------------------------------------------
tweet:  It's the most wonderful time of the year -#rockefellercenter #christmas #christmastree…
pred emoji:  5 5    🎄
Name: emoticons, dtype: object
--------------------------------------------------
tweet:  Oh I think that I found myself a cheerleader @ University of Iowa Rugby…
pred emoji:  2 2    😍
Name: emoticons, dtype: object
--------------------------------------------------
tweet:  Gotta take care of the feet. #pedicure #atlanta #atl #friends #pampered #feet #feelinggood…
pred emoji:  2 2    😍
Name: emoticons, dtype: object
--------------------------------------------------
tweet:  I think I need a hat like this #cutefilter #snapchat @ Georgetown,…
pred emo