Followed the tutorial from this link: https://heartbeat.fritz.ai/using-a-keras-embedding-layer-to-handle-text-data-2c88dc019600

How to set up tensorflow and Keras (Mac): https://www.youtube.com/watch?v=mcIKDJYeyFY

How to set up tensorflow and Keras (Windows): https://www.youtube.com/watch?v=59duINoc8GM

Twitter GloVe embeddings: https://www.kaggle.com/jdpaletto/glove-global-vectors-for-word-representation
> I have gone ahead and downloaded the 200d embeddings and used them in this notebook.

Clean tweets located in the SMM4H channel on Slack

In [1]:
import tensorflow as tf
from numpy import array
import numpy as np
import pandas as pd
import csv
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from sklearn.model_selection import train_test_split 

Using TensorFlow backend.


### 1) Load in our cleaned Twitter data.

In [22]:
df = pd.read_csv('/Users/graceganzel/Desktop/School/LHS712/SMM4H/task1_training_cleaned.tsv', delimiter = '\t', header = None)
df.head()

Unnamed: 0,0,1
0,0,<user> doctor christian </user> scared to star...
1,0,"<user> intuitive gal 1 </user> ok , if you sto..."
2,0,novartis announces secukinumab ( ain 457 ) dem...
3,1,""" u wailed all night ; now y'r disembodied sob..."
4,0,<user> ira paps </user> you are so fucking sel...


### 2) Split the data into train and test

In [23]:
docs = df[1]
labels = array(df[0])

In [24]:
X_train, X_test, y_train, y_test = train_test_split(docs, labels, test_size = 0.33, random_state=42)

### 3) Prepare our data

> We will need to:
> - Tokenize,
> - Encode,
> - And pad our data

#### What is encoding and padding?

> Machine learning models only accept numerical data. Therefore, we need to **encode** our data -> convert the words into series' of integers.

> In Keras, we usually want to pass arrays of integers of the same length. Thus, we **pad** our sequences -> add zeros to the beginning or end of each series of integers to make them the same length

Here is what one of our tweets look like:

In [25]:
print(X_train[1])

<user> intuitive gal 1 </user> ok , if you stopped taking the lamictal , give 90mg a week .


In [26]:
# obtain the vocabulary size to be used with one_hot encoding
t = Tokenizer()

tweets_lst = []
for tweet in X_train:
    tweets_lst.append(tweet)

t.fit_on_texts(tweets_lst)
vocab_size = len(t.word_index) + 1

In [27]:
# encoding with one_hot
X_train = [one_hot(tweet, vocab_size, filters='!"#$%&()*+,-./:;=?@[\]^_`{|}~', split=' ') for tweet in X_train]
X_test = [one_hot(tweet, vocab_size, filters='!"#$%&()*+,-./:;=?@[\]^_`{|}~', split=' ') for tweet in X_test]

This is what the tweet looks like after encoding:

In [28]:
print(X_train[1])

[14162, 21838, 17273, 525, 9951, 18556, 20931, 7217, 9244, 10190, 20931, 13210, 17931, 21648, 4753, 19608, 20931, 21224, 7211, 14959, 3924, 19469, 17279, 13154, 8744]


In [29]:
# padding
X_train = pad_sequences(X_train, maxlen=300, padding='pre')
X_test = pad_sequences(X_test, maxlen=300, padding='pre')

And this is what the tweet looks like after padding:

In [30]:
print(X_train[1])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

### 4) Load in the pre-trained GloVe embeddings

This step takes a bit of time.

In [31]:
embeddings_index = dict()
f = open('/Users/graceganzel/Desktop/School/LHS712/SMM4H/Pokee/graces_code/glove.twitter.27B.200d.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

f.close()

### 5) Create an embedding matrix

In [33]:
embedding_matrix = np.zeros((vocab_size, 200)) # use 200 because our vectors are of dimension 200

for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word) # get the vector from our pretrained glove embeddings
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector # insert that into our matrix

### 6) Build and compile our model

>I created a model that had 3 types of layers:
> - **Embedding layer:** will take our padded sequences and turn them into vectors using the weights from our GloVe embeddings
> - **Flatten layer:** the output of our embedding layer must be flattened before being passed to the dense layer
> - **Dense layer:** a classic neural network layer

In [34]:
model = tf.keras.models.Sequential() # initialize the neural network
model.add(tf.keras.layers.Embedding(vocab_size, 200, weights=[embedding_matrix], input_length=300 , trainable=False))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(10, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

After making our model we must compile it:

In [35]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [36]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 200)          4376600   
_________________________________________________________________
flatten_1 (Flatten)          (None, 60000)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                600010    
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 11        
Total params: 4,976,621
Trainable params: 600,021
Non-trainable params: 4,376,600
_________________________________________________________________
None


### 7) Fit our model to our train data and print out some results

In [37]:
model.fit(X_train, y_train, epochs=20, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x1a337a7588>

#### Those accuracy results are looking pretty amazing, but don't get too excited! Let's see how well our model is **really** doing...
> Let's use our model to make predictions about our testing corpus. The predict function will give us back an array of probability distributions. Any number above 0.5 is considered to be a '1'. However, this is another paramter that can be tuned if we feel that it is either assigning too many or too little tweets with a '1'.

In [62]:
predictions = model.predict(X_test)
predictions[0:10]

array([[9.6203405e-03],
       [1.7136306e-02],
       [6.7504042e-09],
       [4.8116489e-10],
       [2.7730584e-07],
       [9.0009644e-06],
       [3.1182468e-13],
       [2.0540197e-12],
       [6.0767303e-03],
       [1.9831003e-12]], dtype=float32)

In [131]:
# going to make my own confusion matrix
act_pred = pd.DataFrame()
act_pred['actual'] = y_test

pred_corrected = []
for i in predictions:
    if i > 0.5:
        pred_corrected.append(1)
    else:
        pred_corrected.append(0)

act_pred['predicted'] = pred_corrected

act_pred.head(10)

Unnamed: 0,actual,predicted
0,0,0
1,1,0
2,0,0
3,0,0
4,0,0
5,0,0
6,1,0
7,0,0
8,0,0
9,0,0


In [124]:
true_pos = len(act_pred[(act_pred.actual == 1) & (act_pred.predicted == 1)])
false_pos = len(act_pred[(act_pred.actual == 0) & (act_pred.predicted == 1)])
false_neg = len(act_pred[(act_pred.actual == 1) & (act_pred.predicted == 0)])
true_neg = len(act_pred[(act_pred.actual == 0) & (act_pred.predicted == 0)])

In [145]:
print("{:>40}".format("Confusion Matrix"))
print("")
print("{:^15}{:^18}{:^15}".format("", "Actual AHE", "Actual Non-AHE"))
print("{:<17}{:^15}{:^15}".format("Predicted AHE", true_pos, false_pos))
print("{:<15}{:^15}{:^15}".format("Predicted Non-AHE", false_neg, true_neg))

                        Confusion Matrix

                   Actual AHE    Actual Non-AHE 
Predicted AHE          52             186      
Predicted Non-AHE      446           4679      


#### Precision vs. Recall, and what is an F1 score?
> **Precision:** How many tweets, out of all that our model predicted as containing AHE's, **actually** contain AHE's.

> **Recall:** Out of all of the tweets that **actually** contain AHE's, how many did our model capture?

> **F1 score:** A blend of precision and recall. In some cases, we may only want to focus on maximizing either precision or recall. However, if we want to create a model that has an optimal balance of precision **and** recall, we want to maximize our f1 score.

In [126]:
precision = true_pos/(true_pos + false_pos) * 100
recall = true_pos/(true_pos + false_neg) * 100
f1_score = 2 * ((precision * recall)/(precision + recall))

In [127]:
print("Precision: {:.4f}".format(precision))
print("Recall: {:.4f}".format(recall))
print("F1 Score: {:.4f}".format(f1_score))

Precision: 21.8487
Recall: 10.4418
F1 Score: 14.1304


**Not so great....**

### 8) Going to attempt LSTM

> Will need 5 types of layers
> - Embedding
> - Flatten
> - Dense
> - LSTM
> - Dropout

In [129]:
len(X_train)

10888

In [141]:
model = tf.keras.models.Sequential()

model.add(tf.keras.layers.Embedding(vocab_size, 200, weights=[embedding_matrix], input_length=300 , trainable=False))
#model.add(tf.keras.layers.Flatten())
#model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.add(tf.keras.layers.LSTM(units = 50, return_sequences = True, input_shape = (300, 1)))
model.add(tf.keras.layers.Dropout(0.2))

model.add(tf.keras.layers.LSTM(units = 50, return_sequences = True))
model.add(tf.keras.layers.Dropout(0.2))

model.add(tf.keras.layers.LSTM(units = 50, return_sequences = True))
model.add(tf.keras.layers.Dropout(0.2))

model.add(tf.keras.layers.LSTM(units = 50))
model.add(tf.keras.layers.Dropout(0.2))

model.add(tf.keras.layers.Dense(units = 1))

model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics=['accuracy'])

model.fit(X_train, y_train, epochs = 100, batch_size = 32)

Epoch 1/100
 2464/10888 [=====>........................] - ETA: 7:59 - loss: 1.3477 - acc: 0.9164

KeyboardInterrupt: 