# Natural Language Processing - Classification
**Emmanuel Dufourq** (edufourq@gmail.com - [www.emmanueldufourq.com](http://www.emmanueldufourq.com) )

July 2018

*Made for the Theoretical Foundations of Data Science 2018 (African Institute for Mathematical Sciences)*

Adapted from https://cloud.google.com/blog/big-data/2017/10/intro-to-text-classification-with-keras-automatically-tagging-stack-overflow-posts

### Objective:

Construct a model that can classify text data. Here we are interested in tagging questions from Stackoverflow.


### Modified by Jahn Marrero for HPAI Final Project (April 2019)
### In an attempt at simple sarcasm detection using NLP as a base to tag a text as sarcastic or not

## Imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import keras
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from keras.preprocessing import text, sequence
from keras import utils
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
import numpy as np
import re
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

Using TensorFlow backend.


## Download the data

In [2]:
cols = ['tweets', 'sarcastic']
df = pd.read_csv("tweet_sarcasm_data_scrambled.csv", names = cols, dtype={"tweets": str, "sarcastic": int})
df.tweets = df.tweets.astype(str)
df.sarcastic = df.sarcastic.astype(int)

df.sarcastic.value_counts()

0    4989
1    4988
Name: sarcastic, dtype: int64

## **Pre-Process the Data**

In [39]:
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
pat1 = r'@[A-Za-z0-9]+'
pat2 = r'https?://[A-Za-z0-9./]+'
combined_pat = r'|'.join((pat1, pat2))
def tweet_cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    stripped = re.sub(combined_pat, '', souped)
    try:
        clean = stripped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        clean = stripped
    letters_only = re.sub("[^a-zA-Z]", " ", clean)
    lower_case = letters_only.lower()
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = tok.tokenize(lower_case)
    return (" ".join(words)).strip()
testing = df.tweets[:10]
test_result = []
for key in testing:
    test_result.append(tweet_cleaner(key))
test_result

['this tweet has been brought to you by',
 'crakk yup forreal this time',
 'u c strictlysoccer di maria plays beautiful u d',
 'road guy iiiiiit s noooooot sarcasm',
 'another rainy race day',
 'ayyyyy messiiiiiiiii',
 'one minute up one minute passed out',
 'can t watch your streaming app with a directv sub',
 'omg',
 'where s billy hahaha']

In [5]:
nums = [0,9977]
print("Cleaning and parsing the tweets...\n")
clean_tweet_texts = []
for i in range(nums[0],nums[1]):
    if( (i+1)%100 == 0 ):
        print("Tweets %d of %d has been processed" % ( i+1, nums[1] ))                                                                    
    clean_tweet_texts.append(tweet_cleaner(df['tweets'][i]))

Cleaning and parsing the tweets...

Tweets 100 of 9977 has been processed
Tweets 200 of 9977 has been processed
Tweets 300 of 9977 has been processed
Tweets 400 of 9977 has been processed
Tweets 500 of 9977 has been processed
Tweets 600 of 9977 has been processed
Tweets 700 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Tweets 800 of 9977 has been processed
Tweets 900 of 9977 has been processed
Tweets 1000 of 9977 has been processed
Tweets 1100 of 9977 has been processed
Tweets 1200 of 9977 has been processed
Tweets 1300 of 9977 has been processed
Tweets 1400 of 9977 has been processed
Tweets 1500 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup


Tweets 1600 of 9977 has been processed
Tweets 1700 of 9977 has been processed
Tweets 1800 of 9977 has been processed
Tweets 1900 of 9977 has been processed
Tweets 2000 of 9977 has been processed
Tweets 2100 of 9977 has been processed
Tweets 2200 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup


Tweets 2300 of 9977 has been processed
Tweets 2400 of 9977 has been processed
Tweets 2500 of 9977 has been processed
Tweets 2600 of 9977 has been processed
Tweets 2700 of 9977 has been processed
Tweets 2800 of 9977 has been processed
Tweets 2900 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Tweets 3000 of 9977 has been processed
Tweets 3100 of 9977 has been processed
Tweets 3200 of 9977 has been processed
Tweets 3300 of 9977 has been processed
Tweets 3400 of 9977 has been processed
Tweets 3500 of 9977 has been processed
Tweets 3600 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Tweets 3700 of 9977 has been processed
Tweets 3800 of 9977 has been processed
Tweets 3900 of 9977 has been processed
Tweets 4000 of 9977 has been processed
Tweets 4100 of 9977 has been processed
Tweets 4200 of 9977 has been processed
Tweets 4300 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)


Tweets 4400 of 9977 has been processed
Tweets 4500 of 9977 has been processed
Tweets 4600 of 9977 has been processed
Tweets 4700 of 9977 has been processed
Tweets 4800 of 9977 has been processed
Tweets 4900 of 9977 has been processed
Tweets 5000 of 9977 has been processed
Tweets 5100 of 9977 has been processed
Tweets 5200 of 9977 has been processed
Tweets 5300 of 9977 has been processed
Tweets 5400 of 9977 has been processed
Tweets 5500 of 9977 has been processed
Tweets 5600 of 9977 has been processed
Tweets 5700 of 9977 has been processed
Tweets 5800 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Tweets 5900 of 9977 has been processed
Tweets 6000 of 9977 has been processed
Tweets 6100 of 9977 has been processed
Tweets 6200 of 9977 has been processed
Tweets 6300 of 9977 has been processed
Tweets 6400 of 9977 has been processed
Tweets 6500 of 9977 has been processed
Tweets 6600 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Tweets 6700 of 9977 has been processed
Tweets 6800 of 9977 has been processed
Tweets 6900 of 9977 has been processed
Tweets 7000 of 9977 has been processed
Tweets 7100 of 9977 has been processed
Tweets 7200 of 9977 has been processed
Tweets 7300 of 9977 has been processed
Tweets 7400 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Tweets 7500 of 9977 has been processed
Tweets 7600 of 9977 has been processed
Tweets 7700 of 9977 has been processed
Tweets 7800 of 9977 has been processed
Tweets 7900 of 9977 has been processed
Tweets 8000 of 9977 has been processed
Tweets 8100 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Tweets 8200 of 9977 has been processed
Tweets 8300 of 9977 has been processed
Tweets 8400 of 9977 has been processed
Tweets 8500 of 9977 has been processed
Tweets 8600 of 9977 has been processed
Tweets 8700 of 9977 has been processed
Tweets 8800 of 9977 has been processed


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Tweets 8900 of 9977 has been processed
Tweets 9000 of 9977 has been processed
Tweets 9100 of 9977 has been processed
Tweets 9200 of 9977 has been processed
Tweets 9300 of 9977 has been processed
Tweets 9400 of 9977 has been processed
Tweets 9500 of 9977 has been processed
Tweets 9600 of 9977 has been processed
Tweets 9700 of 9977 has been processed
Tweets 9800 of 9977 has been processed
Tweets 9900 of 9977 has been processed


## Look the some of the data

In [6]:
clean_df = pd.DataFrame(clean_tweet_texts,columns=['tweets'])
clean_df['sarcastic'] = df.sarcastic
clean_df = clean_df[1:]
clean_df.head()

Unnamed: 0,tweets,sarcastic
1,crakk yup forreal this time,0
2,u c strictlysoccer di maria plays beautiful u d,0
3,road guy iiiiiit s noooooot sarcasm,1
4,another rainy race day,1
5,ayyyyy messiiiiiiiii,0


## Print out the unique tags

In [11]:
clean_df.to_csv('clean_tweet.csv',encoding='utf-8')
csv = 'clean_tweet.csv'
my_df = pd.read_csv(csv,index_col=0, dtype={"tweets": str})
my_df['sarcastic'].unique()

array([0, 1])

## Check dtypes


In [12]:
my_df['tweets'] = my_df['tweets'].astype(str)
my_df['sarcastic'] = my_df['sarcastic'].astype(int)
my_df.dtypes

tweets       object
sarcastic     int64
dtype: object

## Determine the number of classes

In [0]:
num_classes = len(my_df['sarcastic'].unique())

In [14]:
num_classes

2

## Check how many instances for each class

In [15]:
my_df['sarcastic'].value_counts()

0    4989
1    4987
Name: sarcastic, dtype: int64

## Convert the data into X and Y

In [0]:
X = my_df['tweets'].values

In [0]:
Y = my_df['sarcastic']

## Split the data into training and testing

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

In [19]:
 X_train[0]

'i m so excited for this act'

## Tokenize

Tokenizer has the ability to count the number of unique words and to allocate a unique number to each of the the words. We can specify the number of words that we want, this is typically the most frequent words. So in our case, we can to allocate an index number of 1000 words. The documentation is here: https://keras.io/preprocessing/text/#tokenizer

In [0]:
max_words = 1000
tokenize = text.Tokenizer(num_words=max_words, char_level=False)

Now, we can convert each post in our dataset into a vector. The size of the vector *max_words*. The vector is made up of 0's and 1's. There is a value of 1 at the index location of the tokenized words. In other words, if the tokenized words are [what, I, you, where, cat] then the sentence "where is the cat" is converted into [0, 0,0,1,1] which indicates that words where and cat are present. In other words, the tokenizer creates a vocabulary and then we can assign a 1 if a word in the text is found in the vocabulary, and the index location is based on the vocabulary. We need to fit this to some data, so we use the training data:

In [0]:
tokenize.fit_on_texts(X_train)

We can take a look at the words and the indices in the vocabulary here:

In [22]:
tokenize.word_index

{'sarcasm': 1,
 'd': 2,
 'u': 3,
 'i': 4,
 'ud': 5,
 'the': 6,
 'a': 7,
 't': 8,
 'to': 9,
 'ude': 10,
 's': 11,
 'you': 12,
 'my': 13,
 'is': 14,
 'co': 15,
 'http': 16,
 'in': 17,
 'and': 18,
 'c': 19,
 'it': 20,
 'for': 21,
 'that': 22,
 'of': 23,
 'messi': 24,
 'me': 25,
 'udc': 26,
 'so': 27,
 'on': 28,
 'f': 29,
 'this': 30,
 'm': 31,
 'e': 32,
 'just': 33,
 'love': 34,
 'uc': 35,
 'at': 36,
 'with': 37,
 'like': 38,
 'are': 39,
 'was': 40,
 'be': 41,
 'have': 42,
 'day': 43,
 'all': 44,
 'not': 45,
 'can': 46,
 'ub': 47,
 'what': 48,
 'he': 49,
 'how': 50,
 'out': 51,
 'n': 52,
 'when': 53,
 'great': 54,
 'no': 55,
 'your': 56,
 'we': 57,
 'up': 58,
 'good': 59,
 'but': 60,
 'if': 61,
 'get': 62,
 'they': 63,
 'do': 64,
 'don': 65,
 'ufe': 66,
 'there': 67,
 'who': 68,
 'lol': 69,
 're': 70,
 'about': 71,
 'an': 72,
 'oh': 73,
 'really': 74,
 'people': 75,
 'one': 76,
 'b': 77,
 'know': 78,
 'from': 79,
 'as': 80,
 'see': 81,
 'time': 82,
 'best': 83,
 'by': 84,
 'dad': 85,
 'fa

Then, we go ahead and convert the training and testing features into their corresponding vectors. The size of these vectors is based on the size of the vocabulary, in our case 1000.

In [0]:
X_train_token = tokenize.texts_to_matrix(X_train)
X_test_token = tokenize.texts_to_matrix(X_test)

In [24]:
X_train_token[0]

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

Check size here

In [25]:
len(X_train_token[0])

1000

Now we need to convert the labels (targets) into their corresponding one-hot encoded values. One way to do this is to convert each label into a number, and then convert the number into a one-hot encoded vector.

## Encode the targets

In [0]:
# Use sklearn utility to convert label strings to numbered index
encoder = LabelEncoder()
encoder.fit(Y_train)
Y_train_encoded = encoder.transform(Y_train)
Y_test_encoded = encoder.transform(Y_test)

In [27]:
Y_train_encoded[0]

1

Now convert into one-hot encoded vectors

In [0]:
Y_train_hot = utils.to_categorical(Y_train_encoded, num_classes)
Y_test_hot = utils.to_categorical(Y_test_encoded, num_classes)

In [29]:
Y_train_hot[0]

array([0., 1.], dtype=float32)

Check the shapes.

Here are 2680 training samples and 1320 testing samples.

Each feature sample is a vector of length 1000 and each target is of length 20 (since there are 20 unique classes and the values have been one-hot encoded).

In [30]:
print('x_train shape:', X_train_token.shape)
print('x_test shape:', X_test_token.shape)
print('y_train shape:', Y_train_hot.shape)
print('y_test shape:', Y_test_hot.shape)

x_train shape: (7482, 1000)
x_test shape: (2494, 1000)
y_train shape: (7482, 2)
y_test shape: (2494, 2)


## Hyper-parameters

In [0]:
batch_size = 16
epochs = 70

## Build the model

In [32]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [33]:
history = model.fit(X_train_token, Y_train_hot,batch_size=batch_size,
                    epochs=epochs,verbose=1,
                    validation_split=0.1)

Instructions for updating:
Use tf.cast instead.
Train on 6733 samples, validate on 749 samples
Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


## Check accuracy

In [34]:
# Evaluate the accuracy of our trained model
score = model.evaluate(X_test_token, Y_test_hot,
                       batch_size=batch_size, verbose=1)
print('Test accuracy:', score[1])

Test accuracy: 0.9362469927826784


In [35]:
Y_test.values

array([1, 1, 0, ..., 1, 1, 0])

## Predict

In [38]:
text_labels = encoder.classes_ 

for i in range(10):
    prediction = model.predict(np.array([X_test_token[i]]))
    predicted_label = text_labels[np.argmax(prediction)]
    print('Text: ',X_test[i])
    print('Actual label: ' + str(Y_test.values[i]))
    print("Predicted label: " + str(predicted_label) + "\n")

Text:  it looks like chelsea clinton made
Actual label: 1
Predicted label: 1

Text:  sarcasm
Actual label: 1
Predicted label: 1

Text:  yes melanie
Actual label: 0
Predicted label: 1

Text:  can t have a clean first did you forget who wheeler pitches for sarcasm
Actual label: 1
Predicted label: 1

Text:  my grandma is talking about hooking me up with her alcohol this is why i love her ud c udf b ud c udf b ud c udf b ud c udf b
Actual label: 0
Predicted label: 0

Text:  k http t co gntrzn sk
Actual label: 0
Predicted label: 0

Text:  abby and i get drinks and apps to celebrate father s day while my dad hangs out with his friends
Actual label: 0
Predicted label: 0

Text:  gotta love running in this humid weather with all of the mosquitoes ud d ude sarcasm thatsucked
Actual label: 1
Predicted label: 1

Text:  oh yeah because thinking something is funny is so goddam rude amirite guys sarcasm
Actual label: 1
Predicted label: 1

Text:  riverwalk http t co nrkguqgwh
Actual label: 0
Predicted

## Now Try It Yourself! Run the code stub below to input your own sentences


In [37]:

def predict_for(txt):
  text_labels = encoder.classes_ 
  toPredict = str(txt)
  tokens = [toPredict]
  to_test_token = tokenize.texts_to_matrix(tokens)
  prdiction = model.predict(np.array([to_test_token[0]]))
  prdicted_label = text_labels[np.argmax(prdiction)]
  print("Predicted Star Count: " + str(prdicted_label) + "\n")

def prompt_model_trial():
  get = "ok"
  while(get != "no thanks" and get != "NO THANKS"):
    print("To stop sentence input, type 'no thanks'\n")
    get = input("Please input a sentence to see what the model predicts: ")
    if(get != "no thanks" and get != "NO THANKS"):
      predict_for(get)

prompt_model_trial()
  
  


To stop sentence input, type 'no thanks'

Please input a sentence to see what the model predicts: I am so ready for this test NotReally
Predicted Star Count: 0

To stop sentence input, type 'no thanks'

Please input a sentence to see what the model predicts: wow. I love going to the dentist early in the morning. great.
Predicted Star Count: 1

To stop sentence input, type 'no thanks'

Please input a sentence to see what the model predicts: I really wish I have a test these holidays. Great!
Predicted Star Count: 0

To stop sentence input, type 'no thanks'

Please input a sentence to see what the model predicts: I love my dog
Predicted Star Count: 0

To stop sentence input, type 'no thanks'

Please input a sentence to see what the model predicts: Im so happy the whole world crumbled on me. yikes.
Predicted Star Count: 0

To stop sentence input, type 'no thanks'



KeyboardInterrupt: ignored