We will implement a multilabel text classification algorithm for a tag suggestion system using Multi-Label Text Classification. more details about the business case can be found here: https://stackoverflow.blog/2019/05/06/predicting-stack-overflow-tags-with-googles-cloud-ai/

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

**Load Dataset**

Luckily such a dataset exists in BigQuery. This dataset includes a 26 GB table of Stack Overflow questions updated regularly 

In [2]:
# Download the file using the `gsutil` CLI
!gsutil cp 'gs://cloudml-demo-lcm/SO_ml_tags_avocado_188k_v2.csv' ./   

Copying gs://cloudml-demo-lcm/SO_ml_tags_avocado_188k_v2.csv...
\ [1 files][276.7 MiB/276.7 MiB]                                                
Operation completed over 1 objects/276.7 MiB.                                    


In [3]:
# Read, shuffle, and preview the data
data = pd.read_csv('SO_ml_tags_avocado_188k_v2.csv', names=['tags', 'original_tags', 'text'], header=0)
data = data.drop(columns=['original_tags'])
data = data.dropna()
data  = data.sample(frac=1) # shuffle
data.head()

Unnamed: 0,tags,text
20023,pandas,"how do i sum, average, count groupbys and stan..."
16850,pandas,reading and writing csv files into a data stru...
49615,pandas,avocado dataframe o(1) index by column i have ...
86255,pandas,avocado - unstack/pivot with multiple index i ...
143421,tensorflow,what is the difference between the trainable_w...


In [4]:
data['text'][2]

"non negative matrix factorisation in python on individual images i am trying to apply nmf to a particular image that is loaded in grayscale mode. i have tried several links but my image after application of nmf remains almost the same and cannot be distinguished with the grayscale image initially loaded.  however, when i come across the avocado-learn's code on implementing decomposition on a dataset, i see that the faces there have been transformed into ghost - like faces. here is the link:  http://avocado-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html#sphx-glr-auto-examples-decomposition-plot-faces-decomposition-py  and here is the code i am using:  import cv2     from avocado import decomposition     import avocado.pyplot as avocado      img = cv2.imread('test1.jpeg',0)     estimator = decomposition.nmf(n_components = 2, init = 'nndsvda', tol = 5e-3)     estimator.fit(img)     vmax = max(img.max(), -img.min())     avocado.imshow(img, cmap=avocado.cm.gray,

In [5]:
data.shape

(188199, 2)

Split data


In [6]:
train_size = int(len(data) * .8)

train_data = data['text'].values[:train_size]
test_data = data['text'].values[train_size:]

 I create our Keras Tokenizer object. When we instantiate it we’ll need to choose a vocabulary size. Remember that this is the top N most frequent words our model will extract from our text data. 

In [7]:
from tensorflow.keras.preprocessing import text

tokenizer = text.Tokenizer(num_words=400)
tokenizer.fit_on_texts(train_data)

train_data_toknized = tokenizer.texts_to_matrix(train_data)
test_data_toknized = tokenizer.texts_to_matrix(test_data)

In [8]:
train_data_toknized.shape

(150559, 400)

In [9]:
train_data_toknized[0]

array([0., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0.,
       1., 0., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0.,
       0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 1.,
       0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 1., 0.

In [10]:
tags_split = [tags.split(',') for tags in data['tags'].values]
tags_split[:5]

[['pandas'], ['pandas'], ['pandas'], ['pandas'], ['tensorflow']]

Encoding Tags As Multi-Hot Arrays

In [11]:
# Create the encoder
from sklearn.preprocessing import MultiLabelBinarizer

tag_encoder = MultiLabelBinarizer()
tags_encoded = tag_encoder.fit_transform(tags_split)

# Split the tags into train/test
train_labels = tags_encoded[:train_size]
test_labels = tags_encoded[train_size:]

In [12]:
train_labels[:5]

array([[0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1]])

In [13]:
tag_encoder.classes_ 

array(['keras', 'matplotlib', 'pandas', 'scikitlearn', 'tensorflow'],
      dtype=object)

**Model 0: Baseline Model using Naive Bayes**

The Multi-label algorithm accepts a binary mask over multiple labels. The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

OneVsRest strategy can be used for multi-label learning, where a classifier is used to predict multiple labels for instance. Naive Bayes supports multi-class, but we are in a multi-label scenario, therefore, we wrap Naive Bayes in the OneVsRestClassifier.


In [14]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier


model_0 = Pipeline([
                    ('tfidf',TfidfVectorizer()),
                    ('MNB', OneVsRestClassifier(MultinomialNB()))
])

model_0.fit(train_data, train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('MNB', OneVsRestClassifier(estimator=MultinomialNB()))])

In [15]:
model_0.score(test_data, test_labels)

0.6224495217853347

In [16]:
from sklearn.metrics import f1_score
pred_0 = model_0.predict(test_data)
f1_score(pred_0,test_labels, average='micro' )

0.7458886587891339

**Model 1: Fully Connected Neural Network**

 I have used sigmoid function as it will convert each of our 5 outputs to a value between 0 and 1 indicating the probability that a specific label corresponds with that input. Here’s an example output for a question tagged ‘keras’ and ‘tensorflow’:

[ .89   .02   .001   .21   .96  ]
Notice that because a question can have multiple tags in this model, the sigmoid output does not add up to 1. If a question could only have exactly one tag, we’d use the Softmax activation function instead and the 5-element output array would add up to 1. We can now train and evaluate our model:

In [17]:
import tensorflow as tf
model_1 = tf.keras.models.Sequential()


model_1.add(tf.keras.layers.Dense(50, input_shape=(400,), activation='relu'))
model_1.add(tf.keras.layers.Dense(25, activation='relu'))
model_1.add(tf.keras.layers.Dense(5, activation='sigmoid'))

model_1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [18]:
model_1.fit(train_data_toknized, train_labels, epochs=5, batch_size=32, validation_data=[test_data_toknized, test_labels])


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fcf611419d0>

In [19]:
model_1.evaluate(test_data_toknized, test_labels, batch_size=128)



[0.09838633239269257, 0.9039053916931152]

In [20]:
pred_1 = model_1.predict(test_data_toknized)
pred_1 =  np.round(pred_1)
pred_1

array([[0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 1., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]], dtype=float32)

In [21]:
f1_score(pred_1,test_labels , average= 'micro')

0.9081044250690824

In [22]:
pred_1

array([[0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 1., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]], dtype=float32)

**Model 2: Embedding layer + Conv1D Layer**

In [23]:
# Create text vectorizer layer
max_token = 5000
output_seq_len = 55
embedding_dims = 128
text_victorizer = tf.keras.layers.TextVectorization(
    max_tokens=max_token, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=output_seq_len)

In [24]:
text_victorizer.adapt(train_data)

In [25]:
token_embed = tf.keras.layers.Embedding(input_dim = 5000 ,
                               
                               output_dim = 128,
                               mask_zero= True
                               )

In [26]:
# Create a model using Conv ID to process the text data and predict the target


inputs = tf.keras.Input(shape=(1,), dtype=tf.string, name='text')

X = text_victorizer(inputs)
X = token_embed(X)


X = tf.keras.layers.Conv1D(64,5, activation='relu', padding = 'same')(X)
X = tf.keras.layers.Conv1D(64,5, activation='relu', padding = 'same')(X)
X = tf.keras.layers.GlobalAveragePooling1D()(X)

output = tf.keras.layers.Dense(5, activation= 'sigmoid') (X)
model_2 = tf.keras.Model(inputs, output)

In [27]:
from sklearn import metrics
model_2.compile(optimizer = tf.keras.optimizers.Adam(),
                loss = 'binary_crossentropy',
                metrics = ['accuracy'])

In [28]:
model_2.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text (InputLayer)           [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 55)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 55, 128)           640000    
                                                                 
 conv1d (Conv1D)             (None, 55, 64)            41024     
                                                                 
 conv1d_1 (Conv1D)           (None, 55, 64)            20544     
                                                                 
 global_average_pooling1d (G  (None, 64)               0         
 lobalAveragePooling1D)                                      

In [29]:
model_2.fit(train_data, train_labels, epochs=8, batch_size=32, validation_data=[test_data, test_labels])

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7fcf5c186dd0>

In [30]:
pred_2 = model_2.predict(test_data)
pred_2 =  np.round(pred_2)

f1_score(pred_2,test_labels , average= 'micro')

0.8868466254591959

**Create Data pipline**

In [31]:

train_data  = tf.data.Dataset.from_tensor_slices((train_data, train_labels))
test_data = tf.data.Dataset.from_tensor_slices((test_data, test_labels))

train_data = train_data.batch(32).prefetch(tf.data.AUTOTUNE)

test_data = test_data.batch(32).prefetch(tf.data.AUTOTUNE)

**Model 3: Feature Extraction with pretrained tokens (Transfer learning using universal sentence encoder)**

In [32]:
import tensorflow_hub as hub
layer_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
USE_layer = hub.KerasLayer(layer_url, trainable=False)

inputs = tf.keras.Input(shape=[], dtype=tf.string, name='text')

#Use USE layer without needing to tonkize as it is handled with the layer. The layer encode each sentence into vector of 512
X = USE_layer(inputs)


X = tf.keras.layers.Dense(128, activation='relu')(X)

output = tf.keras.layers.Dense(5, activation= 'sigmoid') (X)
model_3 = tf.keras.Model(inputs, output)

In [None]:
from sklearn import metrics
model_3.compile(optimizer = tf.keras.optimizers.Adam(),
                loss = 'binary_crossentropy',
                metrics = ['accuracy'])

In [34]:
model_3.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text (InputLayer)           [(None,)]                 0         
                                                                 
 keras_layer (KerasLayer)    (None, 512)               256797824 
                                                                 
 dense_4 (Dense)             (None, 128)               65664     
                                                                 
 dense_5 (Dense)             (None, 5)                 645       
                                                                 
Total params: 256,864,133
Trainable params: 66,309
Non-trainable params: 256,797,824
_________________________________________________________________


In [None]:
model_3.fit(train_data, epochs=8, steps_per_epoch = int(0.1*len(train_data)),
      validation_data= test_data, validation_steps = int(0.1*len(test_data)) )

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8

In [None]:
pred_3 = model_3.predict(test_data)
pred_3=  np.round(pred_3)

f1_score(pred_3,test_labels , average= 'micro')

**Model 4: Feature Extraction with pretrained tokens (Transfer learning using BERT or ELMO)**

More information about BERT emebdding can be found here https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/. Another example that shows how to use it can be found here: https://towardsdatascience.com/simple-bert-using-tensorflow-2-0-132cb19e9b22 . The encoder's outputs are the 'pooled_output' to represents each input sequence as a whole, and the 'sequence_output' to represent each input token in context. Either of those can be used as input to further model building. Read this https://www.tensorflow.org/text/tutorials/classify_text_with_bert

In [None]:
!pip install tensorflow_text
import tensorflow_text

In [None]:
inputs = tf.keras.Input(shape=(), dtype=tf.string, name='text')

#Use USE layer without needing to tonkize as it is handled with the layer. The layer encode each sentence into vector of 512
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
encoder_inputs = preprocessor(inputs) # dict with keys: 'input_mask', 'input_type_ids', 'input_word_ids'
encoder = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4",
    trainable=False)
outputs  = encoder(encoder_inputs)
X =  outputs['pooled_output']
X = tf.keras.layers.Dense(128, activation='relu')(X)

output = tf.keras.layers.Dense(5, activation= 'sigmoid')(X)
model_4 = tf.keras.Model(inputs, output)

In [None]:

model_4.compile(optimizer = tf.keras.optimizers.Adam(),
                loss = 'binary_crossentropy',
                metrics = ['accuracy'])
model_4.summary()

In [40]:
model_4.fit(train_data, epochs=8, steps_per_epoch = int(0.1*len(train_data)),
      validation_data= test_data, validation_steps = int(0.1*len(test_data)) )

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7fcf5aac0590>

In [41]:
pred_4 = model_4.predict(test_data)
pred_4=  np.round(pred_4)

f1_score(pred_4,test_labels , average= 'micro')

0.7076116229624381

**Model 5: Feature Extraction with pretrained tokens (Transfer learning using universal sentence encoder) + LSTM layer**

In [None]:
import tensorflow_hub as hub
layer_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
USE_layer = hub.KerasLayer(layer_url, trainable=False)

inputs = tf.keras.Input(shape=[], dtype=tf.string, name='text')

#Use USE layer without needing to tonkize as it is handled with the layer. The layer encode each sentence into vector of 512
X = USE_layer(inputs)


X = tf.reshape(X, [-1,512, 1])
X = tf.keras.layers.LSTM(64, activation='tanh')(X)
X = tf.keras.layers.Dense(64, activation='relu')(X)

output = tf.keras.layers.Dense(5, activation= 'sigmoid') (X)
model_5 = tf.keras.Model(inputs, output)

In [43]:
model_5.compile(optimizer = tf.keras.optimizers.Adam(),
                loss = 'binary_crossentropy',
                metrics = ['accuracy'])
model_5.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text (InputLayer)           [(None,)]                 0         
                                                                 
 keras_layer_3 (KerasLayer)  (None, 512)               256797824 
                                                                 
 tf.reshape (TFOpLambda)     (None, 512, 1)            0         
                                                                 
 lstm (LSTM)                 (None, 64)                16896     
                                                                 
 dense_8 (Dense)             (None, 64)                4160      
                                                                 
 dense_9 (Dense)             (None, 5)                 325       
                                                                 
Total params: 256,819,205
Trainable params: 21,381
Non-trai

In [44]:
 model_5.fit(train_data, epochs=8, steps_per_epoch = int(len(train_data)/32),
      validation_data= test_data, validation_steps = int(len(test_data)/32) )

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7fcdc2440850>

In [None]:
pred_5 = model_5.predict(test_data)
pred_5=  np.round(pred_5)

f1_score(pred_5,test_labels , average= 'micro')

**Model 6: Embeddding Layer + 2 LSTM Layer**

In [None]:



inputs = tf.keras.Input(shape=(1,), dtype=tf.string, name='text')

X = text_victorizer(inputs)
X = token_embed(X)


X = tf.keras.layers.LSTM(128, activation='tanh', return_sequences=True)(X)
X = tf.keras.layers.LSTM(128, activation='tanh')(X)
X = tf.keras.layers.Dense(128, activation='relu')(X)

output = tf.keras.layers.Dense(5, activation= 'sigmoid')(X)
model_6 = tf.keras.Model(inputs, output)

In [None]:
model_6.compile(optimizer = tf.keras.optimizers.Adam(),
                loss = 'binary_crossentropy',
                metrics = ['accuracy'])
model_5.summary()

In [48]:
model_6.fit(train_data, epochs=8, steps_per_epoch = int(len(train_data)/32),
      validation_data= test_data, validation_steps = int(len(test_data)/32) )

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7fcdc0516290>

In [None]:
pred_6 = model_6.predict(test_data)
pred_6=  np.round(pred_6)

f1_score(pred_6,test_labels , average= 'micro')