In this code demo we will see how we can use TFIDF vector representation to do text classification along with a multilayered preceptron.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import pandas as pd
import numpy as np

In [3]:
data=pd.read_csv("/content/gdrive/MyDrive/RNN-LSTM/stack-overflow-data.csv")

In [4]:
data.head()

Unnamed: 0,post,tags
0,what is causing this behavior in our c# datet...,c#
1,have dynamic html load as if it was in an ifra...,asp.net
2,how to convert a float value in to min:sec i ...,objective-c
3,.net framework 4 redistributable just wonderi...,.net
4,trying to calculate and print the mean and its...,python


Let’s try and understand what this data set is about. This data set has a collection of questions which were asked on stack overflow as well as their corresponding tag. For example, here I know that, this question was about .net; this question was about Python; this question was about C sharp. The task here is to take the question text as an input and try to predict what would be the tag of that question. For this, what I will do is I will separate out the question as well as the tag into two different objects X and Y. Next I will split my data into test and training components

In [5]:
X=data['post']
y=data['tags']

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.10,random_state=200)

In [8]:
X_train.head()

21805    is there an library to support nsdictionary tr...
25783    replacing fileinputstream with getresourceasst...
34843    how to email the current screen of ios app in ...
16332    how to create the nscalender and gregorian cal...
32383    is it possible to check what the client is sen...
Name: post, dtype: object

I'll try to create a TFIDF representation of all the questions in the train set as well as in the test set. For that, I will be using the text module within feature extraction module. 

In [9]:
import sklearn.feature_extraction.text as text

This text module contains a function called TFIDF vectorizer. 

In [10]:
tfidf=text.TfidfVectorizer(input=X_train.tolist(),stop_words='english',max_features=1000,ngram_range=(1,1))

It accepts a list as an input here I'm giving the list of questions in my train set as an input.

I want to filter out all the common English stop words. This is what I'm doing here. Now when you created TFIDF vector or a TFIDF matrix the number of columns in a TFIDF matrix can become very large. Here we are putting an artificial restriction that are TFIDF matrix you only have 1000 columns. And here I’m using only unigrams. So let's create an object of TFIDF vectorizer class.

Now I will pass again my questions as an input in the form of a list to this object. This object has a fit transform () method. Now this will create a TFIDF representation for all the questions in my train set

In [11]:
X_train=tfidf.fit_transform(X_train.tolist())

Similarly, I will create TFIDF representation for all the questions in my test set.

In [12]:
X_test=tfidf.transform(X_test.tolist())

In [13]:
X_train.shape

(36000, 1000)

In [14]:
X_test.shape

(4000, 1000)

In [15]:
y.head()

0             c#
1        asp.net
2    objective-c
3           .net
4         python
Name: tags, dtype: object

Now let's take a look at our y vector. You can see our y vector contains the tags. Now in order for us to use neural networks we will need to convert these tags into some integer representation and then do a one hot encoding on them. To this effect, we will use the label encoder class and create an object of this class and fit all the tags on this class and obtain a transformed version of the tags in my train set as well as my test set.

In [16]:
from sklearn.preprocessing import LabelEncoder

In [17]:
enc=LabelEncoder()

In [18]:
enc.fit(data['tags'])

LabelEncoder()

In [19]:
y_train=enc.transform(y_train)

In [20]:
y_test=enc.transform(y_test)

In [21]:
y_train

array([15, 11, 15, ...,  5,  3, 18])

Now you can see all the tags have been converted into a number and the mapping can be understood using encoded.classes attribute. 

So this dot classes attribute tells what integer coding has been given to which class.

In [22]:
enc.classes_

array(['.net', 'android', 'angularjs', 'asp.net', 'c', 'c#', 'c++', 'css',
       'html', 'ios', 'iphone', 'java', 'javascript', 'jquery', 'mysql',
       'objective-c', 'php', 'python', 'ruby-on-rails', 'sql'],
      dtype=object)

Now, we will create a simple multi-layered perceptron architecture

In [24]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense,Dropout
from tensorflow.keras.utils import to_categorical

Now I will have to create a one hot encoded version of the y vector in my training data for which I will use to_categorical ()function.

In [25]:
y_train=to_categorical(y_train)

In [26]:
y_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.]], dtype=float32)

In [27]:
model=Sequential()

In [28]:
model.add(Dense(512,input_shape=(1000,),activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(20,activation='softmax'))

My model will have two dense layers and a drop out layer. I will compile my model now and fit on my data. 

In [29]:
model.compile(loss="categorical_crossentropy",optimizer="adam",metrics=['accuracy'])

In [38]:
model.fit(X_train.toarray(),y_train,epochs=2,batch_size=32,verbose=1,validation_split=0.10)
## x_train is a sparse matrix - convert toarray()

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f1bf76bc150>

Now after I fit my model for only two epochs, I can see after two epochs the out of sample accuracy is around 80%.

Now to test my model what I will do is: I have taken a stack overflow question. Let's create a string out of this question. This string is stored in object Q. 

In [35]:
q='''

I am new to luigi, came across it while designing a pipeline for our ML efforts and though it wasn't fitted to my particular use case it had so many extra features I decided to fit it to my use case, basically what I was looking for was a way to be able to persist a custom built pipeline and thus have its result repeatable, after reading most of the online tutorials I tried to implement my serialization using the existing luigi.cfg configuration and command line mechanisms and it might have sufficed for the tasks' parameters but it provided no way of serializing the DAG connectivity of my pipeline, so I decided to have a WrapperTask which received a json config file which would create all the task instances and connect all the plumbing. I hereby enclose a small test program for your scrutiny:

import random
import luigi
import time
import os


class TaskNode(luigi.Task):
    i = luigi.IntParameter()  # node ID

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.required = []

    def set_required(self, required=None):
        self.required = required  # set the dependencies
        return self

    def requires(self):
        return self.required

    def output(self):
        return luigi.LocalTarget('{0}{1}.txt'.format(self.__class__.__name__, self.i))

    def run(self):
        with self.output().open('w') as outfile:
            outfile.write('inside {0}{1}\n'.format(self.__class__.__name__, self.i))
        self.process()

    def process(self):
        raise NotImplementedError(self.__class__.__name__ + " must implement this method")


class FastNode(TaskNode):

    def process(self):
        time.sleep(1)


class SlowNode(TaskNode):

    def process(self):
        time.sleep(2)


# This WrapperTask builds all the nodes 
class All(luigi.WrapperTask):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        num_nodes = 513

        classes = TaskNode.__subclasses__()
        self.nodes = []
        for i in reversed(range(num_nodes)):
            cls = random.choice(classes)

            dependencies = random.sample(self.nodes, (num_nodes - i) // 35)

            obj = cls(i=i)
            if dependencies:
                obj.set_required(required=dependencies)
            else:
                obj.set_required(required=None)

            # delete existing output causing a build all
            if obj.output().exists():
                obj.output().remove()  

            self.nodes.append(obj)

    def requires(self):
        return self.nodes


if __name__ == '__main__':
    luigi.run()
So, basically, as states in the question title, this focuses on the dynamic dependencies and generates a 513 node dependency DAG with p=1/35 connectivity probability, it also makes the All WrapperTask class require all nodes (I have a version which only connects it to heads of connected DAG components but I didn't want to over complicate).

Is there a more standard way of implementing this? Especially note the not so pretty complication with the TaskNode init and set_required methods, I only did it this way because receiving parameters in the init method clashes somehow with the way luigi registers parameters. I also tried several other ways but this was basically the most decent one (that worked)

If there isn't a standard way I'd still love to hear any insights you have on the way I plan to go before I implement the entire piping framework.'''

Now what I will do is I will create a TFIDF representation of this text and I will pass this TFIDF representation as an input to get a prediction from my model. 

In [39]:
q_v=tfidf.transform([q])

In [44]:
q_v

<1x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>

In [46]:
model.predict(q_v.toarray())

array([[2.0773132e-05, 9.6104111e-07, 2.2326586e-07, 6.2382577e-07,
        2.8022289e-06, 2.0339468e-05, 3.3024339e-06, 1.3054367e-07,
        3.6693990e-07, 2.9894814e-04, 1.3144929e-05, 1.6850501e-05,
        2.1006726e-06, 5.6127288e-08, 2.9671153e-07, 2.3677091e-04,
        3.3554090e-06, 9.9933904e-01, 3.7818296e-05, 2.1792146e-06]],
      dtype=float32)

In [47]:
np.argmax(model.predict(q_v.toarray()))

17

Now my model is predicting this to be of class 17, let's try and understand what class 17 stands for we can use the inverse_transform() method within our encoded class to find out the class of this.

In [54]:
enc.inverse_transform([17])

array(['python'], dtype=object)

So my model is predicting that this question is about python and if you look at the code and look at the imports, this is actually a Python question and my model has been able to successfully identify a python question from its text.