TEXT CLASSIFICATION USING SVM

INTRODUCTION - 
In this exercies I was tasked with creating code that can clean a raw dataset and extract techinical skills.  I determined that a one-class classification algorithm would be suitable for this binary classification task. With a severely skewed example class distribution, this technique allows me to fit on the input examples from the majority class in the training dataset, then evaluate on a  test dataset with both classes.

I am using a one-class SVM with a non-linear kernel. One-class SVM is perfect for this problem since it's an unsupervised algorithm that learns a decision function for novelty detection: classifying new data as similar or different to the training set. I am utilizing the example technical skills as my training dataset. This dataset has contains 979 instances of a negative case (class -1), which is a normal case, and 0 instances of a postive case (class 1), which is an outlier. 

In [333]:
import numpy as np
import pandas as pd
import gensim
import re
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn import svm
from collections import Iterable


Loading the Data

In [228]:
train_df = pd.read_csv('Example_Technical_Skills.csv')
print(train_df.head())

                       Technology Skills
0                    SAP Fiori Developer
1  Oracle Instance Management & Strategy
2           Boomi Master Data Management
3  Digital Manufacturing on Cloud ( DMC)
4                                 DevOps


In [231]:
test_df = pd.read_csv('Raw_Skills_Dataset.csv')
print(test_df.head())
X = test_df['RAW DATA']
print(X.head())

          RAW DATA
0         What ifs
1        seniority
2      familiarity
3  functionalities
4          Lambdas
0           What ifs
1          seniority
2        familiarity
3    functionalities
4            Lambdas
Name: RAW DATA, dtype: object


In this section I am going to train word embeddings using the data and in order to train those word vectors I'll tokenize the data.

In [271]:
x = train_df['Technology Skills']
x_train_tokenized = [[w for w in sentence.split(" ")] for sentence in x]
x_train_tokenized[0]

['SAP', 'Fiori', 'Developer']

In [342]:
#clean the test data before tokenization
def clean_text(text):
    cleaned = re.sub('[^A-Za-z0-9]+', '', text)
    return cleaned.strip()

x_cleaned = [clean_text(t) for t in X]

In [233]:
x_test_tokenzied = [[w for w in sentence.split(" ")] for sentence in X_cleaned]
print(x_test_tokenzied[0])


['What', 'ifs']


Create word embeddings model

In [234]:
model = gensim.models.Word2Vec(x_train_tokenized,
                 size=100, 
                 window=5, 
                 min_count=1, 
                 workers=2, 
                 sg=1
                )


I will create a class called Sequencer that will convert texts into word embedding sequences.
The constructor function for the class takes 4 parameters: all_words, max_words, seq_length, embedding_matrix
All Words = The entire dataset will be giben in a list format which contains all tokens
Max Words = This parameter will be used in finding most used word
Sequence Length = In machine learning our dataset's number of variable has to be specified. But in real life each sentence might has a different length. In order to prevent this problem I determiend a length of 3 and will adapt each sentence to that length.
Embedding Matrix = Alist of all words and their corresponding embeddings

In [343]:
class Sequencer():
    
    def __init__(self,
                 all_words,
                 max_words,
                 seq_len,
                 embedding_matrix
                ):
        
        self.seq_len = seq_len
        self.embed_matrix = embedding_matrix
        """
        temp_vocab = Vocab which has all the unique words
        self.vocab = The last vocab which has only most used N words.
    
        """
        temp_vocab = list(set(all_words))
        self.vocab = []
        self.word_counts = {}
        """
        Now I'll create a hash map (dict) which includes words and their occurencies
        """
        for word in temp_vocab:
            count = len([0 for w in all_words if w == word])
            self.word_counts[word] = count
            counts = list(self.word_counts.values())
            indexes = list(range(len(counts)))
        
        # Now I'll sort counts and while sorting them also will sort indexes.
        # I'll use those indexes to find most used N word.
        cnt = 0
        while cnt + 1 != len(counts):
            cnt = 0
            for i in range(len(counts)-1):
                if counts[i] < counts[i+1]:
                    counts[i+1],counts[i] = counts[i],counts[i+1]
                    indexes[i],indexes[i+1] = indexes[i+1],indexes[i]
                else:
                    cnt += 1
        
        for ind in indexes[:max_words]:
            self.vocab.append(temp_vocab[ind])
                    
    def textToVector(self,text):
        # First I need to split the text into its tokens and learn the length
        # If length is shorter than the max len I'll add some spaces (100D vectors which has only zero values)
        # If it's longer than the max len I'll trim from the end.
        tokens = text.split()
        len_v = len(tokens)-1 if len(tokens) < self.seq_len else self.seq_len-1
        vec = []
        for tok in tokens[:len_v]:
            try:
                vec.append(self.embed_matrix[tok])
            except Exception as E:
                pass
        
        last_pieces = self.seq_len - len(vec)
        for i in range(last_pieces):
            vec.append(np.zeros(100,))
        
        return np.asarray(vec).flatten()
                
                

In [236]:
max_words = len(model.wv.vocab.items())
print(max_words)

1368


In [239]:

sequencer = Sequencer(all_words = [token for seq in x_train_tokenized for token in seq],
              max_words = max_words,
              seq_len = 3,
              embedding_matrix = model.wv
             )

In [240]:
test_vec = sequencer.textToVector("Natural Language Processing")
print(test_vec)
print(test_vec.shape)

[-6.77003118e-04 -1.66989898e-03  2.44807033e-03 -1.60491443e-03
 -2.95602274e-03 -1.14822015e-03  1.45337312e-03  3.70449293e-03
  4.82298387e-03  3.43484618e-03 -5.66118630e-04 -2.81287055e-03
  3.25641548e-03 -3.14594107e-03 -2.00625625e-03 -4.20035329e-03
 -2.91205989e-03 -1.53663137e-03 -9.77550400e-04  2.19090306e-03
 -3.65276216e-03 -7.33121560e-05  3.48160742e-04 -4.73049656e-03
 -7.70643936e-04 -2.83700996e-03  2.50594970e-03  1.52498495e-03
  1.77432562e-03  4.62842034e-03 -4.55416134e-03  1.38183811e-03
  2.07295598e-04 -2.17316533e-03  1.36961858e-03 -7.02824327e-04
 -3.83880269e-03 -1.48214749e-03 -4.08658572e-03  2.35800678e-03
 -3.31553956e-03  2.48296955e-03  1.70158106e-03 -4.86065540e-03
 -1.57628674e-03  3.37599218e-03  4.66610305e-03  3.46359308e-03
 -5.66189177e-04 -6.85683335e-04  3.20699095e-04  1.35756889e-03
  3.87308840e-03  3.01182014e-03 -3.46286013e-03 -4.95317951e-03
  1.71005714e-03 -1.87465281e-03  2.77475570e-03  4.25478024e-03
 -2.31088814e-03  7.82835

Everything looks fine, but as you see each vector for a sentence has 300 elements and it'll consume a lot of time to train a Support Vector Machine Classifier on this.

In order to prevent a long run time I am going to reduce the the dimensionality of the dataset utilizing Principal Component Analysis. With PCA, I can find a balance between creating a reduced set of variables and capturing a large percentager of the variation.

In [241]:
# But before creating a PCA model using scikit-learn let's create
# vectors for our each vector
x_train_vecs = np.asarray([sequencer.textToVector(" ".join(seq)) for seq in x_train_tokenized])
print(x_train_vecs.shape)
x_test_vecs = np.asarray([sequencer.textToVector(" ".join(seq)) for seq in x_test_tokenzied])
print(x_test_vecs.shape)

(979, 300)
(34116, 300)


In [242]:
pca_model = PCA(n_components=150)
pca_model.fit(x_train_vecs)
print("Sum of variance ratios: ",sum(pca_model.explained_variance_ratio_))

Sum of variance ratios:  0.9595982205442048


Utilizing 150 principal components I can capture roughly 96% of the data's variance while cutting the the dimensionality of the dataset in half.
This result is satisfactory, so I am going to use the transform function and reduce the dimensionionality.

In [243]:
x_train_comps = pca_model.transform(x_train_vecs)
print(x_train_comps.shape)
x_test_comps = pca_model.transform(x_test_vecs)
print(x_test_comps.shape)

(979, 150)
(34116, 150)


After conducting PCA, the model is ready to be created. I am using a one-class SVM with a non-linear kernel.


In [347]:
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.0000087)
clf.fit(x_train_comps)
print(clf.fit_status_)


0


A fit status of zero indicates that the model is correctly fitted onto the training data, so now it is time to genereate predictions on the test data.

In [348]:
predictions = clf.predict(x_test_comps)
print(predictions)

[ 1  1  1 ... -1  1  1]


In [349]:
x_normal_array = np.where(predictions == -1)
x_outlier_array = np.where(predictions == 1)
print(x_normal_array)
print(x_outlier_array)


(array([    5,     6,    35, ..., 34103, 34110, 34113], dtype=int64),)
(array([    0,     1,     2, ..., 34112, 34114, 34115], dtype=int64),)


After seperating the predictions from my model I have to find the indeces for every prediction.

In [322]:
x_normal_inds = np.array(x_normal_array).tolist()
x_outlier_inds = np.array(x_outlier_array).tolist()
x_normal_inds = [num for sublist in x_normal_inds for num in sublist]
print(x_normal_inds[:10])
x_outlier_inds = [num for sublist in x_outlier_inds for num in sublist]
print(x_outlier_inds[:10])

[5, 6, 35, 42, 64, 69, 73, 74, 79, 94]
[0, 1, 2, 3, 4, 7, 8, 9, 10, 11]


Now that I have my indices ready I can access the original text and create new lists that have the technical skills and jargon separated.

In [323]:
technical_skills = []
jargon = []
for i in x_normal_inds:
    technical_skills.append(X[i])
for i in x_outlier_inds:
    jargon.append(X[i])

Finally, I convert the list into csv files for my submission.

In [324]:
tc = pd.DataFrame(technical_skills)
j = pd.DataFrame(jargon)
tc.to_csv('Cleaned_Skills.csv')
j.to_csv('Jargon.csv')
