# Text Classification

This Notebook focuses on implementing multi-class text classification on Amazon automotive reviews dataset by choosing any one combination of various data transformation techniques and algorithms.

Rating(1-5) is predicted for each review from the dataset.

Best scoring combinations are listed below. Any single combination out of the following can be chosen for data transformation & model training :

* Creation of word embeddings using gensim's word2vec & subsequent training using Random Forest algorithm.
* Creation of word embeddings using word2vec and/or Smooth Inverse Frequency (SIF) technique & subsequent training using Random Forest algorithm.
* Vectorisation using Term frequency-inverse document frequency (Tfidf) technique & subsequent training using Random Forest algorithm.
* Vectorisation using Tfidf technique & subsequent training using Linear support vector clustering (SVC) algorithm.

| Data Transformation  | Training Algorithm |
| ------------- | ------------- |
| Word2vec | Random Forest  |
| Word2vec + SIF  | Random Forest  |
| TfIdf Vectorization  | Random Forest  |
| TfIdf Vectorization  | Linear SVC  |

### Clone git repo

In [1]:
BRANCH_NAME="master" #Provide git branch name "master" or "dev"
! git clone -b $BRANCH_NAME https://github.com/CiscoAI/cisco-kubeflow-starter-pack.git

Cloning into 'cisco-kubeflow-starter-pack'...
remote: Enumerating objects: 359, done.[K
remote: Counting objects: 100% (359/359), done.[K
remote: Compressing objects: 100% (244/244), done.[K
remote: Total 4927 (delta 116), reused 292 (delta 72), pack-reused 4568[K
Receiving objects: 100% (4927/4927), 23.60 MiB | 40.97 MiB/s, done.
Resolving deltas: 100% (1845/1845), done.


### Install required packages

In [2]:
!pip install pandas nltk gensim sklearn scikit-learn==0.20.3 imbalanced-learn==0.4.3

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-1.1.1-cp36-cp36m-manylinux1_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 19.1 MB/s eta 0:00:01
[?25hCollecting nltk
  Downloading nltk-3.5.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 115.9 MB/s eta 0:00:01
[?25hCollecting gensim
  Downloading gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 30.9 MB/s eta 0:00:01    |███████████████▌                | 11.8 MB 30.9 MB/s eta 0:00:01
[?25hCollecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn==0.20.3
  Downloading scikit_learn-0.20.3-cp36-cp36m-manylinux1_x86_64.whl (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 13.9 MB/s eta 0:00:01
[?25hCollecting imbalanced-learn==0.4.3
  Downloading imbalanced_learn-0.4.3-py3-none-any.whl (166 kB)
[K     |████████████████████████████████| 1

### Restart Jupyter notebook kernel

In [None]:
from IPython.display import display_html
display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

### Import required libraries

In [1]:
#General
import pandas as pd
import numpy as np
import pickle
import yaml
from joblib import dump
import re
import nltk as nl
import gensim
import yaml
import os
import requests
from collections import Counter

#sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

#nltk
from nltk.corpus import stopwords
nl.download('punkt')
nl.download('stopwords')

#Over-sampling
from imblearn.over_sampling import SMOTENC

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Convert dataset from JSON format to CSV format

In [2]:
path = "cisco-kubeflow-starter-pack/apps/retail/customer-reviews/onprem"
json_data = pd.read_json(os.path.join(path, "data/amazon_automotive_reviews.json"), lines=True)
json_data.to_csv('amazon_automotive_reviews.csv', index=False)

### Read data from CSV file

In [3]:
raw_data = pd.read_csv('amazon_automotive_reviews.csv')

In [4]:
raw_data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A3F73SC1LY51OO,B00002243X,Alan Montgomery,"[4, 4]",I needed a set of jumper cables for my new car...,5,Work Well - Should Have Bought Longer Ones,1313539200,"08 17, 2011"
1,A20S66SKYXULG2,B00002243X,alphonse,"[1, 1]","These long cables work fine for my truck, but ...",4,Okay long cables,1315094400,"09 4, 2011"
2,A2I8LFSN2IS5EO,B00002243X,Chris,"[0, 0]",Can't comment much on these since they have no...,5,Looks and feels heavy Duty,1374710400,"07 25, 2013"
3,A3GT2EWQSO45ZG,B00002243X,DeusEx,"[19, 19]",I absolutley love Amazon!!! For the price of ...,5,Excellent choice for Jumper Cables!!!,1292889600,"12 21, 2010"
4,A3ESWJPAVRPWB4,B00002243X,E. Hernandez,"[0, 0]",I purchased the 12' feet long cable set and th...,5,"Excellent, High Quality Starter Cables",1341360000,"07 4, 2012"


### Clean up initial dataset

In [5]:
raw_data['overallRating'] = raw_data['overall']
raw_data = raw_data.drop(['reviewerID','asin','reviewerName','helpful','overall','summary','unixReviewTime','reviewTime'], axis=1)

### Clean review text column and remove punctuations and numericals

In [6]:
raw_data['p_review'] = raw_data['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z\s]','', str(x)))

### Choose data transformation method

In [7]:
# Data transformation can be done using either word2vec or Smooth inverse frequency (sif) technique or Tf-Idf vectorization (tfidf).
# Choose from ==> ['word2vec', 'sif', 'tfidf']

data_transform = ''

### Choose model training algorithm

In [8]:
# Model training can be done using either random forest (rf) or Linear Support vector clustering (lsvc) algorithms.
#Choose from ==> ['rf','lsvc']

train_algorithm = ''

### Validate data transformation method & training algorithm options

In [9]:
if not data_transform or data_transform not in ['word2vec', 'sif', 'tfidf']:
     raise ValueError("Set a valid method to perform data transformation (word2vec/sif/tfidf)")

In [10]:
if not train_algorithm or train_algorithm not in ['rf','lsvc']:
     raise ValueError("Set a valid algorithm to train your model(rf/lsvc)")

In [11]:
if data_transform in ['word2vec','sif'] and train_algorithm == 'lsvc':
    raise Warning("The combination selected may not be the best scoring one!")

### Apply data transformation on data as per selected choice

In [12]:
if data_transform in ['word2vec', 'sif']:

        p_review = raw_data['p_review'].to_list()

        tokens = [nl.word_tokenize(sentences) for sentences in p_review]

        stop_words = stopwords.words('english')

        tokens = [[word for word in tokens[i] if not word in stopwords.words('english')] for i in range(len(tokens))]

        wv_model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)

        wv_model.train(tokens, total_examples=len(tokens), epochs=50)
        
        print("Word2vec model generated & trained on tokens from review text")

In [13]:
if data_transform == 'word2vec':
        
        print("Preparing training data using word2vec..")
        wv_train = []
        for i in range(len(tokens)):
            wv_train.append(np.mean(np.asarray([wv_model[token] for token in tokens[i]]),axis=0))
        print("Completed")
            
elif data_transform == 'sif':
    
        print("Preparing training data using Smooth inverse frequency(SIF)..")
        vlookup = wv_model.wv.vocab
        Z = 0
        for k in vlookup:
                Z += vlookup[k].count # Compute the normalization constant Z

        a = 0.001
        embedding_size = 300
        wv_sif_train = []
        for i in range(len(tokens)):
                vs = np.zeros(300)
                for word in tokens[i]:
                        a_value = a / (a + (vlookup[word].count/Z))
                        vs = np.add(vs, np.multiply(a_value, wv_model.wv[word]))
                wv_sif_train.append(np.divide(vs, len(tokens[i])))
        print("Completed")
                
elif data_transform == 'tfidf':
         print("Preparing training data using TfIdf vectorization..")
         tfidf = TfidfVectorizer(ngram_range=(1,2),sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', stop_words='english')
         features = tfidf.fit_transform(raw_data.p_review).toarray()
         print(features.shape)
         print("Completed")

Preparing training data using TfIdf vectorization..
(20473, 23968)
Completed


### Depict class imbalance issue in dataset using value count for each rating

Here rating 5 has lot more records than others, so the dataset is considered to be highly skewed / imbalanced.

In [14]:
raw_data.overallRating.value_counts()

5    13928
4     3967
3     1430
2      606
1      542
Name: overallRating, dtype: int64

### Initialize target variable to a local variable

In [15]:
y = raw_data.overallRating

### Resample dataset to remove class imbalance issue using SMOTENC

Preprocessing of the dataset is done in such a way that the rating categories other than 5 ( which is the majority class) is oversampled accordingly, so as to get a balanced dataset without any prediction output bias.

In [16]:
sm = SMOTENC(sampling_strategy={1: 6000, 2: 6200, 3 : 6800, 4: 11000}, random_state=42, categorical_features=[1])
if data_transform == 'word2vec':
    X_resampled, y_resampled = sm.fit_resample(np.asarray(wv_train), y)
    
elif data_transform == 'sif':
    X_resampled, y_resampled = sm.fit_resample(np.asarray(wv_sif_train), y)
    
elif data_transform == 'tfidf':
    X_resampled, y_resampled = sm.fit_resample(features, y)

print('Resampled dataset samples per class {}'.format(Counter(y_resampled)))

Resampled dataset samples per class Counter({5: 13928, 4: 11000, 3: 6800, 2: 6200, 1: 6000})


### Split train & test data

In [17]:
if data_transform == 'word2vec':
    x_train, x_test, y_train, y_test = train_test_split(X_resampled,y_resampled,test_size=0.3,shuffle=True,random_state=7)

elif data_transform == 'sif':
    x_train, x_test, y_train, y_test = train_test_split(X_resampled,y_resampled,test_size=0.3,shuffle=True,random_state=7)

elif data_transform == 'tfidf':
    x_train, x_test, y_train, y_test = train_test_split(X_resampled,y_resampled,test_size=0.3,shuffle=True,random_state=7)
    

### Train model

In [18]:
if train_algorithm == 'rf':
    model = RandomForestClassifier(n_estimators=40, random_state=0)
    model.fit(x_train,y_train)
    
elif train_algorithm == 'lsvc':
    model = LinearSVC()
    model.fit(x_train, y_train)  

### Save model

In [19]:
file_path = '/home/jovyan'
file_rel_path = 'model/'
file_abs_path = os.path.join(file_path, file_rel_path)
file_name = 'model.joblib'

if not os.path.exists(file_abs_path):
    os.mkdir(file_abs_path)
dump(model, file_abs_path + file_name)

['/home/jovyan/model/model.joblib']

### Define inference service name & model storage URI

In [20]:
svc_name = 'text-classify'

!kubectl get pods $HOSTNAME -o yaml -n anonymous > podspec
with open("podspec") as f:
    content = yaml.safe_load(f)
    for elm in content['spec']['volumes']:
        if 'workspace-' in elm['name']:
            pvc = elm['name']
os.remove('podspec')
pvc
    
storageURI = "pvc://" + pvc + '/' + file_rel_path
print(storageURI)

pvc://workspace-test/model/


### Define configuration for inference service creation

In [21]:
wsvol_blerssi_kf = f"""apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: {svc_name}
  namespace: anonymous
spec:
  default:
    predictor:
      sklearn:
        storageUri: {storageURI}
"""
    
kfserving = yaml.safe_load(wsvol_blerssi_kf)
with open('blerssi-kfserving.yaml', 'w') as file:
    yaml_kfserving = yaml.dump(kfserving,file)

! cat blerssi-kfserving.yaml

apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
  name: text-classify
  namespace: anonymous
spec:
  default:
    predictor:
      sklearn:
        storageUri: pvc://workspace-test/model/


### Apply the configuration .yaml file

In [24]:
!kubectl apply -f blerssi-kfserving.yaml

inferenceservice.serving.kubeflow.org/text-classify created


### Check whether inferenceservice is created

In [26]:
!kubectl get inferenceservice -n anonymous

NAME            URL                                                                  READY   DEFAULT TRAFFIC   CANARY TRAFFIC   AGE
text-classify   http://text-classify.anonymous.example.com/v1/models/text-classify   True    100                                2m41s


### Note:

Wait for inference service READY="True"

### Predict data from serving after setting INGRESS_IP

In [28]:
predictions = []

host_name = svc_name + '.anonymous.example.com'

headers = { 
    'host': host_name
}

for i in range(15):

    formData = {
        'instances': x_test[i:i+1].tolist()
    }
    url = 'http://<<INGRESS IP>>:<<INGRESS PORT>>/v1/models/' + svc_name + ':predict'
    res = requests.post(url, json=formData, headers=headers)
    results = res.json()
    prediction = results['predictions']
    predictions.append(prediction[0])
    
print("Predictions")
print(predictions)

Predictions
[4, 3, 2, 1, 5, 5, 1, 1, 4, 5, 2, 2, 5, 3, 1]


## Clean up after predicting

### Delete inference service

In [29]:
!kubectl delete -f blerssi-kfserving.yaml

inferenceservice.serving.kubeflow.org "text-classify" deleted


### Delete model folder

In [30]:
!rm -rf $file_rel_path