# 1 Text Classification
## 1.1 Initialization

In [1]:
# Import some useful packages to do the task
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn import decomposition, ensemble
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

import pandas as pd
import numpy as np
import xgboost, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

Using TensorFlow backend.


## 1.2 Acquire data

In [2]:
# Read the text data file as the training data set
TrainDF = pd.read_csv('textdata.csv', encoding ='UTF-8')

# After reading the file have a look at it
TrainDF.head(10)

Unnamed: 0,Class,ReviewTitle,ReviewText
0,2,Great CD,My lovely Pat has one of the GREAT voices of h...
1,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
2,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
3,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
4,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...
5,1,DVD Player crapped out after one year,I also began having the incorrect disc problem...
6,1,Incorrect Disc,"I love the style of this, but after a couple y..."
7,1,DVD menu select problems,I cannot scroll through a DVD menu that is set...
8,2,Unique Weird Orientalia from the 1930's,"Exotic tales of the Orient from the 1930's. ""D..."
9,1,"Not an ""ultimate guide""","Firstly,I enjoyed the format and tone of the b..."


As you can see, the review title and review text both contains some useful information. But in order to simplify the work we can combine them togerther as a new attribute "Review" 

In [3]:
# Combine the "ReviewTitle" with "ReviewText"
TrainDF["Review"] = TrainDF["ReviewTitle"] + " : " + TrainDF["ReviewText"]

# Then drop the original two columns
TrainDF = TrainDF.drop(["ReviewTitle","ReviewText"],1)

# Check it is right or not
TrainDF.head(10)

Unnamed: 0,Class,Review
0,2,Great CD : My lovely Pat has one of the GREAT ...
1,2,One of the best game music soundtracks - for a...
2,1,Batteries died within a year ... : I bought th...
3,2,"works fine, but Maha Energy is better : Check ..."
4,2,Great for the non-audiophile : Reviewed quite ...
5,1,DVD Player crapped out after one year : I also...
6,1,"Incorrect Disc : I love the style of this, but..."
7,1,DVD menu select problems : I cannot scroll thr...
8,2,Unique Weird Orientalia from the 1930's : Exot...
9,1,"Not an ""ultimate guide"" : Firstly,I enjoyed th..."


Then we need to split the current dataset into 2 parts. One as the training set, another one as the testing set:

In [4]:
# Do the data set splitting through model_selection method
trainX, testX, trainY, testY = model_selection.train_test_split(TrainDF["Review"],TrainDF["Class"])

# Set the "Class" as the target variables
encoder = preprocessing.LabelEncoder()
trainY = encoder.fit_transform(trainY)
testY = encoder.fit_transform(testY)

## 1.3 Text Preprocessing(Demo)
### 1.3.1 Noise Removal
In these part, we can remove some noise to make our task more easy to accomplish. There are mainly 2 kinds of noise:
1. Noisy word: The words that are not relevant to the context of the data
2. Special pattern: Some characters.

#### Function to remove the noisy word

In [5]:
# Define a function to remove the noisy words

# Prepare a dictionary of noisy entities 
noiseList = ["this", "so", "is", "very", "really", "he", "she", "i", "it", "the", "are", "them"]

# Function definition
def remove_noisy_word(text):
    
    words = text.split()
    
    # Remove the noisy words
    usefulwords = [word for word in words if word.lower() not in noiseList]

    # Combine the useful words together in order to make a new clean text
    cleantext = " ".join(usefulwords)
    
    # Return the clean text
    return cleantext

#### Function to remove the special pattern

In [6]:
# Define a function to remove all the special patterns

# Function definition
def remove_pattern(text):
    
    # Remove the special characters:
    cleantext = "".join([char for char in text if char in string.ascii_letters or char in string.whitespace])
    
    # Return the clean text
    return cleantext

#### A Small Demo

In [7]:
# A sentence that we obtained from the social network

sentence = "#Data Mining : This course is really, really, really interesting. The teachers are very, very, very nice! I love them so so so so so so much!"

remove_noisy_word(remove_pattern(sentence))

'Data Mining course interesting teachers nice love much'

## 1.3.2 Lexicon Normalization
In fact, there is another type of textual noise. That is about the multiple representations exhibited by single word.

There are mainly two ways to solve this problem:
1. Stemming:  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
2. Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

#### A Small Demo Using NLTK

In [8]:
# New a lemmatizer and a porter stemmer
lem = WordNetLemmatizer()

stem = PorterStemmer()

word = "multiplying"

print(lem.lemmatize(word,"v"))
print(stem.stem(word))

multiply
multipli


## 1.3.3 Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models. So, we also need to deal with them:

### Function to do the Standardization

In [9]:
# Define a function to do the standardization for all the acronyms etc.

# Prepare a dictionary of all the acronyms
lookupDict = {'xswl':'It is really funny', "awsm" : "awesome", "luv" :"love"}

# Function Definition
def lookupWords(text):
    
    words = text.split() 
    
    # A new tuple to store the changed words
    newwords = [] 
    
    # Use a loop to check every word
    for word in words:
        
        # Change the word if it is an acronym
        if word.lower() in lookupDict:
            word = lookupDict[word.lower()]
            
        # Put the changed word back to it
        newwords.append(word) 
        
    # Combine all the words together
    stdtext = " ".join(newwords) 
        
    # Return the standardization text 
    return stdtext

### A Small Demo

In [10]:
# A sentence that we obtained from the social network

sentence = "xswl! The moive is awsm. I luv it!"

print(lookupWords(remove_pattern(sentence)))
print(remove_noisy_word(lookupWords(remove_pattern(sentence))))

It is really funny The moive is awesome I love it
funny moive awesome love


## 1.3.4 Other Technology
Some other technologies for text preprocessing:
1. Spelling Correction
2. Grammer Check

## 1.4 Feature Engineering

### 1.4.1 Use Count Vector as Feature Vector

Count vector is the matrix representation of the data set. For a(ij) in the matrix, it is the word frequency of j in the text i. 

In [11]:
# Create a vector counter object
countVect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
countVect.fit(TrainDF['Review'])

# Use the vecter counter object to transform the text data
xtraincount = countVect.transform(trainX)
xtestcount = countVect.transform(testX)

### 1.4.2 Use TF-IDF Vector as Feature Vector
We can use statistical features to represent our data set. According to this idea, Term Frequency–Inverse Document Frequency(TF–IDF) is a good measurement for feature engineering.

There are mainly two kinds of TF-IDF vectors:
1. Word-Level TF-IDF
2. N-gram TF-IDF

#### Word-Level TF-IDF

In [12]:
# Get the Word-Level TF-IDF vectors
tfidfVect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=1000)
tfidfVect.fit(TrainDF['Review'])

xtrainTFIDF = tfidfVect.transform(trainX)
xtestTFIDF = tfidfVect.transform(testX)

#### N-gram TF-IDF

In [13]:
# Get the n-gram TF-IDF vectors (n = 2,3)
tfidfVectngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=1000)
tfidfVectngram.fit(TrainDF['Review'])

xtrainTFIDFngram = tfidfVectngram.transform(trainX)
xtestTFIDFngram = tfidfVectngram.transform(testX)

### 1.4.3 Use Text Vector as Feature Vector (Word Embedding)
### 1.4.4 Use Topic Modeling as Feature Vector
Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.
### 1.4.5 Use Other Vectors as Feature Vector
Sometimes, according to some specific situation, we can use some other vectors as the feature vectors. For example, the number of total words, the number of different part of speech words, the average length of words. It may improve our model case-by-case.
## 1.5 Modeling
### 1.5.1 Preparation

In [14]:
# Define a function that can do the training and testing of the modeling 

def trainModel(classifier, feature_vector_train, label, feature_vector_test, is_neural_net=False):
    
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_test)

    # If current model is neural network, we need do the special operation of it
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
    
    # Return its accuracy score
    return metrics.accuracy_score(predictions, testY)

### 1.5.2 Naive Bayes
A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature

In [15]:
# Naive Bayes with count vectors as feature vector
accuracy = trainModel(naive_bayes.MultinomialNB(), xtraincount, trainY, xtestcount)
print ("Naive Bayes, Count Vectors: {0}".format(accuracy))

# Naive Bayes with WordLevel TF-IDF vectors as feature vector
accuracy = trainModel(naive_bayes.MultinomialNB(), xtrainTFIDF, trainY, xtestTFIDF)
print ("Naive Bayes, WordLevel TF-IDF: {0}".format(accuracy))

# Naive Bayes with N-Gram TF-IDF vectors as feature vector
accuracy = trainModel(naive_bayes.MultinomialNB(), xtrainTFIDFngram, trainY, xtestTFIDFngram)
print ("Naive Bayes, N-Gram Vectors: {0}".format(accuracy))

Naive Bayes, Count Vectors: 0.812
Naive Bayes, WordLevel TF-IDF: 0.796
Naive Bayes, N-Gram Vectors: 0.742


#### Possible Reason for the Poor Performance of  N-gram TF-IDF

In [16]:
# Change the max_features
tfidfVectngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=12500)
tfidfVectngram.fit(TrainDF['Review'])

xtrainTFIDFngram = tfidfVectngram.transform(trainX)
xtestTFIDFngram = tfidfVectngram.transform(testX)

# Naive Bayes with count vectors as feature vector
accuracy = trainModel(naive_bayes.MultinomialNB(), xtraincount, trainY, xtestcount)
print ("Naive Bayes, Count Vectors: {0}".format(accuracy))

# Naive Bayes with WordLevel TF-IDF vectors as feature vector
accuracy = trainModel(naive_bayes.MultinomialNB(), xtrainTFIDF, trainY, xtestTFIDF)
print ("Naive Bayes, WordLevel TF-IDF: {0}".format(accuracy))

# Naive Bayes with N-Gram TF-IDF vectors as feature vector
accuracy = trainModel(naive_bayes.MultinomialNB(), xtrainTFIDFngram, trainY, xtestTFIDFngram)
print ("Naive Bayes, N-Gram Vectors: {0}".format(accuracy))

Naive Bayes, Count Vectors: 0.812
Naive Bayes, WordLevel TF-IDF: 0.796
Naive Bayes, N-Gram Vectors: 0.796


### 1.5.3 Logistic Regression
Logistic Regression is a linear classifier that use logistic / sigmoid function to esitimate the probability in order to do the classfication task.

In [17]:
# Linear Classifier on Count Vectors
accuracy = trainModel(linear_model.LogisticRegression(), xtraincount, trainY, xtestcount)
print ("Logistic Regression, Count Vectors:  {0}".format(accuracy))

# Linear Classifier on WordLevel TF-IDF vectors as feature vector
accuracy = trainModel(linear_model.LogisticRegression(), xtrainTFIDF, trainY, xtestTFIDF)
print ("Logistic Regression, WordLevel TF-IDF:  {0}".format(accuracy))

# Linear Classifier on N-Gram TF-IDF vectors as feature vector
accuracy = trainModel(linear_model.LogisticRegression(), xtrainTFIDFngram, trainY, xtestTFIDFngram)
print ("Logistic Regression, N-Gram Vectors:  {0}".format(accuracy))




Logistic Regression, Count Vectors:  0.814
Logistic Regression, WordLevel TF-IDF:  0.81
Logistic Regression, N-Gram Vectors:  0.784


## 1.5.4 SVM

In [18]:
# SVM with Count Vectors
accuracy = trainModel(svm.SVC(), xtraincount, trainY, xtestcount)
print ("SVM , Count Vectors:  {0}".format(accuracy))

# SVM with WordLevel TF-IDF vectors as feature vector
accuracy = trainModel(svm.SVC(), xtrainTFIDF, trainY, xtestTFIDF)
print ("SVM , WordLevel TF-IDF:  {0}".format(accuracy))

# SVM with N-Gram TF-IDF vectors as feature vector
accuracy = trainModel(svm.SVC(), xtrainTFIDFngram, trainY, xtestTFIDFngram)
print ("SVM , N-Gram Vectors:  {0}".format(accuracy))



SVM , Count Vectors:  0.52
SVM , WordLevel TF-IDF:  0.52
SVM , N-Gram Vectors:  0.52


It seems that the data is not linear sperable. So, SVM may not be considered as a good model to do the text classification
### 1.5.5 Random Forest (Bagging Model)
Random forest is like bootstrapping algorithm with Decision tree (CART) model. It tries to build multiple CART models with different samples and different initial variables.

In [19]:
# Random Forest with Count Vectors
accuracy = trainModel(ensemble.RandomForestClassifier(), xtraincount, trainY, xtestcount)
print ("Random Forest , Count Vectors:  {0}".format(accuracy))

# Random Forest with WordLevel TF-IDF vectors as feature vector
accuracy = trainModel(ensemble.RandomForestClassifier(), xtrainTFIDF, trainY, xtestTFIDF)
print ("Random Forest , WordLevel TF-IDF:  {0}".format(accuracy))

# Random Forest with N-Gram TF-IDF vectors as feature vector
accuracy = trainModel(ensemble.RandomForestClassifier(), xtrainTFIDFngram, trainY, xtestTFIDFngram)
print ("Random Forest , N-Gram Vectors:  {0}".format(accuracy))

Random Forest , Count Vectors:  0.732




Random Forest , WordLevel TF-IDF:  0.734
Random Forest , N-Gram Vectors:  0.66




### 1.5.6 Xgboost (Boosting Model)
Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more efficient. It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.

This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification and ranking.


In [20]:
# Xgboost with Count Vectors
accuracy = trainModel(xgboost.XGBClassifier(), xtraincount.tocsc(), trainY, xtestcount.tocsc())
print ("Xgboost , Count Vectors:  {0}".format(accuracy))

# Xgboost with WordLevel TF-IDF vectors as feature vector
accuracy = trainModel(xgboost.XGBClassifier(), xtrainTFIDF.tocsc(), trainY, xtestTFIDF.tocsc())
print ("Xgboost , WordLevel TF-IDF:  {0}".format(accuracy))

# Xgboost with N-Gram TF-IDF vectors as feature vector
accuracy = trainModel(xgboost.XGBClassifier(), xtrainTFIDFngram.tocsc(), trainY, xtestTFIDFngram.tocsc())
print ("Xgboost , N-Gram Vectors:  {0}".format(accuracy))

Xgboost , Count Vectors:  0.75
Xgboost , WordLevel TF-IDF:  0.754
Xgboost , N-Gram Vectors:  0.678


### 1.5.7 KNN

In [21]:
# Define k
k = 9

#KNN with Count Vectors
accuracy = trainModel(KNeighborsClassifier(k), xtraincount, trainY, xtestcount)
print ("KNN , Count Vectors:  {0}".format(accuracy))

# KNN with WordLevel TF-IDF vectors as feature vector
accuracy = trainModel(KNeighborsClassifier(k), xtrainTFIDF, trainY, xtestTFIDF)
print ("KNN , WordLevel TF-IDF:  {0}".format(accuracy))

# KNN with N-Gram TF-IDF vectors as feature vector
accuracy = trainModel(KNeighborsClassifier(k), xtrainTFIDFngram, trainY, xtestTFIDFngram)
print ("KNN , N-Gram Vectors:  {0}".format(accuracy))

KNN , Count Vectors:  0.63
KNN , WordLevel TF-IDF:  0.702
KNN , N-Gram Vectors:  0.744
