<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification
INSTRUCTIONS:
- Run the cells
- Observe and understand the results
- Answer the questions

## Import libraries

In [1]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [37]:
## Loading the data

trainDF = pd.read_fwf(
    filepath_or_buffer = 'data/corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enought to get to the end of the line
               ], 
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
trainDF['label'] = trainDF['label'] - 1

## Inspect the data

In [38]:
# ANSWER
trainDF.sample(5)

Unnamed: 0,label,text
9427,1,camco wrench: This is a great tool. Due to art...
174,1,California Exotics Waterproof Delights Blue Ba...
2255,0,"Why Can't I Rate at Zero Stars ?: Come On, Peo..."
7371,0,Expectations: Not being a passionate Science F...
3773,1,Very good heart wrenching story: I read this b...


In [39]:
trainDF['text'].iloc[6]

'Glorious story: I loved Whisper of the wicked saints. The story was amazing and I was pleasantly surprised at the changes in the book. I am not normaly someone who is into romance novels, but the world was raving about this book and so I bought it. I loved it !! This is a brilliant story because it is so true. This book was so wonderful that I have told all of my friends to read it. It is not a typical romance, it is so much more. Not reading this book is a crime, becuase you are missing out on a heart warming story.'

## Split the data into train and test

In [40]:
## ANSWER
## split the dataset
X = trainDF['text']
y = trainDF['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

## Feature Engineering

### Count Vectors as features

In [41]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(trainDF['text'])

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [42]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(trainDF['text'])
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: total: 1.06 s
Wall time: 1.16 s


In [43]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(trainDF['text'])
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: total: 5.25 s
Wall time: 5.44 s


In [44]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         token_pattern = r'\w{1,}',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(trainDF['text'])
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3),
                token_pattern='\\w{1,}')




CPU times: total: 7.86 s
Wall time: 8.03 s


### Text / NLP based features

Create some other features.

Char_Count = Number of Characters in Text

Word Count = Number of Words in Text

Word Density = Average Number of Char in Words

Punctuation Count = Number of Punctuation in Text

Title Word Count = Number of Words in Title

Uppercase Word Count = Number of Upperwords in Text

In [45]:
# ANSWER
import re
def create_features(text):
    # character count
    Char_Count = len(text)

    # word count
    Word_Count = len(text.strip().split(' '))

    # word density
    text_no_punct = re.sub(r'[/"\.,!\?\']', '', text)
    Word_Density = np.mean([len(word) for word in text_no_punct.strip().split(' ')])

    # punctuation count
    text_only_punct = re.sub(r'[^/"\.,!?\']', '', text)
    Punct_Count = len(text_only_punct)

    # title word count
    title = re.findall(r'^.*:', text)
    Title_Word_Count = len(str(title).strip().split(' '))

    # uppercase word count
    upper = re.findall(r'\b[A-Z]\w*\b', text)
    Upper_Word_Count = len(upper)

    # return all into a list
    return (Char_Count, Word_Count, Word_Density, Punct_Count, Title_Word_Count, Upper_Word_Count)



In [46]:
%%time
# Apply to the data and add to features
features_list = []
for i in trainDF.index:
    features = create_features(trainDF['text'].iloc[i])
    features_list.append(features)

# Add to dataframe
cols=['Char_Count', 'Word_Count', 'Word_Density', 'Punct_Count', 'Title_Word_Count', 'Upper_Word_Count']
trainDF[cols] = features_list

CPU times: total: 1.3 s
Wall time: 1.32 s


In [47]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out number of Adjective, Adverb, Noun, Numeric, Pronoun, Proposition, Verb.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter 

In [48]:
# Initialise some columns for feature's counts
trainDF['adj_count'] = 0
trainDF['adv_count'] = 0
trainDF['noun_count'] = 0
trainDF['num_count'] = 0
trainDF['pron_count'] = 0
trainDF['propn_count'] = 0
trainDF['verb_count'] = 0

In [49]:
# ANSWER
def POS_Values(text):
    # Tokenize
    doc = nlp(text)

    # Get POS counts
    # adjective
    adj_count = len([word for word in doc if word.pos_ == 'ADJ'])
    # adverb
    adv_count = len([word for word in doc if word.pos_ == 'ADV'])
    # noun
    noun_count = len([word for word in doc if word.pos_ == 'NOUN'])
    # numerals
    num_count = len([word for word in doc if word.pos_ == 'NUM'])
    # pronoun
    pron_count = len([word for word in doc if word.pos_ == 'PRON'])
    # proper noun
    propn_count = len([word for word in doc if word.pos_ == 'PROPN'])
    # verb
    verb_count = len([word for word in doc if word.pos_ == 'VERB'])

    # return all as tuple
    return (adj_count, adv_count, noun_count, num_count, pron_count, propn_count, verb_count)

In [50]:
%%time
# Apply to the data and add to features
spacy_feat_list = []
for text in trainDF['text']:
    features = POS_Values(text)
    spacy_feat_list.append(features)

# Add to dataframe
cols=['adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']
trainDF[cols] = spacy_feat_list

CPU times: total: 2min 22s
Wall time: 2min 35s


In [52]:
cols = [
    'Char_Count', 'Word_Count', 'Word_Density', 
    'Punct_Count', 'Title_Word_Count', 'Upper_Word_Count', 
    'adj_count', 'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']

trainDF[cols].sample(5)

Unnamed: 0,Char_Count,Word_Count,Word_Density,Punct_Count,Title_Word_Count,Upper_Word_Count,adj_count,adv_count,noun_count,num_count,pron_count,propn_count,verb_count
6441,666.0,109.0,4.972477,16.0,3.0,8.0,14,8,23,1,12,1,12
8390,145.0,25.0,4.4,11.0,1.0,7.0,4,1,6,0,0,1,1
7003,999.0,184.0,4.342391,17.0,1.0,23.0,15,12,28,6,21,6,25
1946,496.0,89.0,4.494382,8.0,16.0,11.0,9,3,18,0,8,5,9
7188,600.0,111.0,4.225225,21.0,5.0,12.0,8,4,23,3,10,5,12


### Topic Models as features

In [53]:
%%time
# train a LDA Model
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

X_topics = lda_model.fit_transform(X_train_count)
topic_word = lda_model.components_ 
vocab = count_vect.get_feature_names()

CPU times: total: 1min 2s
Wall time: 1min 6s




In [54]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 the book of and a is in read to this
    1 game card games oh graphics play box memory software computer
    2 mars sign 1985 enlightenment relation accepted regain ipad nichols cheer
    3 1984 orwell 3d george david winston brother effective hockey version
    4 ugly rabbit coyote bar perform york disney beads string slot
    5 la de en y con el que catherine del es
    6 jawbone metal eargels hook unusual inspiring textbook tears hd packaged
    7 ear fit rubber jabra economic headset squeem commercial minor wasnt
    8 cents flea fleas crossed hendrix organization scam edition noni bela
    9 of cd is music the album s songs and song
   10 hands phone vampire hostel cars awsome bug occasional calling isnt
   11 camera battery product works use the charger charge computer fire
   12 broke boot hot heat following heater socks describe brando mountain
   13 batteries 24 charged

## Modelling

In [55]:
## helper function

def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return accuracy_score(predictions, y_test)

In [56]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [57]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8465

CPU times: total: 0 ns
Wall time: 7.98 ms


In [58]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8550

CPU times: total: 31.2 ms
Wall time: 6.98 ms


In [59]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8370

CPU times: total: 0 ns
Wall time: 4.99 ms


In [60]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8060

CPU times: total: 15.6 ms
Wall time: 25.9 ms


In [61]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Linear Classifier

In [62]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8730

CPU times: total: 1.66 s
Wall time: 1.91 s


In [63]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8780

CPU times: total: 78.1 ms
Wall time: 77.8 ms


In [64]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8435

CPU times: total: 46.9 ms
Wall time: 46.9 ms


In [65]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8445

CPU times: total: 281 ms
Wall time: 300 ms


In [66]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Support Vector Machine

In [67]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 0.8465

CPU times: total: 547 ms
Wall time: 563 ms


In [68]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8710

CPU times: total: 93.8 ms
Wall time: 68.8 ms


In [69]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8330

CPU times: total: 46.9 ms
Wall time: 53.9 ms


In [70]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)

SVM, CharLevel Vectors: 0.8545

CPU times: total: 359 ms
Wall time: 376 ms


In [71]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Bagging Models

In [72]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8300

CPU times: total: 12.4 s
Wall time: 12.7 s


In [73]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8385

CPU times: total: 5.56 s
Wall time: 5.91 s


In [74]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7945

CPU times: total: 5.11 s
Wall time: 5.56 s


In [75]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7960

CPU times: total: 19.1 s
Wall time: 21.6 s


In [76]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [77]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7970

CPU times: total: 23.9 s
Wall time: 27.1 s


In [78]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.8075

CPU times: total: 9.77 s
Wall time: 10.6 s


In [79]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7455

CPU times: total: 5.73 s
Wall time: 5.94 s


In [80]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8140

CPU times: total: 1min 30s
Wall time: 1min 32s


In [81]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [82]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.8465,0.855,0.837,0.806
Logistic Regression,0.873,0.878,0.8435,0.8445
Support Vector Machine,0.8465,0.871,0.833,0.8545
Random Forest,0.83,0.8385,0.7945,0.796
Gradient Boosting,0.797,0.8075,0.7455,0.814




---



---



> > > > > > > > > © 2022 Institute of Data


---



---



