<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification
INSTRUCTIONS:
- Run the cells
- Observe and understand the results
- Answer the questions

## Import libraries

In [1]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [2]:
## Loading the data

# Using Pandas read_fwf to read a fixed-width file
trainDF = pd.read_fwf(
    filepath_or_buffer='/Users/stephanienduaguba/Documents/DATA/corpus.txt',  # Path to the dataset
    colspecs=[(9, 10),   # Column specifications for label: Extract characters from position 9 to 10 (get only the numbers 1 or 2)
              (11, 9000)   # Column specifications for text: Extract characters from position 11 to 9000 (make it big enough to get to the end of the line)
              ],
    header=0,  # Row index (0-based) to use as the header
    names=['label', 'text'],  # Column names for the DataFrame
    lineterminator='\n'  # Line terminator to identify the end of each line
)

# Convert label from [1, 2] to [0, 1]
trainDF['label'] = trainDF['label'] - 1

## Inspect the data

In [3]:
# ANSWER
trainDF

Unnamed: 0,label,text
0,1,The best soundtrack ever to anything.: I'm rea...
1,1,Amazing!: This soundtrack is my favorite music...
2,1,Excellent Soundtrack: I truly like this soundt...
3,1,"Remember, Pull Your Jaw Off The Floor After He..."
4,1,an absolute masterpiece: I am quite sure any o...
...,...,...
9994,1,A revelation of life in small town America in ...
9995,1,Great biography of a very interesting journali...
9996,0,Interesting Subject; Poor Presentation: You'd ...
9997,0,Don't buy: The box looked used and it is obvio...


## Split the data into train and test

In [4]:
## ANSWER
## split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    trainDF['text'],
    trainDF['label'],
    test_size = 0.2,
    random_state = 42
)

## Feature Engineering

### Count Vectors as features

In [5]:
# Create a CountVectorizer object
# token_pattern=r'\w{1,}': Use a regular expression to consider words with one or more characters as tokens
count_vect = CountVectorizer(token_pattern=r'\w{1,}')

# Learn the vocabulary dictionary of all tokens in the raw documents (training data)
# This step builds a vocabulary based on the words present in the 'text' column of the training data (trainDF)
count_vect.fit(trainDF['text'])

# Transform the training and testing documents into document-term matrices
# The resulting matrices, X_train_count and X_test_count, represent the frequency of each word in the documents
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [6]:
%%time  
# This is a Jupyter notebook cell magic command to measure the execution time of the cell.

# Configure the TF-IDF vectorizer
# - 'analyzer': Specify whether the feature should be made of word or character n-grams ('word' in this case).
# - 'token_pattern': Regular expression denoting what constitutes a token (in this case, at least one alphanumeric character).
# - 'max_features': Maximum number of features to be extracted (5000 in this case).
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)

# Print the TF-IDF vectorizer configuration
print(tfidf_vect)

# Fit the TF-IDF vectorizer on the training data and transform both the training and testing data
# - 'fit(trainDF['text'])': Learn the vocabulary and idf from the training data.
# - 'transform(X_train)': Transform the training data into a document-term matrix.
# - 'transform(X_test)': Transform the testing data into a document-term matrix using the same vocabulary.
tfidf_vect.fit(trainDF['text'])
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: user 670 ms, sys: 5.8 ms, total: 676 ms
Wall time: 678 ms


In [7]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(trainDF['text'])
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: user 2.99 s, sys: 93.4 ms, total: 3.08 s
Wall time: 3.08 s


In [8]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         #token_pattern = r'\w{1,}',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(trainDF['text'])
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3))
CPU times: user 5.2 s, sys: 72.5 ms, total: 5.27 s
Wall time: 5.27 s


### Text / NLP based features

Create some other features.

Char_Count = Number of Characters in Text

Word Count = Number of Words in Text

Word Density = Average Number of Char in Words

Punctuation Count = Number of Punctuation in Text

Title Word Count = Number of Words in Title

Uppercase Word Count = Number of Upperwords in Text

In [9]:
%%time
# ANSWER

# Calculate the character count for each text in the 'text' column and create a new column 'char_count'
trainDF['char_count'] = trainDF['text'].apply(len)

# Calculate the word count for each text in the 'text' column and create a new column 'word_count'
trainDF['word_count'] = trainDF['text'].apply(lambda x: len(x.split()))

# Calculate the word density for each text in the 'text' column and create a new column 'word_density'
# Word density is the average number of characters per word
trainDF['word_density'] = trainDF['char_count'] / (trainDF['word_count'] + 1)

# Calculate the punctuation count for each text in the 'text' column and create a new column 'punctuation_count'
# Punctuation count is the number of punctuation characters in the text
trainDF['punctuation_count'] = trainDF['text'].apply(lambda x: len(''.join(_ for _ in x if _ in string.punctuation)))

# Calculate the title word count for each text in the 'text' column and create a new column 'title_word_count'
# Title word count is the number of words that start with an uppercase letter
trainDF['title_word_count'] = trainDF['text'].apply(lambda x: len([w for w in x.split() if w.istitle()]))

# Calculate the uppercase word count for each text in the 'text' column and create a new column 'uppercase_word_count'
# Uppercase word count is the number of words that are entirely in uppercase
trainDF['uppercase_word_count'] = trainDF['text'].apply(lambda x: len([x for w in x.split() if w.isupper()]))

CPU times: user 345 ms, sys: 2.09 ms, total: 348 ms
Wall time: 347 ms


In [10]:
# Check sample
trainDF.sample(5)

Unnamed: 0,label,text,char_count,word_count,word_density,punctuation_count,title_word_count,uppercase_word_count
3941,1,Fun Jigsaw!: I was troubled over what to get m...,308,61,4.967742,8,10,3
5296,1,great for the price: Bought this for my husban...,722,147,4.878378,15,11,2
6015,0,Long winded.: This video was dull and did not ...,763,142,5.335664,16,10,5
2866,1,Great !!: So far the tablet has been really fu...,509,96,5.247423,16,14,5
936,0,Clean it up!: The Story is appropriate and som...,623,116,5.324786,24,16,2


In [11]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out number of Adjective, Adverb, Noun, Numeric, Pronoun, Proposition, Verb.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

In [12]:
# Initialise some columns for feature's counts
trainDF['adj_count'] = 0
trainDF['adv_count'] = 0
trainDF['noun_count'] = 0
trainDF['num_count'] = 0
trainDF['pron_count'] = 0
trainDF['propn_count'] = 0
trainDF['verb_count'] = 0

In [13]:
# ANSWER
# for each text
for i in range(trainDF.shape[0]):
    # convert into spaCy document
    doc = nlp(trainDF.iloc[i]['text'])
    # initialise feature counters
    c = Counter([t.pos_ for t in doc])

    trainDF.at[i, 'adj_count'] = c['ADJ']
    trainDF.at[i, 'adv_count'] = c['ADV']
    trainDF.at[i, 'noun_count'] = c['NOUN']
    trainDF.at[i, 'num_count'] = c['NUM']
    trainDF.at[i, 'pron_count'] = c['PRON']
    trainDF.at[i, 'propn_count'] = c['PROPN']
    trainDF.at[i, 'verb_count'] = c['VERB']

In [14]:
cols = [
    'char_count', 'word_count', 'word_density',
    'punctuation_count', 'title_word_count',
    'uppercase_word_count', 'adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']

trainDF[cols].sample(5)

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,uppercase_word_count,adj_count,adv_count,noun_count,num_count,pron_count,propn_count,verb_count
9448,209,37,5.5,5,4,1,3,5,8,0,4,1,3
9690,779,124,6.232,33,19,1,14,7,19,1,8,21,9
6438,704,125,5.587302,21,13,3,15,11,26,1,11,6,10
3859,446,70,6.28169,30,11,1,13,4,9,0,7,6,10
2146,347,68,5.028986,10,5,2,2,5,11,0,13,0,12


### Topic Models as features

In [15]:
%%time
# Create an instance of LatentDirichletAllocation with specified parameters
# n_components: Number of topics to be identified
# learning_method: 'online' indicates online variational Bayes method
# max_iter: Maximum number of iterations
lda_model = LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20)

# Fit the LDA model to the transformed training data
# This step identifies topics in the corpus and their distributions
X_topics = lda_model.fit_transform(X_train_count) # X_train_count is the transformed training data

# Get the topic-word distributions from the trained LDA model
topic_word = lda_model.components_

# Get the vocabulary (feature names) from the CountVectorizer
vocab = count_vect.get_feature_names_out()

CPU times: user 28.5 s, sys: 82 ms, total: 28.6 s
Wall time: 28.7 s


In [16]:
# View the topic models
n_top_words = 10  # Number of top words to display for each topic
topic_summaries = []  # List to store summaries for each topic

# Print header for the output
print('Group Top Words')
print('-----', '-'*80)

# Iterate through each topic and its word distribution
for i, topic_dist in enumerate(topic_word):
    # Get the indices of the words in decreasing order of their importance in the topic
    topic_words_indices = np.argsort(topic_dist)[::-1][:n_top_words]
    
    # Use the vocabulary to map indices to actual words
    topic_words = np.array(vocab)[topic_words_indices]
    
    # Join the top words into a string for better display
    top_words = ' '.join(topic_words)
    
    # Append the top words to the list of topic summaries
    topic_summaries.append(top_words)
    
    # Print the topic number and its top words
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 sleeping office van catholic heater damme quiet foreign rhythm flyboys
    1 exam mistakes grammar everest chemistry papers explaining mike anderson textbooks
    2 ear u fit jawbone fits stargate noise episodes hated rod
    3 printer la de hp bible titanic scanner yoga wanting cat
    4 remembered un et funk les il le pour est manon
    5 cute techniques handy con west pepper eating occasionally advise gluten
    6 creepy analysis violence hopkins cinema cap dummy richard starring fats
    7 tape charger apple jazz beats g4 original powerbook per ship
    8 vocals mad rock acts sounding newspaper folk gillian punk screenplay
    9 hollywood diane lane gay drivel pushed tuscany tuscan feeding irritating
   10 ray fiction science 451 bradbury blu fahrenheit manson his fi
   11 musiq cross hip jimmy bowie projects index arthur mars hop
   12 i the it to and a my for this not
   1

## Modelling

In [17]:
## Helper function

# Define a function to train a classifier and evaluate its performance
def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    """
    Train a classifier on the training dataset and evaluate its performance on the validation dataset.

    Parameters:
    - classifier: The machine learning classifier to be trained.
    - feature_vector_train: The feature vectors of the training dataset.
    - label: The labels of the training dataset.
    - feature_vector_valid: The feature vectors of the validation dataset.

    Returns:
    - The accuracy score of the classifier on the validation dataset.
    """

    # Fit (train) the classifier on the training dataset
    classifier.fit(feature_vector_train, label)

    # Predict the labels on the validation dataset
    predictions = classifier.predict(feature_vector_valid)

    # Evaluate the accuracy of the predictions on the validation dataset
    accuracy = accuracy_score(predictions, y_test)

    # Return the accuracy score
    return accuracy

In [18]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [19]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8540

CPU times: user 5.57 ms, sys: 2.18 ms, total: 7.75 ms
Wall time: 8.56 ms


In [20]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8600

CPU times: user 7.08 ms, sys: 3.8 ms, total: 10.9 ms
Wall time: 9.81 ms


In [21]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8400

CPU times: user 3.85 ms, sys: 1.6 ms, total: 5.45 ms
Wall time: 4.63 ms


In [22]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8180

CPU times: user 19.5 ms, sys: 14.7 ms, total: 34.2 ms
Wall time: 34.9 ms


In [23]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Linear Classifier

In [24]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8520

CPU times: user 5.75 s, sys: 1.63 s, total: 7.38 s
Wall time: 1.21 s


In [25]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8730

CPU times: user 202 ms, sys: 49.1 ms, total: 251 ms
Wall time: 37.9 ms


In [26]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8360

CPU times: user 141 ms, sys: 39.5 ms, total: 181 ms
Wall time: 26.9 ms


In [27]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8485

CPU times: user 808 ms, sys: 159 ms, total: 967 ms
Wall time: 183 ms


In [28]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Support Vector Machine

In [29]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 0.8345

CPU times: user 279 ms, sys: 4.05 ms, total: 283 ms
Wall time: 293 ms


In [30]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8610

CPU times: user 32.8 ms, sys: 1.54 ms, total: 34.3 ms
Wall time: 34.1 ms


In [31]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8210

CPU times: user 24.8 ms, sys: 939 µs, total: 25.7 ms
Wall time: 25.1 ms


In [32]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)

SVM, CharLevel Vectors: 0.8570

CPU times: user 255 ms, sys: 14.6 ms, total: 269 ms
Wall time: 282 ms


In [33]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Bagging Models

In [34]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8370

CPU times: user 7.55 s, sys: 28.7 ms, total: 7.57 s
Wall time: 7.61 s


In [35]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8270

CPU times: user 3.78 s, sys: 14.9 ms, total: 3.8 s
Wall time: 3.8 s


In [36]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7875

CPU times: user 3.82 s, sys: 9.34 ms, total: 3.83 s
Wall time: 3.83 s


In [37]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7840

CPU times: user 13.5 s, sys: 55 ms, total: 13.5 s
Wall time: 13.5 s


In [38]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [39]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7990

CPU times: user 11.8 s, sys: 13.9 ms, total: 11.8 s
Wall time: 11.8 s


In [40]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.7950

CPU times: user 8.93 s, sys: 17.3 ms, total: 8.94 s
Wall time: 8.96 s


In [41]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7345

CPU times: user 5.21 s, sys: 8.71 ms, total: 5.22 s
Wall time: 5.21 s


In [42]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8020

CPU times: user 1min 24s, sys: 170 ms, total: 1min 24s
Wall time: 1min 24s


In [43]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [44]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.854,0.86,0.84,0.818
Logistic Regression,0.852,0.873,0.836,0.8485
Support Vector Machine,0.8345,0.861,0.821,0.857
Random Forest,0.837,0.827,0.7875,0.784
Gradient Boosting,0.799,0.795,0.7345,0.802




---



---



> > > > > > > > > © 2023 Institute of Data


---



---



