<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification

In this lab you will implement different types of feature engineering for text classification:
* Count vectors
* TF-IDF vectors (word level, n-gram level, character level)
* Text/NLP based features
* Topic models
  
The following classification algorithms will be applied to the count and TF-IDF vector features:
* Naïve Bayes
* Logistic Regression
* Support Vector Machine
* Random Forest
* Gradient Boosting

## Import libraries

In [1]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [2]:
## Loading the data

df_corpus = pd.read_fwf(
    filepath_or_buffer = '/Users/tresornoel/Desktop/IOD/DATA/corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enough to get to the end of the line
               ],
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
df_corpus['label'] = df_corpus['label'] - 1

## Inspect the data

In [3]:
# ANSWER
df_corpus.head()

Unnamed: 0,label,text
0,1,The best soundtrack ever to anything.: I'm rea...
1,1,Amazing!: This soundtrack is my favorite music...
2,1,Excellent Soundtrack: I truly like this soundt...
3,1,"Remember, Pull Your Jaw Off The Floor After He..."
4,1,an absolute masterpiece: I am quite sure any o...


In [4]:
# dataset infos
print(df_corpus.info)

# dataset shape
print(df_corpus.shape)

# dataset types
print(df_corpus.dtypes)

<bound method DataFrame.info of       label                                               text
0         1  The best soundtrack ever to anything.: I'm rea...
1         1  Amazing!: This soundtrack is my favorite music...
2         1  Excellent Soundtrack: I truly like this soundt...
3         1  Remember, Pull Your Jaw Off The Floor After He...
4         1  an absolute masterpiece: I am quite sure any o...
...     ...                                                ...
9994      1  A revelation of life in small town America in ...
9995      1  Great biography of a very interesting journali...
9996      0  Interesting Subject; Poor Presentation: You'd ...
9997      0  Don't buy: The box looked used and it is obvio...
9998      1  Beautiful Pen and Fast Delivery.: The pen was ...

[9999 rows x 2 columns]>
(9999, 2)
label     int64
text     object
dtype: object


In [5]:
# checking missing values
print(df_corpus.isnull().sum())

# counting the labels values
print(df_corpus['label'].value_counts())

label    0
text     0
dtype: int64
label
0    5097
1    4902
Name: count, dtype: int64


In [6]:
# checking the text length to see if any row has short or long text (noise)
df_corpus['text_length'] = df_corpus['text'].apply(len)
print(df_corpus['text_length'].describe())

count    9999.000000
mean      438.703870
std       239.255565
min       101.000000
25%       238.000000
50%       391.000000
75%       605.000000
max      1015.000000
Name: text_length, dtype: float64


## Split the data into train and test

In [7]:
## ANSWER
X = df_corpus['text']  
y = df_corpus['label']
## split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Feature Engineering

### Count Vectors as features

In [8]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(X_train)

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

In [9]:
X_train.shape

(7999,)

In [10]:
X_train_count.shape

(7999, 28212)

In [11]:
X_train_count[0]

<1x28212 sparse matrix of type '<class 'numpy.int64'>'
	with 105 stored elements in Compressed Sparse Row format>

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [12]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: user 429 ms, sys: 20.3 ms, total: 449 ms
Wall time: 454 ms


In [13]:
X_train_tfidf[0]

<1x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 94 stored elements in Compressed Sparse Row format>

In [14]:
X_train_tfidf.shape

(7999, 5000)

In [10]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(X_train)
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: user 1.83 s, sys: 59.2 ms, total: 1.89 s
Wall time: 1.89 s


In [11]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(X_train)
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3))
CPU times: user 2.93 s, sys: 63.6 ms, total: 3 s
Wall time: 2.99 s


### Text / NLP based features

Create some other features.

char_count = Number of Characters in Text

word_count = Number of Words in Text

word_density = Average Number of Char in Words

punctuation_count = Number of Punctuation in Text

title_word_count = Number of Words in Title

uppercase_word_count = Number of Upperwords in Text


In [12]:
%%time
# ANSWER
#  Character count
df_corpus['char_count'] = df_corpus['text'].apply(len)

# Word count
df_corpus['word_count'] = df_corpus['text'].apply(lambda x: len(x.split()))

# Word density (average number of characters per word)
df_corpus['word_density'] = df_corpus['char_count'] / (df_corpus['word_count'] + 1)  # Adding 1 to avoid division by zero

# Punctuation count (counting punctuations like . , ! ? etc.)

df_corpus['punctuation_count'] = df_corpus['text'].apply(lambda x: len([char for char in x if char in string.punctuation]))

# Title word count (counting words with the first letter capitalized)
df_corpus['title_word_count'] = df_corpus['text'].apply(lambda x: len([word for word in x.split() if word.istitle()]))

# Uppercase word count (counting words that are in all uppercase)
df_corpus['uppercase_word_count'] = df_corpus['text'].apply(lambda x: len([word for word in x.split() if word.isupper()]))

# Display the DataFrame to inspect new features
print(df_corpus.head())

   label                                               text  text_length  \
0      1  The best soundtrack ever to anything.: I'm rea...          509   
1      1  Amazing!: This soundtrack is my favorite music...          760   
2      1  Excellent Soundtrack: I truly like this soundt...          743   
3      1  Remember, Pull Your Jaw Off The Floor After He...          481   
4      1  an absolute masterpiece: I am quite sure any o...          825   

   char_count  word_count  word_density  punctuation_count  title_word_count  \
0         509          97      5.193878                 14                 7   
1         760         129      5.846154                 40                24   
2         743         118      6.243697                 33                52   
3         481          87      5.465909                 22                30   
4         825         142      5.769231                 35                14   

   uppercase_word_count  
0                     3  
1         

In [13]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out number of Adjective, Adverb, Noun, Numeric, Pronoun, Proposition, Verb.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

In [14]:
# Initialise some columns for feature's counts
df_corpus['adj_count'] = 0
df_corpus['adv_count'] = 0
df_corpus['noun_count'] = 0
df_corpus['num_count'] = 0
df_corpus['pron_count'] = 0
df_corpus['propn_count'] = 0
df_corpus['verb_count'] = 0

In [15]:
# ANSWER

# Function to count specific parts of speech
def count_pos(text):
    # Convert text to a spaCy document
    doc = nlp(text)
    
    # Initialize a Counter to count occurrences of POS tags
    pos_counts = Counter([token.pos_ for token in doc])
    
    # Return counts for specific parts of speech, defaulting to 0 if not found
    return {
        'adj_count': pos_counts.get('ADJ', 0),
        'adv_count': pos_counts.get('ADV', 0),
        'noun_count': pos_counts.get('NOUN', 0),
        'num_count': pos_counts.get('NUM', 0),
        'pron_count': pos_counts.get('PRON', 0),
        'propn_count': pos_counts.get('PROPN', 0),
        'verb_count': pos_counts.get('VERB', 0)
    }

# Apply the function to each row in the DataFrame
df_corpus[['adj_count', 'adv_count', 'noun_count', 'num_count', 'pron_count', 'propn_count', 'verb_count']] = \
    df_corpus['text'].apply(lambda text: pd.Series(count_pos(text)))

# Inspect the updated DataFrame with new features
print(df_corpus.head())

   label                                               text  text_length  \
0      1  The best soundtrack ever to anything.: I'm rea...          509   
1      1  Amazing!: This soundtrack is my favorite music...          760   
2      1  Excellent Soundtrack: I truly like this soundt...          743   
3      1  Remember, Pull Your Jaw Off The Floor After He...          481   
4      1  an absolute masterpiece: I am quite sure any o...          825   

   char_count  word_count  word_density  punctuation_count  title_word_count  \
0         509          97      5.193878                 14                 7   
1         760         129      5.846154                 40                24   
2         743         118      6.243697                 33                52   
3         481          87      5.465909                 22                30   
4         825         142      5.769231                 35                14   

   uppercase_word_count  adj_count  adv_count  noun_count  num

In [16]:
cols = [
    'char_count', 'word_count', 'word_density',
    'punctuation_count', 'title_word_count',
    'uppercase_word_count', 'adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']

df_corpus[cols].sample(5)

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,uppercase_word_count,adj_count,adv_count,noun_count,num_count,pron_count,propn_count,verb_count
7676,370,65,5.606061,22,7,4,9,10,9,1,10,1,7
5705,518,96,5.340206,13,14,4,11,9,11,0,17,7,12
5964,439,74,5.853333,21,15,3,5,6,11,0,5,17,10
8844,182,37,4.789474,10,7,0,3,2,8,2,5,3,3
7656,109,20,5.190476,3,3,1,3,1,4,0,3,0,1


### Topic Models as features

In [15]:
%%time
# train a LDA Model
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

X_topics = lda_model.fit_transform(X_train_count)
topic_word = lda_model.components_
vocab = count_vect.get_feature_names_out()

CPU times: user 17.5 s, sys: 1.6 s, total: 19.1 s
Wall time: 19.2 s


In [18]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 generation challenging frankenstein monster dean dharma companion owner bums coast
    1 ear fit cave jawbone fits noise bear ayla clan provided
    2 rhythm league path explaining mike engrossing typically rome bebel shine
    3 edition print ray printer blu memory avoid copy hp xp
    4 practice theory spanish marriage tests hospital math ignorant communism hazlitt
    5 casting circuit celiac steer geforce tene playable necklace 1983 gillain
    6 the and a i to of it this is in
    7 season skin digital mad show eye episodes photo larger simpletech
    8 development rate c japanese romantic focus touching killer shakespeare twists
    9 la de y en el que harry los et un
   10 his he manson fascinating fi dialogue sci story stargate places
   11 i it the my to for and a not with
   12 marquez eight creating laughs garcia richard insights professor tons gems
   13 blocks tusca

## Modelling

Run the following cells to train a number of models on the count vector and TF-IDF vector feature sets generated above.

In [19]:
## helper function

def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return accuracy_score(predictions, y_test)

In [20]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [21]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8520

CPU times: user 11.9 ms, sys: 4.97 ms, total: 16.9 ms
Wall time: 15.9 ms


In [22]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8550

CPU times: user 9.71 ms, sys: 4.81 ms, total: 14.5 ms
Wall time: 11.9 ms


In [23]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8360

CPU times: user 8.75 ms, sys: 4.11 ms, total: 12.9 ms
Wall time: 11.4 ms


In [24]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8195

CPU times: user 22.9 ms, sys: 18.2 ms, total: 41.2 ms
Wall time: 39.7 ms


In [25]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Linear Classifier

In [26]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8525

CPU times: user 1.55 s, sys: 787 ms, total: 2.33 s
Wall time: 579 ms


In [27]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8715

CPU times: user 32 ms, sys: 3.05 ms, total: 35 ms
Wall time: 33.5 ms


In [28]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8295

CPU times: user 29.8 ms, sys: 2.96 ms, total: 32.7 ms
Wall time: 31.8 ms


In [29]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8490

CPU times: user 123 ms, sys: 3.76 ms, total: 126 ms
Wall time: 127 ms


In [30]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Support Vector Machine

In [31]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)



SVM, Count Vectors    : 0.8345

CPU times: user 299 ms, sys: 8.18 ms, total: 307 ms
Wall time: 309 ms


In [32]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8605

CPU times: user 56.8 ms, sys: 3.79 ms, total: 60.5 ms
Wall time: 58 ms




In [33]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8120

CPU times: user 34.4 ms, sys: 2.43 ms, total: 36.8 ms
Wall time: 36.1 ms




In [34]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)



SVM, CharLevel Vectors: 0.8590

CPU times: user 264 ms, sys: 14.2 ms, total: 278 ms
Wall time: 275 ms


In [35]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Bagging Models

In [36]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8275

CPU times: user 3.31 s, sys: 22.5 ms, total: 3.34 s
Wall time: 3.34 s


In [37]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8255

CPU times: user 2.11 s, sys: 6.46 ms, total: 2.12 s
Wall time: 2.12 s


In [38]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7835

CPU times: user 2.36 s, sys: 9.65 ms, total: 2.37 s
Wall time: 2.37 s


In [39]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7845

CPU times: user 7.37 s, sys: 38.9 ms, total: 7.41 s
Wall time: 7.41 s


In [40]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [41]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7990

CPU times: user 3 s, sys: 14.3 ms, total: 3.01 s
Wall time: 3.01 s


In [42]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.7915

CPU times: user 8.96 s, sys: 37.6 ms, total: 9 s
Wall time: 9 s


In [43]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7380

CPU times: user 5.22 s, sys: 20.2 ms, total: 5.24 s
Wall time: 5.24 s


In [44]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8025

CPU times: user 1min 25s, sys: 260 ms, total: 1min 25s
Wall time: 1min 25s


In [46]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [47]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.852,0.855,0.836,0.8195
Logistic Regression,0.8525,0.8715,0.8295,0.849
Support Vector Machine,0.8345,0.8605,0.812,0.859
Random Forest,0.8275,0.8255,0.7835,0.7845
Gradient Boosting,0.799,0.7915,0.738,0.8025


Which combination of features and model performed the best?

The combination of Logistic Regression with Word Level TF-IDF performed the best overall, with the highest accuracy of 0.8715.



---



---



> > > > > > > > > © 2024 Institute of Data


---



---



