<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification
INSTRUCTIONS:
- Run the cells
- Observe and understand the results
- Answer the questions

## Import libraries

In [1]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [2]:
## Loading the data

trainDF = pd.read_fwf(
    filepath_or_buffer = '/Users/ayano/Desktop/Data Science & AI/csv/corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enough to get to the end of the line
               ],
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
trainDF['label'] = trainDF['label'] - 1

## Inspect the data

In [3]:
# ANSWER

print(trainDF.info())
print(trainDF.sample(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   9999 non-null   int64 
 1   text    9999 non-null   object
dtypes: int64(1), object(1)
memory usage: 156.4+ KB
None
      label                                               text
9110      0  Falls far short of expectations.: The author i...
6825      0  It was a very good description of life during ...
8579      1  THE USUAL SUSPECTS: MY KIDS AND I LOVED THE MO...
2782      1  helps in understanding children: For a long ti...
6387      1  whoa: Some people find have bad taste in music...
3114      0  If bogus got stars, this would have 10: As an ...
6765      1  Great Christmas classics by a great vocal grou...
2787      1  Did the Egyptians Colonize North America?: I w...
6089      0  The blu-ray image is TERRIBLE!: I wish I had r...
3868      0  Family film: I did not really care for this fi...


In [4]:
trainDF.shape

(9999, 2)

## Split the data into train and test

In [5]:
## ANSWER
## split the dataset

X_train, X_test, y_train, y_test = train_test_split(
    trainDF['text'],
    trainDF['label'],
    test_size = 0.2,
    random_state = 42
)

## Feature Engineering

### Count Vectors as features

In [6]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(trainDF['text'])

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [7]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(trainDF['text'])
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: user 1.66 s, sys: 55.8 ms, total: 1.72 s
Wall time: 1.85 s


In [8]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(trainDF['text'])
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: user 7.29 s, sys: 294 ms, total: 7.58 s
Wall time: 7.68 s


In [9]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         token_pattern = r'\w{1,}',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(trainDF['text'])
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3),
                token_pattern='\\w{1,}')




CPU times: user 10.4 s, sys: 403 ms, total: 10.8 s
Wall time: 11.6 s


### Text / NLP based features

Create some other features.

Char_Count = Number of Characters in Text

Word Count = Number of Words in Text

Word Density = Average Number of Char in Words

Punctuation Count = Number of Punctuation in Text

Title Word Count = Number of Words in Title

Uppercase Word Count = Number of Upperwords in Text

In [10]:
%%time
# ANSWER


# Feature extraction functions
def char_count(text):
    return len(text)

def word_count(text):
    words = text.split()
    return len(words)

def word_density(text):
    words = text.split()
    char_count = sum(len(word) for word in words)
    return char_count / (len(words) + 1)  # Adding 1 to avoid division by zero

def punctuation_count(text):
    return sum(1 for char in text if char in string.punctuation)

def uppercase_word_count(text):
    words = text.split()
    uppercase_words = [word for word in words if word.isupper()]
    return len(uppercase_words)

# Assuming you have a DataFrame 'trainDF' with 'text' column
# Replace 'trainDF' with your actual DataFrame if needed

# Apply the feature extraction functions to the DataFrame
trainDF['Char_Count'] = trainDF['text'].apply(char_count)
trainDF['Word_Count'] = trainDF['text'].apply(word_count)
trainDF['Word_Density'] = trainDF['text'].apply(word_density)
trainDF['Punctuation_Count'] = trainDF['text'].apply(punctuation_count)
trainDF['Uppercase_Word_Count'] = trainDF['text'].apply(uppercase_word_count)

# Display the DataFrame with new features
print(trainDF.head())


   label                                               text  Char_Count  \
0      1  The best soundtrack ever to anything.: I'm rea...         509   
1      1  Amazing!: This soundtrack is my favorite music...         760   
2      1  Excellent Soundtrack: I truly like this soundt...         743   
3      1  Remember, Pull Your Jaw Off The Floor After He...         481   
4      1  an absolute masterpiece: I am quite sure any o...         825   

   Word_Count  Word_Density  Punctuation_Count  Uppercase_Word_Count  
0          97      4.214286                 14                     3  
1         129      4.861538                 40                     4  
2         118      5.260504                 33                     4  
3          87      4.488636                 22                     0  
4         142      4.783217                 35                     3  
CPU times: user 597 ms, sys: 13.4 ms, total: 611 ms
Wall time: 618 ms


In [11]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out number of Adjective, Adverb, Noun, Numeric, Pronoun, Proposition, Verb.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

In [12]:
# Initialise some columns for feature's counts
trainDF['adj_count'] = 0
trainDF['adv_count'] = 0
trainDF['noun_count'] = 0
trainDF['num_count'] = 0
trainDF['pron_count'] = 0
trainDF['propn_count'] = 0
trainDF['verb_count'] = 0

In [13]:
# ANSWER

# Function to count parts of speech
def count_pos(text):
    doc = nlp(text)
    pos_counts = Counter(token.pos_ for token in doc)
    return pos_counts

# Function to update the counts in the DataFrame
def update_counts(row):
    pos_counts = count_pos(row['text'])
    row['adj_count'] = pos_counts['ADJ']
    row['adv_count'] = pos_counts['ADV']
    row['noun_count'] = pos_counts['NOUN']
    row['num_count'] = pos_counts['NUM']
    row['pron_count'] = pos_counts['PRON']
    row['propn_count'] = pos_counts['PROPN']
    row['verb_count'] = pos_counts['VERB']
    return row

# Apply the update_counts function to the DataFrame
trainDF = trainDF.apply(update_counts, axis=1)

# Display the DataFrame with updated counts
print(trainDF[['text', 'adj_count', 'adv_count', 'noun_count', 'num_count', 'pron_count', 'propn_count', 'verb_count']])

                                                   text  adj_count  adv_count  \
0     The best soundtrack ever to anything.: I'm rea...          7          4   
1     Amazing!: This soundtrack is my favorite music...         11          8   
2     Excellent Soundtrack: I truly like this soundt...          6          4   
3     Remember, Pull Your Jaw Off The Floor After He...          6          1   
4     an absolute masterpiece: I am quite sure any o...         18         15   
...                                                 ...        ...        ...   
9994  A revelation of life in small town America in ...         15          5   
9995  Great biography of a very interesting journali...         15          4   
9996  Interesting Subject; Poor Presentation: You'd ...         14          3   
9997  Don't buy: The box looked used and it is obvio...          1          1   
9998  Beautiful Pen and Fast Delivery.: The pen was ...          9          4   

      noun_count  num_count

In [15]:
cols = [
    'char_count', 'word_count', 'word_density',
    'punctuation_count', 'title_word_count',
    'uppercase_word_count', 'adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']

trainDF[cols].sample(5)

KeyError: "['char_count', 'word_count', 'word_density', 'punctuation_count', 'title_word_count', 'uppercase_word_count'] not in index"

### Topic Models as features

In [16]:
%%time
# train a LDA Model
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

X_topics = lda_model.fit_transform(X_train_count)
topic_word = lda_model.components_
vocab = count_vect.get_feature_names_out()

CPU times: user 1min 15s, sys: 1.95 s, total: 1min 17s
Wall time: 1min 22s


In [17]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 the i and a to it this of is in
    1 fit edition ear print ray printer blu size jawbone use
    2 remembered newspaper heaven sales bare titan julie guilt anthology plates
    3 dialogue film places items puzzle continues travel seat gore humanity
    4 shame hollywood her diane lane lacking dissapointed rod higgins recordings
    5 et anderson est manon des korean varies wagner celine fifty
    6 effects special hatebreed celiac realy suit bernie scent 55 pitchshifter
    7 boots these pair them wear boot comfortable shoes hi feet
    8 service adapter la customer de laptop power y en henry
    9 cap arthur xbox cutting circuit desperate grabs astrology knox aluminum
   10 of is the his and s album song in he
   11 marie soldier oriented sorbo schure curie stalingrad joanna tanks bela
   12 latest keeps pushing san popping camp glory jobs vintage sticks
   13 steer dancers poe

## Modelling

In [18]:
## helper function

def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return accuracy_score(predictions, y_test)

In [19]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [20]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8540

CPU times: user 16 ms, sys: 42.8 ms, total: 58.8 ms
Wall time: 92.1 ms


In [21]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8600

CPU times: user 15.3 ms, sys: 24.6 ms, total: 39.9 ms
Wall time: 63 ms


In [22]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8400

CPU times: user 11.4 ms, sys: 17.7 ms, total: 29.1 ms
Wall time: 45.2 ms


In [23]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8180

CPU times: user 53.6 ms, sys: 126 ms, total: 179 ms
Wall time: 280 ms


In [24]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Linear Classifier

In [25]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8520

CPU times: user 6.13 s, sys: 2.07 s, total: 8.2 s
Wall time: 2.4 s


In [26]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8730

CPU times: user 310 ms, sys: 117 ms, total: 427 ms
Wall time: 120 ms


In [27]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8360

CPU times: user 52.4 ms, sys: 3.11 ms, total: 55.5 ms
Wall time: 54.4 ms


In [28]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8485

CPU times: user 332 ms, sys: 5.35 ms, total: 338 ms
Wall time: 336 ms


In [30]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

print(results)

                     Count Vectors  WordLevel TF-IDF  N-Gram Vectors  \
Naïve Bayes                  0.854             0.860           0.840   
Logistic Regression          0.852             0.873           0.836   

                     CharLevel Vectors  
Naïve Bayes                     0.8180  
Logistic Regression             0.8485  


### Support Vector Machine

In [31]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)



SVM, Count Vectors    : 0.8345

CPU times: user 612 ms, sys: 17.9 ms, total: 630 ms
Wall time: 656 ms


In [32]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8610

CPU times: user 97.4 ms, sys: 6.85 ms, total: 104 ms
Wall time: 117 ms




In [33]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8210

CPU times: user 79.3 ms, sys: 5.11 ms, total: 84.4 ms
Wall time: 91.1 ms




In [34]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)



SVM, CharLevel Vectors: 0.8570

CPU times: user 413 ms, sys: 26.3 ms, total: 440 ms
Wall time: 477 ms


In [35]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

print(results)

                        Count Vectors  WordLevel TF-IDF  N-Gram Vectors  \
Naïve Bayes                    0.8540             0.860           0.840   
Logistic Regression            0.8520             0.873           0.836   
Support Vector Machine         0.8345             0.861           0.821   

                        CharLevel Vectors  
Naïve Bayes                        0.8180  
Logistic Regression                0.8485  
Support Vector Machine             0.8570  


### Bagging Models

In [36]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8300

CPU times: user 16.6 s, sys: 235 ms, total: 16.8 s
Wall time: 17.4 s


In [37]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8265

CPU times: user 9.37 s, sys: 134 ms, total: 9.51 s
Wall time: 9.75 s


In [38]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7885

CPU times: user 9.64 s, sys: 131 ms, total: 9.77 s
Wall time: 9.94 s


In [39]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7725

CPU times: user 38.7 s, sys: 720 ms, total: 39.4 s
Wall time: 41.3 s


In [40]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

print(results)

### Boosting Models

In [41]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7990

CPU times: user 20.6 s, sys: 312 ms, total: 20.9 s
Wall time: 21.1 s


In [42]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.7950

CPU times: user 20.6 s, sys: 134 ms, total: 20.7 s
Wall time: 20.8 s


In [43]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7365

CPU times: user 18.1 s, sys: 401 ms, total: 18.5 s
Wall time: 20.2 s


In [44]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8025

CPU times: user 4min 25s, sys: 5.52 s, total: 4min 31s
Wall time: 4min 51s


In [46]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

print(results)

                        Count Vectors  WordLevel TF-IDF  N-Gram Vectors  \
Naïve Bayes                    0.8540            0.8600          0.8400   
Logistic Regression            0.8520            0.8730          0.8360   
Support Vector Machine         0.8345            0.8610          0.8210   
Random Forest                  0.8300            0.8265          0.7885   
Gradient Boosting              0.7990            0.7950          0.7365   

                        CharLevel Vectors  
Naïve Bayes                        0.8180  
Logistic Regression                0.8485  
Support Vector Machine             0.8570  
Random Forest                      0.7725  
Gradient Boosting                  0.8025  


In [47]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.854,0.86,0.84,0.818
Logistic Regression,0.852,0.873,0.836,0.8485
Support Vector Machine,0.8345,0.861,0.821,0.857
Random Forest,0.83,0.8265,0.7885,0.7725
Gradient Boosting,0.799,0.795,0.7365,0.8025




---



---



> > > > > > > > > © 2023 Institute of Data


---



---



