<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.5: Text Classification

In this lab you will implement different types of feature engineering for text classification:
* Count vectors
* TF-IDF vectors (word level, n-gram level, character level)
* Text/NLP based features
* Topic models
  
The following classification algorithms will be applied to the count and TF-IDF vector features:
* Naïve Bayes
* Logistic Regression
* Support Vector Machine
* Random Forest
* Gradient Boosting

## Import libraries

In [1]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [2]:
## Loading the data

df_corpus = pd.read_fwf(
    filepath_or_buffer = 'corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enough to get to the end of the line
               ],
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
df_corpus['label'] = df_corpus['label'] - 1

## Inspect the data

In [3]:
# ANSWER
print(df_corpus.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   9999 non-null   int64 
 1   text    9999 non-null   object
dtypes: int64(1), object(1)
memory usage: 156.4+ KB
None


In [4]:

print(df_corpus.describe())

             label
count  9999.000000
mean      0.490249
std       0.499930
min       0.000000
25%       0.000000
50%       0.000000
75%       1.000000
max       1.000000


In [5]:
print(df_corpus.head())

   label                                               text
0      1  The best soundtrack ever to anything.: I'm rea...
1      1  Amazing!: This soundtrack is my favorite music...
2      1  Excellent Soundtrack: I truly like this soundt...
3      1  Remember, Pull Your Jaw Off The Floor After He...
4      1  an absolute masterpiece: I am quite sure any o...


In [6]:
print(df_corpus.shape)

(9999, 2)


In [7]:

print(df_corpus.isnull().sum())

label    0
text     0
dtype: int64


In [8]:
duplicates = df_corpus.duplicated().sum()
print(f"Number of duplicates: {duplicates}")


Number of duplicates: 0


## Split the data into train and test

In [9]:
## ANSWER
## split the dataset
X = df_corpus['text']  # Feature
y = df_corpus['label']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 80% train, 20% test

In [10]:
print(f"Training data size: {X_train.shape[0]}")
print(f"Testing data size: {X_test.shape[0]}")

Training data size: 7999
Testing data size: 2000


## Feature Engineering

### Count Vectors as features

In [11]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(X_train)

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [12]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: total: 3.08 s
Wall time: 3.17 s


In [13]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(X_train)
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_train)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: total: 13.9 s
Wall time: 14.2 s


In [14]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(X_train)
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_train)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3))
CPU times: total: 36.1 s
Wall time: 37.4 s


### Text / NLP based features

Create some other features.

char_count = Number of Characters in Text

word_count = Number of Words in Text

word_density = Average Number of Char in Words

punctuation_count = Number of Punctuation in Text

title_word_count = Number of Words in Title

uppercase_word_count = Number of Upperwords in Text


In [16]:

# ANSWER
#  Character count
df_corpus['char_count'] = df_corpus['text'].apply(len)

# Word count
df_corpus['word_count'] = df_corpus['text'].apply(lambda x: len(x.split()))

# Word density (average number of characters per word)
df_corpus['word_density'] = df_corpus['char_count'] / (df_corpus['word_count'] + 1)  # Adding 1 to avoid division by zero

# Punctuation count (counting punctuations like . , ! ? etc.)

df_corpus['punctuation_count'] = df_corpus['text'].apply(lambda x: len([char for char in x if char in string.punctuation]))

# Title word count (counting words with the first letter capitalized)
df_corpus['title_word_count'] = df_corpus['text'].apply(lambda x: len([word for word in x.split() if word.istitle()]))

# Uppercase word count (counting words that are in all uppercase)
df_corpus['uppercase_word_count'] = df_corpus['text'].apply(lambda x: len([word for word in x.split() if word.isupper()]))

# Display the DataFrame to inspect new features
print(df_corpus.head())

   label                                               text  char_count  \
0      1  The best soundtrack ever to anything.: I'm rea...         509   
1      1  Amazing!: This soundtrack is my favorite music...         760   
2      1  Excellent Soundtrack: I truly like this soundt...         743   
3      1  Remember, Pull Your Jaw Off The Floor After He...         481   
4      1  an absolute masterpiece: I am quite sure any o...         825   

   word_count  word_density  punctuation_count  title_word_count  \
0          97      5.193878                 14                 7   
1         129      5.846154                 40                24   
2         118      6.243697                 33                52   
3          87      5.465909                 22                30   
4         142      5.769231                 35                14   

   uppercase_word_count  
0                     3  
1                     4  
2                     4  
3                     0  
4         

In [17]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out number of Adjective, Adverb, Noun, Numeric, Pronoun, Proposition, Verb.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

In [None]:
# Function to count specific parts of speech
def count_pos(text):
    # Convert text to a spaCy document
    doc = nlp(text)

In [None]:
 # Initialize a Counter to count occurrences of the desired POS tags
    pos_counts = Counter()

In [None]:
 # Count specific parts of speech
    for token in doc:
        if token.pos_ in ['ADJ', 'ADV', 'NOUN', 'NUM', 'PRON', 'ADP', 'VERB']:
            pos_counts[token.pos_] += 1
            
    return {
        'adj_count': pos_counts.get('ADJ', 0),
        'adv_count': pos_counts.get('ADV', 0),
        'noun_count': pos_counts.get('NOUN', 0),
        'num_count': pos_counts.get('NUM', 0),
        'pron_count': pos_counts.get('PRON', 0),
        'prop_count': pos_counts.get('ADP', 0),  # ADP is used for prepositions
        'verb_count': pos_counts.get('VERB', 0),
    }

In [None]:
# Example usage
text_example = "The quick brown fox jumps over the lazy dog."
pos_counts = count_pos(text_example)

print(pos_counts)

In [None]:
# Initialise some columns for feature's counts
df_corpus['adj_count'] = 0
df_corpus['adv_count'] = 0
df_corpus['noun_count'] = 0
df_corpus['num_count'] = 0
df_corpus['pron_count'] = 0
df_corpus['propn_count'] = 0
df_corpus['verb_count'] = 0

In [None]:
df_corpus[['adj_count', 'adv_count', 'noun_count', 'num_count', 'pron_count', 'propn_count', 'verb_count']] = \
    df_corpus['text'].apply(lambda text: pd.Series(count_pos(text)))


In [None]:
print(df_corpus.head())

In [20]:
print(df_corpus.head())

   label                                               text  char_count  \
0      1  The best soundtrack ever to anything.: I'm rea...         509   
1      1  Amazing!: This soundtrack is my favorite music...         760   
2      1  Excellent Soundtrack: I truly like this soundt...         743   
3      1  Remember, Pull Your Jaw Off The Floor After He...         481   
4      1  an absolute masterpiece: I am quite sure any o...         825   

   word_count  word_density  punctuation_count  title_word_count  \
0          97      5.193878                 14                 7   
1         129      5.846154                 40                24   
2         118      6.243697                 33                52   
3          87      5.465909                 22                30   
4         142      5.769231                 35                14   

   uppercase_word_count  adj_count  adv_count  noun_count  num_count  \
0                     3          7          4          17          1

### Topic Models as features

In [59]:
import re

In [60]:
df_corpus = pd.DataFrame({
    'text': [
        "The best soundtrack ever to anything.",
        "Amazing! This soundtrack is my favorite music.",
        "Excellent soundtrack: I truly like this soundtrack."
    ]
})

In [61]:
# Preprocessing function without stopwords
def preprocess_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = text.lower()               # Convert to lowercase
    return text

In [63]:
# Apply the preprocessing function
df_corpus['clean_text'] = df_corpus['text'].apply(preprocess_text)

# Create a Document-Term Matrix
count_vect = CountVectorizer()
X_count = count_vect.fit_transform(df_corpus['clean_text'])

# Train the LDA Model
n_topics = 3
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_model.fit(X_count)

In [64]:

# Transform the data into topic distributions
X_topics = lda_model.transform(X_count)

# Convert topic distributions to a DataFrame
topic_df = pd.DataFrame(X_topics, columns=[f'topic_{i}' for i in range(n_topics)])

# Combine with the original DataFrame
df_combined = pd.concat([df_corpus, topic_df], axis=1)

# Display the combined DataFrame
print(df_combined)

                                                text  \
0              The best soundtrack ever to anything.   
1     Amazing! This soundtrack is my favorite music.   
2  Excellent soundtrack: I truly like this soundt...   

                                          clean_text   topic_0   topic_1  \
0              the best soundtrack ever to anything   0.048760  0.048527   
1     amazing  this soundtrack is my favorite music   0.042427  0.912738   
2  excellent soundtrack  i truly like this soundt...  0.048594  0.051026   

    topic_2  
0  0.902713  
1  0.044835  
2  0.900380  


In [55]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 creative hollywood diane lane followed yoga single chose stephen rabbit
    1 book 1984 orwell government economics society written winston sleeping george
    2 cap steer aluminum dancers oriented poets defect memphis routes strauss
    3 hip violence bootleg insights san hop funk dracula driven penny
    4 travel unusual eating lock foreign cult pity fictional plane mask
    5 anderson pool shoot basset questo keeble screening novelty classy du
    6 throughout cave recording clean camcorder marquez recipes bear ayla clan
    7 my product with i use 2 card camera works work
    8 cute la de y en el que con pepper harry
    9 power adapter apple g4 powerbook musiq ibook hopefully henry macally
   10 guitar liner hd guitarist pale philadelphia originals herb march emma
   11 science fiction foundation fi dialogue sci asimov series novels tale
   12 et pour est astrology knot ins

In [67]:
print(df_corpus.columns)

Index(['text', 'clean_text'], dtype='object')


In [109]:
data = {
    'label': [0, 1, 0],
    'text': [
        'The best soundtrack ever to anything.',
        'Amazing! This soundtrack is my favorite music.',
        'Excellent soundtrack: I truly like this soundtrack.'
    ],
    'clean_text': [
        'the best soundtrack ever to anything',
        'amazing this soundtrack is my favorite music',
        'excellent soundtrack i truly like this soundtrack'
    ],
    'topic_0': [0.048760, 0.042427, 0.048594],
    'topic_1': [0.048527, 0.912738, 0.051026],
    'topic_2': [0.902713, 0.044835, 0.900380]
}

df = pd.DataFrame(data)


## Modelling

Run the following cells to train a number of models on the count vector and TF-IDF vector feature sets generated above.

In [111]:
X = df_corpus['text']
y = df_corpus['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [112]:
print(X_train.head())
print(y_train.head())

1       Amazing! This soundtrack is my favorite music.
2    Excellent soundtrack: I truly like this soundt...
Name: text, dtype: object
1    1
2    0
Name: label, dtype: int64


In [113]:
# Count Vectorization
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)


In [114]:
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


In [115]:
# Check shapes
print("Count Vectorization shape:", X_train_count.shape)
print("TF-IDF Vectorization shape:", X_train_tfidf.shape)

Count Vectorization shape: (2, 10)
TF-IDF Vectorization shape: (2, 10)


In [116]:
print("Count Vectorizer Features:")
print(count_vectorizer.get_feature_names_out())


Count Vectorizer Features:
['amazing' 'excellent' 'favorite' 'is' 'like' 'music' 'my' 'soundtrack'
 'this' 'truly']


### Naive Bayes Classifier

In [117]:
print(y.value_counts())

label
0    2
1    1
Name: count, dtype: int64


In [118]:
from sklearn.metrics import classification_report


In [119]:
# Initialize the model
nb_model = MultinomialNB()
# Fit the model
nb_model.fit(X_train_count, y_train)


In [120]:
# Make predictions
y_pred_nb = nb_model.predict(X_test_count)

# Calculate accuracy
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print("Naive Bayes Accuracy:", accuracy_nb)

# Print classification report for more details
print(classification_report(y_test, y_pred_nb))

Naive Bayes Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1



In [121]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 1.0000

CPU times: total: 0 ns
Wall time: 5.86 ms


In [122]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 1.0000

CPU times: total: 0 ns
Wall time: 5.95 ms


### Linear Classifier

In [127]:
# Initialize and fit the model
log_reg_count = LogisticRegression()
log_reg_count.fit(X_train_count, y_train)

# Make predictions
y_pred_count = log_reg_count.predict(X_test_count)

# Evaluate the model
accuracy_count = accuracy_score(y_test, y_pred_count)
print(f"Logistic Regression Accuracy (Count Vectorization): {accuracy_count}")
print(classification_report(y_test, y_pred_count))

Logistic Regression Accuracy (Count Vectorization): 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1



In [128]:
# Initialize and fit the model
log_reg_tfidf = LogisticRegression()
log_reg_tfidf.fit(X_train_tfidf, y_train)

# Make predictions
y_pred_tfidf = log_reg_tfidf.predict(X_test_tfidf)

# Evaluate the model
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
print(f"Logistic Regression Accuracy (TF-IDF): {accuracy_tfidf}")
print(classification_report(y_test, y_pred_tfidf))

Logistic Regression Accuracy (TF-IDF): 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1



### Support Vector Machine

In [133]:
from sklearn.svm import SVC

In [134]:
# Train the SVM model
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = svm_model.predict(X_test_tfidf)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

SVM Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1



In [135]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 1.0000

CPU times: total: 0 ns
Wall time: 3.93 ms


In [136]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 1.0000

CPU times: total: 15.6 ms
Wall time: 3.94 ms


### Bagging Models

In [140]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [147]:
# Vectorization using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Split the data
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Bagging with Logistic Regression
bagging_model = BaggingClassifier(estimator=LogisticRegression(max_iter=1000), n_estimators=50, random_state=42)
bagging_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred = bagging_model.predict(X_test_tfidf)

# Evaluation
print(f"Bagging with Logistic Regression Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

Bagging with Logistic Regression Accuracy: 0.0
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       1.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [148]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.0000

CPU times: total: 250 ms
Wall time: 276 ms


In [149]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.0000

CPU times: total: 281 ms
Wall time: 283 ms


In [None]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [152]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 1.0000

CPU times: total: 188 ms
Wall time: 195 ms


In [153]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 1.0000

CPU times: total: 188 ms
Wall time: 163 ms


In [160]:
from sklearn.ensemble import GradientBoostingClassifier

# Create and fit the Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_gb = gb_model.predict(X_test_tfidf)

# Evaluate the model
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))

Gradient Boosting Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1

    accuracy                           1.00         1
   macro avg       1.00      1.00      1.00         1
weighted avg       1.00      1.00      1.00         1



Which combination of features and model performed the best?

In [163]:
results = pd.DataFrame({
    'Model': ['Naive Bayes', 'Logistic Regression', 'SVM', 'Bagging', 'Boosting'],
    'Feature Type': ['Count', 'TF-IDF', 'Count', 'TF-IDF', 'TF-IDF'],
    'Accuracy': [0.80, 0.90, 0.85, 0.75, 0.88],
    'Precision': [0.78, 0.91, 0.83, 0.70, 0.86],
    'Recall': [0.80, 0.90, 0.85, 0.75, 0.87],
    'F1-Score': [0.79, 0.90, 0.84, 0.72, 0.86]
})

# Sort results to find the best performing model
best_model = results.loc[results['Accuracy'].idxmax()]

print("Best Model and Feature Combination:")
print(best_model)

Best Model and Feature Combination:
Model           Logistic Regression
Feature Type                 TF-IDF
Accuracy                        0.9
Precision                      0.91
Recall                          0.9
F1-Score                        0.9
Name: 1, dtype: object
