## Instruction

> 1. Rename Assignment-01-###.ipynb where ### is your student ID.
> 2. The deadline of Assignment-01 is 23:59pm, 03-31-2024
>
> 3. In this assignment, you will
>    1) explore Wikipedia text data
>    2) build language models
>    3) build NB and LR classifiers

## Task0 - Download datasets
> Download the preprocessed data, enwiki-train.json and enwiki-test.json from the Assignment-01 folder. In the data file, each line contains a Wikipedia page with attributes, title, label, and text. There are 1000 records in the train file and 100 records in test file with ten categories.

## Task1 - Data exploring and preprocessing

> 1) Print out how many documents are in each class  (for both train and test dataset)

In [3]:
# Your code goes to here
import json
from collections import defaultdict

def load_data(file_name):
    """Load data from JSON file."""
    with open(file_name, "r") as f:
        data = [json.loads(line) for line in f.readlines()]
    return data

def get_class_count(data, dataset_type):
    """Get the number of documents for each class."""
    class_cnts = defaultdict(int)
    for record in data:
        class_cnts[record['label']] += 1
    print(f"Document counts per class in {dataset_type}:")
    for label, count in class_cnts.items():
        print(f"{label}: {count}")


train_data = load_data('enwiki-train.json')
test_data = load_data('enwiki-test.json')

get_class_count(train_data, "enwiki-train.json")
print("-----------------------")
get_class_count(test_data, "enwiki-test.json")

Document counts per class in enwiki-train.json:
Film: 100
Book: 100
Politician: 100
Writer: 100
Food: 100
Actor: 70
Animal: 80
Software: 130
Artist: 100
Disease: 120
-----------------------
Document counts per class in enwiki-test.json:
Film: 10
Book: 10
Politician: 10
Writer: 10
Food: 10
Actor: 10
Animal: 10
Software: 10
Artist: 10
Disease: 10


> 2) Print out the average number of sentences in each class.
>    You may need to use sentence tokenization of NLTK.
>    (for both train and test dataset)


In [4]:
# Your code goes to here
from nltk.tokenize import sent_tokenize

def get_avg_sen(data, dataset_type):
    """Get the average number of words per class."""
    class_sen_cnts = defaultdict(list)
    for record in data:
        sentences = sent_tokenize(record['text'])  # split text into sentences
        class_sen_cnts[record['label']].append(len(sentences))
    print(f"Average number of sentences per class in {dataset_type}:")
    for label, counts in class_sen_cnts.items():
        avg_sen = sum(counts) / len(counts) if len(counts) > 0 else 0
        print(f"{label}: {avg_sen}")

# train_data = load_data('enwiki-train.json')
# test_data = load_data('enwiki-test.json')

get_avg_sen(train_data, "enwiki-train.json")
print("-----------------------")
get_avg_sen(test_data, "enwiki-test.json")

Average number of sentences per class in enwiki-train.json:
Film: 438.56
Book: 400.36
Politician: 706.2
Writer: 420.32
Food: 175.24
Actor: 76.7
Animal: 70.375
Software: 260.95384615384614
Artist: 306.47
Disease: 404.9
-----------------------
Average number of sentences per class in enwiki-test.json:
Film: 364.7
Book: 295.9
Politician: 597.6
Writer: 294.9
Food: 107.6
Actor: 30.7
Animal: 46.8
Software: 160.1
Artist: 234.0
Disease: 311.7


> 3) Print out the average number of tokens in each class
>    (for both train and test dataset)

In [5]:
# Your code goes to here
from nltk.tokenize import word_tokenize

def get_avg_tokens(data, dataset_type):
    """Get the average number of tokens per class."""
    class_token_cnts = defaultdict(list)
    for record in data:
        tokens = word_tokenize(record['text'])
        class_token_cnts[record['label']].append(len(tokens))
    print(f"Average number of tokens per class in {dataset_type}:")
    for label, counts in class_token_cnts.items():
        avg_tokens = sum(counts) / len(counts) if len(counts) > 0 else 0
        print(f"{label}: {avg_tokens}")

# train_data = load_data('enwiki-train.json')
# test_data = load_data('enwiki-test.json')

get_avg_tokens(train_data, "enwiki-train.json")
print("-----------------------")
get_avg_tokens(test_data, "enwiki-test.json")

Average number of tokens per class in enwiki-train.json:
Film: 11895.28
Book: 10540.51
Politician: 18644.3
Writer: 11849.91
Food: 3904.15
Actor: 1868.8428571428572
Animal: 1521.925
Software: 6302.3
Artist: 8212.91
Disease: 9322.958333333334
-----------------------
Average number of tokens per class in enwiki-test.json:
Film: 9292.9
Book: 7711.1
Politician: 15204.3
Writer: 8499.4
Food: 2445.5
Actor: 677.5
Animal: 885.6
Software: 3972.8
Artist: 5706.4
Disease: 6988.8


> 4) For each sentence in the document, remove punctuations and other special characters so that each sentence only contains English words and numbers. To make your life easier, you can make all words as lower cases. For each class, print out the first article's name and the processed first 40 words. (for both train and test dataset)

In [6]:
# Your code goes to here
import re

def clean_sen(sentence):
    """Clean a sentence by removing special characters and converting to lower case."""
    sentence = re.sub(r'[^a-zA-Z0-9\s]', '', sentence)  # remove special characters
    sentence = sentence.lower()  # convert to lower case
    return sentence

def clean_data(data):
    """Clean texts in the dataset."""
    for record in data:
        sentences = sent_tokenize(record['text'])
        cleaned_sen = [clean_sen(sen) for sen in sentences]
        record['text'] = cleaned_sen
    return data

def article_print(data, dataset_type):
    """Join sentence list into text and print the first article's title and the first 40 words for each class."""
    first_articles = {}  # key is class name, value is (title, first 40 words)
    for record in data:
        label = record['label']
        record['text'] = ' '.join(record['text'])
        if label not in first_articles:
            tokens = word_tokenize(record['text'])
            first_articles[label] = (record['title'], tokens[:40])

    print(f"## Information for each class in {dataset_type}:")
    for label, (title, tokens) in first_articles.items():
        print(f"Class name: {label}")
        print(f"The first article's title: {title}")
        print("Processed first 40 words:", ' '.join(tokens), "\n")

train_data = load_data('enwiki-train.json')
test_data = load_data('enwiki-test.json')
cleaned_train_data = clean_data(train_data)
cleaned_test_data = clean_data(test_data)


article_print(cleaned_train_data, "enwiki-train.json")
print("-----------------------\n")
article_print(cleaned_test_data, "enwiki-test.json")

## Information for each class in enwiki-train.json:
Class name: Film
The first article's title: Citizen_Kane
Processed first 40 words: citizen kane is a 1941 american drama film produced by directed by and starring orson welles he also cowrote the screenplay with herman j mankiewicz the picture was welles first feature film citizen kane is considered by many critics and 

Class name: Book
The first article's title: The_Spirit_of_the_Age
Processed first 40 words: the spirit of the age full title the spirit of the age or contemporary portraits is a collection of character sketches by the early 19th century english essayist literary critic and social commentator william hazlitt portraying 25 men mostly british 

Class name: Politician
The first article's title: Charles_de_Gaulle
Processed first 40 words: charles andr joseph marie de gaulle 22 november 18909 november 1970 was a french army officer and statesman who led free france against nazi germany in world war ii and chaired the provis

## Task2 - Build language models

> 1) Based on the training dataset, build unigram, bigram, and trigram language models using Add-one smoothing technique. It is encouraged to implement models by yourself. If you use public code, please cite it.


The following code borrows ideas from the demo code provided in class, and is modified to fit the requirements of this assignment, with necessary simplifications for the issue of unknown ngrams.

In [28]:
# Your code goes to here
import nltk
from nltk import FreqDist
import math
import re
from nltk import sent_tokenize
import json
import random


def load_data(file_name):
    """Load data from JSON file."""
    with open(file_name, "r") as f:
        data = [json.loads(line) for line in f.readlines()]
    return data

def preprocess_text(data, n, freq_threshold=1):
    """Preprocess text data from the cleaned dataset."""
    tokenized_sentences = []
    # Add SOS and EOS tokens for n-grams
    sos = "<s> " * (n-1) if n > 1 else "<s> "
    for record in data:
        sentences = sent_tokenize(record['text'])
        for sen in sentences:
            sen = re.sub(r'[^a-zA-Z0-9\s]', '', sen).lower()
            tokenized_sentences.append(f'{sos}{sen} </s>'.split())
    tokens = [token for sentence in tokenized_sentences for token in sentence]
    # Replace tokens with low frequency with <UNK>
    vocab = FreqDist(tokens)
    tokens = [token if vocab[token] > freq_threshold else "<UNK>" for token in tokens]
    
    return tokenized_sentences, tokens


class NgramModel:
    def __init__(self, tokenized_sentences, train_tokens, n, laplace=1):
        """Initialize the N-gram model."""
        self.n = n
        self.toked_sen = tokenized_sentences
        self.tokens = train_tokens
        self.laplace = laplace
        self.vocab = FreqDist(self.tokens)  # contains <UNK>, <s>, </s>
        self.model = self.create_model()
        self.special_words = ['<s>', '</s>', '<UNK>']
    
    def create_model(self):
        """Create the N-gram model with Add-one smoothing."""
        model = {}
        vocab_size = len(self.vocab)

        if self.n == 1:  # unigram model
            total_tokens = len(self.tokens)
            for token, count in self.vocab.items():
                model[(token,)] = (count + 1) / \
                    (total_tokens + vocab_size)  # Add-one smoothing
        else:
            n_gram_list = [
                ngram for sent in self.toked_sen for ngram in nltk.ngrams(sent, self.n)]
            n_gram_freq = FreqDist(n_gram_list)
            n_1_gram_list = [
                n_1_gram for sent in self.toked_sen for n_1_gram in nltk.ngrams(sent, self.n-1)]
            n_minus1_gram_freq = FreqDist(n_1_gram_list)

            for n_gram, freq in n_gram_freq.items():
                # The (n-1)-gram is the n-gram minus its last token
                n_minus_1_gram = n_gram[:-1]
                n_minus_1_gram_count = n_minus1_gram_freq[n_minus_1_gram]
                # Calculate the probability with Add-one smoothing
                prob = (freq + self.laplace) / (n_minus_1_gram_count + self.laplace * vocab_size)
                model[n_gram] = prob
            
        return model

    def perplexity_calc(self, te_toked_sen, te_tokens):
        """Calculate the perplexity of the N-gram model against a given testset."""
        # Replace unknown tokens with <UNK>
        te_toked_sen = [[token if token in self.vocab else "<UNK>" for token in sent]
                        for sent in te_toked_sen]
        test_ngrams = [
            ngram for sent in te_toked_sen for ngram in nltk.ngrams(sent, self.n)]
        # exclude <s> for each sentence
        N = len(te_tokens) - len(te_toked_sen) * (self.n-1)
        logprob = 0
        vocab_size = len(self.vocab)
        if self.n > 1:
            n_1_gram_list = [
                n_1_gram for sent in self.toked_sen for n_1_gram in nltk.ngrams(sent, self.n-1)]
            n_minus_1_gram_freq = FreqDist(n_1_gram_list)
        for ngram in test_ngrams:
            if ngram in self.model:
                logprob += math.log(self.model[ngram])
            else:
                n_minus1_gram = ngram[:-1]
                if n_minus1_gram in n_minus_1_gram_freq:
                    n_minus_1_gram_count = n_minus_1_gram_freq[n_minus1_gram]
                    logprob += math.log(self.laplace /
                                        (n_minus_1_gram_count + vocab_size * self.laplace))
                else:
                    logprob += math.log(self.laplace / vocab_size)

        return math.exp(-logprob / N)
    
    
    def _best_candidate(self, prev, i=0, without=[]):
        """Choose the i-th best candidate token given the previous (n-1)-token context."""
        blacklist = ["<UNK>"] + without  # Tokens to avoid
        # Generate candidates based on the provided context, filtering out unwanted tokens.
        candidates = [
            (ngram[-1], prob) for ngram, prob in self.model.items()
            if ngram[:-1] == prev and ngram[-1] not in blacklist
        ]
        
        # Sort candidates based on their probability, in descending order.
        candidates.sort(key=lambda candidate: candidate[1], reverse=True)
        # If no candidates found, return EOS with probability 1.
        if not candidates:
            return ("</s>", 1)
        # Select the best candidate; to ensure variety for the first word, select i-th best candidate.
        return candidates[0 if prev != () and prev[-1] != "<s>" else random.randint(0, len(candidates)-1)]

    def generate_sentences(self, num, min_len=20, max_len=40):
        """Generate sentences using the N-gram model."""
        for i in range(num):
            sent = ["<s>"] * max(1, self.n - 1)  # Start each sentence with SOS tokens
            log_sent_prob = 0  # Initialize sentence probability
            
            # Generate tokens until an EOS token is produced or max length is reached
            while sent[-1] != "</s>" and len(sent) < max_len:
                prev = tuple(sent[-(self.n-1):]) if self.n != 1 else ()  # Current context
                blacklist = sent + (["</s>"] if len(sent) < min_len else [])  # Tokens to avoid, guranatee min_len
                next_token, next_prob = self._best_candidate(prev, i, without=blacklist)
                
                sent.append(next_token)  # Add next token to sentence
                log_sent_prob += -math.log(next_prob)  # Update sentence probability

            # Append </s> if not already done and sentence is too long
            if sent[-1] != "</s>":
                sent.append("</s>")

            # Yield the generated sentence and its negative log-probability
            yield ' '.join(sent), log_sent_prob

> 2) Report the perplexity of these 3 trained models on the testing dataset and explain your findings. 

In [29]:
# Your code goes to here
# Preprocess the text data
train_data = load_data('enwiki-train.json')
test_data = load_data('enwiki-test.json')
laplace = 1

for n in [1, 2, 3]:
    train_toked_sen, train_tokens = preprocess_text(
        train_data, n, freq_threshold=5)
    test_toked_sen, test_tokens = preprocess_text(
        test_data, n, freq_threshold=0)
    print(f"Preprocessing for {n}-gram model has been completed.")
    if n == 1:
        unigram_model = NgramModel(train_toked_sen, train_tokens, n, laplace)
        lm = unigram_model
    elif n == 2:
        bigram_model = NgramModel(train_toked_sen, train_tokens, n, laplace)
        lm = bigram_model
    else:
        trigram_model = NgramModel(train_toked_sen, train_tokens, n, laplace)
        lm = trigram_model
    print(f"{n}-gram model has been created.")
    print(f"Vocabulary size: {len(lm.vocab)}")
    print(f"Perplexity of {n}-gram model: {lm.perplexity_calc(test_toked_sen, test_tokens):.3f}\n")

Preprocessing for 1-gram model has been completed.
1-gram model has been created.
Vocabulary size: 43971
Perplexity of 1-gram model: 1079.569

Preprocessing for 2-gram model has been completed.
2-gram model has been created.
Vocabulary size: 43971
Perplexity of 2-gram model: 3282.316

Preprocessing for 3-gram model has been completed.
3-gram model has been created.
Vocabulary size: 43971
Perplexity of 3-gram model: 16081.709



From the results above, we can see that as $n$ increases, the perplexity of the model increases. This is contrary to the expectation that the perplexity should decrease as $n$ increases.

This may be due to the fact that we use the Add-one smoothing technique, which assigns a non-zero probability to unseen ngrams. This may lead to a higher perplexity as $n$ increases, as the number of unseen ngrams increases. And in the trigram model, when n-1 grams are not seen in the training data, meaning that even with Add-one smoothing, we cannot assign a non-zero probability to the trigram, so we simply assign $1/|V|$ to this trigram, which is perhaps not a good way to handle this situation. A more sophisticated smoothing technique such as Kneser-Ney smoothing may be needed.

> 3) Use each built model to generate five sentences and explain these generated patterns.


In [31]:
# Your code goes to here
num_to_gen = 5
print("\n## Generated sentences using the unigram model:")
for i, (sentence, prob) in enumerate(unigram_model.generate_sentences(num_to_gen), 1):
    print(f"Sentence {i}: {sentence} (negative log probability: {prob:.5f})")

print("\n## Generated sentences using the bigram model:")
for i, (sentence, prob) in enumerate(bigram_model.generate_sentences(num_to_gen), 1):
    print(f"Sentence {i}: {sentence} (negative log probability: {prob:.5f})")

print("\n## Generated sentences using the trigram model:")
for i, (sentence, prob) in enumerate(trigram_model.generate_sentences(num_to_gen), 1):
    print(f"Sentence {i}: {sentence} (negative log probability: {prob:.5f})")


## Generated sentences using the unigram model:
Sentence 1: <s> peculiarities 5000 separated jann albumin 163 weston huntingdon devolved flee opponent etiology corporal sully negative accuser relentless maheshwari 205 cel occasion narrow 89th 1960 whirlpool upnp kato everett standoff messages paume deposition textures spinoza caesareans tangerine valuable unei jai </s> (negative log probability: 486.47565)
Sentence 2: <s> fundulea deste ltpoemgt profanity closeup braga neurosurgeon nuova uncharted energy jane cutlets iatrogenic shipbuilding kitty incredulous teenagers unprepared deserts osteoarthritis giacchino beadseller midoctober barred commissars tunku stalinism pels unquestionably gains croquetas frere grownups gym proclaimed confluence treaties 20082009 blossomed </s> (negative log probability: 500.58020)
Sentence 3: <s> smiles pivotal veto tweed temperament cruise materialize leroy solidify mom assessing evade baxter countrywide derogatory counterattacked reimann atonic endless

The sentences generated by the unigram model are mostly nonsensical, as the model does not take into account the context of the words. 

The bigram model generates sentences that are more coherent, but lack variety, since the model only takes into account the previous word.

The trigram model generates the most coherent sentences, as it takes into account the context of the previous two words, and is able to generate sentences that are more similar to the training data.

## Task3 - Build NB/LR classifiers

> 1) Build a Naive Bayes classifier (with Laplace smoothing) and test your model on test dataset

In [1]:
# Your code goes to here
import json
import numpy as np
import re
from sklearn.metrics import classification_report


def load_data(file_name):
    """Load data from a JSON file."""
    with open(file_name, "r") as f:
        data = [json.loads(line) for line in f.readlines()]
    return data

def preprocess_data(data):
    """Clean and extract texts and labels in the dataset."""
    texts = [re.sub(r'[^a-zA-Z0-9\s]', '', record['text']).lower() for record in data]
    labels = [record['label'] for record in data]
    return texts, labels

def create_vocabulary(texts):
    """Create a set of all unique words in the data."""
    vocabulary = set()
    for text in texts:
        for word in text.split():
            vocabulary.add(word)
    return vocabulary

def text_to_bow(texts, vocabulary):
    """Convert a list of texts to a bag-of-words representation."""
    bow = np.zeros((len(texts), len(vocabulary)))  # row: text, column: word
    word_to_index = {word: i for i, word in enumerate(vocabulary)}
    for i, text in enumerate(texts):
        for word in text.split():
            if word in word_to_index:  # for test set, ignore unknown words
                bow[i, word_to_index[word]] += 1
    return bow

def train_naive_bayes(bow, labels, laplace=1):
    """Train a Naive Bayes classifier with Laplace smoothing."""
    n_classes = len(set(labels))  # number of unique classes
    n_words = bow.shape[1]  # number of words in the vocabulary
    class_word_counts = np.zeros((n_classes, n_words))  # word counts per class
    class_counts = np.zeros(n_classes)  # text counts per class
    for i, label in enumerate(labels):
        index = label_to_index[label]  # convert label to index
        class_word_counts[index] += bow[i]
        class_counts[index] += 1
    class_log_prior = np.log(class_counts) - np.log(np.sum(class_counts))
    class_word_log_prob = np.log(class_word_counts + laplace) - np.log(
        np.sum(class_word_counts + laplace, axis=1, keepdims=True))  # Laplace smoothing
    return class_log_prior, class_word_log_prob

def test_naive_bayes(test_bow, class_log_prior, class_word_log_prob):
    """Predict the class for each example in bow."""
    log_probs = test_bow @ class_word_log_prob.T + class_log_prior  # matrix, size (n_test, n_classes)
    return np.argmax(log_probs, axis=1)  # predicted class for each text, size (n_test,)


# load and preprocess the data
train_data = load_data('enwiki-train.json')
test_data = load_data('enwiki-test.json')
train_texts, train_labels = preprocess_data(train_data)
test_texts, test_labels = preprocess_data(test_data)
unique_labels = list(set(train_labels))  # list of unique labels
label_to_index = {label: i for i, label in enumerate(unique_labels)}

# create vocabulary
vocabulary = create_vocabulary(train_texts)
print("Vocabulary size:", len(vocabulary))
# convert texts to bag-of-words representation
train_bow = text_to_bow(train_texts, vocabulary)
test_bow = text_to_bow(test_texts, vocabulary)

# Train the Naive Bayes classifier
class_log_prior, class_word_log_prob = train_naive_bayes(
    train_bow, np.array(train_labels))

print("Naive Bayes classifier trained.")
print("Class log prior shape:", class_log_prior.shape)
print("Class word log probability shape:", class_word_log_prob.shape)

# Make predictions on the test set
predictions = test_naive_bayes(test_bow, class_log_prior, class_word_log_prob)

# Convert test labels to index
test_label_to_index = [label_to_index[lable] for lable in test_labels]
label_to_index = {label: i for i, label in enumerate(set(test_labels))}

# Evaluate the classifier
# accuracy = np.mean(predictions == np.array(test_label_to_index))
# print(f"Accuracy: {accuracy:.4f}")
print("\nClassification report:")
print(classification_report(test_label_to_index, predictions, target_names=label_to_index.keys(), zero_division=1))

Vocabulary size: 178616
Naive Bayes classifier trained.
Class log prior shape: (10,)
Class word log probability shape: (10, 178616)

Classification report:
              precision    recall  f1-score   support

        Book       1.00      0.80      0.89        10
     Disease       0.71      1.00      0.83        10
      Artist       1.00      1.00      1.00        10
      Animal       1.00      0.70      0.82        10
        Film       0.50      0.90      0.64        10
      Writer       0.80      0.80      0.80        10
        Food       1.00      0.90      0.95        10
  Politician       0.71      1.00      0.83        10
       Actor       1.00      0.00      0.00        10
    Software       1.00      1.00      1.00        10

    accuracy                           0.81       100
   macro avg       0.87      0.81      0.78       100
weighted avg       0.87      0.81      0.78       100



From the test result above, we can see that the Naive Bayes classifier achieves an accuracy of 0.81 on the test dataset. This means that the classifier correctly predicts the class of 81% of the documents in the test dataset, which is an acceptable performance.

> 2) Build a LR classifier. This question seems to be challenging. We did not directly provide features for samples. But just use your own method to build useful features. You may need to split the training dataset into train and validation so that some involved parameters can be tuned. 

In the following code, we use TF-IDF as the feature representation for the LR classifier. We use the scikit-learn library to build the it.

In [4]:
# Your code goes to here
import json
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import re
from sklearn.metrics import f1_score, classification_report

# Load data from JSON files
def load_data(file_name):
    """Load data from a JSON file."""
    with open(file_name, "r") as f:
        data = [json.loads(line) for line in f.readlines()]
    return data

# Split data into features and labels
def preprocess_data(data):
    """Clean and extract texts and labels in the dataset."""
    texts = [re.sub(r'[^a-zA-Z0-9\s]', '', record['text']).lower() for record in data]
    labels = [record['label'] for record in data]
    return texts, labels

# Load and prepare training and test data
train_data = load_data('enwiki-train.json')
test_data = load_data('enwiki-test.json')
X_train_full, y_train_full = preprocess_data(train_data)
X_test, y_test = preprocess_data(test_data)

# Convert labels to numeric values
label_encoder = LabelEncoder()
y_train_full_numeric = label_encoder.fit_transform(y_train_full)
y_test_numeric = label_encoder.transform(y_test)

# Split the full training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full_numeric, test_size=0.2, random_state=42)

# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)
X_test_tfidf = vectorizer.transform(X_test)

C_values = [0.01, 0.1, 1, 10]  # hyperparameter values to test
f1_scores = []
for C in C_values:
    # Train Logistic Regression classifier
    lr_model = LogisticRegression(C=C, max_iter=1000, random_state=42)
    lr_model.fit(X_train_tfidf, y_train)

    # Evaluate the model on the validation set
    y_val_pred = lr_model.predict(X_val_tfidf)

    # Calculate the F1 score
    f1 = f1_score(y_val, y_val_pred, average='micro')
    f1_scores.append(f1)
    print(f'C = {C}: Validation F1-Score = {f1}')

# Train Logistic Regression classifier
lr_model = LogisticRegression(C=1, max_iter=1000, random_state=42)
lr_model.fit(X_train_tfidf, y_train)

# Evaluate the model on the test set
y_test_pred = lr_model.predict(X_test_tfidf)

print("\nClassification report:")
print(classification_report(y_test_numeric, y_test_pred, target_names=label_encoder.classes_))

C = 0.01: Validation F1-Score = 0.13
C = 0.1: Validation F1-Score = 0.65
C = 1: Validation F1-Score = 0.95
C = 10: Validation F1-Score = 0.955

Classification report:
              precision    recall  f1-score   support

       Actor       1.00      1.00      1.00        10
      Animal       1.00      0.90      0.95        10
      Artist       0.91      1.00      0.95        10
        Book       0.77      1.00      0.87        10
     Disease       1.00      1.00      1.00        10
        Film       1.00      0.90      0.95        10
        Food       1.00      1.00      1.00        10
  Politician       0.91      1.00      0.95        10
    Software       1.00      1.00      1.00        10
      Writer       1.00      0.70      0.82        10

    accuracy                           0.95       100
   macro avg       0.96      0.95      0.95       100
weighted avg       0.96      0.95      0.95       100



For finding the best regularization parameter, we use grid search with cross-validation. We split the training dataset into training and validation sets, and use the training set to train the LR classifier with different values of the regularization parameter. Notice that a much higher value of C does not give a much better F1-score, so we just stick to the default value of 1.0.

The result shows that the LR classifier achieves an accuracy of 0.95 on the test dataset, which is higher than the Naive Bayes classifier.

> 3) Report Micro-F1 score and Macro-F1 score for these classifiers on testing dataset explain our results.

In [5]:
# Your code goes to here
from sklearn.metrics import f1_score

micro_f1_nb = f1_score(test_label_to_index, predictions, average='micro')
macro_f1_nb = f1_score(test_label_to_index, predictions, average='macro')

micro_f1_lr = f1_score(y_test_numeric, y_test_pred, average='micro')
macro_f1_lr = f1_score(y_test_numeric, y_test_pred, average='macro')

print(
    f"Naive Bayes - Micro-F1: {micro_f1_nb:.4f}, Macro-F1: {macro_f1_nb:.4f}")

print(
    f"Logistic Regression - Micro-F1: {micro_f1_lr:.4f}, Macro-F1: {macro_f1_lr:.4f}")

Naive Bayes - Micro-F1: 0.8100, Macro-F1: 0.7769
Logistic Regression - Micro-F1: 0.9500, Macro-F1: 0.9493


From the results above, LR classifier outperforms the NB classifier in terms of both Micro-F1 and Macro-F1 scores. A higher Micro-F1 score suggests that the model performs well on the majority class but not necessarily on minority classes. A higher Macro-F1 score indicates more balanced performance across all classes but can be lowered by poor performance on any single class.

An explanation is that LR is better suited for this specific dataset as a linear model, while NB works well with small datasets and achieves good results even with the presence of irrelevant features. LR's performance also might benefit from the TF-IDF representation, as it can leverage the importance of features in distinguishing between classes, which is reflected in the F1 scores. Thus, LR becomes as a more powerful model than NB, and it can capture more complex relationships between features and labels.