**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [8]:
# imports for the project

import pandas as pd


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [9]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [10]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer with hyperparameters
vectorizer = CountVectorizer(max_features=8490, stop_words='english')

# Fit the vectorizer on the training data
X_train = vectorizer.fit_transform(train_df['text'])
X_test = vectorizer.transform(test_df['text'])

print("BoW shape for training data:", X_train.shape)
print("BoW shape for test data:", X_test.shape)


BoW shape for training data: (1200, 8490)
BoW shape for test data: (760, 8490)


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Extract target labels
y_train = train_df['label']
y_test = test_df['label']

# Initialize Logistic Regression classifier 
clf = LogisticRegression(C=0.1, max_iter=100)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.8460526315789474
Classification Report:
               precision    recall  f1-score   support

    Business       0.80      0.80      0.80       190
    Sci/Tech       0.83      0.80      0.81       190
      Sports       0.89      0.91      0.90       190
       World       0.87      0.88      0.87       190

    accuracy                           0.85       760
   macro avg       0.85      0.85      0.85       760
weighted avg       0.85      0.85      0.85       760



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=8490, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['text'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['text'])

# Train Logistic Regression on the TF-IDF features
clf_tfidf = LogisticRegression(C=0.5, max_iter=500)
clf_tfidf.fit(X_train_tfidf, y_train)

# Predict and evaluate
y_pred_tfidf = clf_tfidf.predict(X_test_tfidf)
print("TF-IDF Accuracy:", accuracy_score(y_test, y_pred_tfidf))
print("TF-IDF Classification Report:\n", classification_report(y_test, y_pred_tfidf))


TF-IDF Accuracy: 0.8657894736842106
TF-IDF Classification Report:
               precision    recall  f1-score   support

    Business       0.80      0.84      0.82       190
    Sci/Tech       0.87      0.80      0.83       190
      Sports       0.91      0.94      0.92       190
       World       0.88      0.88      0.88       190

    accuracy                           0.87       760
   macro avg       0.87      0.87      0.87       760
weighted avg       0.87      0.87      0.87       760

