**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [5]:
# imports for the project

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

#for other classifiers

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [3]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

In [13]:
#  Create and fit the countvectorizer
vectorizer = CountVectorizer(
    lowercase=True,
    max_features=2000,
    stop_words='english'
)

# Transform  data
X_train = vectorizer.fit_transform(train_df['text'])
X_test = vectorizer.transform(test_df['text'])

# Get labels
y_train = train_df['label']
y_test = test_df['label']

# Train and evaluate both logitsitc regression and naive bayes classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Naive Bayes': MultinomialNB()
}

for name, clf in classifiers.items():
    print(f"\nTraining {name}...")
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print(f"\nClassification Report for {name}:")
    print(classification_report(y_test, y_pred))


Training Logistic Regression...

Classification Report for Logistic Regression:
              precision    recall  f1-score   support

    Business       0.75      0.76      0.76       190
    Sci/Tech       0.77      0.75      0.76       190
      Sports       0.89      0.90      0.90       190
       World       0.83      0.83      0.83       190

    accuracy                           0.81       760
   macro avg       0.81      0.81      0.81       760
weighted avg       0.81      0.81      0.81       760


Training Naive Bayes...

Classification Report for Naive Bayes:
              precision    recall  f1-score   support

    Business       0.79      0.84      0.81       190
    Sci/Tech       0.85      0.79      0.82       190
      Sports       0.91      0.92      0.91       190
       World       0.87      0.87      0.87       190

    accuracy                           0.85       760
   macro avg       0.85      0.85      0.85       760
weighted avg       0.85      0.85      

#Reflection / analysis

The initial run did great for sports and world, but lacked performance for business and sci/tech. I then tried other models (random forest, linear svm, nn,..., naive bayes). Naive bayes performed best and was chosen as the model for further tuning, along wiht logistic regression. I then tried NB and LR with more features (2k from 1k), removing stop words, adding bigrams. Results showed most improvement adding more features, keeping stop words, and not using bigrams.

The results suggests that more features can improve performance for classifiers - perhaps capturing more of the essence from the articles by having a larger vocubulary. Stop words did not improve NB, but did slightly improve LR for comparison. Adding bigrams were not helpful, perhaps the single word strategy captures enough meaning in this context. Business seems to be difficult to get right - it  might be due to its nature of being a nexus for many different distinct subjects (Business would likely include the areas like sports betting profits, revenue from new technologies and so on, making it less distinct than its counterparts?).


Analysis of performance: 