**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [11]:
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [12]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [13]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

### Slip the data

In [14]:
(
    
    X_train,
    X_test,
    y_train,
    y_test

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(960,) (240,) (960,) (240,)


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV


# Initialize the CountVectorizer (Bag of Words model)
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Logistic Regression Model
model_lr = LogisticRegression(max_iter=200)
model_lr.fit(X_train_vec, y_train)
y_pred_lr = model_lr.predict(X_test_vec)

# Naive Bayes Model
model_nb = MultinomialNB()
model_nb.fit(X_train_vec, y_train)
y_pred_nb = model_nb.predict(X_test_vec)

# Hyperparameter Tuning for Logistic Regression
param_grid_lr = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
    , 'penalty': ['l1', 'l2']
}
grid_search_lr = GridSearchCV(LogisticRegression(max_iter=200), param_grid_lr, cv=3, verbose=2)
grid_search_lr.fit(X_train_vec, y_train)

# Hyperparameter Tuning for Naive Bayes
param_grid_nb = {
    'alpha': [0.01, 0.1, 1, 2, 3, 10],
    'fit_prior': [True, False]
    , 'force_alpha': [True, False]
}
grid_search_nb = GridSearchCV(MultinomialNB(), param_grid_nb, cv=3, verbose=2)
grid_search_nb.fit(X_train_vec, y_train)

# Evaluation 
print("\n--- Model Evaluation ---")

# Logistic Regression - Original
print("\nOriginal Logistic Regression Model Performance:")
print(classification_report(y_test, y_pred_lr))

# Logistic Regression - Tuned
print('Best Parameters for Logistic Regression:', grid_search_lr.best_params_)
y_pred_tuned_lr = grid_search_lr.best_estimator_.predict(X_test_vec)
print("\nTuned Logistic Regression Model Performance:")
print(classification_report(y_test, y_pred_tuned_lr))

# Naive Bayes - Original
print("\nOriginal Naive Bayes Model Performance:")
print(classification_report(y_test, y_pred_nb))

# Naive Bayes - Tuned
print('Best Parameters for Naive Bayes:', grid_search_nb.best_params_)
y_pred_tuned_nb = grid_search_nb.best_estimator_.predict(X_test_vec)
print("\nTuned Naive Bayes Model Performance:")
print(classification_report(y_test, y_pred_tuned_nb))




Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=   0.0s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=   0.0s
[CV] END ................C=0.1, penalty=l1, solver=liblinear; total time=   0.0s
[CV] END ....................C=0.1, penalty=l1, solver=lbfgs; total time=   0.0s
[CV] END ....................C=0.1, penalty=l1, solver=lbfgs; total time=   0.0s
[CV] END ....................C=0.1, penalty=l1, solver=lbfgs; total time=   0.0s
[CV] END ................C=0.1, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END ................C=0.1, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END ................C=0.1, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END ....................C=0.1, penalty=l2, solver=lbfgs; total time=   0.0s
[CV] END ....................C=0.1, penalty=l2, solver=lbfgs; total time=   0.0s
[CV] END ....................C=0.1, penalty=l2, 

9 fits failed out of a total of 36.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
9 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/marcuskrarup/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/marcuskrarup/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/marcuskrarup/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1193, in fit
    solver = _check_solver

[CV] END .........alpha=2, fit_prior=False, force_alpha=True; total time=   0.0s
[CV] END .........alpha=2, fit_prior=False, force_alpha=True; total time=   0.0s
[CV] END .........alpha=2, fit_prior=False, force_alpha=True; total time=   0.0s
[CV] END ........alpha=2, fit_prior=False, force_alpha=False; total time=   0.0s
[CV] END ........alpha=2, fit_prior=False, force_alpha=False; total time=   0.0s
[CV] END ........alpha=2, fit_prior=False, force_alpha=False; total time=   0.0s
[CV] END ..........alpha=3, fit_prior=True, force_alpha=True; total time=   0.0s
[CV] END ..........alpha=3, fit_prior=True, force_alpha=True; total time=   0.0s
[CV] END ..........alpha=3, fit_prior=True, force_alpha=True; total time=   0.0s
[CV] END .........alpha=3, fit_prior=True, force_alpha=False; total time=   0.0s
[CV] END .........alpha=3, fit_prior=True, force_alpha=False; total time=   0.0s
[CV] END .........alpha=3, fit_prior=True, force_alpha=False; total time=   0.0s
[CV] END .........alpha=3, f

### The Bag-of-Words model was tested with Logistic Regression and Naive Bayes classifiers.

The Logistic Regression model performed well with some very minor improvement after hyperparameter tuning.
The Naive Bayes model also performed reasonably well. there was experimented with 3 different hyperparameters none of them gave the model better predicting power.

Further improvements could include experimenting with TF-IDF, trying SVM, or even using neural networks.