**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [2]:
# imports for the project

import pandas as pd

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [3]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [4]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

### 2. BoW using Logistic Regression in Scikit-Learn

In [5]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [6]:
(
    
    X_train,
    X_val,
    y_train,
    y_val

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(960,) (240,) (960,) (240,)


In [7]:
# countvectorizer
cv = CountVectorizer()
X_train_vectorized = cv.fit_transform(X_train)

In [8]:
X_train_vectorized.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [9]:
lr_clf = LogisticRegression()

lr_clf.fit(X_train_vectorized, y_train)

In [10]:
X_val_vectorized = cv.transform(X_val)

y_pred = lr_clf.predict(X_val_vectorized)

In [11]:

print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))

Performance on the training set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00       238
      Sports       1.00      1.00      1.00       240
    Business       1.00      1.00      1.00       240
    Sci/Tech       1.00      1.00      1.00       242

    accuracy                           1.00       960
   macro avg       1.00      1.00      1.00       960
weighted avg       1.00      1.00      1.00       960

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.76      0.68      0.72        62
      Sports       0.69      0.60      0.64        60
    Business       0.79      0.87      0.83        60
    Sci/Tech       0.78      0.90      0.83        58

    accuracy                           0.76       240
   macro avg       0.75      0.76      0.75       240
weighted avg       0.75      0.76      0.75       240



In [12]:
test_df_vectorized = cv.transform(test_df["text"])

print("Performance on the test set:")
print(classification_report(test_df["label"], lr_clf.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.74      0.72      0.73       190
      Sports       0.75      0.72      0.73       190
    Business       0.83      0.88      0.86       190
    Sci/Tech       0.79      0.79      0.79       190

    accuracy                           0.78       760
   macro avg       0.78      0.78      0.78       760
weighted avg       0.78      0.78      0.78       760



In [13]:
# Hyperparameter tuning for Logistic Regression
logreg_param_grid = {
    'C': [0.1, 1, 10],
    'solver': ['lbfgs', 'liblinear'],
    'max_iter': [1000, 5000]
}

logreg_clf = GridSearchCV(LogisticRegression(), logreg_param_grid, cv=3, scoring='accuracy', verbose=1)
logreg_clf.fit(X_train_vectorized, y_train)

# Best Logistic Regression model
best_logreg = logreg_clf.best_estimator_
y_val_pred_logreg = best_logreg.predict(X_val_vectorized)
print("Best Logistic Regression parameters:", logreg_clf.best_params_)
print("Performance on the validation set:")
print(classification_report(y_val, y_val_pred_logreg, target_names=label_map.values()))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best Logistic Regression parameters: {'C': 10, 'max_iter': 1000, 'solver': 'liblinear'}
Performance on the validation set:
              precision    recall  f1-score   support

       World       0.76      0.71      0.73        62
      Sports       0.71      0.58      0.64        60
    Business       0.79      0.87      0.83        60
    Sci/Tech       0.75      0.86      0.80        58

    accuracy                           0.75       240
   macro avg       0.75      0.76      0.75       240
weighted avg       0.75      0.75      0.75       240



In [14]:
y_test_pred_logreg = best_logreg.predict(test_df_vectorized)

In [15]:
print("Performance of Logistic Regression on the test set:")
print(classification_report(test_df["label"], y_test_pred_logreg, target_names=label_map.values()))

Performance of Logistic Regression on the test set:
              precision    recall  f1-score   support

       World       0.73      0.74      0.73       190
      Sports       0.76      0.73      0.74       190
    Business       0.85      0.89      0.87       190
    Sci/Tech       0.82      0.81      0.81       190

    accuracy                           0.79       760
   macro avg       0.79      0.79      0.79       760
weighted avg       0.79      0.79      0.79       760



### 3. BoW using SVM in Scikit-Learn

In [16]:
from sklearn.svm import LinearSVC

In [17]:
cv_svm = CountVectorizer()
X_train_vectorized_svm = cv_svm.fit_transform(X_train)
X_val_vectorized_svm = cv_svm.transform(X_val)

In [18]:
svm_clf = LinearSVC()
svm_clf.fit(X_train_vectorized_svm, y_train)

In [19]:
y_val_pred_svm = svm_clf.predict(X_val_vectorized_svm)

In [20]:
print("Performance on the validation set:")
print(classification_report(y_val, y_val_pred_svm, target_names=label_map.values()))

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.77      0.69      0.73        62
      Sports       0.70      0.58      0.64        60
    Business       0.78      0.88      0.83        60
    Sci/Tech       0.73      0.83      0.77        58

    accuracy                           0.75       240
   macro avg       0.74      0.75      0.74       240
weighted avg       0.74      0.75      0.74       240



In [21]:
test_df_vectorized = cv.transform(test_df["text"])
y_test_pred_svm = svm_clf.predict(test_df_vectorized)

In [22]:
print("Performance on the test set:")
print(classification_report(test_df["label"], y_test_pred_svm, target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.69      0.73      0.71       190
      Sports       0.73      0.69      0.71       190
    Business       0.85      0.87      0.86       190
    Sci/Tech       0.82      0.81      0.81       190

    accuracy                           0.77       760
   macro avg       0.77      0.77      0.77       760
weighted avg       0.77      0.77      0.77       760



In [23]:
# Hyperparameter tuning for SVM
svm_param_grid = {
    'C': [0.1, 1, 10],
    'max_iter': [1000, 5000]
}

svm_clf = GridSearchCV(LinearSVC(), svm_param_grid, cv=3, scoring='accuracy', verbose=1)
svm_clf.fit(X_train_vectorized_svm, y_train)

# Best SVM model
best_svm = svm_clf.best_estimator_
y_val_pred_svm = best_svm.predict(X_val_vectorized)
print("Best SVM parameters:", svm_clf.best_params_)
print("Performance on the validation set:")
print(classification_report(y_val, y_val_pred_svm, target_names=label_map.values()))

Fitting 3 folds for each of 6 candidates, totalling 18 fits
Best SVM parameters: {'C': 0.1, 'max_iter': 1000}
Performance on the validation set:
              precision    recall  f1-score   support

       World       0.77      0.69      0.73        62
      Sports       0.71      0.60      0.65        60
    Business       0.79      0.88      0.83        60
    Sci/Tech       0.74      0.84      0.79        58

    accuracy                           0.75       240
   macro avg       0.75      0.76      0.75       240
weighted avg       0.75      0.75      0.75       240



In [24]:
y_test_pred_svm = best_svm.predict(test_df_vectorized)

In [25]:
print("Performance on the test set:")
print(classification_report(test_df["label"], y_test_pred_svm, target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.72      0.74      0.73       190
      Sports       0.75      0.72      0.73       190
    Business       0.86      0.89      0.87       190
    Sci/Tech       0.82      0.81      0.81       190

    accuracy                           0.79       760
   macro avg       0.79      0.79      0.79       760
weighted avg       0.79      0.79      0.79       760



### 4. Reflection on Model Performance and Hyperparameter Choices

**BoW Feature Representation:**
   - The CountVectorizer approach is effective for text classification but does not capture word semantics or context
   - Using n-grams or TF-IDF weighting could further enhance feature representation

---

Both **Logistic Regression** and **SVM** classifiers were evaluated with hyperparameter tuning using **GridSearchCV**.

##### **Logistic Regression**  
- **Best parameters found:**  
  - `C = 10` (lower regularization, allowing more complex decision boundaries).  
  - `solver = liblinear` (better suited for smaller datasets with L1 regularization support).  
  - `max_iter = 1000` (ensured convergence).  

- **Performance Observations:**  
  - **Higher C (`10`) improved recall and accuracy**, especially for the **Business and Sci/Tech** categories.  
  - `liblinear` solver performed slightly better than `lbfgs`, likely due to feature sparsity.  
  - The test accuracy was **79%**, with Business achieving the highest F1-score.  

---

##### **Support Vector Machine (SVM - LinearSVC)**  
- **Best parameters found:**  
  - `C = 0.1` (stronger regularization, reducing overfitting).  
  - `max_iter = 1000` (sufficient for convergence).  

- **Performance Observations:**  
  - **Stronger regularization (`C = 0.1`) generalized better**, preventing overfitting.  
  - Performed comparably to Logistic Regression, achieving **79% test accuracy**.  
  - Had slightly better generalization on the **World and Sports** categories compared to Logistic Regression.  

---

##### **Regularization and Model Generalization**  
Regularization (`C` parameter) was a key factor in determining performance:

- **Low `C` (e.g., 0.1 in SVM):**  
  - Reduced complexity, avoiding overfitting.  
  - Led to better generalization on unseen data.  

- **High `C` (e.g., 10 in Logistic Regression):**  
  - Allowed more complex decision boundaries.  
  - Helped maximize accuracy but increased risk of overfitting.  

Regularization helped balance **bias-variance trade-offs**, ensuring the models performed well across different data splits.