**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

<br>

***

### Reflections
Implementet gridsearch to tune the mode. increased accuracy by 1%-2% from 76%-77% to 78%

implemented tfidf and increased the accuracy with 2% to 80%


In [50]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [40]:
# Load data
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
train = pd.read_parquet(f"hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet(f"hf://datasets/fancyzhx/ag_news/" + splits["test"])


In [41]:
# Define label mapping
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

# Function to preprocess the dataset
def preprocess(df: pd.DataFrame, frac: float = 0.01, label_map: dict = label_map, seed: int = 42):
    return (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[['text', 'label']]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)
    )

# Preprocess data
train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# Clear up memory
del train
del test

# Split data

In [42]:
# Split train data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    train_df["text"], train_df["label"], test_size=0.2, random_state=42
)

# Build the BoW mode

In [43]:
# Fit TF-IDF on the training set only
tfidf = TfidfVectorizer(ngram_range=(1, 2), max_df=0.95, min_df=5)
X_train_vectorized = tfidf.fit_transform(X_train)  # Fit only on train set

# Transform validation and test sets using the same vectorizer
X_val_vectorized = tfidf.transform(X_val)  # Transform only
test_vectorized = tfidf.transform(test_df["text"]) 

In [44]:
X_train_vectorized.todense()

matrix([[0.        , 0.        , 0.15501908, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

# Create a classifier

In [45]:
# Step 2: Define GridSearchCV parameters for Logistic Regression
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],  # Regularization type
    'solver': ['liblinear', 'saga']  # Compatible solvers
}

# Initialize logistic regression
lr = LogisticRegression(max_iter=1000)

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(lr, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search.fit(X_train_vectorized, y_train)

# Get the best model from GridSearch
best_lr_clf = grid_search.best_estimator_

# Print best parameters
print("Best Parameters:", grid_search.best_params_)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.0s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.0s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.0s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.0s
[CV] END ...............C=0.01, penalty=l1, solver=liblinear; total time=   0.0s
[CV] END ....................C=0.01, penalty=l1, solver=saga; total time=   0.0s
[CV] END ....................C=0.01, penalty=l1, solver=saga; total time=   0.0s
[CV] END ....................C=0.01, penalty=l1, solver=saga; total time=   0.0s
[CV] END ....................C=0.01, penalty=l1, solver=saga; total time=   0.0s
[CV] END ....................C=0.01, penalty=l1, solver=saga; total time=   0.0s
[CV] END ...............C=0.01, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END ...............C=0.01, penalty=l2, sol



[CV] END ......................C=10, penalty=l1, solver=saga; total time=   3.9s
[CV] END ......................C=10, penalty=l1, solver=saga; total time=   3.1s
[CV] END .................C=10, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END ......................C=10, penalty=l1, solver=saga; total time=   3.6s
[CV] END .................C=10, penalty=l2, solver=liblinear; total time=   0.0s




[CV] END .....................C=100, penalty=l1, solver=saga; total time=   7.4s




[CV] END .....................C=100, penalty=l1, solver=saga; total time=   7.6s
[CV] END .....................C=100, penalty=l1, solver=saga; total time=   7.7s
[CV] END .....................C=100, penalty=l1, solver=saga; total time=   7.9s




[CV] END .....................C=100, penalty=l1, solver=saga; total time=   7.8s
[CV] END ......................C=10, penalty=l1, solver=saga; total time=   3.2s
Best Parameters: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}


# Get predictions

In [47]:
X_val_vectorized = tfidf.transform(X_val) # note that we use transform here, not fit_transform

y_pred = best_lr_clf.predict(X_val_vectorized)

# Evaluate BoW model

In [48]:

# Step 3: Evaluate on Validation Set
y_pred = best_lr_clf.predict(X_val_vectorized)
print("\nPerformance on Validation Set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))


Performance on Validation Set:
              precision    recall  f1-score   support

       World       0.83      0.61      0.70        62
      Sports       0.73      0.67      0.70        60
    Business       0.74      0.88      0.80        60
    Sci/Tech       0.75      0.86      0.80        58

    accuracy                           0.75       240
   macro avg       0.76      0.76      0.75       240
weighted avg       0.76      0.75      0.75       240



In [49]:
# Step 4: Evaluate on Test Set
y_test_pred = best_lr_clf.predict(test_vectorized)
print("\nPerformance on Test Set:")
print(classification_report(test_df["label"], y_test_pred, target_names=label_map.values()))


Performance on Test Set:
              precision    recall  f1-score   support

       World       0.77      0.73      0.75       190
      Sports       0.77      0.75      0.76       190
    Business       0.84      0.89      0.86       190
    Sci/Tech       0.81      0.83      0.82       190

    accuracy                           0.80       760
   macro avg       0.80      0.80      0.80       760
weighted avg       0.80      0.80      0.80       760

