**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [1]:
# imports for the project

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
!pip install huggingface_hub



### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [13]:
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])



print(train.shape, test.shape)
# Definer label map
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac: float = 1e-2, label_map: dict[int, str] = label_map, seed: int = 42) -> pd.DataFrame:
    """ Preprocess the dataset """
    return (
        df
        .assign(label=lambda x: x['label'].map(label_map))  # Mapper labels
        [lambda df: df['label'].isin(label_map.values())]  # Filtrerer labels
        .groupby('label')[["text", "label"]]  # Gruppering
        .apply(lambda x: x.sample(frac=frac, random_state=seed))  # Stratificeret sampling
        .reset_index(drop=True)
    )

# Preprocess data
train_df = preprocess(train, frac=0.01)  # Sample 1% for training
test_df = preprocess(test, frac=0.1)  # Sample 10% for testing

# Ryd op i hukommelsen
del train
del test

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

# Vælger tekst (X) og labels (y)
X_train = train_df['text']
y_train = train_df['label']
X_test = test_df['text']
y_test = test_df['label']

# Opret BoW-model
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Træn classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train_bow, y_train)

# Evaluer modellen
y_pred = model.predict(X_test_bow)
print(classification_report(y_test, y_pred))




(120000, 2) (7600, 2)
Train shape: (1200, 2)
Test shape: (760, 2)
              precision    recall  f1-score   support

    Business       0.76      0.74      0.75       190
    Sci/Tech       0.76      0.78      0.77       190
      Sports       0.87      0.88      0.87       190
       World       0.83      0.82      0.82       190

    accuracy                           0.81       760
   macro avg       0.81      0.81      0.81       760
weighted avg       0.81      0.81      0.81       760



In [14]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'penalty': ['l2'],
}

grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid_search.fit(X_train_bow, y_train)

print("Bedste hyperparametre:", grid_search.best_params_)

Bedste hyperparametre: {'C': 1, 'penalty': 'l2'}


In [None]:
# Modellen model præsterede med en samlet præcision på 81%, som fremgår af rapporten. 
# Sports-kategorien viste sig at være den bedst identificerede kategori, mens Business og World også præsterede tilfredsstillende. 
# Sci/Tech-kategorien havde dog lidt lavere præcision, hvilket antyder, at modellen muligvis har brug for flere eksempler eller bedre feature engineering for at forbedre identifikationen i denne kategori.

# De væsentlige hyperparametre for denne model var:
# Med en værdi af C på 1 sikrer vi en stærk balance mellem overfitting og underfitting.
# Den valgte penalty er på 'l2'-regularisering hjalp med at minimere koefficienternes størrelse og havde en positiv indvirkning på det generelle resultat.
