# 📌 Machine Learning Techniques - Assignment 2

## 📝 Assignment Overview
This notebook is part of **Assignment 2** for the course *Machine Learning Techniques*.  
The objective is to build a **text classification model** for Swedish text, using machine learning techniques.  

We are provided with preprocessed text data, and our task is to:
1. **Select and train multiple classifiers** (e.g., Logistic Regression, SVC, Random Forest, Naïve Bayes).
2. **Optimize hyperparameters** using **GridSearchCV** to improve model performance.
3. **Evaluate the models** using accuracy, precision, recall, and F1-score.
4. **Compare models** to determine the best-performing approach.

The final goal is to achieve the highest possible accuracy on the test dataset.

---

## 📊 Notebook Structure

🔹 **1. Data Preprocessing**  
    - Load the dataset  
    - Perform text cleaning (lowercasing, removing stopwords, stemming)  
    - Convert text to numerical features using **TF-IDF vectorization**  

🔹 **2. Model Selection & Hyperparameter Optimization**  
    - Define multiple classifiers  
    - Tune hyperparameters using **GridSearchCV**  

🔹 **3. Training & Evaluation of Optimized Models**  
    - Train models using the best hyperparameters  
    - Evaluate performance using **classification reports**  

🔹 **4. Model Comparison & Final Results**  
    - Compare model performance  
    - Identify the best-performing model  

---

## 🚀 Expected Outcome
At the end of this notebook, we will have:
✅ A trained text classification model with optimized hyperparameters  
✅ Performance evaluation metrics (accuracy, precision, recall, F1-score)  
✅ Insights into which model performed best for this dataset  

By following this structured approach, we ensure a **systematic and well-documented** machine learning workflow.

---

In [37]:
# Importing regular expressions for text manipulation and cleaning
import re
# Importing system-specific parameters and functions (e.g., to suppress warnings)
import sys
# Importing module to control warning messages
import warnings
# Importing the Natural Language Toolkit for text preprocessing
import nltk
# Importing pandas for handling and manipulating structured datasets
import pandas as pd
# Importing numpy for numerical computations
import numpy as np
# Importing stopword lists for removing common, uninformative words
from nltk.corpus import stopwords
# Importing SnowballStemmer for reducing words to their root forms (stemming)
from nltk.stem.snowball import SnowballStemmer
# Importing train_test_split for splitting the dataset into training and testing subsets
from sklearn.model_selection import train_test_split
# Importing TfidfVectorizer for converting text data into numerical vectors using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

In [38]:
# Specifying the custom path for nltk data to ensure that required resources can be loaded correctly
nltk.data.path.append('C:\\Users\\JonGo/nltk_data')

# Downloading the 'punkt_tab' resource, which is used for tokenization purposes
# Note: This ensures that nltk can properly split text into sentences or words during preprocessing
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\JonGo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [39]:
# Suppressing all warning messages to ensure a clean and readable output during execution
# This is useful in cases where warnings are not critical to the functioning of the code
if not sys.warnoptions:
    warnings.simplefilter("ignore")


# 📊 Data Preprocessing

In this section, we clean and prepare the raw text data to ensure it is in an optimal format for machine learning.  
The preprocessing steps include:
- **Text Normalization**: Converting text to lowercase and removing punctuation, digits, and HTML tags.
- **Stopword Removal**: Eliminating common words that do not add meaningful information.
- **Stemming**: Reducing words to their root form to improve generalization.
- **TF-IDF Vectorization**: Converting text data into numerical features for model training.

This preprocessing pipeline enhances model accuracy by reducing noise and improving text representation.

In [40]:
# Load the data from a CSV file located at the specified path
# This dataset is the basis for the text classification task
data_path = "C:\\workspace\\ML\\ML-grupp6\\Inlamning2\\Book1.csv"
data_raw = pd.read_csv(data_path)

# Randomly shuffle the data to ensure no unintended patterns influence the model's training
data_raw = data_raw.sample(frac=1)

# Identify the columns representing the target categories for classification
# Here, categories are assumed to start from the 3rd column onwards
categories = list(data_raw.columns.values)
categories = categories[2:]

# Perform basic cleaning on the "Heading" column:
# - Convert text to lowercase
# - Remove punctuation using regular expressions
# - Remove numeric digits
# - Remove any HTML tags
data_raw['Heading'] = (
    data_raw['Heading']
    .str.lower()
    .str.replace('[^\w\s]', '', regex=True)  # Remove punctuation
    .str.replace('\d+', '', regex=True)      # Remove digits
    .str.replace('<.*?>', '', regex=True)    # Remove HTML tags
)

# Download and use the Swedish stopwords for further text cleaning
nltk.download('stopwords')
stop_words = set(stopwords.words('swedish'))

# Function to remove stopwords from a sentence
# This step eliminates common words (e.g., "och", "att") that are unlikely to contribute to classification
def removeStopWords(sentence):
    return " ".join(
        [word for word in nltk.word_tokenize(sentence) 
         if word not in stop_words]
    )

# Apply the stopword removal function to the "Heading" column
data_raw['Heading'] = data_raw['Heading'].apply(removeStopWords)

# Initialize the Snowball stemmer for Swedish
# This reduces words to their root form (e.g., "springer" -> "spring")
stemmer = SnowballStemmer("swedish")

# Function to apply stemming to each sentence
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stemSentence += stemmer.stem(word) + " "
    return stemSentence.strip()

# Apply stemming to the "Heading" column
data_raw['Heading'] = data_raw['Heading'].apply(stemming)

# Split the dataset into training and testing subsets
# - 70% of the data is used for training
# - 30% is used for testing
train, test = train_test_split(data_raw, random_state=42, test_size=0.30, shuffle=True)

# Extract the text (features) for training and testing
train_text = train['Heading']
test_text = test['Heading']

# Use TF-IDF vectorization to convert text data into numerical feature vectors
# - `strip_accents='unicode'` removes accents for better normalization
# - `analyzer='word'` specifies word-based tokenization
# - `ngram_range=(1,3)` considers unigrams, bigrams, and trigrams
# - `norm='l2'` normalizes feature vectors
vectorizer = TfidfVectorizer(strip_accents='unicode', 
                             analyzer='word', 
                             ngram_range=(1,3), 
                             norm='l2')
vectorizer.fit(train_text)

# Transform the training and testing text data into numerical feature matrices
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels=['Id', 'Heading'], axis=1)  # Drop non-target columns for the labels

x_test = vectorizer.transform(test_text)
y_test = test.drop(labels=['Id', 'Heading'], axis=1)  # Drop non-target columns for the labels


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JonGo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:
# Importing LogisticRegression for implementing a linear classification model
from sklearn.linear_model import LogisticRegression

# Importing MultinomialNB, a Naive Bayes classifier suitable for discrete data (e.g., text)
from sklearn.naive_bayes import MultinomialNB

# Importing RandomForestClassifier, an ensemble learning method based on decision trees
from sklearn.ensemble import RandomForestClassifier

# Importing OneVsRestClassifier for handling multi-label classification problems
# This wraps around base classifiers and trains one model per label
from sklearn.multiclass import OneVsRestClassifier

# Importing MultiOutputClassifier to handle multi-output/multi-label classification problems
# Useful for models like RandomForest which don't natively support multi-label classification
from sklearn.multioutput import MultiOutputClassifier

# Importing metrics for evaluating classification models:
# - accuracy_score: calculates the proportion of correct predictions
# - classification_report: provides a detailed evaluation with precision, recall, and F1-score
from sklearn.metrics import accuracy_score, classification_report

# Importing GridSearchCV for hyperparameter tuning to find the best model configuration
from sklearn.model_selection import GridSearchCV

# Importing SVC (Support Vector Classifier), a versatile classifier capable of handling both linear and non-linear problems
from sklearn.svm import SVC


# 🏆 Model Selection & Hyperparameter Optimization

To achieve the best classification performance, we experiment with different models:  
- **Logistic Regression**
- **Support Vector Classifier (SVC)**
- **Random Forest**
- **Multinomial Naïve Bayes**

Each model is fine-tuned using **GridSearchCV**, which performs cross-validation over a range of hyperparameters.  
This ensures that we identify the best model configuration for our dataset.

In [42]:
# Define hyperparameter grids for each model
# These grids specify the range of values to test during hyperparameter tuning with GridSearchCV
param_grids = {
    "Logistic Regression": {
        'estimator__C': [0.1, 1, 10],  # Regularization strength (higher values = less regularization)
        'estimator__penalty': ['l1', 'l2']  # Type of regularization (L1: Lasso, L2: Ridge)
    },
    "SVC": {
        'estimator__C': [0.1, 1, 10],  # Regularization parameter (controls margin width)
        'estimator__kernel': ['linear', 'rbf'],  # Type of kernel (linear or radial basis function)
        'estimator__gamma': [0.001, 0.01, 0.1]  # Kernel coefficient for non-linear kernels
    },
    "Random Forest": {
        'estimator__n_estimators': [50, 100, 200],  # Number of trees in the forest
        'estimator__max_depth': [None, 10, 20],  # Maximum depth of each tree (None = no limit)
        'estimator__min_samples_split': [2, 5, 10]  # Minimum samples required to split a node
    },
    "MultinomialNB": {
        'estimator__alpha': [0.1, 0.5, 1.0, 2.0],  # Laplace smoothing parameter
        'estimator__fit_prior': [True, False]  # Whether to use class prior probabilities
    }
}

# Define models for classification
# Each model is wrapped in OneVsRestClassifier to handle multi-label classification
models = {
    "Logistic Regression": OneVsRestClassifier(LogisticRegression(solver='liblinear')),
    "SVC": OneVsRestClassifier(SVC()),
    "Random Forest": OneVsRestClassifier(RandomForestClassifier()),
    "MultinomialNB": OneVsRestClassifier(MultinomialNB())
}

# 🚀 Training & Evaluation of Optimized Models

Once the best hyperparameters are determined, we train each model on the preprocessed dataset.  
For evaluation, we compute:
- **Accuracy**: Overall correctness of predictions.
- **Precision & Recall**: How well each class is predicted.
- **F1-score**: A balance between precision and recall.

By comparing these metrics across models, we can select the best-performing classifier.

In [43]:

# Iterate through each model and its corresponding hyperparameter grid
for model_name, model in models.items():
    print(f"Optimizing {model_name}...")
    
    # Initialize GridSearchCV for hyperparameter tuning
    # - model: the classifier to optimize
    # - param_grids[model_name]: the hyperparameter grid for the specific model
    # - cv=5: 5-fold cross-validation
    # - scoring='accuracy': use accuracy as the evaluation metric
    # - n_jobs=-1: use all available processors for parallel processing
    grid = GridSearchCV(model, param_grids[model_name], cv=5, scoring='accuracy', n_jobs=-1)
    
    # Fit the model on the training data
    grid.fit(x_train, y_train)
    
    # Print the best hyperparameters and the corresponding cross-validation score
    print(f"Best params for {model_name}: {grid.best_params_}")
    print(f"Best score for {model_name}: {grid.best_score_}\n")
    
    # Evaluate the best model on the test data
    best_clf = grid.best_estimator_  # Retrieve the best estimator found by GridSearchCV
    y_pred = best_clf.predict(x_test)  # Make predictions on the test set
    
    # Print the test accuracy and a detailed classification report
    print(f"Test accuracy for {model_name}: {accuracy_score(y_test, y_pred)}")
    print(classification_report(y_test, y_pred, target_names=categories))
    print("-" * 80)


Optimizing Logistic Regression...
Best params for Logistic Regression: {'estimator__C': 10, 'estimator__penalty': 'l1'}
Best score for Logistic Regression: 0.2565208796639486

Test accuracy for Logistic Regression: 0.3442622950819672
                    precision    recall  f1-score   support

           Politik       0.66      0.60      0.63       136
        Utbildning       0.37      0.35      0.36        20
          Religion       0.50      0.20      0.29         5
             Miljo       0.50      0.25      0.33        36
           Ekonomi       0.68      0.56      0.62       158
     LivsstilFritt       0.65      0.38      0.48        81
SamhalleKonflikter       0.68      0.56      0.61       197
             Halsa       0.67      0.37      0.47        82
            Idrott       0.59      0.37      0.45        52
   VetenskapTeknik       0.21      0.16      0.18        19

         micro avg       0.64      0.49      0.55       786
         macro avg       0.55      0.38     

# 📊 Model Comparison & Final Results

The final step involves comparing the accuracy of each model to determine the best-performing classifier.  
The results include:
- **Best hyperparameters for each model**
- **Classification reports with precision, recall, and F1-score**
- **A summary of which model achieved the highest accuracy**

This analysis helps us make an informed decision on the most suitable model for our classification task.

In [44]:
# Define a dictionary of optimized classifiers using GridSearchCV
# Each model is paired with its corresponding hyperparameter grid for tuning
optimized_classifiers = {
    "LogReg": GridSearchCV(
        OneVsRestClassifier(LogisticRegression(solver='liblinear')),  # Logistic Regression for multi-label classification
        param_grids["Logistic Regression"],  # Hyperparameter grid for Logistic Regression
        cv=5,  # Perform 5-fold cross-validation
        scoring="accuracy",  # Use accuracy as the evaluation metric
        n_jobs=-1  # Utilize all available processors for parallel processing
    ),
    "SVC": GridSearchCV(
        OneVsRestClassifier(SVC()),  # Support Vector Classifier for multi-label classification
        param_grids["SVC"],  # Hyperparameter grid for SVC
        cv=5,
        scoring="accuracy",
        n_jobs=-1
    ),
    "NaiveBayes": GridSearchCV(
        OneVsRestClassifier(MultinomialNB()),  # Naive Bayes classifier for multi-label classification
        param_grids["MultinomialNB"],  # Hyperparameter grid for Naive Bayes
        cv=5,
        scoring="accuracy",
        n_jobs=-1
    ),
    "RandomForest": GridSearchCV(
        OneVsRestClassifier(RandomForestClassifier()),  # Random Forest for multi-label classification
        param_grids["Random Forest"],  # Hyperparameter grid for Random Forest
        cv=5,
        scoring="accuracy",
        n_jobs=-1
    )
}

# Train and evaluate the optimized models
for name, grid in optimized_classifiers.items():
    print(f"Training and optimizing {name}...")
    
    # Fit the model using GridSearchCV on the training data
    grid.fit(x_train, y_train)
    
    # Retrieve the best model found during hyperparameter tuning
    best_model = grid.best_estimator_
    
    # Use the best model to make predictions on the test data
    y_pred = best_model.predict(x_test)
    
    # Evaluate the model's performance on the test data
    accuracy = accuracy_score(y_test, y_pred)  # Calculate accuracy
    print(f"{name} Accuracy: {accuracy}")  # Print accuracy score
    print(f"Best parameters for {name}: {grid.best_params_}")  # Print the best hyperparameters
    print(classification_report(y_test, y_pred, target_names=categories))  # Print detailed classification metrics
    print("-" * 60)  # Separator for better readability in the output


Training and optimizing LogReg...
LogReg Accuracy: 0.3360655737704918
Best parameters for LogReg: {'estimator__C': 10, 'estimator__penalty': 'l1'}
                    precision    recall  f1-score   support

           Politik       0.66      0.60      0.63       136
        Utbildning       0.41      0.35      0.38        20
          Religion       0.50      0.20      0.29         5
             Miljo       0.50      0.25      0.33        36
           Ekonomi       0.67      0.55      0.61       158
     LivsstilFritt       0.64      0.37      0.47        81
SamhalleKonflikter       0.68      0.56      0.62       197
             Halsa       0.66      0.35      0.46        82
            Idrott       0.58      0.35      0.43        52
   VetenskapTeknik       0.21      0.16      0.18        19

         micro avg       0.64      0.48      0.55       786
         macro avg       0.55      0.37      0.44       786
      weighted avg       0.63      0.48      0.54       786
       samp