# 03: Model Finalization and Serialization (Adapted for SVM)

This notebook is an adaptation of Matthew Vella's serialisation pipeline, modified to finalize and serialize the Maltese sentiment analysis model using a Support Vector Machine (SVM). The SVM model, along with its chosen vectorizer, is trained on the combined preprocessed Maltese sentiment data.

To create a self-contained and deployable model that can handle raw text input, custom preprocessing components (defined in `preprocessor.py`) are integrated into an end-to-end pipeline. This complete pipeline, which includes the text preprocessor and the trained SVM sentiment model, is then serialized using `joblib` for later use.

Essential Python libraries for data handling, machine learning pipeline construction, and model serialization are imported. `DATA_DIR` is set to the data location.

In [1]:
from pathlib import Path
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC # Changed from MultinomialNB to SVC
from sklearn.metrics import classification_report
from imblearn.pipeline import Pipeline as ImbPipeline
import pandas as pd
import joblib
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline as SklearnPipeline

import preprocessor # preprocessor.py

DATA_DIR = Path('./data')

All available preprocessed data (`jerbarnes_dataset_selective_lowercased_lemmatized.csv` and `crowdsourced_dataset_selective_lowercased_lemmatized.csv`) is loaded and combined. This full preprocessed dataset (`X_combined`, `y_combined`) will be used to train the core sentiment model components (vectorizer and classifier).

In [None]:
dataset_path = DATA_DIR/f'jerbarnes_dataset_selective_lowercased_lemmatized.csv'
df = pd.read_csv(dataset_path, header=None, names=['label', 'text', 'processed_text'])

# Split data into features and target
X = df['processed_text']
y = df['label']

# Load the crowd-sourced dataset
new_dataset_path = DATA_DIR/f'crowdsourced_dataset_selective_lowercased_lemmatized.csv'
new_df = pd.read_csv(new_dataset_path, header=None, names=['label', 'text', 'processed_text'])

# Split data into features and target
new_X = new_df['processed_text']
new_y = new_df['label']

# Add crowd-sourced data to training set
X_combined = pd.concat([X, new_X], ignore_index=True)
y_combined = pd.concat([y, new_y], ignore_index=True)

The core sentiment model pipeline (`final_model`) is defined. This pipeline uses `TfidfVectorizer` (with `lowercase=False`, `min_df=2`, Ngrams 1-1, norm='l2', and `use_idf=True`) and `SVC` (with `probability=True`, `random_state=42`, `C=10`, `kernel=RBF` and `gamma=scale` determined for Model C in Notebook 02). This `final_model` is trained on the entire combined *preprocessed* dataset (`X_combined`, `y_combined`).

In [3]:
final_model = ImbPipeline([
    ('vectorizer', TfidfVectorizer(lowercase=False, min_df=2, ngram_range=(1, 1), norm='l2', use_idf=True)), # Best TfidfVectorizer parameters
    ('classifier', SVC(probability=True, random_state=42, C=10, kernel='rbf', gamma='scale')), # Best SVC parameters
])

# Train the model
final_model.fit(X_combined, y_combined)

The `MalteseTokenizer` and the scikit-learn compatible `MalteseTextPreprocessor` (adapted from a colleague's original implementation) are defined below. These classes ensure that raw text input is preprocessed consistently with how the training data was prepared for the SVM model. The `MalteseTextPreprocessor` will be part of the final deployable pipeline.

In [None]:
# MalteseTokenizer without Maltese language filtering
class MalteseTokenizer:
    def __init__(self, case_folding_type=2, lemmatize=True):
        if case_folding_type not in {0, 1, 2}:
            raise ValueError("Invalid case folding method. Choose from 0 (no change), 1 (lowercase everything except fully-uppercase words), 2 (full lowercasing)")
        self.case_folding_type = case_folding_type
        self.lemmatize = lemmatize

        self.cleaner = preprocessor.TextCleaner(input_dir=None, output_dir=None)
        self.anonymizer = preprocessor.TextAnonymizer(input_dir=None, output_dir=None, names_dir='./names')
        
    def __call__(self, text):
        # Apply the same initial cleaning steps used in the Facebook Scraper dataset
        text = self.cleaner.clean_text(text)
        text = self.anonymizer.anonymize_text(text)

        # Apply preprocessing steps
        text = preprocessor.emoji_to_text(text)
        tokens = preprocessor.tokenise(text)
        tokens = preprocessor.clean_tokens(tokens)
        if self.case_folding_type == 1:
            tokens = preprocessor.selective_lowercase(tokens)
        elif self.case_folding_type == 2:
            tokens = preprocessor.lowercase(tokens)
        tokens = [preprocessor.normalize_word(token) for token in tokens]
        if self.lemmatize:
            tokens = [preprocessor.get_lemma(token) for token in tokens]
        return tokens


class MalteseTextPreprocessor(BaseEstimator, TransformerMixin):
    """
    Scikit-learn compatible transformer for applying the MalteseTokenizer.
    """
    def __init__(self, case_folding_type=2, lemmatize=True):
        self.case_folding_type = case_folding_type
        self.lemmatize = lemmatize
        # Instantiate the tokenizer when the transformer is created
        self.tokenizer_ = MalteseTokenizer(case_folding_type=self.case_folding_type, 
                                           lemmatize=self.lemmatize)

    def fit(self, X, y=None):
        return self # No fitting needed for this preprocessor

    def transform(self, X, y=None):
        processed_X = []
        for raw_text in X: # X is an iterable of raw text strings
            tokens = self.tokenizer_(raw_text) 
            processed_X.append(' '.join(tokens) if tokens else "")
        return pd.Series(processed_X) # Output a Series for the next pipeline step

The final `deployable_pipeline` is constructed by combining the `MalteseTextPreprocessor` (for handling raw text input) with the trained SVM sentiment model. This pipeline is designed for end-to-end prediction from raw text. The entire pipeline is then serialized to a `.joblib` file for deployment.

In [5]:
model_filename = "svm_maltese_sentiment_analyzer.joblib" # Changed filename

deployable_pipeline = SklearnPipeline([
    ('custom_maltese_preprocessor', preprocessor.MalteseTextPreprocessor(case_folding_type=2, lemmatize=True)), # Ensure these parameters match what was used for training the SVM
    ('sentiment_model', final_model)
])

print("Deployable wrapper pipeline created.")
print("This pipeline takes RAW text and uses the pre-fitted `final_model` for sentiment.")

try:
    joblib.dump(deployable_pipeline, model_filename)
    print(f"\nDeployable model pipeline saved successfully to: {model_filename}")
except Exception as e:
    print(f"Error saving the deployable model: {e}")

Error loading name files: [Errno 2] No such file or directory: './names/names.txt'
Deployable wrapper pipeline created.
This pipeline takes RAW text and uses the pre-fitted `final_model` for sentiment.

Deployable model pipeline saved successfully to: svm_maltese_sentiment_analyzer.joblib
