# Project 2.1

Write a short summary of what the assignment is.

Install dependencies for this notebook to run:

In [1]:
%pip install -q numpy
%pip install -q pandas
%pip install -q scikit-learn
%pip install -q nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m

Standard Python library imports:

In [2]:
import os
import re
import string
import typing
import warnings; warnings.filterwarnings("ignore",
	category = UserWarning,
)

The backend used for calculations and data manipulation. Mostly `pandas` is used.

In [3]:
import numpy
import pandas

Specific modules for `sklearn`:
- `feature_extraction` for transformation of texts into features
- `linear_model` for the models used for this assignemt
- `pipeline` to assemble the various steps in a single process
- `metrics` for making evaluations on known splits (either training or validation)
- `model_selection` for handling tuning of hyper-parameters of models

In [4]:
import sklearn
import sklearn.feature_extraction
import sklearn.linear_model
import sklearn.pipeline
import sklearn.metrics
import sklearn.model_selection

Load the Natural Language Tool Kit:
- `stem` has tools for preprocessing text
- `corpus` has language specific content like stop words

In [5]:
import nltk
import nltk.stem
import nltk.corpus

In [6]:
nltk.download('wordnet'  )  # for lemmatization
nltk.download('stopwords')  # for stop word removal

[nltk_data] Downloading package wordnet to /home/nikos/nltk_data...
[nltk_data] Downloading package stopwords to /home/nikos/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Data

This assignment works on an english Twitter dataset with attributes:

- `ID`: a unique identifier for each text
- `Text`: contains the content of the tweet
- `Label`: represents sentiment:
  - `1` indicates a _positive_ sentiment
  - `0` indicates a _negative_ sentiment

This notebook assumes that the following files exist in the same path as this notebook:

- `train_dataset.csv` (used for training models)
- `val_dataset.csv` (used for tuning models and later training them as well)
- `test_dataset.csv` (used for generating predictions for final evaluation)

Also, this notebook upon execution, generates a `submission.csv` file with the predictions of our trained model.

Lets start with a simple function to load each dataset split into a dataframe indexed with the `ID` column:

In [7]:
def load_data(
	split: typing.Literal[
		"train",
		"val",
		"test",
	],
	root: str = "",
	index: str = "ID",
):
	return pandas.read_csv(os.path.join(f"{root}", f"{split}_dataset.csv"),
		index_col = index,  # put IDs on index
		encoding = "utf-8",
	)


## Text normalization

After inspecting a sample, tweets require a great amount of normalization:

- lowercasing
- removal of punctuation
- removal of accents
- removal of stop words
- removal of sensitive information like:
  - mentions
  - hashtags
  - emails
- lemmatization
- stemming


The `sklearn` term-frequency-inverse-document-frequency vectorizer supports lowercasing and stop words out of the box, but the later shall be replaced by the NLTK version for english. It also supports a custom preprocessor and tokenizer which are defined below as callables.

### Preprocessing

This step consists of removing sensitive info and punctuation:

In [8]:
class Preprocessor:

	def __call__(self, text: str) -> str:
		text = re.sub(r"@\w+"   , "", text)  # mentions
		text = re.sub(r"#\w+"   , "", text)  # hashtags
		text = re.sub(r"\S+@\S+", "", text)  # emails

		return text.translate(str.maketrans("", "", string.punctuation))  # punctuation


### Tokenizing

This step consists of lemmatizing and stemming texts before splitting into (meaningful) tokens:

In [9]:
class Tokenizer:

	def __init__(self):
		self.lemmatizer = nltk.WordNetLemmatizer()
		self.stemmer = nltk.stem.PorterStemmer()
		self.tokenizer = nltk.tokenize.TweetTokenizer(
			preserve_case = False,
			reduce_len = True,
			match_phone_numbers = True,
			strip_handles = True,
		)

	def __call__(self, text: str):
		return [self.stemmer.stem(self.lemmatizer.lemmatize(token))
			for token in self.tokenizer.tokenize(text) if token and not token.isdigit()]


## Pipeline

The pipeline for this assignment consists of two steps:

1. Vectorization of texts (using _term-frequency-inverse-document-frequency_)
2. Modelling of the task (_binary sentiment_ estimation of tweets)

### Vectorizer

Lets specialize a vectorizer to expose two parameters only:

- `max_features`: Usually the vocabulary a vectorizer produces from a corpus is extremely large and unusable due to the curse of dimensionality. Also such vocabulary is ususally sorted by frequency, so trimming terms from the end means ditching rare terms which may not be so bad.
- `ngrams_range`: By default a vectorizer will use single-word terms to describe texts. Using overlapping two-word terms (bigrams) though, can lead to better capturing of the text content, but that also leads to much larger vocabularies, making `max_features` all the more necessary. The library supports combination settings with this range, using both unigrams and bigrams for example. As bigrams are generally rarer, they tend to get clipped by `max_features` so it makes little sense to go to trigrams (or more).

In [10]:
class Vectorizer(sklearn.feature_extraction.text.TfidfVectorizer):

	def __init__(self, *,
		max_features: int | None = None,  # regularize by reduing curse of dimensionality
		ngram_range: tuple[
			int,
			int,
		] = (
			1,
			1,
		),
	):
		super().__init__(
		#	input = "content",
			encoding = "utf-8",
		#	decode_error = "strict",
			strip_accents = "unicode",
			lowercase = True,
			preprocessor = Preprocessor(),
			tokenizer = Tokenizer(),
		#	analyzer = "word",
			stop_words = nltk.corpus.stopwords.words("english"),  # messages seem to be in english
		#	token_pattern = "(?u)\\b\\w\\w+\\b",
			ngram_range = ngram_range,  # NOTE: maybe tunable
		#	max_df = 1.0,
		#	min_df = 1,
			max_features = max_features, # NOTE: tunable
		#	vocabulary = None,
		#	binary = False,
		#	dtype = numpy.float64,
		#	norm = "l2",
		#	use_idf = True,
		#	smooth_idf = True,
		#	sublinear_tf = False,
		)


The seed is fixed at `42` for reproducibility. Notice how lowercasing and accent removal is passed explicitely. Also this is the palce where the stopwords are passed, along with a preprocessor and tokenizer as defined above. This wraps the vectorizer to be used for experiments on this assignment.

### Model

Lets also specialize the model to be used for this assignment. A logistic regressor is requested and frankly suitable for binary classification.

In [11]:
class Model(sklearn.linear_model.LogisticRegression):

	def __init__(self, *,
		C: float = 1.0,
	):
		super().__init__(
		#	penalty = None,
		#	dual = False,
		#	tol = 0.0001,
			C = C, # NOTE: tunable
			fit_intercept = True,  # use bias
		#	intercept_scaling = 1,
		#	class_weight = None,
			random_state = 42,  # seed
		#	solver = "lbfgs",
		#	max_iter = 100,
		#	verbose = 0,
		#	warm_start = False,
			n_jobs = -1,  # parallelization
		#	l1_ratio = None,
		)


Ridge regularization is expeted to be applied (hardcoded) so the strength of regularization is exposed for tuning later. Using the same seed (`42`) here.

### Classifier

This is the largest and most important utility that puts everything in one pipeline. Semantically it contains:

- a `vectorizer` to do all the text normalization mentioned above
- a `model` to do all the machine learning on the features the `vectorizer` got

In [12]:
class Classifier:

	def __init__(self, *,
		vectorizer: Vectorizer,
		model: Model,
	):
		self.pipeline = sklearn.pipeline.Pipeline(
			[
				("vectorizer", vectorizer),
				("model", model),
			]
		)


	def fit(self, train_data: pandas.DataFrame):
		X = train_data.Text
		y = train_data.Label

	#	Fit model:
		self.pipeline.fit(X, y)

		return self

	def predict(self, test_data: pandas.DataFrame,
		save: str | None = None,
	) -> pandas.Series:
		X = test_data.Text
		y = pandas.Series(self.pipeline.predict(X),
			index = test_data.index,  # align with test data index
			name = "Label",  # recover the "Label" column name
		)

	#	Optionally save perdictions for submission:
		if save is not None:
			y.to_csv(save)

		return y

	def evaluate(self, val_data: pandas.DataFrame):
		y_true = val_data.Label
		y_pred = self.predict(val_data)

		return {
			"accuracy"  : sklearn.metrics. accuracy_score(y_true, y_pred),
			"presiction": sklearn.metrics.precision_score(y_true, y_pred),
			"recall"    : sklearn.metrics.   recall_score(y_true, y_pred),
			"f1"        : sklearn.metrics.       f1_score(y_true, y_pred),
		}

	def tune(self,
		train_data: pandas.DataFrame,
		val_data: pandas.DataFrame,
	**param_grid: list):
		data = pandas.concat(
			[
				train_data,
				val_data,
			]
		)

	#	Concatenate train and val split because `sklearn` is a ballbuster for fixed splits:
		X = data.Text
		y = data.Label

	#	Initialize a tuner with given parameter grid and splits on the stock model:
		tuner = sklearn.model_selection.GridSearchCV(self.pipeline, param_grid,
			scoring = "accuracy",
			n_jobs = -1,  # parallelization
			refit = False,  # we will do our own refitting (on the whole data) thank you very much
			cv = sklearn.model_selection.PredefinedSplit([-1] * len(train_data) + [0] * len(val_data)),  # use our splits
		#	verbose = 0,
		#	pre_dispatch = "2*n_jobs",
		#	error_score = numpy.nan,
		#	return_train_score = False,
		)

	#	Fit and tune model:
		tuner.fit(X, y)  # type: ignore

	#	Refit model with the best parameters on the whole data (why throw the val split now that we finished tunning?):
		self.pipeline.set_params(**tuner.best_params_)  # set the best parameters for the model
		self.pipeline.fit(X, y)  # refit it with the best parameters on the whole dataset available for training

		return self


#### `.fit(train_data: DataFrame)`

This method uses the dataset as is to train the pipeline. This means all of the text gets normalized after the vectorizer is fit, then the extracted features are used to fit the model in the pipeline.

#### `.predict(test_data: DataFrame, save = str | None = None)`

This method uses the dataset as is to make predictions. The module `sklearn.pipeline` is smart enough to not fit the vectorizer at this (inference) stage, and only normalized text with the previously fit vectorizer. The model then makes predictions on the extracted features. The output is also `pandas`-friendly as a series, always keeping the `ID` indexing to keep proper track of which text is which and with which label.

This method is used either in operation or final testing of the pipeline.

#### `.evaluate(val_data: DataFrame)`

This method uses the dataset as is to make evaluate predictions on a known (validation) data split. It compares model predictions on the given data split with the known values in it to assess performance of the pipeline. This was used mostly to peek how the various model settings affected performance.

Evaluation on a validation split is also used when tuning the model hyper-parameters, though of course this custom method here is not used. This method was used only for convinience and review during development of the pipeline.

### `.tune(train_data: DataFrame, val_data: DataFrame, **param_grid: list)`

This method wraps creating a grid-search tuner for trying various pipeline settings (as exposed) and runs it on the data given. A little python magic is needed to make `sklearn` work with predifined splits instead of its default $k$-fold splitting, suitable for cross-validation.

At the end of the process, the tuner may be able to expose the best model fit during the tuning process. It is preferred however to instantiate the model with the best parameters found instead, so that it gets trained on all the data (both training and validation), since the validation data are no longer useful for training.

## Experiment

Below the pipeline is executed on the data given.

First lets load all the dataset splits:

In [13]:
train_data = load_data("train")
val_data   = load_data("val"  )
test_data  = load_data("test" )

Now lets build the classification pipeline and fit on the training data with tuning. The settings chosen to explore are visible in the instantiation of the classifier below, leading to a total of 18 trials.

In [None]:
classifier = Classifier(
	vectorizer = Vectorizer(),
	model = Model(),
)
classifier.tune(train_data, val_data,
	vectorizer__max_features = [
		128,
		256,
		512,
	],
	vectorizer__ngram_range = [
		(1, 1),
		(1, 2),
		(2, 2),
	],
	model__C = list(numpy.linspace(.5, 1., 2)),
)



Finally lets genarate the predictions for submission to the competition:

In [None]:
predictions = classifier.predict(test_data,
	save = "submission.csv",
)