# Task 2: Text classification


The task is to implement, train and evaluate a multi-label text classifier that assigns document-level labels to each document in a corpus. I will use methods a) and b):

a) Developing a “traditional” classification method with SVM;


b) Developing a “traditional” deep learning method with  bi-directional LSTM

Importing necessary packages:

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.multiclass import OneVsRestClassifier
import warnings
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

warnings.filterwarnings(action = 'ignore')
 

Preprocessing method to be applied on the dataframe (the same from Task 1):

In [42]:
# Creating a set of English stopwords and additional custom stopwords
stop_words = set(stopwords.words('english') + ['reuter', '\x03'])

# Initializing a lemmatizer for text processing
lemmatizer = WordNetLemmatizer()

# Uncomment the following line to enable stemming using Porter Stemmer
# stemmer = PorterStemmer()

def preprocessor(text: str):
    """
    Preprocesses the input text by performing the following steps:
    1. Converting text to lowercase.
    2. Removing punctuation using a translation table.
    3. Replacing digits with the placeholder 'num'.
    4. Filtering out stopwords.
    5. Lemmatizing each word.

    Args:
    - text (str): Input text to be preprocessed.

    Returns:
    - str: Preprocessed text.
    """
    # Converting text to lowercase
    text = text.lower()

    # Removing punctuation using translation table
    table = str.maketrans('', '', string.punctuation)
    text = text.translate(table)

    # Replacing digits with 'num'
    text = re.sub(r'\d+', 'num', text)

    # Filtering out stopwords
    text = [word for word in text.split() if word not in stop_words]

    # Lemmatizing each word
    text = [lemmatizer.lemmatize(word) for word in text]
    
    # Uncomment the following line to enable stemming using Porter Stemmer
    # text = [stemmer.stem(word) for word in text]

    return " ".join(text)


Loading, analysing and pre-processing data:

In [43]:
# Setting the path to the dataset
dataset_path = "./data/"

# Reading the CSV file into a Pandas DataFrame
df = pd.read_csv(dataset_path + "Training-dataset.csv")

# Creating separate DataFrames for each genre based on binary labels
comedy_df = df.loc[df["comedy"] == 1]
cult_df = df.loc[df["cult"] == 1]
flashback_df = df.loc[df["flashback"] == 1]
historical_df = df.loc[df["historical"] == 1]
murder_df = df.loc[df["murder"] == 1]
revenge_df = df.loc[df["revenge"] == 1]
romantic_df = df.loc[df["romantic"] == 1]
scifi_df = df.loc[df["scifi"] == 1]
violence_df = df.loc[df["violence"] == 1]

# Creating a list of DataFrames for each genre
sep_label_df = [comedy_df, cult_df, flashback_df,
                historical_df,
                murder_df,
                revenge_df,
                romantic_df,
                scifi_df,
                violence_df
                ]

# Displaying the number of plots for each genre
col_val = 3
for i in sep_label_df:
    print(f"Number of '{i.columns[col_val]}' plots: {i.shape[0]}")
    col_val += 1

# Combining the title and plot synopsis into a single 'text' column
df['text'] = df['title'] + ' ' + df['plot_synopsis']

# Selecting relevant columns for training data
training_data = df[['text', 'comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']]

# Applying a preprocessing function to the 'text' column
training_data['preprocessed_text'] = training_data['text'].apply(preprocessor)

# Displaying the first few rows of the preprocessed training data
training_data.head()


Number of 'comedy' plots: 1262
Number of 'cult' plots: 1801
Number of 'flashback' plots: 1994
Number of 'historical' plots: 186
Number of 'murder' plots: 4019
Number of 'revenge' plots: 1657
Number of 'romantic' plots: 2006
Number of 'scifi' plots: 204
Number of 'violence' plots: 3064


Unnamed: 0,text,comedy,cult,flashback,historical,murder,revenge,romantic,scifi,violence,preprocessed_text
0,Si wang ta After a recent amount of challenges...,0,0,0,0,1,1,0,0,1,si wang ta recent amount challenge billy lo br...
1,Shattered Vengeance In the crime-ridden city o...,0,0,0,0,1,1,1,0,1,shattered vengeance crimeridden city tremont r...
2,L'esorciccio Lankester Merrin is a veteran Cat...,0,1,0,0,0,0,0,0,0,lesorciccio lankester merrin veteran catholic ...
3,"Serendipity Through Seasons ""Serendipity Throu...",0,0,0,0,0,0,1,0,0,serendipity season serendipity season heartwar...
4,The Liability Young and naive 19-year-old slac...,0,0,1,0,0,0,0,0,0,liability young naive numyearold slacker adam ...


At this point I have realized how unbalanced my data is because for example out of all the documents only 186 are labelled as 'historical' which would not be the best amount to learn 'historical' label. For this reason, I have decided to split my data by 80/20 for train and test not by random subsampling but making sure that I get 80% of all the label specific documents. Code below:

In [44]:
# Function to select a specified percentage of rows for training
def training_rows(data, perc=0.8):
    """
    Selects a specified percentage of rows for training from the given DataFrame.

    Parameters:
    - data: DataFrame, the input data for training
    - perc: float, the percentage of rows to be used for training (default is 0.8)

    Returns:
    - DataFrame, subset of the input data for training
    """
    return data.head(int(len(data) * perc))

# Function to select rows for testing based on the training set
def testing_rows(data, train):
    """
    Selects the rows for testing that are not included in the training set.

    Parameters:
    - data: DataFrame, the complete dataset
    - train: DataFrame, the training set

    Returns:
    - DataFrame, subset of the input data for testing
    """
    return data.iloc[len(train):]


In [45]:
# Initializing empty sets to store unique indices for training and testing data
train_id_set = []
test_id_set = []

# Iterating over genre-specific DataFrames to split them into training and testing sets
for i in sep_label_df:
    # Selecting rows for training and testing for each genre
    i_train = training_rows(i)
    i_test = testing_rows(i, i_train)
    
    # Extending the sets with unique indices for both training and testing
    train_id_set.extend(i_train.index.unique())
    test_id_set.extend(i_test.index.unique())

# Converting the lists to sets to ensure uniqueness of indices
train_id_set = set(train_id_set)
test_id_set = set(test_id_set)


Now I can divide my available data for training and testing:

In [46]:
# Extracting preprocessed text data for training and testing based on the generated indices
X_train = training_data.loc[train_id_set, "preprocessed_text"]
X_test = training_data.loc[test_id_set, "preprocessed_text"]

# Extracting binary label data for training and testing based on the generated indices
y_train = training_data.loc[train_id_set, ['comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']]
y_test = training_data.loc[test_id_set, ['comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']]


## Method A: SVM “traditional” classification method 

For completing this task I will use TfidfVectorizer class from sklearn (used in Task 1) alongside with LinearSVC class and OneVsRestClassifier from sklearn.

In scikit-learn, LinearSVC stands for Linear Support Vector Classification, which is a linear classification model. It is based on the liblinear library and is similar to the SVC (Support Vector Classification) with a linear kernel, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in terms of loss functions and penalties.

Some important parameters of the LinearSVC class in scikit-learn:

* **C** (float, optional, default=1.0):
This is the regularization parameter, also known as the penalty parameter. It controls the trade-off between achieving a low training error and a low testing error. A smaller C encourages a larger-margin separating hyperplane, but at the cost of training accuracy. A larger C may result in a smaller-margin hyperplane but better training accuracy.

* **loss** (string, optional, default='squared_hinge'):
Specifies the loss function. It can take values like 'hinge' or 'squared_hinge'. 'Hinge' is the standard SVM loss (soft-margin) while 'squared_hinge' is the square of the hinge loss.

The OneVsRestClassifier in scikit-learn is a meta-estimator that fits one binary classifier per class. It is commonly used for multilabel classification where each class is considered as a separate binary classification task.

In order to fine-tune the parameters for LinearSVC and Tfidfvectorizer, I have used sklearn Pipeline and GridSearchCV.

A Pipeline in scikit-learn is a way to streamline a lot of the routine processes, making it easier to keep track of the various steps involved in a machine learning workflow. It allows you to assemble several steps that can be cross-validated together while setting different parameters.

The primary purpose of a Pipeline is to assemble several steps that can be cross-validated together, while setting different parameters.

GridSearchCV is a method for systematically working through multiple combinations of hyperparameters to find the best parameters for a model. It performs an exhaustive search over a specified parameter grid, evaluating the model performance for each combination of parameters using cross-validation.


In [47]:
# Importing necessary modules for building the pipeline and performing grid search
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Creating a pipeline with TF-IDF vectorization and a One-vs-Rest classifier using LinearSVC
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', OneVsRestClassifier(LinearSVC()))
])

# Define the parameter grid to search during grid search
param_grid = {
    'classifier__estimator__C': [0.1, 0.5, 0.8, 1],
    'classifier__estimator__dual': ["auto"],
    'tfidf__max_df': [0.3, 0.5, 0.8, 1],
    'tfidf__min_df': [1, 3, 5, 10],
    'tfidf__norm': ['l1', 'l2'],
}

# Perform grid search with cross-validation (cv=10) using F1 score for multilabel classification
grid_search = GridSearchCV(pipeline, param_grid, cv=10, scoring='f1_samples', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print the best parameters and their corresponding cross-validated F1 score
print("Best Parameters: ", grid_search.best_params_)
print("Best Cross-validated F1 Score: {:.2f}".format(grid_search.best_score_))

# Retrieve the best model from the grid search
best_pipeline = grid_search.best_estimator_

# Evaluate the model on the test set and print the accuracy
test_accuracy = best_pipeline.score(X_test, y_test)
print("Test Set Accuracy: {:.2f}".format(test_accuracy))


Best Parameters:  {'classifier__estimator__C': 1, 'classifier__estimator__dual': 'auto', 'tfidf__max_df': 0.8, 'tfidf__min_df': 1, 'tfidf__norm': 'l2'}
Best Cross-validated F1 Score: 0.43
Test Set Accuracy: 0.23


NOTE: Before using LinearSVC class I have used the normal SVC class and carried out cross validation on that including the kernel parameter for 'linear', 'poly' and 'rbf'. I have found out that the best parameter for kernel always is 'linear' and for this reason I directly started using LinearSVC which is much more robust and efficient than SVC class. Additionally, the best parameters are found using the f1 score as it is what is going to be tested later by the validation dataset.

Afterwards I am creating my instances using the best parameters GridSearchCV found but now I will not split the Training dataset but will use all of it for training.

In [48]:
def generate_results_out_A(input_data, data_path, val_file=True):
    """
    Generates classification results using Method A for a given input data and saves the results to a CSV file.

    Parameters:
    - input_data: str, the name of the input CSV file containing plot and title data
    - data_path: str, the path to the data directory
    - val_file: bool, indicating whether the input file is a validation file (default is True)

    Returns:
    - None, saves the results to a CSV file based on the specified output name
    """

    # Reading the input data
    validation_file = pd.read_csv(data_path + input_data)

    # Combining title and plot synopsis into a single 'text' column
    validation_file['text'] = validation_file['title'] + ' ' + validation_file['plot_synopsis']

    # Applying preprocessing to the 'text' column
    validation_file['preprocessed_text'] = validation_file['text'].apply(preprocessor)

    # Selecting text data for training and testing
    X_validation_train = training_data["preprocessed_text"]
    X_validation_test = validation_file['preprocessed_text']

    # Selecting binary label data for training
    y_validation_train = training_data[['comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']]

    # Creating a TF-IDF vectorizer with best parameters from grid search
    tfidf_vectorizer = TfidfVectorizer(max_df=grid_search.best_params_['tfidf__max_df'],
                                       min_df=grid_search.best_params_['tfidf__min_df'],
                                       norm=grid_search.best_params_['tfidf__norm'])

    # Transforming text data to TF-IDF vectors for both training and testing
    X_validation_train_tfidf = tfidf_vectorizer.fit_transform(X_validation_train)
    X_validation_test_tfidf = tfidf_vectorizer.transform(X_validation_test)

    # Creating a linear support vector classifier (SVM) with best parameter from grid search
    svm_classifier = OneVsRestClassifier(LinearSVC(C=grid_search.best_params_['classifier__estimator__C']))
    svm_classifier.fit(X_validation_train_tfidf, y_validation_train)

    # Predicting genre labels for the validation data
    y_validation_pred = svm_classifier.predict(X_validation_test_tfidf)

    # Creating a DataFrame with the predicted labels
    df_nd_array = pd.DataFrame(y_validation_pred, columns=['comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence'])

    # Concatenating the ID column from the validation file with the predicted labels
    result_df = pd.concat([validation_file['ID'], df_nd_array], axis=1)

    # Determining the output file name based on whether the input file is a validation file
    if val_file:
        output_name = '10756505-Task2-method-a-validation.csv'
    else:
        output_name = '10756505-Task2-method-a.csv'

    # Saving the results to a CSV file without headers and index
    result_df.to_csv(data_path + output_name, header=False, index=False)


### Generating results for validation data

In [49]:
generate_results_out_A("Task-2-validation-dataset.csv", dataset_path)

### Generating results for test data

In [50]:
generate_results_out_A("Task-2-test-dataset1.csv", dataset_path, False)

## Method B:  Bi-LSTM “traditional” deep learning method

### Tensorflow.keras

For this method I am going to utilize tensorflow.keras LSequential model with Bidirectional layer and LSTM layer. I am going to use the same data which was already preprocessed for Method A.

In [51]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense

**Sequential** is a linear stack of layers in TensorFlow's Keras API. It allows to create models layer-by-layer in a step-by-step fashion. Each layer has weights that correspond to the layer that follows it. You can add layers using the add() method.

**Bidirectional** layer in Keras allows to process the input data in both forward and backward directions.
This is often used with recurrent layers like LSTM to capture patterns that depend on the order of the input sequence in both directions.
It takes another layer as an argument, and the combined output is the concatenation of the forward and backward layer outputs.

**Long Short-Term Memory (LSTM)** is a type of recurrent neural network (RNN) layer in Keras.
It is designed to overcome the vanishing gradient problem in standard RNNs, making it more effective in learning and remembering long-term dependencies in sequential data.
The LSTM layer has memory cells that can store and retrieve information over long sequences.

I will now train the model using the training dataset and validate it on the validation dataset.

In [52]:
X_train = training_data["preprocessed_text"].values
y_train = training_data[['comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']].values


These classes have a lot of hyperparmeters to tune. Due to lack of time and computational resources I was only label to produce the below values. If I had more computational power, I would tune the following main parameters: max_words, max_len, embeddin_dim, epochs and batch_size. The current values are the best results from manual experiments.

In [53]:
import numpy as np

def average_word_count(string_array):
    """
    Calculates the average word count for an array of strings.

    Parameters:
    - string_array: numpy array or list, containing strings for which the word count will be calculated

    Returns:
    - float, the average word count across all strings in the array
    """

    # Function to calculate word count for a single string
    def word_count(string):
        return len(string.split())

    # Vectorize the word_count function to apply it to each element of the array
    vectorized_word_count = np.vectorize(word_count)

    # Apply the vectorized function to the array to obtain an array of word counts
    word_counts = vectorized_word_count(string_array)

    # Calculate the average word count from the array of word counts
    average_word_count = np.mean(word_counts)

    return average_word_count


I have created the above method to find the average length of a movie_plot because I have found out that the max_len parameter is the best when it is about the average size of a document in a corpus.

In [54]:
def generate_results_out_B(input_data, data_path, val_file=True):
    """
    Generates classification results using Method B for a given input data and saves the results to a CSV file.

    Parameters:
    - input_data: str, the name of the input CSV file containing plot and title data
    - data_path: str, the path to the data directory
    - val_file: bool, indicating whether the input file is a validation file (default is True)

    Returns:
    - None, saves the results to a CSV file based on the specified output name
    """

    # Reading the input data
    validation_file = pd.read_csv(data_path + input_data)

    # Combining title and plot synopsis into a single 'text' column
    validation_file['text'] = validation_file['title'] + ' ' + validation_file['plot_synopsis']

    # Applying preprocessing to the 'text' column
    validation_file['preprocessed_text'] = validation_file['text'].apply(preprocessor)

    # Determining the output file name based on whether the input file is a validation file
    if val_file:
        output_name = '10756505-Task2-method-b-validation.csv'
    else:
        output_name = '10756505-Task2-method-b.csv'

    # Extracting preprocessed text data for testing
    X_test = validation_file["preprocessed_text"].values

    # Setting parameters for tokenization and padding
    max_words = 12000  
    max_len = round(average_word_count(X_train))

    # Tokenizing and padding the text data
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(X_train)

    X_train_seq = tokenizer.texts_to_sequences(X_train)
    X_test_seq = tokenizer.texts_to_sequences(X_test)

    X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
    X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

    # Defining the LSTM model architecture
    embedding_dim = 100 
    model = Sequential()
    model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len))
    model.add(Bidirectional(LSTM(64)))
    model.add(Dense(9, activation='sigmoid'))

    # Compiling the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    # Training the model
    epochs = 5  
    batch_size = 16  
    model.fit(X_train_pad, y_train, epochs=epochs, batch_size=batch_size, validation_split=0.1)

    # Predicting genre labels for the validation data
    y_pred = model.predict(X_test_pad)

    # Converting predicted probabilities to binary labels
    y_pred_binary = (y_pred > 0.5).astype(int)

    # Creating a DataFrame with the predicted labels
    df_nd_array = pd.DataFrame(y_pred_binary, columns=['comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence'])

    # Concatenating the ID column from the validation file with the predicted labels
    result_df = pd.concat([validation_file['ID'], df_nd_array], axis=1)

    # Saving the results to a CSV file without headers and index
    result_df.to_csv(data_path + output_name, header=False, index=False)


### Generating results for validation data

In [55]:
generate_results_out_B("Task-2-validation-dataset.csv", dataset_path)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Generating results for test data

In [56]:
generate_results_out_B("Task-2-test-dataset1.csv", dataset_path, False)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
