# Task 1: Distributional semantics


The task is to use different vector representations to estimate the cosine similarity between two terms. I will use methods a) and b):

a) a sparse representation BoW with tf*idf;


b) a dense static representation word2vec;



Importing necessary packages:

In [332]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action = 'ignore')

Loading and preprocessing the training dataset:

In [333]:
# Setting the path to the dataset
dataset_path = "./data/"

# Reading the CSV file into a Pandas DataFrame
df = pd.read_csv(dataset_path + "Training-dataset.csv")

# Merging the 'title' and 'plot_synopsis' columns to create a unified 'text' column
df['text'] = df['title'] + ' ' + df['plot_synopsis']

# Selecting relevant columns for training data, including 'text' and various genre labels
training_data = df[['text', 'comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']]


Preprocessing method to be applied on the dataframe:

In [334]:
# Creating a set of English stopwords and additional custom stopwords
stop_words = set(stopwords.words('english') + ['reuter', '\x03'])

# Initializing a lemmatizer for text processing
lemmatizer = WordNetLemmatizer()

# Uncomment the following line to enable stemming using Porter Stemmer
# stemmer = PorterStemmer()

def preprocessor(text: str):
    """
    Preprocesses the input text by performing the following steps:
    1. Converting text to lowercase.
    2. Removing punctuation using a translation table.
    3. Replacing digits with the placeholder 'num'.
    4. Filtering out stopwords.
    5. Lemmatizing each word.

    Args:
    - text (str): Input text to be preprocessed.

    Returns:
    - str: Preprocessed text.
    """
    # Converting text to lowercase
    text = text.lower()

    # Removing punctuation using translation table
    table = str.maketrans('', '', string.punctuation)
    text = text.translate(table)

    # Replacing digits with 'num'
    text = re.sub(r'\d+', 'num', text)

    # Filtering out stopwords
    text = [word for word in text.split() if word not in stop_words]

    # Lemmatizing each word
    text = [lemmatizer.lemmatize(word) for word in text]
    
    # Uncomment the following line to enable stemming using Porter Stemmer
    # text = [stemmer.stem(word) for word in text]

    return " ".join(text)


In [335]:
# Applying the preprocessor function to the 'text' column and creating a new 'preprocessed_text' column
training_data['preprocessed_text'] = training_data['text'].apply(preprocessor)

# Displaying the first few rows of the updated training_data DataFrame
training_data.head()

Unnamed: 0,text,comedy,cult,flashback,historical,murder,revenge,romantic,scifi,violence,preprocessed_text
0,Si wang ta After a recent amount of challenges...,0,0,0,0,1,1,0,0,1,si wang ta recent amount challenge billy lo br...
1,Shattered Vengeance In the crime-ridden city o...,0,0,0,0,1,1,1,0,1,shattered vengeance crimeridden city tremont r...
2,L'esorciccio Lankester Merrin is a veteran Cat...,0,1,0,0,0,0,0,0,0,lesorciccio lankester merrin veteran catholic ...
3,"Serendipity Through Seasons ""Serendipity Throu...",0,0,0,0,0,0,1,0,0,serendipity season serendipity season heartwar...
4,The Liability Young and naive 19-year-old slac...,0,0,1,0,0,0,0,0,0,liability young naive numyearold slacker adam ...


## Method A: Sparse representation BoW with tf*idf

### TfidfVectorizer

The TfidfVectorizer class in scikit-learn (sklearn) is a part of the feature extraction module and is used to convert a collection of raw documents to a matrix of TF-IDF features. It has multiple parameters which include:





* **max_df (default=1.0)**: When building the vocabulary, terms that have a document frequency strictly higher than the given threshold are ignored. This can be a float in the range [0.0, 1.0] or an integer.


* **min_df (default=1)**: Similar to max_df, but terms with a document frequency lower than the given threshold are ignored.


* **max_features (default=None)**: If not None, limits the number of features (terms) in the vocabulary to the top max_features ordered by term frequency across the corpus.


* **norm (default='l2')**: This parameter specifies the normalization method for the term vectors. Options include 'l1', 'l2', or None.


* **'ngram_range'**: This parameter in the TfidfVectorizer class specifies the range of n-grams to include in the feature matrix

In this scenario I have obtained the parameter values through manual tuning meaning I have tried multiple combinations on the validation dataset and recorded the accuracy metric provided by the evaluation script.

In [336]:
# Initializing a TF-IDF vectorizer with specified parameters
# Using ngram_range to implement bi-gram functionality as required
vectorizer = TfidfVectorizer(max_df=0.5, min_df=3, ngram_range=(1, 2), norm='l1')

# Transforming the preprocessed text data using the TF-IDF vectorizer
tfidf = vectorizer.fit_transform(training_data['preprocessed_text'])

The follwoing is the main method which calculates the cosine simlarity of two terms. First, I obtain the vectors of the given terms from the already existing vectorized form while at the same time checking if the term is in the vocabulary. If it is not in the vocabulary, its vector is set to ones instead of zeros because I was getting better results with such approach. Afterwards, the cosine similarity is calculated using the obtained vector values through the cosine_similarity method from scikit learn.

In [337]:
def sim_A(term1, term2):
    """
    Calculates the cosine similarity between the TF-IDF vectors of two terms in the dataset.

    Args:
    - term1 (str): First term for similarity comparison.
    - term2 (str): Second term for similarity comparison.

    Returns:
    - float: Cosine similarity between the TF-IDF vectors of the two terms.
    """

    # Retrieve the index of each term in the TF-IDF matrix vocabulary
    term1_index = vectorizer.vocabulary_.get(term1, -1)
    term2_index = vectorizer.vocabulary_.get(term2, -1)

    # If a term is not present in the vocabulary, create a dummy TF-IDF vector with all ones
    if term1_index == -1:
        tfidf_vector_for_term_1 = np.ones((1, training_data.shape[0]))
    else:
        tfidf_vector_for_term_1 = tfidf[:, term1_index]

    if term2_index == -1:
        tfidf_vector_for_term_2 = np.ones((1, training_data.shape[0]))
    else:
        tfidf_vector_for_term_2 = tfidf[:, term2_index]

    # Calculate cosine similarity between the TF-IDF vectors of the two terms
    similarity = cosine_similarity(tfidf_vector_for_term_1.reshape(1, -1), tfidf_vector_for_term_2.reshape(1, -1))[0][0]

    return similarity


### Results generation for given input data

The follwoing method will be used to obtain results and write them to a file for given input data in the required format.

In [338]:
def generate_results_output(input_data, data_path, method, val_file=True):
    """
    Generates output CSV files containing similarity scores based on the specified method.

    Args:
    - input_data (str): Name of the input CSV file containing term pairs.
    - data_path (str): Path to the directory containing the input files.
    - method (str): Method to be used ('A' or 'B') for similarity calculation.
    - val_file (bool): Indicates whether the input file is a validation file.

    Returns:
    - None: Outputs a CSV file with similarity scores based on the chosen method.
    """

    # Determine column names and output file name based on method and validation status
    if val_file and method == 'A':
        col_names = ['index', 'term1', 'term2', 'score']
        output_name = '10756505-Task1-method-a-validation.csv'
    elif val_file and method == 'B':
        col_names = ['index', 'term1', 'term2', 'score']
        output_name = '10756505-Task1-method-b-validation.csv'
    elif not val_file and method == 'A':
        col_names = ['index', 'term1', 'term2']
        output_name = '10756505-Task1-method-a.csv'
    elif not val_file and method == 'B':
        col_names = ['index', 'term1', 'term2']
        output_name = '10756505-Task1-method-b.csv'
    else:
        print("Invalid method arg")
        return

    # Read the input CSV file with specified column names
    validation_file = pd.read_csv(data_path + input_data, names=col_names)
    all_similarity_vals = []

    # Iterate through rows and calculate similarity scores based on the chosen method
    for index, row in validation_file.iterrows():
        if method == 'A':
            similarity_val = sim_A(row['term1'], row['term2'])
        elif method == 'B':
            similarity_val = sim_B(row['term1'], row['term2'])
        else:
            print("Invalid method arg")
            return

        all_similarity_vals.append(similarity_val)

    # Add the calculated similarity scores to the DataFrame
    validation_file['prediction_score'] = all_similarity_vals

    # Create a DataFrame with 'index' and 'prediction_score' columns
    prediction_df = validation_file[['index', 'prediction_score']]

    # Save the prediction DataFrame to a CSV file
    prediction_df.to_csv(data_path + output_name, header=False, index=False)


### Generating results for validation data

In [339]:
generate_results_output("Task-1-validation-dataset.csv", dataset_path, 'A')

### Generating results for test data

In [340]:
generate_results_output("Task-1-test-dataset1.csv", dataset_path, 'A', False)

## Method B: Dense static representation with Word2vec

### Gensim Model

For this method I am going to utilize gensim library word2vec class. I am going to use the same data which was already preprocessed for Method A.

In [341]:
import gensim
from gensim.models import Word2Vec

gensim.models.Word2Vec is a popular implementation of the Word2Vec algorithm in the Gensim library, which is widely used for word embedding. Word embedding is a technique that represents words as vectors in a continuous vector space, capturing semantic relationships between words. Some key parameters of the Word2Vec model:

* **sentences**: This parameter takes a list of sentences as input, where each sentence is represented as a list of words. The model learns vector representations for words based on the context in which they appear in these sentences.
* **vector_size**: The size of the word vectors. This parameter sets the dimensionality of the word vectors. Common values are in the range of 100 to 300.
* **window**: The maximum distance between the current and predicted word within a sentence. It defines the context window for learning word representations. Words outside this window are not considered as context words.
* **min_count**: Ignores all words with a total frequency lower than this. This helps to filter out rare words that may not have sufficient context for meaningful vector representations.

Below parameter values manually fine tuned baed on the accuracy of the evaluation script on the validation dataset.

In [358]:
# Initializing a Word2Vec model using gensim
model = gensim.models.Word2Vec(
    sentences=[t.split() for t in training_data['preprocessed_text'].to_list()],
    vector_size=100,  # Size of the word vectors
    window=10,  # Maximum distance between the current and predicted word within a sentence
    min_count=3  # Ignores all words with a total frequency lower than this
)


I am using similar logic for finding the similarity to the one for method A. I am setting the vetor with ones if the term is an OOV.

In [359]:
def sim_B(term1, term2):
    """
    Calculates the cosine similarity between the Word2Vec vectors of two terms in the dataset.

    Args:
    - term1 (str): First term for similarity comparison.
    - term2 (str): Second term for similarity comparison.

    Returns:
    - float: Cosine similarity between the Word2Vec vectors of the two terms.
    """

    try:
        # Get vectors for the two terms from the Word2Vec model
        vector1 = model.wv[term1].reshape(1, -1)
        vector2 = model.wv[term2].reshape(1, -1)

    except KeyError as e:
        # Handle the case where one or both terms are not in the vocabulary
        vector1 = np.ones((1, model.vector_size))
        vector2 = np.ones((1, model.vector_size))

    # Calculate cosine similarity using sklearn's cosine_similarity function
    similarity_score = cosine_similarity(vector1, vector2)[0, 0]

    return similarity_score


### Generating results for validation data

In [360]:
generate_results_output("Task-1-validation-dataset.csv", dataset_path, 'B')

### Generating results for test data

In [345]:
generate_results_output("Task-1-test-dataset1.csv", dataset_path, 'B', False)

Given more time and computational resources it would havebeen possible to further fine-tune parameters of Tfidfvectorizers and gensim.Word2vec which would lead to higher accuracy scores.