## Task 1: Distributional semantics
### Estimate the similarity between two terms
### a) using a sparse representation BoW with tf*idf


Importing the necessary packages:

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action = 'ignore')

Loading training dataset:

In [4]:
df = pd.read_csv("Training-dataset.csv")
df.head()

Unnamed: 0,ID,title,plot_synopsis,comedy,cult,flashback,historical,murder,revenge,romantic,scifi,violence
0,8f5203de-b2f8-4c0c-b0c1-835ba92422e9,Si wang ta,"After a recent amount of challenges, Billy Lo ...",0,0,0,0,1,1,0,0,1
1,6416fe15-6f8a-41d4-8a78-3e8f120781c7,Shattered Vengeance,"In the crime-ridden city of Tremont, renowned ...",0,0,0,0,1,1,1,0,1
2,4979fe9a-0518-41cc-b85f-f364c91053ca,L'esorciccio,Lankester Merrin is a veteran Catholic priest ...,0,1,0,0,0,0,0,0,0
3,b672850b-a1d9-44ed-9cff-025ee8b61e6f,Serendipity Through Seasons,"""Serendipity Through Seasons"" is a heartwarmin...",0,0,0,0,0,0,1,0,0
4,b4d8e8cc-a53e-48f8-be6a-6432b928a56d,The Liability,"Young and naive 19-year-old slacker, Adam (Jac...",0,0,1,0,0,0,0,0,0


Creating a new dataframe where plot_synopsis and title are merged:

In [5]:
df['text'] = df['title'] + ' ' + df['plot_synopsis']
training_data = df[['text', 'comedy', 'cult', 'flashback', 'historical', 'murder', 'revenge', 'romantic', 'scifi', 'violence']]
training_data.head() 

Unnamed: 0,text,comedy,cult,flashback,historical,murder,revenge,romantic,scifi,violence
0,Si wang ta After a recent amount of challenges...,0,0,0,0,1,1,0,0,1
1,Shattered Vengeance In the crime-ridden city o...,0,0,0,0,1,1,1,0,1
2,L'esorciccio Lankester Merrin is a veteran Cat...,0,1,0,0,0,0,0,0,0
3,"Serendipity Through Seasons ""Serendipity Throu...",0,0,0,0,0,0,1,0,0
4,The Liability Young and naive 19-year-old slac...,0,0,1,0,0,0,0,0,0


### Data Ppre-processing

Pre-processing the dataset through lowercasing the text, removing punctuations and stopwords, and also carrying out lemmatization. Stemming was not implemented as for htis task, stemming lowered the accuracy of the model in previous tries.

In [17]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

stop_words = set(stopwords.words('english') + ['reuter', '\x03'])
lemmatizer = WordNetLemmatizer()
# stemmer = PorterStemmer()

def preprocessor(text: str):
    text = text.lower()

    table = str.maketrans('', '', string.punctuation)
    text = text.translate(table)

    text = re.sub(r'\d+', 'num', text)

    text = [word for word in text.split() if word not in stop_words]

    text = [lemmatizer.lemmatize(word) for word in text]
    
    # text = [stemmer.stem(word) for word in text]

    return " ".join(text)


Applying the pre-processing to the available data and creating a new column in the dataframe name "preprocessed_text". This column will be later used for the model.

In [18]:
training_data['preprocessed_text'] = training_data['text'].apply(preprocessor)
training_data.head()

Unnamed: 0,text,comedy,cult,flashback,historical,murder,revenge,romantic,scifi,violence,preprocessed_text
0,Si wang ta After a recent amount of challenges...,0,0,0,0,1,1,0,0,1,si wang ta recent amount challenge billy lo br...
1,Shattered Vengeance In the crime-ridden city o...,0,0,0,0,1,1,1,0,1,shattered vengeance crimeridden city tremont r...
2,L'esorciccio Lankester Merrin is a veteran Cat...,0,1,0,0,0,0,0,0,0,lesorciccio lankester merrin veteran catholic ...
3,"Serendipity Through Seasons ""Serendipity Throu...",0,0,0,0,0,0,1,0,0,serendipity season serendipity season heartwar...
4,The Liability Young and naive 19-year-old slac...,0,0,1,0,0,0,0,0,0,liability young naive numyearold slacker adam ...


### TF-IDF vectorizer

I have implemented TfidfVectorizer from sklearn. It converts a collection of text documents into a matrix of token frequencies, while also considering the importance of each term in relation to the entire corpus. Parameter of 0.5 in the TF-IDF vectorizer is a common practice to filter out terms that appear too frequently across documents and may not contribute much to the discriminative power of the model. In other words, terms with a document frequency higher than 50% (0.5) of the documents in the corpus are excluded.

In [8]:
vectorizer = TfidfVectorizer(max_df = 0.5)
tfidf = vectorizer.fit_transform(training_data['preprocessed_text'])

The follwoing is the main method which calculates the cosine simlarity of two terms. First, I obtain the vectors of the given terms from the already existing vectorized form while at the same time checking if the term is in the vocabulary. If it is not in the vocabulary, its vector is set to zeros. Afterwards, the cosine similarity is calculated using the obtained vector values through the cosine_similarity method from scikit learn.

In [9]:
def sim(term1, term2):
    
    term1_index = vectorizer.vocabulary_.get(term1, -1)
    term2_index = vectorizer.vocabulary_.get(term2, -1)

    if term1_index == -1:
        tfidf_vector_for_term_1 = np.zeros((1, training_data.shape[0]))
    else:
        tfidf_vector_for_term_1 = tfidf[:, term1_index]
    
    if term2_index == -1:
        tfidf_vector_for_term_2 = np.zeros((1, training_data.shape[0]))
    else:   
        tfidf_vector_for_term_2 = tfidf[:, term2_index]

    similarity = cosine_similarity(tfidf_vector_for_term_1.reshape(1, -1), tfidf_vector_for_term_2.reshape(1, -1))[0][0]

    return similarity
    

Example of the implementation:

In [10]:
term1 = "carry"
term2 = "region"

similar = sim(term1, term2)

print(similar)

0.038294354412983636


### Creating the output file for validation

In [11]:
validation_file = pd.read_csv("Task-1-validation-dataset.csv", names=['index', 'term1', 'term2', 'score'])
validation_file.head()

Unnamed: 0,index,term1,term2,score
0,1,absorb,learn,5.48
1,2,absorb,withdraw,2.97
2,3,achieve,accomplish,8.57
3,4,achieve,try,4.42
4,6,acquire,get,8.82


In [12]:
all_similarity_vals = []
for index, row in validation_file.iterrows():
    similarity_val = sim(row['term1'], row['term2'])
    all_similarity_vals.append(similarity_val)
    
validation_file['prediction_score'] = all_similarity_vals
validation_file.head()

Unnamed: 0,index,term1,term2,score,prediction_score
0,1,absorb,learn,5.48,0.020255
1,2,absorb,withdraw,2.97,0.0
2,3,achieve,accomplish,8.57,0.001639
3,4,achieve,try,4.42,0.03704
4,6,acquire,get,8.82,0.0


In [13]:
prediction_df = validation_file[['index', 'prediction_score']]
prediction_df.head()

Unnamed: 0,index,prediction_score
0,1,0.020255
1,2,0.0
2,3,0.001639
3,4,0.03704
4,6,0.0


In [14]:
prediction_df.to_csv('prediction_file_Task1_a.csv', header = False, index = False)