# Semantic textual similarity
## Final Project IHLT - UPC 2022/2023
### Authors : Rob - Edison Bejarano

1. Data

2. What we are doing?
#### Techniques for preprocessing text for similarity comparison

- Stemming: is a process that involves reducing words to their base form, or stem, in order to normalize the text and remove variations in word endings. For example, the words "running," "runs," and "ran" would all be reduced to the stem "run" by a stemming algorithm.


- Lemmatization: is a process that involves reducing words to their base form, or lemma, in order to normalize the text and remove variations in word endings. Unlike stemming, lemmatization takes into account the context of the word in order to determine its lemma, resulting in more accurate and meaningful reductions. For example, the words "running," "runs," and "ran" would all be reduced to the lemma "run" by a lemmatization algorithm.

- Tf-idf weighting: Is a method for assigning a weight to each word in a document based on its relative importance. The weight is calculated by multiplying the term frequency (tf) of the word by the inverse document frequency (idf) of the word across all documents in a corpus. This weighting scheme gives higher weight to words that are more frequent within a document but less frequent across the corpus, making them more important for characterizing the document.

- NES : Function used the Natural Language Toolkit (nltk) to identify named entities in a given sentence. The sentence parameter is the sentence in which named entities should be identified, and the binary parameter determines whether named entities should be grouped together or returned as individual tokens. The function returns a set of the named entities and individual words found in the sentence.


These techniques can be used in combination with each other or with stopwords removal to preprocess text and improve the accuracy of similarity comparison. For example, you could use stemming or lemmatization to normalize the words in the phrases, and then use tf-idf weighting to assign importance to each word based on its frequency within the phrases and across a larger corpus. This would allow you to compare the similarity of the phrases in a more meaningful and accurate way


3. Results

## Install packages

In [313]:
%pip install -q spacy nltk numpy pandas scikit-learn pyjarowinkler lazypredict
!python3 -m spacy download en_core_web_sm

/usr/bin/fish: /home/rob/miniconda3/lib/libtinfo.so.6: no version information available (required by /usr/bin/fish)
/usr/bin/fish: /home/rob/miniconda3/lib/libstdc++.so.6: version `CXXABI_1.3.13' not found (required by /usr/bin/fish)
/usr/bin/fish: /home/rob/miniconda3/lib/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/bin/fish)
/usr/bin/fish: /home/rob/miniconda3/lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /usr/bin/fish)
Note: you may need to restart the kernel to use updated packages.
/usr/bin/fish: /home/rob/miniconda3/lib/libtinfo.so.6: no version information available (required by /usr/bin/fish)
/usr/bin/fish: /home/rob/miniconda3/lib/libstdc++.so.6: version `CXXABI_1.3.13' not found (required by /usr/bin/fish)
/usr/bin/fish: /home/rob/miniconda3/lib/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/bin/fish)
/usr/bin/fish: /home/rob/miniconda3/lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /usr/

# Libraries

In [263]:
import os
import re
import nltk
import spacy
import string
import numpy as np
import pandas as pd

from tqdm import tqdm
from itertools import chain
from functools import partial
from argparse import Namespace
from pyjarowinkler import distance
from collections.abc import Iterable
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer
from nltk.metrics import jaccard_distance
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag, ne_chunk, Tree
from nltk.metrics.distance import jaccard_distance
from scipy.stats import pearsonr

from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer
from typing import List
from lazypredict.Supervised import REGRESSORS, LazyRegressor

nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('gutenberg')
nltk.download('conll2000')
nltk.download('brown')
nltk.download('words')

[nltk_data] Downloading package wordnet to /home/rob/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/rob/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package punkt to /home/rob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/rob/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/rob/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /home/rob/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package gutenberg to /home/rob/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package conll2000 to /home/rob

True

## Download data

In [None]:
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/trial.tgz https://gebakx.github.io/ihlt/sts/resources/trial.tgz
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/train.tgz https://gebakx.github.io/ihlt/sts/resources/train.tgz
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/test-gold.tgz https://gebakx.github.io/ihlt/sts/resources/test-gold.tgz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2003  100  2003    0     0  47690      0 --:--:-- --:--:-- --:--:-- 47690
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  122k  100  122k    0     0   505k      0 --:--:-- --:--:-- --:--:--  503k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  115k  100  115k    0     0   345k      0 --:--:-- --:--:-- --:--:--  345k


# Bring data

In [15]:
!tar zxvf ../final_project/train.tgz
!tar zxvf ../final_project/trial.tgz
!tar zxvf ../final_project/test-gold.tgz

!rm ../final_project/train.tgz
!rm ../final_project/test-gold.tgz 
!rm ../final_project/trial.tgz

train/
train/00-readme.txt
train/STS.output.MSRpar.txt
train/STS.input.SMTeuroparl.txt
train/STS.input.MSRpar.txt
train/STS.gs.MSRpar.txt
train/STS.input.MSRvid.txt
train/STS.gs.MSRvid.txt
train/correlation.pl
train/STS.gs.SMTeuroparl.txt
trial/
trial/STS.input.txt
trial/00-readme.txt
trial/STS.gs.txt
trial/STS.ouput.txt
test-gold/
test-gold/STS.input.MSRpar.txt
test-gold/STS.gs.MSRpar.txt
test-gold/STS.input.MSRvid.txt
test-gold/STS.gs.MSRvid.txt
test-gold/STS.input.SMTeuroparl.txt
test-gold/STS.gs.SMTeuroparl.txt
test-gold/STS.input.surprise.SMTnews.txt
test-gold/STS.gs.surprise.SMTnews.txt
test-gold/STS.input.surprise.OnWN.txt
test-gold/STS.gs.surprise.OnWN.txt
test-gold/STS.gs.ALL.txt
test-gold/00-readme.txt


# Usesful functions

In [406]:
# ------------------------------ #
# Jaccard similarity Function
# ------------------------------ #
def jaccard_similarity(s1: List[str], s2: List[str]):
    return 1 - jaccard_distance(set(s1), set(s2))

# ------------------------------ #
# Jaccard Similarity List
# ------------------------------ #
def jaccard_similarity_list(s1: List[List[str]], s2: List[List[str]]):
    sims = []
    for l1, l2 in zip(s1, s2):
        sim = jaccard_similarity(l1, l2)
        sims.append(sim)
    return np.array(sims)


def dice_similarity(s1, s2):
    assert isinstance(s1, Iterable), f"s1 must be an iterable, not {type(s1)}"
    assert isinstance(s2, Iterable), f"s2 must be an iterable, not {type(s2)}"
    s1 = set(s1)
    s2 = set(s2)
    intersection = s1.intersection(s2)
    return 2 * len(intersection) / (len(s1) + len(s2))

def dice_similarity_list(s1: List[List[str]], s2: List[List[str]]):
    sims = []
    for l1, l2 in zip(s1, s2):
        sim = dice_similarity(l1, l2)
        sims.append(sim)
    return np.array(sims)

# ------------------------------ #
# Jarowinkler Similarity
# ------------------------------ #   
def calculateJarowinklerSimilarity(dataframe, column1, column2):

    aux = []
    for row in dataframe.itertuples():
            
        # Longest one selected
        if len(row[column1]) >= len(row[column2]):
            sentence1 = row[column1]
            sentence2 = row[column2]
        else:
            sentence1 = row[column2]
            sentence2 = row[column1]

        similarities_array = []
        for word1 in sentence1:
            max = 0

        for word2 in sentence2:
            similarity = distance.get_jaro_distance(str(word1), str(word2), winkler=True, scaling=0.1)
            
            if max < similarity:
                max = similarity
            
        similarities_array.append(max)

        aux.append(np.array(similarities_array).mean())

    return aux

# ------------------------------ #
#       Overlap Similarity
# ------------------------------ # 
def overlap_distance(sentence1, sentence2):
  # Zip the characters from the two strings together
  pairs = zip(sentence1, sentence2)

  # Initialize a counter for the overlap distance
  overlap = 0

  # Iterate over the pairs of characters
  for a, b in pairs:
    # If the characters are the same, increment the overlap counter
    if a == b:
      overlap += 1

  # Return the overlap distance
  return overlap

# ------------------------------ #
#    Overlap Similarity list
# ------------------------------ # 

def overlap_similarity_list(s1: List[List[str]], s2: List[List[str]]):
    sims = []
    for l1, l2 in zip(s1, s2):
        sim = overlap_distance(l1, l2)
        sims.append(sim)
    return np.array(sims)



In [385]:
tag_dict = {
        "NN": "n",
        "NNS": "n",
        "NNP": "n",
        "NNPS": "n",
        "VB": "v",
        "VBD": "v",
        "VBG": "v",
        "VBN": "v",
        "VBP": "v",
        "VBZ": "v",
        "RB": "r",
        "RBR": "r",
        "RBS": "r",
        "JJ": "a",
        "JJR": "a",
        "JJS": "a",
  }

# ------------------------------ #
#         Get Wordnet POS
# ------------------------------ #
def get_wordnet_pos(word):
  """Map POS tag to first character lemmatize() accepts"""
  tag = nltk.pos_tag([word])[0][1][0].upper()
  
        
  return tag_dict.get(tag, wordnet.NOUN)


#Auxiliar spacy
nlp = spacy.load('en_core_web_sm')
special_pattern = re.compile(r"[^ \nA-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ/]+")

# ------------------------------ #
#   Function to tokenize spacy
# ------------------------------ #
def spacy_tokenize(sentence):
  return [ word.text.lower() for word in nlp.tokenizer(sentence) ]

def tokenize_column_spacy(column):
  tokenize = [spacy_tokenize(sentence) for sentence in column]
  
  return tokenize
  
# ------------------------------ #
#   Function to lemmatize spacy
# ------------------------------ #
def spacy_lemmatize(sentence: str):
  return [ word.lemma_.lower() for word in nlp.tokenizer(sentence) ]
  
# ------------------------------ #
#   Function to tokenize
# ------------------------------ #
def tokenize_column(column):
    #put in lowercase
    tokenizator = [nltk.word_tokenize(sentence) for sentence in column]
    #Lowercase the tokens
    return [ [ word.lower() for word in sentence ] for sentence in tokenizator ]


#--------------------------------------------#
#  Function to NES
#--------------------------------------------#
def NES(sentence: str, binary: bool):
    x = nltk.pos_tag(nltk.word_tokenize(sentence))
    res = nltk.ne_chunk(x, binary=binary)
    necs_and_words = set()
    for chunk in res:
        if hasattr(chunk, 'label'):
            # Add NE
            token = ' '.join(term[0] for term in chunk)
            necs_and_words.add(token)
        else:
            token = chunk[0]
            if token.isalnum():
                necs_and_words.add(token.lower())
    return necs_and_words

 #--------------------------------------------#
 # Function to get entities from a column
 # -------------------------------------------# 
def get_entities_new(column):
    entities = []
    for sentence in column:
        entities.append(NES(sentence, False))
    return entities

# ------------------------------ #
# Lemmatization text process
# ------------------------------ #
lemmatizer = WordNetLemmatizer()
# ------------------------------ #
#   Function to lemmatize
# ------------------------------ #
def lemmatize(tokenized_text: List[List[str]]):
  
  lemmas = []

  for sentence in tqdm(tokenized_text):
    sentence_lemmas = []
    for word in sentence:
      sentence_lemmas.append(lemmatizer.lemmatize(word.lower(), get_wordnet_pos(word.lower())))
    lemmas.append(sentence_lemmas)

  return lemmas

# ------------------------------ #
#   Stopwords initialization
# ------------------------------ #
stopwords_list = set(nltk.corpus.stopwords.words("english"))
stopwords_list = stopwords_list.union(set(string.punctuation))
stopwords_list = stopwords_list.union(set(['.', ',', ';', '."']))

# ------------------------------ #
#   Function to remove stopwords
# ------------------------------ #
def remove_stopwords(column: List[List[str]]):
  #Lowercase the tokens
  return [ [ word.lower() for word in sentence if word not in stopwords_list ]  for sentence in column ]


# ------------------------------ #
#   Function to synonimize
# ------------------------------ #
def synonimize_column(column):
  #put in lowercase
  tokenized = [nltk.word_tokenize(sentence) for sentence in column]
  #Lowercase the tokens
  tokenized = [ [ word.lower() for word in sentence ] for sentence in tokenized ]
  #Synonimize
  synonimized = [ [ word for word in sentence if word not in stopwords_list ] for sentence in tokenized ]

  return synonimized


# ------------------------------ #
#   Function to synset
# ------------------------------ #
def get_synset_column(tokenized_text: List[List[str]]):
  synset = []
  for sentence in tokenized_text:
    pos = nltk.pos_tag(sentence)
    lemmas = []
    for pair in pos:
      if pair[1][0] in tag_dict.keys():
        lemma = wnl.lemmatize(p[0].lower(), pos=p[1][0].lower())
        lemmas.append(lemma)
      else:
        lemma = pair[0]
        lemmas.append(lemma)
    synset.append(lemmas)
  
  return synset


# ------------------------------ #
#  Function to NE(Name entities)
# ------------------------------ #
def apply_ne(tokenized_text: List[str]):
    # tokenize the sentence and find the POS tag for each token
    sentences_ne = list(ne_chunk(pos_tag(tokenized_text), binary=True))
    result = []
    for el in sentences_ne:
        if isinstance(el, Tree):
            leaves = el.leaves()
            result.append(" ".join(word[0] for word in leaves))
        else:
            result.append(el[0])
    return result

# used apply_ne function to get NE from a column
def get_name_entities(column: List[List[str]]):
  ne = []
  for sentence in column:
    ne.append(apply_ne(sentence))
  return ne



# ------------------------------ #
#  Function to get ngrams
# ------------------------------ #the
def get_ngrams_column(column: List[List[str]], n: int):
  ngrams = []
  for sentence in column:
    ngrams.append(apply_ngram(sentence, n))
  return ngrams


def apply_ngram(sentence: List[str], n: int):
    if len(sentence) < n:
        return [tuple(sentence)]
    return list(nltk.ngrams(sentence, n))


# ------------------------------ #
#     Function to get lesk 
# ------------------------------ #
def get_lesk_column(column):
  lesk_text = []

  for sentence in column:
    synset = [lesk(sentence, word) for word in sentence]
    synset = {word for word in synset if word is not None}
    lesk_text.append(synset)

  return lesk_text


  
def apply_jaccard_lesk(sentence1: str, sentence2: str):

  # Apply lesk to sentence 1
  synset1 = [ lesk(sentence1, word) for word in sentence1 ]
  synset1 = { word for word in synset1 if word is not None }

  # Apply lesk to sentence 1
  synset2 = [ lesk(sentence2, word) for word in sentence2 ]
  synset2 = { word for word in synset2 if word is not None }

  # Calculate distance
  distance = jaccard_distance(synset1, synset2)

  return distance


def lemma_spacy(sentences):
  sentences = [special_chars_out(s) for s in sentences]
  token_lemmatize = [spacy_lemmatize(phrase) for phrase in sentences]
  return token_lemmatize


def special_chars_out(sentence: str):
  
  sentence = sentence.replace("'ve", " have")
  sentence = sentence.replace("n't", " not")
  sentence = sentence.replace("'ll", " will")  
  sentence = sentence.replace("'m", " am")  
  sentence = sentence.replace("'re", " are")
  
  sentence = re.sub(special_pattern, " ", sentence)  

  return sentence

In [266]:
first = "My Bonnie White lies over the ocean, in Picadilli Circus at 3:00pm."
second = "My Bonnie lied over the sea! Over the sea..."


p1 = nltk.word_tokenize(first)
p2 = nltk.word_tokenize(second)

print(p1)
print(p2)

test_frase_benet = jaccard_similarity(p1,p2)
print(test_frase_benet)

['My', 'Bonnie', 'White', 'lies', 'over', 'the', 'ocean', ',', 'in', 'Picadilli', 'Circus', 'at', '3:00pm', '.']
['My', 'Bonnie', 'lied', 'over', 'the', 'sea', '!', 'Over', 'the', 'sea', '...']
0.21052631578947367


In [267]:
# aply the function apply_ne to a phrase
phrase = "I am a student of the University of Granada and that is in that city, that is in Spain, The artificial beach named angelica is going to be super cool."
#tokenize the phrase
ne = apply_ne(phrase)
print(ne)

['I', 'am', 'a', 'student', 'of', 'the', 'University', 'of', 'Granada', 'and', 'that', 'is', 'in', 'that', 'city', ',', 'that', 'is', 'in', 'Spain', ',', 'The', 'artificial', 'beach', 'named', 'angelica', 'is', 'going', 'to', 'be', 'super', 'cool', '.']


In [268]:
# Functions of preprocessing
def read_data(text_datas: List[str], gs_datas: List[str]):
  all_df_text = []
  for text_data, gs_data in zip(text_datas, gs_datas):
    df_text = pd.read_csv(text_data, sep=r'\t', engine='python', header=None)
    df_text.columns = ["text1", "text2"]
    df_text['gs'] = pd.read_csv(gs_data, sep='\t', header=None)
    all_df_text.append(df_text.dropna())
  return pd.concat(all_df_text)

def get_dataset(path: str) -> pd.DataFrame:
  files = sorted(os.listdir(path))
  input_files = [ os.path.join(path, file) for file in files if 'input' in file ]
  gs_files = [ os.path.join(path, file) for file in files if 'gs' in file ]
  df = read_data(input_files, gs_files)
  return df

# Pre-processing

### Data information
- trial : includes the definition of the scores, a sample of 5 sentence pairs and the input and output formats. It is not needed, but it is useful for prototyping.

- train : training data from paraphrasing data sets, input and output formats.

- test : test data from paraphrasing data sets.

In [306]:
train_path = '../final_project/train'
trial_path = '../final_project/trial'
test_path  = '../final_project/test-gold'

# **Similarities**

In [378]:

train_dataset_pruebas = get_dataset(train_path)
test_dataset_pruebas = get_dataset(test_path)
df = train_dataset_pruebas


In [379]:
# Tokenization features
tokenized_text1 = tokenize_column(df['text1'])
tokenized_text2 = tokenize_column(df['text2'])

# Lemmatization features
lemmatize_text1 = lemmatize(tokenized_text1)
lemmatize_text2 = lemmatize(tokenized_text1)


#Use stopwords function to remove stopwords
stopwords_text1 = remove_stopwords(tokenized_text1)
stopwords_text2 = remove_stopwords(tokenized_text2)



# Synonyms features
synonyms_text1 = []
synonyms_text2 = []
# Use sysnstesizer to get synonyms
for i in tqdm(range(len(tokenized_text1))):
    synonyms_text1.append([syn for w in tokenized_text1[i] for syn in wordnet.synsets(w)])
    synonyms_text2.append([syn for w in tokenized_text2[i] for syn in wordnet.synsets(w)])


# Synonyms features another way
synonimized_text1_new = synonimize_column(df['text1'])
synonimized_text2_new = synonimize_column(df['text2'])


# NES features
NES_column_text1 = get_entities_new(df['text1'])
NES_column_text2 = get_entities_new(df['text2'])

# Name entities features
name_entities_text1 = get_name_entities(tokenized_text1)
name_entities_text2 = get_name_entities(tokenized_text2)

ngrams_column_2_text1 = get_ngrams_column(tokenized_text1, 2)
ngrams_column_2_text2 = get_ngrams_column(tokenized_text2, 2)

ngrams_column_3_text1 = get_ngrams_column(tokenized_text1, 3)
ngrams_column_3_text2 = get_ngrams_column(tokenized_text2, 3)

# Lesk features
lesk_text1 = get_lesk_column(tokenized_text1)
lesk_text2 = get_lesk_column(tokenized_text2)

# Spacy features
spacy_tok_text1 = tokenize_column_spacy(df['text1'])
spacy_tok_text2 = tokenize_column_spacy(df['text2'])

100%|██████████| 2234/2234 [00:02<00:00, 763.01it/s] 
100%|██████████| 2234/2234 [00:02<00:00, 765.09it/s] 
100%|██████████| 2234/2234 [00:00<00:00, 3198.12it/s]


In [None]:

# Lemmatization features
lemmatize_text1 = lemmatize(df['text1'])
lemmatize_text2 = lemmatize(df['text2'])
print(lemmatize_text1[0])
print(lemmatize_text2[0])

#Use stopwords function to remove stopwords
stopwords_text1 = remove_stopwords(df['text1'])
stopwords_text2 = remove_stopwords(df['text2'])



# Synonyms features
synonyms_text1 = []
synonyms_text2 = []
# Use sysnstesizer to get synonyms
for i in tqdm(range(len(tokenized_text1))):
    synonyms_text1.append([syn for w in tokenized_text1[i] for syn in wordnet.synsets(w)])
    synonyms_text2.append([syn for w in tokenized_text2[i] for syn in wordnet.synsets(w)])


# Synonyms features another way
synonimized_text1_new = synonimize_column(df['text1'])
synonimized_text2_new = synonimize_column(df['text2'])


# NES features
NES_column_text1 = get_entities_new(df['text1'])
NES_column_text2 = get_entities_new(df['text2'])

# Name entities features
name_entities_text1 = get_name_entities(df['text1'])
name_entities_text2 = get_name_entities(df['text2'])

ngrams_column_2_text1 = get_ngrams_column(tokenized_text1, 2)
ngrams_column_2_text2 = get_ngrams_column(tokenized_text2, 2)

ngrams_column_3_text1 = get_ngrams_column(tokenized_text1, 3)
ngrams_column_3_text2 = get_ngrams_column(tokenized_text2, 3)

# Lesk features
lesk_text1 = get_lesk_column(tokenized_text1)
lesk_text2 = get_lesk_column(tokenized_text2)



# Synset features
#synset_text1 = synset_column(df['text1'])
#synset_text2 = synset_column(df['text2'])

100%|██████████| 2234/2234 [00:03<00:00, 700.94it/s] 
100%|██████████| 2234/2234 [00:02<00:00, 745.92it/s] 


['but', 'other', 'source', 'close', 'to', 'the', 'sale', 'said', 'vivendi', 'wa', 'keeping', 'the', 'door', 'open', 'to', 'further', 'bid', 'and', 'hoped', 'to', 'see', 'bidder', 'interested', 'in', 'individual', 'asset', 'team', 'up', '.']
['but', 'other', 'source', 'close', 'to', 'the', 'sale', 'said', 'vivendi', 'wa', 'keeping', 'the', 'door', 'open', 'for', 'further', 'bid', 'in', 'the', 'next', 'day', 'or', 'two', '.']


100%|██████████| 2234/2234 [00:00<00:00, 3132.87it/s]


In [380]:
lemma_synonyms_text1 = []
lemma_synonyms_text2 = []
# Use sysnstesizer to get synonyms
for i in tqdm(range(len(tokenized_text1))):
    lemma_synonyms_text1.append([syn for w in lemmatize_text1[i] for syn in wordnet.synsets(w)])
    lemma_synonyms_text2.append([syn for w in lemmatize_text2[i] for syn in wordnet.synsets(w)])


100%|██████████| 2234/2234 [00:00<00:00, 2715.35it/s]


In [410]:
d = overlap_distance(tokenized_text1[0], tokenized_text2[0])
#o = overlap_similarity_list(tokenized_text1, tokenized_text2)

print(d)
#print(o[0])

16


In [272]:
# Jaccard similarity features
jaccard_similarity_tokenized = jaccard_similarity_list(tokenized_text1, tokenized_text2)
jaccard_similarity_synonyms_new = jaccard_similarity_list(synonimized_text1_new, synonimized_text2_new)
jaccard_similarity_NES = jaccard_similarity_list(NES_column_text1, NES_column_text2)
jaccard_similarity_lemmatize = jaccard_similarity_list(lemmatize_text1, lemmatize_text2)
jaccard_similarity_stopwords = jaccard_similarity_list(stopwords_text1, stopwords_text2)
jaccard_similarity_synonyms = jaccard_similarity_list(synonyms_text1, synonyms_text2)
jaccard_similarity_name_entities = jaccard_similarity_list(name_entities_text1, name_entities_text2)
jaccard_similarity_ngrams_2 = jaccard_similarity_list(ngrams_column_2_text1, ngrams_column_2_text2)
jaccard_similarity_ngrams_3 = jaccard_similarity_list(ngrams_column_3_text1, ngrams_column_3_text2)
jaccard_similarity_lesk = jaccard_similarity_list(lesk_text1, lesk_text2)

print("Jaccard similarity tokenized: ", jaccard_similarity_tokenized[:3])
print("Jaccard similarity lemmatize: ", jaccard_similarity_lemmatize[:3])
print("Jaccard similarity stopwords: ", jaccard_similarity_stopwords[:3])
print("Jaccard similarity synonyms: ", jaccard_similarity_synonyms[:3])
print("Jaccard similarity synonyms new: ", jaccard_similarity_synonyms_new[:3])
print("Jaccard similarity name entities: ", jaccard_similarity_name_entities[:3])
print("Jaccard similarity ngrams 2: ", jaccard_similarity_ngrams_2[:3])
print("Jaccard similarity ngrams 3: ", jaccard_similarity_ngrams_3[:3])
print("Jaccard similarity lesk: ", jaccard_similarity_lesk[:3])

Jaccard similarity tokenized:  [0.5483871  0.42105263 0.34782609]
Jaccard similarity lemmatize:  [1. 1. 1.]
Jaccard similarity stopwords:  [0.47368421 0.46153846 0.33333333]
Jaccard similarity synonyms:  [0.67680608 0.30172414 0.38562092]
Jaccard similarity synonyms new:  [0.47368421 0.46153846 0.33333333]
Jaccard similarity name entities:  [0.5483871  0.42105263 0.33333333]
Jaccard similarity ngrams 2:  [0.37837838 0.13043478 0.20689655]
Jaccard similarity ngrams 3:  [0.32432432 0.04347826 0.1       ]
Jaccard similarity lesk:  [0.35714286 0.3125     0.22222222]


In [390]:
def get_features(df: pd.DataFrame):

    #--------------------------------------------#
    # 0. NLTK Words features
    #--------------------------------------------#
    #print("NLTK Words features")
    
    #nltk_words_text1 = []
    #nltk_words_text2 = []

    #--------------------------------------------#
    # 1. Tokenize features
    #--------------------------------------------#    
    tokenized_text1 = tokenize_column(df['text1'])
    tokenized_text2 = tokenize_column(df['text2'])

    #--------------------------------------------#
    # 2. Lemmatize features
    #--------------------------------------------#
    lemmatize_text1 = lemmatize(tokenized_text1)
    lemmatize_text2 = lemmatize(tokenized_text2)


    #--------------------------------------------#
    # 3. Stopwords features
    #--------------------------------------------#   
    stopwords_text1 = remove_stopwords(tokenized_text1)
    stopwords_text2 = remove_stopwords(tokenized_text2)

    #--------------------------------------------#
    # 4. Synonims features
    #--------------------------------------------#
    synonyms_text1 = []
    synonyms_text2 = []
    # Use sysnstesizer to get synonyms
    for i in tqdm(range(len(tokenized_text1))):
        synonyms_text1.append([syn for w in tokenized_text1[i] for syn in wordnet.synsets(w)])
        synonyms_text2.append([syn for w in tokenized_text2[i] for syn in wordnet.synsets(w)])

    
    #--------------------------------------------#
    # 5. NES features
    #--------------------------------------------#
    NES_column_text1 = get_entities_new(df['text1'])
    NES_column_text2 = get_entities_new(df['text2'])

    
    #--------------------------------------------#
    # 6. Name entities features
    #--------------------------------------------#
    name_entities_text1 = get_name_entities(tokenized_text1)
    name_entities_text2 = get_name_entities(tokenized_text2)

    #--------------------------------------------#
    # 7. Ngrams features
    #--------------------------------------------#

    ngrams_column_2_text1 = get_ngrams_column(tokenized_text1, 2)
    ngrams_column_2_text2 = get_ngrams_column(tokenized_text2, 2)

    ngrams_column_3_text1 = get_ngrams_column(tokenized_text1, 3)
    ngrams_column_3_text2 = get_ngrams_column(tokenized_text2, 3)

    ngrams_column_4_text1 = get_ngrams_column(tokenized_text1, 4)
    ngrams_column_4_text2 = get_ngrams_column(tokenized_text2, 4)

    ngrams_column_5_text1 = get_ngrams_column(tokenized_text1, 5)
    ngrams_column_5_text2 = get_ngrams_column(tokenized_text2, 5)

    ngrams_column_6_text1 = get_ngrams_column(tokenized_text1, 6)
    ngrams_column_6_text2 = get_ngrams_column(tokenized_text2, 6)

    ngrams_column_7_text1 = get_ngrams_column(tokenized_text1, 7)
    ngrams_column_7_text2 = get_ngrams_column(tokenized_text2, 7)

    ngrams_column_8_text1 = get_ngrams_column(tokenized_text1, 8)
    ngrams_column_8_text2 = get_ngrams_column(tokenized_text2, 8)

    ngrams_column_9_text1 = get_ngrams_column(tokenized_text1, 9)
    ngrams_column_9_text2 = get_ngrams_column(tokenized_text2, 9)

    #--------------------------------------------#
    # 8. Lesk features
    #--------------------------------------------#
    # Lesk features
    lesk_text1 = get_lesk_column(tokenized_text1)
    lesk_text2 = get_lesk_column(tokenized_text2)

    # --------------------------------------------#
    # 9. Spacy words features
    # --------------------------------------------#
    print("Spacy words features")
    spacy_words_text1 = tokenize_column_spacy(df['text1'])
    spacy_words_text2 = tokenize_column_spacy(df['text2'])

    # --------------------------------------------#
    # 10. Spacy lemmatize features
    # --------------------------------------------#
    print("Spacy lemmatize features")
    spacy_lemmatize_text1 = lemma_spacy(df['text1'])
    spacy_lemmatize_text2 = lemma_spacy(df['text2'])

    #--------------------------------------------#
    # 11.Lemma synonyms features
    #--------------------------------------------#
    lemma_synonyms_text1 = []
    lemma_synonyms_text2 = []
    # Use sysnstesizer to get synonyms
    for i in tqdm(range(len(tokenized_text1))):
        lemma_synonyms_text1.append([syn for w in lemmatize_text1[i] for syn in wordnet.synsets(w)])
        lemma_synonyms_text2.append([syn for w in lemmatize_text2[i] for syn in wordnet.synsets(w)])

    #print("Word synonyms features"

    #--------------------------------------------#
    # 12. Synset features
    #--------------------------------------------#
    print("Synset features")
    synset_text1 = get_synset_column(tokenized_text1)
    synset_text2 = get_synset_column(tokenized_text2)


    features = [
        # Jaccard similarity
        jaccard_similarity_list(tokenized_text1, tokenized_text2),
        jaccard_similarity_list(lemmatize_text1, lemmatize_text2),
        jaccard_similarity_list(stopwords_text1, stopwords_text2),
        jaccard_similarity_list(synonyms_text1, synonyms_text2),
        jaccard_similarity_list(NES_column_text1, NES_column_text2),
        jaccard_similarity_list(name_entities_text1, name_entities_text2),
        jaccard_similarity_list(ngrams_column_2_text1, ngrams_column_2_text2),
        jaccard_similarity_list(ngrams_column_3_text1, ngrams_column_3_text2),
        jaccard_similarity_list(ngrams_column_4_text1, ngrams_column_4_text2),
        jaccard_similarity_list(ngrams_column_5_text1, ngrams_column_5_text2),
        jaccard_similarity_list(ngrams_column_6_text1, ngrams_column_6_text2),
        jaccard_similarity_list(ngrams_column_7_text1, ngrams_column_7_text2),
        jaccard_similarity_list(ngrams_column_8_text1, ngrams_column_8_text2),
        jaccard_similarity_list(ngrams_column_9_text1, ngrams_column_9_text2),
        jaccard_similarity_list(lesk_text1, lesk_text2),
        jaccard_similarity_list(spacy_words_text1, spacy_words_text2),
        jaccard_similarity_list(spacy_lemmatize_text1, spacy_lemmatize_text2),


        # jaccard_similarity_list(nltk_words_text1, nltk_words_text2),
        # jaccard_similarity_list(spacy_words_text1, spacy_words_text2),
        # jaccard_similar
        jaccard_similarity_list(lemma_synonyms_text1,lemma_synonyms_text2),
        jaccard_similarity_list(synset_text1, synset_text2),
        #jaccard_similarity_list(synset_text1, synset_text2),
        
        
        # Dice similarity
        dice_similarity_list(tokenized_text1, tokenized_text2),
        dice_similarity_list(lemmatize_text1, lemmatize_text2),
        dice_similarity_list(stopwords_text1, stopwords_text2),
        dice_similarity_list(synonyms_text1, synonyms_text2),
        dice_similarity_list(NES_column_text1, NES_column_text2),
        dice_similarity_list(name_entities_text1, name_entities_text2),
        dice_similarity_list(ngrams_column_2_text1, ngrams_column_2_text2),
        dice_similarity_list(ngrams_column_3_text1, ngrams_column_3_text2),
        dice_similarity_list(ngrams_column_4_text1, ngrams_column_4_text2),
        dice_similarity_list(ngrams_column_5_text1, ngrams_column_5_text2),
        dice_similarity_list(ngrams_column_6_text1, ngrams_column_6_text2),
        dice_similarity_list(ngrams_column_7_text1, ngrams_column_7_text2),
        dice_similarity_list(ngrams_column_8_text1, ngrams_column_8_text2),
        dice_similarity_list(ngrams_column_9_text1, ngrams_column_9_text2),
        dice_similarity_list(lesk_text1, lesk_text2),
        dice_similarity_list(spacy_words_text1, spacy_words_text2),
        dice_similarity_list(spacy_lemmatize_text1, spacy_lemmatize_text2),

        #jaccard_similarity_list(nltk_words_text1, nltk_words_text2),
        #jaccard_similarity_list(spacy_words_text1, spacy_words_text2),
        #jaccard_similarity_
        dice_similarity_list(lemma_synonyms_text1,lemma_synonyms_text2),
        dice_similarity_list(synset_text1, synset_text2),
        #jaccard_similarity_list(synset_text1, synset_text2),

    ]
    return np.array(features)

# **Training**

## Get training dataset

In [391]:
train_dataset = get_dataset(train_path)
print(train_dataset.shape)
train_dataset.head()

(2234, 3)


Unnamed: 0,text1,text2,gs
0,But other sources close to the sale said Viven...,But other sources close to the sale said Viven...,4.0
1,Micron has declared its first quarterly profit...,Micron's numbers also marked the first quarter...,3.75
2,The fines are part of failed Republican effort...,"Perry said he backs the Senate's efforts, incl...",2.8
3,"The American Anglican Council, which represent...","The American Anglican Council, which represent...",3.4
4,The tech-loaded Nasdaq composite rose 20.96 po...,The technology-laced Nasdaq Composite Index <....,2.4


In [392]:
y_train = train_dataset['gs'].values
y_train.shape

(2234,)

## Get features of the training dataset

In [393]:
X_train_features: np.ndarray = get_features(train_dataset)
X_train_features.shape

100%|██████████| 2234/2234 [00:02<00:00, 796.32it/s] 
100%|██████████| 2234/2234 [00:02<00:00, 832.77it/s] 
100%|██████████| 2234/2234 [00:00<00:00, 3349.81it/s]


Spacy words features
Spacy lemmatize features


100%|██████████| 2234/2234 [00:00<00:00, 2983.61it/s]


Synset features


(38, 2234)

In [394]:
X_train_features.shape

(38, 2234)

# **Testing**

## Get the test dataset

In [395]:
test_dataset = get_dataset(test_path)
print(test_dataset.shape)
test_dataset.head()

(3108, 3)


Unnamed: 0,text1,text2,gs
0,The problem likely will mean corrective change...,He said the problem needs to be corrected befo...,4.4
1,The technology-laced Nasdaq Composite Index .I...,The broad Standard & Poor's 500 Index .SPX inc...,0.8
2,"""It's a huge black eye,"" said publisher Arthur...","""It's a huge black eye,"" Arthur Sulzberger, th...",3.6
3,SEC Chairman William Donaldson said there is a...,"""I think there's a building confidence that th...",3.4
4,Vivendi shares closed 1.9 percent at 15.80 eur...,"In New York, Vivendi shares were 1.4 percent d...",1.4


## Get features of the test dataset

In [396]:
X_test_features: np.ndarray = get_features(test_dataset)
X_test_features.shape

100%|██████████| 3108/3108 [00:02<00:00, 1213.19it/s]
100%|██████████| 3108/3108 [00:02<00:00, 1239.73it/s]
100%|██████████| 3108/3108 [00:00<00:00, 4544.69it/s]


Spacy words features
Spacy lemmatize features


100%|██████████| 3108/3108 [00:00<00:00, 5050.11it/s]


Synset features


(38, 3108)

In [397]:
y_test = test_dataset['gs'].values
y_test.shape

(3108,)

## Normalize all features

In [398]:
# Normalize the data
scaler = StandardScaler()
scaler.fit(X_train_features.T)
X_train_features_norm = scaler.transform(X_train_features.T)
X_test_features_norm = scaler.transform(X_test_features.T)

## Train the model

In [399]:
# Print all shapes
print("X_train_features shape: ", X_train_features_norm.shape)
print("y_train shape: ", y_train.shape)
print("X_test_features shape: ", X_test_features_norm.shape)
print("y_test shape: ", y_test.shape)

X_train_features shape:  (2234, 38)
y_train shape:  (2234,)
X_test_features shape:  (3108, 38)
y_test shape:  (3108,)


### Train a simple regression model

In [400]:
# Train
reg = LinearRegression()
reg.fit(X_train_features_norm, y_train)

In [401]:
# Evaluate
y_pred_train = reg.predict(X_train_features_norm)
y_pred_test = reg.predict(X_test_features_norm)

print("Train pearson: ", pearsonr(y_train, y_pred_train)[0])
print("Test pearson: ", pearsonr(y_test, y_pred_test)[0])

Train pearson:  0.7316572243439259
Test pearson:  0.5870344728661504


### Train multiple regression models

In [402]:
# Select all of the models that we are going to use
REGRESSORS = [ c for c in REGRESSORS if c[0] != 'QuantileRegressor' ]
print("Number of regressors:", len(REGRESSORS))

Number of regressors: 41


In [403]:
# Build pearson score function
def pearsonr_scorer(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    score = pearsonr(y_true, y_pred)[0]
    return score

pearson_scorer = make_scorer(pearsonr_scorer)
pearson_scorer.__name__ = 'pearson_scorer'

In [404]:
# Fit all models
reg = LazyRegressor(predictions=True, regressors=REGRESSORS, custom_metric=pearsonr_scorer)
regresion_models, regresion_predictions = reg.fit(X_train_features_norm, X_test_features_norm, y_train, y_test)

'tuple' object has no attribute '__name__'
Invalid Regressor(s)


100%|██████████| 41/41 [00:21<00:00,  1.88it/s]


In [405]:
regresion_models.sort_values(by='pearsonr_scorer', ascending=False)

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken,pearsonr_scorer
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RandomForestRegressor,0.35,0.35,1.1,2.53,0.67
ExtraTreesRegressor,0.33,0.34,1.11,1.02,0.66
GradientBoostingRegressor,0.33,0.34,1.11,1.05,0.65
LGBMRegressor,0.31,0.32,1.13,0.22,0.65
HistGradientBoostingRegressor,0.3,0.31,1.14,5.53,0.65
BaggingRegressor,0.31,0.32,1.13,0.3,0.64
AdaBoostRegressor,0.35,0.36,1.1,0.17,0.63
NuSVR,0.32,0.33,1.12,0.35,0.63
XGBRegressor,0.24,0.25,1.19,0.35,0.63
SVR,0.29,0.3,1.14,0.5,0.63


In [423]:
# Train MLP model
mlp = MLPRegressor(hidden_layer_sizes=(200, 100, 50), learning_rate='adaptive', max_iter=1000, verbose=True)
mlp.fit(X_train_features_norm, y_train)
y_pred_test = mlp.predict(X_test_features_norm)
print("Test pearson: ", pearsonr(y_test, y_pred_test)[0])

Iteration 1, loss = 3.03948815
Iteration 2, loss = 1.52205760
Iteration 3, loss = 1.01258261
Iteration 4, loss = 0.77674076
Iteration 5, loss = 0.67618250
Iteration 6, loss = 0.59800196
Iteration 7, loss = 0.55090673
Iteration 8, loss = 0.52676833
Iteration 9, loss = 0.50642458
Iteration 10, loss = 0.51563406
Iteration 11, loss = 0.49087451
Iteration 12, loss = 0.46937004
Iteration 13, loss = 0.44942276
Iteration 14, loss = 0.43873619
Iteration 15, loss = 0.42683126
Iteration 16, loss = 0.44512938
Iteration 17, loss = 0.41409897
Iteration 18, loss = 0.40046986
Iteration 19, loss = 0.40419851
Iteration 20, loss = 0.40088978
Iteration 21, loss = 0.39059042
Iteration 22, loss = 0.38274293
Iteration 23, loss = 0.38530324
Iteration 24, loss = 0.37764409
Iteration 25, loss = 0.37416559
Iteration 26, loss = 0.37676852
Iteration 27, loss = 0.38983758
Iteration 28, loss = 0.39971261
Iteration 29, loss = 0.36995542
Iteration 30, loss = 0.37056348
Iteration 31, loss = 0.37068264
Iteration 32, los