<a href="https://colab.research.google.com/github/EjbejaranosAI/IHLT/blob/main/final_siames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic textual similarity
## Final Project IHLT - UPC 2022/2023
### Authors : Roberto Ariosa - Edison Bejarano

1. Data

2. What we are doing?
#### Techniques for preprocessing text for similarity comparison

- Stemming: is a process that involves reducing words to their base form, or stem, in order to normalize the text and remove variations in word endings. For example, the words "running," "runs," and "ran" would all be reduced to the stem "run" by a stemming algorithm.


- Lemmatization: is a process that involves reducing words to their base form, or lemma, in order to normalize the text and remove variations in word endings. Unlike stemming, lemmatization takes into account the context of the word in order to determine its lemma, resulting in more accurate and meaningful reductions. For example, the words "running," "runs," and "ran" would all be reduced to the lemma "run" by a lemmatization algorithm.

- Tf-idf weighting: Is a method for assigning a weight to each word in a document based on its relative importance. The weight is calculated by multiplying the term frequency (tf) of the word by the inverse document frequency (idf) of the word across all documents in a corpus. This weighting scheme gives higher weight to words that are more frequent within a document but less frequent across the corpus, making them more important for characterizing the document.

- NES : Function used the Natural Language Toolkit (nltk) to identify named entities in a given sentence. The sentence parameter is the sentence in which named entities should be identified, and the binary parameter determines whether named entities should be grouped together or returned as individual tokens. The function returns a set of the named entities and individual words found in the sentence.


These techniques can be used in combination with each other or with stopwords removal to preprocess text and improve the accuracy of similarity comparison. For example, you could use stemming or lemmatization to normalize the words in the phrases, and then use tf-idf weighting to assign importance to each word based on its frequency within the phrases and across a larger corpus. This would allow you to compare the similarity of the phrases in a more meaningful and accurate way


3. Results

## Install packages

In [1]:
!pip install -q spacy nltk numpy pandas scikit-learn pyjarowinkler lazypredict ipykernel
!python3 -m spacy download en_core_web_sm

[?25l[K     |▏                               | 10 kB 22.3 MB/s eta 0:00:01[K     |▍                               | 20 kB 6.4 MB/s eta 0:00:01[K     |▋                               | 30 kB 8.2 MB/s eta 0:00:01[K     |▉                               | 40 kB 3.9 MB/s eta 0:00:01[K     |█                               | 51 kB 3.8 MB/s eta 0:00:01[K     |█▎                              | 61 kB 4.5 MB/s eta 0:00:01[K     |█▌                              | 71 kB 4.9 MB/s eta 0:00:01[K     |█▊                              | 81 kB 5.5 MB/s eta 0:00:01[K     |█▉                              | 92 kB 4.0 MB/s eta 0:00:01[K     |██                              | 102 kB 4.1 MB/s eta 0:00:01[K     |██▎                             | 112 kB 4.1 MB/s eta 0:00:01[K     |██▌                             | 122 kB 4.1 MB/s eta 0:00:01[K     |██▊                             | 133 kB 4.1 MB/s eta 0:00:01[K     |███                             | 143 kB 4.1 MB/s eta 0:00:01[K    

# Libraries

In [2]:
!pip install lazypredict pyjarowinkler

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import os
import re
import nltk
import spacy
import string
import numpy as np
import pandas as pd

from tqdm import tqdm
from itertools import chain
from functools import partial
from argparse import Namespace
from pyjarowinkler import distance
from collections.abc import Iterable
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer
from nltk.metrics import jaccard_distance
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag, ne_chunk, Tree
from nltk.metrics.distance import jaccard_distance
from scipy.stats import pearsonr

from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer
from typing import List
from lazypredict.Supervised import REGRESSORS, LazyRegressor

nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('gutenberg')
nltk.download('conll2000')
nltk.download('brown')
nltk.download('words')
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading packa

True

## Download data

In [None]:
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/trial.tgz https://gebakx.github.io/ihlt/sts/resources/trial.tgz
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/train.tgz https://gebakx.github.io/ihlt/sts/resources/train.tgz
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/test-gold.tgz https://gebakx.github.io/ihlt/sts/resources/test-gold.tgz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2003  100  2003    0     0  47690      0 --:--:-- --:--:-- --:--:-- 47690
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  122k  100  122k    0     0   505k      0 --:--:-- --:--:-- --:--:--  503k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  115k  100  115k    0     0   345k      0 --:--:-- --:--:-- --:--:--  345k


# Bring data

In [None]:
!cp /content/drive/MyDrive/Colab_Notebooks_personal/2_IHLT/final_project/train.tgz .
!cp /content/drive/MyDrive/Colab_Notebooks_personal/2_IHLT/final_project/trial.tgz .
!cp /content/drive/MyDrive/Colab_Notebooks_personal/2_IHLT/final_project/test-gold.tgz .

!tar zxvf /content/train.tgz
!tar zxvf /content/trial.tgz 
!tar zxvf /content/test-gold.tgz

train/
train/00-readme.txt
train/STS.output.MSRpar.txt
train/STS.input.SMTeuroparl.txt
train/STS.input.MSRpar.txt
train/STS.gs.MSRpar.txt
train/STS.input.MSRvid.txt
train/STS.gs.MSRvid.txt
train/correlation.pl
train/STS.gs.SMTeuroparl.txt
trial/
trial/STS.input.txt
trial/00-readme.txt
trial/STS.gs.txt
trial/STS.ouput.txt
test-gold/
test-gold/STS.input.MSRpar.txt
test-gold/STS.gs.MSRpar.txt
test-gold/STS.input.MSRvid.txt
test-gold/STS.gs.MSRvid.txt
test-gold/STS.input.SMTeuroparl.txt
test-gold/STS.gs.SMTeuroparl.txt
test-gold/STS.input.surprise.SMTnews.txt
test-gold/STS.gs.surprise.SMTnews.txt
test-gold/STS.input.surprise.OnWN.txt
test-gold/STS.gs.surprise.OnWN.txt
test-gold/STS.gs.ALL.txt
test-gold/00-readme.txt


# Usesful functions

In [None]:
# ------------------------------ #
# Jaccard similarity Function
# ------------------------------ #
def jaccard_similarity(s1: List[str], s2: List[str]):
    return 1 - jaccard_distance(set(s1), set(s2))

# ------------------------------ #
# Jaccard Similarity List
# ------------------------------ #
def jaccard_similarity_list(s1: List[List[str]], s2: List[List[str]]):
    sims = []
    for l1, l2 in zip(s1, s2):
        sim = jaccard_similarity(l1, l2)
        sims.append(sim)
    return np.array(sims)


def dice_similarity(s1, s2):
    assert isinstance(s1, Iterable), f"s1 must be an iterable, not {type(s1)}"
    assert isinstance(s2, Iterable), f"s2 must be an iterable, not {type(s2)}"
    s1 = set(s1)
    s2 = set(s2)
    intersection = s1.intersection(s2)
    return 2 * len(intersection) / (len(s1) + len(s2))

def dice_similarity_list(s1: List[List[str]], s2: List[List[str]]):
    sims = []
    for l1, l2 in zip(s1, s2):
        sim = dice_similarity(l1, l2)
        sims.append(sim)
    return np.array(sims)

# ------------------------------ #
# Jarowinkler Similarity
# ------------------------------ #   
def calculateJarowinklerSimilarity(dataframe, column1, column2):

    aux = []
    for row in dataframe.itertuples():
            
        # Longest one selected
        if len(row[column1]) >= len(row[column2]):
            sentence1 = row[column1]
            sentence2 = row[column2]
        else:
            sentence1 = row[column2]
            sentence2 = row[column1]

        similarities_array = []
        for word1 in sentence1:
            max = 0

        for word2 in sentence2:
            similarity = distance.get_jaro_distance(str(word1), str(word2), winkler=True, scaling=0.1)
            
            if max < similarity:
                max = similarity
            
        similarities_array.append(max)

        aux.append(np.array(similarities_array).mean())

    return aux

# ------------------------------ #
#       Overlap Similarity
# ------------------------------ # 
def overlap_distance(sentence1, sentence2):
  # Zip the characters from the two strings together
  pairs = zip(sentence1, sentence2)

  # Initialize a counter for the overlap distance
  overlap = 0

  # Iterate over the pairs of characters
  for a, b in pairs:
    # If the characters are the same, increment the overlap counter
    if a == b:
      overlap += 1

  # Return the overlap distance
  return overlap

# ------------------------------ #
#    Overlap Similarity list
# ------------------------------ # 

def overlap_similarity_list(s1: List[List[str]], s2: List[List[str]]):
    sims = []
    for l1, l2 in zip(s1, s2):
        sim = overlap_distance(l1, l2)
        sims.append(sim)
    return np.array(sims)



In [None]:
tag_dict = {
        "NN": "n",
        "NNS": "n",
        "NNP": "n",
        "NNPS": "n",
        "VB": "v",
        "VBD": "v",
        "VBG": "v",
        "VBN": "v",
        "VBP": "v",
        "VBZ": "v",
        "RB": "r",
        "RBR": "r",
        "RBS": "r",
        "JJ": "a",
        "JJR": "a",
        "JJS": "a",
  }

# ------------------------------ #
#         Get Wordnet POS
# ------------------------------ #
def get_wordnet_pos(word):
  """Map POS tag to first character lemmatize() accepts"""
  tag = nltk.pos_tag([word])[0][1][0].upper()
  
        
  return tag_dict.get(tag, wordnet.NOUN)


#Auxiliar spacy
nlp = spacy.load('en_core_web_sm')
special_pattern = re.compile(r"[^ \nA-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ/]+")

# ------------------------------ #
#   Function to tokenize spacy
# ------------------------------ #
def spacy_tokenize(sentence):
  return [ word.text.lower() for word in nlp.tokenizer(sentence) ]

def tokenize_column_spacy(column):
  tokenize = [spacy_tokenize(sentence) for sentence in column]
  
  return tokenize
  
# ------------------------------ #
#   Function to lemmatize spacy
# ------------------------------ #
def spacy_lemmatize(sentence: str):
  return [ word.lemma_.lower() for word in nlp.tokenizer(sentence) ]
  
# ------------------------------ #
#   Function to tokenize
# ------------------------------ #
def tokenize_column(column):
    #put in lowercase
    tokenizator = [nltk.word_tokenize(sentence) for sentence in column]
    #Lowercase the tokens
    return [ [ word.lower() for word in sentence ] for sentence in tokenizator ]


#--------------------------------------------#
#  Function to NES
#--------------------------------------------#
def NES(sentence: str, binary: bool):
    x = nltk.pos_tag(nltk.word_tokenize(sentence))
    res = nltk.ne_chunk(x, binary=binary)
    necs_and_words = set()
    for chunk in res:
        if hasattr(chunk, 'label'):
            # Add NE
            token = ' '.join(term[0] for term in chunk)
            necs_and_words.add(token)
        else:
            token = chunk[0]
            if token.isalnum():
                necs_and_words.add(token.lower())
    return necs_and_words

 #--------------------------------------------#
 # Function to get entities from a column
 # -------------------------------------------# 
def get_entities_new(column):
    entities = []
    for sentence in column:
        entities.append(NES(sentence, False))
    return entities

# ------------------------------ #
# Lemmatization text process
# ------------------------------ #
lemmatizer = WordNetLemmatizer()
# ------------------------------ #
#   Function to lemmatize
# ------------------------------ #
def lemmatize(tokenized_text: List[List[str]]):
  
  lemmas = []

  for sentence in tqdm(tokenized_text):
    sentence_lemmas = []
    for word in sentence:
      sentence_lemmas.append(lemmatizer.lemmatize(word.lower(), get_wordnet_pos(word.lower())))
    lemmas.append(sentence_lemmas)

  return lemmas

# ------------------------------ #
#   Stopwords initialization
# ------------------------------ #
stopwords_list = set(nltk.corpus.stopwords.words("english"))
stopwords_list = stopwords_list.union(set(string.punctuation))
stopwords_list = stopwords_list.union(set(['.', ',', ';', '."']))

# ------------------------------ #
#   Function to remove stopwords
# ------------------------------ #
def remove_stopwords(column: List[List[str]]):
  #Lowercase the tokens
  return [ [ word.lower() for word in sentence if word not in stopwords_list ]  for sentence in column ]


# ------------------------------ #
#   Function to synonimize
# ------------------------------ #
def synonimize_column(column):
  #put in lowercase
  tokenized = [nltk.word_tokenize(sentence) for sentence in column]
  #Lowercase the tokens
  tokenized = [ [ word.lower() for word in sentence ] for sentence in tokenized ]
  #Synonimize
  synonimized = [ [ word for word in sentence if word not in stopwords_list ] for sentence in tokenized ]

  return synonimized


# ------------------------------ #
#   Function to synset
# ------------------------------ #
def get_synset_column(tokenized_text: List[List[str]]):
  synset = []
  for sentence in tokenized_text:
    pos = nltk.pos_tag(sentence)
    lemmas = []
    for pair in pos:
      if pair[1][0] in tag_dict.keys():
        lemma = wnl.lemmatize(p[0].lower(), pos=p[1][0].lower())
        lemmas.append(lemma)
      else:
        lemma = pair[0]
        lemmas.append(lemma)
    synset.append(lemmas)
  
  return synset


# ------------------------------ #
#  Function to NE(Name entities)
# ------------------------------ #
def apply_ne(tokenized_text: List[str]):
    # tokenize the sentence and find the POS tag for each token
    sentences_ne = list(ne_chunk(pos_tag(tokenized_text), binary=True))
    result = []
    for el in sentences_ne:
        if isinstance(el, Tree):
            leaves = el.leaves()
            result.append(" ".join(word[0] for word in leaves))
        else:
            result.append(el[0])
    return result

# used apply_ne function to get NE from a column
def get_name_entities(column: List[List[str]]):
  ne = []
  for sentence in column:
    ne.append(apply_ne(sentence))
  return ne



# ------------------------------ #
#  Function to get ngrams
# ------------------------------ #the
def get_ngrams_column(column: List[List[str]], n: int):
  ngrams = []
  for sentence in column:
    ngrams.append(apply_ngram(sentence, n))
  return ngrams


def apply_ngram(sentence: List[str], n: int):
    if len(sentence) < n:
        return [tuple(sentence)]
    return list(nltk.ngrams(sentence, n))


# ------------------------------ #
#     Function to get lesk 
# ------------------------------ #
def get_lesk_column(column):
  lesk_text = []

  for sentence in column:
    synset = [lesk(sentence, word) for word in sentence]
    synset = {word for word in synset if word is not None}
    lesk_text.append(synset)

  return lesk_text


# -------------------------------------- #
#     Different similarities for synsets
# -------------------------------------- #
def get_synset(tokenized_text: str, synsets):
  key_list = []
  sentence_tagged = nltk.pos_tag(tokenized_text)
  for pair in sentence_tagged:
    wordnet_tag = get_wordnet_pos(pair[1])
    if wordnet_tag is not None:
      pair = (pair[0], wordnet_tag)
      synset = wordnet.synsets(pair[0], pair[1])
      if synset:
          synsets[pair[0]] = (synset[0], synset[0].pos())
          key_list.append(pair[0])
  return synsets, key_list


def get_synset_similarity(column1, column2, distance: str):
  
  all_similarities = []
  brown_ic = nltk.corpus.wordnet_ic.ic('ic-brown.dat')

  for sentence1, sentence2 in tqdm(zip(column1, column2), total=max(len(column1), len(column2))):
    synsets, keys1 = get_synset(sentence1, {})
    synsets, keys2 = get_synset(sentence2, synsets)
    
    similarities = []
    for word1 in keys1:
      for word2 in keys2:
        if synsets[word1][1] != synsets[word2][1]:
          continue
        similarity = None
        if distance == 'path':
          similarity = synsets[word1][0].path_similarity(synsets[word2][0])
        elif distance == 'lch':
          similarity = synsets[word1][0].lch_similarity(synsets[word2][0])
        elif distance == 'wup':
          similarity = synsets[word1][0].wup_similarity(synsets[word2][0])
        elif distance == 'lin':
          try:
            similarity = synsets[word1][0].lin_similarity(synsets[word2][0], brown_ic)
          except:
            similarity = 0
        similarities.append(similarity)
    if len(similarities) > 0:
      all_similarities.append(np.mean(similarities))
    else:
      all_similarities.append(0)
  return all_similarities
  
def apply_jaccard_lesk(sentence1: str, sentence2: str):

  # Apply lesk to sentence 1
  synset1 = [ lesk(sentence1, word) for word in sentence1 ]
  synset1 = { word for word in synset1 if word is not None }

  # Apply lesk to sentence 1
  synset2 = [ lesk(sentence2, word) for word in sentence2 ]
  synset2 = { word for word in synset2 if word is not None }

  # Calculate distance
  distance = jaccard_distance(synset1, synset2)

  return distance


def lemma_spacy(sentences):
  sentences = [special_chars_out(s) for s in sentences]
  token_lemmatize = [spacy_lemmatize(phrase) for phrase in sentences]
  return token_lemmatize


def special_chars_out(sentence: str):
  
  sentence = sentence.replace("'ve", " have")
  sentence = sentence.replace("n't", " not")
  sentence = sentence.replace("'ll", " will")  
  sentence = sentence.replace("'m", " am")  
  sentence = sentence.replace("'re", " are")
  
  sentence = re.sub(special_pattern, " ", sentence)  

  return sentence

In [None]:
# Functions of preprocessing
def read_data(text_datas: List[str], gs_datas: List[str]):
  all_df_text = []
  for text_data, gs_data in zip(text_datas, gs_datas):
    df_text = pd.read_csv(text_data, sep=r'\t', engine='python', header=None)
    df_text.columns = ["text1", "text2"]
    df_text['gs'] = pd.read_csv(gs_data, sep='\t', header=None)
    all_df_text.append(df_text.dropna())
  return pd.concat(all_df_text)

def get_dataset(path: str) -> pd.DataFrame:
  files = sorted(os.listdir(path))
  input_files = [ os.path.join(path, file) for file in files if 'input' in file ]
  gs_files = [ os.path.join(path, file) for file in files if 'gs' in file ]
  df = read_data(input_files, gs_files)
  return df

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Pre-processing

### Data information
- trial : includes the definition of the scores, a sample of 5 sentence pairs and the input and output formats. It is not needed, but it is useful for prototyping.

- train : training data from paraphrasing data sets, input and output formats.

- test : test data from paraphrasing data sets.

In [None]:
#train_path = '../final_project/train'
#trial_path = '../final_project/trial'
#test_path  = '../final_project/test-gold'

train_path = '/content/train'
trial_path = '/content/trial'
test_path  = '/content/test-gold'

# **Similarities**

In [None]:
def get_features(df: pd.DataFrame):

    #--------------------------------------------#
    # 0. NLTK Words features
    #--------------------------------------------#
    #print("NLTK Words features")
    
    #nltk_words_text1 = []
    #nltk_words_text2 = []

    #--------------------------------------------#
    # 1. Tokenize features
    #--------------------------------------------#    
    tokenized_text1 = tokenize_column(df['text1'])
    tokenized_text2 = tokenize_column(df['text2'])

    #--------------------------------------------#
    # 2. Lemmatize features
    #--------------------------------------------#
    lemmatize_text1 = lemmatize(tokenized_text1)
    lemmatize_text2 = lemmatize(tokenized_text2)


    #--------------------------------------------#
    # 3. Stopwords features
    #--------------------------------------------#   
    stopwords_text1 = remove_stopwords(lemmatize_text1)
    stopwords_text2 = remove_stopwords(lemmatize_text2)

    #--------------------------------------------#
    # 4. Synonims features
    #--------------------------------------------#
    synonyms_text1 = []
    synonyms_text2 = []
    # Use sysnstesizer to get synonyms
    for i in tqdm(range(len(tokenized_text1))):
        synonyms_text1.append([syn for w in tokenized_text1[i] for syn in wordnet.synsets(w)])
        synonyms_text2.append([syn for w in tokenized_text2[i] for syn in wordnet.synsets(w)])

    
    #--------------------------------------------#
    # 5. NES features
    #--------------------------------------------#
    NES_column_text1 = get_entities_new(df['text1'])
    NES_column_text2 = get_entities_new(df['text2'])

    
    #--------------------------------------------#
    # 6. Name entities features
    #--------------------------------------------#
    name_entities_text1 = get_name_entities(lemmatize_text1)
    name_entities_text2 = get_name_entities(lemmatize_text2)

    #--------------------------------------------#
    # 7. Ngrams features
    #--------------------------------------------#

    ngrams_column_2_text1 = get_ngrams_column(lemmatize_text1, 2)
    ngrams_column_2_text2 = get_ngrams_column(lemmatize_text2, 2)

    ngrams_column_3_text1 = get_ngrams_column(lemmatize_text1, 3)
    ngrams_column_3_text2 = get_ngrams_column(lemmatize_text2, 3)

    ngrams_column_4_text1 = get_ngrams_column(lemmatize_text1, 4)
    ngrams_column_4_text2 = get_ngrams_column(lemmatize_text2, 4)

    ngrams_column_5_text1 = get_ngrams_column(lemmatize_text1, 5)
    ngrams_column_5_text2 = get_ngrams_column(lemmatize_text2, 5)

    ngrams_column_6_text1 = get_ngrams_column(lemmatize_text1, 6)
    ngrams_column_6_text2 = get_ngrams_column(lemmatize_text2, 6)

    ngrams_column_7_text1 = get_ngrams_column(lemmatize_text1, 7)
    ngrams_column_7_text2 = get_ngrams_column(lemmatize_text2, 7)

    ngrams_column_8_text1 = get_ngrams_column(lemmatize_text1, 8)
    ngrams_column_8_text2 = get_ngrams_column(lemmatize_text2, 8)

    ngrams_column_9_text1 = get_ngrams_column(lemmatize_text1, 9)
    ngrams_column_9_text2 = get_ngrams_column(lemmatize_text2, 9)

    #--------------------------------------------#
    # 8. Lesk features
    #--------------------------------------------#
    # Lesk features
    lesk_text1 = get_lesk_column(tokenized_text1)
    lesk_text2 = get_lesk_column(tokenized_text2)

    # --------------------------------------------#
    # 9. Spacy words features
    # --------------------------------------------#
    print("Spacy words features")
    spacy_words_text1 = tokenize_column_spacy(df['text1'])
    spacy_words_text2 = tokenize_column_spacy(df['text2'])

    # --------------------------------------------#
    # 10. Spacy lemmatize features
    # --------------------------------------------#
    print("Spacy lemmatize features")
    spacy_lemmatize_text1 = lemma_spacy(df['text1'])
    spacy_lemmatize_text2 = lemma_spacy(df['text2'])

    #--------------------------------------------#
    # 11.Lemma synonyms features
    #--------------------------------------------#
    lemma_synonyms_text1 = []
    lemma_synonyms_text2 = []
    # Use sysnstesizer to get synonyms
    for i in tqdm(range(len(tokenized_text1))):
        lemma_synonyms_text1.append([syn for w in lemmatize_text1[i] for syn in wordnet.synsets(w)])
        lemma_synonyms_text2.append([syn for w in lemmatize_text2[i] for syn in wordnet.synsets(w)])

    #print("Word synonyms features"

    #--------------------------------------------#
    # 12. Synset features
    #--------------------------------------------#
    print("Synset features")
    synset_text1 = get_synset_column(tokenized_text1)
    synset_text2 = get_synset_column(tokenized_text2)

    #--------------------------------------------#
    # 13. Synset similarities
    #--------------------------------------------#
    print("Synset similarities")
    average_path = get_synset_similarity(tokenized_text1, tokenized_text2, "path")
    average_lch = get_synset_similarity(tokenized_text1, tokenized_text2, "lch")
    average_wup = get_synset_similarity(tokenized_text1, tokenized_text2, "wup")
    average_lin = get_synset_similarity(tokenized_text1, tokenized_text2, "lin")


    features = [
        # Jaccard similarity
        jaccard_similarity_list(tokenized_text1, tokenized_text2),
        jaccard_similarity_list(lemmatize_text1, lemmatize_text2),
        jaccard_similarity_list(stopwords_text1, stopwords_text2),
        jaccard_similarity_list(synonyms_text1, synonyms_text2),
        jaccard_similarity_list(NES_column_text1, NES_column_text2),
        jaccard_similarity_list(name_entities_text1, name_entities_text2),
        jaccard_similarity_list(ngrams_column_2_text1, ngrams_column_2_text2),
        jaccard_similarity_list(ngrams_column_3_text1, ngrams_column_3_text2),
        jaccard_similarity_list(ngrams_column_4_text1, ngrams_column_4_text2),
        jaccard_similarity_list(ngrams_column_5_text1, ngrams_column_5_text2),
        jaccard_similarity_list(ngrams_column_6_text1, ngrams_column_6_text2),
        jaccard_similarity_list(ngrams_column_7_text1, ngrams_column_7_text2),
        jaccard_similarity_list(ngrams_column_8_text1, ngrams_column_8_text2),
        jaccard_similarity_list(ngrams_column_9_text1, ngrams_column_9_text2),
        jaccard_similarity_list(lesk_text1, lesk_text2),
        jaccard_similarity_list(spacy_words_text1, spacy_words_text2),
        jaccard_similarity_list(spacy_lemmatize_text1, spacy_lemmatize_text2),


        # jaccard_similarity_list(nltk_words_text1, nltk_words_text2),
        # jaccard_similarity_list(spacy_words_text1, spacy_words_text2),
        # jaccard_similar
        jaccard_similarity_list(lemma_synonyms_text1,lemma_synonyms_text2),
        jaccard_similarity_list(synset_text1, synset_text2),
        #jaccard_similarity_list(synset_text1, synset_text2),
        
        
        # Dice similarity
        dice_similarity_list(tokenized_text1, tokenized_text2),
        dice_similarity_list(lemmatize_text1, lemmatize_text2),
        dice_similarity_list(stopwords_text1, stopwords_text2),
        dice_similarity_list(synonyms_text1, synonyms_text2),
        dice_similarity_list(NES_column_text1, NES_column_text2),
        dice_similarity_list(name_entities_text1, name_entities_text2),
        dice_similarity_list(ngrams_column_2_text1, ngrams_column_2_text2),
        dice_similarity_list(ngrams_column_3_text1, ngrams_column_3_text2),
        dice_similarity_list(ngrams_column_4_text1, ngrams_column_4_text2),
        dice_similarity_list(ngrams_column_5_text1, ngrams_column_5_text2),
        dice_similarity_list(ngrams_column_6_text1, ngrams_column_6_text2),
        dice_similarity_list(ngrams_column_7_text1, ngrams_column_7_text2),
        dice_similarity_list(ngrams_column_8_text1, ngrams_column_8_text2),
        dice_similarity_list(ngrams_column_9_text1, ngrams_column_9_text2),
        dice_similarity_list(lesk_text1, lesk_text2),
        dice_similarity_list(spacy_words_text1, spacy_words_text2),
        dice_similarity_list(spacy_lemmatize_text1, spacy_lemmatize_text2),

        #jaccard_similarity_list(nltk_words_text1, nltk_words_text2),
        #jaccard_similarity_list(spacy_words_text1, spacy_words_text2),
        #jaccard_similarity_
        dice_similarity_list(lemma_synonyms_text1,lemma_synonyms_text2),
        dice_similarity_list(synset_text1, synset_text2),
        #jaccard_similarity_list(synset_text1, synset_text2),

        # Synset similarities
        average_path,
        average_lch,
        average_wup,
        average_lin,
    ]
    return np.array(features)

# **Training**

## Get training dataset

In [None]:
train_dataset = get_dataset(train_path)
print(train_dataset.shape)
train_dataset.head()

(2234, 3)


Unnamed: 0,text1,text2,gs
0,But other sources close to the sale said Viven...,But other sources close to the sale said Viven...,4.0
1,Micron has declared its first quarterly profit...,Micron's numbers also marked the first quarter...,3.75
2,The fines are part of failed Republican effort...,"Perry said he backs the Senate's efforts, incl...",2.8
3,"The American Anglican Council, which represent...","The American Anglican Council, which represent...",3.4
4,The tech-loaded Nasdaq composite rose 20.96 po...,The technology-laced Nasdaq Composite Index <....,2.4


In [None]:
y_train = train_dataset['gs'].values
y_train.shape

(2234,)

## Get features of the training dataset

In [None]:
X_train_features: np.ndarray = get_features(train_dataset)
X_train_features.shape

100%|██████████| 2234/2234 [00:05<00:00, 445.46it/s]
100%|██████████| 2234/2234 [00:04<00:00, 454.15it/s]
100%|██████████| 2234/2234 [00:01<00:00, 1584.83it/s]


Spacy words features
Spacy lemmatize features


100%|██████████| 2234/2234 [00:01<00:00, 1543.19it/s]


Synset features
Synset similarities


100%|██████████| 2234/2234 [01:14<00:00, 29.97it/s]
100%|██████████| 2234/2234 [01:05<00:00, 33.88it/s]
100%|██████████| 2234/2234 [01:26<00:00, 25.83it/s]
100%|██████████| 2234/2234 [00:21<00:00, 104.90it/s]


(42, 2234)

In [None]:
X_train_features.shape

(42, 2234)

# **Testing**

## Get the test dataset

In [None]:
test_dataset = get_dataset(test_path)
print(test_dataset.shape)
test_dataset.head()

(2817, 3)


Unnamed: 0,text1,text2,gs
0,The problem likely will mean corrective change...,He said the problem needs to be corrected befo...,4.4
1,The technology-laced Nasdaq Composite Index .I...,The broad Standard & Poor's 500 Index .SPX inc...,0.8
2,"""It's a huge black eye,"" said publisher Arthur...","""It's a huge black eye,"" Arthur Sulzberger, th...",3.6
3,SEC Chairman William Donaldson said there is a...,"""I think there's a building confidence that th...",3.4
4,Vivendi shares closed 1.9 percent at 15.80 eur...,"In New York, Vivendi shares were 1.4 percent d...",1.4


## Get features of the test dataset

In [None]:
X_test_features: np.ndarray = get_features(test_dataset)
X_test_features.shape

100%|██████████| 2817/2817 [00:04<00:00, 672.11it/s]
100%|██████████| 2817/2817 [00:04<00:00, 664.35it/s]
100%|██████████| 2817/2817 [00:01<00:00, 2099.82it/s]


Spacy words features
Spacy lemmatize features


100%|██████████| 2817/2817 [00:01<00:00, 2389.59it/s]


Synset features
Synset similarities


100%|██████████| 2817/2817 [00:44<00:00, 63.59it/s]
100%|██████████| 2817/2817 [00:36<00:00, 77.03it/s]
100%|██████████| 2817/2817 [00:53<00:00, 53.09it/s]
100%|██████████| 2817/2817 [00:17<00:00, 159.79it/s]


(42, 2817)

In [None]:
y_test = test_dataset['gs'].values
y_test.shape

(2817,)

## Normalize all features

In [None]:
# Normalize the data
scaler = StandardScaler()
scaler.fit(X_train_features.T)
X_train_features_norm = scaler.transform(X_train_features.T)
X_test_features_norm = scaler.transform(X_test_features.T)

## Select the best features

In [None]:
best_features = [0, 1, 2, 3, 4, 38, 39, 40, 41]
X_train_features_norm = X_train_features_norm[:, best_features]
X_test_features_norm = X_test_features_norm[:, best_features]

## Train the model

In [None]:
# Print all shapes
print("X_train_features shape: ", X_train_features_norm.shape)
print("y_train shape: ", y_train.shape)
print("X_test_features shape: ", X_test_features_norm.shape)
print("y_test shape: ", y_test.shape)

X_train_features shape:  (2234, 9)
y_train shape:  (2234,)
X_test_features shape:  (2817, 9)
y_test shape:  (2817,)


### Train a simple regression model

In [None]:
# Train
reg = LinearRegression()
reg.fit(X_train_features_norm, y_train)

LinearRegression()

In [None]:
# Evaluate
y_pred_train = reg.predict(X_train_features_norm)
y_pred_test = reg.predict(X_test_features_norm)

print("Train pearson: ", pearsonr(y_train, y_pred_train)[0])
print("Test pearson: ", pearsonr(y_test, y_pred_test)[0])

Train pearson:  0.6893760521234433
Test pearson:  -0.021679359805346278


### Train multiple regression models

In [None]:
# Select all of the models that we are going to use
REGRESSORS = [ c for c in REGRESSORS if c[0] != 'QuantileRegressor' ]
print("Number of regressors:", len(REGRESSORS))

Number of regressors: 41


In [None]:
# Build pearson score function
def pearsonr_scorer(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    score = pearsonr(y_true, y_pred)[0]
    return score

pearson_scorer = make_scorer(pearsonr_scorer)
pearson_scorer.__name__ = 'pearson_scorer'

In [None]:
# Fit all models
reg = LazyRegressor(predictions=True, regressors=REGRESSORS, custom_metric=pearsonr_scorer)
regresion_models, regresion_predictions = reg.fit(X_train_features_norm, X_test_features_norm, y_train, y_test)

'tuple' object has no attribute '__name__'
Invalid Regressor(s)


 90%|█████████ | 37/41 [00:16<00:01,  2.16it/s]



100%|██████████| 41/41 [00:17<00:00,  2.40it/s]


In [None]:
regresion_models.sort_values(by='pearsonr_scorer', ascending=False)

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken,pearsonr_scorer
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PassiveAggressiveRegressor,-5.12,-5.1,2.89,0.08,0.02
HuberRegressor,-0.95,-0.94,1.63,0.05,-0.02
LinearRegression,-0.87,-0.86,1.6,0.03,-0.02
TransformedTargetRegressor,-0.87,-0.86,1.6,0.02,-0.02
Lars,-0.87,-0.86,1.6,0.03,-0.02
LarsCV,-0.87,-0.86,1.6,0.04,-0.02
LassoLarsCV,-0.87,-0.86,1.6,0.07,-0.02
Ridge,-0.87,-0.86,1.6,0.02,-0.02
KernelRidge,-9.82,-9.79,3.85,0.35,-0.02
LassoLarsIC,-0.86,-0.86,1.6,0.03,-0.02


In [None]:
# Train MLP model
mlp = MLPRegressor(hidden_layer_sizes=(200, 50), learning_rate='adaptive', early_stopping=True, max_iter=1000, verbose=True)
mlp.fit(X_train_features_norm, y_train)

y_pred_train = mlp.predict(X_train_features_norm)
print("Train pearson: ", pearsonr(y_train, y_pred_train)[0])

y_pred_test = mlp.predict(X_test_features_norm)
print("Test pearson: ", pearsonr(y_test, y_pred_test)[0])


Iteration 1, loss = 4.71979734
Validation score: -1.505735
Iteration 2, loss = 2.14367023
Validation score: -0.839765
Iteration 3, loss = 1.49076401
Validation score: -0.297101
Iteration 4, loss = 1.14684008
Validation score: -0.015613
Iteration 5, loss = 0.95295384
Validation score: 0.115487
Iteration 6, loss = 0.82427153
Validation score: 0.226772
Iteration 7, loss = 0.70840598
Validation score: 0.304941
Iteration 8, loss = 0.63150698
Validation score: 0.403209
Iteration 9, loss = 0.57486295
Validation score: 0.457073
Iteration 10, loss = 0.54702394
Validation score: 0.492627
Iteration 11, loss = 0.52554875
Validation score: 0.522288
Iteration 12, loss = 0.50982342
Validation score: 0.536281
Iteration 13, loss = 0.50270128
Validation score: 0.546131
Iteration 14, loss = 0.49512578
Validation score: 0.555069
Iteration 15, loss = 0.48434199
Validation score: 0.558638
Iteration 16, loss = 0.47785536
Validation score: 0.564118
Iteration 17, loss = 0.47487023
Validation score: 0.567721
It

# Siames network

In [None]:
import spacy

# Load the `en_core_web_md` model, which includes pre-trained word embeddings.
nlp = spacy.load('en_core_web_md')

# Define the sub-network architecture.
def create_subnetwork():
  model = Sequential()
  model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len))
  model.add(Flatten())
  model.add(Dense(128, activation='relu'))
  return model

# Define the inputs and outputs of the Siamese network.
input_a = Input(shape=(max_len,))
input_b = Input(shape=(max_len,))

subnetwork = create_subnetwork()
output_a = subnetwork(input_a)
output_b = subnetwork(input_b)

similarity = dot([output_a, output_b], axes=-1, normalize=True)
siamese_model = Model([input_a, input_b], similarity)

# Compile the model and specify the loss function and optimizer to use.
siamese_model.compile(loss='binary_crossentropy', optimizer='adam')

# To generate word embeddings using `spaCy`, you can iterate over the tokens in your text data and use the `.vector` attribute of each token.
sentences_a = []
sentences_b = []
labels = []

for a, b, label in zip(sentences_a, sentences_b, labels):
  # Tokenize the sentences and create a list of embeddings for each one.
  embeddings_a = [token.vector for token in nlp(a)]
  embeddings_b = [token.vector for token in nlp(b)]
  
  # Pad the sequences to the same length.
  padded_a = pad_sequences([embeddings_a], maxlen=max_len, padding='post')
  padded_b = pad_sequences([embeddings_b], maxlen=max_len, padding='post')
  
  # Add the padded sequences and labels to the list.
  sentences_a.append(padded_a)
  sentences_b.append(padded_b)
  labels.append(label)

# Train the model on the prepared dataset.
siamese_model.fit([sentences_a, sentences_b], labels, batch_size=32, epochs=10)

# To evaluate the model, you can pass it a pair of sentences and use the output to measure their similarity.
similarity = siamese_model.predict([sentence_a, sentence_b])


A Siamese network is a type of neural network architecture that is used for learning similarity between two input objects. It consists of two or more identical sub-networks, which share the same weights and architecture. The sub-networks are trained to process the input objects and generate feature vectors, which are then compared to measure the similarity between the input objects.

One common application of Siamese networks is in natural language processing tasks, where they can be used to measure the similarity between sentences. To do this, the input to each sub-network would be a sentence, and the output would be a feature vector representing the sentence. The feature vectors can then be compared using a distance measure, such as cosine similarity, to determine the similarity between the sentences.

There are a number of different approaches to training a Siamese network for sentence similarity. One approach is to use a dataset of pairs of sentences, where each pair is labeled as either similar or not similar. The network can be trained to classify the pairs into these two categories using a binary cross-entropy loss function. Alternatively, the network can be trained to directly predict the similarity between the pairs using a regression loss function, such as mean squared error.

It's also possible to use a Siamese network in conjunction with a pre-trained language model, such as BERT, to improve the quality of the feature vectors and increase the accuracy of the similarity measurement. This can be done by fine-tuning the language model on a dataset of sentence pairs, and then using the trained language model as one of the sub-networks in the Siamese network.