# Semantic textual similarity
## Final Project IHLT - UPC 2022/2023
### Authors : Rob - Edison Bejarano

1. Data

2. What we are doing?
#### Techniques for preprocessing text for similarity comparison

- Stemming: is a process that involves reducing words to their base form, or stem, in order to normalize the text and remove variations in word endings. For example, the words "running," "runs," and "ran" would all be reduced to the stem "run" by a stemming algorithm.


- Lemmatization: is a process that involves reducing words to their base form, or lemma, in order to normalize the text and remove variations in word endings. Unlike stemming, lemmatization takes into account the context of the word in order to determine its lemma, resulting in more accurate and meaningful reductions. For example, the words "running," "runs," and "ran" would all be reduced to the lemma "run" by a lemmatization algorithm.

- Tf-idf weighting: Is a method for assigning a weight to each word in a document based on its relative importance. The weight is calculated by multiplying the term frequency (tf) of the word by the inverse document frequency (idf) of the word across all documents in a corpus. This weighting scheme gives higher weight to words that are more frequent within a document but less frequent across the corpus, making them more important for characterizing the document.

- NES : Function used the Natural Language Toolkit (nltk) to identify named entities in a given sentence. The sentence parameter is the sentence in which named entities should be identified, and the binary parameter determines whether named entities should be grouped together or returned as individual tokens. The function returns a set of the named entities and individual words found in the sentence.


These techniques can be used in combination with each other or with stopwords removal to preprocess text and improve the accuracy of similarity comparison. For example, you could use stemming or lemmatization to normalize the words in the phrases, and then use tf-idf weighting to assign importance to each word based on its frequency within the phrases and across a larger corpus. This would allow you to compare the similarity of the phrases in a more meaningful and accurate way


3. Results

## Install packages

In [None]:
%pip install -q spacy nltk numpy pandas scikit-learn pyjarowinkler lazypredict

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Libraries

In [74]:
import os
import re
import nltk
import spacy
import string
import numpy as np
import pandas as pd

from tqdm import tqdm
from itertools import chain
from functools import partial
from argparse import Namespace
from pyjarowinkler import distance
from collections.abc import Iterable
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer
from nltk.metrics import jaccard_distance
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag, ne_chunk, Tree

from scipy.stats import pearsonr

from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer
from typing import List
from lazypredict.Supervised import REGRESSORS, LazyRegressor

nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('gutenberg')
nltk.download('conll2000')
nltk.download('brown')
nltk.download('words')

[nltk_data] Downloading package wordnet to /home/rob/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/rob/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package punkt to /home/rob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/rob/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/rob/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /home/rob/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package gutenberg to /home/rob/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package conll2000 to /home/rob

True

## Download data

In [None]:
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/trial.tgz https://gebakx.github.io/ihlt/sts/resources/trial.tgz
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/train.tgz https://gebakx.github.io/ihlt/sts/resources/train.tgz
#!curl -o /content/drive/MyDrive/Colab_Notebooks/2.IHLT/final_project/test-gold.tgz https://gebakx.github.io/ihlt/sts/resources/test-gold.tgz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2003  100  2003    0     0  47690      0 --:--:-- --:--:-- --:--:-- 47690
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  122k  100  122k    0     0   505k      0 --:--:-- --:--:-- --:--:--  503k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  115k  100  115k    0     0   345k      0 --:--:-- --:--:-- --:--:--  345k


# Bring data

In [15]:
!tar zxvf ../final_project/train.tgz
!tar zxvf ../final_project/trial.tgz
!tar zxvf ../final_project/test-gold.tgz

!rm ../final_project/train.tgz
!rm ../final_project/test-gold.tgz 
!rm ../final_project/trial.tgz

train/
train/00-readme.txt
train/STS.output.MSRpar.txt
train/STS.input.SMTeuroparl.txt
train/STS.input.MSRpar.txt
train/STS.gs.MSRpar.txt
train/STS.input.MSRvid.txt
train/STS.gs.MSRvid.txt
train/correlation.pl
train/STS.gs.SMTeuroparl.txt
trial/
trial/STS.input.txt
trial/00-readme.txt
trial/STS.gs.txt
trial/STS.ouput.txt
test-gold/
test-gold/STS.input.MSRpar.txt
test-gold/STS.gs.MSRpar.txt
test-gold/STS.input.MSRvid.txt
test-gold/STS.gs.MSRvid.txt
test-gold/STS.input.SMTeuroparl.txt
test-gold/STS.gs.SMTeuroparl.txt
test-gold/STS.input.surprise.SMTnews.txt
test-gold/STS.gs.surprise.SMTnews.txt
test-gold/STS.input.surprise.OnWN.txt
test-gold/STS.gs.surprise.OnWN.txt
test-gold/STS.gs.ALL.txt
test-gold/00-readme.txt


# Usesful functions

In [92]:
# ------------------------------ #
# Jaccard similarity Function
# ------------------------------ #
def jaccard_similarity(s1: List[str], s2: List[str]):
    s1 = set(s1)
    s2 = set(s2)
    intersection = len(s1.intersection(s2))
    union = len(s1) + len(s2) - intersection
    return float(intersection) / float(union)

# ------------------------------ #
# Jaccard Similarity List
# ------------------------------ #
def jaccard_similarity_list(s1: List[List[str]], s2: List[List[str]]):
    sims = []
    for l1, l2 in zip(s1, s2):
        sim = jaccard_similarity(l1, l2)
        sims.append(sim)
    return np.array(sims)


def dice_similarity(s1, s2):
    assert isinstance(s1, Iterable), f"s1 must be an iterable, not {type(s1)}"
    assert isinstance(s2, Iterable), f"s2 must be an iterable, not {type(s2)}"
    s1 = set(s1)
    s2 = set(s2)
    intersection = s1.intersection(s2)
    return 2 * len(intersection) / (len(s1) + len(s2))

def dice_similarity_list(s1: List[List[str]], s2: List[List[str]]):
    sims = []
    for l1, l2 in zip(s1, s2):
        sim = dice_similarity(l1, l2)
        sims.append(sim)
    return np.array(sims)

# ------------------------------ #
# Jarowinkler Similarity
# ------------------------------ #   
def calculateJarowinklerSimilarity(dataframe, column1, column2):

    aux = []
    for row in dataframe.itertuples():
            
        # Longest one selected
        if len(row[column1]) >= len(row[column2]):
            sentence1 = row[column1]
            sentence2 = row[column2]
        else:
            sentence1 = row[column2]
            sentence2 = row[column1]

        similarities_array = []
        for word1 in sentence1:
            max = 0

        for word2 in sentence2:
            similarity = distance.get_jaro_distance(str(word1), str(word2), winkler=True, scaling=0.1)
            
            if max < similarity:
                max = similarity
            
        similarities_array.append(max)

        aux.append(np.array(similarities_array).mean())

    return aux

In [45]:
# ------------------------------ #
#         Get Wordnet POS
# ------------------------------ #
def get_wordnet_pos(word):
  """Map POS tag to first character lemmatize() accepts"""
  tag = nltk.pos_tag([word])[0][1][0].upper()
  tag_dict = {
        "NN": "n",
        "NNS": "n",
        "NNP": "n",
        "NNPS": "n",
        "VB": "v",
        "VBD": "v",
        "VBG": "v",
        "VBN": "v",
        "VBP": "v",
        "VBZ": "v",
        "RB": "r",
        "RBR": "r",
        "RBS": "r",
        "JJ": "a",
        "JJR": "a",
        "JJS": "a",
  }
        
  return tag_dict.get(tag, wordnet.NOUN)


# ------------------------------ #
#   Function to tokenize
# ------------------------------ #
def tokenize_column(column):
    #put in lowercase
    tokenized = [nltk.word_tokenize(sentence) for sentence in column]
    #Lowercase the tokens
    tokenized = [ [ word.lower() for word in sentence ] for sentence in tokenized ]
    return tokenized


#--------------------------------------------#
#  Function to NES
#--------------------------------------------#
def NES(sentence: str, binary: bool):
    x = nltk.pos_tag(nltk.word_tokenize(sentence))
    res = nltk.ne_chunk(x, binary=binary)
    necs_and_words = set()
    for chunk in res:
        if hasattr(chunk, 'label'):
            # Add NE
            token = ' '.join(term[0] for term in chunk)
            necs_and_words.add(token)
        else:
            token = chunk[0]
            if token.isalnum():
                necs_and_words.add(token.lower())
    return necs_and_words

 #--------------------------------------------#
 # Function to get entities from a column
 # -------------------------------------------# 
def get_entities_new(column):
    entities = []
    for sentence in column:
        entities.append(NES(sentence, False))
    return entities



# ------------------------------ #
# Lemmatization text process
# ------------------------------ #
lemmatizer = WordNetLemmatizer()
# ------------------------------ #
#   Function to lemmatize
# ------------------------------ #
def lemmatize(column):
  
  lemmas = []

  for sentence in tqdm(column):
    sentence_lemmas = []
    for word in nltk.word_tokenize(sentence):
      sentence_lemmas.append(lemmatizer.lemmatize(word.lower(), get_wordnet_pos(word.lower())))
    lemmas.append(sentence_lemmas)

  return lemmas


# ------------------------------ #
#   Stopwords initialization
# ------------------------------ #
stopwords_list = nltk.corpus.stopwords.words("english")
stopwords_list[:10]
stopwords_list += string.punctuation
stopwords_list += ['.', ',', ';', '."']

# ------------------------------ #
#   Function to remove stopwords
# ------------------------------ #
def remove_stopwords(column):
  tokenized = [nltk.word_tokenize(sentence) for sentence in column]
  #Lowercase the tokens
  tokenized = [ [ word.lower() for word in sentence ] for sentence in tokenized ]
  return [ [ word for word in sentence if word not in stopwords_list ] for sentence in tokenized ]


# ------------------------------ #
#   Function to synonimize
# ------------------------------ #
def synonimize_column(column):
  #put in lowercase
  tokenized = [nltk.word_tokenize(sentence) for sentence in column]
  #Lowercase the tokens
  tokenized = [ [ word.lower() for word in sentence ] for sentence in tokenized ]
  #Synonimize
  synonimized = [ [ word for word in sentence if word not in stopwords ] for sentence in tokenized ]

  return synonimized


# ------------------------------ #
#   Function to synset
# ------------------------------ #
def synset_column(column):

  #Lowercase the tokens
  tokenized = [ [ word.lower() for word in sentence ] for sentence in tokenized ]

  #add synonimized
  synonimized = [ [ word for word in sentence if word not in stopwords ] for sentence in tokenized ]
  #Synset
  synset = [ [ wordnet.synsets(word)[0] for word in tqdm(sentence)] for sentence in tqdm(synonimized)]

  return synset


# ------------------------------ #
#  Function to NE(Name entities)
# ------------------------------ #
def apply_ne(sentence: str):
    # tokenize the sentence and find the POS tag for each token
    sentence = nltk.word_tokenize(sentence)
    
    sentences_ne = list(ne_chunk(pos_tag(sentence), binary=True))
    result = []
    for el in sentences_ne:
        if isinstance(el, Tree):
            leaves = el.leaves()
            result.append(" ".join(word[0] for word in leaves))
        else:
            result.append(el[0])
    return result

# used apply_ne function to get NE from a column
def get_name_entities(column):
  ne = []
  for sentence in column:
    ne.append(apply_ne(sentence))
  return ne



# ------------------------------ #
#  Function to get ngrams
# ------------------------------ #
def get_ngrams_column(column, n):
  ngrams = []
  for sentence in column:
    ngrams.append(apply_ngram(sentence, n))
  return ngrams


def apply_ngram(sentence: list, n: int):
    if len(sentence) < n:
        return [tuple(sentence)]
    return list(nltk.ngrams(sentence, n))


# ------------------------------ #
#     Function to get lesk 
# ------------------------------ #
def get_lesk_column(column):
  lesk_text = []

  for sentence in column:
    synset = [lesk(sentence, word) for word in sentence]
    synset = {word for word in synset if word is not None}
    lesk_text.append(synset)

  return lesk_text


  
def apply_jaccard_lesk(sentence1: str, sentence2: str):

  # Apply lesk to sentence 1
  synset1 = [ lesk(sentence1, word) for word in sentence1 ]
  synset1 = { word for word in synset1 if word is not None }

  # Apply lesk to sentence 1
  synset2 = [ lesk(sentence2, word) for word in sentence2 ]
  synset2 = { word for word in synset2 if word is not None }

  # Calculate distance
  distance = jaccard_distance(synset1, synset2)

  return distance

In [25]:
# aply the function apply_ne to a phrase
phrase = "I am a student of the University of Granada and that is in that city, that is in Spain, The artificial beach named angelica is going to be super cool."
#tokenize the phrase
ne = apply_ne(nltk.word_tokenize(phrase))
print(ne)

['I', 'am', 'a', 'student', 'of', 'the', 'University', 'of', 'Granada', 'and', 'that', 'is', 'in', 'that', 'city', ',', 'that', 'is', 'in', 'Spain', ',', 'The', 'artificial', 'beach', 'named', 'angelica', 'is', 'going', 'to', 'be', 'super', 'cool', '.']


In [5]:
# Functions of preprocessing
def read_data(text_datas: List[str], gs_datas: List[str]):
  all_df_text = []
  for text_data, gs_data in zip(text_datas, gs_datas):
    df_text = pd.read_csv(text_data, sep=r'\t', engine='python', header=None)
    df_text.columns = ["text1", "text2"]
    df_text['gs'] = pd.read_csv(gs_data, sep='\t', header=None)
    all_df_text.append(df_text.dropna())
  return pd.concat(all_df_text)

def get_dataset(path: str) -> pd.DataFrame:
  files = sorted(os.listdir(path))
  input_files = [ os.path.join(path, file) for file in files if 'input' in file ]
  gs_files = [ os.path.join(path, file) for file in files if 'gs' in file ]
  df = read_data(input_files, gs_files)
  return df

# Pre-processing

### Data information
- trial : includes the definition of the scores, a sample of 5 sentence pairs and the input and output formats. It is not needed, but it is useful for prototyping.

- train : training data from paraphrasing data sets, input and output formats.

- test : test data from paraphrasing data sets.

In [47]:
train_path = '../final_project/train'
trial_path = '../final_project/trial'
test_path  = '../final_project/test-gold'

# **Similarities**

In [7]:

train_dataset_pruebas = get_dataset(train_path)
test_dataset_pruebas = get_dataset(test_path)
df = train_dataset_pruebas


In [8]:
# Tokenization features
tokenized_text1 = tokenize_column(df['text1'])
tokenized_text2 = tokenize_column(df['text2'])

# Lemmatization features
lemmatize_text1 = lemmatize(df['text1'])
lemmatize_text2 = lemmatize(df['text2'])


#Use stopwords function to remove stopwords
stopwords_text1 = remove_stopwords(df['text1'])
stopwords_text2 = remove_stopwords(df['text2'])



# Synonyms features
synonyms_text1 = []
synonyms_text2 = []
# Use sysnstesizer to get synonyms
for i in tqdm(range(len(tokenized_text1))):
    synonyms_text1.append([syn for w in tokenized_text1[i] for syn in wordnet.synsets(w)])
    synonyms_text2.append([syn for w in tokenized_text2[i] for syn in wordnet.synsets(w)])


# Synonyms features another way
synonimized_text1_new = synonimize_column(df['text1'])
synonimized_text2_new = synonimize_column(df['text2'])


# NES features
NES_column_text1 = get_entities_new(df['text1'])
NES_column_text2 = get_entities_new(df['text2'])

# Name entities features
name_entities_text1 = get_name_entities(df['text1'])
name_entities_text2 = get_name_entities(df['text2'])

ngrams_column_2_text1 = get_ngrams_column(tokenized_text1, 2)
ngrams_column_2_text2 = get_ngrams_column(tokenized_text2, 2)

ngrams_column_3_text1 = get_ngrams_column(tokenized_text1, 3)
ngrams_column_3_text2 = get_ngrams_column(tokenized_text2, 3)

# Lesk features
lesk_text1 = get_lesk_column(tokenized_text1)
lesk_text2 = get_lesk_column(tokenized_text2)



# Synset features
#synset_text1 = synset_column(df['text1'])
#synset_text2 = synset_column(df['text2'])

100%|██████████| 2234/2234 [00:03<00:00, 570.17it/s] 
100%|██████████| 2234/2234 [00:02<00:00, 774.27it/s] 
100%|██████████| 2234/2234 [00:01<00:00, 1356.67it/s]


In [93]:
dice_similarity_list(tokenized_text1, tokenized_text2)

array([0.70833333, 0.59259259, 0.51612903, ..., 1.        , 0.72340426,
       0.61538462])

In [120]:
# Jaccard similarity features
jaccard_similarity_tokenized = jaccard_similarity_list(tokenized_text1, tokenized_text2)
jaccard_similarity_synonyms_new = jaccard_similarity_list(synonimized_text1_new, synonimized_text2_new)
jaccard_similarity_NES = jaccard_similarity_list(NES_column_text1, NES_column_text2)
jaccard_similarity_lemmatize = jaccard_similarity_list(lemmatize_text1, lemmatize_text2)
jaccard_similarity_stopwords = jaccard_similarity_list(stopwords_text1, stopwords_text2)
jaccard_similarity_synonyms = jaccard_similarity_list(synonyms_text1, synonyms_text2)
jaccard_similarity_name_entities = jaccard_similarity_list(name_entities_text1, name_entities_text2)
jaccard_similarity_ngrams_2 = jaccard_similarity_list(ngrams_column_2_text1, ngrams_column_2_text2)
jaccard_similarity_ngrams_3 = jaccard_similarity_list(ngrams_column_3_text1, ngrams_column_3_text2)
jaccard_similarity_lesk = jaccard_similarity_list(lesk_text1, lesk_text2)

print("Jaccard similarity tokenized: ", jaccard_similarity_tokenized[:3])
print("Jaccard similarity lemmatize: ", jaccard_similarity_lemmatize[:3])
print("Jaccard similarity stopwords: ", jaccard_similarity_stopwords[:3])
print("Jaccard similarity synonyms: ", jaccard_similarity_synonyms[:3])
print("Jaccard similarity synonyms new: ", jaccard_similarity_synonyms_new[:3])
print("Jaccard similarity name entities: ", jaccard_similarity_name_entities[:3])
print("Jaccard similarity ngrams 2: ", jaccard_similarity_ngrams_2[:3])
print("Jaccard similarity ngrams 3: ", jaccard_similarity_ngrams_3[:3])
print("Jaccard similarity lesk: ", jaccard_similarity_lesk[:3])

Jaccard similarity tokenized:  [0.5483871  0.42105263 0.34782609]
Jaccard similarity lemmatize:  [0.5483871  0.42105263 0.34782609]
Jaccard similarity stopwords:  [0.47368421 0.46153846 0.33333333]
Jaccard similarity synonyms:  [0.67680608 0.30172414 0.38562092]
Jaccard similarity synonyms new:  [0.47368421 0.46153846 0.33333333]
Jaccard similarity name entities:  [0.5483871  0.42105263 0.33333333]
Jaccard similarity ngrams 2:  [0.37837838 0.13043478 0.20689655]
Jaccard similarity ngrams 3:  [0.32432432 0.04347826 0.1       ]
Jaccard similarity lesk:  [0.35714286 0.3125     0.22222222]


In [97]:
def get_features(df: pd.DataFrame):

    #--------------------------------------------#
    # 0. NLTK Words features
    #--------------------------------------------#
    #print("NLTK Words features")
    
    #nltk_words_text1 = []
    #nltk_words_text2 = []

    #--------------------------------------------#
    # 1. Tokenize features
    #--------------------------------------------#    
    tokenized_text1 = tokenize_column(df['text1'])
    tokenized_text2 = tokenize_column(df['text2'])

    #--------------------------------------------#
    # 2. Lemmatize features
    #--------------------------------------------#
    lemmatize_text1 = lemmatize(df['text1'])
    lemmatize_text2 = lemmatize(df['text2'])


    #--------------------------------------------#
    # 3. Stopwords features
    #--------------------------------------------#   
    stopwords_text1 = remove_stopwords(df['text1'])
    stopwords_text2 = remove_stopwords(df['text2'])



    #--------------------------------------------#
    # 4. Synonims features
    #--------------------------------------------#
    synonyms_text1 = []
    synonyms_text2 = []
    # Use sysnstesizer to get synonyms
    for i in tqdm(range(len(tokenized_text1))):
        synonyms_text1.append([syn for w in tokenized_text1[i] for syn in wordnet.synsets(w)])
        synonyms_text2.append([syn for w in tokenized_text2[i] for syn in wordnet.synsets(w)])

    
    #--------------------------------------------#
    # 5. NES features
    #--------------------------------------------#
    NES_column_text1 = get_entities_new(df['text1'])
    NES_column_text2 = get_entities_new(df['text2'])

    
    #--------------------------------------------#
    # 6. Name entities features
    #--------------------------------------------#
    name_entities_text1 = get_name_entities(df['text1'])
    name_entities_text2 = get_name_entities(df['text2'])

    #--------------------------------------------#
    # 7. Biagrams features
    #--------------------------------------------#
    ngrams_column_2_text1 = get_ngrams_column(tokenized_text1, 2)
    ngrams_column_2_text2 = get_ngrams_column(tokenized_text2, 2)

    #--------------------------------------------#
    # 8. Triagrams features
    #--------------------------------------------#
    ngrams_column_3_text1 = get_ngrams_column(tokenized_text1, 3)
    ngrams_column_3_text2 = get_ngrams_column(tokenized_text2, 3)

    #--------------------------------------------#
    # 9. Lesk features
    #--------------------------------------------#
    # Lesk features
    lesk_text1 = get_lesk_column(tokenized_text1)
    lesk_text2 = get_lesk_column(tokenized_text2)



    

    #--------------------------------------------#
    # 5. Synset features
    #--------------------------------------------#
    #print("Synset features")
    #synset_text1 = [wordnet.synsets(phrase)[0] for phrase in tqdm(lemmatize_text1)]
    #synset_text2 = [wordnet.synsets(phrase)[0] for phrase in tqdm(lemmatize_text2)]


    #--------------------------------------------#
    # 6. Spacy words features
    #--------------------------------------------#
    #print("Spacy words features")
    #spacy_words_text1 = []
    #spacy_words_text2 = []


    #--------------------------------------------#
    # 7. Ngrams features
    #--------------------------------------------#
    #print("Ngrams features")
    #ngrams_text1 = []

    #--------------------------------------------#
    # 8.Word synonyms features
    #--------------------------------------------#
    #print("Word synonyms features")


    features = [
        # Jaccard similarity
        jaccard_similarity_list(tokenized_text1, tokenized_text2),
        jaccard_similarity_list(lemmatize_text1, lemmatize_text2),
        jaccard_similarity_list(stopwords_text1, stopwords_text2),
        jaccard_similarity_list(synonyms_text1, synonyms_text2),
        jaccard_similarity_list(NES_column_text1, NES_column_text2),
        jaccard_similarity_list(name_entities_text1, name_entities_text2),
        jaccard_similarity_list(ngrams_column_2_text1, ngrams_column_2_text2),
        jaccard_similarity_list(ngrams_column_3_text1, ngrams_column_3_text2),
        jaccard_similarity_list(lesk_text1, lesk_text2),
        #jaccard_similarity_list(nltk_words_text1, nltk_words_text2),
        #jaccard_similarity_list(spacy_words_text1, spacy_words_text2),
        #jaccard_similarity_list(ngrams_text1, ngrams_text2),
        #jaccard_similarity_list(synset_text1, synset_text2),
        
        
        # Dice similarity
        dice_similarity_list(tokenized_text1, tokenized_text2),
        dice_similarity_list(lemmatize_text1, lemmatize_text2),
        dice_similarity_list(stopwords_text1, stopwords_text2),
        dice_similarity_list(synonyms_text1, synonyms_text2),
        dice_similarity_list(NES_column_text1, NES_column_text2),
        dice_similarity_list(name_entities_text1, name_entities_text2),
        dice_similarity_list(ngrams_column_2_text1, ngrams_column_2_text2),
        dice_similarity_list(ngrams_column_3_text1, ngrams_column_3_text2),
        dice_similarity_list(lesk_text1, lesk_text2),
        #jaccard_similarity_list(nltk_words_text1, nltk_words_text2),
        #jaccard_similarity_list(spacy_words_text1, spacy_words_text2),
        #jaccard_similarity_list(ngrams_text1, ngrams_text2),
        #jaccard_similarity_list(synset_text1, synset_text2),

    ]
    return np.array(features)

# **Training**

## Get training dataset

In [99]:
train_dataset = get_dataset(train_path)
print(train_dataset.shape)
train_dataset.head()

(2234, 3)


Unnamed: 0,text1,text2,gs
0,But other sources close to the sale said Viven...,But other sources close to the sale said Viven...,4.0
1,Micron has declared its first quarterly profit...,Micron's numbers also marked the first quarter...,3.75
2,The fines are part of failed Republican effort...,"Perry said he backs the Senate's efforts, incl...",2.8
3,"The American Anglican Council, which represent...","The American Anglican Council, which represent...",3.4
4,The tech-loaded Nasdaq composite rose 20.96 po...,The technology-laced Nasdaq Composite Index <....,2.4


In [101]:
y_train = train_dataset['gs'].values
y_train.shape

(2234,)

## Get features of the training dataset

In [102]:
X_train_features: np.ndarray = get_features(train_dataset)
X_train_features.shape

100%|██████████| 2234/2234 [00:03<00:00, 743.90it/s] 
100%|██████████| 2234/2234 [00:02<00:00, 756.30it/s] 
100%|██████████| 2234/2234 [00:00<00:00, 3202.38it/s]


(18, 2234)

In [103]:
X_train_features.shape

(18, 2234)

# **Testing**

## Get the test dataset

In [104]:
test_dataset = get_dataset(test_path)
print(test_dataset.shape)
test_dataset.head()

(2817, 3)


Unnamed: 0,text1,text2,gs
0,The problem likely will mean corrective change...,He said the problem needs to be corrected befo...,4.4
1,The technology-laced Nasdaq Composite Index .I...,The broad Standard & Poor's 500 Index .SPX inc...,0.8
2,"""It's a huge black eye,"" said publisher Arthur...","""It's a huge black eye,"" Arthur Sulzberger, th...",3.6
3,SEC Chairman William Donaldson said there is a...,"""I think there's a building confidence that th...",3.4
4,Vivendi shares closed 1.9 percent at 15.80 eur...,"In New York, Vivendi shares were 1.4 percent d...",1.4


## Get features of the test dataset

In [106]:
X_test_features: np.ndarray = get_features(test_dataset)
X_test_features.shape

100%|██████████| 2817/2817 [00:02<00:00, 1105.02it/s]
100%|██████████| 2817/2817 [00:02<00:00, 1075.14it/s]
100%|██████████| 2817/2817 [00:00<00:00, 5078.05it/s]


(18, 2817)

In [107]:
y_test = test_dataset['gs'].values
y_test.shape

(2817,)

## Normalize all features

In [108]:
# Normalize the data
scaler = StandardScaler()
scaler.fit(X_train_features.T)
X_train_features_norm = scaler.transform(X_train_features.T)
X_test_features_norm = scaler.transform(X_test_features.T)

## Train the model

In [109]:
# Print all shapes
print("X_train_features shape: ", X_train_features_norm.shape)
print("y_train shape: ", y_train.shape)
print("X_test_features shape: ", X_test_features_norm.shape)
print("y_test shape: ", y_test.shape)

X_train_features shape:  (2234, 18)
y_train shape:  (2234,)
X_test_features shape:  (2817, 18)
y_test shape:  (2817,)


### Train a simple regression model

In [119]:
# Train
reg = LinearRegression()
reg.fit(X_train_features.T, y_train)

In [117]:
# Evaluate
y_pred_train = reg.predict(X_train_features)
y_pred_test = reg.predict(X_test_features)

print("Train pearson: ", pearsonr(y_train, y_pred_train)[0])
print("Test pearson: ", pearsonr(y_test, y_pred_test)[0])

Train pearson:  0.7110311945994099
Test pearson:  0.02565926912069541


### Train multiple regression models

In [112]:
# Select all of the models that we are going to use
REGRESSORS = [ c for c in REGRESSORS if c[0] != 'QuantileRegressor' ]
print("Number of regressors:", len(REGRESSORS))

Number of regressors: 41


In [113]:
# Build pearson score function
def pearsonr_scorer(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    score = pearsonr(y_true, y_pred)[0]
    return score

pearson_scorer = make_scorer(pearsonr_scorer)
pearson_scorer.__name__ = 'pearson_scorer'

In [114]:
# Fit all models
reg = LazyRegressor(predictions=True, regressors=REGRESSORS, custom_metric=pearsonr_scorer)
regresion_models, regresion_predictions = reg.fit(X_train_features_norm, X_test_features_norm, y_train, y_test)

'tuple' object has no attribute '__name__'
Invalid Regressor(s)


100%|██████████| 41/41 [00:13<00:00,  3.01it/s]


In [115]:
regresion_models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken,pearsonr_scorer
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LassoLars,-0.02,-0.02,1.18,0.03,
DummyRegressor,-0.02,-0.02,1.18,0.01,
Lasso,-0.02,-0.02,1.18,0.03,
ElasticNet,-0.16,-0.15,1.26,0.03,-0.06
TweedieRegressor,-0.55,-0.54,1.45,0.37,-0.07
PoissonRegressor,-0.66,-0.65,1.5,0.87,-0.04
AdaBoostRegressor,-0.7,-0.69,1.52,0.09,-0.06
NuSVR,-0.89,-0.88,1.61,0.31,-0.05
SGDRegressor,-0.9,-0.89,1.61,0.03,0.01
ElasticNetCV,-0.91,-0.9,1.62,0.21,0.02
