# Semantic Textual Similarity Assignment
## SemEval 2012: The First Joint Conference on Lexical and Computational Semantics

**Students:**
- Mario Rosas
- Alam Lopez

**Lab Professor:** Salvador Medina Herrera



## Introduction 

The current assignment is intended to use all the concepts acquired in the Introduction to Human Language Technology class, by replicating the 6th task of the 2012 SemEval with similar conditions that the participants were given. 

The main advantage is that we have full acces to the results' paper where the different approaches porposed are ranked by the three metrics, and the different techniques used for each participants are shown, as well as the corresponding papers for each of the implementations. 

However, the main constraint for this assignment is that it is only allowed to use techniques that were developed by the time of the challenge. This means, that the following approach only consider methods published before June 2012. 



The main task to accomplish is to compute the Semantic Textual Similarity (STS), between various pairs of sentences. In a range from 0-5, where:

- **(5) Completely equivalent**, as they mean the same thing
- **(4) Mostly equivalent**, *but some unimportant details differ.*
- **(3) Roughly equivalent**, *but some important information differs/missing.*
- **(2) Not equivalent**, *but share some details.*
- **(1) Not equivalent**, *but are on the same topic.*
- **(0) On different topics.**

Implementing different approaches:

- a) Explore some lexical dimensions.
- b) Explore the syntactic dimension alone.
- c) Explore the combination of both previous.
- d) Add new components at your choice 

Finally compute the pearson correlation between the predicted similarity versus the ground truth/gold standard (that was provided by human annotation and used a methodology to improve the quality of it). Where the baseline is a p-correlation of 0.31 and the maximum obtained in the contest was 0.75.

# General Steps

We decided to merge the first the approaches a, b, c into the same workflow, and 


1. First we implemented all the functions needed to load the train and test data and preprocess the sentences. That are stored in  https://github.com/MarioROT/IHLT-MAI

2. Then we defined the features to be used in the between sentences in order to describe its similarity or difference.

We inputed the features into a Support Vector Regression / ML classifier, specifically a ...  in order to train our system with the features defined and predict the similarity of test sentences.


### For Colaboratory

In [13]:
%%shell
git clone https://github.com/mariorot/IHLT-MAI.git
cd 'IHLT-MAI'
mv 'complementary_material' /content/
mv scripts /content/
pip install svgling
pip install python-crfsuite

UsageError: Cell magic `%%shell` not found.


### CODE

In [2]:
%load_ext autoreload
%autoreload 2

In [None]:
from scripts.compute_metrics import ComputeMetrics
from scripts.text_preprocessing import TextPreprocessing
from scripts.utils import Dataset, ShowResults
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option("display.precision", 4)

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

# Loading Data

In [4]:
train = Dataset('complementary_material/train/')
test = Dataset('complementary_material/test-gold/')
tp = TextPreprocessing()

## Experiments on training data

In [5]:
insT = {}
for dataname in ['SMTeuroparl', 'MSRvid', 'MSRpar', 'all']:
    dt = train[dataname]

    # ----- Tokenization -----
    # NLTK
    dt[2] = tp.tokenize_data(list(dt[0]),'nltk')
    dt[3] = tp.tokenize_data(list(dt[1]),'nltk')
    # spaCy
    dt[4] = tp.tokenize_data(list(dt[0]),'spacy')
    dt[5] = tp.tokenize_data(list(dt[1]),'spacy')
    # Data cleaning
    dt[6]=tp.clean_data(list(dt[0]))
    dt[7]=tp.clean_data(list(dt[1]))


    # ----- Lemmatization -----
    # -- With Tokens
    # NLTK
    dt[8]=tp.lemmatize_data(list(dt[0]),'nltk',False)
    dt[9]=tp.lemmatize_data(list(dt[1]),'nltk',False)
    # spaCy
    dt[10]=tp.lemmatize_data(list(dt[0]),'spacy')
    dt[11]=tp.lemmatize_data(list(dt[1]),'spacy')

    # -- With Cleaned data
    # NLTK
    dt[12]=tp.lemmatize_data(list(dt[6]),'nltk')
    dt[13]=tp.lemmatize_data(list(dt[7]),'nltk')
    # spaCy
    dt[14]=tp.lemmatize_data(list(dt[6]),'spacy')
    dt[15]=tp.lemmatize_data(list(dt[7]),'spacy')


    # ----- Word Desambiguation -----
    # --- With Tokens
    # NLTK
    dt[16]= tp.wsd_lesk_data(list(dt[2]),'nltk', keep_failures=True, synset_word=True)
    dt[17]= tp.wsd_lesk_data(list(dt[3]),'nltk', keep_failures=True, synset_word=True)
    # Spacy
    dt[18]= tp.wsd_lesk_data(list(dt[4]),'nltk', keep_failures=True, synset_word=True)
    dt[19]= tp.wsd_lesk_data(list(dt[5]),'nltk', keep_failures=True, synset_word=True)
    # Cleaned data
    dt[20]= tp.wsd_lesk_data(list(dt[6]),'nltk', keep_failures=True, synset_word=True)
    dt[21]= tp.wsd_lesk_data(list(dt[7]),'nltk', keep_failures=True, synset_word=True)

    # --- With Lemmas
    # NLTK Lemmas
    dt[22]= tp.wsd_lesk_data(list(dt[8]),'nltk', keep_failures=True, synset_word=True)
    dt[23]= tp.wsd_lesk_data(list(dt[9]),'nltk', keep_failures=True, synset_word=True)
    # Spacy Lemmas
    dt[24]= tp.wsd_lesk_data(list(dt[10]),'nltk', keep_failures=True, synset_word=True)
    dt[25]= tp.wsd_lesk_data(list(dt[11]),'nltk', keep_failures=True, synset_word=True)
    # Cleaned NLTK Lemmas
    dt[26]= tp.wsd_lesk_data(list(dt[12]),'nltk', keep_failures=True, synset_word=True)
    dt[27]= tp.wsd_lesk_data(list(dt[13]),'nltk', keep_failures=True, synset_word=True)
    # Cleaned SpaCy Lemmas
    dt[28]= tp.wsd_lesk_data(list(dt[14]),'nltk', keep_failures=True, synset_word=True)
    dt[29]= tp.wsd_lesk_data(list(dt[15]),'nltk', keep_failures=True, synset_word=True)


    # ----- Named Entities -----
    # --- With Tokens
    # NLTK
    dt[30]= tp.named_entities_data(list(dt[0]), 'nltk', False)
    dt[31]= tp.named_entities_data(list(dt[1]), 'nltk', False)
    # SpaCy
    dt[32]= tp.named_entities_data(list(dt[0]), 'spacy')
    dt[33]= tp.named_entities_data(list(dt[1]), 'spacy')
    # Cleaned data
    # NLTK
    dt[34]= tp.named_entities_data(list(dt[6]), 'nltk')
    dt[35]= tp.named_entities_data(list(dt[7]), 'nltk')
    # SpaCy
    dt[36]= tp.named_entities_data(list(dt[6]), 'spacy')
    dt[37]= tp.named_entities_data(list(dt[7]), 'spacy')

    # --- With Lemmas
    # NLTK
    dt[38]= tp.named_entities_data(list(dt[8]), 'nltk')
    dt[39]= tp.named_entities_data(list(dt[9]), 'nltk')

    dt[40]= tp.named_entities_data(list(dt[8]), 'spacy')
    dt[41]= tp.named_entities_data(list(dt[9]), 'spacy')
    # SpaCy
    dt[42]= tp.named_entities_data(list(dt[10]), 'nltk')
    dt[43]= tp.named_entities_data(list(dt[11]), 'nltk')

    dt[44]= tp.named_entities_data(list(dt[10]), 'spacy')
    dt[45]= tp.named_entities_data(list(dt[11]), 'spacy')
    # Cleaned data
    # NLTK
    dt[46]= tp.named_entities_data(list(dt[12]), 'nltk')
    dt[47]= tp.named_entities_data(list(dt[13]), 'nltk')

    dt[48]= tp.named_entities_data(list(dt[12]), 'spacy')
    dt[49]= tp.named_entities_data(list(dt[13]), 'spacy')
    # SpaCy
    dt[50]= tp.named_entities_data(list(dt[14]), 'nltk')
    dt[51]= tp.named_entities_data(list(dt[15]), 'nltk')

    dt[52]= tp.named_entities_data(list(dt[14]), 'spacy')
    dt[53]= tp.named_entities_data(list(dt[15]), 'spacy')

    # -- Metrics computation
    pairs = {'nltk_token':[2,3], 'spacy_token':[4,5], 'clean_token':[6,7], # Tokens
            'nltk_lemma':[8,9], 'spacy_lemma':[10,11], 'clean_nltk_lemma':[12,13], 'clean_spacy_lemma':[14,15], # Lemmas
            'nltk_token_wsd':[16,17], 'spacy_token_wsd':[18,19], 'clean_token_wsd':[20,21], # WSD
            'nltk_lemma_wsd':[22,23], 'spacy_lemma_wsd':[24,25], 'clean_nltk_lemma_wsd':[26,27], 'clean_spacy_lemma_wsd':[28,29], #WSD
            'nltk_ne':[30,31],  'spacy_ne':[32,33],  'clean_nltk_ne':[34,35],  'clean_spacy_ne':[36,37], #Named Entities
            'nltk_lemmas_-_nltk_ne':[38,39],  'nltk_lemmas_-_spacy_ne':[40,41],  'spacy_lemmas_-_nltk_ne':[42,43],  'spacy_lemmas_-_spacy_ne':[44,45], #Named Entities
            'clean_nltk_lemmas_-_nltk_ne':[46,47],  'clean_nltk_lemmas_-_spacy_ne':[48,49],  'clean_spacy_lemmas_-_nltk_ne':[50,51],  'clean_spacy_lemmas_-_spacy_ne':[52,53] #Named Entities
            }

    metrics = ['jaccard', 'cosine', 'overlap', 'dice']
    mets_results = {k:{} for k in metrics}
    mets_results['gs'] = dt['gs']

    for name, values in pairs.items():
        met_results = ComputeMetrics(dt[values].to_numpy(), metrics).do()
        for metric in metrics:
            mets_results[metric][name] = met_results[metric]

    sr = ShowResults(mets_results, {'Tokenization':'token', 'Lemmatization':'lemma', 'Word Sense Disambiguation':'wsd', 'Named Entities':'ne'}, False)
    insT[dataname] =  sr

In [17]:
insT['SMTeuroparl'].heatmap()

Unnamed: 0,Category,jaccard,cosine,overlap,dice
0,NLTK TOKEN,0.505,0.562,0.518,0.566
1,SPACY TOKEN,0.444,0.483,0.485,0.471
2,CLEAN TOKEN,0.428,0.467,0.427,0.471
0,NLTK LEMMA,0.509,0.565,0.519,0.568
1,SPACY LEMMA,0.517,0.576,0.537,0.578
2,CLEAN NLTK LEMMA,0.442,0.483,0.445,0.487
3,CLEAN SPACY LEMMA,0.44,0.48,0.441,0.484
0,NLTK TOKEN WSD,0.463,0.517,0.481,0.52
1,SPACY TOKEN WSD,0.412,0.449,0.451,0.439
2,CLEAN TOKEN WSD,0.424,0.465,0.432,0.468


In [18]:
insT['MSRvid'].heatmap()

Unnamed: 0,Category,jaccard,cosine,overlap,dice
0,NLTK TOKEN,0.356,0.348,0.386,0.343
1,SPACY TOKEN,0.355,0.348,0.386,0.343
2,CLEAN TOKEN,0.665,0.684,0.68,0.683
0,NLTK LEMMA,0.484,0.474,0.512,0.468
1,SPACY LEMMA,0.534,0.526,0.568,0.52
2,CLEAN NLTK LEMMA,0.738,0.752,0.751,0.751
3,CLEAN SPACY LEMMA,0.74,0.754,0.752,0.753
0,NLTK TOKEN WSD,0.369,0.371,0.406,0.367
1,SPACY TOKEN WSD,0.368,0.371,0.406,0.367
2,CLEAN TOKEN WSD,0.688,0.71,0.709,0.709


In [19]:
insT['MSRpar'].heatmap()

Unnamed: 0,Category,jaccard,cosine,overlap,dice
0,NLTK TOKEN,0.519,0.529,0.481,0.532
1,SPACY TOKEN,0.411,0.402,0.429,0.374
2,CLEAN TOKEN,0.452,0.451,0.378,0.456
0,NLTK LEMMA,0.53,0.54,0.492,0.542
1,SPACY LEMMA,0.553,0.562,0.517,0.564
2,CLEAN NLTK LEMMA,0.451,0.452,0.386,0.457
3,CLEAN SPACY LEMMA,0.464,0.464,0.399,0.469
0,NLTK TOKEN WSD,0.472,0.475,0.439,0.477
1,SPACY TOKEN WSD,0.383,0.377,0.398,0.353
2,CLEAN TOKEN WSD,0.419,0.415,0.36,0.42


In [20]:
insT['all'].heatmap()

Unnamed: 0,Category,jaccard,cosine,overlap,dice
0,NLTK TOKEN,0.378,0.387,0.378,0.387
1,SPACY TOKEN,0.316,0.317,0.351,0.303
2,CLEAN TOKEN,0.571,0.609,0.585,0.61
0,NLTK LEMMA,0.422,0.431,0.419,0.431
1,SPACY LEMMA,0.447,0.456,0.445,0.456
2,CLEAN NLTK LEMMA,0.603,0.644,0.624,0.645
3,CLEAN SPACY LEMMA,0.6,0.641,0.619,0.642
0,NLTK TOKEN WSD,0.432,0.458,0.459,0.458
1,SPACY TOKEN WSD,0.38,0.401,0.432,0.388
2,CLEAN TOKEN WSD,0.566,0.608,0.591,0.609


In [27]:
mets_results['dice']['clean_nltk_lemmas_-_nltk_ne'][...,np.newaxis].shape

(2234,)

In [None]:
from sklearn.neural_network import MLPClassifier, MLPRegressor
X = mets_results['dice']['clean_nltk_lemmas_-_nltk_ne'][...,np.newaxis]
y = mets_results['gs']
clf = MLPRegressor(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5, 1), random_state=1)
clf.fit(X, y)

## Final model in test data

In [11]:
ins = {}
for dataname in ['SMTeuroparl', 'MSRvid', 'MSRpar', 'all']:
    dt = test[dataname]

    # ----- Tokenization -----
    # NLTK
    dt[2] = tp.tokenize_data(list(dt[0]),'nltk')
    dt[3] = tp.tokenize_data(list(dt[1]),'nltk')
    # spaCy
    dt[4] = tp.tokenize_data(list(dt[0]),'spacy')
    dt[5] = tp.tokenize_data(list(dt[1]),'spacy')
    # Data cleaning
    dt[6]=tp.clean_data(list(dt[0]))
    dt[7]=tp.clean_data(list(dt[1]))


    # ----- Lemmatization -----
    # -- With Tokens
    # NLTK
    dt[8]=tp.lemmatize_data(list(dt[0]),'nltk',False)
    dt[9]=tp.lemmatize_data(list(dt[1]),'nltk',False)
    # spaCy
    dt[10]=tp.lemmatize_data(list(dt[0]),'spacy')
    dt[11]=tp.lemmatize_data(list(dt[1]),'spacy')

    # -- With Cleaned data
    # NLTK
    dt[12]=tp.lemmatize_data(list(dt[6]),'nltk')
    dt[13]=tp.lemmatize_data(list(dt[7]),'nltk')
    # spaCy
    dt[14]=tp.lemmatize_data(list(dt[6]),'spacy')
    dt[15]=tp.lemmatize_data(list(dt[7]),'spacy')


    # ----- Word Desambiguation -----
    # --- With Tokens
    # NLTK
    dt[16]= tp.wsd_lesk_data(list(dt[2]),'nltk', keep_failures=True, synset_word=True)
    dt[17]= tp.wsd_lesk_data(list(dt[3]),'nltk', keep_failures=True, synset_word=True)
    # Spacy
    dt[18]= tp.wsd_lesk_data(list(dt[4]),'nltk', keep_failures=True, synset_word=True)
    dt[19]= tp.wsd_lesk_data(list(dt[5]),'nltk', keep_failures=True, synset_word=True)
    # Cleaned data
    dt[20]= tp.wsd_lesk_data(list(dt[6]),'nltk', keep_failures=True, synset_word=True)
    dt[21]= tp.wsd_lesk_data(list(dt[7]),'nltk', keep_failures=True, synset_word=True)

    # --- With Lemmas
    # NLTK Lemmas
    dt[22]= tp.wsd_lesk_data(list(dt[8]),'nltk', keep_failures=True, synset_word=True)
    dt[23]= tp.wsd_lesk_data(list(dt[9]),'nltk', keep_failures=True, synset_word=True)
    # Spacy Lemmas
    dt[24]= tp.wsd_lesk_data(list(dt[10]),'nltk', keep_failures=True, synset_word=True)
    dt[25]= tp.wsd_lesk_data(list(dt[11]),'nltk', keep_failures=True, synset_word=True)
    # Cleaned NLTK Lemmas
    dt[26]= tp.wsd_lesk_data(list(dt[12]),'nltk', keep_failures=True, synset_word=True)
    dt[27]= tp.wsd_lesk_data(list(dt[13]),'nltk', keep_failures=True, synset_word=True)
    # Cleaned SpaCy Lemmas
    dt[28]= tp.wsd_lesk_data(list(dt[14]),'nltk', keep_failures=True, synset_word=True)
    dt[29]= tp.wsd_lesk_data(list(dt[15]),'nltk', keep_failures=True, synset_word=True)


    # ----- Named Entities -----
    # --- With Tokens
    # NLTK
    dt[30]= tp.named_entities_data(list(dt[0]), 'nltk', False)
    dt[31]= tp.named_entities_data(list(dt[1]), 'nltk', False)
    # SpaCy
    dt[32]= tp.named_entities_data(list(dt[0]), 'spacy')
    dt[33]= tp.named_entities_data(list(dt[1]), 'spacy')
    # Cleaned data
    # NLTK
    dt[34]= tp.named_entities_data(list(dt[6]), 'nltk')
    dt[35]= tp.named_entities_data(list(dt[7]), 'nltk')
    # SpaCy
    dt[36]= tp.named_entities_data(list(dt[6]), 'spacy')
    dt[37]= tp.named_entities_data(list(dt[7]), 'spacy')

    # --- With Lemmas
    # NLTK
    dt[38]= tp.named_entities_data(list(dt[8]), 'nltk')
    dt[39]= tp.named_entities_data(list(dt[9]), 'nltk')

    dt[40]= tp.named_entities_data(list(dt[8]), 'spacy')
    dt[41]= tp.named_entities_data(list(dt[9]), 'spacy')
    # SpaCy
    dt[42]= tp.named_entities_data(list(dt[10]), 'nltk')
    dt[43]= tp.named_entities_data(list(dt[11]), 'nltk')

    dt[44]= tp.named_entities_data(list(dt[10]), 'spacy')
    dt[45]= tp.named_entities_data(list(dt[11]), 'spacy')
    # Cleaned data
    # NLTK
    dt[46]= tp.named_entities_data(list(dt[12]), 'nltk')
    dt[47]= tp.named_entities_data(list(dt[13]), 'nltk')

    dt[48]= tp.named_entities_data(list(dt[12]), 'spacy')
    dt[49]= tp.named_entities_data(list(dt[13]), 'spacy')
    # SpaCy
    dt[50]= tp.named_entities_data(list(dt[14]), 'nltk')
    dt[51]= tp.named_entities_data(list(dt[15]), 'nltk')

    dt[52]= tp.named_entities_data(list(dt[14]), 'spacy')
    dt[53]= tp.named_entities_data(list(dt[15]), 'spacy')

    # -- Metrics computation
    pairs = {'nltk_token':[2,3], 'spacy_token':[4,5], 'clean_token':[6,7], # Tokens
            'nltk_lemma':[8,9], 'spacy_lemma':[10,11], 'clean_nltk_lemma':[12,13], 'clean_spacy_lemma':[14,15], # Lemmas
            'nltk_token_wsd':[16,17], 'spacy_token_wsd':[18,19], 'clean_token_wsd':[20,21], # WSD
            'nltk_lemma_wsd':[22,23], 'spacy_lemma_wsd':[24,25], 'clean_nltk_lemma_wsd':[26,27], 'clean_spacy_lemma_wsd':[28,29], #WSD
            'nltk_ne':[30,31],  'spacy_ne':[32,33],  'clean_nltk_ne':[34,35],  'clean_spacy_ne':[36,37], #Named Entities
            'nltk_lemmas_-_nltk_ne':[38,39],  'nltk_lemmas_-_spacy_ne':[40,41],  'spacy_lemmas_-_nltk_ne':[42,43],  'spacy_lemmas_-_spacy_ne':[44,45], #Named Entities
            'clean_nltk_lemmas_-_nltk_ne':[46,47],  'clean_nltk_lemmas_-_spacy_ne':[48,49],  'clean_spacy_lemmas_-_nltk_ne':[50,51],  'clean_spacy_lemmas_-_spacy_ne':[52,53] #Named Entities
            }

    metrics = ['jaccard', 'cosine', 'overlap', 'dice']
    mets_results = {k:{} for k in metrics}
    mets_results['gs'] = dt['gs']

    for name, values in pairs.items():
        met_results = ComputeMetrics(dt[values].to_numpy(), metrics).do()
        for metric in metrics:
            mets_results[metric][name] = met_results[metric]

    sr = ShowResults(mets_results, {'Tokenization':'token', 'Lemmatization':'lemma', 'Word Sense Disambiguation':'wsd', 'Named Entities':'ne'}, False)
    ins[dataname] =  sr

In [12]:
ins['SMTeuroparl'].heatmap()

Unnamed: 0,Category,jaccard,cosine,overlap,dice
0,NLTK TOKEN,0.45,0.461,0.444,0.462
1,SPACY TOKEN,0.461,0.472,0.458,0.473
2,CLEAN TOKEN,0.468,0.473,0.443,0.475
0,NLTK LEMMA,0.449,0.462,0.44,0.464
1,SPACY LEMMA,0.477,0.492,0.47,0.494
2,CLEAN NLTK LEMMA,0.481,0.496,0.465,0.498
3,CLEAN SPACY LEMMA,0.491,0.505,0.476,0.507
0,NLTK TOKEN WSD,0.42,0.417,0.398,0.419
1,SPACY TOKEN WSD,0.429,0.425,0.408,0.426
2,CLEAN TOKEN WSD,0.48,0.479,0.449,0.481


In [13]:
ins['MSRvid'].heatmap()

Unnamed: 0,Category,jaccard,cosine,overlap,dice
0,NLTK TOKEN,0.358,0.361,0.391,0.356
1,SPACY TOKEN,0.359,0.363,0.395,0.358
2,CLEAN TOKEN,0.686,0.709,0.703,0.707
0,NLTK LEMMA,0.48,0.484,0.516,0.478
1,SPACY LEMMA,0.557,0.56,0.591,0.554
2,CLEAN NLTK LEMMA,0.741,0.759,0.753,0.757
3,CLEAN SPACY LEMMA,0.742,0.759,0.754,0.757
0,NLTK TOKEN WSD,0.353,0.366,0.402,0.363
1,SPACY TOKEN WSD,0.354,0.367,0.403,0.364
2,CLEAN TOKEN WSD,0.684,0.704,0.699,0.704


In [14]:
ins['MSRpar'].heatmap()

Unnamed: 0,Category,jaccard,cosine,overlap,dice
0,NLTK TOKEN,0.514,0.52,0.485,0.521
1,SPACY TOKEN,0.415,0.404,0.413,0.38
2,CLEAN TOKEN,0.411,0.408,0.362,0.409
0,NLTK LEMMA,0.529,0.534,0.497,0.536
1,SPACY LEMMA,0.542,0.549,0.51,0.551
2,CLEAN NLTK LEMMA,0.424,0.42,0.371,0.421
3,CLEAN SPACY LEMMA,0.425,0.42,0.372,0.421
0,NLTK TOKEN WSD,0.461,0.462,0.436,0.464
1,SPACY TOKEN WSD,0.388,0.384,0.389,0.362
2,CLEAN TOKEN WSD,0.413,0.407,0.368,0.408


In [15]:
ins['all'].heatmap()

Unnamed: 0,Category,jaccard,cosine,overlap,dice
0,NLTK TOKEN,0.365,0.362,0.356,0.363
1,SPACY TOKEN,0.35,0.341,0.344,0.334
2,CLEAN TOKEN,0.585,0.615,0.595,0.615
0,NLTK LEMMA,0.406,0.407,0.4,0.407
1,SPACY LEMMA,0.476,0.487,0.478,0.487
2,CLEAN NLTK LEMMA,0.615,0.649,0.629,0.65
3,CLEAN SPACY LEMMA,0.618,0.65,0.631,0.65
0,NLTK TOKEN WSD,0.416,0.429,0.433,0.429
1,SPACY TOKEN WSD,0.405,0.416,0.423,0.409
2,CLEAN TOKEN WSD,0.606,0.638,0.621,0.638
