# Lab 3: Morphology

Authors: Edison Jair Bejarano Sepulveda - Roberto Ariosa


## Statement:

1. Read all pairs of sentences of the SMTeuroparl files of test set within the evaluation framework of the project.

2. Compute their similarities by considering lemmas and Jaccard distance.

3. Compare the results with those in session 2 (document structure) in which words were considered.

4. Compare the results with gold standard by giving the pearson correlation between them.

5. Questions (justify the answers):

  * Which is better: words or lemmas?

  * Do you think that could perform better for any pair of texts?

In [5]:
!pip install -q pingouin

In [6]:
import nltk
import os
import pandas as pd
import pingouin as pg
from scipy.stats import pearsonr
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')


from google.colab import drive
drive.mount('/content/drive')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Before to start this laboratory, it was neccesary to understand the next concepts:
   - Lemmatization is the process to convert words to its base form, in other words, lemmatization takes into account the context and then converts the word to its "meaningful base form" (Prabhakaran, Selva). In comparisson to work tokenization or sentence tokenization, where there we remove the stop words, and punctuations. 
   - The process of lemmatizations has as advantage that take into account the definition and context of the word including the position as consideration.
  
## Pipeline to solve the next lab:
1. Read the corpus from SMTeuroparl file
2. Create a dataset with both texts
3. Put in lower-case the corpus
4. Tokenize the corpus of both texts
5. Compute two columns with the lemma for both texts
6. Compute Jaccard simmilarity for columns with and without lemmatization
7. Calculate the pearson´s correlation between the gold and the similarity with and without lemmatization.

It was experiment use the lemma for a specific verbs and general options to get the best simmilarity metric.

# 1. Read all pairs of sentences of the SMTeuroparl files of test set within the evaluation framework of the project.

In [7]:
# ------------------------------ #
# Path test gold directory
# ------------------------------ #
path = '/content/drive/MyDrive/Colab_Notebooks/2.IHLT/lab3/test-gold'

In [8]:
# ------------------------------ #
# Read dataset and return a list 
# with the files 
# ------------------------------ #
files = os.listdir(path)
files = pd.DataFrame(files)
files = path+"/"+files
files.head(8)

Unnamed: 0,0
0,/content/drive/MyDrive/Colab_Notebooks/2.IHLT/...
1,/content/drive/MyDrive/Colab_Notebooks/2.IHLT/...
2,/content/drive/MyDrive/Colab_Notebooks/2.IHLT/...
3,/content/drive/MyDrive/Colab_Notebooks/2.IHLT/...
4,/content/drive/MyDrive/Colab_Notebooks/2.IHLT/...
5,/content/drive/MyDrive/Colab_Notebooks/2.IHLT/...
6,/content/drive/MyDrive/Colab_Notebooks/2.IHLT/...
7,/content/drive/MyDrive/Colab_Notebooks/2.IHLT/...


In [9]:
# ------------------------------ #
# Path test gold directory
# ------------------------------ #
da = files[0][3]
df = pd.read_csv(da, sep='\t', header=None)
df.columns = ['Text1', 'Text2']
gold_file = files[0][4]
df["gs"] = pd.read_csv(gold_file, sep='\t', header=None)
df.head()

Unnamed: 0,Text1,Text2,gs
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0


# 2. Compute their similarities by considering lemmas and Jaccard distance.

In [10]:
# ------------------------------ #
# Similarity Function
# ------------------------------ #
def jaccard_similarity(s1, s2):
  intersection = len(s1.intersection(s2))
  union = len(s1) + len(s2) - intersection
  return float(intersection) / float(union)

In [11]:
# ------------------------------ #
# Lemmatization text process
# ------------------------------ #

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize(column):
  return  [ list(lemmatizer.lemmatize(word.lower(), get_wordnet_pos(word.lower())) for word in nltk.word_tokenize(sentence)) for sentence in column ]

In [12]:
#--------------------------------------------#
# Tokenize both columns that contain text
#--------------------------------------------#
df['text1_lemmatized'] = lemmatize(df['Text1'])
df['text2_lemmatized'] = lemmatize(df['Text2'])


In [13]:
#--------------------------------------------#
#Compute the jaccard similarity between 
#text1 and text2
#--------------------------------------------#
df['similarity'] = [jaccard_similarity(set(row['Text1']), set(row['Text2'])) for i,row in df.iterrows()]

#--------------------------------------------#
#Compute the jaccard similarity between 
#text1 and text2 after be lemmatized
#--------------------------------------------#
df['lemma_similarity'] = [jaccard_similarity(set(row['text1_lemmatized']), set(row['text2_lemmatized'])) for i,row in df.iterrows()]
df.head(8)

Unnamed: 0,Text1,Text2,gs,text1_lemmatized,text2_lemmatized,similarity,lemma_similarity
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5,"[the, leader, have, now, be, give, a, new, cha...","[the, leader, benefit, aujourd, ', hui, of, a,...",0.678571,0.346154
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0,"[amendment, no, 7, proposes, certain, change, ...","[amendment, no, 7, be, propose, certain, chang...",1.0,0.785714
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25,"[let, me, remind, you, that, our, ally, includ...","[i, would, like, to, remind, you, that, among,...",0.666667,0.391304
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5,"[the, vote, will, take, place, today, at, 5.30...","[the, vote, will, take, place, at, 5.30pm]",0.904762,0.545455
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0,"[the, fisherman, be, inactive, ,, tire, and, d...","[the, fisherman, be, inactive, ,, tire, and, d...",1.0,1.0
5,Neither was there a qualified majority within ...,There was not a majority voting in Parliament ...,5.0,"[neither, be, there, a, qualify, majority, wit...","[there, be, not, a, majority, voting, in, parl...",0.65625,0.4
6,It increases the power of the big countries at...,It has the effect of augmenting the potency of...,4.667,"[it, increase, the, power, of, the, big, count...","[it, have, the, effect, of, augment, the, pote...",0.791667,0.333333
7,"The fishermen are inactive, tired and disappoi...","The fishers are inactive, tired and disappointed.",5.0,"[the, fisherman, be, inactive, ,, tire, and, d...","[the, fisher, be, inactive, ,, tire, and, disa...",0.947368,0.8


# 3. Compare the results with those in session 2 (document structure) in which words were considered.

In [14]:
y = df.iloc[:,-2:]
y

Unnamed: 0,similarity,lemma_similarity
0,0.678571,0.346154
1,1.000000,0.785714
2,0.666667,0.391304
3,0.904762,0.545455
4,1.000000,1.000000
...,...,...
454,0.730769,0.550000
455,0.625000,0.461538
456,0.750000,0.473684
457,0.680000,0.318182


# 4. Compare the results with gold standard by giving the pearson correlation between them.

In [15]:
#--------------------------------------------#
# Pearson Correlation without lemmas
#--------------------------------------------#
pearson_corr = pearsonr(df['similarity'], df['gs'])
print(f'The pearson correlation between these text is : {pearson_corr[0]}')
print(pg.corr(x=df['gs'], y=df['similarity']))


The pearson correlation between these text is : 0.3971297709735514
           n        r         CI95%         p-val       BF10  power
pearson  459  0.39713  [0.32, 0.47]  8.622654e-19  5.313e+15    1.0


In [16]:
#--------------------------------------------#
# Pearson Correlation with lemmas
#--------------------------------------------#
pearson_corr = pearsonr(df['lemma_similarity'], df['gs'])
print(f'The pearson correlation between these text is : {pearson_corr[0]}')
print(pg.corr(x=df['gs'], y=df['lemma_similarity']))

The pearson correlation between these text is : 0.46654833197283224
           n         r         CI95%         p-val      BF10  power
pearson  459  0.466548  [0.39, 0.54]  3.463051e-26  1.05e+23    1.0


# 5. Questions:

  * Which is better: words or lemmas?
     - As it was observed in the previous results,and based on the Pearson´s correlation computed for text with and without lemmatization, the better correlation value was <0.46> using lemmas, compared with <0.39> obtained by using only words . These increment can be understand as that the lemma bring us a good opportunity to compare texts. Otherwise, the way in how are implemented can influence in the possitive or negative way, because in out tests, we can percibe that apply lemmas only for verbs, has as consequence that the pearson´s cosrrelation value decresee, nevertheles, the better way, it was applied for all the possible options.

  * Do you think that could perform better for any pair of texts?
    - As it was observed exist a possible inprovment for the lemma implementation and it is by another library that take into account many gramaticals, morphology factor. This library is called https://spacy.io/api/lemmatizer.

    - Another option to experiment the implementation of lemmatization are:
     
      * TextBlob
      * CLiPS Pattern
      * Stanford CoreNLP
      * Gensim Lemmatizer
      * TreeTagger
   