**Based on a word network, write a program that takes a sentence and exchanges each word to another one with a similar meaning, if one exists. You are free to exclude stop words and names from this transformation.
Please include a code snippet and at least three example inputs with their respective outputs in your response.
Discuss how legible (in the sense of easy to understand) you find the transformed texts in comparison to the originals.**

In [3]:
import nltk # this we already have
nltk.download('wordnet') # this is new, download once per environment
nltk.download('omw-1.4') # same here
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
sent_1 = 'The roof of my house needs repair.'
sent_2 = 'The pretty birds eat vegetables from my garden.'
sent_3 = 'This lakeside cottage is amazing.'

In [5]:
from nltk.corpus import stopwords
nltk.download('stopwords')

#Stopword
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [13]:
def change_word(sentence):

  sent=[]

  # Tokenize the sentence
  sent_1w = word_tokenize(sentence)

  for i in range(0, len(sent_1w)):

    # check stopwords and change nothing if is in the list of stopwords
    if sent_1w[i] in stop_words:
      sent.append(sent_1w[i])
  
    else:
      synset = wn.synsets(sent_1w[i].lower())

      # check is the word have synset, if not append in the list
      if len(synset) == 0:
        sent.append(sent_1w[i])

      # if the word have synset, take a similar word
      else:
        ss = synset[0] 
        lm = ss.lemma_names()
        lm = lm[0]

        # Choose lemma_names different of the word we want change
        if lm != sent_1w[i]:
          sent.append(lm)
        else:
          ss = synset[-1] 
          lm = ss.lemma_names()
          lm = lm[-1]
          sent.append(lm)

  return sent

In [14]:
# Sentece 1
print('Transform: ' + str(' '.join(change_word(sent_1))))
print('Sentence: ' + str(sent_1))

Transform: The roof of my domiciliate need revivify .
Sentence: The roof of my house needs repair.


In [15]:
synset = wn.synsets('roof')
for batch in synset:
  print(batch.lemma_names())

['roof']
['roof']
['roof']
['ceiling', 'roof', 'cap']
['roof']


In [16]:
synset = wn.synsets('needs')
for batch in synset:
  print(batch.lemma_names())

['need', 'demand']
['need', 'want']
['motivation', 'motive', 'need']
['indigence', 'need', 'penury', 'pauperism', 'pauperization']
['necessitate', 'ask', 'postulate', 'need', 'require', 'take', 'involve', 'call_for', 'demand']
['want', 'need', 'require']
['need']
['inevitably', 'necessarily', 'of_necessity', 'needs']


Le mot roof n’a pas été changé. Dans ma boucle j’utilise le premier synset auquel le lemma_names est associé au même mot roof. Le mot needs a été remplacé par need. Globalement, la phrase transformée est similaire et fait du sens.

In [17]:
# Sentece 2
print('Transform: ' + str(' '.join(change_word(sent_2))))
print('Sentence: ' + str(sent_2))

Transform: The passably bird rust vegetable from my garden .
Sentence: The pretty birds eat vegetables from my garden.


In [18]:
synset = wn.synsets('garden')
for batch in synset:
  print(batch.lemma_names())

['garden']
['garden']
['garden']
['garden']


Le mot garden ne possède pas de lemma_nammes différent et reste ainsi. Globalement, la phrase ne fait plus vraiment de sens après la transformation.

In [7]:
# Sentece 3
print('Transform: ' + str(' '.join(change_word(sent_3))))
print('Sentence: ' + str(sent_3))

Transform: This lakeshore bungalow is amaze .
Sentence: This lakeside cottage is amazing.


La phrase transformé reste très similaire et compréhensible.



---



In [20]:
# car.n.01 is called a synset, or “synonym set,” a collection of synonymous words (or “lemmas”)
synset = wn.synsets('car')
synset

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

In [21]:
# collection of synonymous words (or “lemmas”):
for batch in synset:
  print(batch.lemma_names())

['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']


In [22]:
# Synsets also come with a prose definition and some example sentences
for batch in synset:
  print(batch.definition())

a motor vehicle with four wheels; usually propelled by an internal combustion engine
a wheeled vehicle adapted to the rails of railroad
the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
where passengers ride up and down
a conveyance for passengers or freight on a cable railway


In [23]:
for batch in synset:
  print(batch.examples())

['he needs a car to get to work']
['three cars had jumped the rails']
[]
['the car was on the top floor']
['they took a cable car to the top of the mountain']


In [31]:
motocar = wn.synset('car.n.01')
types= motocar.hyponyms()
types

[Synset('ambulance.n.01'),
 Synset('beach_wagon.n.01'),
 Synset('bus.n.04'),
 Synset('cab.n.03'),
 Synset('compact.n.03'),
 Synset('convertible.n.01'),
 Synset('coupe.n.01'),
 Synset('cruiser.n.01'),
 Synset('electric.n.01'),
 Synset('gas_guzzler.n.01'),
 Synset('hardtop.n.01'),
 Synset('hatchback.n.01'),
 Synset('horseless_carriage.n.01'),
 Synset('hot_rod.n.01'),
 Synset('jeep.n.01'),
 Synset('limousine.n.01'),
 Synset('loaner.n.02'),
 Synset('minicar.n.01'),
 Synset('minivan.n.01'),
 Synset('model_t.n.01'),
 Synset('pace_car.n.01'),
 Synset('racer.n.02'),
 Synset('roadster.n.01'),
 Synset('sedan.n.01'),
 Synset('sport_utility.n.01'),
 Synset('sports_car.n.01'),
 Synset('stanley_steamer.n.01'),
 Synset('stock_car.n.01'),
 Synset('subcompact.n.01'),
 Synset('touring_car.n.01'),
 Synset('used-car.n.01')]

**Now, based on the response to the previous question: design a program that assigns a numerical similarity score to two input sentences in terms of how similar they are with respect to where the words are within the conceptual hierarchies in WordNet.
Instead of using just unigram-level similarity, try to incorporate n-gram aspects of assigning a higher similarity to texts that contain sequences of words that are all similar to one another, lowering the similarity whenever this breaks.
For example a small dog that is hungry is very similar to a petite canine who runs in the first four words but then differs at the end.
Again, please provide a code snippet and examples (you can reuse the inputs and the outputs of the previous question as examples in this question).**

https://medium.com/@adriensieg/text-similarities-da019229c894

https://newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python

In [25]:
# Sentece 1
print('Transform: ' + str(' '.join(change_word(sent_1))))
print('Sentence: ' + str(sent_1))

Transform: The roof of my domiciliate need revivify .
Sentence: The roof of my house needs repair.


In [26]:
sentences_1 = ['The roof of my domiciliate need revivify.', 'The roof of my house needs repair.']

In [27]:
# Sentece 2
print('Transform: ' + str(' '.join(change_word(sent_2))))
print('Sentence: ' + str(sent_2))

Transform: The passably bird rust vegetable from my garden .
Sentence: The pretty birds eat vegetables from my garden.


In [28]:
sentences_2 = ['The passably bird are rust my vegetable from my garden.', 'The pretty birds are eating my vegetables from my garden.']

In [29]:
# Sentece 3
print('Transform: ' + str(' '.join(change_word(sent_3))))
print('Sentence: ' + str(sent_3))

Transform: This lakeshore bungalow is amaze .
Sentence: This lakeside cottage is amazing.


In [30]:
sentences_3 = ['This lakeshore bungalow is amaze.', 'This lakeside cottage is amazing.']

Jaccard similarity:

In [32]:
def jaccard_similarity(x,y):
  """ returns the jaccard similarity between two lists """
  intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
  union_cardinality = len(set.union(*[set(x), set(y)]))
  return intersection_cardinality/float(union_cardinality)

In [33]:
jaccard_similarity(sentences_1[0], sentences_1[1])

0.6666666666666666

In [34]:
jaccard_similarity(sentences_2[0], sentences_2[1])

0.9545454545454546

In [35]:
jaccard_similarity(sentences_3[0], sentences_3[1])

0.6818181818181818

Cosine Similarity:

In [36]:
!python -m spacy download en_core_web_md

2022-10-10 15:11:55.242597: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.0/en_core_web_md-3.4.0-py3-none-any.whl (42.8 MB)
[K     |████████████████████████████████| 42.8 MB 2.5 MB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [37]:
import spacy
nlp = spacy.load('en_core_web_md')

In [38]:
embeddings_1 = [nlp(sentence).vector for sentence in sentences_1]
embeddings_2 = [nlp(sentence).vector for sentence in sentences_2]
embeddings_3 = [nlp(sentence).vector for sentence in sentences_3]

In [39]:
from math import sqrt, pow, exp
 
def squared_sum(x):
  """ return 3 rounded square rooted value """
 
  return round(sqrt(sum([a*a for a in x])),3)

In [40]:
def cos_similarity(x,y):
  """ return cosine similarity between two lists """
 
  numerator = sum(a*b for a,b in zip(x,y))
  denominator = squared_sum(x)*squared_sum(y)
  return round(numerator/float(denominator),3)

In [41]:
cos_similarity(embeddings_1[0], embeddings_1[1])

0.956

In [42]:
cos_similarity(embeddings_2[0], embeddings_2[1])

0.973

In [43]:
cos_similarity(embeddings_3[0], embeddings_3[1])

0.959

J’ai fait le test avec les 3 phrases de la question 1. J’ai appliqué 2 mesures pour se faire : Jaccard Index et Cosine Similarity. Jaccard est rarement utilisée lorsqu’on travaille avec des données textuelles car elle ne fonctionne pas avec les incorporations de texte (se limite à évaluer la similarité lexicale du texte, donc à quel point les documents sont similaires au niveau des mots). La similarité cosine calcul 2 vecteurs comme le cosinus de l’angl. Il détermine si les 2 vecteurs pointent à peu près dans la même direction. Ainsi on peut voir que le Jaccard index ne donne pas de très bon résultat pour la phrase 1 & 3. Le cosine semble être effectivement une meilleure mesure. Les résultats sont très similaires pour chaque phrase (0.96 à 0.97). Je suis surprise cependant de voir que c’est la phrase qui fait moins de sens qui a le score le plus élevé (phrase 2 – 0.973).



---



In [28]:
# pour voir la similarité entre 2 mots:
right = wn.synset('right_whale.n.01')
minke = wn.synset('minke_whale.n.01')
right

Synset('right_whale.n.01')

In [29]:
right.path_similarity(minke)

0.25

In [30]:
words = [ 'lakeshore', 'lakeside', 'bungalow', 'cottage', 'amaze', 'amazing']
wss = [ wn.synsets(w) for w in words ]
for (w, ss) in zip(words, wss):
  for s in ss:
    print(f'{w} is {s.min_depth()} down from {s.root_hypernyms()}')

lakeshore is 5 down from [Synset('entity.n.01')]
lakeside is 5 down from [Synset('entity.n.01')]
bungalow is 8 down from [Synset('entity.n.01')]
cottage is 8 down from [Synset('entity.n.01')]
amaze is 2 down from [Synset('affect.v.05')]
amaze is 2 down from [Synset('be.v.01')]
amazing is 2 down from [Synset('affect.v.05')]
amazing is 2 down from [Synset('be.v.01')]
amazing is 0 down from [Synset('amazing.s.01')]
amazing is 0 down from [Synset('amazing.s.02')]


In [31]:
d = dict(zip(words, wss))
similarities = dict()
for w1 in words:
  ss1 = d[w1]
  for w2 in words:
    if w1 == w2:
      continue
    ss2 = d[w2]
    for (s1, s2) in zip(ss1, ss2): 
      r1 = s1.root_hypernyms()[0]
      r2 = s2.root_hypernyms()[0]
      if r1 == r2:
        sim = s1.path_similarity(s2) # a value in [0, 1], 1 meaning "the same"
        key = (w1, w2, r1)
        value = max(sim, similarities.get(key, 0)) # highest 
        similarities[key] = value

for (w1, w2, root), value in similarities.items():
  print(f'Similarity of {w1} with {w2} is {value} in the hierarchy of {root}')

# extreme cases
print('Highest similarity', max(similarities, key = similarities.get))
print('Lowest similarity', min(similarities, key = similarities.get))

Similarity of lakeshore with lakeside is 1.0 in the hierarchy of Synset('entity.n.01')
Similarity of lakeshore with bungalow is 0.1 in the hierarchy of Synset('entity.n.01')
Similarity of lakeshore with cottage is 0.1 in the hierarchy of Synset('entity.n.01')
Similarity of lakeside with lakeshore is 1.0 in the hierarchy of Synset('entity.n.01')
Similarity of lakeside with bungalow is 0.1 in the hierarchy of Synset('entity.n.01')
Similarity of lakeside with cottage is 0.1 in the hierarchy of Synset('entity.n.01')
Similarity of bungalow with lakeshore is 0.1 in the hierarchy of Synset('entity.n.01')
Similarity of bungalow with lakeside is 0.1 in the hierarchy of Synset('entity.n.01')
Similarity of bungalow with cottage is 1.0 in the hierarchy of Synset('entity.n.01')
Similarity of cottage with lakeshore is 0.1 in the hierarchy of Synset('entity.n.01')
Similarity of cottage with lakeside is 0.1 in the hierarchy of Synset('entity.n.01')
Similarity of cottage with bungalow is 1.0 in the hie

**Using the Open Multilingual Wordnet at http://compling.hss.ntu.edu.sg/omw/, write a pro- gram that takes as input a sentence along with information about what language this sentence is written in and what language to translate it to, and then, using WordNet to map concepts, write a very rough automated translator.
Provide a code snippet, input-output examples, and a discussion on the aspects of language translations that are hard or impossible to capture just using a WordNet as a knowledge base.**

In [49]:
def translate(sentence, lang_from, lang_to):

  sent=[]

  # Tokenize the sentence
  sent_1w = word_tokenize(sentence)

  # need put definition of each language for each stopword
  sp = str()
  if lang_from == 'fra': sp = 'french'
  stop_words = set(stopwords.words(sp))

  for i in range(0, len(sent_1w)):

    if sent_1w[i] not in stop_words:
      synset = wn.synsets(sent_1w[i].lower(), lang = lang_from)

        # check is the word have synset
      if len(synset) == 0:
        continue

        # if the word have synset, take a similar word
      else:
        for j in range (0, len(synset)):
          lm = synset[j].lemma_names(lang = lang_to)
          if len(lm) == 0:
            continue
          else:
            sent.append(lm[-1])
            break

  return sent

In [50]:
sentence = 'Le chien joue avec sa balle.'

In [46]:
translate(sentence, 'fra', 'ita')

['cane', 'temerarietà', 'palla_veloce']

Mon programme ne prend pas en considération les stopwords. Puisque ma phrase est en français, j’ai défini les stopwords dans cette langue mais il faudrait ajouter une liste de stopwords pour chaque langue utilisée.

Phrase : Le chien joue avec sa balle. 
Mot transformé en italien : chien = cane, joue = temerarietà, balle = palla_veloce.

Les stopwords ne sont pas définie dans WordNet, ainsi il est difficile de faire la translation d’une phrase complète. Le mot chien est bien transformé d’une langue à une autre, ce qui n’est cependant pas le cas avec le mot joue (insouciance) et balle (balle rapide) qui change la signification de la phrase et devient moins compréhensible pour l’humain. Problématique :
-	Comment déterminer automatiquement quel synset à utiliser.

https://aclanthology.org/P10-4014.pdf

In [54]:
synset = wn.synsets('avec',lang='fra')
for batch in synset:
  print(batch.lemma_names())
synset

[]