# **Practical 11**

**A) Aim-:Multiword Expressions in NLP**

**Multi-word Expressions** (MWEs) are word combinations with linguistic properties that cannot be predicted from the properties of the individual words or the way they have been combined. MWEs occur frequently and are usually highly domain-dependent. A proper treatment of MWEs is essential for the success of NLP-systems. 

eneric NLP-systems usually perform less well on texts from specific domains. One of the reasons for this is clear: each domain uses its own vocabulary, and it uses generally occurring words with a highly specific meaning or in a domain-specific manner.
 

For this reason, state-of-the-art NLP systems usually work best if they are adapted to a specific domain. It is therefore highly desirable to have technology that allows one to adapt an NLP system to a specific domain for MWEs, e.g., on the basis of a text corpus. 

In [None]:
import nltk
from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize
nltk.download('punkt')
s = '''Good cake cost Rs.1500\kg in Mumbai. Please buy me one of them.\n\nThanks.'''
mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')
for sent in sent_tokenize(s):
  print(mwe.tokenize(word_tokenize(sent)))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['Good', 'cake', 'cost', 'Rs.1500\\kg', 'in', 'Mumbai', '.']
['Please', 'buy', 'me', 'one', 'of', 'them', '.']
['Thanks', '.']


**B)Aim-:Normalized Web Distance and Word Similarity**

In [None]:
pip install textdistance

Collecting textdistance
  Downloading textdistance-4.2.2-py3-none-any.whl (28 kB)
Installing collected packages: textdistance
Successfully installed textdistance-4.2.2


In [None]:
import numpy as np
import re
import textdistance 
# pip install textdistance
# we will need scikit-learn>=0.21
#pip install sklearn
import sklearn 
from sklearn.cluster import AgglomerativeClustering
texts = [
 'Reliance supermarket', 'Reliance hypermarket', 'Reliance', 'Reliance', 'Reliance downtown', 'Relianc market',
 'Mumbai', 'Mumbai Hyper', 'Mumbai dxb', 'mumbai airport',
 'k.m trading', 'KM Trading', 'KM trade', 'K.M. Trading', 'KM.Trading']
def normalize(text):
  """ Keep only lower-cased text and numbers"""
  return re.sub('[^a-z0-9]+', ' ', text.lower())
def group_texts(texts, threshold=0.4):
  """ Replace each text with the representative of its cluster"""
  normalized_texts = np.array([normalize(text) for text in texts])
  distances = 1 - np.array([[textdistance.jaro_winkler(one, another) for one in normalized_texts] for another in normalized_texts ])
  clustering = AgglomerativeClustering(distance_threshold=threshold, affinity="precomputed", linkage="complete", n_clusters=None).fit(distances)
  centers = dict()
  for cluster_id in set(clustering.labels_):
    index = clustering.labels_ == cluster_id
    centrality = distances[:, index][index].sum(axis=1)
    centers[cluster_id] = normalized_texts[index][centrality.argmin()]
  return [centers[i] for i in clustering.labels_]
print(group_texts(texts))

['reliance', 'reliance', 'reliance', 'reliance', 'reliance', 'reliance', 'mumbai', 'mumbai', 'mumbai', 'mumbai', 'km trading', 'km trading', 'km trading', 'km trading', 'km trading']


**C)Aim-:Word Sense Disambiguation**

**Word sense disambiguation**, in natural language processing (NLP), may be defined as the ability to determine which meaning of word is activated by the use of word in a particular context. Lexical ambiguity, syntactic or semantic, is one of the very first problem that any NLP system faces. 

Part-of-speech (POS) taggers with high level of accuracy can solve Word’s syntactic ambiguity. On the other hand, the problem of resolving semantic ambiguity is called WSD (word sense disambiguation). Resolving semantic ambiguity is harder than resolving syntactic ambiguity.





In [None]:
#Word Sense Disambiguation
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
def get_first_sense(word, pos=None):
  if pos:
    synsets = wn.synsets(word,pos)
  else:
    synsets = wn.synsets(word)
  return synsets[0]
best_synset = get_first_sense('bank')
print ('%s: %s' % (best_synset.name, best_synset.definition))
best_synset = get_first_sense('set','n')
print ('%s: %s' % (best_synset.name, best_synset.definition))
best_synset = get_first_sense('set','v')
print ('%s: %s' % (best_synset.name, best_synset.definition))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
<bound method Synset.name of Synset('bank.n.01')>: <bound method Synset.definition of Synset('bank.n.01')>
<bound method Synset.name of Synset('set.n.01')>: <bound method Synset.definition of Synset('set.n.01')>
<bound method Synset.name of Synset('put.v.01')>: <bound method Synset.definition of Synset('put.v.01')>
