## Wordnet

21 June 2023

**Topics:**  
#nlp #wordnet #spellings #nltk #regex #wordlength #autocorrect

In [1]:
# imports

from nltk.corpus import wordnet

In [24]:
syns = wordnet.synsets('spring')
print('Syns for sprint:', syns)

Syns for sprint: [Synset('spring.n.01'), Synset('spring.n.02'), Synset('spring.n.03'), Synset('spring.n.04'), Synset('give.n.01'), Synset('leap.n.01'), Synset('jump.v.01'), Synset('form.v.03'), Synset('bounce.v.01'), Synset('spring.v.04'), Synset('spring.v.05')]


In [30]:
# name of synset

print(syns[0].name())

spring.n.01


In [26]:
# just the word

print(syns[0].lemmas()[0].name())

spring


In [31]:
# definition of the word

print('Definition of syn `**',syns[0].lemmas()[0].name(),'**` is :',syns[0].definition())

Definition of syn `** spring **` is : the season of growth


In [40]:
# examples of the word

print('Definition of syn \"',syns[0].lemmas()[0].name(),'\" is :',syns[0].examples())

Definition of syn " spring " is : ['the emerging buds were a sure sign of spring', 'he will hold office until the spring of next year']


### Similarity between words using WordNet

In [39]:
w1 = wordnet.synset('ship.n.01') # using a synset, and pack details
w2 = wordnet.synset('boat.n.01')

print(w1.wup_similarity(w2)*100,'%')

90.9090909090909 %


## re.compile()

*Context:* The re.compile() function is used to compile a regular expression pattern into a pattern object, which can then be used for various operations such as searching, matching, and replacing strings based on the specified pattern.

In [51]:
import re

str1 = "Emma's luck 4 numbers are 251 761 231 451"

# pattern as string
str_pattern = r"\d{3}" # get digits with length 3

# re.compile() to get a regex.Pattern object
reg_pattern = re.compile(str_pattern)

print(type(reg_pattern))

<class 're.Pattern'>


In [52]:
# apply the regex pattern object

result = reg_pattern.findall(str1)
print('Results from reg_pattern(str1):', result)

Results from reg_pattern(str1): ['251', '761', '231', '451']


In [53]:
# Another string
str2 = "Kelly's 33 luck numbers are 111 212 415 12124"

result = reg_pattern.findall(str2)
print('Results from reg_pattern(str2****):', result)

Results from reg_pattern(str2****): ['111', '212', '415', '121']


## Fixing Word Lengthening 

_We will use regex for this_

Tip: No english word have any word with a letter repeating more than twice. i.e. wood, speed, letter

In [55]:
# import re

# a functino to fix the lengthening of a string
def fix_lengthening(string):
    ptr  = re.compile(r"(.)\1{2,}")
    return ptr.sub(r"\1\1", string)

In [65]:
# 💀🎩 - taking example of 'Brook' a character's name from one-piece

print(fix_lengthening('brooook'))

brook


In [62]:
# checking with two letters repetions instead of one

print(fix_lengthening('brooookkkkkkk')) # works fine

brookk


### 📝 Spell Correction

We will use library : *autocorrect*  

**!pip install autocorrect**

In [68]:
from autocorrect import Speller

spell = Speller(lang = 'en') # initiate a speller object
print(type(spell))

<class 'autocorrect.Speller'>


**Spell check on different words**

In [69]:
print(spell('mussage'))

message


In [70]:
print(spell('survice'))

service


In [71]:
print(spell('hte'))

the


In [81]:
print(spell('caaaaaar')) # does not give a proper result when there is letter-repetion in string, 
# word_lengthening is required here

aaaaaa


### Spell Checker (pattern.en)

**!pip install pattern**

In [85]:
from pattern.en import suggest

In [86]:
print(suggest('mussage'))

[('message', 0.6216216216216216), ('massage', 0.3783783783783784)]


In [87]:
print(suggest('survice'))

[('service', 0.9253112033195021), ('survive', 0.07468879668049792)]


In [88]:
print(suggest('hte'))

[('the', 0.8653201565642368), ('he', 0.13408515883485067), ('ate', 0.00022706139307570876), ('hate', 0.0002162489457863893), ('hue', 0.00012974936747183358), ('te', 1.0812447289319465e-05), ('htm', 1.0812447289319465e-05)]


In [90]:
print(suggest('caaaar'))

[('caesar', 0.6666666666666666), ('bazaar', 0.3333333333333333)]


**Side note:**  

The suggest function in pattern.en utilizes a statistical approach to generate spelling suggestions based on known words and their frequencies in a given corpus. It considers various factors, such as edit distance (the number of character changes required to convert one word into another) and word frequency, to determine possible correct spellings for a misspelled word.  
_Source: Internet_  
_Purpose: To learn the underlying concepts and understand as to why some words has more weight/scores than the others_