# **Natural Language Processing**

<img align='center' width='600' src="https://images.gr-assets.com/misc/1535611813-1535611813_goodreads_misc.gif">

In [2]:
import spacy
import nltk

## **Text data**

- **Lexicons**
  
  can be define as the vocabulary of a person, language or branch of knowledge. They can be thought of as a dictionary of terms that are called lexemes.

- **Phonemes**
  
  Speech sounds made by mouth or unit of sound.

- **Grapheme**
  
  Groups of letters of size one or more that can represent these individual sounds or phonemes. The word *spoon* consists of five letters that actually repsresent four phonemes.

- **Morphemes**
  
  The smallets meaningful unit in a language. The word *unstoppable* is composed of three morphemes:
  - *un* --> a bound morpheme signifying not
  - *stop* --> the root morpheme
  - *able* --> a free morpheme signifying can be done

<br>

- **Tokenization**
  
  Can be thought as a segmentation(word tokenization, sentence tokenization, ...)

## **Spacy**

```
pip install spacy
python -m spacy download en_core_web_sm
```

```python
nlp = spacy.blank('en') # is an empty pipeline to add more items or say better pretrain models
```
First we must download them(pretrain models like `en_core_web_sm`), then to use the model we go like:

``` python
import en_core_web_sm

nlp = spacy.load("en_core_web_sm")
# or we can use another method ---------------------------
nlp = en_core_web_sm.load()
```

Since the pipeline is blank it isn't tokenize the sentences efficiently


In [3]:
# spacy.blank 
# en_core_web_sm need to be download first: python -m spacy download en_core_web_sm
nlp = spacy.blank("en")

sentence = 'Hello My dear friend I will be late to night.'
doc = nlp(sentence)

for token in doc[:4]:
    print(token)
    
print('-'*30)
# you use slicing method on doc object
print(doc[-2])

Hello
My
dear
friend
------------------------------
night


In [11]:
sentence = '''I'm realy happy since my dad say "Your are the best son!", then gave me 100$.'''
doc = nlp(sentence)

print([token.text for token in doc])

['I', "'m", 'a', 'realy', 'happy', 'since', 'my', 'dad', 'say', '"', 'Your', 'are', 'the', 'best', 'son', '!', '"', ',', 'then', 'gave', 'me', '100$.']


In [4]:
print(type(nlp), type(doc), type(token))

<class 'spacy.lang.en.English'> <class 'spacy.tokens.doc.Doc'> <class 'spacy.tokens.token.Token'>


### Token attributes

attributes are specified using internal IDs in many places 

for more information check the [link](https://spacy.io/api/attributes)

In [5]:
sentence = '''I'm realy happy since my dad say "Your are the best son!", then gave me $100.'''
doc = nlp(sentence)
print(f"{'token':^10}|{' Attributes':^30}")
print('-'*40)
for token in doc:
    print(f"{token.text:^10}|", end=' ')
    attribut = ''
    if token.is_alpha:
        attribut += ' - is alpha'
    if token.is_digit:
        attribut += ' - is digit'
    if token.is_currency:
        attribut += ' - is curency'
    if token.is_quote:
        attribut += ' - is quote'
    if token.is_punct:
        attribut += ' - is punctuation'
        
    print(f"{attribut:^30}")

  token   |          Attributes          
----------------------------------------
    I     |           - is alpha          
    'm    |                               
  realy   |           - is alpha          
  happy   |           - is alpha          
  since   |           - is alpha          
    my    |           - is alpha          
   dad    |           - is alpha          
   say    |           - is alpha          
    "     |   - is quote - is punctuation 
   Your   |           - is alpha          
   are    |           - is alpha          
   the    |           - is alpha          
   best   |           - is alpha          
   son    |           - is alpha          
    !     |        - is punctuation       
    "     |   - is quote - is punctuation 
    ,     |        - is punctuation       
   then   |           - is alpha          
   gave   |           - is alpha          
    me    |           - is alpha          
    $     |          - is curency         
   100    |   

In [4]:
with open('students.txt', 'r') as f:
    lines = f.readlines()
lines

['My high school, 8th grade students information\n',
 '\n',
 'Name        Birth day       Email\n',
 '----        --------        ------\n',
 'Virat       5 June, 1882    virat@gmail.com\n',
 'ali         10 August 1884  ali_new@yahoo.com\n',
 'joe         3 February 1883 joe.1883@gmail.com']

In [8]:
print(''.join(lines))

My high school, 8th grade students information

Name        Birth day       Email
----        --------        ------
Virat       5 June, 1882    virat@gmail.com
ali         10 August 1884  ali_new@yahoo.com
joe         3 February 1883 joe.1883@gmail.com


In [9]:
sentences = ' '.join(lines)

In [8]:
doc = nlp(sentences)

for token in doc:
    if token.like_email:
        print(token)

virat@gmail.com
ali_new@yahoo.com
joe.1883@gmail.com


### REGEX
Now do the same task using regular expression
check the [link](https://github.com/CoreyMSchafer/code_snippets/blob/master/Python-Regular-Expressions/snippets.txt) to see regular expression, Even you see the [video on youtube](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwijj8r9kr-EAxUT7QIHHTOzD8IQtwJ6BAgTEAI&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DK8L6KVGG-7o&usg=AOvVaw3OMhMUs8L3x615-f-V44_7&opi=89978449)

<br>
<br>

| Special character  |  Action  |
|--------------------|----------|
|     .              | Any Character Except New Line |
|    \d              | Digit (0-9) |
|    \D              | Not a Digit (0-9) |
|    \w              | Word Character (a-z, A-Z, 0-9, _) |
|    \W              | Not a Word Character |
|    \s              | Whitespace (space, tab, newline) |
|    \S              | Not Whitespace (space, tab, newline) |
|**----------------**|**----------------------------**|
|    \b              | Word Boundary |
|    \B              | Not a Word Boundary |
|    ^               | Beginning of a String |
|    $               | End of a String |
|**----------------**|**----------------------------**|
|    [ ]             | Matches Characters in brackets |
|    [^ ]            | Matches Characters NOT in brackets |
|                    | Either Or |
|    ( )             | Group |
|  **Quantifiers:**  |    ---    |
|    *               | 0 or More |
|    +               | 1 or More |
|    ?               | 0 or One |
|   {3}              | Exact Number |
|   {3,4}           | Range of Numbers (Minimum, Maximum) |

In [11]:
import re
pattern = r'[a-zA-Z0-9-.]+@[a-zA-Z0-9-.]+\.\w+'
pattern = re.compile(pattern)

matches = re.finditer(pattern, sentences)

for match in matches:
    print(match.group())

virat@gmail.com
new@yahoo.com
joe.1883@gmail.com


In [12]:
doc = nlp("let's go play game")
[token.text for token in doc]

['let', "'s", 'go', 'play', 'game']

<img align='center' width='600' src="https://spacy.io/images/pipeline-design.svg">

In [13]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x293d28f50>

In [16]:
doc = nlp("let's go play game. Do you like video games? Yes, of course")
for token in doc.sents:
    print(token)

let's go play game.
Do you like video games?
Yes, of course


### Pipe lines
now check the `nlp` object pipelines befor and after load pretrain model

In [13]:
nlp.pipe_names

['sentencizer']

In [17]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [18]:
doc = nlp("let's go play game. Do you like video games?")

print(f"{'token':-^10} {'POS':-^10} {'lemma':-^10}")
for token in doc:
    print(f"{token.text:^10} {token.pos_:^10} {token.lemma_:^10}")

--token--- ---POS---- --lemma---
   let        VERB       let    
    's        PRON        us    
    go        VERB        go    
   play       VERB       play   
   game       NOUN       game   
    .        PUNCT        .     
    Do        AUX         do    
   you        PRON       you    
   like       VERB       like   
  video       NOUN      video   
  games       NOUN       game   
    ?        PUNCT        ?     


**Stopword**

Stopword are words such as *a, an, the, in, at, ...* that accure frequently in text corpora and do not carry a lot of information

In [19]:
from spacy.lang.en.stop_words import STOP_WORDS

#Check for stop words
print(list(STOP_WORDS)[:15])

['per', 'fifty', 'everyone', 'whether', 'quite', 'it', 'did', 'once', 'each', 'former', 'something', 'nowhere', 'therein', 'were', 'whom']


In [23]:
items = ['regarding', 'am', 'her', 'if', 'hello']
stops = list(STOP_WORDS)[:15]

for w in items:
    if w not in stops:
        print(w)
# [w for w in items if w not in stops]

regarding
am
her
if
hello


In [30]:
print("Tokens with Stopword:\n", ' | '.join([token.text for token in doc]), "\n")
print("Tokens without Stopword:\n", ' | '.join([token.text for token in doc if not token.is_stop]))

Tokens with Stopword:
 let | 's | go | play | game | . | Do | you | like | video | games | ? 

Tokens without Stopword:
 let | play | game | . | like | video | games | ?


### Not as a STOP_WORD 
It's really important remove the word "not", which is exist in STOP_WORD list, in order to preserve the negative aspect of sentence

In [31]:
from spacy.lang.en.stop_words import STOP_WORDSn

pos = "This movie is good"
neg = "This movie is not good"

pos_doc = nlp(pos.lower())
neg_doc = nlp(neg.lower())

stops = list(STOP_WORDS)
stops.remove('not')

pos_tokens = [token.text for token in pos_doc if not token.is_stop]
pos_tokens2 = [token.text for token in pos_doc if token.text not in stops]

neg_tokens = [token.text for token in neg_doc if not token.is_stop]
neg_tokens2 = [token.text for token in neg_doc if token.text not in stops]


print(f"{'Sentence':^30}| {'Stops removal':^20}| {'Custom Stops':^20}", end='\n'+'-'*75+'\n')

print(f"{pos:^30}| {' '.join(pos_tokens):^20}| {' '.join(pos_tokens2):^20}")
print(f"{neg:^30}| {' '.join(neg_tokens):^20}| {' '.join(neg_tokens2):^20}")

           Sentence           |    Stops removal    |     Custom Stops    
---------------------------------------------------------------------------
      This movie is good      |      movie good     |      movie good     
    This movie is not good    |      movie good     |    movie not good   


In [16]:
doc = nlp("Apple worth $1 trillion")

print(f"{'entity':^20} {'ent.label_':^20} {'explain':^30}")
for ent in doc.ents:
    print(f"{ent.text:^20} {ent.label_:^20} {spacy.explain(ent.label_):^10}")

       entity             ent.label_                 explain            
       Apple                 ORG          Companies, agencies, institutions, etc.
    $1 trillion             MONEY         Monetary values, including unit


## **NLTK**

First part of preprocessing is tokenize the sentences. NLTK have various type of tokenizer which are listed [here](https://www.nltk.org/api/nltk.tokenize.html)

In [33]:
from nltk.tokenize import (
    word_tokenize,
    wordpunct_tokenize,
    sent_tokenize, RegexpTokenizer,
    TreebankWordTokenizer,
    SExprTokenizer,
    TweetTokenizer
    )

sentence = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."

### `nltk.tokenize.word_tokenize`

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string

In [34]:
print(word_tokenize(sentence))

['A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$', '3000.0', '-', '$', '8000.0', 'in', 'USA', '.']


### `nltk.tokenize.wordpunct_tokenize`

  NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation

**Warning**: sometimes we need to decode the input string. `sentence.decode("utf8")`

In [35]:
print(wordpunct_tokenize(sentence))

['A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$', '3000', '.', '0', '-', '$', '8000', '.', '0', 'in', 'USA', '.']


### `nltk.tokenizer.sent_tokenize`

  We can also operate at the level of sentences, using the sentence tokenizer directly as follows

In [38]:
sentence2 = sentence + " I think It's really expensive."
print(sentence2, end ='\n\n')

print("Tokens: ")
for token in sent_tokenize(sentence2):
    print(token)

A Rolex watch costs in the range of $3000.0 - $8000.0 in USA. I think It's really expensive.

Tokens: 
A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.
I think It's really expensive.


### `nltk.tokenize.regexp`

A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of: <br>
- alphabetic sequences: `\w+` <br>
- money expressions: `\s[0-9\.]+`, and <br>
- any other non-whitespace sequences: `\S+`

In [39]:
reg_tokenizer = RegexpTokenizer('\w+|\s[0-9\.]+|\S+')
print(reg_tokenizer.tokenize(sentence))

['A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$3000.0', '-', '$8000.0', 'in', 'USA', '.']


### `nltk.tokenize.treebank`

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at [here](http://www.cis.upenn.edu/~treebank/tokenizer.sed).

In [40]:
tree_tokenizer = TreebankWordTokenizer()
print(tree_tokenizer.tokenize(sentence))

['A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$', '3000.0', '-', '$', '8000.0', 'in', 'USA', '.']


In [23]:
# from nltk.tokenize.stanford import StanfordTokenizer
# stan_tokenizer = StanfordTokenizer(options={"americanize": True})
# print(stan_tokenizer.tokenize(sentence))

### `nltk.tokenize.sexpr`
SExprTokenizer is used to find *parenthesized expressions* in a string. In particular, it divides a string into a sequence of substrings that are either parenthesized expressions (including any nested parenthesized expressions), or other whitespace-separated tokens.

In [41]:
sentence3 = "A Rolex watch (which is a luxury brand) costs about $5000 (sometimes even higher)."
# used to find parenthesized expressions in a string.
se_tokenizer = SExprTokenizer()

print("SExpr output: \n",se_tokenizer.tokenize(sentence3), end="\n\n")
print("TreeBank output:\n", tree_tokenizer.tokenize(sentence3))

SExpr output: 
 ['A', 'Rolex', 'watch', '(which is a luxury brand)', 'costs', 'about', '$5000', '(sometimes even higher)', '.']

TreeBank output:
 ['A', 'Rolex', 'watch', '(', 'which', 'is', 'a', 'luxury', 'brand', ')', 'costs', 'about', '$', '5000', '(', 'sometimes', 'even', 'higher', ')', '.']


### `nltk.tokenizer.TweetTokenizer`

In [42]:
sentence4 = "@Mhddaraaa I just bought a newwwww car! :D #happiness #audi <3"
t_tokenizer = TweetTokenizer()
print(t_tokenizer.tokenize(sentence4))

# if you want get rid of @ and excesive letters
t_tokenizer = TweetTokenizer(reduce_len=True, strip_handles=True)
print(t_tokenizer.tokenize(sentence4))

['@Mhddaraaa', 'I', 'just', 'bought', 'a', 'newwwww', 'car', '!', ':D', '#happiness', '#audi', '<3']
['I', 'just', 'bought', 'a', 'newww', 'car', '!', ':D', '#happiness', '#audi', '<3']


### ```nltk.tokenize.punkt``` 
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

In [46]:
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.corpus import state_union

train_text = state_union.raw("2005-GWBush.txt")
test_text = state_union.raw("2006-GWBush.txt")

tokenizer = PunktSentenceTokenizer(train_text)
tokenized = tokenizer.tokenize(test_text)

tagged = []
for w in tokenized:
    words = nltk.word_tokenize(w)
    tagged.append(nltk.pos_tag(words))

In [47]:
# Let's check the first sentence
print(f"{'tokens':^20}| {'POS':^10}")
print('-'*30)
for item in tagged[1]:
    print(f"{item[0]:^20}| {item[1]:^10}")

       tokens       |    POS    
------------------------------
        Mr.         |    NNP    
      Speaker       |    NNP    
         ,          |     ,     
        Vice        |    NNP    
     President      |    NNP    
       Cheney       |    NNP    
         ,          |     ,     
      members       |    NNS    
         of         |     IN    
      Congress      |    NNP    
         ,          |     ,     
      members       |    NNS    
         of         |     IN    
        the         |     DT    
      Supreme       |    NNP    
       Court        |    NNP    
        and         |     CC    
     diplomatic     |     JJ    
       corps        |     NN    
         ,          |     ,     
   distinguished    |     JJ    
       guests       |    NNS    
         ,          |     ,     
        and         |     CC    
       fellow       |     JJ    
      citizens      |    NNS    
         :          |     :     
       Today        |     VB    
        our 

### **Word Normalization**

most of the time we don't want to have every individual word fragment that we have ever encountered in our vocabulary. There are several reason for this like:

- **Distinguish the phrases or words**

  for example: UN with U.N

- **Bring words to their root**

  am , is , are --> be

- **Remove inflections**

  car, cars, car's --> car

- **Words wich do not convey much meaning**

  the articles: *a*, *an* and *the* - Wh words: *when*, *where*, ... . However, all these highly depend on the use cases. All of them are removes as a part of technique called **Stopword removal**.

**Note**: **Case Folding** is strategy to convert all of the letter in corpus to lowercase

#### 1.Stemming
A crude attempt is made to remove the inflectional form of a word and bring them to a a base form called **stem**. The chopped-off pieces refred to as **affixes**. 


- **Over-stemming**

  A situation may arise when words that are stemmed to the same root should have been stemmed to different roots

- **under-stemmin**
 
  In contrast, words that should have been stemmed to the same root aren't stemmed to it
  
[NLTK stem package](https://www.nltk.org/api/nltk.stem.html)

##### `nltk.stem.porter.PorterStemmer`

- Supports the *English* language
- Works only with *strings*

In [26]:
from nltk.stem.porter import PorterStemmer

words = ['cars', 'caresses', 'flies', 'dies', 'mules', 'died', 'stating',
         'sized', 'men', 'women', 'oxen', 'geese', 'children', 'teeth', 'feet', 'mice']
p_stemmer = PorterStemmer()
stems = [p_stemmer.stem(w) for w in words]

for i, (w, l) in enumerate(zip(['words'] + words, ['stems'] + stems)):
    print(f"{w:^10}|{l:^10}")
    print('-' * 20) if i == 0 else 0

  words   |  stems   
--------------------
   cars   |   car    
 caresses |  caress  
  flies   |   fli    
   dies   |   die    
  mules   |   mule   
   died   |   die    
 stating  |  state   
  sized   |   size   
   men    |   men    
  women   |  women   
   oxen   |   oxen   
  geese   |   gees   
 children | children 
  teeth   |  teeth   
   feet   |   feet   
   mice   |   mice   


##### `nltk.stem.snowball.SnowballStemmer`

- SnowballStemmer is an improvement of PorterStemmer, supports different languages
- Work with both *strings* and *Unicodes*

In [27]:
from nltk.stem.snowball import SnowballStemmer

print('Here is the supported languages:\n',SnowballStemmer.languages, end='\n\n')

s_stemmer = SnowballStemmer(language='english')
stems = [s_stemmer.stem(w) for w in words]

for i, (w, l) in enumerate(zip(['words'] + words, ['stems'] + stems)):
    print(f"{w:^10}|{l:^10}")
    print('-' * 20) if i == 0 else 0

Here is the supported languages:
 ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

  words   |  stems   
--------------------
   cars   |   car    
 caresses |  caress  
  flies   |   fli    
   dies   |   die    
  mules   |   mule   
   died   |   die    
 stating  |  state   
  sized   |   size   
   men    |   men    
  women   |  women   
   oxen   |   oxen   
  geese   |   gees   
 children | children 
  teeth   |  teeth   
   feet   |   feet   
   mice   |   mice   


#### 2.Lemmatizer
Unlike stemming, where in a few characters are removed from words using crude methods, **lemmatization** is a process wherein the context is used to convert a word to its meaningful base form. Lemmatization algorithms try to indentify the lemma form of a word by taking into account the neighborhood conext of the word, **Part-Of-Speech(POS)** tags, the meaning of the word, ... .

<br>

##### `nltk.stem.wordnet.WordNetLemmatizer`
**wordnet** is a lexical database of english.

**Note**: wordnet lemmatizer work well if **POS** also included

In [28]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(w) for w in words]

for i, (w, l) in enumerate(zip(['words'] + words, ['lemmas'] + lemmas)):
    print(f"{w:^10}|{l:^10}")
    print('-' * 20) if i == 0 else 0

  words   |  lemmas  
--------------------
   cars   |   car    
 caresses |  caress  
  flies   |   fly    
   dies   |    dy    
  mules   |   mule   
   died   |   died   
 stating  | stating  
  sized   |  sized   
   men    |   men    
  women   |  woman   
   oxen   |    ox    
  geese   |  goose   
 children |  child   
  teeth   |  teeth   
   feet   |   foot   
   mice   |  mouse   


In [29]:
sentence = "We are putting in efforts to enhance our understanding of Lemmatization"

nltk.pos_tag(sentence.split())

[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('in', 'IN'),
 ('efforts', 'NNS'),
 ('to', 'TO'),
 ('enhance', 'VB'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

In [48]:
from nltk.corpus import wordnet

def get_pos(token):
    """
    Map POS to first character lemmatize() accepts
    if ther is no match for POS of a word it will return NOUN
    """
    
    tags_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV,
    }

    tag = nltk.pos_tag([token])[0][1][0].upper()

    return tags_dict.get(tag, wordnet.NOUN)

In [31]:
sentence = "We are putting in efforts to enhance our understanding of Lemmatization"
# Case Folding
sentence = sentence.lower()

pos_tags = nltk.pos_tag(sentence.split())

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(w) for w in sentence.split()]
lemmas_with_pos = [lemmatizer.lemmatize(w, get_pos(w)) for w in sentence.split()]

items = zip(['words'] + sentence.split(), ['lemmas'] + lemmas, ['lemmas with pos'] + lemmas_with_pos)
for i, (w, l, l_p) in enumerate(items):
    if l != l_p:
        l_p = '-> ' + l_p + ' <-'
    print(f"{w:^15}|{l:^15}|{l_p:^15}")
    print('-' * 50) if i == 0 else 0

     words     |    lemmas     |-> lemmas with pos <-
--------------------------------------------------
      we       |      we       |      we       
      are      |      are      |   -> be <-    
    putting    |    putting    |   -> put <-   
      in       |      in       |      in       
    efforts    |    effort     |    effort     
      to       |      to       |      to       
    enhance    |    enhance    |    enhance    
      our      |      our      |      our      
 understanding | understanding |-> understand <-
      of       |      of       |      of       
 lemmatization | lemmatization | lemmatization 


#### 3. Stopword removal
Stopword are words such as *a, an, the, in, at, ...* that accure frequently in text corpora and do not carry a lot of information

In [32]:
from nltk.corpus import stopwords

stop = set(stopwords.words('english'))
print("First 20 stopword:\n\n",list(stop)[:20])

First 20 stopword:

 ['so', 'shan', 'no', 'yours', 'those', 'couldn', "shan't", 'not', 'the', 'should', 'before', 'have', 'won', 'about', 'and', 'will', 'during', 'haven', 'to', 'some']


In [33]:
# if want classify questions so Wh are important, so we need to remove them from stopwords
stop = set(stopwords.words('english'))
wh_words = ['who', 'what', 'where', 'when', 'why', 'how', 'which', 'whom']

print(f"len Stopwords befor wh_word removing: {len(stop)}")
for w in wh_words:
    stop.remove(w)

print(f"len Stopwords after wh_word removing: {len(stop)}")

len Stopwords befor wh_word removing: 179
len Stopwords after wh_word removing: 171


In [34]:
sentence = "We are putting in efforts to enhance our understanding of Lemmatization"
# Case Folding
sentence = sentence.lower()

pos_tags = nltk.pos_tag(sentence.split())

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(w) for w in sentence.split()]

lemmas_without_stop = [lemmatizer.lemmatize(w) if w not in stop else "---" for w in sentence.split()]

items = zip(['words'] + sentence.split(), ['lemmas'] + lemmas, ['no stopword'] + lemmas_without_stop)
for i, (w, l, s) in enumerate(items):
    print(f"{w:^15}#{l:^15}#{s:^20}")
    print('#' * 50) if i == 0 else 0

     words     #    lemmas     #    no stopword     
##################################################
      we       #      we       #        ---         
      are      #      are      #        ---         
    putting    #    putting    #      putting       
      in       #      in       #        ---         
    efforts    #    effort     #       effort       
      to       #      to       #        ---         
    enhance    #    enhance    #      enhance       
      our      #      our      #        ---         
 understanding # understanding #   understanding    
      of       #      of       #        ---         
 lemmatization # lemmatization #   lemmatization    


### **N-grams**
Every thing we cover until now were tokens of size 1, which means only one word. Sentences generally contain names or phrases or other compound terms such as *living room, coffe mug, look up, ...*. These phrases convey a specific meaning when two or more words are used together. When used individually, they carry a differnt meaning  altogether and the inherent meaning behind the compound term is somewhat lost.

There is a **Unigram, Bigram, Trigram, ...**, the naming system can be extended to larger n-grams, but mos **NLP** tasks use only trigram or lower.

In [49]:
from nltk.util import ngrams

sentence = "Natural language processing is developing right now and there will be a lot of new job."

tokens = sentence.lower().split()
bigrams = list(ngrams(tokens, 2))
[' '.join(token) for token in bigrams]

['natural language',
 'language processing',
 'processing is',
 'is developing',
 'developing right',
 'right now',
 'now and',
 'and there',
 'there will',
 'will be',
 'be a',
 'a lot',
 'lot of',
 'of new',
 'new job.']

In [36]:
from nltk.corpus import wordnet

### **Wordnet**
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations

```wordnet.synset``` : a set of synonyms that share a common meaning.

In NLTK docs, Synset string is compose of 3 three different part `lemma`, `pos`, `number`:

- <lemma> is the word’s morphological stem
- <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB
    - n:    NOUN
    - v:    VERB
    - a:    ADJECTIVE
    - s:    ADJECTIVE SATELLITE
    - r:    ADVERB 
- <number> is the sense number, used to disambiguate word meanings, counting from 0

In [37]:
syns = wordnet.synsets('program')
[syn.name() for syn in syns]

['plan.n.01',
 'program.n.02',
 'broadcast.n.02',
 'platform.n.02',
 'program.n.05',
 'course_of_study.n.01',
 'program.n.07',
 'program.n.08',
 'program.v.01',
 'program.v.02']

In [45]:
print(f"{'synonyms for program':^20}| {'POS':^5}| {'Definition':^50}")
print('-'*100)
for syn in syns:
    lemma, pos, _ = syn.name().split('.')
    print(f"{lemma:^20}| {pos:^5}| {syn.definition():^20}")

synonyms for program|  POS |                     Definition                    
----------------------------------------------------------------------------------------------------
        plan        |   n  | a series of steps to be carried out or goals to be accomplished
      program       |   n  | a system of projects or services intended to meet a public need
     broadcast      |   n  | a radio or television show
      platform      |   n  | a document stating the aims and principles of a political party
      program       |   n  | an announcement of the events that will occur as part of a theatrical or sporting event
  course_of_study   |   n  | an integrated course of academic studies
      program       |   n  | (computer science) a sequence of instructions that a computer can interpret and execute
      program       |   n  | a performance (or series of performances) at a public presentation
      program       |   v  | arrange a program of or for
      program       |   v  

In [39]:
# we want to find synonyms lemmas in nice format, costume format of course
print(f"{'synonyms of program':^20}| {'lemmas':^70}")
print("#"*121)
for syn in syns:
    print(f"{syn.name().split('.')[0]:^20}|", end=' ')
    for lemma in syn.lemmas():
        print(f"{lemma.name():^19},", end=' ')
        
    print("\n", "-" * 120)

synonyms of program |                                 lemmas                                
#########################################################################################################################
        plan        |        plan        ,       program      ,      programme     , 
 ------------------------------------------------------------------------------------------------------------------------
      program       |       program      ,      programme     , 
 ------------------------------------------------------------------------------------------------------------------------
     broadcast      |      broadcast     ,       program      ,      programme     , 
 ------------------------------------------------------------------------------------------------------------------------
      platform      |      platform      , political_platform ,  political_program ,       program      , 
 --------------------------------------------------------------------------

In [40]:
# What if we want some example of specific word!?
for example in syns[0].examples():
    print(example)

they drew up a six-step plan
they discussed plans for a new bond issue


In [50]:
# We want to find antonym for different synonyms
def syn_ant(word, print_=True):
    
    # Create sysnset object for word
    syns = wordnet.synsets(word)
    synonyms = []
    antonyms = {}
    
    # append sysnonyms word with out their POS and number in one list
    for syn in syns:
        synonyms.append(syn.name().split('.')[0])
        ants = []
        # Iterate for lemma of different synonyms and find their antonyms
        for lemma in syn.lemmas():
            # Check if there is a antonyms for specific lemma or not
            if ant := lemma.antonyms():
                ants.append(ant[0].name())
                
        # If antonym added to a dictionary of antonyms befor hand new antonyms will append to that
        lemma = syn.name().split('.')[0]
        if lemma in antonyms.keys() and ants:
            antonyms[lemma] += ants
        # If antonyms doesn't exist in the dictionary it will added to it
        elif lemma not in antonyms.keys():
            antonyms[lemma] = ants

    synonyms = list(set(synonyms))
    if print_:
        num = len(max(antonyms.values(), key=len)) * 15
        print(f"{'Synonyms of '+word:^20}| {'Antonyms':^{int(num)}}")
        num += 25
        print('-' * num)
        
        for k, v in antonyms.items():
            print(f"{k:^20}|", end='')
            if values := [value for value in v]:
                for v in values:
                    print(f"{v:^15}", end='')
            else:
                print(f"{'--':^15}", end='')
            print()
            
    return synonyms, antonyms

In [42]:
_, antonyms = syn_ant('good')

  Synonyms of good  |                                          Antonyms                                         
-------------------------------------------------------------------------------------------------------------------
        good        |     evil         evilness          bad          badness          bad           evil      
     commodity      |      --       
        full        |      --       
     estimable      |      --       
     beneficial     |      --       
       adept        |      --       
        dear        |      --       
     dependable     |      --       
     effective      |      --       
        well        |      ill      
     thoroughly     |      --       


In [51]:
# Check the similarity between words
def similarity(w1, w2):
    # first we must creat right format for wordnet and convert them to synset objects
    w1_s = wordnet.synset( w1 + '.n.01')
    w2_s = wordnet.synset(w2 + '.n.01')

    # use wup.similarity to count similarity between words
    similarity = w1_s.wup_similarity(w2_s)
    print(f"{w1} and {w2} similarity is: {similarity}")
    return None

In [52]:
similarity('ship', 'boat')
similarity('ship', 'car')
similarity('ship', 'goat')

ship and boat similarity is: 0.9090909090909091
ship and car similarity is: 0.6956521739130435
ship and goat similarity is: 0.2962962962962963
