# NLTK Basics - Chunk Extraction and Document Similarity

## Chunk Extraction

** Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech). **



In [None]:
import nltk
nltk.download()

** First we perform tokenization of the given input text into tokens. Then we assign parts of speech to each individual token in the text and store as a list of tuples. **

In [None]:
text = "a quick brown fox jumps over the lazy dog"
tagged = nltk.pos_tag(nltk.word_tokenize(text))

In [4]:
print(tagged)

[('a', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


** We define the grammar which would be the standard in the creation of the chunk parser. **

In [5]:
grammar="NP:{<DT>?<JJ>*<NN>}"

** We define a parser "chunkParse" using the grammar that we defined previously. **

In [6]:
chunkParse = nltk.RegexpParser(grammar)

** We can further design an illustration of chunking by creating a tree called chunking graph. We print the entire tree by recursively printing its subtree over and over. **

In [7]:
tree = chunkParse.parse(tagged)

In [8]:
for subtree in tree.subtrees():
    print(subtree)

(S
  (NP a/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN))
(NP a/DT quick/JJ brown/NN)
(NP fox/NN)
(NP the/DT lazy/JJ dog/NN)


In [1]:
tree.draw()

NameError: name 'tree' is not defined

In [9]:
tree.draw()

## Document Similarity

** A simple demo of comparing the contents of two text documents and evaluating the similarity of one document to the other. **

** I will be executing a piece of code that would calculate the document similarity between two documents. If this similarity comes out to be <= 40%, then I will print "HIRED". Otherwise print "Not HIRED". **

** I am making use of the spaCy module for the same. I import the module and load the language model that will be used. **

In [12]:
import spacy
nlp = spacy.load('en_core_web_sm')

** Next I store the location of the documents which I will use to access their contents. **

In [24]:
d1="C:\Users\HP\Desktop\Intelligent Cloud Hub.txt"
d2="C:\Users\HP\Desktop\Shakespeare.txt"

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-24-dae1d5f0b5c9>, line 1)

** Error can be resolved by adding additional '\' to avoid treating the alphabets following the '\' as escape character. **

In [25]:
d1="C:\\Users\\HP\\Desktop\\Intelligent Cloud Hub.txt"
d2="C:\\Users\HP\\Desktop\\Shakespeare.txt"

** The next step is to load the contents of these files which can be done by the following piece of code. **

In [26]:
content1 = open(d1).read()
content2 = open(d2).read()

doc1 = nlp(content1)
doc2 = nlp(content2)

print(doc1)
print(doc2)

In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming lives and industry a

** The next step is the final and most important step that is to evaluate the similarity of the two documents that was loaded here. **

** We obtain the similarity and then check for the condition whether the value turns out to be greater than some threshold (40% here) and then print the corresponding output. **

In [22]:
similarity = doc1.similarity(doc2)
if similarity > 0.4:
    print('Hired')
else:
    print('Not Hired')
print(similarity)

Hired
0.8894884006180047


  "__main__", mod_spec)


** We have obtained the value to be approximately 0.889 even considering the fact that the contents stored in these documents have no relation with one another. **