# Document for demonstrating the pipeline of recognition

### The Pipeline for Sytax tree generation

The following image shows the steps to be taken to covert the natural language sentence into a document, which is used later to convert into python code.

![Text](https://spacy.io/assets/img/pipeline.svg)

### Import all necessary dependencies 

In [1]:
import nltk
import spacy
from spacy import displacy
from IPython.display import HTML, display
print('Imports successful')

Imports successful


### Load Spacy Model for recognition

In [4]:
nlp = spacy.load('en')

### A utility to display tabes for better visualisation of output

In [5]:
def displayAsTable(data):
    display(HTML(
    '<table><tr>{}</tr></table>'.format(
        '</tr><tr>'.join(
            '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in data)
        )
     ))

In [6]:
input_text = input("Input a sentence or a group of sentences: ")
doc = nlp(input_text)

Input a sentence or a group of sentences: Create a list of 10 integers


## Step 1 : Tokenisation and Part-of-Speech Tagging

Tokenisation breaks down the sentence into a list of tokens.

**The algorithm can be summarized as follows:**

1. Iterate over space-separated substrings
2. Check whether we have an explicitly defined rule for this substring. If we do, use it.
3. Otherwise, try to consume a prefix.
4. If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
5. If we didn't consume a prefix, try to consume a suffix.
6. If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.
7. Once we can't consume any more of the string, handle it as a single token.

In [7]:
token_list = []
for token in doc:
    if not token.is_stop:
        token_list.append([token.text, token.pos_, spacy.explain(token.pos_), spacy.explain(token.dep_)])
displayAsTable(token_list)

0,1,2,3
Create,VERB,verb,
list,NOUN,noun,direct object
10,NUM,numeral,
integers,NOUN,noun,object of preposition


## Step 2 : Word similarity 


In [8]:
words = nlp(u'create develop initialise start')

similarity_table = []

for token1 in words:
    for token2 in words:
        similarity_table.append([token1.text, token2.text, token1.similarity(token2)])

displayAsTable(similarity_table[:3])
displayAsTable(similarity_table[-2:-1])

0,1,2
create,create,1.0
create,develop,0.5023777
create,initialise,0.2719381


0,1,2
start,initialise,0.47232231


In [9]:
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text, sep='|')

a list|list|dobj|Create
10 integers|integers|pobj|of


## Pre order traversal

The dependency tree formed for the sentence is shown as below - 

In [10]:
options = {'compact': False, 'bg': 'rgb(66, 133, 244)',
           'color': 'white'}

spacy.displacy.render(doc, style='dep', jupyter=True, options=options)

### Functions for preorder traversal

In [11]:
sents = list(doc.sents)
tokens = []
    
def get_sent_root(pos=0):
    return sents[pos].root

def preorder_util(root):

    if root is not None:
        print(root.text, end=', ')
        tokens.append(root.text)
        for left in root.lefts:
            preorder_util(left)
        for right in root.rights:
            preorder_util(right)

def preorder_traverse():
    root = get_sent_root()
    preorder_util(root)

In [12]:
preorder_traverse()

Create, list, a, of, integers, 10, 

### Remove stopwords

In [13]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [14]:
tokens

['Create', 'list', 'a', 'of', 'integers', '10']

In [15]:
filtered_sentence = [w for w in tokens if not w in stop_words]
filtered_sentence

['Create', 'list', 'integers', '10']