# Leveraging Linguistics


We are going to pick up a simple use case and see how we can solve that. Then, we repeat this again, but on a slighlty different text corpus and so on. 

This helps us learn build intuition on how to use linguistics in NLP. As mentioned, I am going to use spaCy here, but you are free to use NLTK or equivalent. There are programmatic differences in their API and style, but the underlying theme remains same. 

**Approach**:
This section is dedicated to introducing you to the ideas and tools from several decades of linguistics. The traditional way to introduce this to take an idea, talk about it at length and then put them together all in one piece like magic. 

Here, I am going to do the other way around. We will solve 2 problems and in the process, we will use look at the tools. Instead of talking to you about a Number 8 spanner, I am giving you a car engine, and the tools and introducing the tools as I use them myself. 

**Key Idea**: The Natural Language Pipeline

Most NLP tasks are solved in a sequential pipeline, with results from one component feeding to the next. 

There is a wide variety of data structures used to store the pipeline results and intermediate steps. Here, for simplicity, I am going to use only the data structures already in spaCy and native Python one's like lists and dictionaries. 


**Challenges**:

Here, we will tackle the following real-life inspired challenges: 
- Redacting names from any document e.g. for GDPR compliance
- Making quizzes from any text e.g. from a Wikipedia article

## Getting Started

You can install spaCy via conda or pip. Since I am in a conda environment, I use the conda install from below: 

In [1]:
# !conda install -y spacy 
# !pip install spacy

Let's download the English language model provided by spaCy. We are doing to use 'en_core_web_lg', the 'lg' at the end stands for large. This means that this is the most comprehensive and best performing model that spaCy releases for general purpose use.

You need to only do this once. 

In [2]:
# !python -m spacy download en_core_web_lg

If there is an error above, you can use the smaller model as well. 

Try:
- Windows Shell:```python -m spacy download en``` as **Administrator**
- Linux Terminal:```sudo python -m spacy download en ```

In [3]:
import spacy
from spacy import displacy # for visualization
nlp = spacy.load('en_core_web_lg')

In [4]:
spacy.__version__

'2.0.11'

**Introducing textacy**:

Textacy is a very underappreciated set of tools around spaCy. It's tagline is what it exactly does: NLP, before and after spaCy. It implements tools which use spaCy under the hood, ranging from data streaming utilities for production use to higher level text clustering functions. 

You can install textacy via pip or conda both. On conda, it's available on the 'conda-forge' channel instead of the main 'conda' channel. We mention this by adding a '-c' flag and the channel name after that.  

In [5]:
# !conda install -c conda-forge textacy 
# !pip install textacy

In [6]:
import textacy

## Redacting Names with Named Entity Recognition

**Challenge: Replace all human names with [REDACTED] in free text**

Consider that you are new engineer at European Bank Co. In preparation for GDPR, bank is scrubbing off names of customers from all their old records and specially internal communications like email and memos. They ask you to do this. 

The first way is to lookup names of your customers and match each of them against all your emails. This can be painfully slow, and error prone. A customer named John D'Souza, you might simply refer to him as DSouza in an email. An exact match for D'Souza will never be scrubbed.


Here, we will use an automatic NLP technique to assist us. We will parse all our emails from spaCy and simply replace the person names with the token [REDACTED]. This would be at least 5-10x faster than matching millions of substrings aginast millions of substrings.

We will use a small excerpt from a Harry Potter book, talking about flu as an example. 

In [7]:
text = "Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy."

In [8]:
# Parse the text with spaCy. This runs the entire NLP pipeline.
doc = nlp(text)

'doc' now contains a parsed version of text. We can use it to do anything we want!. 
For example, this will print out all the named entities that were detected:

In [9]:
for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

Pomfrey (PERSON)
Pepperup (ORG)
several hours (TIME)
Ginny Weasley (PERSON)
Percy (PERSON)


The spacy object doc has an atrribute 'ents' which stores all detected entities. In order to find this, spaCy has done few things behind the scenes for us e.g.
- Sentence Segmentation i.e. break the long text into smaller sentences
- Tokenization i.e. break each sentence into individual words or tokens
- Removed Stop Words i.e. remove words like _a, an, the, of_
- NER i.e. using statistical techniques, find out which entities are there in the text and label them with entity type

In [10]:
doc.ents

(Pomfrey, Pepperup, several hours, Ginny Weasley, Percy)

The 'doc' object has a specific object called 'ents', short for entities which we can use to lookup all entities in our text. Additionally, each entity has as label. 

Tip: In spaCy, all information is stored by numeric hashing. So, `entity.label` will be a numeric entry like 378, while `entity.label_` will be human readable e.g. PERSON. 

In [11]:
entity.label, entity.label_

(378, 'PERSON')

In spaCy, all human readable labels etc can also be explained using the simple spacy.explain(label) syntax:

In [12]:
spacy.explain('GPE')

'Countries, cities, states'

Using spaCy's NER, let's write a simple function to replace each PERSON name with [REDACTED]: 

In [13]:
def redact_names(text):
    doc = nlp(text)
    redacted_sentence = []
    for token in doc:
        if token.ent_type_ == "PERSON":
            redacted_sentence.append("[REDACTED]")
        else:
            redacted_sentence.append(token.string)
    return "".join(redacted_sentence)

The function takes in text as string, parses it in the doc object using the nlp object which we loaded earlier. Then it traverses each token in the document (remember tokenization?). Each token is added to a list. If the token has the entity type of a person, it is replaced with [REDACTED] instead. 

At the end, we re-construct the original sentenced by converting this list back to a string. 

As an exercise, try doing above in-place i.e. by editing the original string itself instead of creating a new string.  

In [14]:
redact_names(text)

'Madam [REDACTED], the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. [REDACTED][REDACTED], who had been looking pale, was bullied into taking some by [REDACTED].'

This output is still leaky faucet if you are trying to make GDPR compliant edits. By using two [REDACTED] blocks instead of one, we are disclosing the number of words in a name. This can be seriously harmful if we were to use this in some other context, e.g. redacting location or organisation names too. 

Let's fix this:

In [15]:
def redact_names(text):
    doc = nlp(text)
    redacted_sentence = []
    for ent in doc.ents:
        ent.merge()
    for token in doc:
        if token.ent_type_ == "PERSON":
            redacted_sentence.append("[REDACTED]")
        else:
            redacted_sentence.append(token.string)
    return "".join(redacted_sentence)

We do this by using merging entities separately from the pipeline. 

In [16]:
redact_names(text)

'Madam [REDACTED], the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. [REDACTED], who had been looking pale, was bullied into taking some by [REDACTED].'

### Entity Types 

spaCy supports the following entity types in the large language model which we loaded in the nlp object:

|Type	|Description
|---|---
|PERSON	|People, including fictional.
|NORP	|Nationalities or religious or political groups.
|FAC	|Buildings, airports, highways, bridges, etc.
|ORG	|Companies, agencies, institutions, etc.
|GPE	|Countries, cities, states.
|LOC	|Non-GPE locations, mountain ranges, bodies of water.
|PRODUCT	|Objects, vehicles, foods, etc. (Not services.)
|EVENT	|Named hurricanes, battles, wars, sports events, etc.
|WORK_OF_ART	|Titles of books, songs, etc.
|LAW	|Named documents made into laws.
|LANGUAGE	|Any named language.
|DATE	|Absolute or relative dates or periods.
|TIME	|Times smaller than a day.
|PERCENT	|Percentage, including "%".
|MONEY	|Monetary values, including unit.
|QUANTITY	|Measurements, as of weight or distance.
|ORDINAL	|"first", "second", etc.
|CARDINAL	|Numerals that do not fall under another type.

Let's look at some examples of above in real world sentences, we will also use the `spacy.explain()` on all entities to build a quick mental model of how these things work. 

In [17]:
def explain_text_entities(text):
    doc = nlp(text)
    for ent in doc.ents:
        print(f'{ent}, Label: {ent.label_}, {spacy.explain(ent.label_)}')

In [18]:
explain_text_entities('Tesla has gained 20% market share in the months since')

Tesla, Label: ORG, Companies, agencies, institutions, etc.
20%, Label: PERCENT, Percentage, including "%"
the months, Label: DATE, Absolute or relative dates or periods


In [19]:
explain_text_entities('Taj Mahal built by Mughal Emperor Shah Jahan stands tall on the banks of Yamuna in modern day Agra, India')

Taj Mahal, Label: PERSON, People, including fictional
Mughal, Label: NORP, Nationalities or religious or political groups
Shah Jahan, Label: PERSON, People, including fictional
Yamuna, Label: LOC, Non-GPE locations, mountain ranges, bodies of water
Agra, Label: GPE, Countries, cities, states
India, Label: GPE, Countries, cities, states


Interesting, the model got "Taj Mahal" wrong. Taj Mahal is obviously a world famous monument. But the model has made a believable mistake, because "Taj Mahal" was also the stage name of a Blues musician. 

In most production use cases, we "fine-tune" the in-built spaCy models for specific languages using our own annotations. This would teach the model that Taj Mahal for us is almost always a monument and not a Blues musician. 

In [20]:
explain_text_entities('Ashoka was a great Indian king')

Ashoka, Label: PERSON, People, including fictional
Indian, Label: NORP, Nationalities or religious or political groups


In [21]:
explain_text_entities('The Ashoka University sponsors the Young India Fellowship')

Ashoka University, Label: ORG, Companies, agencies, institutions, etc.
the Young India Fellowship, Label: ORG, Companies, agencies, institutions, etc.


Here, our pipeline is able to leverage the word 'University' to infer that Ashoka is a name of an organisation and not King Ashoka from Indian history. 

It has also figured out that 'Young India Fellowship' is one logical entity and has not tagged 'India' has a location. 

It helps me a lot to see a few examples such as above to form a mental model of what are the limits of what we can and cannot do. 

## Automatic Question Generation

The Challenge: Can you automatically convert a sentence to a question? 

For instance, 'Martin Luther King Jr. was a civil rights activist and skilled orator.' to 'Who was Martin Luther King Jr.?' 

Notice that when we convert a sentence to a question, the answer might not be in the original sentence anymore. To me, the answer to that question might be something different and that's fine. We are not aiming for answers here.

### Part-of-Speech Tagging

Sometimes, we want to quickly pull out keywords, or keyphrases from a larger body of text. This helps us mentally paint a picture of what this text is about. This is particularly helpful in analysis of texts like long emails or essays. 

As a quick hack, we can pull out all relevant "nouns". This is because most keywords are in fact nouns of some form. 

In [22]:
example_text = 'Bansoori is an Indian classical instrument. Tom plays Bansoori and Guitar.'

In [23]:
doc = nlp(example_text)

We need noun chunks. Noun chunks are _noun phrases_ - not a single word, but a short phrase which describes the noun. For example, "the blue skies" or "the world’s largest conglomerate". 

To get the noun chunks in a document, simply iterate over `doc.noun_chunks`: 

In [24]:
for idx, sentence in enumerate(doc.sents):
    for noun in sentence.noun_chunks:
        print(f'sentence{idx+1}', noun)

sentence1 Bansoori
sentence1 an Indian classical instrument
sentence2 Tom
sentence2 Bansoori
sentence2 Guitar


Our example text has two sentences, we can pull out noun phrase chunks from each sentence. We pull out noun phrases instead of single words. This means, we are able to pull out 'an Indian classical instrument' as one noun. This is quite useful as we will see in a moment.  

Next, let's take a quick look at all parts-of-speech tags in our example text. We will use the verbs and adjectives to write some simple question generating logic. 

In [25]:
for token in doc:
    print(token, token.pos_, token.tag_)

Bansoori PROPN NNP
is VERB VBZ
an DET DT
Indian ADJ JJ
classical ADJ JJ
instrument NOUN NN
. PUNCT .
Tom PROPN NNP
plays VERB VBZ
Bansoori PROPN NNP
and CCONJ CC
Guitar PROPN NNP
. PUNCT .


Notice that here 'instrument' is tagged as a NOUN while 'Indian' and 'classical' are tagged as adjectives. This makes sense. Addititionally, Bansoori and Guitar are tagged as PROPN or Proper Nouns. 

**Nouns vs Proper Noun** 
Nouns name people, places, and things. Common nouns name general items like waiter, jeans, country. Proper nouns name specific things like Roger, Levi's, India

### Creating a Ruleset

Quite often when using linguistics, you will be writing custom rules. Here is one data structure suggestion to help you store these rules: list of dictionaries. Each dictionary in turn can have elements ranging from simple string lists to lists to strings. Avoid nesting a list of dictionaries inside a dictionary:

In [26]:
ruleset = [
    {
        'id': 1, 
        'req_tags': ['NNP', 'VBZ', 'NN'],
    }, 
    {
        'id': 2, 
        'req_tags': ['NNP', 'VBZ'],
    }
    ]

Here, I have written two rules. Each rule is simply a collection of part-of-speech tags stored under the 'req_tags' key. Each rule comprises of all the tags that I will look for in a particular sentence. 

Depending on 'id', I will use a hard coded question template to generate my questions. In practice, you can and should move the question template to your ruleset.  

In [27]:
print(ruleset)

[{'id': 1, 'req_tags': ['NNP', 'VBZ', 'NN']}, {'id': 2, 'req_tags': ['NNP', 'VBZ']}]


Next, I need a function to pull out all tokens which match a particular tag. We do this by simply iterating over the entire list of and matching each token against the target tag. 

In [28]:
def get_pos_tag(doc, tag):
    return [tok for tok in doc if tok.tag_ == tag]

Tip: This is slow O(n). As an exercise, can you think of a way to reduce this to O(1)? 

Hint: You can pre-compute some results and store them at cost of more memory.  

Next, I am going to write a function to use the ruleset above and use a question template. 

Here is the broad outline which I will follow for each sentence: 

- For each rule id, check if all the required tags ('req_tags') meet the conditions 
- Find the first rule id which matches, 
- Find the words which match the required part of speech tags
- Fill in the corresponding question template and return the question string

In [29]:
def sent_to_ques(sent:str)->str:
    """
    Return a question string corresponding to a sentence string using a set of pre-written rules
    """
    doc = nlp(sent)
    pos_tags = [token.tag_ for token in doc]
    for idx, rule in enumerate(ruleset):
        if rule['id'] == 1:
            if all(key in pos_tags for key in rule['req_tags']): 
                print(f"Rule id {rule['id']} matched for sentence: {sent}")
                NNP = get_pos_tag(doc, "NNP")
                NNP = str(NNP[0])
                VBZ = get_pos_tag(doc, "VBZ")
                VBZ = str(VBZ[0])
                ques = f'What {VBZ} {NNP}?'
                return(ques)
        if rule['id'] == 2:
            if all(key in pos_tags for key in rule['req_tags']): #'NNP', 'VBZ' in sentence.
                print(f"Rule id {rule['id']} matched for sentence: {sent}")
                NNP = get_pos_tag(doc, "NNP")
                NNP = str(NNP[0])
                VBZ = get_pos_tag(doc, "VBZ")
                VBZ = str(VBZ[0].lemma_)
                ques = f'What does {NNP} {VBZ}?'
                return(ques)

Within each rule id match, I do something more: I am dropping all but the first match for each part-of-speech tag that I receive. For instance, when I query for "NNP", I later pick the first element with NNP[0], convert it to string and drop all other matches. 

While this is a perfectly good approach for simple sentences, this breaks down when you have conditional statements or complex reasoning. Let's run the above function for each sentence in the our example text and see what questions do we get:

In [30]:
for sent in doc.sents:
    print(f"The generated question is: {sent_to_ques(str(sent))}")

Rule id 1 matched for sentence: Bansoori is an Indian classical instrument.
The generated question is: What is Bansoori?
Rule id 2 matched for sentence: Tom plays Bansoori and Guitar.
The generated question is: What does Tom play?


This is quite good. Obviously, I cheated a bit by writing exactly two rules for the two example sentences I already knew. In practice, you will need a much larger set, maybe 10-15 rulesets and corresponding templates just to have a decent coverage of "What?" questions. 

Another few rulesets might be needed to cover "When","Who" and "Where" type of questions. For instance, "Who plays Bansoori?" is also a valid question from the second sentence above. 

This means PoS tagging+rule driven engine can have a large coverage, a reasonable precision with respect to the questions - but it will still be a little tedious to maintain, debug and generalize this system. 

We need a set of better tools which is less reliant on the "state" of tokens and more on the relationship between them. This will allow us to change the relationship to form a question instead. This is where Dependency Parsing comes in. 

# Question Generation using Dependency Parsing

What is a dependency parser? 

> A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.
> from [Stanford NNDEP Project](https://nlp.stanford.edu/software/nndep.html)

A dependency parser helps us understand the various ways in which parts of the sentence interact or depend on each other. For instance, how is the noun modified by adjectives. 

In [31]:
for token in doc:
    print(token, token.dep_)

Bansoori nsubj
is ROOT
an det
Indian amod
classical amod
instrument attr
. punct
Tom nsubj
plays ROOT
Bansoori dobj
and cc
Guitar conj
. punct


Some of these terms are simple enough to guess e.g. 'ROOT' is where the dependency tree might begin. 'nsubj' is the noun or nominal subject. 'cc' is probably conjunction. But this is still incomplete, luckily for us, spaCy includes the nifty `explain()` function to help us interpret these.  

In [32]:
for token in doc:
    print(token, token.dep_, spacy.explain(token.dep_))

Bansoori nsubj nominal subject
is ROOT None
an det determiner
Indian amod adjectival modifier
classical amod adjectival modifier
instrument attr attribute
. punct punctuation
Tom nsubj nominal subject
plays ROOT None
Bansoori dobj direct object
and cc coordinating conjunction
Guitar conj conjunct
. punct punctuation


This gives us a good starting point to Google away and pick up some linguistics specific terms. E.g. a 'conjunct' is often used to connect two clauses. 'attribute' is simply a way to highlight something which is a property of the nominal subject. 

Nominal subjects are usually nouns or pronouns which in turns are actors (via verbs) or have properties(via attributes). 

## Visualizing the Relationship

spaCy has an inbuilt tool called displacy for displaying simple, but clean and powerful visualizations. It offers two primary modes: Named Entity Recognition and Dependency Parsing. Here we will use the 'dep' or dependency mode. 

In [33]:
displacy.render(doc, style='dep', jupyter=True)

Let's take the first sentence for a quick study: We see that "instrument" is "amod" or adjectively modified by "Indian classicial". We pulled this phrase earlier as a noun chunk. 

This means that when we pulled noun phrase chunks out of this sentence, spaCy must have finished dependency parsing already under the hood. 

Also notice the direction of arrows, while the NOUN (instrument) is modified by ADJ. It is the 'attr' of the ROOT VERB (is). 

In [82]:
tricky_doc = nlp('This is ship-shipping ship, shipping shipping ships')

In [96]:
displacy.render(tricky_doc, style='dep', jupyter=True)

This logical tree structure of simple sentences is what we will exploit to simplify our question generation. In order to do this, we need two important pieces: 
- the main verb aka ROOT
- the subjects on which this ROOT verb is acting

Let's write some functions to extract these dependency entities in the spaCy token format i.e. without converting them to strings. 

Alternatively, we can import them from textacy itself :)

In [85]:
from textacy.spacier import utils as spacy_utils

TIP: You can see the docstring AND function implementation using the '??' syntax in Jupyter notebook like this:

In [49]:
??spacy_utils.get_main_verbs_of_sent

In [50]:
# Signature: spacy_utils.get_main_verbs_of_sent(sent)
# Source:   
# def get_main_verbs_of_sent(sent):
#     """Return the main (non-auxiliary) verbs in a sentence."""
#     return [tok for tok in sent
#             if tok.pos == VERB and tok.dep_ not in constants.AUX_DEPS]
# File:      d:\miniconda3\envs\nlp\lib\site-packages\textacy\spacier\utils.py
# Type:      function

If we are to ask questions from someone, they often are around a piece of information e.g. What is the capital of India? or around some action e.g. What did you do on Sunday?

Answering 'what' means we need to find out what the verbs are acting on. This means find the subjects of the verb. Let's take a more concrete but simple example to explore this: 

In [86]:
toy_sentence = 'Shivangi is an engineer'
doc = nlp(toy_sentence)

What are the entities in this sentence? 

In [89]:
displacy.render(doc, style='ent', jupyter=True)

Let's find out the main verb in this sentence: 

In [87]:
verbs = spacy_utils.get_main_verbs_of_sent(doc)
print(verbs)

[is]


And what are nominal subjects of this verb?   

In [88]:
for verb in verbs:
    print(verb, spacy_utils.get_subjects_of_verb(verb))

is [Shivangi]


You will notice that this has a reasonable overlap with the noun phrases which we pulled from our part-of-speech tagging but can be different as well. 

In [90]:
[(token, token.tag_) for token in doc]

[(Shivangi, 'NNP'), (is, 'VBZ'), (an, 'DT'), (engineer, 'NN')]

Tip: As an exercise, extend this approach to at least add Who, Where and When questions as practice. 

# Level Up: Question and Answer
So far, we have been trying to generate questions. But if you were trying to make an automated quiz for students, you would also need to mine the right answer. 

The answer in this case will be simply the objects of verb. What is an object of verb? 

> In the sentence, "Give the book to me," "book" is the direct object of the verb "give," and "me" is the indirect object. - from the Cambridge English Dictionary

Loosely, object is the piece on which our verb acts. This is almost always the answer to our "what". Let's write a question to find the objects of any verb --- or wait, we can pull it from the `textacy.spacier.utils`. 

In [91]:
spacy_utils.get_objects_of_verb(verb)

[engineer]

In [92]:
for verb in verbs:
    print(verb, spacy_utils.get_objects_of_verb(verb))

is [engineer]


In [93]:
displacy.render(doc, style='dep', jupyter=True)

Let's look at the output of our functions for the example text. The first is the sentence itself, then the root verb, than the lemma form of that verb, followed by subjects of the verb and then objects.  

In [95]:
doc = nlp(example_text)
for sentence in doc.sents:
    print(sentence, sentence.root, sentence.root.lemma_, spacy_utils.get_subjects_of_verb(sentence.root), spacy_utils.get_objects_of_verb(sentence.root))

Bansoori is an Indian classical instrument. is be [Bansoori] [instrument]
Tom plays Bansoori and Guitar. plays play [Tom] [Bansoori, Guitar]


Let's arrange the pieces above into a neat function which we can then re-use

In [56]:
def para_to_ques(eg_text):
    doc = nlp(eg_text)
    results = []
    for sentence in doc.sents:
        root = sentence.root
        ask_about = spacy_utils.get_subjects_of_verb(root)
        answers = spacy_utils.get_objects_of_verb(root)
        if len(ask_about) > 0 and len(answers) > 0:
            if root.lemma_ == "be":
                question = f'What {root} {ask_about[0]}?'
            else:
                question = f'What does {ask_about[0]} {root.lemma_}?'
            results.append({'question':question, 'answers':answers})
    return results

In [57]:
para_to_ques(example_text)

[{'question': 'What is Bansoori?', 'answers': [instrument]},
 {'question': 'What does Tom play?', 'answers': [Bansoori, Guitar]}]

This seems right to me. Let's run this on a larger sample of sentences. This sample has varying degrees of complexities and sentence structures. 

In [42]:
large_example_text = """
Puliyogare is a South Indian dish made of rice and tamarind. 
Priya writes poems. Shivangi bakes cakes. Sachin sings in the orchestra.

Osmosis is the movement of a solvent across a semipermeable membrane toward a higher concentration of solute. In biological systems, the solvent is typically water, but osmosis can occur in other liquids, supercritical liquids, and even gases.
When a cell is submerged in water, the water molecules pass through the cell membrane from an area of low solute concentration to high solute concentration. For example, if the cell is submerged in saltwater, water molecules move out of the cell. If a cell is submerged in freshwater, water molecules move into the cell.

Raja-Yoga is divided into eight steps. The first is Yama. Yama is nonviolence, truthfulness, continence, and non-receiving of any gifts.
After Yama, Raja-Yoga has Niyama. cleanliness, contentment, austerity, study, and self - surrender to God.
The steps are Yama and Niyama. 
"""


In [43]:
para_to_ques(large_example_text)

[{'question': 'What is Puliyogare?', 'answers': [dish]},
 {'question': 'What does Priya write?', 'answers': [poems]},
 {'question': 'What does Shivangi bake?', 'answers': [cakes]},
 {'question': 'What is Osmosis?', 'answers': [movement]},
 {'question': 'What is solvent?', 'answers': [water]},
 {'question': 'What is first?', 'answers': [Yama]},
 {'question': 'What is Yama?',
  'answers': [nonviolence, truthfulness, continence, of]},
 {'question': 'What does Yoga have?', 'answers': [Niyama]},
 {'question': 'What are steps?', 'answers': [Yama, Niyama]}]

# Facts Extraction using Semi Structured Sentence Parsing
Introducing textacy,

Boss mode with co reference resolution