# Leveraging Linguistics


We are going to pick up a simple use case and see how we can solve that. Then, we repeat this again, but on a slighlty different text corpus and so on. 

This helps us learn build intuition on how to use linguistics in NLP. As mentioned, I am going to use spaCy here, but you are free to use NLTK or anything else available in your favourite progamming language

## Grammar Crash Course and spaCy

In [1]:
# !python -m spacy download en_core_web_lg

In [2]:
import spacy
from spacy import displacy # for visualization
nlp = spacy.load('en_core_web_lg')

In [86]:
import textacy

If there is an error above, try:
- Windows Shell:```python -m spacy download en``` as **Administrator**
- Linux Terminal:```sudo python -m spacy download en ```

The first half talks about NLP pipeline 

## Redacting Names with Named Entity Recognition

In [13]:
text = "Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy."

In [14]:
# Parse the text with spaCy. This runs the entire pipeline.
doc = nlp(text)

In [15]:
# 'doc' now contains a parsed version of text. We can use it to do anything we want!
# For example, this will print out all the named entities that were detected:
for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

Pomfrey (PERSON)
Pepperup (ORG)
several hours (TIME)
Ginny Weasley (PERSON)
Percy (PERSON)


#TODO: Explain the entity and entity labels above

In [50]:
def redact_names(text):
    doc = nlp(text)
    redacted_sentence = []
    for token in doc:
        if token.ent_type_ == "PERSON":
            redacted_sentence.append("[REDACTED]")
        else:
            redacted_sentence.append(token.string)
    return "".join(redacted_sentence)

In [51]:
redact_names(text)

'Madam [REDACTED], the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. [REDACTED][REDACTED], who had been looking pale, was bullied into taking some by [REDACTED].'

In [52]:
def redact_names(text):
    doc = nlp(text)
    redacted_sentence = []
    for ent in doc.ents:
        ent.merge()
    for token in doc:
        if token.ent_type_ == "PERSON":
            redacted_sentence.append("[REDACTED]")
        else:
            redacted_sentence.append(token.string)
    return "".join(redacted_sentence)

In [53]:
redact_names(text)

'Madam [REDACTED], the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. [REDACTED], who had been looking pale, was bullied into taking some by [REDACTED].'

## Entity Types 

In [64]:
def explain_text_entities(text):
    doc = nlp(text)
    for ent in doc.ents:
        print(ent, ent.label_, spacy.explain(ent.label_))

In [65]:
explain_text_entities('Tesla has gained 20% market share in the months since')

Tesla ORG Companies, agencies, institutions, etc.
20% PERCENT Percentage, including "%"
the months DATE Absolute or relative dates or periods


In [67]:
explain_text_entities('Taj Mahal built by Mughal Emperor Shah Jahan stands tall on the banks of Yamuna in modern day Agra, India')

Taj Mahal PERSON People, including fictional
Mughal NORP Nationalities or religious or political groups
Shah Jahan PERSON People, including fictional
Yamuna LOC Non-GPE locations, mountain ranges, bodies of water
Agra GPE Countries, cities, states
India GPE Countries, cities, states


In [73]:
explain_text_entities('Ashoka was a great Indian king')

Ashoka PERSON People, including fictional
Indian NORP Nationalities or religious or political groups


In [72]:
explain_text_entities('The Ashoka University sponsors the Young India Fellowship')

Ashoka University ORG Companies, agencies, institutions, etc.
the Young India Fellowship ORG Companies, agencies, institutions, etc.


# Question Generation with PoS Tagging and Dependency Parsing

Sometimes, we want to quickly pull out keywords, or keyphrases from a larger body of text. This helps us mentally paint a picture of what this text is about. This is particularly helpful in analysis of texts like email length. 

We refer to these as noun chunks. Noun chunks are _noun phrases_ - not a single word, but a short phrase which describes the noun. For example, "the blue skies" or "the world’s largest conglomerate". 

To get the noun chunks in a document, simply iterate over Doc.noun_chunks: 

In [5]:
example_sentence = 'James B. Comey, the former F.B.I. director fired by President Trump, said in an ABC News interview that Mr. Trump was “morally unfit to be president,” portraying him as a danger to the nation.'

In [6]:
nlp = spacy.load('en')
doc = nlp(example_sentence)

In [7]:
for chunk in doc.noun_chunks:
    print(f'{chunk.text:<30},{chunk.root.text:<15},{chunk.root.dep_:<7},{spacy.explain(chunk.root.dep_):25},{chunk.root.head.text:<15}')

James B. Comey                ,Comey          ,nsubj  ,nominal subject          ,said           
the former F.B.I. director    ,director       ,appos  ,appositional modifier    ,Comey          
President Trump               ,Trump          ,pobj   ,object of preposition    ,by             
an ABC News interview         ,interview      ,pobj   ,object of preposition    ,in             
Mr. Trump                     ,Trump          ,nsubj  ,nominal subject          ,was            
president                     ,president      ,attr   ,attribute                ,be             
him                           ,him            ,dobj   ,direct object            ,portraying     
a danger                      ,danger         ,pobj   ,object of preposition    ,as             
the nation                    ,nation         ,pobj   ,object of preposition    ,to             


# Facts Extraction using Semi Structured Sentence Parsing
Introducing textacy,

Boss mode with co reference resolution