# Lecture NLP

Spacy installation 
https://spacy.io/usage

In [28]:
# !pip install transformers

In [48]:
import spacy
nlp = spacy.load("en_core_web_sm")

### Sentence Detection: Sentence detection is the process of locating where sentences start and end in a given text.


In the above example, spaCy is correctly able to identify the input’s sentences. With .sents, you get a list of Span objects representing individual sentences. You can also slice the Span objects to produce sections of a sentence. In spaCy, the .sents property is used to extract sentences from the Doc object. Here’s how you would extract the total number of sentences and the sentences themselves for a given input:

In [49]:
about_text = (
        "Gus Proto is a Python developer currently"
         " working for a London-based Fintech"
         " company. He is interested in learning"
         " Natural Language Processing."
     )
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
print(len(sentences))

for sentence in sentences:
     print(f"{sentence[:5]}...")



2
Gus Proto is a Python...
He is interested in learning...


You can also customize sentence detection behavior by using custom delimiters. Here’s an example where an ellipsis (...) is used as a delimiter, in addition to the full stop, or period (.).
For this example, you use the @Language.component("set_custom_boundaries") decorator to define a new function that takes a Doc object as an argument. The job of this function is to identify tokens in Doc that are the beginning of sentences and mark their .is_sent_start attribute to True. Once done, the function must return the Doc object again.

Then, you can add the custom boundary function to the Language object by using the .add_pipe() method. Parsing text with this modified Language object will now treat the word after an ellipse as the start of a new sentence.

In [50]:
ellipsis_text = (
    "Gus, can you, ... never mind, I forgot"
   " what I was saying. So, do you think"
   " we should ..."
 )

from spacy.language import Language
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
     """Add support to use `...` as a delimiter for sentence detection"""
     for token in doc[:-1]:
         if token.text == "...":
             doc[token.i + 1].is_sent_start = True
     return doc


custom_nlp = spacy.load("en_core_web_sm")
custom_nlp.add_pipe("set_custom_boundaries", before="parser")
custom_ellipsis_doc = custom_nlp(ellipsis_text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)
for sentence in custom_ellipsis_sentences:
 print(sentence)


Gus, can you, ...
never mind, I forgot what I was saying.
So, do you think we should ...


### Tokens in spaCy: The process of tokenization breaks a text down into its basic units—or tokens—which are represented in spaCy as Token objects.

In [52]:
import spacy
nlp = spacy.load("en_core_web_sm")
about_text = (
     "Gus Proto is a Python developer currently"
     " working for a London-based Fintech"
     " company. He is interested in learning"
     " Natural Language Processing."
 )
about_doc = nlp(about_text)

for token in about_doc:
    print (token, token.idx) ##.idx attribute represents the starting position of the token in the original text. 
                             ## Proto is 4 spaces from the start (Gus + whitespace). Useful for word counts.

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


###  Stop Words: Stop words are typically defined as the most common words in a language. 
In the English language, some examples of stop words are the, are, but, and they. 
Most sentences need to contain stop words in order to be full sentences that make grammatical sense.

With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis. spaCy stores a list of stop words for the English language:


In [53]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(len(spacy_stopwords))

for stop_word in list(spacy_stopwords)[:10]:
 print(stop_word)

326
'd
whose
n't
whereafter
‘re
are
those
yet
anywhere
made


In [54]:
custom_about_text = (
 "Gus Proto is a Python developer currently"
 " working for a London-based Fintech"
 " company. He is interested in learning"
 " Natural Language Processing."
)
nlp = spacy.load("en_core_web_sm")
about_doc = nlp(custom_about_text)
print([token for token in about_doc if not token.is_stop])

[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, .]


### Lemmatization
Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a lemma.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma. The inflection of a word allows you to express different grammatical categories, like tense (organized vs organize), number (trains vs train), and so on. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

spaCy puts a lemma_ attribute on the Token class. This attribute has the lemmatized form of the token:

In [55]:
import spacy
nlp = spacy.load("en_core_web_sm")
conference_help_text = (
 "Gus is helping organize a developer"
 " conference on Applications of Natural Language"
 " Processing. He keeps organizing local Python meetups"
 " and several internal talks at his workplace."
)
conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
 if str(token) != str(token.lemma_):
     print(f"{str(token):>20} : {str(token.lemma_)}")

                  is : be
                  He : he
               keeps : keep
          organizing : organize
             meetups : meetup
               talks : talk


### Word Frequency
You can now convert a given text into tokens and perform statistical analysis on it. 

In [56]:
import spacy
from collections import Counter
nlp = spacy.load("en_core_web_sm")
complete_text = (
 "Gus Proto is a Python developer currently"
 " working for a London-based Fintech company. He is"
 " interested in learning Natural Language Processing."
 " There is a developer conference happening on 21 July"
 ' 2019 in London. It is titled "Applications of Natural'
 ' Language Processing". There is a helpline number'
 " available at +44-1234567891. Gus is helping organize it."
 " He keeps organizing local Python meetups and several"
 " internal talks at his workplace. Gus is also presenting"
 ' a talk. The talk will introduce the reader about "Use'
 ' cases of Natural Language Processing in Fintech".'
 " Apart from his work, he is very passionate about music."
 " Gus is learning to play the Piano. He has enrolled"
 " himself in the weekend batch of Great Piano Academy."
 " Great Piano Academy is situated in Mayfair or the City"
 " of London and has world-class piano instructors."
)
complete_doc = nlp(complete_text)

words = [
 token.text
 for token in complete_doc
 if not token.is_stop and not token.is_punct
]

print(Counter(words).most_common(5))

[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]


### Part-of-Speech Tagging
Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are typically eight parts of speech:

Noun,
Pronoun,
Adjective,
Verb,
Adverb,
Preposition,
Conjunction,
Interjection.
Part-of-speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.
In spaCy, POS tags are available as an attribute on the Token object:



In [60]:
import spacy
nlp = spacy.load("en_core_web_sm")
about_text = (
 "Gus Proto is a Python developer currently"
 " working for a London-based Fintech"
 " company. He is interested in learning"
 " Natural Language Processing."
)
about_doc = nlp(about_text)
for token in about_doc:
  # f is f-string used for including variabes in strings
 print(
     f"""
TOKEN: {str(token)}
=====
TAG: {str(token.tag_):10} POS: {token.pos_}
EXPLANATION: {spacy.explain(token.tag_)}"""
 )



TOKEN: Gus
=====
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: Proto
=====
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: is
=====
TAG: VBZ        POS: AUX
EXPLANATION: verb, 3rd person singular present

TOKEN: a
=====
TAG: DT         POS: DET
EXPLANATION: determiner

TOKEN: Python
=====
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: developer
=====
TAG: NN         POS: NOUN
EXPLANATION: noun, singular or mass

TOKEN: currently
=====
TAG: RB         POS: ADV
EXPLANATION: adverb

TOKEN: working
=====
TAG: VBG        POS: VERB
EXPLANATION: verb, gerund or present participle

TOKEN: for
=====
TAG: IN         POS: ADP
EXPLANATION: conjunction, subordinating or preposition

TOKEN: a
=====
TAG: DT         POS: DET
EXPLANATION: determiner

TOKEN: London
=====
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: -
=====
TAG: HYPH       POS: PUNCT
EXPLANATION: punctuation mark, hyphen

TOKEN: based
=====
TAG

In [61]:
nouns = []
adjectives = []
for token in about_doc:
 if token.pos_ == "NOUN":
     nouns.append(token)
 if token.pos_ == "ADJ":
     adjectives.append(token)


print(nouns)

print(adjectives)

[developer, company]
[interested]


### Preprocessing Functions:
Examples: 
Lowercases the text
Lemmatizes each token
Removes punctuation symbols
Removes stop words

In [62]:
import spacy
nlp = spacy.load("en_core_web_sm")
complete_text = (
 "Gus Proto is a Python developer currently"
 " working for a London-based Fintech company. He is"
 " interested in learning Natural Language Processing."
 " There is a developer conference happening on 21 July"
 ' 2019 in London. It is titled "Applications of Natural'
 ' Language Processing". There is a helpline number'
 " available at +44-1234567891. Gus is helping organize it."
 " He keeps organizing local Python meetups and several"
 " internal talks at his workplace. Gus is also presenting"
 ' a talk. The talk will introduce the reader about "Use'
 ' cases of Natural Language Processing in Fintech".'
 " Apart from his work, he is very passionate about music."
 " Gus is learning to play the Piano. He has enrolled"
 " himself in the weekend batch of Great Piano Academy."
 " Great Piano Academy is situated in Mayfair or the City"
 " of London and has world-class piano instructors."
)
complete_doc = nlp(complete_text)
def is_token_allowed(token):
 return bool(
     token
     and str(token).strip()
     and not token.is_stop
     and not token.is_punct
 )

def preprocess_token(token):
 return token.lemma_.strip().lower()

complete_filtered_tokens = [
 preprocess_token(token)
 for token in complete_doc
 if is_token_allowed(token)
]

complete_filtered_tokens

['gus',
 'proto',
 'python',
 'developer',
 'currently',
 'work',
 'london',
 'base',
 'fintech',
 'company',
 'interested',
 'learn',
 'natural',
 'language',
 'processing',
 'developer',
 'conference',
 'happen',
 '21',
 'july',
 '2019',
 'london',
 'title',
 'application',
 'natural',
 'language',
 'processing',
 'helpline',
 'number',
 'available',
 '+44',
 '1234567891',
 'gus',
 'helping',
 'organize',
 'keep',
 'organize',
 'local',
 'python',
 'meetup',
 'internal',
 'talk',
 'workplace',
 'gus',
 'present',
 'talk',
 'talk',
 'introduce',
 'reader',
 'use',
 'case',
 'natural',
 'language',
 'processing',
 'fintech',
 'apart',
 'work',
 'passionate',
 'music',
 'gus',
 'learn',
 'play',
 'piano',
 'enrol',
 'weekend',
 'batch',
 'great',
 'piano',
 'academy',
 'great',
 'piano',
 'academy',
 'situate',
 'mayfair',
 'city',
 'london',
 'world',
 'class',
 'piano',
 'instructor']

### Rule-Based Matching Using spaCy
Rule-based matching is one of the steps in extracting information from unstructured text. It’s used to identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).For example, with rule-based matching, you can extract a first name and a last name, which are always proper nouns:



In [65]:
import spacy
nlp = spacy.load("en_core_web_sm")
about_text = (
 "Gus Proto is a Python developer currently"
 " working for a London-based Fintech"
 " company. He is interested in learning"
 " Natural Language Processing."
)
about_doc = nlp(about_text)

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

def extract_full_name(nlp_doc):
 pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}]
 matcher.add("FULL_NAME", [pattern])
 matches = matcher(nlp_doc)
 for _, start, end in matches:
     span = nlp_doc[start:end]
     yield span.text


next(extract_full_name(about_doc))


'Gus Proto'

### Named-Entity Recognition
Named-entity recognition (NER) is the process of locating named entities in unstructured text and then classifying them into predefined categories, such as person names, organizations, locations, monetary values, percentages, and time expressions.

You can use NER to learn more about the meaning of your text. For example, you could use it to populate tags for a set of documents in order to improve the keyword search. You could also use it to categorize customer support tickets into relevant categories.

spaCy has the property .ents on Doc objects. You can use it to extract named entities:

In [71]:
import spacy
nlp = spacy.load("en_core_web_sm")

piano_class_text = (
 "Great Piano Academy is situated"
 " in Mayfair or the City of London and has"
 " world-class piano instructors."
)
piano_class_doc = nlp(piano_class_text)

for ent in piano_class_doc.ents:
 print(
     f"""
{ent.text = }
{ent.start_char = }
{ent.end_char = }
{ent.label_ = }
spacy.explain('{ent.label_}') = {spacy.explain(ent.label_)}"""
)


ent.text = 'Great Piano Academy'
ent.start_char = 0
ent.end_char = 19
ent.label_ = 'ORG'
spacy.explain('ORG') = Companies, agencies, institutions, etc.

ent.text = 'Mayfair'
ent.start_char = 35
ent.end_char = 42
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states

ent.text = 'the City of London'
ent.start_char = 46
ent.end_char = 64
ent.label_ = 'GPE'
spacy.explain('GPE') = Countries, cities, states


In [72]:
survey_text = (
 "Out of 5 people surveyed, James Robert,"
 " Julie Fuller and Benjamin Brooks like"
 " apples. Kelly Cox and Matthew Evans"
 " like oranges."
)


def replace_person_names(token):
 if token.ent_iob != 0 and token.ent_type_ == "PERSON":
     return "[REDACTED] "
 return token.text_with_ws


def redact_names(nlp_doc):
 with nlp_doc.retokenize() as retokenizer:
     for ent in nlp_doc.ents:
         retokenizer.merge(ent)
 tokens = map(replace_person_names, nlp_doc)
 return "".join(tokens)

survey_doc = nlp(survey_text)
print(redact_names(survey_doc))

Out of 5 people surveyed, [REDACTED] , [REDACTED] and [REDACTED] like apples. [REDACTED] and [REDACTED] like oranges.


In [73]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("He left Syria and flew to Turkey, then went through Slovakia, Germanny and Paris and across to the UK.")

countries = set()
for ent in doc.ents:
    if ent.label_ == 'GPE':  # GPE stands for Geo-Political Entity
        countries.add(ent.text)

print(countries)  # {'Syria', 'Turkey', 'Slovakia', 'Germany', 'Paris', 'UK'}

{'Paris', 'Syria', 'UK', 'Slovakia', 'Turkey'}


### Transformers

Transformers are new types of recurrent neural networks that are particularly
suitable for natural language processing. Transformers have become
the state-of-
the-
art
approach in natural language processing since 2017. With
transformers you can build chatbot and question answering applications easily.

You might need to run the following commands in terminal to install transformers and tensorflow.

- pip install transformers

- pip install tensorflow


In [79]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
first = classifier('We are very happy to visit London.')
second = classifier('In London it was raining a lot')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [80]:
print(first)
print(second)

[{'label': 'POSITIVE', 'score': 0.9998718500137329}]
[{'label': 'NEGATIVE', 'score': 0.986086905002594}]


### Q/A using transformers

In [81]:
# Example Q&A with transformers
from transformers import pipeline
question_answerer = pipeline('question-answering')
question_answerer({'question': 'What is the name of the company?',
                   'context': 'We created Biox Systems Ltd company back in the yearof 2000.'
})

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.7448022365570068,
 'start': 11,
 'end': 27,
 'answer': 'Biox Systems Ltd'}

In [82]:
# Example open question answering with transformers
from transformers import pipeline
context = '''
We created Biox Systems Ltd company back in the year of 2000.
'''
Question = input('Ask a question:')
question_answerer = pipeline('question-answering')
result = question_answerer(question=Question, context=context)
print("Answer:", result['answer'])
print("Score:", result['score'])

Ask a question:What is 2+2


No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Answer: Biox Systems Ltd company
Score: 0.49045586585998535


Example shows another Python example on text generation by using Generative Pre-trained Transformer 2 (GPT-2)
model Transformers. GPT-2 is a large transformer-based language model developed by OpenAI. The latest
version is GPT-3. GPT-2 has 1.5 billion parameters and is trained on a dataset of 8 million web pages. GPT-2 
is trained to predict the next word, given the previous words within the text. 

In [83]:
# Example Text Generation with transformers (GPT-2 Model)
# pip install transformers
from transformers import pipeline, set_seed
generator = pipeline('text-generation',
model='gpt2')
set_seed(20)
generator("I feel amazing about", max_length=20,
num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I feel amazing about it."\n\nAnd she was already working on that.'},
 {'generated_text': 'I feel amazing about this."\n\nHe said he was shocked that the school would take the case'},
 {'generated_text': 'I feel amazing about that. We started it by calling a party. At home to work on it'},
 {'generated_text': 'I feel amazing about where the team should take the offseason. It\'s going to be great," Williams'},
 {'generated_text': 'I feel amazing about this project. The idea of this album being included in the album is just so'}]