<a href="https://colab.research.google.com/github/TatheerHussain/spaCy/blob/master/nlpSacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This collab Notbook is about using Natural Language Processing (NLP),

> We will use the leading NLP library (spaCy) to take on some of the most important tasks in working with text.


> This Notebook will hel you to get Understanding of nlp and spacy and 
By the end of this notebook, you will be able to use spaCy for:


*   Basic text processing and pattern matching
*   Building machine learning models with text
*   Representing text with word embeddings that numerically capture the meaning of words and documents 

# NLP with spaCy
#### spaCy is the leading library for NLP, and it has quickly become one of the most popular Python frameworks. Most people find it intuitive, and it has excellent documentation.

> spaCy relies on models that are language-specific and come in different sizes. You can load a spaCy model with 
```
spacy.load
```
For example, here's how you would load the English language model


In [0]:
#pip install -U spaCy
#python -m spacy download en


# Notice that the installation doesn’t automatically download the English model. We need to do that ourselves.

In [0]:
import spacy
nlp = spacy.load('en')

In [3]:
doc = nlp('Hello     World!')
for token in doc:
    print('"' + token.text + '"')

"Hello"
"    "
"World"
"!"


Notice the index preserving tokenization in action in the above example.
Rather than only keeping the words, spaCy keeps the spaces too. as we can see it clearly after hello being printed.
This is helpful for situations when you need to replace words in the original text or add some annotations. 
With NLTK tokenization, there’s no way to know exactly where a tokenized word is in the original raw text. spaCy preserves this “link” between the word and its place in the raw text. Here’s how to get the exact index of a word and psaces.

In [6]:
doc = nlp('Hello     World!')
for token in doc:
    print('"' + token.text + '"', token.idx)

"Hello" 0
"    " 6
"World" 10
"!" 15


In [21]:
doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.


In [18]:
doc = nlp("Next week I'll   be in Kashmir.")
print(f"Text \t\tindex \t\t lemma \t\t punctuation \t space \t\t shape \t\t pos \t\t tag".format('Text','index','lemma','punctuation','space','shape','pos','tag'))
print("______________________________________________________________________________________________________________________________")
for token in doc:
    print("{0}\t\t{1}\t\t{2}\t\t{3}\t\t{4}\t\t{5}\t\t{6}\t\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

Text 		index 		 lemma 		 punctuation 	 space 		 shape 		 pos 		 tag
______________________________________________________________________________________________________________________________
Next		0		next		False		False		Xxxx		ADJ		JJ
week		5		week		False		False		xxxx		NOUN		NN
I		10		-PRON-		False		False		X		PRON		PRP
'll		11		will		False		False		'xx		VERB		MD
  		15		  		False		True		  		SPACE		_SP
be		17		be		False		False		xx		AUX		VB
in		20		in		False		False		xx		ADP		IN
Kashmir		23		Kashmir		False		False		Xxxxx		PROPN		NNP
.		30		.		True		False		.		PUNCT		.


In [0]:
# we will try now with more set of examples 

docs = nlp("Tea makes your mind fresh, healthy and calming, haven't you expereinced so?")

# Why spaCy

**It's really FAST**

Written in Cython, it was specifically designed to be as fast as possible

**It's really ACCURATE**

spaCy implementation of its dependency parser is one of the best-performing in the world:
It Depends: Dependency Parser Comparison
Using A Web-based Evaluation Tool
Batteries included
Index preserving tokenization (details about this later)
Models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
Supports 8 languages out of the box
Easy and beautiful visualizations
Pretrained word vectors
Extensible
It plays nicely with all the other already existing tools that you know and love: Scikit-Learn, TensorFlow, gensim
DeepLearning Ready
It also has its own deep learning framework that’s especially designed for NLP tasks:
Thinc


**Now we will go further ahead to explore what we can do with the "docs" object we just created above**


# Tokenizing
> A token is a unit of text in the document, such as individual words and punctuation. or in simpler terms Tokenizing means splitting your text into minimal meaningful units. It is a mandatory step before any kind of processing. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". You can see the tokens by iterating through the document.

In [0]:
for token in docs:
    print(token)

Tea
makes
your
mind
fresh
,
healthy
and
calming
,
have
n't
you
expereinced
so
?


# Text preprocessing

There are a few types of preprocessing to improve how we model with words. The first is "lemmatizing." The "lemma" of a word is its base form. 
For example, 
"walk" is the lemma of the word "walking". So, when you lemmatize the word walking, you would convert it to walk.

It's also common to remove stopwords. Stopwords are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not".

With a spaCy token, token.lemma_ returns the lemma, while token.is_stop returns a boolean True if the token is a stopword (and False otherwise

In [0]:
print(f"Token \t\t\t Lemma \t\t\t Stopword".format('Token', 'Lemma', 'Stopword'))
print("----------------------------------------")
for token in docs:
    print(f"{str(token)}\t\t\t{token.lemma_}\t\t\t{token.is_stop}")

Token 			 Lemma 			 Stopword
----------------------------------------
Tea			tea			False
makes			make			False
your			-PRON-			True
mind			mind			False
fresh			fresh			False
,			,			False
healthy			healthy			False
and			and			True
calming			calming			False
,			,			False
have			have			True
n't			not			True
you			-PRON-			True
expereinced			expereince			False
so			so			True
?			?			False


**Here we will mention that why are lemmas and identifying stopwords important? **

Language data has a lot of noise mixed in with informative content. In the sentence above, the important words are tea, healthy, calming and exerienced.

 Removing **stop words** might help the predictive model focus on relevant words. 
**Lemmatizing** similarly helps by combining multiple forms of the same word into one base form ("calming", "calms", "calmed" would all change to "calm").

However, lemmatizing and dropping stopwords might result in your models performing worse. So you should treat this preprocessing as part of your hyperparameter optimization process.

# Pattern Matching
Another common NLP task is matching tokens or phrases within chunks of text or whole documents. You can do pattern matching with regular expressions, but spaCy's matching capabilities tend to be easier to use.

To match individual tokens, you create a Matcher. When you want to match a list of terms, it's easier and more efficient to use PhraseMatcher. For example, if you want to find where different smartphone models show up in some text, you can create patterns for the model names of interest. First you create the PhraseMatcher itself

In [0]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier. Setting attr='LOWER' will match the phrases on lowercased text. This provides case insensitive matching.

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the nlp model.

In [0]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

Then we create a document from the text to search and use the phrase matcher to find where the terms occur in the text.

In [0]:
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)


[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]


The matches here are a tuple of the match id and the positions of the start and end of the phrase.

In [0]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList iPhone 11


To understand better we will followup with an example

**Basic Text processing with Spacy**

Suppose you are a consultant for DelFalco's Italian Restaurant. The owner asked you to identify whether there are any foods on their menu that diners find disappointing.


# Part Of Speech Tagging

In [23]:

# now lets have look another look.
doc = nlp("Next week I'll be in Kashmir.")
print([(token.text, token.tag_) for token in doc])

[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Kashmir', 'NNP'), ('.', '.')]


# Named Entity Recognition
Doing NER with spaCy is easy and the pretrained model performs also well:


In [30]:
doc = nlp("Next week i'll be in Kashmir")
for ent in doc.ents:
    print(ent.text, ent.label_)

Next week DATE
Kashmir LOC


**IOB style tagging of the sentence like this:**

In [32]:
from nltk.chunk import conlltags2tree
 
doc = nlp("Next week I'll be in Kashmir.")
iob_tagged = [
    (
        token.text, 
        token.tag_, 
        "{0}-{1}".format(token.ent_iob_, token.ent_type_) if token.ent_iob_ != 'O' else token.ent_iob_
    ) for token in doc
]
 
print(iob_tagged)

[('Next', 'JJ', 'B-DATE'), ('week', 'NN', 'I-DATE'), ('I', 'PRP', 'O'), ("'ll", 'MD', 'O'), ('be', 'VB', 'O'), ('in', 'IN', 'O'), ('Kashmir', 'NNP', 'B-LOC'), ('.', '.', 'O')]


In [33]:
# the same above results in nltk.Tree format
print(conlltags2tree(iob_tagged))

(S
  (DATE Next/JJ week/NN)
  I/PRP
  'll/MD
  be/VB
  in/IN
  (LOC Kashmir/NNP)
  ./.)


The spaCy NER has a healthy variety of entities. if you want to read more about them follow this link:
spaCy NER 

In [0]:
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...



In [0]:
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)
 
# 2 CARDINAL
# 9 a.m. TIME
# 30% PERCENT
# just 2 days DATE
# WSJ ORG