 # SpaCy Overview
 
What is spaCy(v2):
spaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython.

The library is published under the MIT license and currently offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages.
spaCy v2.0 features new neural models

for tagging, parsing and entity recognition. The models have been designed and implemented from scratch specifically for spaCy, to give you an unmatched balance of speed, size and accuracy..



Features:

• Non-destructive tokenization          
• Named entity recognition           
• Support for 49+ languages             
• 16 statistical models for 9 languages             
• Pre-trained word vectors             
• Easy deep learning integration          
• Part-of-speech tagging         
• Labelled dependency parsing         
• Syntax-driven sentence segmentation           
• Built in visualizers for syntax and NER             
• Convenient string-to-hash mapping                
• Export to numpy data arrays                
• Efficient binary serialization              
• Easy model packaging and deployment              
• State-of-the-art speed           
• Robust, rigorously evaluated accuraçy             

# Install en & en_core_web_md en

python -m spacy download en

In [None]:
import spacy
spacy.load("en")

python -m spacy link en_core_web_md en

In [None]:
import en_core_web_sm
nlp = en_core_web_sm. load()

In [None]:
import spacy
spacy.prefer_gpu()
nlp = spacy. load("en_core_web_sm")

# Tokeniz words

In [None]:
#https://nlpforhackers.io/complete-guide-to-spacy/ 2import en_core_web_sm
nlp = en_core'web_sm. load()
doc = nlp( 'Hello   World!')
for token in doc:
    print('"' + token.text + '"')

In [None]:
import spacy
spacy.prefer_gpu()
nlp = spacy.load("en_core-_web_sm") # load the spaCy language model.
doc = n1p ('Hello    World!')
for token in doc:
    print('"' + token.text + '"')

Output:

"Hello "    
"World"    
" !"    

# Tokeniz sentence

In [None]:
#https://nlpforhackers.io/complete-guide-to-spacy/
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("These are apples. These are oranges.")
for sent in doc.sents:
    print(sent)

Output:

These are apples.             
These are oranges.             

# Word Tokenization

In [None]:
#https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/
# Word tokenization
from spacy. lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey.
You've got this!"""
# "nlp" Object is used to create documents with Linguistic annotations.
my_doc = nlp(text)
# Create List of word tokens
token_list = []
for token in my_ doc:
    token_list.append(token. text)
print (token_list)  # will split words

# Stop words

In [None]:
#https://www.dataquest.io/blog/tutorial-text-classification-in-
#Stop words
#importing stop words from English Language.
import spacy
spacy_stopwords = spacy. lang.en.stop_words.STOP_WORDS
#Printing the total number of stop words:
print('Number of stop words: %d' % len (spacy _stopwords))
#Printing first ten stop words:
print('First ten stop words: %' % list (spacy_stopwords) [:20])

# Remove stop words

In [None]:
#https://www.dataquest.io/blog/tutorial-text-classification-in-
from spacy. lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English ()
text = """When learning data science, you shouldn't get discour Challenges and setbacks aren't failures, 
                they're just part of"""

#Implementation of stop words:
filtered sent=[]
# "nlp" Object is used to create documents with linguistic features
doc = nlp(text)
# filtering stop words
for word in doc:
    if word.is stop==False:
        filtered_sent.append(word)
print("Filtered Sentence:"‚filtered_sent)

output:
    
Filtered Sentence: [learning, data, science, , discouraged, !,
Challenges, setbacks, failures, ,, journey,
., got, !]

# Part of Speech (POS) Tagging

nine parts of speech are:
1. noun            
2. verb          
3. adjective           
4. adverb           
5. pronoun           
6. preposition            
7. conjunction           
8. interjection           
9. article or (more recently) determiner         

In [None]:
import spacy
nlp = spacy. load ('en_core_web_sm')
doc = nlp (u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token. text, token. lemma_, token.pos_, token.tag_, token. dep_, token.shape_, token.is_alpha, token.is_stop)

# Part Of Speech Tagging

In [None]:
import spacy
nlp = spacy. load ("en_core_web_sm")
doc = nlp("Next week I'll be in Madrid.")
print([(token. text, token.tag_) for token in doc])

Output: 
    
[ ('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'),
("'11"
'MD'), ('be', 'VB'), ('in',
'IN'), ('Madrid'.
"NNP"),
(' .','.')]

# POS

A word's part of speech defines its function within a sentence

In [None]:
import spacy
nlp = spacy. load("en_core_web_sm")
docs = nlp (u"All is well that ends well.
for word in docs:
    print(word.text,word.pos_)

# Detecting Nouns

spaCy automatically detects noun-phrases as well

In [None]:
import spacy
nlp = spacy. load ("en_core_web_sm")
doc = np (u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun chunks:
    print (chunk.text)

# lemmatization

Lemmatization converts words in the second or third forms to their first form variants
It might be surprising to you but spay doesn't contain any function for stemming as it relies on lemmatization only.


We can find the roots of all the words using spaCy lemmatization as follows:

In [None]:
import spacy
nlp = spacy. load("en_core_web_sm")
doc = n1p (u"Autonomous cars shift insurance liability toward manufacturers")
for word in doc:
    print (word.text + '===>', word. lemma)

In [None]:
import spacy
nlp = spacy. load("en_core_web_sm")
# Implementing Lemmatization
lem = nlp ("run runs running runner")
# finding Lemma for each word
for word in lem:
    print (word.text,word. lemma_)

# Named Entity Recognition (NER)

What is Named Entity Recognition (NER)?    
Named entity recognition (NER) is a sub-task of information extraction (IE) that seeks out and categories specified entities in a body or bodies of texts.


In [None]:
import spacy
nlp = spacy. load ("en_core_web_sm")
doc = nlp("Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
output:
    
Next week DATE    
Madrid GPE


The dataset consists of the following tags:              
•geo = Geographical Entity              
•org = Organization              
•per = Person                      
•gpe = Geopolitical Entity                  
•tim = Time indicator                      
•art = Artifact                               
•eve = Event                
•nat = Natural Phenomenon  

# Entity Detection

Entity detection, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text


In [None]:
import spacy
from spacy import displacy
nlp = spacy. load("en_core_web_sm")
nytimess nlp(u"""New York city on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an
At least 285 people have contracted measles in the city since September, mostly in Brooklyn's Williamsburg neighborhood. The ord
The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, inc""")
entities= [(i, i.label_, i.label) for i in nytimes.ents]
displacy.render (nytimes, style = "ent", jupyter = True)

# Word Vector Representation

A word vector is a numeric representation of a word that commuicates its relationship to other words.

In [None]:
import spacy
nlp = spacy. load ("en_core_web_sm")
mango = n1p (u'mango')
print (mango.vector.shape)
print (mango.vector)  # output like : [ 1.0466383 - 1.5323697 -0.72177905 -2.4700649 -0.2715162 .....]

# Computing Similarity

by percent number like 0.8252482050769769

In [None]:
import spacy
nlp = spacy. load ("en_core_web_sm")
target = nlp("Cats are beautiful animals.")
doc1 = n1p ("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")
print(target.similarity(doc1))
print(target.similarity(doc2))
print(target.similarity(doc3))

# Get word frequency

In [None]:
import spacy
from collections import Counter
nlp = spacy. load ("en_core_web_sm")
text = """"Most of the outlay will be at home. No surprise there, either.
While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers.Samsung is good """
doc = nlp(text)
#remove stopwords and punctuations
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
word_freq = Counter (words)
common_words = word_freq.most_common (5)
print (common words)

output:

[ ('\n', 2), ('Samsung', 2), ('outlay',1), ('home', 1), ('surprise', 1)]

# Visualizing the dependency parse

Depenency parsing is a language processing technique that allows us to better determine the meaning of a sentence by analyzing how it's constructed to determine how the individual words relate to each other.


Stanford typed dependencies manual:
    
nsubj: nominal subject A nominal subject is a noun phrase which is the syntactic subject of a clause           
det: determiner

In [None]:
import spacy
from spacy import displacy
nlp = spacy. load("en_core _web _sm")
doc = nlp(u"this is a sentence.")
displacy.serve (doc, style="dep")

# Dependency Parsing


In [None]:
import spacy
from spacy import displacy
nlp = spay. load("en core web sm")
docp = nlp (" In pursuit of a wall, President Trump ran into one.")
for chunk in doc.noun chunks:
    print(chunk. text, chunk.root.text, chunk.root.dep_,chunk. root.head. text)
displacy.render (docp, style="dep", jupyter= True) 

output:

pursuit pursuit pobj In            
a wall wall pobi of            
President Trump Trump subj ran       

# Rule-based matching

In [None]:
import spacy
nlp = spacy. load("en_core_web_sm")
# Matcher is initialized with the shared vocab
from spacy. matcher import Matcher
# Each dict represents one token and its attributes
matcher = Matcher (nlp.vocab)
# Add with ID, optional callback and patterns)
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
matcher.add( 'CITIES', None, pattern)
# Match by calling the matcher on a Doc object
doc = nlp("I live in New York")
matches = matcher (doc)
# Matches are (match_id, start, end) tuples
for match_id, start, end in matches:
    # Get the matched span by slicing the Doc
    span = doc[ start: end]
    print(span.text)

Output:  New York

# Machine Learning with text using Spacy

In [None]:
from sklearn. feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords
from sklearn. feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn. base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearsVC
import string
punctuations = string.punctuation
from spacy.lang.en import English
parser = English()
#Custom transformer using spaCy
class predictors (TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic utility function to clean the text
def clean text (text):
    return text.strip () .lower ()

#Create spacy tokenizer that parses a sentence and generates tokens
#these can also be replaced by word vectors
def spacy tokenizer(sentence):
    tokens = parser(sentence)
    tokens = [tok.lemma .lower ().strip() if tok.lemma != "-PRON-" else tok.lower for tok in tokens]
    tokens = [tok for tok in tokens if (tok not in stopwords and tok not in punctuations)]
    return tokens

#create vectorizer object to generate feature vectors, we will use custom spacy's tokenizer
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, gram_range=(1,1))
classifier = LinearsVC()

# Create the pipeline to clean, tokenize, vectorize, and classify
pipe = Pipeline([("cleaner", predictors ()),('vectorizer', vectorizer), ('classifier', classifier)])

# Load sample data
train = [('I love this sandwich.','pos'),('this is an amazing place!', 'pos'),('I feel very good about these beers.','pos'),
         ('this is my best work.', 'pos'),('I am tired of this stuff.', 'neg'),("I can't deal with this","neg"),
         ('he is my sworn enemy!','neg'),('my boss is horrible.','neg')]
test = [('the beer was good.','pos'),('I do not enjoy my job','neg'),("I ain't feelin dandy today.",'neg'),
        ("I feel amazing!",'pos'),('Gary is a good friend of mine.','pos'),("I can't believe I'm doing this",'neg')]

# Create model and measure accuracy
pipe.fit ([x[0] for x in train], [x[1] for x in train])
pred_data = pipe.predict([x[0] for x in test])
for (sample, pred) in zip(test, pred_data):
    print (sample, pred )
print ("Accuracy:",accuracy_score([x[1] for x in test], pred_data))    


# Refernces:

https://medium.com/@manivannan_data/spacy-named-entity-recognizer-4a1eeee1d749

https://nlpforhackers.io/complete-guide-to-spacy/

https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

https://spacy.io/usage/linguistic-features

https://nlpforhackers.io/complete-guide-to-spacy/

https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/

https://nlpforhackers.io/complete-guide-to-spacy/

https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718

https://nlpforhackers.io/complete-guide-to-spacy/

https://blog.ekbana.com/nlp-for-beninners-using-spacy-6161cf48a229

https://spacy.io/usage/visualizers

https://www.datacamp.com/community/blog/spacy-cheatsheet

https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/
