# Spacy Tool

spaCy is very useful and advanced open source library in Natural Language
Processing and one of the fastest library as NLTK. spaCy tool gives access to methods and functionalities using API. spaCy can be easily installed using python packages and also need to download models and data to explore spacy library.By loading all the models, pipeline is created. Various types of models provided in library that includes the information of vocabularies, syntaxes, vectors and entities. Spacy supports languages such as Polish, Portugese, English, Spanish, Italian, Greek ,German, Danish etc.


# Import and Load Model for english Language

In [242]:
import spacy#import spacy package
data = spacy.load("en_core_web_sm")#load() function is used to access the properties of English language: en_core_web_sm

In [243]:
text = """I think most physicists would agree that Hawking’s greatest contribution is the prediction that black holes emit radiation,” says Sean Carroll, a theoretical physicist at the California Institute of Technology. “While we still don’t have experimental confirmation that Hawking’s prediction is true, nearly every expert believes he was right."""
doc = data(text)#text is defined above. data is a object that retuens all information with doc object.

# Preprocessing Step : Tokenization and stopwords

In spaCy, Each text file is tokenized to sentence and then into tokens. Tokenization helps to recognize small units in text. This step is important because it helps to divide text into meaningful words. 

In [244]:
Eng_token = [token.text for token in doc]#Divinding the text into tokens
print("Tokenized words : " ,Eng_token[:9])

Tokenized words :  ['I', 'think', 'most', 'physicists', 'would', 'agree', 'that', 'Hawking', '’s']


Importance of removing stopwords : out of 5 words, there can be 4 stop words which will never tell what the text is about. If stop words considers, we can't derive the meaningful insights from text.
    Therefore, stopwords should be removed.

In [245]:
Without_stopwords_token = [token for token in doc if not token.is_stop]#Remove stopwords from text
print (Without_stopwords_token[:10])

[think, physicists, agree, Hawking, greatest, contribution, prediction, black, holes, emit]


# Part-of-Speech tags

Part-of-speech is the process of assigning a parts of speech to tokens.
parts of speech contains verb, noun,adjectives etc. Input to the POS tagging is tokens and output is tagged tokens.

In [246]:
for token in doc[:2]:#token.tag : fine-grained part of speech,token.pos : coarse-grained part of speech
     print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

I PRP PRON pronoun, personal
think VBP VERB verb, non-3rd person singular present


# Dependency Parsing

It is the process of identifying a sentence and represents its syntactical structure.
it shows the dependency between dependents and headwords.the sentence head ahs no dependency,called as root of the sentence.[2]
Dependency can map in graphical representation such as words become nodes and grammatical relation by edges. It is used in Named Entity Recognition.[2]

In [247]:
#dependency parsing, nsubj--> (subject of word,headword is verb)
text1 = data("Stephen Hawking is regarded as one of the most brilliant theoretical physicists since Einstein")
for token in text1[:3]:
    print(token.text, token.tag_, token.head.text, token.dep_)
#displacy to view a nice visualization 
displacy.render(text1, style='dep',options = {'distance':90})

Stephen NNP Hawking compound
Hawking NNP regarded nsubjpass
is VBZ regarded auxpass


# Named Entity Recognition

NER is a technique to locate name entities in text and further classify it into predefined categories.
like organizations, percentages,locations etc [1]. NER can be used to know about meaning of text. [1]
ents property used to extract named entities.

In [248]:
#ent.label : gives label to entity
#ent.text : unicode text representation of entity
#spacy.explain : gives detail descrption of label entity
for ent in doc.ents:
    print(ent.text, ent.label_, spacy.explain(ent.label_))

Hawking NORP Nationalities or religious or political groups
Sean Carroll PERSON People, including fictional
the California Institute of Technology ORG Companies, agencies, institutions, etc.


In [249]:
displacy.render(doc, style='ent')#ent for NER representation,displacy to view a nice visualization of Named Entity 

# Conclusion:

In NLP applications, Named Entity Recognition used for classifiaction and detection of entities.
spacy is quite fast in dependency parsing and have less modules than nltk. spacy doesn't support multi-language.
spacy have integrated word vectors feature.

# References

[1]Schmitt, X. ( 1 ), J. ( 1 ) Robert, M. ( 1 ) Papadakis, Y. ( 1 ) Letraon, and S. ( 2 ) Kubler. 2020. “A Replicable Comparison Study of NER Software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate.” 2019 6th International Conference on Social Networks Analysis, Management and Security, SNAMS 2019, 338–43. Accessed June 22. doi:10.1109/SNAMS.2019.8931850.

[2]T. Vakare, K. Verma and V. Jain, "Sentence Semantic Similarity Using Dependency Parsing," 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1-4, doi: 10.1109/ICCCNT45670.2019.8944671.

[3]Colic, Nico, and Fabio Rinaldi. 2019. “Improving SpaCy Dependency Annotation and PoS Tagging Web Service Using Independent NER Services.” Genomics & Informatics 17 (2): e21. https://doi.org/10.5808/gi.2019.17.2.e21.
