# SpaCy Library

spaCy is an open-Source library which is fast, efficient.  
it is written in Python and Cython  
### Q) How is spaCy different from NLTK?  
-spaCy is faster, Scalable and is used for industrial and production purposes whereas NLTK is slower because of it's modularity and is used for academic and research purposes.  
-spaCy has built-in functions and libraries which can be used to perform text preprocessing steps like POS tagging, NER.  
-tokenization is rule Based in spacy whereas we have methods to perform tokenization in NLTK.  
-spaCy supports word vectors wheras we have limited support in NLTK

### Q) What are the main components of spaCy objects  
#### doc
 - used as a container to hold processed text with tokens and annonations

#### Token
 - represents a single word

#### span
  - represnts a slice of doc

In [54]:
## 🔹 Customizing Tokenization Rules in spaCy

# You can customize tokenization in spaCy by modifying the Tokenizer settings.

### 1️⃣ Customizing Tokenizer Prefixes, Suffixes, and Infixes
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_sm")
infixes = ("-", "\.")  # Custom infix rules
infix_re = compile_infix_regex(infixes)

nlp.tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
doc = nlp("custom-tokenization example")
print([token.text for token in doc])

### 2️⃣ Adding Special Cases to the Tokenizer
from spacy.symbols import ORTH

special_case = [{ORTH: "spaCy"}]
nlp.tokenizer.add_special_case("spaCy", special_case)
doc = nlp("I love spaCy!")
print([token.text for token in doc])


['custom', '-', 'tokenization', 'example']
['I', 'love', 'spaCy!']


# Experiment 1

question: Create a basic NLP program to find words, phrases, names and concepts using "spacy.blank" to create the English nlp object. Process the text and instantiate a Doc object in the variable doc. Select the first token of the Doc and print its text.  


In [27]:
#import the spacy library
import spacy

In [35]:
#create a blank NLP object
nlp = spacy.load("en_core_web_sm")

In [36]:
text = "I love... visiting New-York in the summer. The best time of the year."

In [37]:
# send the text to NLP object and the doc object holds the processed text
doc = nlp(text)

In [38]:
# to access the first element/token use obj[0] which will return the first token
first_token = doc[0]
print("the first token is -",first_token)

the first token is - I


In [39]:
#to print all the tokens in the given text
print([token.text for token in doc])

['I', 'love', '...', 'visiting', 'New', '-', 'York', 'in', 'the', 'summer', '.', 'The', 'best', 'time', 'of', 'the', 'year', '.']


In [40]:
#extract words while ignoring punctuations
words = [token.text for token in doc if not token.is_punct]
print("Words:", words)

Words: ['I', 'love', 'visiting', 'New', 'York', 'in', 'the', 'summer', 'The', 'best', 'time', 'of', 'the', 'year']


In [41]:
#extract sentences from the doc object
sentences = [sent.text for sent in doc.sents]
print("Sentences:", sentences)

Sentences: ['I love... visiting New-York in the summer.', 'The best time of the year.']


In [47]:
# Extract parts of speech from the text
for token in doc:
  print(f"TOKEN:{token},POS:{token.pos_}",end="\n")

TOKEN:I,POS:PRON
TOKEN:love,POS:VERB
TOKEN:...,POS:PUNCT
TOKEN:visiting,POS:VERB
TOKEN:New,POS:PROPN
TOKEN:-,POS:PUNCT
TOKEN:York,POS:PROPN
TOKEN:in,POS:ADP
TOKEN:the,POS:DET
TOKEN:summer,POS:NOUN
TOKEN:.,POS:PUNCT
TOKEN:The,POS:DET
TOKEN:best,POS:ADJ
TOKEN:time,POS:NOUN
TOKEN:of,POS:ADP
TOKEN:the,POS:DET
TOKEN:year,POS:NOUN
TOKEN:.,POS:PUNCT


In [51]:
#extract named entities fro the text
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Entity: New-York, Label: GPE
Entity: the summer, Label: DATE
Entity: the year, Label: DATE


In [48]:
#Extract the stop words from the text
stop_words = [token.text for token in doc if token.is_stop]
print("Stop words are: ",stop_words)

Stop words are:  ['I', 'in', 'the', 'The', 'of', 'the']


In [50]:
#generate a text without the stop words
non_stop_words = [token.text for token in doc if not token.is_stop]
filtered=' '.join(non_stop_words)
print(f"original text: {text}")
print(f"Filtered text: {}")


original text: I love... visiting New-York in the summer. The best time of the year.
Filtered text: love ... visiting New - York summer . best time year .
