**Importing spacy and loading en_core_web_lg**

en_core_web_lg is basically loading the english language into an object. We could also use en_core_web_sm for a smaller dictionary of english language.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

**Simple Tokenization**

We have already seen the tokenization using nltk. Let's see how to do tokenization using spacy.

In [3]:
s=nlp('GFG is looking for data science interns')
for token in s:
    print(token.text)

s=nlp(u'The cost of Iphone in U.K is 699$')
for token in s:
    print(token.text)

GFG
is
looking
for
data
science
interns
The
cost
of
Iphone
in
U.K
is
699
$


It is smart enough to consider U.K as a single token

In [5]:
# pos_ - it indicates part of speech
for token in s:
    print(token.text,token.pos_)
# It helps you identify the grammatical role of each word in your text, useful for text analysis and NLP tasks.

The DET
cost NOUN
of ADP
Iphone PROPN
in ADP
U.K PROPN
is AUX
699 NUM
$ SYM




*   U.K is a pronoun
*   cost is noun
*   699 is a number
*   $ is a symbol








---



**Sentence Tokenization**

In [6]:
s=nlp(u"This is the first sentence. I gave given fullstop please check. Let's study now")
for sentence in s.sents:
    print(sentence)

This is the first sentence.
I gave given fullstop please check.
Let's study now




---



# Stop Words using spacy

In [14]:
import spacy

nlp = spacy.load('en_core_web_sm')

print(nlp.Defaults.stop_words)

print()

print(len(nlp.Defaults.stop_words))

#  These are the list of all the stop words present by default.

#  We will now try to remove all of these words and start working towards our future analysis.

#  There are a total of 326 stop words present.

{'besides', 'rather', 'this', 'then', 'hundred', 'am', 'became', 'eleven', 'least', 'toward', 'hence', 'hereafter', 'an', 'has', '’ll', 'indeed', 'full', 'when', 'bottom', 'she', 'would', 'always', 'nine', 'well', 'whereupon', 'he', 'most', 'due', 'namely', 'somewhere', 'along', 'two', 'still', 'keep', 'thereafter', 'per', 'something', 'towards', 'others', 'by', 'become', 'therein', 'do', 'herself', 'you', 'all', 'now', 'both', 'nor', 'twenty', 'too', 'mostly', 'yours', 'amongst', 'various', 'make', 'and', 'at', 'will', 'much', 'thence', 'whether', 'everywhere', 'your', 'fifty', 'four', '’re', 'than', 'yourself', 'ten', 'to', 'until', 'latter', 'sometime', 'anyhow', 'it', 'move', 'could', 'anyone', 'therefore', 'the', 'even', 'they', "'s", '’d', 'twelve', 'while', 'third', 'thus', 'beforehand', 'afterwards', 'over', 'whereas', 'only', 'moreover', 'give', 'none', 'off', 'these', 'from', 'because', 'of', 'many', 'behind', '‘d', '’m', 'is', 'did', 'everything', 'though', 'n‘t', 'becomes',

In [15]:
# Method to check if a particular Word is stop word or not
print(nlp.vocab['is'].is_stop)

print(nlp.vocab['GFG'].is_stop)

print(nlp.vocab['hello'].is_stop)

print(nlp.vocab['was'].is_stop)

True
False
False
True


In [16]:
# If we want to add stop words of our own specific choice then we can easily do it.
print(f"Before: {nlp.vocab['i.e'].is_stop}")
nlp.vocab['i.e'].is_stop = True
print(f"After: {nlp.vocab['i.e'].is_stop}")

Before: False
After: True


In [18]:
# Let’s see how to remove a stop word now
print(f"Before: {nlp.vocab['done'].is_stop}")
nlp.Defaults.stop_words.remove('done')
nlp.vocab['done'].is_stop = False

print(f"After: {nlp.vocab['done'].is_stop}")

Before: True
After: False


In [26]:
# Now let's see how we can remove stop words from out corpus
txt='''Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it's the study of physical reactions. Data is real, data has real properties, and we need to study them if we're going to work on them. Data Science involves data and some signs. It is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data. It is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It's when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution. We can also define data science as a field that is about processes and systems to extract data of various forms and from various resources whether the data is unstructured or structured.
The definition and the name came up in the 1980s and 1990s when some professors, IT Professionals, scientists were looking into the statistics curriculum, and they thought it would be better to call it data science and then later on data analytics derived.
'''
txt=txt.replace('\n','')
txt=txt.replace('  ','')
txt=txt.strip()
txt = nlp(txt)

Finding Stopwords from the corpus:

In [27]:
stop_words = set()
for token in txt:
  if token.is_stop:
    stop_words.add(token.text)
print(stop_words)
print(len(stop_words))

{'the', 'also', 'of', 'they', "'s", 'many', 'are', 'can', 'call', 'And', 'then', 'behind', 'whether', 'some', 'We', 'we', 'were', 'your', "'re", 'into', 'is', 'or', 'you', 'not', 'an', 'It', 'has', 'about', 'them', 'name', 'to', 'when', 'have', 'as', 'in', 'would', 'up', 'IT', 'that', 'for', 'be', 'too', 'it', 'these', 'if', 'various', 'on', 'The', 'using', 'So', 'make', 'from', 'a', 'and', 'with'}
55


In [29]:
# printing corpus without the stopwords
' '.join([token.text for token in txt if not token.is_stop])

'Data science study data . Like biological sciences study biology , physical sciences , study physical reactions . Data real , data real properties , need study going work . Data Science involves data signs . process , event . process data understand different things , understand world . Let Suppose model proposed explanation problem , try validate proposed explanation model data . skill unfolding insights trends hiding ( abstract ) data . translate data story . use storytelling generate insight . insights , strategic choices company institution . define data science field processes systems extract data forms resources data unstructured structured . definition came 1980s 1990s professors , Professionals , scientists looking statistics curriculum , thought better data science later data analytics derived .'



---



# **Synonyms and Antonyms**

**Synonym**

A synonym is a word or phrase with the same or nearly the same meaning as another word or phrase. Thus, the words that are similar in meaning are called synonyms.

**Antonym**

An antonym is a term or phrase that has the opposite meaning.

**Wordnet**

WordNet is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing.

In [33]:
import nltk
from nltk.corpus import wordnet
#  we also need to download wordnet specifically with nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [35]:
# Using the definiton() function to get the definition of the word.
syn = wordnet.synsets('Book')
print(syn[0].definition())
# using the [0] index while printing because there are many synonyms of book
# and we have just used the first one

a written work or composition that has been published (printed on pages bound together)


**Getting the synonyms**

In [39]:
synonyms = []
for s in wordnet.synsets('Happy'):
  for lemma in s.lemmas():
    synonyms.append(lemma.name())

print(synonyms)

['happy', 'felicitous', 'happy', 'glad', 'happy', 'happy', 'well-chosen']


**Getting the antonyms**

In [44]:
ant = []
for a in wordnet.synsets('Healthy'):
  for lemma in a.lemmas():
    if lemma.antonyms():
      ant.append(lemma.antonyms()[0].name())

print(ant)

['unhealthy']
