# SPACY

In [1]:
!pip install spacy

Collecting numpy>=1.19.0 (from spacy)
  Using cached numpy-2.0.2-cp311-cp311-win_amd64.whl.metadata (59 kB)
Using cached numpy-2.0.2-cp311-cp311-win_amd64.whl (15.9 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
Successfully installed numpy-2.0.2


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
astropy 5.3.4 requires numpy<2,>=1.21, but you have numpy 2.0.2 which is incompatible.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.0.2 which is incompatible.
matplotlib 3.8.0 requires numpy<2,>=1.21, but you have numpy 2.0.2 which is incompatible.
numba 0.59.0 requires numpy<1.27,>=1.22, but you have numpy 2.0.2 which is incompatible.
pywavelets 1.5.0 requires numpy<2.0,>=1.22.4, but you have numpy 2.0.2 which is incompatible.
scipy 1.10.1 requires numpy<1.27.0,>=1.19.5, but you have numpy 2.0.2 which is incompatible.
streamlit 1.30.0 requires numpy<2,>=1.19.3, but you have numpy 2.0.2 which is incompatible.


#### For basic NLP tasks Spacy is much faster and more efficient compared to NLTK at the cost of the user not being able to choose algorithmic implementations

In [2]:
import spacy 

### Loading the english core models 
1. The <b>en_core_web_lg </b>model is larger in size and provides more accurate and detailed NLP capabilities. It contains a more extensive vocabulary and word vectors, which means it can handle more complex tasks like named entity recognition, part-of-speech tagging, and dependency parsing with better accuracy.
2. The<b> en_core_web_sm </b> model is smaller in size, and it’s faster and less resource-intensive than the large model.It still provides basic NLP capabilities like tokenization, part-of-speech tagging, and named entity recognition, but with less accuracy compared to the larger models.

In [3]:
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_sm

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
     ---------------------------------------- 0.0/400.7 MB ? eta -:--:--
     ---------------------------------------- 0.0/400.7 MB ? eta -:--:--
     -------------------------------------- 0.0/400.7 MB 165.2 kB/s eta 0:40:26
     -------------------------------------- 0.1/400.7 MB 292.6 kB/s eta 0:22:50
     -------------------------------------- 0.1/400.7 MB 383.3 kB/s eta 0:17:26
     -------------------------------------- 0.1/400.7 MB 610.6 kB/s eta 0:10:56
     -------------------------------------- 0.2/400.7 MB 579.6 kB/s eta 0:11:31
     -------------------------------------- 0.2/400.7 MB 588.1 kB/s eta 0:11:21
     -------------------------------------- 0.2/400.7 MB 562.0 kB/s eta 0:11:53
     -------------------------------------- 0.2/400.7 MB 554.9 kB/s eta 0:12:02
     --------------------------------

In [4]:
nlp=spacy.load('en_core_web_sm')

In [5]:
s=nlp(u'The U.K. is known for its rich history, diverse culture, and iconic landmarks like Big Ben, attracting millions of tourists every year from around the world.')

### Simple Tokenization

In [6]:
for token in s:
    print(token)
print(type(token))

The
U.K.
is
known
for
its
rich
history
,
diverse
culture
,
and
iconic
landmarks
like
Big
Ben
,
attracting
millions
of
tourists
every
year
from
around
the
world
.
<class 'spacy.tokens.token.Token'>


In [7]:
## IF we want to text in the string format 
for token in s:
    print(token.text)
print(type(token.text))

The
U.K.
is
known
for
its
rich
history
,
diverse
culture
,
and
iconic
landmarks
like
Big
Ben
,
attracting
millions
of
tourists
every
year
from
around
the
world
.
<class 'str'>


##### <b> NOTE:</b> Spacy is as smart as it consisders U.K. together.

### Some more functions

##### POS (Part of Speech) : It indicates part of speech (a number will be printed which we will see later how those number indicates the part of speech).

In [8]:
for token in s:
    print(token.text,token.pos)

The 90
U.K. 96
is 87
known 100
for 85
its 95
rich 84
history 92
, 97
diverse 84
culture 92
, 97
and 89
iconic 84
landmarks 92
like 85
Big 96
Ben 96
, 97
attracting 100
millions 92
of 85
tourists 92
every 90
year 92
from 85
around 85
the 90
world 92
. 97


##### Now you need to fidn the part of speech.

In [9]:
for token in s:
    print(token.text,token.pos_)

The DET
U.K. PROPN
is AUX
known VERB
for ADP
its PRON
rich ADJ
history NOUN
, PUNCT
diverse ADJ
culture NOUN
, PUNCT
and CCONJ
iconic ADJ
landmarks NOUN
like ADP
Big PROPN
Ben PROPN
, PUNCT
attracting VERB
millions NOUN
of ADP
tourists NOUN
every DET
year NOUN
from ADP
around ADP
the DET
world NOUN
. PUNCT


##### <b>dep_>:</b> dep_ refers to dependency parsing, which is a method used to analyze the grammatical structure of a sentence by establishing relationships between words. These relationships show how words depend on each other within the sentence

In [10]:
s=nlp(u"He isn't going to       play today")

In [11]:
for token in s:
    print(token.text,token.pos_,token.dep_)

He PRON nsubj
is AUX aux
n't PART neg
going VERB ROOT
to PART aux
       SPACE dep
play VERB xcomp
today NOUN npadvmod


##### <b>Inference</b>
is and n't are being categorised differently 'n't' is being treated as negation
Due to the extra space between to and play we can see in the output a space is also being categorised

#### Print sentences in the essay.

In [12]:
s=nlp(u"This is the first sentence. I gave given fullstop please check. Let's study now")

In [13]:
for sentences in s.sents:
    print(sentences)
    

This is the first sentence.
I gave given fullstop please check.
Let's study now


<b>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
</b>

### StopWords

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

<b>Spacy has built in 305 stopwords in both english model.</b>

In [14]:
nlp=spacy.load('en_core_web_sm')
nlp_1=spacy.load('en_core_web_lg')

In [15]:
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))
print(nlp_1.Defaults.stop_words)
print(len(nlp_1.Defaults.stop_words))

{'upon', 'sixty', 'two', 'being', 'on', '‘ve', 'am', 'doing', 'whether', 'amount', 'beforehand', "'m", 'else', 'becomes', 'what', 'some', 'where', 'and', 'six', 'must', 'do', 'anyone', 'themselves', 'was', 'out', 'seeming', 'whence', 'above', 'hereafter', 'i', 'could', 'whoever', 'towards', 'for', 'hundred', 'thereafter', 'almost', 'herself', 'around', '‘d', 'first', 'front', 'them', "'d", 'whereupon', 'across', 'such', 'whom', 'get', 'back', 'it', 'least', 'without', 'most', 'never', 'bottom', 'about', 'will', 'ever', 'behind', 'namely', 'becoming', 'his', 'seem', 'while', 'no', 'besides', 'one', 'beyond', 'why', 'few', 'well', 'once', 'three', 'a', 'anyhow', 'an', 'through', "n't", 'against', 'except', 'should', '‘ll', 'more', 'yours', 'myself', 'anything', 'another', 'latterly', "'re", 'wherever', 'call', 'until', 'many', 'much', 're', 'itself', 'as', 'every', 'nor', 'who', '‘m', 'whenever', 'make', 'your', 'hence', '‘re', 'you', 'several', 'n‘t', '’ve', 'go', 'together', 'off', 'fo

In [16]:
# function to check if word is stop word or not 
print(nlp.vocab['the'].is_stop)
print(nlp.vocab['prime'].is_stop)

True
False


##### Adding and Removing words to deafault stopwords


In [17]:
nlp.Defaults.stop_words.add('i.e')
nlp.vocab['i.e'].is_stop = True
len(nlp.Defaults.stop_words)

327

In [18]:
nlp.Defaults.stop_words.remove('done')
nlp.vocab['done'].is_stop = False
nlp.vocab['done'].is_stop

False

### Removing the stopwords in the text.

In [19]:
s='''
Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on them. Data Science involves data and some signs. 

It is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data. 

It is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution. 

We can also define data science as a field that is about processes and systems to extract data of various forms and from various resources whether the data is unstructured or structured. 
The definition and the name came up in the 1980s and 1990s when some professors, IT Professionals, scientists were looking into the statistics curriculum, and they thought it would be better to call it data science and then later on data analytics derived. 
'''

In [20]:
# s=s.replace('\n',' ')
# s=s.strip() # trips the leading and trailing spaces
s
len(s)

1279

In [21]:
s=nlp(s)

In [22]:
stop_words=[]
for token in s:
    if(token.is_stop):
        stop_words.append(token)
print(stop_words)
print(len(stop_words))

[is, the, of, is, a, of, it, ’s, the, of, is, has, and, we, to, them, if, we, ’re, to, on, them, and, some, It, is, a, not, an, It, is, the, of, using, to, too, many, to, the, when, you, have, a, or, of, a, and, you, to, that, or, with, your, It, is, the, of, the, and, that, are, or, behind, It, ’s, when, you, into, a, So, to, And, with, these, you, can, make, for, a, or, an, We, can, also, as, a, that, is, about, and, to, of, various, and, from, various, whether, the, is, or, The, and, the, name, up, in, the, and, when, some, IT, were, into, the, and, they, it, would, be, to, call, it, and, then, on]
125


In [23]:
# without stopwords 
tokens=[token for token in s if not token.is_stop]

In [24]:
print(tokens)
print(len(tokens))

[
, Data, science, study, data, ., Like, biological, sciences, study, biology, ,, physical, sciences, ,, study, physical, reactions, ., Data, real, ,, data, real, properties, ,, need, study, going, work, ., Data, Science, involves, data, signs, ., 

, process, ,, event, ., process, data, understand, different, things, ,, understand, world, ., Let, Suppose, model, proposed, explanation, problem, ,, try, validate, proposed, explanation, model, data, ., 

, skill, unfolding, insights, trends, hiding, (, abstract, ), data, ., translate, data, story, ., use, storytelling, generate, insight, ., insights, ,, strategic, choices, company, institution, ., 

, define, data, science, field, processes, systems, extract, data, forms, resources, data, unstructured, structured, ., 
, definition, came, 1980s, 1990s, professors, ,, Professionals, ,, scientists, looking, statistics, curriculum, ,, thought, better, data, science, later, data, analytics, derived, ., 
]
131


### Synonyms and Antonyms

In [25]:
!pip install numpy==1.24.3
!pip install scipy==1.10.1


Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp311-cp311-win_amd64.whl.metadata (5.6 kB)
Using cached numpy-1.24.3-cp311-cp311-win_amd64.whl (14.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
Successfully installed numpy-1.24.3


  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
blis 1.0.1 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.24.3 which is incompatible.
tensorflow-intel 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 1.24.3 which is incompatible.
thinc 8.3.2 requires numpy<2.1.0,>=2.0.0; python_version >= "3.9", but you have numpy 1.24.3 which is incompatible.




In [26]:
import numpy as np
import scipy
import nltk


print("NumPy version:", np.__version__)
print("SciPy version:", scipy.__version__)
print("NLTK version:",nltk.__version__)



ImportError: 

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.11 from "C:\Users\deeps\anaconda3\python.exe"
  * The NumPy version is: "2.0.2"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: DLL load failed while importing _multiarray_umath: The specified module could not be found.


In [None]:
# The synsets function in the wordnet module from the nltk.corpus package retrieves synsets (synonym sets) of a given word.
syn=wordnet.synsets('Book')
print(syn[0].definition())

 ##### A lemma is the base form of a word, often the dictionary entry form.
 For example:
The lemma of "running" is "run."
The lemma of "happier" is "happy."

In [None]:
## Synonyms
 synonyms=[]
for s in wordnet.synsets("Happy"):
    for lemma in s.lemmas():
        synonyms.append(lemma.name())
print(synonyms)

In [None]:
## Antonyms

## lemma.antonyms(): Returns a list of antonyms (lemmas) for the given lemma if available.
## lemma.antonyms()[0].name(): Retrieves the name (string representation) of the first antonym lemma.

 synonyms=[]
for s in wordnet.synsets("Happy"):
    for lemma in s.lemmas():
        synonyms.append(lemma.name())
print(synonyms)