<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/Spacy/Spacy%20Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

### Introduction

In [2]:
import spacy

In [4]:
# Load small english model: https://spacy.io/models
nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x7fbf5317ca90>

In [5]:
my_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost"""

In [6]:
my_doc = nlp(my_text)
type(my_doc)

spacy.tokens.doc.Doc

In [7]:
my_doc

The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost

### Tokenization with spaCy

Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words (and punctuation optionally). And paragraphs into sentences, depending on the context.

In [11]:
for token in my_doc:
  print(token.text)

The
economic
situation
of
the
country
is
on
edge
,
as
the
stock


market
crashed
causing
loss
of
millions
.
Citizens
who
had
their
main
investment


in
the
share
-
market
are
facing
a
great
loss
.
Many
companies
might
lay
off


thousands
of
people
to
reduce
labor
cost


### Text-Preprocessing with spaCy

As mentioned in the last section, there is ‘noise’ in the tokens. The words such as ‘the’, ‘was’, ‘it’ etc are very common and are referred as ‘stop words’.

Besides, you have punctuation like commas, brackets, full stop and some extra white spaces too. The process of removing noise from the doc is called Text Cleaning or Preprocessing.

In [13]:
for token in my_doc:
  print(token.text, token.is_punct, token.is_stop)

The False True
economic False False
situation False False
of False True
the False True
country False False
is False True
on False True
edge False False
, True False
as False True
the False True
stock False False

 False False
market False False
crashed False False
causing False False
loss False False
of False True
millions False False
. True False
Citizens False False
who False True
had False True
their False True
main False False
investment False False

 False False
in False True
the False True
share False False
- True False
market False False
are False True
facing False False
a False True
great False False
loss False False
. True False
Many False True
companies False False
might False True
lay False False
off False True

 False False
thousands False False
of False True
people False False
to False True
reduce False False
labor False False
cost False False


Removing StopWords and punctuations

In [15]:
cleaned = []

In [17]:
my_doc_cleaned = [token for token in my_doc if not token.is_stop and not token.is_punct]

for token in my_doc_cleaned:
  cleaned.append(token.text)

In [22]:
cleaned_text = " "

In [23]:
cleaned_text.join(cleaned)

'economic situation country edge stock \n market crashed causing loss millions Citizens main investment \n share market facing great loss companies lay \n thousands people reduce labor cost'

In [30]:
nlp = spacy.load('en_core_web_sm')

In [31]:
data = nlp(cleaned_text)

In [32]:
for txt in data.ents:
  print(txt.text)