<a href="https://colab.research.google.com/github/Showcas/NLP/blob/main/01_4_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Deep Learning

## Import spaCy

Read more at [spacy.io](https://spacy.io)

In [6]:
import spacy

In [12]:
# download model:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m56.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [13]:
# We load a model for english, based on a web crawl, and we choose the small model
nlp = spacy.load('en_core_web_sm')

## Word Tokenize
Tokenize sentences to get the tokens of the text i.e breaking the sentences into words.

In [14]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"

doc = nlp(text)
words = [token.text for token in doc]
print (words)

['Vienna', 'is', 'the', 'national', 'capital', ',', 'largest', 'city', ',', 'and', 'one', 'of', 'nine', 'states', 'of', 'Austria', '.', 'Vienna', 'is', 'Austria', "'s", 'most', 'populous', 'city', ',', 'with', 'about', '1.9', 'million', 'inhabitants']


## Sentence tokenize
Tokenize sentences if the there are more than 1 sentence i.e breaking the sentences to list of sentence.

In [15]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"
doc = nlp(text)

text = nlp(text)
list(text.sents)

[Vienna is the national capital, largest city, and one of nine states of Austria.,
 Vienna is Austria's most populous city, with about 1.9 million inhabitants]

## Stopword removal
Remove irrelevant words using nltk stop words like *is*, *the*, *a*, *etc*, ... from the sentences as they don’t carry any information.

TASK 1.6

Implement the removal of stopwords and punctuations in the code below.

In [28]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"
doc = nlp(text)

### IMPLEMENT YOUR SOLUTION HERE ###
# remove stopwords and punctuations
from string import punctuation
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

words = [
  word.text.lower() for word in doc
  if word.text.lower() not in stopwords.words('english') and word.text not in punctuation
]


print(words)

['vienna', 'national', 'capital', 'largest', 'city', 'one', 'nine', 'states', 'austria', 'vienna', 'austria', "'s", 'populous', 'city', '1.9', 'million', 'inhabitants']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Get word frequency
Counting the word occurrence using FreqDist library

In [29]:
from collections import Counter

text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"
doc = nlp(text)

#remove stopwords and punctuations
words = [token.text for token in doc if not token.is_stop and not token.is_punct]

word_freq = Counter(words)
common_words = word_freq.most_common()

print (common_words)

[('Vienna', 2), ('city', 2), ('Austria', 2), ('national', 1), ('capital', 1), ('largest', 1), ('states', 1), ('populous', 1), ('1.9', 1), ('million', 1), ('inhabitants', 1)]


## Part of Speech tags
POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.

In [30]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants."

doc = nlp(text)

for token in doc:
    print (token.text, token.pos_)

Vienna PROPN
is AUX
the DET
national ADJ
capital NOUN
, PUNCT
largest ADJ
city NOUN
, PUNCT
and CCONJ
one NUM
of ADP
nine NUM
states NOUN
of ADP
Austria PROPN
. PUNCT
Vienna PROPN
is AUX
Austria PROPN
's PART
most ADV
populous ADJ
city NOUN
, PUNCT
with ADP
about ADP
1.9 NUM
million NUM
inhabitants NOUN
. PUNCT


# Visualization with spaCy


In [31]:
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

## NER(Named Entity Recognition)

| Label    | Description                                          |
|----------|------------------------------------------------------|
| ORG      | Companies, agencies, institutions.                   |
| GPE      | Geopolitical entity, i.e. countries, cities, states. |
| CARDINAL | Numerals                                             |

In [32]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Vienna GPE
one CARDINAL
nine CARDINAL
Austria GPE
Vienna GPE
Austria GPE
about 1.9 million CARDINAL


In [33]:
nlp.get_pipe('ner').labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

The labels and their meaning:
* **CARDINAL**: Numerals that do not fall under another type.
* **DATE**: Absolute or relative dates or periods.
* **EVENT**: Named hurricanes, battles, wars, sports events, etc.
* **FAC**: Buildings, airports, highways, bridges, etc.
* **GPE**: Countries, cities, states.
* **LANGUAGE**: Any named language.
* **LAW**: Named documents made into laws.
* **LOC**: Non-GPE locations, mountain ranges, bodies of water.
* **MONEY**: Monetary values, including unit.
* **NORP**: Nationalities or religious or political groups.
* **ORDINAL**: "First", "second", etc.
* **ORG**: Companies, agencies, institutions, etc.
* **PERCENT**: Percentage, including "%".
* **PERSON**: People, including fictional.
* **PRODUCT**: Objects, vehicles, foods, etc. (Not services.)
* **QUANTITY**: Measurements, as of weight or distance.
* **TIME**: Times smaller than a day.
* **WORK_OF_ART**: Titles of books, songs, etc.



## Word Vector Representation

In [34]:
city = nlp('Vienna')
print(city.vector.shape)
print(city.vector)

(96,)
[-1.0961218  -0.22902218 -0.4975809   0.9610567   0.7078363  -0.06464957
  0.73181033  0.7445235  -1.2915039  -0.09090728  1.4697149   0.09945044
 -1.3506333  -0.13108441 -0.7924895   0.05922782 -0.6520083  -0.35899088
  1.2024351  -0.52779263 -1.159215    0.53939533 -0.6297082   0.14621311
  0.5931368   0.03357325  0.790095    1.5684465  -0.12552348  0.29643065
  0.02728534  0.15686297  0.8964345   1.0861708  -1.2775282  -1.2620009
  0.40376115  1.0572989   0.89938     1.5239228  -1.276994    0.15016714
 -0.30887002 -0.2136845  -0.39376312 -0.93562853 -1.3808439   1.8952878
  0.61209774 -0.47402984  0.4551257  -0.812488    0.03708351 -0.24509734
 -0.5069572  -0.9935806   1.3590736  -0.6163687   0.69572055  0.5491389
 -0.5353222  -0.9912694   0.37881336 -0.41703197  1.7358744  -0.02423835
  0.11495821 -0.94645905  0.63233984 -0.79578835  0.19647892  0.08197635
  1.4766746   0.03564269  0.7181915   0.0255273  -0.4215235  -0.5941889
 -0.82184887 -1.1261017  -0.02957836 -0.55367756 