<a href="https://colab.research.google.com/github/Showcas/NLP/blob/main/01_4_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Deep Learning

## Import spaCy

Read more at [spacy.io](https://spacy.io)

In [None]:
import spacy

In [None]:
# download model:
# !python -m spacy download en_core_web_sm

In [None]:
# We load a model for english, based on a web crawl, and we choose the small model
nlp = spacy.load('en_core_web_sm')

## Word Tokenize
Tokenize sentences to get the tokens of the text i.e breaking the sentences into words.

In [None]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"

doc = nlp(text)
words = [token.text for token in doc]
print (words)

## Sentence tokenize
Tokenize sentences if the there are more than 1 sentence i.e breaking the sentences to list of sentence.

In [None]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"
doc = nlp(text)

text = nlp(text)
list(text.sents)

## Stopword removal
Remove irrelevant words using nltk stop words like *is*, *the*, *a*, *etc*, ... from the sentences as they don’t carry any information.

TASK 1.6

Implement the removal of stopwords and punctuations in the code below.

In [None]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"
doc = nlp(text)

### IMPLEMENT YOUR SOLUTION HERE ###
# remove stopwords and punctuations
words = [

]


print(words)

## Get word frequency
Counting the word occurrence using FreqDist library

In [None]:
from collections import Counter

text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"
doc = nlp(text)

#remove stopwords and punctuations
words = [token.text for token in doc if not token.is_stop and not token.is_punct]

word_freq = Counter(words)
common_words = word_freq.most_common()

print (common_words)

## Part of Speech tags
POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.

In [None]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants."

doc = nlp(text)

for token in doc:
    print (token.text, token.pos_)

# Visualization with spaCy


In [None]:
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

## NER(Named Entity Recognition)

| Label    | Description                                          |
|----------|------------------------------------------------------|
| ORG      | Companies, agencies, institutions.                   |
| GPE      | Geopolitical entity, i.e. countries, cities, states. |
| CARDINAL | Numerals                                             |

In [None]:
text = "Vienna is the national capital, largest city, and one of nine states of Austria. Vienna is Austria's most populous city, with about 1.9 million inhabitants"
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
nlp.get_pipe('ner').labels

The labels and their meaning:
* **CARDINAL**: Numerals that do not fall under another type.
* **DATE**: Absolute or relative dates or periods.
* **EVENT**: Named hurricanes, battles, wars, sports events, etc.
* **FAC**: Buildings, airports, highways, bridges, etc.
* **GPE**: Countries, cities, states.
* **LANGUAGE**: Any named language.
* **LAW**: Named documents made into laws.
* **LOC**: Non-GPE locations, mountain ranges, bodies of water.
* **MONEY**: Monetary values, including unit.
* **NORP**: Nationalities or religious or political groups.
* **ORDINAL**: "First", "second", etc.
* **ORG**: Companies, agencies, institutions, etc.
* **PERCENT**: Percentage, including "%".
* **PERSON**: People, including fictional.
* **PRODUCT**: Objects, vehicles, foods, etc. (Not services.)
* **QUANTITY**: Measurements, as of weight or distance.
* **TIME**: Times smaller than a day.
* **WORK_OF_ART**: Titles of books, songs, etc.



## Word Vector Representation

In [None]:
city = nlp('Vienna')
print(city.vector.shape)
print(city.vector)