<a href="https://colab.research.google.com/github/Paul-mwaura/Natural-Language-Processing/blob/main/NLP_with_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import Spacy**

> The default model which is *english-core-web*, for which we load the "en model.

In [None]:
import spacy
nlp = spacy.load("en")

**Word Tokenize**

> Tokenize words to get the tokens of the text i.e breaking the sentences into words.

In [None]:
from collections import Counter

text = """Most of the outlay will be at home. No surprise there, either. 
While Samsung has expanded overseas, South Korea 
is still host to most of its factories and research engineers. """

doc = nlp(text)
words = [token.text for token in doc]
print(words)

['Most', 'of', 'the', 'outlay', 'will', 'be', 'at', 'home', '.', 'No', 'surprise', 'there', ',', 'either', '.', '\n', 'While', 'Samsung', 'has', 'expanded', 'overseas', ',', 'South', 'Korea', '\n', 'is', 'still', 'host', 'to', 'most', 'of', 'its', 'factories', 'and', 'research', 'engineers', '.']


**Sentence Tokenize**

> Tokenize sentences if there are more than 1 sentence i.e breaking the sentences to list of sentence.

In [None]:
text = """Natural Language Toolkit, or more commonly NLTK, 
is a suite of libraries and programs for symbolic and statistical 
natural language processing (NLP) for English written in the Python 
programming language. It was developed by Steven Bird and Edward Loper 
in the Department of Computer and Information Science at the University of Pennsylvania."""

text = nlp(text)
list(text.sents)

[Natural Language Toolkit, or more commonly NLTK, 
 is a suite of libraries and programs for symbolic and statistical 
 natural language processing (NLP) for English written in the Python 
 programming language., It was developed by Steven Bird and Edward Loper 
 in the Department of Computer and Information Science at the University of Pennsylvania.]

**Stop words removal**

> Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.

In [None]:
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """

doc = nlp(text)
# Remove stop words and punctuations.
#
words = [token.text for token in doc if token.is_stop !=True and token.is_punct != True]

print(words)

['outlay', 'home', 'surprise', 'Samsung', 'expanded', 'overseas', 'South', 'Korea', 'host', 'factories', 'research', 'engineers']


**Lemmatization**

> lemmatize the text so as to get its root form eg: functions,funtionality as function.

In [None]:
nlp = spacy.load("en")
text = """While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """
doc = nlp(text)
for token in doc:
  print(token, token.lemma_)

While while
Samsung Samsung
has have
expanded expand
overseas overseas
, ,
South South
Korea Korea
is be
still still
host host
to to
most most
of of
its -PRON-
factories factory
and and
research research
engineers engineer
. .


**Get word frequency**

> Counting the word occurrence using FreqDist library. Word frequency helps us to determine how important the word is in the document by knowing how many times the word is being used.

In [None]:
nlp = spacy.load("en")
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. """

doc = nlp(text)
# remove stopwords and punctuations.
#
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
word_freq = Counter(words)
common_words = word_freq.most_common(6)
print(common_words)

[('outlay', 1), ('home', 1), ('surprise', 1), ('Samsung', 1), ('expanded', 1), ('overseas', 1)]


**POS tags**

> POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.

In [None]:
nlp = spacy.load("en")
text = """Natural Language Toolkit, or more commonly NLTK."""
text = nlp(text)
for w in text:
  print(w, w.pos_)

Natural PROPN
Language PROPN
Toolkit PROPN
, PUNCT
or CCONJ
more ADJ
commonly ADV
NLTK NUM
. PUNCT


**NER-(Named Entity Recognition)**

> NER(Named Entity Recognition) is the process of getting the entity names.

In [None]:
nlp = spacy.load("en")
text = """Most of the outlay will be at home. No surprise there, either. While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers. 
Meanwhile, Apple based in the USA can differ any much, according to Paul."""

text = nlp(text)
labels = set([w.label_ for w in text.ents])
for label in labels:
  entities = [e.string for e in text.ents if label==e.label_]
  entities = list(set(entities))
  print(label, entities)

GPE ['USA ', 'South Korea ']
PERSON ['Paul']
ORG ['Samsung ', 'Apple ']


In [None]:
# If you want to understand the tags associated with the outputs.
#
spacy.explain("GPE")

'Countries, cities, states'

In [None]:
from spacy import displacy

