# Natural Language Processing: NLTK vs spaCy

NLTK is essentially a string processing library, where each function takes strings as input and returns a processed string.
spaCy takes an object-oriented approach. Each function returns objects instead of strings or arrays. This allows for easy 
exploration of the tool.

Each library utilizes either time or space to improve performance. While NLTK returns results much slower than 
spaCy (spaCy is a memory hog!), spaCy’s performance is attributed to the fact that it was written in Cython from 
the ground up.

Most sources on the Internet mention that spaCy only supports the English language, but these articles were written 
a few years ago. Since then, spaCy has grown to support over 50 languages. Both spaCy and NLTK support English, German, 
French, Spanish, Portuguese, Italian, Dutch, and Greek.

# Part-of-speech tagging - NLTK allows to determine the grammatical parts of speech for each word in a sentence.

In [8]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')  # Download the Punkt tokenizer models

text = "NLTK is a powerful library for NLP. It makes natural language processing simple."

# Tokenize into words
words = word_tokenize(text)
print("Tokenized words:", words)

# Tokenize into sentences
sentences = sent_tokenize(text)
print("Tokenized sentences:", sentences)

Tokenized words: ['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'NLP', '.', 'It', 'makes', 'natural', 'language', 'processing', 'simple', '.']
Tokenized sentences: ['NLTK is a powerful library for NLP.', 'It makes natural language processing simple.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SONY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Part-of-speech tagging - NLTK allows you to determine the grammatical parts of speech for each word in a sentence.

In this example, the pos_tag function from NLTK is used to assign a part-of-speech tag to each word in the input text. 
The result is then printed, showing each word along with its corresponding part-of-speech tag.

In [9]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

nltk.download('punkt')  # Download the Punkt tokenizer models
nltk.download('averaged_perceptron_tagger')  # Download the POS tagger model

text = "NLTK is a powerful library for natural language processing."

# Tokenize the text
words = word_tokenize(text)

# Perform Part-of-Speech tagging
pos_tags = pos_tag(words)

# Display the Part-of-Speech tagged result
print("Part-of-Speech tagging result:")
for word, pos_tag in pos_tags:
    print(f"{word}: {pos_tag}")

Part-of-Speech tagging result:
NLTK: NNP
is: VBZ
a: DT
powerful: JJ
library: NN
for: IN
natural: JJ
language: NN
processing: NN
.: .


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SONY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SONY\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Named Entity Recognition (NER) - spaCy excels in identifying entities like names, locations, and organizations within a text.

In [10]:
import spacy

text = "Apple Inc. was founded by Steve Jobs. It is headquartered in Cupertino, California."

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Process the text
doc = nlp(text)

# Extract named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]
print("Named Entities:", entities)

Named Entities: [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE'), ('California', 'GPE')]


# Dependency parsing - spaCy helps in understanding the grammatical structure of sentences, highlighting relationships between words.

In this example, the spaCy model is loaded, and a sample sentence is processed. The dep_ attribute of each token provides 
information about its dependency relation to the head token (parent) in the sentence. The output shows the relationships 
between words in the form of a dependency parse tree.

In [11]:
import spacy

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Example sentence
sentence = "SpaCy helps in understanding the grammatical structure of sentences."

# Process the sentence
doc = nlp(sentence)

# Display the dependency parse tree
print("Dependency Parse Tree:")
for token in doc:
    print(f"{token.text} --{token.dep_}--> {token.head.text}")

Dependency Parse Tree:
SpaCy --nsubj--> helps
helps --ROOT--> helps
in --prep--> helps
understanding --pcomp--> in
the --det--> structure
grammatical --amod--> structure
structure --dobj--> understanding
of --prep--> structure
sentences --pobj--> of
. --punct--> helps
