#### Difference between NLTK and SpaCy in Python
- NLTK (Natural Language Toolkit) and SpaCy are both popular libraries for natural language processing (NLP) in Python, but they have different design philosophies and use cases.

- NLTK is more focused on providing a wide range of tools and resources for linguistic research and education. It includes functionalities for text processing, classification, tokenization, stemming, tagging, parsing, and more. NLTK is highly customizable and allows users to experiment with different algorithms and techniques.

- SpaCy, on the other hand, is designed for industrial-strength NLP tasks and emphasizes performance and efficiency. It provides pre-trained models for various languages and is optimized for speed and ease of use. SpaCy is often preferred for production applications where quick and accurate results are required.

- In summary, NLTK is a versatile toolkit for NLP research and education, while SpaCy is a powerful library for building real-world NLP applications.

In [7]:
pip install nltk 


Note: you may need to restart the kernel to use updated packages.


In [16]:
corpus = """Hello, welcome to Subham Chakraborty's fantasy world.
Please do remember it is an interesting and amazing world altogether.
"""
print(corpus)

Hello, welcome to Subham Chakraborty's fantasy world.
Please do remember it is an interesting and amazing world altogether.



In [19]:
## Tokenization

## Sentence -> paragraphs 
from nltk.tokenize import sent_tokenize
import nltk 
nltk.download('punkt') # Downloads the punkt tokenizer
nltk.download('punkt_tab') # Downloads the punkt tokenizer 


[nltk_data] Downloading package punkt to
[nltk_data]     /home/subhamchakraborty/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/subhamchakraborty/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [21]:
documents = sent_tokenize(corpus) 
print(documents)

["Hello, welcome to Subham Chakraborty's fantasy world.", 'Please do remember it is an interesting and amazing world altogether.']


In [22]:
type(documents)

list

In [23]:
for sentence in documents:
    print(sentence)

Hello, welcome to Subham Chakraborty's fantasy world.
Please do remember it is an interesting and amazing world altogether.


In [26]:
# Tokenization techniques

# paragraph --> words 
# sentence --> words


In [28]:
from nltk.tokenize import word_tokenize
word_tokenize(corpus) # returns a list of words

['Hello',
 ',',
 'welcome',
 'to',
 'Subham',
 'Chakraborty',
 "'s",
 'fantasy',
 'world',
 '.',
 'Please',
 'do',
 'remember',
 'it',
 'is',
 'an',
 'interesting',
 'and',
 'amazing',
 'world',
 'altogether',
 '.']

In [29]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', ',', 'welcome', 'to', 'Subham', 'Chakraborty', "'s", 'fantasy', 'world', '.']
['Please', 'do', 'remember', 'it', 'is', 'an', 'interesting', 'and', 'amazing', 'world', 'altogether', '.']


In [30]:
from nltk.tokenize import wordpunct_tokenize


In [33]:
wordpunct_tokenize(corpus) # returns a list of lists

['Hello',
 ',',
 'welcome',
 'to',
 'Subham',
 'Chakraborty',
 "'",
 's',
 'fantasy',
 'world',
 '.',
 'Please',
 'do',
 'remember',
 'it',
 'is',
 'an',
 'interesting',
 'and',
 'amazing',
 'world',
 'altogether',
 '.']

In [None]:
from nltk.tokenize import TreebankWordDetokenizer # for tokenization
tokenizer = TreebankWordDetokenizer() 
tokenizer.tokenize(corpus)

"H e l l o,   w e l c o m e   t o   S u b h a m   C h a k r a b o r t y' s   f a n t a s y   w o r l d . \n P l e a s e   d o   r e m e m b e r   i t   i s   a n   i n t e r e s t i n g   a n d   a m a z i n g   w o r l d   a l t o g e t h e r."