<a href="https://colab.research.google.com/github/MOHAN-DATTA-24/NLP/blob/main/Tokenization(Using_NLTK_and_SPACY).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
corpus = """
I want to get deeply into NLP. But I'm not sure! where to start, and I'm also unfamiliar with great resources.
"""

In [16]:
print(corpus)


I want to get deeply into NLP. But I'm not sure! where to start, and I'm also unfamiliar with great resources.



# **Using NLTK**

In [17]:
!pip install nltk



In [18]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [19]:
## Tokenization
##  Paragraphs ---> Sentence

from nltk.tokenize import sent_tokenize

## full stop and exclamations are considered as line break or new sentence.

In [20]:
documents = sent_tokenize(corpus)

In [21]:
type(documents)

list

In [22]:
for sentence in documents:
  print(sentence)


I want to get deeply into NLP.
But I'm not sure!
where to start, and I'm also unfamiliar with great resources.


In [23]:
## Tokenization
## Paragraph ---> Words
## Sentence ---> Words
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize

In [24]:
word_tokenize(corpus)

['I',
 'want',
 'to',
 'get',
 'deeply',
 'into',
 'NLP',
 '.',
 'But',
 'I',
 "'m",
 'not',
 'sure',
 '!',
 'where',
 'to',
 'start',
 ',',
 'and',
 'I',
 "'m",
 'also',
 'unfamiliar',
 'with',
 'great',
 'resources',
 '.']

In [25]:
len(word_tokenize(corpus))

27

In [26]:
wordpunct_tokenize(corpus)

['I',
 'want',
 'to',
 'get',
 'deeply',
 'into',
 'NLP',
 '.',
 'But',
 'I',
 "'",
 'm',
 'not',
 'sure',
 '!',
 'where',
 'to',
 'start',
 ',',
 'and',
 'I',
 "'",
 'm',
 'also',
 'unfamiliar',
 'with',
 'great',
 'resources',
 '.']

In [27]:
len(wordpunct_tokenize(corpus))

29

In [28]:
for sentence in documents:
  print(word_tokenize(sentence))

['I', 'want', 'to', 'get', 'deeply', 'into', 'NLP', '.']
['But', 'I', "'m", 'not', 'sure', '!']
['where', 'to', 'start', ',', 'and', 'I', "'m", 'also', 'unfamiliar', 'with', 'great', 'resources', '.']


In [29]:
for sentence in documents:
  print(wordpunct_tokenize(sentence))

['I', 'want', 'to', 'get', 'deeply', 'into', 'NLP', '.']
['But', 'I', "'", 'm', 'not', 'sure', '!']
['where', 'to', 'start', ',', 'and', 'I', "'", 'm', 'also', 'unfamiliar', 'with', 'great', 'resources', '.']


In [30]:
from nltk.tokenize import TreebankWordTokenizer

## It considers full stop as a part of token in last word of the sentence.
## EXCEPTION: But for the last full stop of a corpus it will be considered as a separate token.

In [31]:
tokenizer = TreebankWordTokenizer()

In [32]:
tokenizer.tokenize(corpus)

['I',
 'want',
 'to',
 'get',
 'deeply',
 'into',
 'NLP.',
 'But',
 'I',
 "'m",
 'not',
 'sure',
 '!',
 'where',
 'to',
 'start',
 ',',
 'and',
 'I',
 "'m",
 'also',
 'unfamiliar',
 'with',
 'great',
 'resources',
 '.']

# **Using SPACY**

In [33]:
!pip install SPACY



In [34]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

In [35]:
# Tokenize the paragraph into sentences
sentences = list(nlp(corpus).sents)
sentences

[
 I want to get deeply into NLP.,
 But I'm not sure!,
 where to start, and I'm also unfamiliar with great resources.]

In [37]:
# Tokenize the paragraph into words
words_paragraph = [token.text for token in nlp(corpus)]
words_paragraph

['\n',
 'I',
 'want',
 'to',
 'get',
 'deeply',
 'into',
 'NLP',
 '.',
 'But',
 'I',
 "'m",
 'not',
 'sure',
 '!',
 'where',
 'to',
 'start',
 ',',
 'and',
 'I',
 "'m",
 'also',
 'unfamiliar',
 'with',
 'great',
 'resources',
 '.',
 '\n']

In [38]:
print(len(words_paragraph))

29


In [39]:
# Tokenize each sentence into words
words_sentences = [[token.text for token in sentence] for sentence in sentences]
words_sentences

[['\n', 'I', 'want', 'to', 'get', 'deeply', 'into', 'NLP', '.'],
 ['But', 'I', "'m", 'not', 'sure', '!'],
 ['where',
  'to',
  'start',
  ',',
  'and',
  'I',
  "'m",
  'also',
  'unfamiliar',
  'with',
  'great',
  'resources',
  '.',
  '\n']]

In [40]:
print(len(words_sentences))

3


In [41]:
# Extract noun chunks
noun_chunks = list(nlp(corpus).noun_chunks)
noun_chunks

[I, NLP, I, I, great resources]

<table border="1" cellspacing="0" cellpadding="5">
  <tr>
    <th>Aspect</th>
    <th>NLTK</th>
    <th>Spacy</th>
  </tr>
  <tr>
    <td>Design Philosophy</td>
    <td>Comprehensive, educational</td>
    <td>Efficiency, production use</td>
  </tr>
  <tr>
    <td>Ease of Use</td>
    <td>Flexible, beginner-friendly</td>
    <td>User-friendly, intuitive</td>
  </tr>
  <tr>
    <td>Performance</td>
    <td>May be slower</td>
    <td>Optimized for speed</td>
  </tr>
  <tr>
    <td>Features</td>
    <td>Wide range of tools</td>
    <td>Core NLP tasks</td>
  </tr>
  <tr>
    <td>Language Support</td>
    <td>Supports many languages</td>
    <td>Robust support for English</td>
  </tr>
  <tr>
    <td>Community and Development</td>
    <td>Large and active community</td>
    <td>Growing community, actively maintained</td>
  </tr>
</table>


#**When to use NLTK:**

### For educational purposes, such as teaching and learning NLP concepts.<br>
### For research in natural language processing, where flexibility and experimentation are important.<br>
### When you need a wide range of tools and resources for NLP tasks, including algorithms for tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more.

#**When to use Spacy:**

### For building real-world NLP applications that require efficiency and performance.
### When you need pre-trained models for common NLP tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.
### When you want a streamlined API and easy integration into production systems, without sacrificing performance.