## Tokenization

### 1. Natural Language Toolkit (nltk)
##### NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

### 2. spaCy
##### spaCy is a modern, open-source natural language processing (NLP) library for Python, designed to provide industrial-strength NLP capabilities. It features:
    1. High-performance: spaCy is optimized for speed and efficiency, making it suitable for large-scale NLP applications.
    
    2. State-of-the-art models: spaCy comes with pre-trained models for tokenization, part-of-speech tagging, dependency parsing, named entity recognition (NER), and more.
    
    3. Pythonic API: spaCy’s API is designed to be easy to use and integrate with other Python libraries and frameworks.

In [13]:
corpus = '''Hello welcome, My name is Subrat Mishra, I am an aspiring Data Sceintist,having a post-grad (MCA) degree. I am very much interested in developing application, which will help people.
'''

In [14]:
corpus

'Hello welcome, My name is Subrat Mishra, I am an aspiring Data Sceintist,having a post-grad (MCA) degree. I am very much interested in developing application, which will help people.\n'

In [15]:
print(corpus)

Hello welcome, My name is Subrat Mishra, I am an aspiring Data Sceintist,having a post-grad (MCA) degree. I am very much interested in developing application, which will help people.



In [19]:
## tokenization
## convert sentences --> paragraphs
from nltk.tokenize import sent_tokenize

documents = sent_tokenize(corpus)

In [22]:
for i in documents:
    print(i)
len(documents)

Hello welcome, My name is Subrat Mishra, I am an aspiring Data Sceintist,having a post-grad (MCA) degree.
I am very much interested in developing application, which will help people.


2

In [23]:
## word tokenization
## pargraph --> words
## snetence --> words

from nltk.tokenize import word_tokenize
vocab = word_tokenize(corpus)
vocab

['Hello',
 'welcome',
 ',',
 'My',
 'name',
 'is',
 'Subrat',
 'Mishra',
 ',',
 'I',
 'am',
 'an',
 'aspiring',
 'Data',
 'Sceintist',
 ',',
 'having',
 'a',
 'post-grad',
 '(',
 'MCA',
 ')',
 'degree',
 '.',
 'I',
 'am',
 'very',
 'much',
 'interested',
 'in',
 'developing',
 'application',
 ',',
 'which',
 'will',
 'help',
 'people',
 '.']

In [24]:
from nltk.tokenize import wordpunct_tokenize
wordpunt = wordpunct_tokenize(corpus)
wordpunt

['Hello',
 'welcome',
 ',',
 'My',
 'name',
 'is',
 'Subrat',
 'Mishra',
 ',',
 'I',
 'am',
 'an',
 'aspiring',
 'Data',
 'Sceintist',
 ',',
 'having',
 'a',
 'post',
 '-',
 'grad',
 '(',
 'MCA',
 ')',
 'degree',
 '.',
 'I',
 'am',
 'very',
 'much',
 'interested',
 'in',
 'developing',
 'application',
 ',',
 'which',
 'will',
 'help',
 'people',
 '.']

In [29]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

In [30]:
tokenizer.tokenize(corpus)

['Hello',
 'welcome',
 ',',
 'My',
 'name',
 'is',
 'Subrat',
 'Mishra',
 ',',
 'I',
 'am',
 'an',
 'aspiring',
 'Data',
 'Sceintist',
 ',',
 'having',
 'a',
 'post-grad',
 '(',
 'MCA',
 ')',
 'degree.',
 'I',
 'am',
 'very',
 'much',
 'interested',
 'in',
 'developing',
 'application',
 ',',
 'which',
 'will',
 'help',
 'people',
 '.']