<center><u><H1>Sentence splitter and Tokenization</H1></u></center>

## Sentence splitter:

In [1]:
string = 'Every one of us is, in the cosmic perspective, precious. If a human disagrees with you, let him live. In a hundred billion galaxies, you will not find another.'

In [2]:
# Method-1: Using nltk sent_tokenize

from nltk.tokenize import sent_tokenize

In [4]:
_sent = sent_tokenize(string)

In [4]:
print(_sent)

['Every one of us is, in the cosmic perspective, precious.', 'If a human disagrees with you, let him live.', 'In a hundred billion galaxies, you will not find another.']


In [5]:
# Method-2: Using nltk punkt tokenizer

import nltk.tokenize.punkt
#This tokenizer divides a text into a list of sentences,
#by using an unsupervised algorithm to build a model for abbreviation
#words, collocations, and words that start sentences. 

In [6]:
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

In [7]:
tokenizer.tokenize(string)

['Every one of us is, in the cosmic perspective, precious.',
 'If a human disagrees with you, let him live.',
 'In a hundred billion galaxies, you will not find another.']

## Tokenization:

### Word Tokenizing:

In [8]:
a = 'Hi NLTK students ! level s10'

In [9]:
# Method-1: simplest tokenizer: uses white space as delimiter.
print(a.split())

['Hi', 'NLTK', 'students', '!', 'level', 's10']


In [10]:
# Method-2: 
from nltk.tokenize import word_tokenize

In [11]:
word_tokenize(a)

['Hi', 'NLTK', 'students', '!', 'level', 's10']

In [12]:
## Method-3: Another method using TreebankWorldTokenizer
from nltk.tokenize import TreebankWordTokenizer

In [13]:
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(a))

['Hi', 'NLTK', 'students', '!', 'level', 's10']


### Removing Noise

In [14]:
#Example of removing numbers:
import re

def remove_numbers(text):
    return re.sub(r'\d+', "", text)

In [15]:
txt = 'This a sample sentence in English, \n with   whitespaces and many numbers 123456!'

In [16]:
print('Removed numbers:', remove_numbers(txt))

Removed numbers: This a sample sentence in English, 
 with   whitespaces and many numbers !


In [17]:
#Example of removing punctuation from text
import string

def remove_punctuation(text):
    words = word_tokenize(text)
    pun_removed = [w for w in words if w.lower() not in string.punctuation]
    return " ".join(pun_removed)

In [20]:
b = 'This is a great course of NLP using Python and NLTK!!! for this year 2017, isnt.?'
print(remove_punctuation(b))

This is a great course of NLP using Python and NLTK for this year 2017 isnt


In [25]:
from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize

In [26]:
# + : one or more times | \w : character or digit
regexp_tokenize(b, pattern='\w+')

['This',
 'is',
 'a',
 'great',
 'course',
 'of',
 'NLP',
 'using',
 'Python',
 'and',
 'NLTK',
 'for',
 'this',
 'year',
 '2017',
 'isnt']

In [27]:
regexp_tokenize(b, pattern='\d+')

['2017']

In [28]:
c = 'The capital is raising up to $100000'

In [29]:
regexp_tokenize(c, pattern='\w+|\$')

['The', 'capital', 'is', 'raising', 'up', 'to', '$', '100000']

In [30]:
# + = one or more times
regexp_tokenize(c, pattern='\w+|\$[\d]')

['The', 'capital', 'is', 'raising', 'up', 'to', '$1', '00000']

In [31]:
regexp_tokenize(c, pattern='\w+|\$[\d\.]+|\S+')

['The', 'capital', 'is', 'raising', 'up', 'to', '$100000']

In [32]:
wordpunct_tokenize(b)

['This',
 'is',
 'a',
 'great',
 'course',
 'of',
 'NLP',
 'using',
 'Python',
 'and',
 'NLTK',
 '!!!',
 'for',
 'this',
 'year',
 '2017',
 ',',
 'isnt',
 '.?']

In [33]:
blankline_tokenize(b)

['This is a great course of NLP using Python and NLTK!!! for this year 2017, isnt.?']

## References:

https://docs.python.org/2/library/tokenize.html

http://www.nltk.org/_modules/nltk/tokenize.html

http://www.nltk.org/_modules/nltk/tokenize/punkt.html

http://www.nltk.org/_modules/nltk/tokenize/treebank.html

http://www.nltk.org/_modules/nltk/tokenize/regexp.html