### Tasks in Natural Language Processing:
* Tokenization - breaking down text into words and sentences
* Stopword removal - filtering common words
* N-Grams - identifying commonly occuring groups of words
* Word sense disambiguation - identifying the context in which the word occurs
* Identifying Parts-of-Speech
* Stemming - removing ends of words

### 1.Tokenization

In [12]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Mary had a little lamb. Her fleece was white as snow"
sents = sent_tokenize(text)
print(sents)

['Mary had a little lamb.', 'Her fleece was white as snow']


In [13]:
words = [word_tokenize(sent) for sent in sents]
print(words)

[['Mary', 'had', 'a', 'little', 'lamb', '.'], ['Her', 'fleece', 'was', 'white', 'as', 'snow']]


### 2. Removing Stopwords

In [14]:
from nltk.corpus import stopwords
from string import punctuation
customStopWords = set(stopwords.words('english') + list(punctuation))

wordsWOStopWords = [word for word in word_tokenize(text) if word not in customStopWords]
print(wordsWOStopWords)

['Mary', 'little', 'lamb', 'Her', 'fleece', 'white', 'snow']


### 3. Identifying bigrams

In [17]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wordsWOStopWords) #construct bigrams from list of words
sorted(finder.ngram_fd.items())

[(('Her', 'fleece'), 1),
 (('Mary', 'little'), 1),
 (('fleece', 'white'), 1),
 (('lamb', 'Her'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]