In [1]:
import nltk

### Tokenizing Text

In [2]:
text = "Mary had a little lamb. Her fleece was white as snow"
from nltk.tokenize import word_tokenize, sent_tokenize
sents = sent_tokenize(text)
print(sents)

['Mary had a little lamb.', 'Her fleece was white as snow']


In [3]:
words = [word_tokenize(sent) for sent in sents]
print(words)

[['Mary', 'had', 'a', 'little', 'lamb', '.'], ['Her', 'fleece', 'was', 'white', 'as', 'snow']]


### Removing Stopwords

In [4]:
from nltk.corpus import stopwords
from string import punctuation
customStopWords = set(stopwords.words('english')+list(punctuation))

In [5]:
wordsNOStopwords = [word for word in word_tokenize(text) if word not in customStopWords]
print(wordsNOStopwords)

['Mary', 'little', 'lamb', 'Her', 'fleece', 'white', 'snow']


 ### Identifying Bigrams

In [6]:
# collocations ->  any word that is collocated or occur together
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wordsNOStopwords) # constructs bigrams from a list of words
sorted(finder.ngram_fd.items()) # distinct bigrams and their frequencies

[(('Her', 'fleece'), 1),
 (('Mary', 'little'), 1),
 (('fleece', 'white'), 1),
 (('lamb', 'Her'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]

### Stemming

In [7]:
text2 = "Mary closed on closing night when she was in the mood to close."
# different morphological form of the same word - close.

from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
stemmedWords=[st.stem(word) for word in word_tokenize(text2)] # Reduced to its root form
print(stemmedWords)

['mary', 'clos', 'on', 'clos', 'night', 'when', 'she', 'was', 'in', 'the', 'mood', 'to', 'clos', '.']


### Parts of Speech Tagging

In [8]:
nltk.pos_tag(word_tokenize(text2))
# NNP -> Proper Noun
# VBD -> Verb
# PRP -> Pronoun

[('Mary', 'NNP'),
 ('closed', 'VBD'),
 ('on', 'IN'),
 ('closing', 'NN'),
 ('night', 'NN'),
 ('when', 'WRB'),
 ('she', 'PRP'),
 ('was', 'VBD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mood', 'NN'),
 ('to', 'TO'),
 ('close', 'VB'),
 ('.', '.')]

### Word Sense Disambiguation
- Identifying the meaning of the word based on its occurance in the content

#### In NLTK - word meaning can be identified by using a resource called Wordnet.
- Wordnet is a lexicon - thesaurus - it has words and their relationships incorporated within it
- Synset - entity within Wordnet. A synset represents one single definition of the word.

In [9]:
from nltk.corpus import wordnet as wn
for ss in wn.synsets('bass'):
    print(ss, ss.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


In [10]:
# lesk - algorithm to resolve word sense disambiguity
from nltk.wsd import lesk
sense1 = lesk(word_tokenize("Sing in a lower tone, along with the bass"), 'bass')
print(sense1, sense1.definition())

Synset('bass.n.07') the member with the lowest range of a family of musical instruments


In [11]:
sense2 = lesk(word_tokenize("This sea bass was really hard to catch"), 'bass')
print(sense2, sense2.definition())

Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae


## Spam Detection
- Could have a rule-based system
    - Write rules by hand
    - contains specific keywords
- Machine Learning approach
    - Emails are passed through a set of rules and filtered accordingly.
    - The difference between the rule-based and machine-learning approach is the fact that the rules are derived from historical data.
    - Historical data could have mails tagged spam or not and the algorithm can draw inference from it.

## Typical ML Workflow
1. Pick your problem
 - Identify which type of problem we need to solve.
    - Classification*  -> spam/not | sentiment analysis
    - Clustering
    - Recommendations
    - Regression*
2. Represent Data
 - Represent data using numeric attributes
    - Machine learning algorithms deal with numbers, hence essential to use meaningful attributes to represent text.
    - The numeric attributes that are used to identify are termed "features".
        - Term Frequency (TF) - frequencies of words which occur in the text.
        - Term Frequency - Inverse Document Frequency (TF-IDF)
3. Apply a standard algorithm
 - Find patterns from the historical data.
 - Rules are meant to quantify relationships between variables.
 - Model is a representation of the patterns that are identified by the algorithm.
    - a mathematical equation
    - a set of rules (if-then-else statements)
    
 - Classification Algorithms
    - Naive Bayes
    - Support Vector Machines
    
 - Clustering Algorithms     
    - K-Means
    - Hierarchical Clustering

## Classification 
* A category of machine learning algorithms that segregate data based on certain attributes.
* Problem instance - Entity or piece of text that needs to be classified. Email , Tweet
 - A label needs to be assigned to the problem instance to categorize it.
* Classifiers - Algorithms which perform classifications.
    - Pre-requisite - A set of instances for which the correct category membership is known.
    - TrainingData - essential for a classifier to infer rules or patterns that actually help classify.

## Clustering
* Divide data into various groups/clusters based on common attribute.
* The groups to be divided are unknown beforehand.
* Exploring the body of text