Lecture Notes

### Session 1

### 1.1 Tokenization

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize
TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat so many cookies."
print(sent_tokenize(TEXT))
print(word_tokenize(TEXT))


['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat so many cookies."]
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'so', 'many', 'cookies', '.']


### 1.2 Part-of-speech (POS) tagging

In [4]:
import nltk
text=word_tokenize("And now for something completely different")
print nltk.pos_tag(text)


[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]


### 1.3 Unigrams Vs Bigrams

In [5]:
from nltk.util import ngrams
from collections import Counter
text = "This camera produces awesome pictures."
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
print Counter(bigrams)

Counter({('camera', 'produces'): 1, ('This', 'camera'): 1, ('awesome', 'pictures'): 1, ('produces', 'awesome'): 1, ('pictures', '.'): 1})


### 1.4 Stemming & Lemmatization

Stemming:  “The process for reducing inflected (or sometimes derived) words to their stem…”

Lemmatization: “The process of grouping together the different inflected forms of a word so they can be analyzed as a single item.”


In [12]:
token = nltk.word_tokenize("The striped bats are hanging on their feet for best")
from nltk.stem import PorterStemmer
print [PorterStemmer().stem(t) for t in token]

['the', u'stripe', u'bat', 'are', u'hang', 'on', 'their', 'feet', 'for', 'best']


In [9]:
wnl = nltk.WordNetLemmatizer()
print [wnl.lemmatize(t) for t in token]

['The', 'striped', u'bat', 'are', 'hanging', 'on', 'their', u'foot', 'for', 'best']


### Session 2

### 2.1 TF-IDF

Term Frequency (tf) is the frequency of j th term in the i th document ($f_{ij}$)

Inverse document frequency (idf) is document freqeuncy of the j th term (what % of documents have the term)

Term frequency-Inverse document frequency (tf-idf) = $f_{ij}$ * log(1/ $F_{j}$)

The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. 


In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
corpus = ['This is the first document.', 
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In [36]:
weights = np.asarray(X.mean(axis=0)).ravel()
weights_df = pd.DataFrame({'Term':vectorizer.get_feature_names(), 'TF-IDF': weights})
weights_df = weights_df[['Term', 'TF-IDF']]
weights_df

Unnamed: 0,Term,TF-IDF
0,and,0.127962
1,document,0.406802
2,first,0.290143
3,is,0.329091
4,one,0.127962
5,second,0.134662
6,the,0.329091
7,third,0.127962
8,this,0.329091


Zipf's Law: Frequency of a word is inversely proportional to its rank.

### Session 3

### 3.1 Naive Bayes i Text Classification

Naïve Bayes makes two simplifying assumptions: 

    1. probability of each word occurring in the document is independent of the occurrences of the other words;
    
    2. The position of words in the document does not matter


Bayes'Rules applied to Documents and Classes 

$P ( c|d ) = [ P( d | c ) * P( c ) ] / [ P( d ) ]$

where P(c) is the total probability of a class, in other words, how often does this class occur in total?

$P( d | c )$ is the probability of word i happened in class j 

One problem is that, say, the word 'fantastic' is not in the traning documents in class positive. Then the P(fantastic | positive) = 0 and by multiplying other terms, the whole probability will be zero. To fix this, we use a Smoothing Algorithm (Laplace Smoothing): we add one to both the denominator and numerator. 

Here's an example found here: https://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/

In [39]:
# this steps creates a bag of words. 
def word_feats(words):
        return dict([(word, True) for word in words])

In [38]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
 
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
 
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
 
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
 
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0
