# Text Mining

Hi everyone, <br />

This session is about text mining. 

It will walk you though the following sections:

1. Text pre-processing
2. Term Frequency analysis (TF)
3. Inverse Document Frequency (IDF)
4. Term Frequency - Inverse Document Frequency (TF-IDF)
5. Text classification
6. Sentiment analysis

** Before Starting, let us import some basic text mining tools**

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*- 
# Basic imports
import pickle
from pprint import pprint
import collections
import numpy as np
import matplotlib.pyplot as plt
import operator

# Natural Language Tool Kit (NLTK) imports
import nltk
from nltk.data  import load
from nltk.tokenize.treebank import TreebankWordTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# Machine Learning Library (sklearn) imports
from sklearn import metrics
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer

In [None]:
# Instantiate objects from NLTK
sentence_splitter = load('tokenizers/punkt/english.pickle')
tokenizer = TreebankWordTokenizer()
stemmer = PorterStemmer()

For this text mining session, we will use **reviews from Amazon**. These reviews correspond to product reviews from 4 product types 
* Books
* DVD
* Electronics
* Kitchen Appliances

For each review, we know the sentiment associated to it.

**Book** 

"*What a waste of time. This was like sitting through a very boring business course.*"

"*An excellent, well explained art book, with beautiful and easy to follow illustrations. The book is a treasure chest of ideas suitable for the primary classroom. This book provides plenty of opportunities to explore the various strands of the visual arts field. A great resource for any teacher, parent or doting aunt*"

**DVD** 

"*The sound on this DVD is absolutely horrible.  The dialogue is at a much lower volume than the music and sound effects, making it impossible to view without constantly tinkering.  I also have the VHS, on which the sound is perfect, so I can still watch this wonderful movie.  But I would sure like to get my money back for the DVD*"

**Electronics** 

"*Terrible Design did not fit my car's electical outlet, it does not work with my Jeep Liberty.  The jack construction is defective does not even power on*"

**Kitchen appliances** 

"*Great blender! I use it daily to make smoothies and it never fails. Powerful motor purees frozen fruits great!! Simple--only two speeds. Easy clean-up *"
 

## 1. Text pre-processing

### 1.1  Split into sentences

In [None]:
review = """What a waste of time. This was like sitting through a very boring business course."""
for sentence in sentence_splitter.tokenize(review):
    pprint(sentence)

### 1.2 Split sentence into tokens

In [None]:
sentence = """This was like sitting through a very boring business course."""
for token in tokenizer.tokenize(sentence):
    pprint(token)

### 1.3 Convert tokens to lower case

In [None]:
tokenized_sentence = ['This','was','like','sitting','through','a','very','boring','business','course','.']
for token in tokenized_sentence:
    token = token.lower()
    pprint(token)

### 1.4 Remove punctuation

In [None]:
punctuation = set([",", ".", ";", "/", ":", "-", "--" ,"!", "?", "(", ")","'",'"',"''", "``"])
tokenized_sentence = ['this','was','like','sitting','through','a','very','boring','business','course','.']
for token in tokenized_sentence:
    if token not in punctuation:
        pprint(token)

### 1.5 Remove stop words

In [None]:
stopwords_set = set(stopwords.words('english'))
pprint(stopwords_set)

In [None]:
tokenized_sentence = ['this','was','like','sitting','through','a','very','boring','business','course']
for token in tokenized_sentence:
    if token not in stopwords_set:
        pprint(token)

### 1.6 Stemming

In [None]:
tokenized_sentence = ['like','sitting','boring','business','course']
for token in tokenized_sentence:
    stem = stemmer.stem(token)
    pprint(stem)

### 1.7 All together

In [None]:
# Preprocess a given text
def preprocess_text(review):
    tokens = []
    # 1. Split into sentences
    for sentence in sentence_splitter.tokenize(review):
        # 2. Split into tokens
        for token in tokenizer.tokenize(sentence):
            token = token.lower()
            # 3. Filter on stoplist and punctuation
            if token not in stopwords_set and token not in punctuation:
                # 4. Stemming (takes root)
                stem = stemmer.stem(token)
                tokens.append(stem)
    return tokens

pprint( preprocess_text("""An excellent, well explained art book, with beautiful and easy to follow illustrations. 
The book is a treasure chest of ideas suitable for the primary classroom. 
This book provides plenty of opportunities to explore the various strands of the visual arts field. 
A great resource for any teacher, parent or doting aunt"""))

## 2 Term Frequency analysis (TF)

A central question in text mining and natural language processing is how to quantify what a document is about.

One measure of how important a word may be is its **term frequency** (tf), how often a word occurs in a document

### 2.1 Without pre-processing

#### Book reviews

In [None]:
# Some previews of reviews
with open('data/sorted_data_acl/books/positive_text.review', 'r') as myfile:
    pos_books = myfile.readlines()
    
    print'------'
    pprint(pos_books[0])
    print'------'
    pprint(pos_books[1])
    print'------'
    pprint(pos_books[2])
    print'------'

In [None]:
# Merge book reviews together
with open('data/sorted_data_acl/books/positive_text.review', 'r') as myfile:
    pos_book = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/books/negative_text.review', 'r') as myfile:
    neg_book = myfile.read().decode("utf-8")   
book_reviews = pos_book + neg_book

In [None]:
# Split into words, without further pre-processing
tokens = tokenizer.tokenize(book_reviews)

# count frequency of words
counter = collections.Counter(tokens)
pprint(counter.most_common(15))

#### DVD reviews

In [None]:
# Merge dvd reviews together
with open('data/sorted_data_acl/dvd/positive_text.review', 'r') as myfile:
    pos_dvd = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/dvd/negative_text.review', 'r') as myfile:
    neg_dvd = myfile.read().decode("utf-8")
dvd_reviews = pos_dvd + neg_dvd
    
# Split into words, without further pre-processing
tokens = tokenizer.tokenize(dvd_reviews)

# count frequency of words
counter=collections.Counter(tokens)
pprint(counter.most_common(15))

As we can see, many common words which do not tell us much about our reviews, are present.

Pre-processing allows us to remove some of the highly frequent common words.

### 2.2 With pre-processing

#### Book reviews

In [None]:
# pre-processing of reviews
book_reviews_prepro = preprocess_text(book_reviews.replace("'",' '))

# count frequency of words
counter=collections.Counter(book_reviews_prepro)
pprint(counter.most_common(15))

#### DVD reviews

In [None]:
# pre-processing of reviews
dvd_reviews_prepro = preprocess_text(dvd_reviews.replace("'",' '))

# count frequency of words
counter=collections.Counter(dvd_reviews_prepro)
pprint(counter.most_common(15))

#### Electronics  reviews

In [None]:
# Merge Electronics reviews together
with open('data/sorted_data_acl/electronics/positive_text.review', 'r') as myfile:
    pos_elec = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/electronics/negative_text.review', 'r') as myfile:
    neg_elec = myfile.read().decode("utf-8")
elec_reviews = pos_elec + neg_elec

# pre-processing of reviews
elec_reviews_prepro = preprocess_text(elec_reviews.replace("'",' '))

# count frequency of word
counter=collections.Counter(elec_reviews_prepro)
pprint(counter.most_common(15))

#### Kitchen appliance reviews

In [None]:
# Merge Kitchen appliance reviews together
with open('data/sorted_data_acl/kitchen_&_housewares/positive_text.review', 'r') as myfile:
    pos_kitch = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/kitchen_&_housewares/negative_text.review', 'r') as myfile:
    neg_kitch = myfile.read().decode("utf-8")
kitch_reviews = pos_kitch + neg_kitch

# pre-processing of reviews
kitch_reviews_prepro = preprocess_text(kitch_reviews.replace("'",' '))

# count frequency of word
counter=collections.Counter(kitch_reviews_prepro)
pprint(counter.most_common(15))

The words now better quantify our reviews.

However, there are still many **common words** that are not very useful (*one, get, make, even, also, well*), and which are present in all reviews, whetever the category or the sentiment.

We would like to extract truly **distinctive keywords** to better characterise our reviews within each category.

In [None]:
# Distribution of word frequencies
word_frequencies = [float(x) / 1000 for x in counter.values()]
plt.hist(counter.values(), 500)
plt.xscale('log')
plt.yscale('log')
plt.show()

## 3 Inverse Document Frequency (IDF)

One way to correct these frequencies is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.

Let us take an English corpus containing thousands of documents to evaluate the inverse document frequency of english words

#### Load Brown english corpus

In [None]:
nltk.download()

In [None]:
from nltk.corpus import brown
brown_corpus = brown.sents()

print "------"
print brown_corpus[0]
print "------"
print brown_corpus[1]
print "------"
print brown_corpus[2]
print "------"

#### Pre-process the sentences

In [None]:
preprocessed_sentences = []
for sentence in brown_corpus:
    clean_sentence = preprocess_text(' '.join(sentence))
    preprocessed_sentences.append(" ".join(clean_sentence))

#### Compute Inverse Document Frequency of english words

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
tf_idf = vectorizer.fit(preprocessed_sentences)
idf = vectorizer.idf_
idf = dict(zip(vectorizer.get_feature_names(), idf))
sorted_idf = sorted(idf.items(), key=operator.itemgetter(1))

#### Most common words accross English documents (Low IDF value)

In [None]:
pprint(sorted_idf[0:20])

#### Most rare words accross English documents (High IDF value)

In [None]:
sorted_idf.reverse()
pprint(sorted_idf[0:20])

## 4 Term Frequency - Inverse Document Frequency (TF-IDF)

Now that we have a measure that quantifies how important a word is within a document (TF) and another one that quantifies how common a word is accross the language (IDF), we can combine them into

**TF-IDF = TF * IDF**

This measure (TF-IDF) attempts to find the words that are important within the document (i.e., high frequency), but not too common across the documents (i.e. in english language in general).

#### TF-IDF of Book reviews

We can now compute the TF-IDF values for each word within a particular category.

Each document can then be represented as a **vector** of TF-IDF values.

In [None]:
# Book review document 
book_doc = " ".join(book_reviews_prepro)
# Compute TF-IDF
result = vectorizer.transform([book_doc])
feature_names = tf_idf.get_feature_names()
tfidf = []
for col in result.nonzero()[1]:
    tfidf.append((feature_names[col],result[0, col]))
sorted_tfidf = sorted(tfidf, key=lambda x: x[1])
sorted_tfidf.reverse()
pprint(sorted_tfidf[0:15])

#### TF-IDF of DVD reviews

In [None]:
# DVD review document 
dvd_doc = " ".join(dvd_reviews_prepro)

# Compute TF-IDF
result = vectorizer.transform([dvd_doc])
feature_names = tf_idf.get_feature_names()
tfidf = []
for col in result.nonzero()[1]:
    tfidf.append((feature_names[col],result[0, col]))
sorted_tfidf = sorted(tfidf, key=lambda x: x[1])
sorted_tfidf.reverse()
pprint(sorted_tfidf[0:15])

#### TF-IDF of Electronics reviews

In [None]:
# Electronics review document 
elec_doc = " ".join(elec_reviews_prepro)

# Compute TF-IDF
result = vectorizer.transform([elec_doc])
feature_names = tf_idf.get_feature_names()
tfidf = []
for col in result.nonzero()[1]:
    tfidf.append((feature_names[col],result[0, col]))
sorted_tfidf = sorted(tfidf, key=lambda x: x[1])
sorted_tfidf.reverse()
pprint(sorted_tfidf[0:15])

#### TF-IDF of Kitchen appliance reviews

In [None]:
# kitchen appliance review document 
kitch_doc = " ".join(kitch_reviews_prepro)

# Compute TF-IDF
result = vectorizer.transform([kitch_doc])
feature_names = tf_idf.get_feature_names()
tfidf = []
for col in result.nonzero()[1]:
    tfidf.append((feature_names[col],result[0, col]))
sorted_tfidf = sorted(tfidf, key=lambda x: x[1])
sorted_tfidf.reverse()
pprint(sorted_tfidf[0:15])

TF-IDF allows us to represent each text document in a mathematical way that truly quantify its content. Each document can be represented as a vector of real value tf-idf weights.

TF-IDF representation can solve several problems:
* Document comparison for plagiarism
* Text classification
* Sentiment analysis

## 5 Text classification

**Objective:** Given a new review or comment as input, we would like to automatically detect the category it belongs to.

#### Input review

In [None]:
# Input review
input_review= """Yes. Wild things is what I recommend for our jaded eyes. 
Aren't we sick of all the crowd pleasing PG-13 shows which are neither sexy or 
action packed - most of all with hardly a plot? Wild Things is sex sex sex but 
with witty capital H humor and a twisted story"""

#### Preprocessing of the input

In [None]:
# Preprocessing of input
input_prepro = " ".join(preprocess_text(input_review))
pprint(input_prepro)

#### Computation of tf-idf representation of the text input

In [None]:
result = vectorizer.transform([input_prepro])
feature_names = tf_idf.get_feature_names()
tfidf = []
for col in result.nonzero()[1]:
    tfidf.append((feature_names[col],result[0, col]))
sorted_tfidf = sorted(tfidf, key=lambda x: x[1])
sorted_tfidf.reverse()
pprint(sorted_tfidf[0:100])

#### Computation of similarity between input and our categories

We know have a mathematical representation (vector) for both our input text and the categories (see point 4)

In mathematics, the similarity between two vectors can be measured by the cosine of the angle between the two vectors. This measure is know has the **cosine similarity**.

Two vectors with similar values will have a small angle, and thus a cosine value near zero. On the other hand, two very different vectors will have a large angle between them and thus a cosine value close to 1.

Let us try to compare our input with each category

In [None]:
# TF-IDF vectors for each category
book_vector = vectorizer.transform([book_doc]) # Books
dvd_vector = vectorizer.transform([dvd_doc]) # DVD
elec_vector = vectorizer.transform([elec_doc]) # Electronics
kitch_vector = vectorizer.transform([kitch_doc]) # Kitchen Appliances

# TF-IDF vector of our input
input_vector = vectorizer.transform([input_prepro])

# Cosine similarities
print 'similarity with Books: \t\t %f' % (cosine_similarity(input_vector,book_vector)[0][0])
print 'similarity with DVD: \t\t %f' % (cosine_similarity(input_vector,dvd_vector)[0][0])
print 'similarity with Elec: \t\t %f' % (cosine_similarity(input_vector,elec_vector)[0][0])
print 'similarity with Kitchen app: \t %f' % (cosine_similarity(input_vector,kitch_vector)[0][0])


Let us try another one

In [None]:
# Input review
input_review= """This is a real "in your face" drama that has been all but forgotten about.  Hopefully the rumors of the remake are true.  
A couple of key things without rehashing the plot....great dialogue, especially from Hal Holbrook and the other judges.  When they were looking to fill a vacancy in their ranks and a name is brought up, they disdainfully tore up the potential nominee...."he's a lightweight....I'm sure he's nice to his cocker-spanial, but that's just not good enough".  Great stuff.  And when Holbrook finally explains it all to Michael Douglas..."you are depressingly familiar".  I love that line.

Yes, there are some weak plot points in spots, but overall this movie presents complex issues without clear answers.  You have to ask yourself...what would you do?  The Doctor who's little boy was killed says it all....."You don't escape so easily".  That's what makes this so rivoting....no black and white. 

Go buy it....its time to get your fingernails dirty."""

# Preprocessing of input
input_prepro = " ".join(preprocess_text(input_review))

# TF-IDF vector of our input
input_vector = vectorizer.transform([input_prepro])

# Cosine similarities
print 'similarity with Books: \t\t %f' % (cosine_similarity(input_vector,book_vector)[0][0])
print 'similarity with DVD: \t\t %f' % (cosine_similarity(input_vector,dvd_vector)[0][0])
print 'similarity with Elec: \t\t %f' % (cosine_similarity(input_vector,elec_vector)[0][0])
print 'similarity with Kitchen app: \t %f' % (cosine_similarity(input_vector,kitch_vector)[0][0])

## 6 Sentiment analysis

**Objective:** Given a new review or comment as input, we would like to assess the sentiment associated to this review.

#### Computation of tf-idf representation of positive reviews

In [None]:
# Merge positive reviews together
with open('data/sorted_data_acl/books/positive_text.review', 'r') as myfile:
    pos_book = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/dvd/positive_text.review', 'r') as myfile:
    pos_dvd = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/electronics/positive_text.review', 'r') as myfile:
    pos_elec = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/kitchen_&_housewares/positive_text.review', 'r') as myfile:
    pos_kitch = myfile.read().decode("utf-8")
pos_reviews = pos_book + pos_dvd + pos_elec + pos_kitch

# pre-processing of positive reviews
pos_reviews_prepro = preprocess_text(pos_reviews.replace("'",' '))
pos_doc = " ".join(pos_reviews_prepro)

# TF-IDF vector of positive reviews
pos_vector = vectorizer.transform([pos_doc]) # Books

#### Computation of tf-idf representation of negative reviews

In [None]:
# Merge negative reviews together
with open('data/sorted_data_acl/books/negative_text.review', 'r') as myfile:
    neg_book = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/dvd/negative_text.review', 'r') as myfile:
    neg_dvd = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/electronics/negative_text.review', 'r') as myfile:
    neg_elec = myfile.read().decode("utf-8")
with open('data/sorted_data_acl/kitchen_&_housewares/negative_text.review', 'r') as myfile:
    neg_kitch = myfile.read().decode("utf-8")
neg_reviews = neg_book + neg_dvd + neg_elec + neg_kitch

# pre-processing of positive reviews
neg_reviews_prepro = preprocess_text(neg_reviews.replace("'",' '))
neg_doc = " ".join(neg_reviews_prepro)

# TF-IDF vector of positive reviews
neg_vector = vectorizer.transform([neg_doc]) # Books

#### Input review

In [None]:
# Input review
input_review= """Interference from other electronics is a severe problem - 
I had to return this item for a refund.  If you can locate it several feet from any other 
electronics, it might work for you, but who wants a phone that you cannot place on your desktop, near a computer"""

#### Preprocessing of the input¶

In [None]:
# Preprocessing of input
input_prepro = " ".join(preprocess_text(input_review))
pprint(input_prepro)

#### Computation of tf-idf representation of input reviews


In [None]:
# TF-IDF vector of our input
input_vector = vectorizer.transform([input_prepro])

#### Computation of similarity between input and positive/negative reviews


In [None]:
# Cosine similarities
print 'similarity with Positive: \t\t %f' % (cosine_similarity(input_vector,pos_vector)[0][0])
print 'similarity with Negative: \t\t %f' % (cosine_similarity(input_vector,neg_vector)[0][0])