#  REI Option Project Preprocessing Data
## Sat Jan  7 19:05:04 CET 2017

## Document classification

Text categorization is one of the most active research areas in NLP. It has a variety of real-world applications such as sentiment analysis, opinion mining, email filtering, etc. Given the current data overflow, especially of textual type, the needs for efficient automated text classification solutions have become more pressing than ever.

The most common pipeline for text classification is the vector space representation followed by TF- IDF term weighting. With this approach, each document is viewed as a point in a sparse space where each dimension (or feature) is a unique term in the corpus. The set of unique terms is called the vocabulary. If we assume that there are m documents $\{d_1,\dots,d_m\}$ in the collection and n unique terms $T = \{t_1,\dots,t_n\}$ in the corpus ($T$ being the vocabulary), each document can be represented as a vector of n term weights (i.e., the weights are the coordinates of the document in vocabulary space). A classifier is then trained on the available documents (i.e., on a document-term matrix of dimension $m*n$) and subsequently used for classifying new ones. In this project, we consider the classifier as a black-box, and focus only on improving categorization performance by coming up with better term weights. We will compare two ways of computing these weights.

In [788]:
# !/usr/bin/env python
# -*- coding: utf-8 -*-
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from utils import *
import glob
import os
import nltk
from nltk.stem.porter import *
import string
import unicodedata
import pandas as pd

## 1 Load data

Initially, we should load the data that are contained in the *aclImdb* directory. We will work on a standard, publicly available dataset: aclImdb. This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. 

In [789]:
# change this address to your own.
# dataset_path = '/Users/zacharie/Documents/AIC/REI(Option)/aclImdb/'

In [790]:
def load_aclImdb_data(path_train, path_test):
    sentences_train = []
    train_y = []
    currdir = os.getcwd()
    os.chdir('%s/pos/' % path_train)
    for ff in glob.glob("*.txt"):
       	with open(ff, 'r') as f:
	    sentences_train.append(f.readline().strip())
            train_y.append(1)
    os.chdir('%s/neg/' % path_train)
    for ff in glob.glob("*.txt"):
       	with open(ff, 'r') as f:
            sentences_train.append(f.readline().strip())
            train_y.append(0)
    os.chdir(currdir)
    sentences_test = []
    test_y = []
    currdir = os.getcwd()
    os.chdir('%s/pos/' % path_test)
    for ff in glob.glob("*.txt"):
        with open(ff, 'r') as f:
            sentences_test.append(f.readline().strip())
            test_y.append(1)
    os.chdir('%s/neg/' % path_test)
    for ff in glob.glob("*.txt"):
        with open(ff, 'r') as f:
            sentences_test.append(f.readline().strip())
            test_y.append(0)
    os.chdir(currdir)
    return sentences_train, sentences_test, train_y, test_y


In [791]:
def load_mrd_data():
    from sklearn.model_selection import train_test_split
    import io
    sentences_pos = []
    ff = "../rt-polaritydata/rt_polarity.pos.utf8.txt"
    with io.open(ff, 'r', encoding='utf8') as f:
        for line in f:
            sentences_pos.append(line)
    sentences_neg = []
    ff = "../rt-polaritydata/rt_polarity.neg.utf8.txt"
    with io.open(ff, 'r', encoding='utf8') as f:
        for line in f:
            sentences_neg.append(line)
    terms_by_doc_train, terms_by_doc_test, terms_by_label_train, terms_by_label_test = train_test_split(\
        sentences_pos+sentences_neg, [1]*len(sentences_pos)+[0]*len(sentences_neg), test_size=0.4, random_state=58)
    return terms_by_doc_train, terms_by_doc_test, terms_by_label_train, terms_by_label_test

In [792]:
# terms_by_doc_train, terms_by_doc_test, terms_by_label_train, terms_by_label_test = load_mrd_data()

In [793]:
# sentences_train, sentences_test, train_y, test_y =  load_aclImdb_data(os.path.join(dataset_path, 'train'), os.path.join(dataset_path, 'test'))

## 2 Data preprocessing

Before applying any learning algorithm to the data, it is necessary to apply some preprocessing tasks as shown below:

1)  Remove punctuation marks (e.g, . ~ , ~ ? ~ :~ ( ~) ~ [ ~]) and transform all characters to lowercase. This can be done using Python's *NLTK* library (http://www.nltk.org/).

2)  Remove stop words. These are words that are filtered out before processing any natural language data. This set of words does not offer information about the content of the document and typically corresponds to a set of commonly used words in any language. For example, in the context of a search engine, suppose that your search query is "how to categorize documents". If the search engine tries to find web pages that contain the terms "how", "to", "categorize", "documents", it will find many more pages that contain the terms "how" and "to" than pages that contain information about document categorization. This is happening because the terms "how" and "to" are  commonly used in the English language.  In the code, nltk.corpus.stopwords.words() can return one list of stop words. We used them to clear data.

3)  Stemming all words (see Wikipedia's annotation for *Stemming*: http://en.wikipedia.org/wiki/Stemming, i.e., the process of reducing the words to their word stem or root. For example, a stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". We used *Porter's* stemmer, contained also in the NLTK library.

4)  Store our preprocessed data to *./dumps/stemmered\_train.csv* and *./dumps/stemmered\_test.csv*

In [794]:
def stemmering_sentences(sentences_train, sentences_test, train_y, test_y):
    # Remove punctuation, stopword and then stemmering
    punctuation=set(string.punctuation)
    stemmer=PorterStemmer()
    for i in range(len(sentences_train)):
        tmp = sentences_train[i]
        tmp = unicode(tmp, errors='ignore')
        doc=[stemmer.stem(word) for word in nltk.word_tokenize(tmp) if (word not in punctuation) and (word not in nltk.corpus.stopwords.words('english'))]
        sentences_train[i]=doc
    df_train = {'features':sentences_train, 'labels':train_y}
    df_train = pd.DataFrame(df_train)
    df_train.to_csv('dumps/stemmered_train_mrd.csv', encoding='utf8', index=False)
    print "writing stemmered sentences_train in ./dumps/stemmered_train.csv file is done."

    for i in range(len(sentences_test)):
        tmp = sentences_test[i]
        tmp = unicode(tmp, errors='ignore')
        doc=[stemmer.stem(word) for word in nltk.word_tokenize(tmp) if (word not in punctuation) and (word not in nltk.corpus.stopwords.words('english'))]
        sentences_test[i]=doc
    df_test = {'features': sentences_test, 'labels':test_y}
    df_test = pd.DataFrame(df_test)
    df_test.to_csv('dumps/stemmered_test_mrd.csv', encoding='utf8', index=False)
    print "writing stemmered sentences_test in dumps/stemmered_test.csv is done."


In [795]:
# stemmering_sentences(sentences_train, sentences_test, train_y, test_y)

In [796]:
def stemmering_sentences_mrd(sentences_train, sentences_test, train_y, test_y):
    new_train_X = []
    new_train_y = []
    # Remove punctuation, stopword and then stemmering
    punctuation=set(string.punctuation)
    stemmer=PorterStemmer()
    for i in range(len(sentences_train)):
        tmp = sentences_train[i]
        # tmp = unicode(tmp, errors='ignore')
        doc=[stemmer.stem(word) for word in nltk.word_tokenize(tmp) if (word not in punctuation) and (word not in nltk.corpus.stopwords.words('english'))]
        if len(doc)>=5:
            new_train_X.append(doc)
            new_train_y.append(train_y[i])
    df_train = {'features':new_train_X, 'labels':new_train_y}
    df_train = pd.DataFrame(df_train)
    df_train.to_csv('dumps/stemmered_train_mrd.csv', encoding='utf8', index=False)
    print "writing stemmered sentences_train in ./dumps/stemmered_train_mrd.csv file is done."

    new_test_X = []
    new_test_y =[]
    for i in range(len(sentences_test)):
        tmp = sentences_test[i]
        # tmp = unicode(tmp, errors='ignore')
        doc=[stemmer.stem(word) for word in nltk.word_tokenize(tmp) if (word not in punctuation) and (word not in nltk.corpus.stopwords.words('english'))]
        if len(doc)>=5:
            new_test_X.append(doc)
            new_test_y.append(test_y[i])
    df_test = {'features': new_test_X, 'labels':new_test_y}
    df_test = pd.DataFrame(df_test)
    df_test.to_csv('dumps/stemmered_test_mrd.csv', encoding='utf8', index=False)
    print "writing stemmered sentences_test in dumps/stemmered_test_mrd.csv is done."


In [797]:
# stemmering_sentences_mrd(terms_by_doc_train, terms_by_doc_test, terms_by_label_train, terms_by_label_test)


## 3 Reload clean data to pandas DataFrame. 

In [798]:
# reload necessary data set to pandas DataFrame.
train_df = pd.read_csv('dumps/stemmered_train_mrd.csv', encoding='utf8')
test_df = pd.read_csv('dumps/stemmered_test_mrd.csv', encoding='utf8')

In [799]:
print "Storing terms from training documents as list of lists"
terms_by_doc_train = [document.rstrip(']"').lstrip('"[').split(", ") for document in train_df.ix[:,0]]
terms_by_label_train = train_df.ix[:, 1]
n_terms_per_doc = [len(terms) for terms in terms_by_doc_train]
print "min, max and average number of terms per document:", min(n_terms_per_doc), max(n_terms_per_doc), sum(n_terms_per_doc)/len(n_terms_per_doc)
# print terms_by_doc_train[0][8]

Storing terms from training documents as list of lists
min, max and average number of terms per document: 5 39 11


In [800]:
print "Storing terms from test documents as list of lists"
terms_by_doc_test = [document.rstrip(']"').lstrip('"[').split(", ") for document in test_df.ix[:,0]]
terms_by_label_test = test_df.ix[:, 1]
n_terms_per_doc = [len(terms) for terms in terms_by_doc_test]
print "min, max and average number of terms per document:", min(n_terms_per_doc), max(n_terms_per_doc), sum(n_terms_per_doc)/len(n_terms_per_doc)
# print terms_by_doc_test[0][0]

Storing terms from test documents as list of lists
min, max and average number of terms per document: 5 32 11


In [801]:
# from sklearn.model_selection import train_test_split

# terms_by_doc_train, _, terms_by_label_train, _ = train_test_split(terms_by_doc_train, terms_by_label_train, test_size=0.7, random_state=42)

# terms_by_doc_test, _, terms_by_label_test, _ = train_test_split(terms_by_doc_test, terms_by_label_test, test_size=0.97, random_state=42)

In [802]:
# Store all terms in list
all_terms = [terms for sublist in terms_by_doc_train for terms in sublist]
# Compute average number of terms
avg_len = sum(n_terms_per_doc)/len(n_terms_per_doc)
print "the average number of terms:", avg_len
# Find unique terms
all_unique_terms = list(set(all_terms))
print "the number of unique terms:", len(all_unique_terms)

the average number of terms: 11
the number of unique terms: 11403


## 4 TF-IDF and TW-IDF Representations
The text data (i.e., all the possible documents-reviews) should be transformed to a format that will be used in the learning (i.e., classification) task. As we described above, the data will be represented by the Document-Term matrix, where the rows correspond to the different documents of the collection (i.e., reviews, comments or abstracts) and the columns to the features, which in our case are the different terms (i.e., words). Here, we are interested to find relevant weighting criteria for the Document-Term matrix, i.e., assign a relevance score $w_{ij}$ to each term $t_j$ for each document $d_i$. In our project, we are going to consider to different score functions, the TF-IDF and the TW-IDF.

### 4.1 TF-IDF

A given document $d$ loads on each dimension of the feature space (each term $t$) according to the following formula, known as the pivoted normalization weighting:
$$tf-idf(t,d)= \frac{1+\ln⁡(1+\ln(tf(t,d))}{1-b+b \times \frac{|d|}{avgdl}} \times idf(t,D)$$

where

1. tf(t,d) is the number of times term t appear in document $d$, 
2. $|d|$ is the length of document d, 
3. avgdl is the average document length across the corpus, 
4. $b=0.08$ is a constant predefined parameter, and
5. $idf(t,D)=log((m+1)/df(t))$, with $df(t)$ the number of documents in which the term $t$ appears and $m$ is the number of all documents in collection.

We can conclude intuitively that frequent words in a document are representative of that document as long as they are not also very frequent at the corpus level. Note that for all the terms that do not appear in d, the weights are null. Since a given document contains only a small fraction of the vocabulary, most of its coordinates in the vector space are null, leading to a very sparse representation, which is a well-known limitation of the vector space model that motivated (among other things) word embeddings(i.e., word2vect).

### 4.2 TF-IWF

We propose to leverage the graph-of-words representation of a document to derive a new scoring
function:
$$tw-idf(t,d)= \frac{tw(t,d)}{1-b+b \times \frac{|d|}{avgdl}} \times idf(t,D)$$
where, $tw(t,d)$ is some graph-of-words based score for term $t$ (for the graph-of-words corresponding to document $d$), and $b=0.08$ (the remainder of the equation is the same as for TF-IDF).

Various node centrality criteria are intuitively good candidates for tw(t,d):
1. Normalized degree centrality:
$$degree(node)=  \frac{|neighbors(node)|}{|\text{vertices in graph}|-1}$$
Note that in its weighted version, degree centrality sums up the weights of the edges incident to the node instead of simply counting the number of incident edges. Keep in mind that both igraph implementations are not normalized.
2. Closeness centrality:
$$closeness(node)=  \frac{|\text{vertices in graph}|-1}{\sum_{\text{node}_i \in graph} dist(node,\text{node}_i)}$$
Closeness centrality is defined as the inverse of the average shortest path distance from the considered node to the other nodes in the graph. As opposed to degree centrality, closeness is a global metric, in that it aggregates information from the entire graph. Note that the igraph implementation has an argument for normalization. If set to True, the function computes exactly the quantity above.


Create a TF-IDF and TW-IDF representation for each document in the training set. For TW- IDF, build a graph-of-words for each document (window size of 3) and consider both the weighted and unweighted versions of degree and closeness.


In [803]:
# Create IDF dictionary

import math
# Number of all documents in collection
n_doc = len(terms_by_doc_train)
# Store DF values in dictionary
# Number of documents in which each term appears is the value of this DF dictionary.
df = dict(zip(all_unique_terms, [0]*len(all_unique_terms)))
for document in terms_by_doc_train:
    unique_words = list(set(document))
    for word in unique_words:
        df[word] += 1
# Store IDF values in dictionary idf(d, t) = log [ (1 + n) / 1 + df(d, t) ] + 1.
idf = dict()
for element in df.keys():
    # idf[element] = math.log((float(n_doc)+1)/(1 +df[element])) + 1
    idf[element] = math.log10((float(n_doc)+1)/df[element])

### 4.2.1 Graph-of-words 

Graph representations of textual documents have been proposed for more than a decade. Unlike earlier approaches assuming term independence, such as the bag- of-words, graphs-of-words offer an information-rich way of encoding text, by capturing for instance term dependence and term order. More precisely, a graph-of-words is a graph whose vertices represent unique terms in the document and whose edges capture some meaningful syntactic (grammar), semantic (synonymy), or statistical similarity between terms.

In this project, we assume that two vertices are linked by an edge if and only if the two words they represent co-occur in text within a sliding window of predetermined fixed size. This is a statistical approach, as it links all co-occurring terms without considering their meaning or function. The underlying assumption (similar to the Markov assumption in time series) is that dependence only exists between words found close to each other. Edges may be assigned integer weights based on the number of co-occurrences, as shown in Figure 1. Similarly, edges can be directed to encode term order, forward edges matching the natural flow of the text.

![Caption.](figures/fig1.png)
**Figure 1.** Graph-of-words with main core. Node color indicates the highest core a vertex belongs to (i.e., its core number), from 2-core (white) to 6-core (black).

The graph was built on document #1938 from the Hulth (2003) data set with **3** words in same window:

A **method** for **solution** of **systems** of **linear** **algebraic** **equations** with **m-dimensional** **lambda** **matrices** A **system** of **linear** **algebraic** **equations** with **m-dimensional** **lambda** **matrices** is considered. The **proposed** **method** of searching for the **solution** of this **system** lies in reducing it to a **numerical** **system** of a **special** **kind**.

Note: an interactive web application illustrating graphs-of-words can be found in  https://safetyapp.shinyapps.io/GoWvis/, and is very useful to develop a good intuition for the concept.

### 4.2.2 K-core

The k-core of a graph corresponds to the **maximal connected subgraph** whose vertices are at least of **degree k** within the subgraph.

The core number of a vertex is the highest order of a core that contains this vertex.

The core of maximum order is called the main core.

It corresponds to a fast and good (although not perfect) approximation of the density, or most cohesive connected component of the graph. The set of all the k-cores of a graph (from the 0-core to the main core) forms what is called the k-core decomposition of a graph. The k-core decomposition of a weighted graph can be computed in linearithmic time and linear space using a min-oriented binary heap to retrieve the vertex of lowest degree at each iteration (n in total). In the unweighted case, the algorithm is linear in time.

**Usefule note**: Use the *core_dec* function (found in *library.py*). This function returns the main core of the graph (subgraph, igraph obect). For instance, 
core_dec(graph, weighted = True)["main_core"]
returns the subgraph corresponds to the main core of the graph. The names of the vertices of this subgraph can be obtained as a list via executing the following command:
core_dec(graph, weighted = True)["main_core"].vs["name"]


Create a graph-of-words for each training document by using the *terms_to_graph()* function provided in the *library.py* file. This function takes as input a list of ordered terms and the size of sliding window and returns a graph where its edges are weighted based on term co-occurence within a fixed-size sliding window. Because graph-of-words was consistently reported to give superior results, we used a sliding window of size 4. 
that means 4 in the *terms_to_graph()* function. Loop over all training documents (list of lists of terms) and store the graphs in a list called *all_graphs*.

In [804]:
from library import *
print "Creating a graph-of-words for each training document. \n"
# 4 means that there are merely 3 words in one sliding window. 
window = 4
all_graphs = []
for terms in terms_by_doc_train:
    all_graphs.append(terms_to_graph(terms, window))

Creating a graph-of-words for each training document. 



In [805]:
# assert checks (should return True)
# print  len(terms_by_doc_train)==len(all_graphs)
# print  len(set(terms_by_doc_train[0]))==len(all_graphs[0].vs)


### 4.2.2 K-core

The k-core of a graph corresponds to the **maximal connected subgraph** whose vertices are at least of **degree k** within the subgraph.

The core number of a vertex is the highest order of a core that contains this vertex.

The core of maximum order is called the main core.

It corresponds to a fast and good (although not perfect) approximation of the density, or most cohesive connected component of the graph. The set of all the k-cores of a graph (from the 0-core to the main core) forms what is called the k-core decomposition of a graph. The k-core decomposition of a weighted graph can be computed in linearithmic time and linear space using a min-oriented binary heap to retrieve the vertex of lowest degree at each iteration (n in total). In the unweighted case, the algorithm is linear in time.

**Usefule note**: Use the *core_dec* function (found in *library.py*). This function returns the main core of the graph (subgraph, igraph obect). For instance, 
core_dec(graph, weighted = True)["main_core"]
returns the subgraph corresponds to the main core of the graph. The names of the vertices of this subgraph can be obtained as a list via executing the following command:
core_dec(graph, weighted = True)["main_core"].vs["name"]
Loop over "all_graphs" and store the main cores in the "mcs_weighted" and "mcs_unweighted" lists. In this case, you will end up with two lists of lists.

In [806]:
print "computing vector representations of each training document"
# reminder of formula
b = 0.08
features_degree = []
features_w_degree = []
features_closeness = []
features_w_closeness = []
features_tfidf = []
len_all = len(all_unique_terms)

counter = 0

idf_keys = idf.keys()

# print len(all_graphs)
# print len(terms_by_doc_train)
for i in xrange(len(all_graphs)):
    graph = all_graphs[i]
    terms_in_doc = terms_by_doc_train[i]
    doc_len = len(terms_in_doc)
    # Returns zip (node name, degree, weighted degree, closeness, weighted closeness)
    my_metrics = compute_node_centrality(graph)

    feature_row_degree = [0]*len_all
    feature_row_w_degree = [0]*len_all
    feature_row_closeness = [0]*len_all
    feature_row_w_closeness = [0]*len_all
    feature_row_tfidf = [0]*len_all

    for term in list(set(terms_in_doc)):
        # term here is unique word in specific document terms_by_doc_train[i]
        index = all_unique_terms.index(term)
        idf_term = idf[term]
        denominator = (1-b+(b*(float(doc_len)/avg_len)))
        # find current node's degree, w_degree, closeness and w_closeness
        metrics_term = [tuple[1:5] for tuple in my_metrics if tuple[0]==term][0]

        # store TW-IDF values
        feature_row_degree[index] = (float(metrics_term[0])/denominator) * idf_term
        feature_row_w_degree[index] = (float(metrics_term[1])/denominator) * idf_term
        feature_row_closeness[index] = (float(metrics_term[2])/denominator) * idf_term
        feature_row_w_closeness[index] = (float(metrics_term[3])/denominator) * idf_term

        # number of occurences of word in document, this can be also calculated when we count IDF.
        tf = terms_in_doc.count(term)

        # store TF-IDF value
        feature_row_tfidf[index] = ((1+math.log1p(1+math.log1p(tf)))/(1-b+(b*(float(doc_len)/avg_len)))) * idf_term
        # feature_row_tfidf[index] = ((1+math.log1p(math.log1p(tf)))/(1-b+(b*(float(doc_len)/avg_len)))) * idf_term
    features_degree.append(feature_row_degree)
    features_w_degree.append(feature_row_w_degree)
    features_closeness.append(feature_row_closeness)
    features_w_closeness.append(feature_row_w_closeness)
    features_tfidf.append(feature_row_tfidf)

    counter += 1
    if counter % 400 == 0:
        print counter, "documents have been processed"

computing vector representations of each training document
400 documents have been processed
800 documents have been processed
1200 documents have been processed
1600 documents have been processed
2000 documents have been processed
2400 documents have been processed
2800 documents have been processed
3200 documents have been processed
3600 documents have been processed
4000 documents have been processed
4400 documents have been processed
4800 documents have been processed
5200 documents have been processed
5600 documents have been processed


In [807]:
import numpy as np 
# Convert list of lists into array
# Documents as rows, unique words as columns (i.e., document-term matrix)
training_set_tfidf = np.array(features_tfidf)
training_set_degree = np.array(features_degree)
training_set_w_degree = np.array(features_w_degree)
training_set_closeness = np.array(features_closeness)
training_set_w_closeness = np.array(features_w_closeness)

In [808]:
from sklearn import svm
# Initialize basic SVM
classifier_tfidf = svm.LinearSVC()
classifier_degree = svm.LinearSVC()
classifier_w_degree = svm.LinearSVC()
classifier_closeness = svm.LinearSVC()
classifier_w_closeness = svm.LinearSVC()

In [809]:
classifier_tfidf.fit(training_set_tfidf, terms_by_label_train)
classifier_degree.fit(training_set_degree, terms_by_label_train)
classifier_w_degree.fit(training_set_w_degree, terms_by_label_train)
classifier_closeness.fit(training_set_closeness, terms_by_label_train)
classifier_w_closeness.fit(training_set_w_closeness, terms_by_label_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [810]:
print "Creating a graph-of-words for each testing document \n"
window = 4
all_graphs_test = []
for terms in terms_by_doc_test:
    all_graphs_test.append(terms_to_graph(terms,window))
# sanity checks (should return True)
# print len(terms_by_doc_test)==len(all_graphs_test)
# print len(set(terms_by_doc_test[0]))==len(all_graphs_test[0].vs)

Creating a graph-of-words for each testing document 



In [811]:
print "computing vector representations of each testing document"
# each testing document is represented in the training space only
features_degree_test = []
features_w_degree_test = []
features_closeness_test = []
features_w_closeness_test = []
features_tfidf_test = []
counter = 0

for i in xrange(len(all_graphs_test)):
    graph = all_graphs_test[i]
    # retain only the terms originally present in the training test
    terms_in_doc = [term for term in terms_by_doc_test[i] if term in all_unique_terms]
    doc_len = len(terms_in_doc)
    
    # returns node (1) name, (2) degree, (3) weighted degree, (4) closeness, (5) weighted closeness
    my_metrics = compute_node_centrality(graph)
    
    feature_row_degree_test = [0]*len_all
    feature_row_w_degree_test = [0]*len_all
    feature_row_closeness_test = [0]*len_all
    feature_row_w_closeness_test = [0]*len_all
    feature_row_tfidf_test = [0]*len_all
    
    for term in list(set(terms_in_doc)):
        index = all_unique_terms.index(term)
        # if this term in test data set has never been in training set, we would return 0 as IDF value.
        # idf_term = idf.get(term, default=0)
        idf_term = idf[term]
        
        denominator = (1-b+(b*(float(doc_len)/avg_len)))
        metrics_term = [tuple[1:5] for tuple in my_metrics if tuple[0]==term][0]
        
        # store TW-IDF values
        feature_row_degree_test[index] = (float(metrics_term[0])/denominator) * idf_term
        feature_row_w_degree_test[index] = (float(metrics_term[1])/denominator) * idf_term
        feature_row_closeness_test[index] = (float(metrics_term[2])/denominator) * idf_term
        feature_row_w_closeness_test[index] = (float(metrics_term[3])/denominator) * idf_term
        
        # number of occurences of word in document
        tf = terms_in_doc.count(term)
        
        # store TF-IDF value
        feature_row_tfidf_test[index] = ((1+math.log1p(1+math.log1p(tf)))/(1-b+(b*(float(doc_len)/avg_len)))) * idf_term
        # feature_row_tfidf_test[index] = ((1+math.log1p(math.log1p(tf)))/(1-b+(b*(float(doc_len)/avg_len)))) * idf_term
    
    features_degree_test.append(feature_row_degree_test)
    features_w_degree_test.append(feature_row_w_degree_test)
    features_closeness_test.append(feature_row_closeness_test)
    features_w_closeness_test.append(feature_row_w_closeness_test)
    features_tfidf_test.append(feature_row_tfidf_test)

    counter += 1
    if counter % 400 == 0:
        print counter, "documents have been processed"


computing vector representations of each testing document
400 documents have been processed
800 documents have been processed
1200 documents have been processed
1600 documents have been processed
2000 documents have been processed
2400 documents have been processed
2800 documents have been processed
3200 documents have been processed
3600 documents have been processed


In [812]:
# Convert list of lists into array
# Documents as rows, unique words as columns (i.e., document-term matrix)

testing_set_degree = np.array(features_degree_test)
testing_set_w_degree = np.array(features_w_degree_test)
testing_set_closeness = np.array(features_closeness_test)
testing_set_w_closeness = np.array(features_w_closeness_test)
testing_set_tfidf = np.array(features_tfidf_test)


In [813]:
# Predictions
predictions_tfidf = classifier_tfidf.predict(testing_set_tfidf)
predictions_degree = classifier_degree.predict(testing_set_degree)
predictions_w_degree = classifier_w_degree.predict(testing_set_w_degree)
predictions_closeness = classifier_closeness.predict(testing_set_closeness)
predictions_w_closeness = classifier_w_closeness.predict(testing_set_w_closeness)


In [814]:
from sklearn import metrics

print "Accuracy TF-IDF:", metrics.accuracy_score(terms_by_label_test,predictions_tfidf)

print "Accuracy TW-IDW degree:", metrics.accuracy_score(terms_by_label_test,predictions_degree)

print "Accuracy TW-IDW weighted degree:", metrics.accuracy_score(terms_by_label_test,predictions_w_degree)

print "Accuracy TW-IDW closeness:", metrics.accuracy_score(terms_by_label_test,predictions_closeness)

print "Accuracy TW-IDW weighted closeness:", metrics.accuracy_score(terms_by_label_test,predictions_w_closeness)

Accuracy TF-IDF: 0.687838884586
Accuracy TW-IDW degree: 0.69584301575
Accuracy TW-IDW weighted degree: 0.692744642396
Accuracy TW-IDW closeness: 0.698424993545
Accuracy TW-IDW weighted closeness: 0.70100697134


In [815]:
hasher = HashingVectorizer(n_features=2**18, stop_words='english', non_negative=True, norm=None, binary=False)
vectorizer = Pipeline([('hasher', hasher), ('tf_idf', TfidfTransformer(smooth_idf=False))])
    # vectorizer = Pipeline([('hasher', hasher)])
X_train = vectorizer.fit_transform([' '.join(term) for term in terms_by_doc_train])
X_test = vectorizer.fit_transform([' '.join(term) for term in terms_by_doc_test])


In [816]:
from sklearn import metrics
classifier_original_tfidf = svm.LinearSVC()
classifier_original_tfidf.fit(X_train, terms_by_label_train)
predictions_original_tfidf = classifier_original_tfidf.predict(X_test)
print "Accuracy TF-IDF:", metrics.accuracy_score(terms_by_label_test, predictions_original_tfidf)

Accuracy TF-IDF: 0.718822618125
