# NLP and Information Retrieval with Julia
______________________________________________________________

![pipes](images/text_classification_workflow.png)


## Setup

The data source is usually a document database such as MongoDB. I've started a client with a local mongo service and loaded the data from a json document that is supposed to mimic the real-life document one might recieve from a request body. I won't go into how to do this with this tutorial so lets assume the documents are already loaded in a local Mongo service.



### Loading Data from Mongo

For my pipeline, I want the data in a julia table. I want to pipe and process the data from my database to flat files (txt) located in a directory called data. To do this I will call a python script, aptly called load_data.py. I did this in python as I already had pymongo installed and the main goal is to perform common nlp tasks with julia. Number of files written will be 0 if files already exist locally.

In [1]:
run(`python src/load_data.py`)

Number of files written:  0


Process(`[4mpython[24m [4msrc/load_data.py[24m`, ProcessExited(0))

## Text Processing Pipeline

The goal is to build a basic text processing pipeline involving tokenization, stripping stopwords and stemming. Ultimately what we want is a sparse representation of the data where 1 row of data is a document and each column is a unique term, such as a unigram, bigram or trigram. The values herein will be generated from a vectorization method which assigns each document term a value which is proportional to its frequency in the document, but inversely proportional to the number of documents in which it occurs.

### Load data using TextAnalysis and Glob

In [2]:
using TextAnalysis
using Glob

Create an array of filenames

In [3]:
fnames = glob("data/*.txt");

Read an example file

In [4]:
# example file
readlines(fnames[1])

42-element Array{String,1}:
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                           

We can also map a file document type to the filenames in fnames. This produces an array of FileDocuments.

In [5]:
fds = map(FileDocument, fnames);

We can read a file using the text function.

In [6]:
text(fds[1])

"\n\n\n\n\n\n\n\n\n\n\nHey, the man on the phone said. Are you still coming tonight?        \n \n\nIt took a moment for me to realize that he was calling from Distilled to confirm my dinner reservation.        \nYes, I replied. Cool, he said, and sounded as if he meant it.        \nDistilled opened in June on the corner of Franklin Street and West Broadway in TriBeCa, the former home of Drew Nieporents Layla and Centrico. The belly dancers and the frozen-margarita machine are gone, but a certain effervescence remains. So does Mr. Nieporent, hovering in the background as guru to Distilleds owners, the first-time restaurateur Nick Iovacchini and Shane Lyons, the 25-year-old chef.        \nThe space is blandly handsome, with dark woods and charcoal banquettes, breathlessly high ceilings and quasi-medieval wheel chandeliers like crowns of fire. One side is devoted to the bar, where the drinks, by Benjamin Wood, are lady-killers, elegant with a knife twist. Occasionally 1980s mope rock shim

Another way to do this would be to use the core Julia functions to load text. We can push strings into an iterable data structure.

In [7]:
slist = String[]
for fname in fnames
    s = open(fname) do file
        # read the contents of a file all at once
        read(file, String)
    end
    push!(slist, s)
end

metadata for our document can be accesed as a property of the FileDocument instance

In [8]:
a = fds[1];
a.metadata

TextAnalysis.DocumentMetadata(Languages.English(), "data/5233240838f0d8062fddf624.txt", "Unknown Author", "Unknown Time")

### Tokenization and Stop Words

Next we will be removing stop words and tokenizing the document. Tokens are individual words split on whitespace. Stop words are high frequency words that we want to filter out. These words often have low lexical meaning and they don't help distinguish one text from another. Below I've created my_prepare which takes care of preparation tasks such as stripping punctuation, articles, pronouns, numbers and non-letters. This also removes stopwords. and stems the document which removes morphological affixes to the words leaving only the stem.

In [9]:
using WordTokenizers
using Languages

set_tokenizer(WordTokenizers.nltk_word_tokenize)

STOPWORDS = stopwords(Languages.English());

In [10]:
"""
my_prepare(text)

Returns prepared text string
"""
function my_prepare(text)
    sd = StringDocument(text)
    prepare!(sd, strip_punctuation
        | strip_articles 
        | strip_pronouns
        | strip_numbers 
        | strip_non_letters)
    remove_words!(sd, STOPWORDS)
    stem!(sd)
    remove_case!(sd)
    return sd.text
end

my_prepare

In [11]:
my_prepare(text(fds[1]))

"hey phone are come tonight it moment realiz call distil confirm dinner reserv yes repli cool sound meant distil june corner franklin street west broadway tribeca former home drew niepor layla centrico the belli dancer frozen margarita machin gone effervesc remain so mr niepor hover background guru distil owner time restaurateur nick iovacchini shane lyon chef the space bland handsom dark wood charcoal banquett breathless ceil quasi mediev wheel chandeli crown fire one devot bar drink benjamin wood ladi killer eleg knife twist occasion mope rock shimmer speaker servic confound friend coddl when stood outsid read post menu hurri step hand copi wouldnt crane neck on arriv departur host leapt door the mission statement preced meal we modern american public hous waiter inton unnecessari slight coy mr lyonss ambit yes wing menu jack gochujang korean ferment soybean chile past borrow larder momofuku noodl bar mr lyon there occasion technic flourish watermelon cube cryovac intensifi flavor mu

### Bag of Words and TFIDF

Using the TextAnalysis package we will create a DirectoryCorpus to use when constructing counts over the whole corpus. A text corpus is a large body of text.

In [12]:
crps = DirectoryCorpus("data");

We can use the standardize inplace function to make sure all the documents in our corpus are standardized to the StringDocument type.

In [13]:
standardize!(crps, StringDocument)

I use some of the preparation steps from my_preparation function above but applied to the entire corpus. These work in-place.

In [14]:
remove_case!(crps)
prepare!(crps, strip_punctuation
    | strip_articles 
    | strip_pronouns
    | strip_numbers 
    | strip_non_letters)
remove_words!(crps, STOPWORDS)
stem!(crps)

Our lexicon is what is going to keep track of our words and the counts associated with each word. The is in the form of a dictionary. First we have to update the lexicon.

In [15]:
update_lexicon!(crps)

In [16]:
lexicon(crps)

Dict{String,Int64} with 22718 entries:
  "nuhu"        => 1
  "ironwe"      => 1
  "wintri"      => 2
  "flatb"       => 1
  "economix"    => 2
  "curv"        => 21
  "skylight"    => 4
  "unoffici"    => 5
  "touchpad"    => 1
  "bidder"      => 4
  "whiz"        => 3
  "beckett"     => 5
  "brandt"      => 5
  "apiec"       => 4
  "il"          => 3
  "msnbc"       => 3
  "archiv"      => 25
  "overdos"     => 2
  "ankl"        => 26
  "oedip"       => 1
  "adventur"    => 25
  "acton"       => 1
  "wpp"         => 8
  "recurr"      => 2
  "underground" => 19
  ⋮             => ⋮

If we wish to have a reverse lookup for each word with documents it appears in, we need to create an inverse index. Fortunately, TextAnalysis makes this easy for us.

In [17]:
update_inverse_index!(crps)

In [18]:
inverse_index(crps)

Dict{String,Array{Int64,1}} with 22718 entries:
  "nuhu"        => [603]
  "ironwe"      => [630]
  "wintri"      => [159, 583]
  "flatb"       => [744]
  "economix"    => [611, 809]
  "curv"        => [23, 47, 51, 272, 284, 422, 493, 522, 559, 575, 584, 599, 61…
  "skylight"    => [360, 376, 530, 634]
  "unoffici"    => [24, 123, 281, 719, 999]
  "touchpad"    => [757]
  "bidder"      => [104, 235, 881]
  "whiz"        => [51, 321, 857]
  "beckett"     => [397, 601, 701, 750, 992]
  "brandt"      => [91, 493, 706, 800, 916]
  "apiec"       => [328, 773, 816, 869]
  "il"          => [234, 833, 939]
  "msnbc"       => [89, 107, 746]
  "archiv"      => [203, 243, 245, 368, 549, 560, 599, 676, 716, 826, 833, 878,…
  "overdos"     => [639, 758]
  "ankl"        => [147, 158, 168, 241, 266, 375, 393, 408, 447, 462, 572, 662,…
  "oedip"       => [2]
  "adventur"    => [2, 41, 174, 198, 211, 233, 238, 341, 386, 405  …  551, 632,…
  "acton"       => [65]
  "wpp"         => [87]
  "recurr"      

In [19]:
m = DocumentTermMatrix(crps);

The DocumentTermMatrix is a struct with properties containing the components necessary to create a term frequency inverse document frequency (tfidf) matrix for the corpus. This will be applied in later procedures involving information retrieval or sentiment analysis.

The document term matrix is stored in a data structure called SparseMatrixCSC. Sparse matrices are distinct from dense matrices in that the only values stores are non-zero values. In julia, zero values can be stored but only manually. Sparse matrices are common in machine learning, such as in data that contains counts and data encodings that map values to n dimensional arrays, because these computationally efficient data structures can elicit performance gains when used by algorithms meant to take advantage of sparsity.

The inverse index also provides us with a count of the number of documents each word appears in. This is known as document frequencies.

To obtain the tfidf matrix we will simply use the function below applied to the document term matrix.

In [20]:
tfidf = tf_idf(m);

### Steps for computing tfidf

What if we want to do all of the above manually?

1. Create the bag of words (bow), a set of words unique over the corpus. A set is a good datatype for this since it doesn't allow duplicates. At the end you'll want to convert it to a list so that we can deal with our words in a consistent order.

In [22]:
cleaned_docs = []
bow = Set{String}();
for doc in fds
    cleaned = tokenize(my_prepare((text(doc))))
    union!(bow, Set(cleaned))
    push!(cleaned_docs, cleaned)
end
filter!(!isempty, cleaned_docs);

2. Create a reverse lookup for the vocab list. This is a dictionary whose keys are the words and values are the indices of the words (the word id). This will make things much faster than using the list index function.

In [23]:
indexer = Dict{String,Int64}()
for (i, word) in enumerate(bow)
    indexer[word] = i
end

3. Create a word count matrix. This is an array data type where each row corresponds to a document and each column a word. The value should be the count of the number of times that word appeared in that document.

In [62]:
num_docs = length(fds)
num_words = length(indexer);

In [63]:
counts = zeros((num_docs, num_words));

In [64]:
for (idx, doc) in enumerate(cleaned_docs)
    C = Dict{String,Int64}()
    for word in doc
        C[word] = get(C, word, 0) + 1
    end
    for (word, count) in C
        counts[idx, indexer[word]] = count
    end
end

4. Create the document frequencies. For each word, get a count of the number of documents the word appears in. This is different from the total number of times the word appears.

In [65]:
df = sum(counts, dims=1);

5. Normalize the word count matrix to get the term frequencies. This means dividing each term frequency by the l2 (euclidean) norm. This makes each document vector has a length of 1.

In [61]:
# document_frequencies
tf_norm = sqrt.(sum(counts .^ 2, dims=2));
tf_norm[tf_norm .== 0] .= 1;
tf = counts ./ tf_norm;

6. Multiply the term frequency matrix by the log of the inverse of the document frequencies to get the tf-idf matrix. We add one to the denominator to avoid dividing by 0.

In [86]:
idf = log.((num_docs + 1) ./ (1 .+ df)) .+ 1;
tfidf_m = tf .* idf;

7. Normalize the tf-idf matrix as well by dividing by the l2 norm.

In [87]:
tfidf_norm = sqrt.(sum(tfidf_m .^ 2, dims=2));
tfidf_norm[tfidf_norm .== 0] .= 1;
tfidf_m ./= tfidf_norm;