# NLP and Information Retrieval with Julia
______________________________________________________________

![pipes](images/text_classification_workflow.png)


## Setup

The data source is usually a document database such as MongoDB. I've started a client with a local mongo service and loaded the data from a json document that is supposed to mimic the real-life document one might recieve from a request body. I won't go into how to do this with this tutorial so lets assume the documents are already loaded in a local Mongo service.



### Loading Data from Mongo

For my pipeline, I want the data in a julia table. I want to pipe and process the data from my database to flat files (txt) located in a directory called data. To do this I will call a python script, aptly called load_data.py. I did this in python as I already had pymongo installed and the main goal is to perform common nlp tasks with julia. Number of files written will be 0 if files already exist locally.

In [6]:
run(`python src/load_data.py`)

Number of files written:  0


Process(`[4mpython[24m [4msrc/load_data.py[24m`, ProcessExited(0))

## Text Processing Pipeline

The goal is to build a basic text processing pipeline involving tokenization, stripping stopwords and stemming. Ultimately what we want is a sparse representation of the data where 1 row of data is a document and each column is a unique term, such as a unigram, bigram or trigram. The values herein will be generated from a vectorization method which assigns each document term a value which is proportional to its frequency in the document, but inversely proportional to the number of documents in which it occurs.

### Load data using JuliaDB

In [8]:
using JuliaDB

┌ Info: Recompiling stale cache file /Users/gmacmillan/.julia/compiled/v1.1/JuliaDB/4FA8g.ji for JuliaDB [a93385a2-3734-596a-9a66-3cfbb77141e6]
└ @ Base loading.jl:1184


In [9]:
fnames = glob("data/*.txt");

999-element Array{String,1}:
 "data/5233240838f0d8062fddf624.txt"
 "data/5233249838f0d8062fddf6a0.txt"
 "data/5234544138f0d81989737272.txt"
 "data/523454a638f0d819897372fb.txt"
 "data/523454f338f0d8198973735f.txt"
 "data/5236a0c838f0d81989738879.txt"
 "data/5236aa6138f0d81989738886.txt"
 "data/5236be3538f0d819897388aa.txt"
 "data/5236d12238f0d819897388d2.txt"
 "data/5236d6ee38f0d819897388dd.txt"
 "data/5236e1ca38f0d819897388ee.txt"
 "data/5236f14538f0d8198973890d.txt"
 "data/5236f58b38f0d81989738917.txt"
 ⋮                                  
 "data/523dc3d238f0d8198973af64.txt"
 "data/523dc44138f0d8198973af66.txt"
 "data/523dc4d038f0d8198973af67.txt"
 "data/523dc51638f0d8198973af68.txt"
 "data/523dc56f38f0d8198973af69.txt"
 "data/523dc57138f0d8198973af6a.txt"
 "data/523dc67438f0d8198973af6c.txt"
 "data/523dc69e38f0d8198973af6d.txt"
 "data/523dc82738f0d8198973af6f.txt"
 "data/523dcab938f0d8198973af70.txt"
 "data/523dccc738f0d8198973af72.txt"
 "data/523dd65b38f0d8198973af84.txt"

In [10]:
# example file
readlines(fnames[1])

42-element Array{String,1}:
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                                                                                              
 ""                                                           

### TODO: Figure out loading data into a table, modify python load_data to store labels in a file called out train_labels.csv or something