# Understanding The DatasetJson Class

#### Basic Functionality

The class reads a Json file, object by object, upto some limited number of objects. This allows one to read a subset of the dataset without having to read the entire dataset into memory and then sample form the data set. A clear limitation of this approach is that one is not randomly sampling from the entire dataset. That said this class is not meant for production. 

#### Creating a Basic Object

yelp_data = DatasetJson("yelp_academic_dataset_review.json", "text", "stars", 100, maxlen = 10 max_features = 5000)

The constructoor takes 6 parameters

- filename which includes the file location, name and extension
- text_field which is the field the text is located in. For the Yelp Reviews data set this is 'text'
- response_field which is the field that contains the response. For the Yelp Reviews data set this is 'stars'. 
- limit which is the number of objects or observations to extract from the dataset
- maxlen which is the length of sequences to be extracted for sequence tasks. The default is 10.
- max_features which is the maximum number of unique features to keep when preparing the data for training. The default is 5000.

#### Preparing the Data for an MLP

A Keras MLP takes a document term matrix (DTM) as an input. This is matrix with documents in the rows and terms or words in the columns. In other words documents are observations and words are features. The repsonse is a matrix where each document is represented by a vector. The vector is the same length as the total number of classes or prediction categories. The vector has a 1 corresponding with its class or cateogory and a zero elsewhere. 

In [1]:
from preprocessor import DocuemntTermMatrix, indicator_to_matrix
yelp_data = DocuemntTermMatrix("yelp_academic_dataset_review.json", "text", "stars", 100)

In [2]:
# [0] documents; [1] terms
yelp_data.X_docs.shape

(100, 3120)

The class DocuemntTermMatrix extends DatasetJson by allowing it to store a DTM object in X_docs, a vocabulary in docs_vocab and the response in Y_docs. The response is a list of words as such it needs to be converted into a matrix before it can be used to train an MLP. The function indicator_to_matrix performs this operation. The response is not automatically converted to a matrix within the class as this allows for greater flexibility.

In [3]:
# indicator matrix for response
indicator_to_matrix(yelp_data.Y_docs,yelp_data.docs_label_index)

array([[ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  

#### Preparing the Data for an LSTM

A Keras LSTM takes a 3D Matrix as its input. This is a Document Term Matrix with sequence information so you could think of it as a Document Sequence Term Matrix (DSTM). This terminology is non-standard. The sequence section retains which word appears when. The rest is the same. Becuase sequences are taken into account there is a natural issue which arises. The sequences are not always the same length. For this reason the sequences need to be 'padded'. The documents that are short have sequences added to then. The documents that are long have data truncated. When a document is padded its typically padded with zero vectors. 

In [4]:
from preprocessor import Tensor_Sequence
yelp_data = Tensor_Sequence("yelp_academic_dataset_review.json", "text", "stars", 100, 25)

The class Tensor_Sequence extends DatasetJson by allowing it to store a DSTM object in X_doc_seq, a vocabulary in docs_vocab and the response in Y_doc_seq. The name change is partially to prevent the user from using a DTM when they expect a DSTM object. The class is called Tensor Sequence because tensors are the n-dimnesional extention of a vector (1D object) and matrix (2D object) this case the DSTM is a 3D object. The same steps as above are required to convert the response to an indicator matrix. 

In [5]:
yelp_data.X_doc_seq.shape

(100, 25, 3120)

#### Preparing the Data for a CNN

A Keras CNN can take a 1 dimensional vector sequence. Unlike the LSTM which extends the DTM this does away with it. This view takes the word indices relative to the vocabulary as a data sequence. These sequences as padded for the same reasons tensor sequences are padded. When contrasting this input to visual input it effectively treats word indices as pixel intensities. Naturally it seems this approach should under perform other approaches as the word index carries no information yet it is modelled as though it does.

In [6]:
from preprocessor import Index_Sequence
yelp_data = Index_Sequence("yelp_academic_dataset_review.json", "text", "stars", 100, 25)

In [7]:
yelp_data.X_doc_seq[1]

array([   74.,     0.,  2850.,  2964.,  1162.,  1037.,   832.,   603.,
        1958.,     0.,  2850.,  1542.,   834.,  1134.,  2921.,   650.,
        1134.,  2921.,  2700.,  1617.,   156.,    68.,  2699.,  1124.,
          39.])

#### Final Comments

The orginial data is stored in DatasetJon.text and the orginal response is stored in DatasetJon.response. This provides the flixibility to refer back to the orginial text or response when needed.

There are set objects for both the text and response. The set object for the text is docs_vocab. This is a dictionary object that links words to arbitrary indices. The second set object is the label index docs_label_index. This is a dictionary object that links the classes or cateogries contained in the response to arbitrary indices.

In [8]:
yelp_data.text[1]

u"Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars."

In [9]:
yelp_data.response[1]

2

In [10]:
yelp_data.docs_vocab

{u'': 0,
 u'ever.': 1,
 u'polite.': 2,
 u'better,': 3,
 u'better.': 4,
 u'happen.\n\nthen': 1305,
 u'four': 6,
 u'school.....traditional': 7,
 u'0830': 8,
 u'consists': 9,
 u'woody': 10,
 u'oldest': 11,
 u'yinzer': 12,
 u'up.': 13,
 u'onus': 14,
 u'nah...why': 15,
 u'remodeled': 16,
 u'presents': 17,
 u'voted': 19,
 u'under': 20,
 u're-carpeted': 21,
 u'$x.xx"\n"2': 22,
 u'worth': 23,
 u"primanti's,": 24,
 u'dispense': 26,
 u'attentive,': 27,
 u'thin.': 28,
 u'attentive.': 29,
 u'sucks.': 30,
 u'every': 31,
 u'today.': 32,
 u'chapel': 33,
 u'school': 34,
 u'basics': 35,
 u"it.\n\nhere's": 36,
 u'formica': 37,
 u'wednesday': 38,
 u'stars.': 39,
 u'woods': 40,
 u'front,': 41,
 u'mondays.': 42,
 u'enjoy': 43,
 u'her.': 44,
 u'salad.': 1074,
 u'ate,': 46,
 u'back!!!': 47,
 u'estimates': 48,
 u'direct': 49,
 u'tires': 50,
 u'galore.': 51,
 u'crave': 52,
 u'street': 53,
 u'sink"': 54,
 u'air': 55,
 u'machines': 56,
 u'monster': 2513,
 u'even': 58,
 u'change.': 59,
 u'ruben': 61,
 u'beaten': 

In [11]:
yelp_data.docs_label_index

{1: 0, 2: 1, 3: 2, 4: 3, 5: 4}