# At the beginning there was the data

In NLP, everything begins with text, and it is important to have a good understanding of the data before starting to work on the code. When an algorithm returns output you did not expect, chances are there is something you did not know about the data that is violating some of your assumptions (for instance, you are working with a collection of movie reviews and some of them turn out to be about books).

This parallels the often-cited advice for programmers: do not start writing code before you understand the problem you are trying to solve. The same applies to the data: do not start writing the code until you know the data.

So, let's first get our hands on some data. During this course, we will be working with a few different datasets.

## Raw text
First, we will need some large amount of random raw text in order to feed a few relatively stupid processes and a few smarter ones.

This corpus should be big. Ideally, as big as possible. However, there are practical (mostly memory) limitations and, for this course, we will be working with data sizes that can be crunched on a standard laptop. That means **no Big Data** as such (nothing that can fit on a laptop is big data :), just some toy, pocket-size data to prove the point.

We will need a Python class to easily stream text into our algorithm. When possible, we will use a generator (using the __yield__ keyword instead of returning a __list__) to avoid having to store all the records in memory and process them one by one instead (when possible, it is always more efficient). Here's the streamer:

In [10]:
from lib.Tools import (
    decode as d,           # Auxiliary method to decode UTF-8 (when reading from a file)
    encode as e            # Auxiliary method to encode UTF-8 (when writing to a file or stdout)
)

class TextStreamer:
    
    def __init__(self, source, parser=None):
        self.source = source
        if parser:
            self.parser = parser(self.source)
        else:
            self.parser = None

    def __iter__(self):
        if not self.parser:
            with open(self.source, 'rb') as rd:
                for line in rd:
                    if not line.strip():
                        continue
                    yield d(line)
        else:
            for parsed in self.parser:
                yield d(parsed)

Let's make sure it works. Let's create a __Streamer__ instance, passing it a path as the argument for its constructor, and see if we get anything out of it:

In [11]:
path = '/Users/jordi/Laboratorio/corpora/raw/umbc/webbase_all/delorme.com_shu.pages_89.txt'
strr = TextStreamer(path)
i = 0
for record in strr:
    i += 1
    print i, record
    if i == 50:
        break

1 A  bus  looms  out  of  the grey and blizzardy  conditions  and  we  get  on gratefully.  Then a very lengthy episode of bus and local railway gets us to the residence of Q-Funk's current squeeze, whom we met briefly at the party.

2 We  go back into the city centre,  this time on a search for pizza.  This is duly  found,  in a neat bar-cafe not too far from Rax's,  the domain of 'Eat until you explode'.  Q-Funk is a regular there,  and the bar person seems to know him well.  What is given to us foodwise, is plenty to be going on with.

3 The  next  vital  and necessary step,  is to secure some  inportant  alcohol supplies for the after-party,  and it is to the 'Alko' booze shoppe, that we adjourn  to  next.  We spend an enjoyable fifteen or so minutes mulling  the many  choices  over.  Felice  opts for a bottle of fine old red wine,  I  am caught  in  a dilemma over several different varieties of  flavoured  vodka, choosing a big bottle for public enjoyment,  and a smaller bottle fo

That looks great :) We now have text we can use for our experiments. In a few minutes we will start doing some much more interesting stuff with it.

Before that, though, let's take a look at another type of dataset we will be using. For some of the experiments, we will need labeled datasets such as the ones we described for semi-supervised tasks. In those cases, the streamer should allow us to iterate not only over the text records but also over the labels assigned to those records, __correlatively__.

Luckily, there is a very well-known NLP library for Python called NLTK (Natural Language Tool-Kit, you can check it out [here](http://www.nltk.org)) that can help us with that (and many other things :) NLTK comes with a built-in class for working with corpora. It also comes with a few datasets of its own, too.

One of them is the [Reuters corpus](http://about.reuters.com/researchandstandards/corpus/), a dataset where we have a collection of documents tagged with the topics they are about. Let's take a quick look:

In [12]:
import nltk

from nltk.corpus import reuters

for i in reuters.fileids()[:5]:
    print reuters.categories(i), ' '.join(reuters.words(i)[:15]) + '...', '\n'

[u'trade'] ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between... 

[u'grain'] CHINA DAILY SAYS VERMIN EAT 7 - 12 PCT GRAIN STOCKS A survey of 19... 

[u'crude', u'nat-gas'] JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWNWARDS The Ministry of International Trade and... 

[u'corn', u'grain', u'rice', u'rubber', u'sugar', u'tin', u'trade'] THAI TRADE DEFICIT WIDENS IN FIRST QUARTER Thailand ' s trade deficit widened to 4... 

[u'palm-oil', u'veg-oil'] INDONESIA SEES CPO PRICE RISING SHARPLY Indonesia expects crude palm oil ( CPO ) prices... 



In [13]:
print len(reuters.categories()), len(reuters.fileids())

90 10788


That is how most annotated datasets look like at some level. For most supervised, semi-supervised and weakly-supervised tasks, that is the format of the input we will be receiving.

However, in NLP (just as in all forms of science and engineering) it is always a good idea to run the same experiment on different datasets to ensure the results we are getting are not simply because of the nature of the data (this is sometimes called an _artifact of the data_, or also __bias__). For purposes of replication and validation, we will be using a second corpus for classification, the [__20 Newsgroups__ corpus](http://qwone.com/~jason/20Newsgroups/).

NLTK does not come with built-in support for this particular corpus so we will improvise a class to handle it:

In [14]:
import os

from collections import (
    Counter
)

class SimpleCorpusReader:
    def __init__(self, root):
        self.root = root
        self.documents = []
        self.paths = []
        self.tags = []
        self.tagdist = Counter()
        self.__load()
    
    def __load(self):
        for category in os.listdir(self.root):
            category_folder = '%s%s' % (self.root, category)
            if not os.path.isdir(category_folder):
                continue
            for document_path in os.listdir(category_folder):
                document_path = '%s/%s' % (category_folder, document_path)
                text = self.__read(document_path)
                self.paths.append(document_path)
                self.tags.append(category)
                self.documents.append(text)
                self.tagdist[category] += 1
    
    def __read(self, document_path):
        with open(document_path, 'rb') as rd:
            return ''.join([line for line in rd])

        
rdr = SimpleCorpusReader('data/20_newsgroups/')
print 'The first document is:', '\n'
print [rdr.documents[0][1000:1250] + '...'], '\n\n'
print 'The category of the first document is:', rdr.tags[0], '\n'
print 'The number of records for documents, paths, and categories is: %d, %d, and %d' % (len(rdr.documents), len(rdr.paths), len(rdr.tags))

The first document is: 

[' 1992\nVersion: 1.0\n\n                              Atheist Resources\n\n                      Addresses of Atheist Organizations\n\n                                     USA\n\nFREEDOM FROM RELIGION FOUNDATION\n\nDarwin fish bumper stickers and assorted other ...'] 


The category of the first document is: alt.atheism 

The number of records for documents, paths, and categories is: 19997, 19997, and 19997


Luckily, the category matches and all the numbers add up :)

Last, for our sentiment analysis experiments we will be using another dataset, which is stored in a tab-separated .csv file (so, a .tsv file really :) For that, we will just use a wrapper over Python's standard .csv parsing library. Our wrapper is in lib.Tools and admits an arbitrary delimiter as a keyword argument, so we can specify the tab as the field separator:

In [9]:
from lib.Tools import from_csv

sentiment_data = list(from_csv('data/labeledTrainData.tsv', delimiter='\t'))

for _id, tag, text in sentiment_data[1:11]:
    print '%s | %s | %s...' % (_id, tag, text[:80])


5814_8 | 1 | With all this stuff going down at the moment with MJ i've started listening to h...
2381_9 | 1 | \The Classic War of the Worlds\" by Timothy Hines is a very entertaining film th...
7759_3 | 0 | The film starts with a manager (Nicholas Bell) giving welcome investors (Robert ...
3630_4 | 0 | It must be assumed that those who praised this film (\the greatest filmed opera ...
9495_8 | 1 | Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-...
8196_8 | 1 | I dont know why people think this is such a bad movie. Its got a pretty good plo...
7166_2 | 0 | This movie could have been very good, but comes up way short. Cheesy special eff...
10633_1 | 0 | I watched this video at a friend's house. I'm glad I did not waste money buying ...
319_1 | 0 | A friend of mine bought this film for £1, and even then it was grossly overpric...
8713_10 | 1 | <br /><br />This movie is full of references. Like \Mad Max II\", \"The wild one...


These look like the records we were looking for :)

Before we start getting our hands dirty, it is worth noting that the .tsv and .csv formats are a widespread standard for exporting and importing data and it is always a good idea to design input and output workflows around them.


## Summary

* It is important to know your data (we will go into more detail in the following sections).
* Python class to stream raw text into our algorithm using a generator.
* NLTK Corpus Reader class and methods for accessing datasets.
* Custom CorpusReader class to parse records belonging to a new dataset.
* Standard import from a .csv file.
* Raw text vs. annotated data.