# Welcome to the Natural Language Modelling

NLP, a.k.a. Natural Language Programming or Processing or Modelling..., is defined as the systematic way of to make any text understandable for machines. Moreover, NLP deals with the enriched world of languages in order to allow machines comprehend and communicate like any other human.

Asides from rhetoric, NLP aims at doing this:

![corpus](images/1_purpose.png)

## Data Exploration

We are going to work with [wikipedia dumps files](https://dumps.wikimedia.org/), which are a collection of articles dumped every day and in different languages. To be practical, we're going to use English articles. To save you time, effort and headaches, a bunch of dumped files have been downloaded for you. The [wikiextractor tool](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) was used to extract dumps packages into a set of flat text files.

### Basic imports for this module

In [1]:
import numpy as np
import string
import os
import operator
from datetime import datetime
import pickle
import random

### Defining preprocessing functions

In [2]:
def my_tokenizer(text):
    text = remove_punctuation(text)
    text = text.lower()  # downcase
    return text.split()

def remove_punctuation(text):
    # Replace punctuation with tokens so we can use them in our model
    text = text.replace('.', '')
    text = text.replace(',', '')
    text = text.replace('"', '')
    text = text.replace(';', '')
    text = text.replace('!', '')
    text = text.replace('?', '')
    text = text.replace('(', '')
    text = text.replace(')', '')
    text = text.replace('-', ' ')
    text = text.replace('?', '')
    text = text.replace(':', '')
    text = text.replace("'", '')
    
    return text.translate(string.punctuation)

### Method for getting wikipedia data into the workspace memory, not yours but your computer's one

Do you happen to know what really reading text files in NLP means?

In [3]:
def get_wikipedia_data(n_files):
    prefix = './wikifiles/'

    # checking existance of folder with text files downloaded from wikipedia
    if not os.path.exists(prefix):
        print("Are you sure you've downloaded, converted, and placed the Wikipedia data into the proper folder (wikifiles)?")
        print("Quitting...")
        exit()

    # getting list of files from folder and subfolders
    input_files = []
    for folder in os.listdir(prefix):
        for f in os.listdir(prefix + "/" + folder):
            if f.startswith('wiki'):
                input_files.append("/" + folder + "/" + f)

    if len(input_files) == 0:
        print("Looks like you don't have any data files, or they're in the wrong location.")
        print("Please download the data from https://dumps.wikimedia.org/")
        print("Quitting...")
        exit()

    # list of sentences
    sentences = []
    
    # initializing dictionaries of words
    word2idx = {'START': 0, 'END': 1}
    idx2word = ['START', 'END']
    current_idx = 2
    word_idx_count = {0: float('inf'), 1: float('inf')}

    # shuflling files
    random.shuffle(input_files)
    
    if n_files is not None:
        input_files = input_files[:n_files]
        
    # for each file, reads its sentences and cleanses them
    for f in input_files:
        #print("reading:", f)
        for line in open(prefix + "/" + f):
            line = line.strip()
            # don't count headers, structured data, lists, etc...
            if line and line[0] not in ('[', '*', '-', '|', '=', '{', '}', '<', '/', '\\'):
                # sentence into a list of words
                sentence_lines = line.split('. ')
                
                for sentence in sentence_lines:
                    # cleasing sentence to get a list of words
                    tokens = my_tokenizer(sentence)
                    # each word is indexed and counted
                    for t in tokens:
                        if t not in word2idx:
                            word2idx[t] = current_idx
                            idx2word.append(t)
                            current_idx += 1    
                        idx = word2idx[t]
                        # counting
                        word_idx_count[idx] = word_idx_count.get(idx, 0) + 1
                    # replacing sentence with indexes of words
                    sentence_by_idx = [word2idx[t] for t in tokens]
                    sentences.append(sentence_by_idx)
    
    with open('1.1_read_wikis.pickle', 'wb') as f:
        # Pickle the 'data' dictionary using the highest protocol available.
        pickle.dump((word_idx_count, idx2word, sentences), f, pickle.HIGHEST_PROTOCOL)
        
    return word_idx_count, idx2word, sentences

><img src="images/task.png" width="5%" height="5%">

>**Exercise:** Below, execute get_wikipedia_data specifying a number of files up to 200 (otherwise we wouln't go home today). Notice that the function returns three elements, check them all a little bit and see in what all of those wiki files have been turned into.

In [4]:
word2idx, idx2word, sentences = get_wikipedia_data(n_files=10000)

In [5]:
len(idx2word)

3106493

In [6]:
[idx2word[word] for word in sentences[1]]

['the',
 'constitution',
 'of',
 'mongolia',
 'provides',
 'for',
 'freedom',
 'of',
 'religion',
 'and',
 'the',
 'mongolian',
 'government',
 'generally',
 'respects',
 'this',
 'right',
 'in',
 'practice',
 'however',
 'the',
 'law',
 'somewhat',
 'limits',
 'proselytism',
 'and',
 'some',
 'religious',
 'groups',
 'have',
 'faced',
 'bureaucratic',
 'harassment',
 'or',
 'been',
 'denied',
 'registration']

## Subsampling... as always

Well, if you guessed it, we've juts created a vocabulary! ...and by vocabulary I don't mean just numbers, vowels and consonants.

In some cases, corpora comprise vast amount of information. Imposible to be processed in just one go. Particulary, sampling is the best way to extract a piece of information that should represent quite well the whole dataset. It means that in somehow a text representation technique is needed in order to make a selection and then an extraction of a representative small dataset to deal with.

To do so, the simplest and easiest technique of human kind is applied: counting!

...but to be fancier lets just called as bag of words, then n-grams.

Vocabulary turns into features for a Machine Learning model. Let's say that here we have up 2000 terms in 1000 files (documents), which makes it a matrix of 2000x1000 (tiny big).

But... lets imagine what if a machine tries to learn about the whole set of words in a language...

__The Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries.__

In a bigram model that means a matrix of at least 171476x171476, and that's a lot. An average human cannot learn so much words neither.

In [7]:
def subsampling(word2idx, idx2word, sentences, n_vocab):
    # restrict vocab size
    sorted_word_idx_count = sorted(word2idx.items(), key=operator.itemgetter(1), reverse=True)
    word2idx_small = {}
    new_idx = 0
    idx_new_idx_map = {}
    
    for idx, count in sorted_word_idx_count[:n_vocab]:
        word = idx2word[idx]
        word2idx_small[word] = new_idx
        idx_new_idx_map[idx] = new_idx
        new_idx += 1
        
    # let 'unknown' be the last token
    word2idx_small['UNKNOWN'] = new_idx
    idx2word_small = {v:k for k,v in word2idx_small.items()}
    unknown = new_idx
    
    # map old idx to new idx
    sentences_small = []
    for sentence in sentences:
        if len(sentence) > 0:
            new_sentence = [idx_new_idx_map[idx] if idx in idx_new_idx_map else unknown for idx in sentence]
            sentences_small.append(new_sentence)
              
    with open('1.2_subsampled_wikis.pickle', 'wb') as f:
        # Pickle the 'data' dictionary using the highest protocol available.
        pickle.dump((word2idx_small, idx2word_small, sentences_small), f, pickle.HIGHEST_PROTOCOL)

    return word2idx_small, idx2word_small, sentences_small

In [8]:
word2idx_sampled, idx2word_sampled, sentences_sampled  = subsampling(word2idx, idx2word, sentences, n_vocab=100000)

In [9]:
[idx2word_sampled[word] for word in sentences_sampled[1]]

['the',
 'constitution',
 'of',
 'mongolia',
 'provides',
 'for',
 'freedom',
 'of',
 'religion',
 'and',
 'the',
 'UNKNOWN',
 'government',
 'generally',
 'UNKNOWN',
 'this',
 'right',
 'in',
 'practice',
 'however',
 'the',
 'law',
 'somewhat',
 'limits',
 'UNKNOWN',
 'and',
 'some',
 'religious',
 'groups',
 'have',
 'faced',
 'UNKNOWN',
 'UNKNOWN',
 'or',
 'been',
 'denied',
 'registration']

><img src="images/task.png" width="5%" height="5%">

> **Exercise (OPTIONAL):** Create another sampling method. Instead of receiving a parameter for a maximum number of vocabulary, it should receive a minimum percentage of words out of the complete vocabulary to be contained in the sample.