# INTRODUCTION TO DEEP LEARNING FOR TEXT.

This notebook is based on the notes taken from the following books:
- "Deep Learning with Python 2nd Ed". F.Chollet
- "DEEP LEARNING WITH TENSORFLOW AND KERAS 3rd Ed.

NLP is about to use ML and large datesets to give computers the ability to return something useful.
Some tasks are actually, text classification, content filtering, sentiment analysis (good, bad), language modelling, 
translation, summarize, synthetize images from text

## 1) First Step: Text Preparation

Any computer needs to process text translated to numbers (Text Vectorization).

- Text Vectorization Steps:
    - Standarize (lowercase, punctuation removal)
    - Split text in units ("tokenize"). 
        - Word-level
        - N-gram tokenization
        - Char level tokenization
    - Convert tokens to numbers and index them into a corpus (Embeddings)
        - [UNK] index 1 Out-ot-vocabulary index
        - [mask] index 0

In [None]:
# A Text Vectorizer Class in pure Python

import string

class Vectorizer():
    
    def make_vocabulary(self, dataset):
        self.vocabulary = {'': 0, '[UNK]': 1}
        for text in dataset:
            text = self.standarize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        # return tokens to words
        self.inverse_vocabulary = dict({(v,k) for k, v in self.vocabulary.items()})
                    
        
    def standarize(self, text):
        text = text.lower()
        returned_text = "".join(char for char in text if char not in string.punctuation)
        return returned_text
    
    def tokenize(self, text):
        # Split into words
        text = self.standarize(text)
        return text.split()
    
    def encode(self, text):
        text = self.standarize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token,1) for token in tokens]
    
    def decode(self, int_sequence):
        return " ".join(self.inverse_vocabulary.get(i,"[UNK]") for i in int_sequence)

In [None]:
text4test = ['En un lugar de la Mancha de cuyo nombre no quiero acordarme', 
             'no ha mucho tiempo vivia un ingenioso hidalgo...']

textVect = Vectorizer()
textVect.make_vocabulary(text4test)



In [None]:
std = textVect.standarize(text4test[0])
std

In [None]:
encod = textVect.encode(text4test[0])
encod

In [None]:
# Practically in Tensorflow all these task are performed with the preprocessing layer TextVectorization
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import TextVectorization

In [None]:
text_vect = TextVectorization(output_mode = 'int')
text_vect.adapt(text4test)


 Word Representations:
 
 - Order Matters (RNN)
 - Sequences not ordered (Bag of Words)
 - Order agnostic (Transformer)
 

In [None]:
# Run this code only once
# Preparation of folders strucuture that will be used later by keras.utils.text_dataset_from_directory

import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In [None]:

batch_size = 32

# Create a batched dataset using from text_dataset_from_directory

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)


### Processing the dataset as a "bag-of-words (tokens)" --> order does not matter

Individual words (Unigrams)
A group of consecutive words N-grams

Using this way by encoding words in single vectors of zeros and ones "one-hot-encoding". The problem is that is 
unaffordable on large corpuses. That´s why word embeddings are preferred to manage large corpuses because they allow to compress in a low-dimensional latent space the word representations.

In [None]:
# The keras layer TextProcessing can be used for multiple text preprocessing tasks

text_vect_bow = TextVectorization(max_tokens = 10000, output_mode='multi_hot')
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vect_bow.adapt(text_only_train_ds)

In [None]:
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vect_bow(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vect_bow(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vect_bow(x), y),
    num_parallel_calls=4)

In [None]:
# Check Output 
for inputs, targets in binary_1gram_train_ds:
    print("inputs shape:", inputs.shape)
    print("inputs type:", inputs.dtype)
    print("targets shape:", targets.shape)
    print("targets type:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

In [None]:
# Create a reusable model
from tensorflow.keras import layers
def build_text_model(max_tokens = 10000, hidden_dims= 16):
    inputs = keras.Input(shape = (max_tokens,))
    # A simple dense layer
    x = layers.Dense

Wikipedia defines word embedding as the collective name for a set of language modeling and feature
learning techniques in natural language processing (NLP) where words or phrases from a vocabulary
are mapped to vectors of real numbers.

Today, word embedding is a foundational technique for all kinds of NLP tasks, such as text classification, document clustering, partof-speech tagging, named entity recognition, sentiment analysis, and many more. Word embeddings
result in dense, low-dimensional vectors, and along with LSA and topic models can be thought of as
a vector of latent features for the word.

Word embeddings are based on the distributional hypothesis, which states that words that occur in
similar contexts tend to have similar meanings. Hence the class of word embedding-based encodings
is also known as distributed representations
