<a href="https://colab.research.google.com/github/Jetsukda/Deep-Learning-with-Python/blob/main/6.%20Deep%20learning%20for%20text%20and%20sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This chapter covers**
- Preprocessing text data info useful representations
- Working with recurrent neural networks
- Using 1D convnets for sequence processing

# 6.1 Working with text data

**Text is one of the most widespread forms of sequence data.**

- Sequence of characters.
- Sequence of words.
- *but it's most common to work at the level of words.

**The deep-learning sequence-processing models introduced in the following sections can use text to produce a basic from of natural language understanding**
- Document classification
- Sentiment analysis
- Author identification
- Question answering (QA)


**Natural language processing (NLP)**
- Deep learning for natural language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels.
- Deep learning models don't take as input raw text: they only work with numeric tensors. 
- ***Vectorizing*** is the process of transforming text into numeric tensors
- This can be done in multiple ways:
    - Segment text into words, and transform each word into a vector.
    - Segment text into characters, and transform each character into a vector.
    - Extract **N-grams** of **words** or **character**, and transform each n-gram into vector.
    - ***N-grams*** are overlapping groups of multiple consecutive words or characters.
- **Words, character, or n-grams** are called ***tokens***.
- Breaking text into such tokens is called ***tokenization***.
- All text-vectorization processes consist of applying some tokenization scheme and then associating numeriv vectors with the generated tokens.
- These vecctors, packed into **sequence tensors**
- There are multiple ways to associate a  vector with a token.
    - one-hot encoding
    - token embedding (word embedding)

<p align="center">
        <img src="https://drive.google.com/uc?export=view&id=1iSf4bn8xyBRpFFIk1Nr1f_PHDzq9-WcC" width="700">
        </p>

## 6.1.1 One-hot Encoding

- It consists of associating a unique index with every word and then turning this integer index `i` into a binary vector of size `N`. (the size of the vocabulary)
- The vector is all zeros except for the `i`'th entry, which is 1.
- one-hot encoding can be done at and how to implement it.

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [3]:
# Using Keras for word-level one-hot encoding
samples = ["The cat sat on the mat.", "The dog ate my homework."]

# Creates a tokenizer
# Configured to only take into account the 1,000 most common words
tokenizer = Tokenizer(num_words=1000)

In [15]:
# Builds the word index
tokenizer.fit_on_texts(samples)
print("{index:word}->",tokenizer.index_word)
print("{word:index}->",tokenizer.word_index)

{index:word}-> {1: 'the', 2: 'cat', 3: 'sat', 4: 'on', 5: 'mat', 6: 'dog', 7: 'ate', 8: 'my', 9: 'homework'}
{word:index}-> {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}


In [18]:
# Turns strings into lists of integer indices
sequences = tokenizer.texts_to_sequences(samples)
for sentence, sequence in zip(samples, sequences):
    print("Sentence:", sentence)
    print("Sequence:", sequence)

Sentence: The cat sat on the mat.
Sequence: [1, 2, 3, 4, 1, 5]
Sentence: The dog ate my homework.
Sequence: [1, 6, 7, 8, 9]


In [22]:
# one-hot encoding
one_hot_results = tokenizer.texts_to_matrix(samples, mode="binary")
one_hot_results

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

## 6.1.2 Using word embeddings

- Popular and powerful way to associate a vector with a word
- **Dense word vectors** also called **word embeddings**.
- **One-hot encodding** : **binary**, **sparse** (mostly made of zeros), and **ver high-dimensional** (same dimensionality as the number of words in the vocabulary)
- **Word embedding** : **low dimensional floating-point vectors** (that is, dense vectors, as opposed to sparse vectors)
- **Word embedding** are learn form data.
    - It's common to see word embedding that are 256-dimensional, 512-dimensional, or 1,024-dimensional when dealing with very large vocabularies.
- **One-hot encoding** words generally leads to vectors that are 20,000-dimensional or greater (capturing a vocabbulary of 20,000 tokens)
- **Word embeddings** pack more information into far fewer dimensions.

<p align="center">
        <img src="https://drive.google.com/uc?export=view&id=1edEiY05tRr2Z7vQcl4WVCKCPr4yRsekE" width="700" >
        </p>

- There are two ways to obtain word embeddings:
    - Learn word embeddings jointly with the main task you care about
        - Document classification
        - Sentiment prediction
        - *In this setup, you start with random word vectors and then learn word vectors in the same way you learn the weights of neural network.
    - Load into your model word embeddings that were precomputed using a different machine-learning task than the one you're trying to solve.
        - These are called **pretrained word embeddings**.

**Learning word embeddings with the embedding layer**

# 6.2 Understanding recurrent neural networks

# 6.3 Advanced use of recurrent neural networks

# 6.4 Sequence processing with convnets