### Text Data Structure

#### Word Expreassion

**◆ word vectors (word representations)**
- The most basic problem of natural language processing is how to make a computer recognize natural language.
- Computer recognizes natural language as binary code ( Unicode , ASCII code ,…). 
    Ex) English : 1100010110111000, English : 1100010110110100
- This way of expression has no characteristics of words at all.
- It can be used for classification and clustering.

**◆ One-Hot encoding (one-hot encoding)**

words It is expressed as a single vector , with only one 1 at a specific position, and the rest are marked as 0. Ex) {Thomas Jepperson made Jepperson building}

<center>

||Thomas|Jepperson|made|building|
|:---:|:---:|:---:|:---:|:---:|
|Thomas|1|0|0|0|
|Japperson|0|1|0|0|
|made|0|0|1|0|
|Japperson|0|1|0|0|
|building|0|0|0|1|

</center>

- To know what the nth word is in a row ( sentence ) , you need to know the column value that is 1 in that row. (100% restorable)
- One row is each binary row vector, where only one is 1 and the rest are 0.
- Columns act as a dictionary of words (terms).

**◆ One-hot Disadvantages and Alternatives**

- one - hot The disadvantage of encoding is that it becomes inefficient because the size of the vector becomes
large when there are many words.
- To overcome various disadvantages, two alternatives are proposed.

**◆ Two alternatives**
1. frequency information based

    a. word frequency vector (bag of words)
   
    b. word - document matrix method (TF-IDF , etc. )
   
    c. co-occurrence matrix : word - word matrix , document - document matrix
   
2. Meaning ( subject / characteristic ) information-based

    a. subject vector ( semantic vector )
   
    b. word2vec/ Glove ( method is different, but the solution is the same )
   
    c. BERT, GPT, ...
    
**◆ Word frequency vector (word collection vector): Bag of words**
- A method of trying to understand the meaning of a sentence only with word collection data (word frequency), ignoring the order and grammar of words in a sentence
- Unlike the one-hot vector, it contains only the number of appearances, so it is difficult to reproduce the document .
- In the most basic way, there is a vector of binary word collections .
- Binary word collection vectors are useful for document search indexing, which tells which word is used in which document .

#### Integer Encoding & Padding

**◆ Integer encoding**
- It is the basic step among several techniques for converting text to numbers in natural language processing.
- A preprocessing task that maps each word to a unique integer.
- If there are 5,000 words in the text, unique integers mapped to words from 1 to 5,000 in each of the 5,000 words, In other words, an index is given, usually after sorting by word frequency.
- One of the ways to assign integers to words is to create a set of words (vocabulary) in which words are sorted in order of frequency. There is a way to assign integers from lowest to highest in order

**◆ padding**
- Each sentence ( or document ) can be of different lengths , but the machine divides all documents of the same length into one matrix. Reports can be grouped together and processed .
- Arbitrarily equalizing the length of several sentences for parallel operation

#### Word document procession

**◆ word Frequency (Term Frequency: TF)**
- the number of times the word appeared in the document
- If a particular word appears frequently in a particular document, the word is said to be closely related to that document.

```
// Example

Doc1 : the fox chases the rabbit
Doc2 : the rabbit ate the cabbage
Doc3 : the fox caught the rabbit
```

Rows are words, columns are documents(TDM: Term-Document Matrix)

<center>

|          | Doc1 | Doc2 | Doc3 |
|:--------:|:----:|:----:|:----:|
|   the    |  2   |  2   |  2   |
|   fox    |  1   |  0   |  1   |
|  rabbit  |  1   |  1   |  1   |
|  chases  |  1   |  0   |  0   |
|  caught  |  0   |  0   |  1   |
| cabbage  |  0   |  1   |  0   |
|   ate    |  0   |  1   |  0   |

</center>

#### Word document matrix

**◆ word frequency reverse document frequency**
- **Zipf's Law** : The frequency of use of any word is inversely proportional to the rank of that word. (Ex: 1st place is 3 times as frequent as 3rd place)
- Give low weight to words that appear frequently in the document but do not help to understand the meaning of the document -> IDF


**IDF**

A weight that measures the importance of a word. 

$$\mathrm{IDF} = \log(\frac{\mathrm{N}}{\mathrm{DF}})$$

N is the total number of documents. DF is the document frequency (the number of documents in which the word appears)

- The smaller the DF, the higher the importance of the word.
- The higher the IDF, the higher the importance of the word
- Words with high TF–IDF values give high discrimination in documents (important words in information retrieval)
- Calculate TF-IDF : (TF-IDF)(t, d)=TF(t, d) x IDF(t), where t is the word and d is the document

<center>

|          | Doc1 | Doc2 | Doc3 | DF | N/DF | IDF=$\log_2(\mathrm{N}/\mathrm{DF})$ |
|:--------:|:----:|:----:|:----:|:--:|:----:|:------------------------------------:| 
|   the    |  2   |  2   |  2   | 3  | 3/3  |            $\log_2(3/3)$             |
|   fox    |  1   |  0   |  1   | 2  | 3/2  |            $\log_2(3/2)$             |
|  rabbit  |  1   |  1   |  1   | 3  | 3/3  |            $\log_2(3/3)$             |
|  chases  |  1   |  0   |  0   | 1  | 3/1  |            $\log_2(3/1)$             |
|  caught  |  0   |  0   |  1   | 1  | 3/1  |            $\log_2(3/1)$             |
| cabbage  |  0   |  1   |  0   | 1  | 3/1  |            $\log_2(3/1)$             |
|   ate    |  0   |  1   |  0   | 1  | 3/1  |            $\log_2(3/1)$             |

</center>

**◆ TF standardization and regularization**
- The longer the length of the document, the higher the frequency of occurrence of the word and
the higher the possibility of being searched.
- Thus the longer the length of the document, the higher the possibility of similarity with other documents .
- Standardization and normalization of TF is necessary to complicate these week-points.

**◆ Standardization** 

$$z = \frac{\mathrm{TF}-\mu(\mathrm{TF})}{\sigma(\mathrm{TF})}$$

Example) doc 1

$$\mu(\mathrm{TF}) = \frac{5}{7} ~~ (\frac{\mathrm{number~of~occurrences}}{\mathrm{total~number~of~words}})$$
$$\sigma(\mathrm{TF}) = \sqrt{\frac{(2 - \mu)^2 + 3(1 - \mu)^2 + 3(0-\mu)^2}{6}}$$

<center>

|          |   Doc1    |   Doc2    |    Doc3    |
|:--------:|:---------:|:---------:|:----------:|
|   the    |  1.70084  |  1.70084  |  1.70084   |
|   fox    |  0.37796  | -0.944911 |  0.37796   |
|  rabbit  |  0.37796  |  0.37796  |  0.37796   |
|  chases  |  0.37796  | -0.944911 | -0.944911  |
|  caught  | -0.944911 | -0.944911 |  0.37796   |
| cabbage  | -0.944911 |  0.37796  | -0.944911  |
|   ate    | -0.944911 |  0.37796  | -0.944911  |

</center>

**◆ Normalization**

divide TF by the total frequency of the word (1+log(TF))/ ni

ni : frequency count of total words in

<center>

|          |   Doc1    |   Doc2    | Doc3 |
|:--------:|:---------:|:---------:|:----:|
|   the    |    0.4    |    0.4    | 0.4  |
|   fox    |    0.2    |     0     | 0.2  |
|  rabbit  |    0.2    |    0.2    | 0.2  |
|  chases  |    0.2    |     0     |  0   |
|  caught  |     0     |     0     | 0.2  |
| cabbage  |     0     |    0.2    |  0   |
|   ate    |     0     |    0.2    |  0   |

</center>

where $0.4 = \frac{1 + \log_2{2}}{5}$ and $0.2 = \frac{1 + \log_2{1}}{5}$

**◆ Normalized TF-IDF: Normalized TF times IDF**

<center>

| Normalized TF-IDF |                Doc1                |             Doc2             |  Doc3   |
|:-----------------:|:----------------------------------:|:----------------------------:|:-------:|
|        the        |    0 = $0.4 \times \log_2(3/3)$    | 0 = $0.4 \times \log_2(3/3)$ |    0    |
|        fox        | 0.11699 = $0.2 \times \log_2(3/2)$ |              0               | 0.11699 |
|      rabbit       |    0 = $0.2 \times \log_2(3/3)$    |              0               |    0    |
|      chases       | 0.31699 = $0.2 \times \log_2(3/1)$ |              0               |    0    |
|      caught       |                 0                  |              0               | 0.31699 |
|      cabbage      |                 0                  |           0.31699            |    0    |
|        ate        |                 0                  |           0.31699            |    0    |

</center>

**◆ Disadvantages of**
- Vectors of two pieces of text with different words, even though they have similar meanings (subjects), are in the TF-IDF vector space. If words with the same meaning but different spellings, TF-IDF vectors do not lie close together in the vector space.

**The TF-IDF method is difficult to use in the process of finding documents that are similar in meaning (topic)**

#### Co-Occurrence Matrix

**◆ joint ( simultaneous ) occurrence matrix**

A method of directly counting the number of times words appear simultaneously in a particular context. The number of simultaneous appearances is expressed as a matrix and the matrix is digitized to create word vectors.

```
Ex) 
Myeong-seok and Jun-seon went to America
Myeong-seok and Sang-ho went to the library
Myeong-seok and Jun-seon like cold noodles
```

< Co-occurrence matrix : word - word matrix > → square matrix , symmetric matrix

<center>

![wordvector](./images/wordvector.png)

</center>

→ Can be used as a social network analysis (e.g., calculating the centrality of each word (degree of connection , proximity , median , eigenvector))

- TDM : Term based Matrix

- DTM : Document based Matrix

#### Word Embedding

**◆ Word embedding (word embedding)**
- A one-hot vector is a sparse representation with many 0 's and only one 1’.
- In contrast to sparse representation , the size of the vector is determined by a value set by the user (smaller than the size of the word set ) rather than the size of the word set , and has real values other than 0 and 1.
- The method of expressing words as dense vectors is called word embedding, and the result obtained in this way is called an embedding vector.
- Examples of word embeddings include, LSA, word2vec, FastText , and Glove.

**◆ Distributed representation**
- **Local representation** is a method of expressing a word by looking only at the word itself and mapping a specific value.
- On the other hand , the distributed representation depends on the distribution hypothesis
- Based on the expression, it is made on the assumption that words appearing in similar positions have similar meanings, and the task of vectorizing the similarity of words corresponds to word embedding.
- Distributed representation methods refer to neighboring words to represent that word.
- For example, since the words cute and lovely often appear near the word puppy, the word puppy defines the word as cute and lovely.
- As an example of a distributed representation, There are techniques such as Word2vec.

#### Topic Vector

**◆ Topic vector (semantic vector)**
- Dimensional reduction of multidimensional vectors whose components are subject scores obtained using the weighted frequencies of TF-IDF vectors
- Group words of the same subject together using correlations between normalized term frequencies.
- Used for semantic-based retrieval, which searches documents based on their semantics → usually than keyword-based search is known to be accurate.
- Able to find a set of key words (keywords) that best summarize the meaning of a given document.
- There are (1) word subject vectors representing the meaning of words and (2) document subject vectors representing the meaning of documents.

▪ Word Topic Vectors : Create 3 topic scores {pet}, {animal } , {city} as subject vector reproduce

<center>

![topicvector.png](./images/TopicVector.png)

</center>

▪ Document subject vector
To obtain a word2vec that contains the topic ( meaning ) of the entire document, the document topic vector is
obtained as the sum of word vectors.

▪ Word inference using subject vectors
It can be converted to a word vector space with a lower dimension than the word frequency vector, and the word vector operation has meaning

Useful for word analogy tasks

Ex) king : male = female : ?

    ➢ Topic : { male / female , adult / child , royal family / commoner }  
    ➢ Words : King = { 1.0, 0.9, 0.9 }, Prince = {0.9, 0.1, 0.8}, Queen = { 0.1, 0.9, 0.8 }, Princess = {0.1, 0.1, 0.8}
    male = { 1.0, 0.0, 0.0 }
    female = { 0.1, 0.0, 0.0 }
    king - male + female = { 0.1, 0.9, 0.9 } → close to queen

#### Topic Vector Expansion : Word2Vec and BERT

**◆ Word2vec (Tomasi Mikolov , MS Apprentice , 2012)**
- Subject vectors are the same If it is only in a sentence, it has meaning as a word, but word2vec has meaning as a word nearby
- n-gram consisting of words before and after the target word
- Use window=k to specify
- CBOW (Continuous Bag of Words) and Skip-Gram are available

**◆ BERT: Bidirectional Encoder Representations from Transformer ( Google , 2018): Encoder only model**
- Train the model using the encoder part of Pre-learning is performed using two language learning methods:
mask language model and next sentence prediction.

**◆ Difference between Word2vec and BERT**
- Word2vec corresponds to the static embedding technique , so multiple meanings corresponding to one word are converted into only one vector.
- A fixed expression and has the same expression value wherever it appears in the document text (regardless of order), so homophones cannot be distinguished.
- As a solution to this, contextual embedding creates a dynamic vector for each context based on the sentence. The technique BERT, ELMo etc. are developed to solve this problem.

#### Integer Encoding

**1. Dictionary**

In [1]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')
              
raw_text = "A barber is a person. a barber is good person. a barber is huge person. he Knew A Secret! The Secret He Kept is huge secret. Huge secret. His barber kept his word. a barber kept his word. His barber kept his secret. But keeping and keeping such a huge secret to himself was driving the barber crazy. the barber went up a huge mountain."

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/junghunlee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/junghunlee/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


tokenizing sentences

In [2]:
sentences = sent_tokenize(raw_text)
print(sentences)

['A barber is a person.', 'a barber is good person.', 'a barber is huge person.', 'he Knew A Secret!', 'The Secret He Kept is huge secret.', 'Huge secret.', 'His barber kept his word.', 'a barber kept his word.', 'His barber kept his secret.', 'But keeping and keeping such a huge secret to himself was driving the barber crazy.', 'the barber went up a huge mountain.']


In [3]:
vocab = {}
preprocessed_sentences = []
stop_words = set(stopwords.words('english'))

In [4]:
for sentence in sentences : # tokenizing words
    tokenized_sentence = word_tokenize(sentence)
    result = []

for sentence in sentences:
    tokenized_sentence = word_tokenize(sentence)
    result = []

    for word in tokenized_sentence:
        word = word.lower()
        if word not in stop_words and len(word) > 2:
            result.append(word)
            if word not in vocab:
                vocab[word] = 0
            vocab[word] += 1
    
    preprocessed_sentences.append(result)

print(preprocessed_sentences)
print('word set :', vocab)
print('Frequency of the word "barber":', vocab["barber"])

[['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]
word set : {'barber': 8, 'person': 3, 'good': 1, 'huge': 5, 'knew': 1, 'secret': 6, 'kept': 4, 'word': 2, 'keeping': 2, 'driving': 1, 'crazy': 1, 'went': 1, 'mountain': 1}
Frequency of the word "barber": 8


Build a dictionary based on frequencies

In [5]:
vocab_sorted = sorted( vocab.items (), key = lambda x:x[1], reverse = True)
print(vocab_sorted)

[('barber', 8), ('secret', 6), ('huge', 5), ('kept', 4), ('person', 3), ('word', 2), ('keeping', 2), ('good', 1), ('knew', 1), ('driving', 1), ('crazy', 1), ('went', 1), ('mountain', 1)]


In [6]:
word_to_index = {}
i = 0
for (word, frequency) in vocab_sorted:
    if frequency > 1 : # Exclude low frequency words
        i = i + 1
        word_to_index [word] = i # Index words based on frequency
print( word_to_index )

{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'word': 6, 'keeping': 7}


Remove words with index greater than 5 (remove low frequency words)

In [7]:
vocab_size = 5
words_frequency = [word for word, index in word_to_index.items () if index >= vocab_size + 1]

Delete the index information for the word

In [8]:
for w in words_frequency :
    del word_to_index[w]
print(word_to_index)

{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5}


Words with an index greater than 5 ( low frequency ) are collectively referred to as

In [9]:
word_to_index ['OOV'] = len ( word_to_index ) + 1
print(word_to_index)

{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'OOV': 6}


In [10]:
encoded_sentences = []

for sentence in preprocessed_sentences :
    encoded_sentence = []
    
for word in sentence:
    
    try: # If a word is in the word set, return the integer for that word
        encoded_sentence.append(word_to_index[word])
        
    except KeyError : # If the word is not in the word set, return an integer of
        encoded_sentence.append(word_to_index['OOV'])
        
    encoded_sentences.append(encoded_sentence)

print(encoded_sentences)

[[1, 6, 3, 6], [1, 6, 3, 6], [1, 6, 3, 6], [1, 6, 3, 6]]


**2. Counting**

In [11]:
from collections import Counter
print(preprocessed_sentences)

[['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]


In [12]:
# words = np.hstack(preprocessed_sentences) can also be done
all_words_list = sum(preprocessed_sentences, [])
print(all_words_list)

['barber', 'person', 'barber', 'good', 'person', 'barber', 'huge', 'person', 'knew', 'secret', 'secret', 'kept', 'huge', 'secret', 'huge', 'secret', 'barber', 'kept', 'word', 'barber', 'kept', 'word', 'barber', 'kept', 'secret', 'keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy', 'barber', 'went', 'huge', 'mountain']


In [13]:
# Python's Count the frequency of words using the Counter module
vocab = Counter(all_words_list)
print(vocab)

Counter({'barber': 8, 'secret': 6, 'huge': 5, 'kept': 4, 'person': 3, 'word': 2, 'keeping': 2, 'good': 1, 'knew': 1, 'driving': 1, 'crazy': 1, 'went': 1, 'mountain': 1})


In [14]:
print(vocab['barber'])  # print the frequency of the word 'barber'

8


In [15]:
vocab_size = 5
vocab = vocab.most_common(vocab_size) # Store only the top 5 most frequent words
word_to_index = {}
i = 0
for (word, frequency) in vocab:
    i = i + 1
    word_to_index [word] = i
print(word_to_index)

{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5}


**3. NLTK's FreqDist**

In [16]:
from nltk import FreqDist
import numpy as np

Remove sentence breaks with np.hstack

In [17]:
vocab = FreqDist(np.hstack(preprocessed_sentences))
print(vocab["barber"]) # print the frequency of the word 'barber’

8


In [18]:
vocab_size = 5
vocab = vocab.most_common(vocab_size) # Store only the top 5 most frequent words
print(vocab)

[('barber', 8), ('secret', 6), ('huge', 5), ('kept', 4), ('person', 3)]


In [19]:
word_to_index = {word[0] : index + 1 for index, word in enumerate(vocab)}
print(word_to_index)

{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5}


**4. Understanding enumerate**

In [20]:
test_input = ['a','b','c','d','e']
for index, value in enumerate(test_input) :
    print(f"value : {value}, index : {index}")

value : a, index : 0
value : b, index : 1
value : c, index : 2
value : d, index : 3
value : e, index : 4


**5. Keras Text Preprocessing**

In [22]:
from tensorflow.keras.preprocessing.text import Tokenizer
preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept ', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]

tokenizer = Tokenizer()

input corpus into

In [23]:
tokenizer.fit_on_texts(preprocessed_sentences)
print(tokenizer.word_index)
print(tokenizer.word_counts)
print(tokenizer.texts_to_sequences(preprocessed_sentences))

{'barber': 1, 'secret': 2, 'huge': 3, 'person': 4, 'kept': 5, 'word': 6, 'keeping': 7, 'good': 8, 'knew': 9, 'kept ': 10, 'driving': 11, 'crazy': 12, 'went': 13, 'mountain': 14}
OrderedDict([('barber', 8), ('person', 3), ('good', 1), ('huge', 5), ('knew', 1), ('secret', 6), ('kept', 3), ('word', 2), ('kept ', 1), ('keeping', 2), ('driving', 1), ('crazy', 1), ('went', 1), ('mountain', 1)])
[[1, 4], [1, 8, 4], [1, 3, 4], [9, 2], [2, 5, 3, 2], [3, 2], [1, 5, 6], [1, 10, 6], [1, 5, 2], [7, 7, 3, 2, 11, 1, 12], [1, 13, 3, 14]]


In [24]:
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 1) # 상위 5개 단어만 사용
tokenizer.fit_on_texts(preprocessed_sentences)
print(tokenizer.word_index)
print(tokenizer.word_counts)
print(tokenizer.texts_to_sequences(preprocessed_sentences))

{'barber': 1, 'secret': 2, 'huge': 3, 'person': 4, 'kept': 5, 'word': 6, 'keeping': 7, 'good': 8, 'knew': 9, 'kept ': 10, 'driving': 11, 'crazy': 12, 'went': 13, 'mountain': 14}
OrderedDict([('barber', 8), ('person', 3), ('good', 1), ('huge', 5), ('knew', 1), ('secret', 6), ('kept', 3), ('word', 2), ('kept ', 1), ('keeping', 2), ('driving', 1), ('crazy', 1), ('went', 1), ('mountain', 1)])
[[1, 4], [1, 4], [1, 3, 4], [2], [2, 5, 3, 2], [3, 2], [1, 5], [1], [1, 5, 2], [3, 2, 1], [1, 3]]


In [25]:
vocab_size = 5
words_frequency = [word for word , index in tokenizer.word_index.items( ) if index >= vocab_size + 1]

In [26]:
# delete cases with more than 5 ubdexes
for word in words_frequency :
    del tokenizer.word_index[word] # delete index information for that word
    del tokenizer.word_counts[word] # delete count information for that word
print(tokenizer.word_index)
print(tokenizer.word_counts)
print(tokenizer.texts_to_sequences(preprocessed_sentences))

{'barber': 1, 'secret': 2, 'huge': 3, 'person': 4, 'kept': 5}
OrderedDict([('barber', 8), ('person', 3), ('huge', 5), ('secret', 6), ('kept', 3)])
[[1, 4], [1, 4], [1, 3, 4], [2], [2, 5, 3, 2], [3, 2], [1, 5], [1], [1, 5, 2], [3, 2, 1], [1, 3]]


In [27]:
# Size of word set is +2 , taking into account the number 0 and OOV
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 2, oov_token = 'OOV');
tokenizer.fit_on_texts(preprocessed_sentences);

print('Token OOV value : {}'.format(tokenizer.word_index['OOV']))
print(tokenizer.texts_to_sequences(preprocessed_sentences));

Token OOV value : 1
[[2, 5], [2, 1, 5], [2, 4, 5], [1, 3], [3, 6, 4, 3], [4, 3], [2, 6, 1], [2, 1, 1], [2, 6, 3], [1, 1, 4, 3, 1, 2, 1], [2, 1, 4, 1]]


#### Padding

**1. Padding using Numpy**

In [28]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer

preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
print(encoded)

[[1, 5], [1, 8, 5], [1, 3, 5], [9, 2], [2, 4, 3, 2], [3, 2], [1, 4, 6], [1, 4, 6], [1, 4, 2], [7, 7, 3, 2, 10, 1, 11], [1, 12, 3, 13]]


In [29]:
max_len = max(len(item) for item in encoded)
print('최대 길이 :',max_len)

최대 길이 : 7


In [30]:
for sentence in encoded:
    while len(sentence) < max_len:
        sentence.append(0)
padded_np = np.array(encoded)
padded_np

array([[ 1,  5,  0,  0,  0,  0,  0],
       [ 1,  8,  5,  0,  0,  0,  0],
       [ 1,  3,  5,  0,  0,  0,  0],
       [ 9,  2,  0,  0,  0,  0,  0],
       [ 2,  4,  3,  2,  0,  0,  0],
       [ 3,  2,  0,  0,  0,  0,  0],
       [ 1,  4,  6,  0,  0,  0,  0],
       [ 1,  4,  6,  0,  0,  0,  0],
       [ 1,  4,  2,  0,  0,  0,  0],
       [ 7,  7,  3,  2, 10,  1, 11],
       [ 1, 12,  3, 13,  0,  0,  0]])

**2. Padding using Keras tools** 

In [31]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
print(encoded)

[[1, 5], [1, 8, 5], [1, 3, 5], [9, 2], [2, 4, 3, 2], [3, 2], [1, 4, 6], [1, 4, 6], [1, 4, 2], [7, 7, 3, 2, 10, 1, 11], [1, 12, 3, 13]]


In [32]:
padded = pad_sequences(encoded)
padded = pad_sequences(encoded, padding = 'post')
(padded == padded_np).all()
padded = pad_sequences(encoded, padding='post', maxlen =5)
padded = pad_sequences(encoded, padding='post', truncating='post', maxlen =5)
last_value = len(tokenizer.word_index) + 1 # use a number one greater than the size of the word set
print(last_value)

14


In [33]:
padded = pad_sequences(encoded, padding = 'post', value = last_value)

In [35]:
print(padded)

[[ 1  5 14 14 14 14 14]
 [ 1  8  5 14 14 14 14]
 [ 1  3  5 14 14 14 14]
 [ 9  2 14 14 14 14 14]
 [ 2  4  3  2 14 14 14]
 [ 3  2 14 14 14 14 14]
 [ 1  4  6 14 14 14 14]
 [ 1  4  6 14 14 14 14]
 [ 1  4  2 14 14 14 14]
 [ 7  7  3  2 10  1 11]
 [ 1 12  3 13 14 14 14]]
