# Padding
- 자연어 처리에서 각 문장(문서)의 길이는 다를 수 있음
- 그러나 언어모델은 고정된 길이의 데이터를 효율적으로 처리함함
    - -> 모든 문장의 길이를 동일하게 맞춰주는 작업이 필요함 == 패딩

**패딩 이점**
1. 일관된 입력 형식
2. 병렬 연산 최적화
3. 유연한 데이터 처리

In [None]:
# 딥러닝 모델(특히 RNN, LSTM, Transformer 등)은 입력 시퀀스의 길이가 동일해야 하기 때문에,
# 짧은 문장은 긴 문장에 맞춰 일정한 길이로 늘려줘야한다

In [4]:
preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'],
                          ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'],
                          ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'],
                          ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'],
                          ['barber', 'went', 'huge', 'mountain']]

### 직접 구현

In [None]:
import torch
from collections import Counter

class TokenizerForPadding:
    def __init__(self, num_words=None, oov_token='<OOV>'):  # num_words : 단어사전의 내용의 개수 
        self.num_words = num_words
        self.oov_token = oov_token
        self.word_index = {}
        self.index_word = {}
        self.word_counts = Counter()    # 전체 단어 빈도를 담는 Counter / 리스트 같은 반복 가능한(iterable) 객체에서 원소의 개수를 자동으로 세어주는 딕셔너리 형태의 클래스

    def fit_on_texts(self, texts):
        # 빈도수 세기 
        for sentence in texts:
            self.word_counts.update(word for word in sentence if word)

        # 빈도수 기반 vocabulary 생성 (num_words 만큼만) 
        vocab = [self.oov_token] + [word for word, _ in self.word_counts.most_common(self.num_words -2 if self.num_words else None)]
        # most_common(): 가장 많이 등장한 항목들을 (원소, 개수) 쌍의 리스트로 반환 
        # self.num_words가 존재하면 -2 (0와 OOV 두개를 뺀 값) / most_common : (self.num_words -2)개의 숫자만큼 빈도수가 상위인 것을 가져온다
        self.word_index = {word: i+1 for i, word in enumerate(vocab)}
        self.index_word = {i+1: word for word, i in self.word_index.items()}

    def texts_to_sequences(self, texts):
        return [[self.word_index.get(word, self.word_index[self.oov_token]) for word in sentence] for sentence in texts]

In [6]:
def pad_sequences(sequneces, maxlen=None, padding='pre', truncating='pre', value=0):    # 'pre' : 패딩을 앞에 추가 / value=0 : 제로패딩 (빈공간을 0으로 채워주겠다)
    if maxlen is None:  # 일괄적으로 정해줄 길이
        maxlen = max(len(seq) for seq in sequneces)     # 가장 긴 문장에 맞춰서 설정할 수 있게. 

    padded_sequences = []
    for seq in sequneces:
        if len(seq) > maxlen:
            if truncating == 'pre':
                seq = seq[-maxlen:]
            else:   # post
                seq = seq[:maxlen]
        else:
            pad_length = maxlen - len(seq)
            if padding == 'pre':
                seq = [value] * pad_length + seq
            else:   # post 
                seq = seq + [value] * pad_length
        padded_sequences.append(seq)
    
    return torch.tensor(padded_sequences)



In [7]:
tokenizer = TokenizerForPadding(num_words=15)
tokenizer.fit_on_texts(preprocessed_sentences)
sequneces = tokenizer.texts_to_sequences(preprocessed_sentences)
sequneces

[[2, 6],
 [2, 9, 6],
 [2, 4, 6],
 [10, 3],
 [3, 5, 4, 3],
 [4, 3],
 [2, 5, 7],
 [2, 5, 7],
 [2, 5, 3],
 [8, 8, 4, 3, 11, 2, 12],
 [2, 13, 4, 14]]

In [8]:
padded = pad_sequences(sequneces)   # 데이터의 길이를 맞춰주는게 패딩. / 패딩을 앞으로 붙임 
padded

tensor([[ 0,  0,  0,  0,  0,  2,  6],
        [ 0,  0,  0,  0,  2,  9,  6],
        [ 0,  0,  0,  0,  2,  4,  6],
        [ 0,  0,  0,  0,  0, 10,  3],
        [ 0,  0,  0,  3,  5,  4,  3],
        [ 0,  0,  0,  0,  0,  4,  3],
        [ 0,  0,  0,  0,  2,  5,  7],
        [ 0,  0,  0,  0,  2,  5,  7],
        [ 0,  0,  0,  0,  2,  5,  3],
        [ 8,  8,  4,  3, 11,  2, 12],
        [ 0,  0,  0,  2, 13,  4, 14]])

In [9]:
padded = pad_sequences(sequneces, padding='post', maxlen=5, truncating='post')   # 패딩을 뒤로 붙임 
padded

tensor([[ 2,  6,  0,  0,  0],
        [ 2,  9,  6,  0,  0],
        [ 2,  4,  6,  0,  0],
        [10,  3,  0,  0,  0],
        [ 3,  5,  4,  3,  0],
        [ 4,  3,  0,  0,  0],
        [ 2,  5,  7,  0,  0],
        [ 2,  5,  7,  0,  0],
        [ 2,  5,  3,  0,  0],
        [ 8,  8,  4,  3, 11],
        [ 2, 13,  4, 14,  0]])

### keras Tokenizer 이용

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
sequneces = tokenizer.texts_to_sequences(preprocessed_sentences)
sequneces

[[1, 5],
 [1, 8, 5],
 [1, 3, 5],
 [9, 2],
 [2, 4, 3, 2],
 [3, 2],
 [1, 4, 6],
 [1, 4, 6],
 [1, 4, 2],
 [7, 7, 3, 2, 10, 1, 11],
 [1, 12, 3, 13]]

In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequneces, padding='post', maxlen=3, truncating='post')
padded

array([[ 1,  5,  0],
       [ 1,  8,  5],
       [ 1,  3,  5],
       [ 9,  2,  0],
       [ 2,  4,  3],
       [ 3,  2,  0],
       [ 1,  4,  6],
       [ 1,  4,  6],
       [ 1,  4,  2],
       [ 7,  7,  3],
       [ 1, 12,  3]])

---

##### 어린왕자 데이터 샘플 패딩처리 (실습) **(월요일까지)**

1. 텍스트 전처리 (토큰화/불용어처리/정제/정규화)
2. 정수 인코딩 Tokenizer (tensorflow.keras)
3. 패딩 처리 pad_sequences (tensorflow.keras)

In [12]:
raw_text = """The Little Prince, written by Antoine de Saint-Exupéry, is a poetic tale about a young prince who travels from his home planet to Earth. The story begins with a pilot stranded in the Sahara Desert after his plane crashes. While trying to fix his plane, he meets a mysterious young boy, the Little Prince.

The Little Prince comes from a small asteroid called B-612, where he lives alone with a rose that he loves deeply. He recounts his journey to the pilot, describing his visits to several other planets. Each planet is inhabited by a different character, such as a king, a vain man, a drunkard, a businessman, a geographer, and a fox. Through these encounters, the Prince learns valuable lessons about love, responsibility, and the nature of adult behavior.

On Earth, the Little Prince meets various creatures, including a fox, who teaches him about relationships and the importance of taming, which means building ties with others. The fox's famous line, "You become responsible, forever, for what you have tamed," resonates with the Prince's feelings for his rose.

Ultimately, the Little Prince realizes that the essence of life is often invisible and can only be seen with the heart. After sharing his wisdom with the pilot, he prepares to return to his asteroid and his beloved rose. The story concludes with the pilot reflecting on the lessons learned from the Little Prince and the enduring impact of their friendship.

The narrative is a beautifully simple yet profound exploration of love, loss, and the importance of seeing beyond the surface of things."""