# Text Preprocessing
- **Tokenizing**
- Cleaning
- Normalization


In [None]:
!pip install tensorflow==2.5.0

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


### Tokeninzer
> tokenizing: 토큰의 단위가 상황에 따라 다르지만, 보통 의미있는 단위로 토큰을 정의함. 
>
generating the dictionary of word encodings and creating vectors out of the sentences.. \
=> 단어를 토큰화하고 숫자에 대응, 딕셔너리 생성 \
- `num_words`: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.
- `fit_on_texts`: encodes (숫자 부여)
- `word_index`: returns dictionary with word&index

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences, maxlen=5)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)


Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Sequences =  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded Sequences:
[[ 0  5  3  2  4]
 [ 0  5  3  2  7]
 [ 0  6  3  2  4]
 [ 9  2  4 10 11]]


- `oov_token`: replace out-of-vocabulary words during text_to_sequence calls, 즉 test data에서 encoding 안 된 단어 대체하는 용도
- `text_to_sequences`: transforms a string of text into a list of words
### Sequence
- `sequence`: 정수의 시퀀스로 변환된 텍스트 문장
- `pad_sequences`: with 0, pads sequences to the same length.
> '패딩'한다: '고르다', 즉 일종의 통일성을 주는 것? 위에서 `pad_sequences`를 거쳐 나오는 리스트는 길이가 모두 같아짐
>


In [None]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)


Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 5 1 3 2 4]
 [0 0 0 0 0 2 4 1 2 1]]


In [1]:
!gdown --id 1xRU3xY5-tkiPGvlz5xBJ18_pHWSRzI4v

Downloading...
From: https://drive.google.com/uc?id=1xRU3xY5-tkiPGvlz5xBJ18_pHWSRzI4v
To: /content/sarcasm.json
  0% 0.00/5.64M [00:00<?, ?B/s]100% 5.64M/5.64M [00:00<00:00, 81.0MB/s]


In [2]:
import json

with open("./sarcasm.json", 'r') as f:
    datastore = json.load(f)


sentences = [] 
labels = []
urls = []
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

### `.json` (JavaScript Object Notation)
- Javascript에서 객체를 만들 때 사용하는 표현식
- 경량(Lightweight)의 DATA-교환 형식
- { String key : String Value } 형태

```
{
  "firstName": "Kwon",
  "lastName": "YoungJae",
  "email": "kyoje11@gmail.com"
}
```
출처: https://nesoy.github.io/articles/2017-02/JSON \



In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(len(word_index))
#print(word_index)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

29657
[  308 15115   679  3337  2298    48   382  2576 15116     6  2577  8434
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]
(26709, 40)


- `padding` :String, 'pre' or 'post' (optional, defaults to 'pre'): pad either before or after each sequence, i.e., 패딩할 때 0이 앞으로 올지 뒤로 올지 결정해주는 것

# Week 1 Quiz 
총점 8점 
1. What is the name of the object used to tokenize sentences?
- **Tokenizer**
- WordTokenizer
- TextTokenizer
- CharacterTokenizer

2. What is the name of the method used to tokenize a list of sentences?
- fit_to_text(sentences)
- tokenize_on_text(sentences)
- **fit_on_texts(sentences)**
- tokenize(sentences)

3. Once you have the corpus tokenized, what’s the method used to encode a list of sentences to use those tokens?
- texts_to_tokens(sentences)
- text_to_sequences(sentences)
- **texts_to_sequences(sentences)**
- text_to_tokens(sentences)

4. When initializing the tokenizer, how to you specify a token to use for unknown words?
- unknown_token=`<Token>`
- **oov_token=`<Token>`**
- unknown_word=`<Token>`
- out_of_vocab=`<Token>`

5. If you don’t use a token for out of vocabulary words, what happens at encoding?
- The word isn’t encoded, and is replaced by a zero in the sequence
- The word isn’t encoded, and the sequencing ends
- The word is replaced by the most common token
- **The word isn’t encoded, and is skipped in the sequence**

6. If you have a number of sequences of different lengths, how do you ensure that they are understood when fed into a neural network?
- **Make sure that they are all the same length using the pad_sequences method of the tokenizer**
- Use the pad_sequences object from the tensorflow.keras.preprocessing.sequence namespace
- Process them on the input layer of the Neural Netword using the pad_sequences property
- Specify the input layer of the Neural Network to expect different sizes with dynamic_length

7. If you have a number of sequences of different length, and call pad_sequences on them, what’s the default result?
- **They’ll get padded to the length of the longest sequence by adding zeros to the beginning of shorter ones**
- They’ll get cropped to the length of the shortest sequence
- Nothing, they’ll remain unchanged
- They’ll get padded to the length of the longest sequence by adding zeros to the end of shorter ones

8. When padding sequences, if you want the padding to be at the end of the sequence, how do you do it?
- Call the padding method of the pad_sequences object, passing it ‘after’
- **Pass padding=’post’ to pad_sequences when initializing it**
- Pass padding=’after’ to pad_sequences when initializing it
- Call the padding method of the pad_sequences object, passing it ‘post’


