<a href="https://colab.research.google.com/github/HARASEON/Deep_Learning_tensorflow/blob/main/Day4_sarcasm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Auto ML
* 개념; This includes automating the data preprocessing, feature engineering, model selection, hyperparameter tuning, and model deployment.
* point1: 자동화 -> 인공지능 전문가 인력 부족 해결
* point2: 접근성 -> 비전문가에게 더 높은 접근성
* point3: 효율성 -> 더 빠르고 효율적인 메뉴얼된 머신러닝제공
* point4: 맞춤성 -> 구체적인 목적에 적합한 머신러닝 제공
* point5: 재생산성 
* point6: 최적화

# RNN
* 용도 : used for processing sequential data.
* 특징 : feedback connections, which allow them to store information about previous inputs and use that information to make decisions about future inputs.
* 장점 : RNN can maintain a sort of "memory" of the past inputs it has seen and use that information to make decisions about future inputs. This makes RNNs particularly useful for processing sequential data, such as time series or natural language.
* 문제점 :  the gradients can quickly become very small or very large, a problem known as ** the vanishing gradient problem.**
* 해결방안 : This can make it difficult for the network to learn long-term dependencies. To address this, various types of RNNs have been developed, such as LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit), which use more sophisticated architectures to help preserve the gradients over longer sequences.(RNN의 장기 의존성 문제 해결)


# Sarcasm 
* Goal: val_loss < 0.3626
* data length: 26709 ea
* Dense layer를 추가로 늘려봤는데 오히려 더 적은 것이 학습률이 뛰어나다는 것을 배웠다!!
 

In [None]:
import json
import tensorflow as tf
import numpy as np
import urllib

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, Flatten, Bidirectional, LSTM, Dense, Input, Dropout, BatchNormalization
from tensorflow.keras import Sequential, Model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Data loading
url = 'https://storage.googleapis.com/download.tensorflow.org/data/sarcasm.json'
urllib.request.urlretrieve(url, 'sarcasm.json')
with open('sarcasm.json') as f:
  data = json.load(f)

# Data preprocessing
sentences = []
labels = []
for d in data:
  sentences.append(d['headline'])
  labels.append(d['is_sarcastic'])

training_size = 20000
train_sentences = sentences[:training_size]
train_labels = labels[:training_size]

test_sentences = sentences[training_size:]
test_labels = labels[training_size:]

# tokenizer
vocab_size = 1000  # 25637
oov_tok = "<oov>"  # oov = out of vacabulary

# max_length: 헤드라인에 들어가는 최대 단어수
max_length = 120
# 잘라낼 문장의 위치
trunc_type='post'
# 채워줄 문장의 위치 -> 채워주려면 'pre'를 쓴다.
padding_type='post'

embedding_dim = 16   # word2vec

# 단어의 토큰화
# oov: out of vocabulary
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)

# 문장에 대한 토큰화
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

# 문장을 토큰으로 변경
train_sequences = tokenizer.texts_to_sequences(train_sentences)

# Test문장에도 토크나이저 적용
test_sequences = tokenizer.texts_to_sequences(test_sentences)

# 시퀀스 길이 맞추기
x_train = pad_sequences(train_sequences, truncating=trunc_type, padding=padding_type, maxlen=max_length)
x_test = pad_sequences(test_sequences, truncating=trunc_type, padding=padding_type, maxlen=max_length)

# label 값을 Numpy Array로 변환
y_train = np.array(train_labels)
y_test = np.array(test_labels)


# model building
model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(64)),
    Flatten(),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
    ])

# model fitting
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# model monitoring
early_stopping = EarlyStopping(monitor='val_loss', patience=10)
checkpoint_path = 'my_checkpoint.ckpt'
checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, 
    save_weights_only=True, 
    save_best_only=True, 
    monitor='val_loss',
    verbose=1)

# model_learning & validation
history = model.fit(x_train, y_train, 
    validation_data=(x_test, y_test),
    callbacks=[checkpoint, early_stopping],
    epochs=100)

model.load_weights(checkpoint_path)
model.save("sarcasm.h5")


Epoch 1/100
Epoch 1: val_loss improved from inf to 0.39407, saving model to my_checkpoint.ckpt
Epoch 2/100
Epoch 2: val_loss improved from 0.39407 to 0.37324, saving model to my_checkpoint.ckpt
Epoch 3/100
Epoch 3: val_loss did not improve from 0.37324
Epoch 4/100
Epoch 4: val_loss did not improve from 0.37324
Epoch 5/100
Epoch 5: val_loss did not improve from 0.37324
Epoch 6/100
Epoch 6: val_loss did not improve from 0.37324
Epoch 7/100
Epoch 7: val_loss did not improve from 0.37324
Epoch 8/100
Epoch 8: val_loss did not improve from 0.37324
Epoch 9/100
Epoch 9: val_loss did not improve from 0.37324
Epoch 10/100
Epoch 10: val_loss did not improve from 0.37324
Epoch 11/100
Epoch 11: val_loss did not improve from 0.37324
Epoch 12/100
Epoch 12: val_loss did not improve from 0.37324


# GPT
* prompt:code with sarcasm dataset from 'https://storage.googleapis.com/download.tensorflow.org/data/sarcasm.json' and do the RNN and modify it to get val_loss under 0.3626 success!! 0.3283
* respond:
1. Loading and preprocessing the dataset: The code loads the sarcasm dataset in JSON format from a URL using the tf.keras.utils.get_file function. Then, it extracts the sentences and labels from the dataset and splits them into training and validation sets. The sentences are tokenized using the Tokenizer class and padded to a fixed length using the pad_sequences function.

2. Defining the RNN model: The code defines a sequential model using the Keras API with the following layers: an embedding layer, a dropout layer, a bidirectional LSTM layer, another dropout layer, a dense layer with ReLU activation, another dropout layer, and finally a dense layer with sigmoid activation. The embedding layer learns the meaning of each word in the input sequence and the LSTM layer captures the context of the sequence.

3. Compiling the model: The model is compiled with binary cross-entropy loss and Adam optimizer. The accuracy metric is also included to monitor the performance of the model during training.

4. Training the model: The model is trained for 30 epochs on the training set with a batch size of 32 using the fit function. The validation set is also provided to monitor the performance of the model on unseen data during training. A checkpoint callback is defined to save the best weights of the model based on the validation loss.

5. Evaluating the model: After training, the model is evaluated on the validation set using the evaluate function. The validation loss and accuracy are printed to the console. Finally, the best weights of the model are saved to a .h5 file named "sarcasm_gpt.h5".

In [None]:
import json
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the sarcasm dataset
url = "https://storage.googleapis.com/download.tensorflow.org/data/sarcasm.json"
response = tf.keras.utils.get_file('sarcasm.json', url)
with open(response, 'r') as f:
    datastore = json.load(f)

# Extract the sentences and labels from the dataset
sentences = []
labels = []
for item in datastore:
    sentences.append(item['headline'])
    labels.append(int(item['is_sarcastic']))

# Split the dataset into training and validation sets
training_size = int(len(sentences) * 0.8)
training_sentences = sentences[0:training_size]
validation_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
validation_labels = labels[training_size:]

# Tokenize the sentences and pad them to a fixed length
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=100, padding='post', truncating='post')
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)
validation_padded = pad_sequences(validation_sequences, maxlen=100, padding='post', truncating='post')

# Define the RNN model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 16, input_length=100),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define the checkpoint callback
checkpoint_filepath = 'my_checkpoint.h5'
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True)

# Train the model # 이 부분 에러: training_labels가가 numpy가 아니라 리스트라 생긴 에러
history = model.fit(x=training_padded, y=training_labels, epochs=30, validation_data=(validation_padded, validation_labels),
                    callbacks=[checkpoint_callback])

# Load the best weights from the checkpoint file
model.load_weights(checkpoint_filepath)

# Save the best weights
model.save("sarcasm_gpt.h5")

# Evaluate the model on the validation set
loss, accuracy = model.evaluate(validation_padded, validation_labels)
print("Validation loss:", loss)
print("Validation accuracy:", accuracy)


ValueError: ignored

In [None]:
import json
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the sarcasm dataset
url = "https://storage.googleapis.com/download.tensorflow.org/data/sarcasm.json"
response = tf.keras.utils.get_file('sarcasm.json', url)
with open(response, 'r') as f:
    datastore = json.load(f)

# Extract the sentences and labels from the dataset
sentences = []
labels = []
for item in datastore:
    sentences.append(item['headline'])
    labels.append(int(item['is_sarcastic']))

# Split the dataset into training and validation sets
training_size = int(len(sentences) * 0.8)
training_sentences = sentences[0:training_size]
validation_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
validation_labels = labels[training_size:]

# Tokenize the sentences and pad them to a fixed length
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=100, padding='post', truncating='post')
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)
validation_padded = pad_sequences(validation_sequences, maxlen=100, padding='post', truncating='post')

# Define the RNN model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 16, input_length=100),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define the checkpoint callback
checkpoint_filepath = 'my_checkpoint.h5'
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True)

# Train the model
history = model.fit(x=training_padded, y=np.array(training_labels), epochs=50, validation_data=(validation_padded, np.array(validation_labels)),
                    callbacks=[checkpoint_callback])

# Load the best weights from the checkpoint file
# Save the best weights
model.load_weights(checkpoint_filepath)
model.save("sarcasm_gpt.h5")

# Evaluate the model on the validation set
loss, accuracy = model.evaluate(validation_padded, np.array(validation_labels))
print("Validation loss:", loss)
print("Validation accuracy:", accuracy)


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Validation loss: 0.3283102512359619
Validation accuracy: 0.8517408967018127
