## Recurrent Neural Network (RNN)

## In this notebook, you will find out:
### 1. how our data looks like
### 2. how Simple RNN model works on different test data
## You should run the cells in order, but you should not train the model your own, we already have it for you! You should skip the cell that train the model and load the model directly.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import keras
from keras import layers
from sklearn.feature_extraction.text import CountVectorizer

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
from tensorflow.keras.models import load_model
from sklearn.metrics import accuracy_score

import shap

In [2]:
FILENAME = 'fake reviews dataset.csv'

df = pd.read_csv(FILENAME)
df['labels'] = [1 if label=='OR' else 0 for label in df['label']]
df

Unnamed: 0,category,rating,label,text_,labels
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor...",0
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I...",0
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...,0
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i...",0
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...,0
...,...,...,...,...,...
40427,Clothing_Shoes_and_Jewelry_5,4.0,OR,I had read some reviews saying that this bra r...,1
40428,Clothing_Shoes_and_Jewelry_5,5.0,CG,I wasn't sure exactly what it would be. It is ...,0
40429,Clothing_Shoes_and_Jewelry_5,2.0,OR,"You can wear the hood by itself, wear it with ...",1
40430,Clothing_Shoes_and_Jewelry_5,1.0,CG,I liked nothing about this dress. The only rea...,0


In [3]:
corpus = df.text_

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X_data = X.toarray()

In [4]:
TRAIN_SIZE = 0.8
TRAIN_IDX = int(0.8 * X_data.shape[0])

X_train = X_data[:TRAIN_IDX]
X_test = X_data[TRAIN_IDX:]

y_train = df.labels[:TRAIN_IDX]
y_test = df.labels[TRAIN_IDX:]

In [5]:
def data_generator(X: list, y: list, num_sequences_per_batch: int) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    If for_feedforward is True: 
    Returns data generator to be used by feed_forward
    else: Returns data generator for RNN model
    '''
    # YOUR CODE HERE
    num_samples = len(X)
    
    while True:
        
        for offset in range(0, num_samples, num_sequences_per_batch):
            
            if offset+num_sequences_per_batch <= num_samples:
                
                # Get the batch data
                batch_sequences = X[offset:offset+num_sequences_per_batch]
                batch_labels = y[offset:offset+num_sequences_per_batch]    
                    
                yield np.array(batch_sequences), np.array(batch_labels)


In [6]:
num_sequences_per_batch = 128 # this is the batch size
train_generator = data_generator(list(X_train), y_train, num_sequences_per_batch)

sample = next(train_generator) # this is how you get data out of generators

print(sample[0].shape)
print(sample[1].shape)

(128, 41099)
(128,)


In [7]:
def train_model(data_generator, X, y, save_path, num_sequences_per_batch=128, num_epochs=1):
    
    model = Sequential()

    model.add(SimpleRNN(128, input_shape=X_data.shape[1]))
    model.add(Dense(1, activation='sigmoid'))

    loss_fn = 'binary_crossentropy'
    model.compile(loss=loss_fn, optimizer='adam', metrics=['accuracy'])
    
    train_generator = data_generator(X, y, num_sequences_per_batch)
    
    history = model.fit(
        x=train_generator,
        steps_per_epoch=len(X) // num_sequences_per_batch,
        epochs=num_epochs
    )                            

    model.save(save_path)

In [None]:
train_model(data_generator, list(X_train), y_train, 'SimpleRNN', num_epochs=3)

Epoch 1/3
  1/252 [..............................] - ETA: 1632:49:04 - loss: 0.6959 - accuracy: 0.4297

In [8]:
model = load_model('SimpleRNN')

## Test our model with original data
## This would takes 15-20 minutes on my computer

In [9]:
y_pred = model.predict(X_test)
y_pred = np.where(y_pred >= 0.5, 1, 0)



In [10]:
accuracy_score(y_pred, y_test)

0.499814517126252

In [11]:
df_explain = pd.DataFrame({'text': list(df.text_[TRAIN_IDX:]), 'true_label': y_test, 'pred_label': y_pred.flatten()})
df_explain

Unnamed: 0,text,true_label,pred_label
32345,">>>...""The Lean Startup:..."" is a ""MustRead"" f...",1,0
32346,"My Brief History is Stephen King's first book,...",0,0
32347,I love the Harry Potter series. The characters...,0,0
32348,"Well, I struggled a bit to write this review a...",1,0
32349,"THE SOLOIST is a fine novel, interweaving thre...",1,0
...,...,...,...
40427,I had read some reviews saying that this bra r...,1,0
40428,I wasn't sure exactly what it would be. It is ...,0,0
40429,"You can wear the hood by itself, wear it with ...",1,0
40430,I liked nothing about this dress. The only rea...,0,0


In [12]:
test_df = pd.read_csv('chatgpt4_generated_amazon_reviews.csv')
test_df['labels'] = [0] * len(test_df.Review)
test_df

Unnamed: 0,Review,labels
0,This product truly exceeded my expectations in...,0
1,I was pleasantly surprised with the efficiency...,0
2,"Although the material feels a bit cheap, the o...",0
3,"Unfortunately, the product broke after just a ...",0
4,Absolutely love this! It's become an essential...,0
...,...,...
95,"Review 96: The product is average, meets the b...",0
96,"Review 97: Amazing product, it has exceeded al...",0
97,Review 98: Quite disappointed with the purchas...,0
98,"Review 99: The product is average, meets the b...",0


In [13]:
corpus_test = test_df.Review
X = vectorizer.transform(corpus_test)
X_test = X.toarray()

In [14]:
pred_labels = model.predict(X_test)
pred_labels = np.where(pred_labels >= 0.5, 1, 0)
accuracy_score(pred_labels, test_df.labels)



1.0

In [15]:
pred_labels

array([[0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
    

## Wait...How about GPT 2?

In [17]:
starting_token = [sentence.split()[0] for sentence in df.text_]
print(starting_token[:5])

['>>>..."The', 'My', 'I', 'Well,', 'THE']


In [18]:
import requests

API_TOKEN = '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'

API_URL = "https://api-inference.huggingface.co/models/guangyil/gpt2-amazon"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

## the cell below might crash for some reason, you just need to re-run it.

In [21]:
gpt2_text = []

num_query = 100

for i in range(num_query):
    start = np.random.choice(starting_token)
    outputs = query({
    "inputs": start,
    })
    sentences = outputs[0]['generated_text'].split('\n')
    gpt2_text.extend(sentences)
    
print(gpt2_text[:10])

['I don t care if you have one you have to invest in it. ', 'i wanted something that would stand up to heavier lifting that i could easily hold and work. ', 'i bought the second set because i was going to buy the third', 'The is a very durable, very sturdy plastic tub. ', 'i ve had this item for a few years and it has stayed in great shape. ', 'that is all i would say about these products. ', 'it s one of the best', 'Very about one year ago. ', 'i think it could have easily broken or damaged the metal parts. ', 'i am still amazed at the quality of this product. ']


In [22]:
X_test = vectorizer.transform(gpt2_text)
X_test = X_test.toarray()
print(X_test)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [23]:
pred_labels = model.predict(X_test)
pred_labels = np.where(pred_labels >= 0.5, 1, 0)

y_test = np.array([[0] * X_test.shape[0]]).reshape(-1, 1)

accuracy_score(pred_labels, y_test)



1.0

## that's weird (both 100%?) maybe it tends to predict things as computer generated！Let's write a review and see the result

In [24]:
my_review = ['I brought this cup last week and I really like the volume of it.']

X_test = vectorizer.transform(my_review)
X_test = X_test.toarray()
print(X_test)

[[0 0 0 ... 0 0 0]]


In [25]:
pred_labels = model.predict(X_test)
pred_labels = np.where(pred_labels >= 0.5, 1, 0)

y_test = np.array([[1] * X_test.shape[0]]).reshape(-1, 1)

accuracy_score(pred_labels, y_test)



0.0

## It turns out our RNN model tends to classify any unseen reviews as generated, and it does not matter if it is wrote by human or not