## Building Review Sentimental Classifiers with GNN

> For this intern project, I successfully implemented GNNs and sequential model, such as Bi-LSTM, on Amazon reviews dataset. By following through the steps, you will get a sense of how this project started from data cleaning to model training. This tutorial only introduces the steps for training GCN or GAT models. Please change the root path and config setting as you like.

> Note: The scripts for the LSTM model have not been made to production code. For someone who is interested in following up with this work, you may need to check out the jupyter notebooks under *src/models* folder directly.

### Step 1: Download the example dataset

I extracted some of the subsets as the benchmark datasets from [Amazon review data (2018)](https://nijianmo.github.io/amazon/index.html). Under the **"Small" subsets for experimentation** section, You can find a bunch of reviews splitted into categories with the link. For this tutorial, *Magazine Subscriptions* will be taken as an example. So first of all, let's download the required dataset using wget command.

In [3]:
!wget -P dataset/raw_datasets/ http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Magazine_Subscriptions_5.json.gz 

--2020-10-17 04:19:32--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Magazine_Subscriptions_5.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 401655 (392K) [application/octet-stream]
Saving to: 'dataset/raw_datasets/Magazine_Subscriptions_5.json.gz'


2020-10-17 04:19:33 (1.02 MB/s) - 'dataset/raw_datasets/Magazine_Subscriptions_5.json.gz' saved [401655/401655]



The downloaded dataset is stored in a zip file. Let's unzip the file first.

In [6]:
!gzip -d dataset/raw_datasets/Magazine_Subscriptions_5.json.gz

### Step 2: Clean the reviews

You can run the following script to transform the messy reviews to clean tokens. At the same time, it will produce edge indices and the unique tokens which will further represent the nodes in each graph.

In [17]:
!PYTHONPATH="/scratch/kll482/cathay" python src/preprocessing/feature_engineering/feature_engineering.py

Multiprocessing CPU Count: 10
=== preprocessing file 0 ===
transform json to dataframe...
start reading the json files...
clean tokens and remove rows with no tokens...
cleaning the reviews...
100%|██████████████████████████████████████| 2375/2375 [00:05<00:00, 423.89it/s]
remove empty tokens...
extract unique tokens...
getting the edge index for each graph...
100%|█████████████████████████████████████████████| 3/3 [00:00<00:00,  3.29it/s]
finish!


### Step 3: Build pytorch geometric dataloaders

After cleaning the reviews, I will build pytorch dataloaders for training, validation, and test sets before training the model. Those dataloaders are stored under "dataset/dataloaders/".

In [31]:
!PYTHONPATH="/scratch/kll482/cathay" python src/preprocessing/build_loaders.py

Importing packages...
1. Config Setting
cuda on:  True
2. Read all preprocessed json files
read the datasets...
3. Take a look at the token and unique token length distribution among all reviews
max: 515
min: 1
median: 12.0
mean: 29.536637018212623
Quantile (25%): 3.0
Quantile (75%): 31.0
Quantile (90%): 78.0
max: 317
min: 1
median: 12.0
mean: 23.781872088098265
Quantile (25%): 3.0
Quantile (75%): 28.0
Quantile (90%): 64.0
5. Reclassify overall score from 1~5 to [0,1], representing negative and positive
6. Make sure there is no NA value in the new target column
7. Split the dataframe into train, validation, and test sets
8. Create a vocabulary from unique tokens
9. Conduct undersampling on the based on the target variable distribution
Length of training set: 338
Length of test set: 237
10. Build Pytorch dataset for all neighbors
11. Build and save Pytorch dataloaders


### Step 4: Train the model

It will take a few seconds to set up the config/arguments. Please be patient!

In [33]:
!PYTHONPATH="/scratch/kll482/cathay" python training.py

1. Config Setting
cuda on:  True
Which model I am training? gcn 

size of train loader 22

=== The nodes within the same graph are connecting with only 1 neighbors. Let's start training ===
=== Settings ===
data_path: dataset/processed_datasets/
data_loader_path: dataset/dataloaders/
model_name: gcn
use_cuda: True
set_seed: 123
num_features: 768
n_classes: 1
target: y
neighbor: 1,2
nodes: uniqueTokens
tokens: reviewTokens
embedd_method: random
test_batch_size: 1024
batch_size: 16
epochs: 15
log_every: 10
lr: 0.001
drop_out: 0.0
lr_decay: 0.7
lr_min: 1e-05
n_bad_loss: 4.0
result_path: result/graph/
log_path: logs/graph/
result: result/
device: cuda
num_words: 160993
edge_index: edgeIndex1
best_model: result/graph/checkpoint/edgeIndex1_gcn_2020_10_17_04_43.pth
log_file: logs/graph/edgeIndex1_gcn_2020_10_17_04_43.txt
config_saved_path: result/graph/config_saved/edgeIndex1_gcn_2020_10_17_04_43.pkl
iteration: 0
n_total: 0
train_loss: 0
init_bad_loss: 0
stop: False
best_val_loss: inf
init: 2

13it [00:00, 21.54it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 16.79it/s][A
   % Time: 0:00:34.174358 | Iteration:   300 | Batch:   14/22 | Train loss: 0.1824 | Val loss: 0.5574
22it [00:02,  7.52it/s]
=> EPOCH 15
0it [00:00, ?it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 10.47it/s][A
   % Time: 0:00:36.654272 | Iteration:   310 | Batch:    2/22 | Train loss: 0.1887 | Val loss: 0.5621
11it [00:00, 14.94it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 16.92it/s][A
   % Time: 0:00:37.113123 | Iteration:   320 | Batch:   12/22 | Train loss: 0.1801 | Val loss: 0.5616
19it [00:00, 18.47it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 17.24it/s][A
   % Time: 0:00:37.573082 | Iteration:   330 | Batch:   22/22 | Train loss: 0.2068 | Val loss: 0.5633
22it [00:01, 19.05it/s]
size of train loader 22

=== The nodes within the same graph are connecting with only 2 neighbors. Let's 

15it [00:00, 18.55it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 16.55it/s][A
   % Time: 0:00:17.936141 | Iteration:   280 | Batch:   16/22 | Train loss: 0.0237 | Val loss: 0.6978
22it [00:01, 20.55it/s]
=> EPOCH 14
3it [00:00, 25.16it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 16.29it/s][A
   % Time: 0:00:18.393812 | Iteration:   290 | Batch:    4/22 | Train loss: 0.0334 | Val loss: 0.6987
13it [00:00, 21.49it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 16.54it/s][A
   % Time: 0:00:18.856289 | Iteration:   300 | Batch:   14/22 | Train loss: 0.0207 | Val loss: 0.6979
22it [00:00, 22.04it/s]
=> EPOCH 15
0it [00:00, ?it/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 16.45it/s][A
   % Time: 0:00:19.312228 | Iteration:   310 | Batch:    2/22 | Train loss: 0.0305 | Val loss: 0.7024
=> Adjust learning rate to: 0.00011764899999999998
11it [00:00, 19.53it/s]
100%|██████████████

### Step 5: Testing the model and compute the prediction result

In [34]:
!PYTHONPATH="/scratch/kll482/cathay" python testing.py

1. Prepare for initial settings
2. Start testing
start testing...
100%|█████████████████████████████████████████████| 1/1 [00:04<00:00,  4.14s/it]

=== Confusion Matrix ===

    0    1
0  11   12
1  80  134

Finished testing
3. Write the classification report...

=== Classification Report ===

              precision    recall  f1-score   support

         0.0       0.12      0.48      0.19        23
         1.0       0.92      0.63      0.74       214

    accuracy                           0.61       237
   macro avg       0.52      0.55      0.47       237
weighted avg       0.84      0.61      0.69       237


4. Plot the learning curve and save it...
