# Preparing Data

In this project, we will use Yelp online reviews as a dataset, and mainly focus on building a filter by analyzing online text reviews. For more information of dataset, [find this].

In this first notebook, we start process dataset by normalize and pretrain the dataset. Further notebooks will use different learning algorithms to train our refined dataset.

Let's import some packages and data files.

[find this]:(http://odds.cs.stonybrook.edu/yelpnyc-dataset/)

In [1]:
import numpy as np
import torch
import torchtext
from torchtext import data
import spacy
import nltk

content = np.loadtxt("../data/reviewContent", dtype=np.str, delimiter="\t")
data = np.loadtxt("../data/metadata", 
		dtype={'names': ('user_id', 'prod_id', 'rating', 'label', 'date'), 
        'formats': (np.int_, np.int_, np.float, np.int_, '|S11')}, delimiter="\t")

## Data Regroup & Split
From raw data files, for each sample they contains:
* metadata
```
<user_id> <restaurant_id> <rating> <label> <date>
```
* reviewContent
```
<user_id> <restaurant_id> <date> <review>
```


For this project we have to rearrange those sample as one sample dataset:

| Label | Rating | Review |
| :-----------: |:-------------:| :-----|
| 1 for real, -1 for fake    | 5.0 | The food at ... seated. |

Here is what we get:

In [2]:
sc = content[:, 3].reshape(content.shape[0], 1)
dt = np.array([data['user_id'], data['prod_id'], data['label'], data['rating']])
rst = np.hstack([dt.T, sc])
np.random.shuffle(rst)

train_size = round(rst.shape[0] * 0.6)
cv_size = round(rst.shape[0] * 0.2)
tst_size = rst.shape[0] - train_size - cv_size

np.savetxt('../data/input/train', rst[:train_size], fmt='%s', delimiter='\t')
np.savetxt('../data/input/dev', rst[train_size:(train_size+cv_size)], fmt='%s', delimiter='\t')
np.savetxt('../data/input/test', rst[(train_size+cv_size):], fmt='%s', delimiter='\t')

## Text Normalization



In [10]:
ps = nltk.stem.porter.PorterStemmer()

def tokenizer (sentence):
    tk = nltk.word_tokenize(sentence)
    for i in range(len(tk)):
        tk[i] = ps.stem(tk[i]) # Stemming
    return tk

TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer, lower=True)
LABEL = torchtext.data.LabelField(sequential=False, dtype=torch.float)
fields = [(None, None), (None, None), ('label', LABEL), (None, None), ('text', TEXT)]


# Generate dataset for torchtext
train_data, val_data, test_data = torchtext.data.TabularDataset.splits(path='../data/input', train='train',
        validation='dev', test='test', format = 'tsv', fields=fields)
print(vars(train_data.examples[0]))

# Build vocab
TEXT.build_vocab(train_data, vectors="glove.6B.100d") #
LABEL.build_vocab(train_data)

# Create batch and iterate dataset
BATCH_SIZE = [64, 64, 64]
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iter, val_iter, test_iter = torchtext.data.Iterator.splits(
        (train_data, val_data, test_data),
        batch_sizes=BATCH_SIZE, device=device)

{'label': '-1.0', 'text': ['thi', 'littl', 'place', 'in', 'soho', 'is', 'wonder', '.', 'i', 'had', 'a', 'lamb', 'sandwich', 'and', 'a', 'glass', 'of', 'wine', '.', 'the', 'price', 'shock', 'me', 'for', 'how', 'small', 'the', 'serv', 'wa', ',', 'but', 'then', 'again', ',', 'thi', 'is', 'soho', '.', 'the', 'staff', 'can', 'be', 'a', 'littl', 'snotti', 'and', 'rude', ',', 'but', 'the', 'food', 'is', 'great', ',', 'just', 'do', "n't", 'expect', 'world-class', 'servic', '.']}
