## Downloading the Dataset

Download and extract the 'yelp_review_full_csv.tar.gz' file from https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M

Make sure that 'train.csv' and 'test.csv' files are present in this directory

In [1]:
import os

assert os.path.exists('train.csv') and os.path.exists('test.csv')

In [2]:
!export PYTHONIOENCODING=utf8

In [3]:
import nltk
import pandas as pd
import csv
import sys
import spacy
import re
import random
import codecs
from importlib import reload

random.seed(1357)
def read_input_file(input_file):
    lines = csv.reader(codecs.open(input_file, "r", encoding="utf-8"))
    lines = list(lines)
    random.shuffle(lines)
    new_labels = []
    new_lines = []
#     for label, line in lines:
#         if int(label) < 3:
#             new_labels.append("0")
#             new_lines.append(line)
#         elif int(label) > 3:
#             new_labels.append("1")
#             new_lines.append(line)
    for label, line in lines:
        new_labels.append(label)
        new_lines.append(line)
            
    print (new_labels[:10], new_lines[:10])
    print(len(new_labels), len(new_lines))
    return new_labels, new_lines
                

In [4]:
labels_train, content_train = read_input_file("train.csv")
assert(len(labels_train) == len(content_train))
print (len(labels_train))

labels_dev, content_dev = labels_train[:7000], content_train[:7000]
keys_dev = ["dev"]* len(labels_dev)

labels_train, content_train = labels_train[7000:], content_train[7000:]
keys_train = ["train"]*len(labels_train)

['3', '5', '3', '4', '3', '3', '5', '4', '2', '2'] ['Still looking for that elusive Mexican food joint, so I stopped by here and gave it a try. The salsa was very good, nice flavor with just a right amount of heat...in other words you could taste the salsa.  I ordered a chicken burro which was good, not great.  the chicken didnt seem to be grilled but must have been, no flavor just blah. Prices OK, service very good, tea and water pitcher on table which is great for those refills. Will give it another try.', 'Amazing! I Purple rice makes me healthier.\\nI had Las Vegas roll, Crunch California, Oh My God, and Crazy roll. Foods are soooo good! I will revisit there again!', "The food here was good, but wasn't anything to rave about.\\n\\nWe started with the fried calamari.  It was served with banana peppers, which I love.  The calamari itself was tender and not rubbery.\\n\\nWe both went for the chicken parmesan.  My husband loved it, I thought it was just average.  I wasn't in love with 

In [5]:
labels_test, content_test = read_input_file("test.csv")
keys_test = ["test"]*len(labels_test)
assert(len(labels_test) == len(content_test))
print (len(labels_test))

['5', '3', '5', '5', '1', '2', '2', '2', '2', '1'] ["Avec un ami, nous y avons pass\\u00e9 une journ\\u00e9e de r\\u00eave. Le terrain est parfaitement entretenu et il fut tr\\u00e8s agr\\u00e9able de s'y promener, au travers des diff\\u00e9rents jardins.", 'The food is gonna be what the food is gonna be, you know that.\\nSo I judge a FFchain on a few things.\\nIs it clean.\\nAre they friendly.\\nDo they fahqew in the drive thru?\\n\\nThis Taco Hell scores consistently mediocre on all counts, except that they are quite friendly.\\n\\nIt is refreshing to pull into the drive thru, and even if there were a line of cars ahead of you, to hear the mechanical voice in the box say, \\n\\"Hi, How are you?\\"\\n...and then pause for an answer!\\n\\nThey do it all the time, so must be made to say it, but it always seems sincere.  It catches you off guard at first, but now I have grown to really dig the tone it conveys.  I don\'t think that the couple of seconds it may delay the order is a big dea

In [6]:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])

def tokenize(text) :
    #text = " ".join(text)
    text = text.replace("-LRB-", '')
    text = text.replace("-RRB-", " ")
    text = text.strip()
    tokens = " ".join([t.text.lower() for t in nlp(text)])
    return tokens

labels_train = [int(i)-1 for i in labels_train]
content_train = [tokenize(i) for i in content_train]


labels_test = [int(i)-1 for i in labels_test]
content_test = [tokenize(i) for i in content_test]


labels_dev = [int(i)-1 for i in labels_dev]
content_dev = [tokenize(i) for i in content_dev]

# assert(len(labels) == len(content))
# print(labels[:3])
# print(content[:3])


In [7]:
print(set(labels_train))
print(set(labels_test))
print(set(labels_dev))

{0, 1, 2, 3, 4}
{0, 1, 2, 3, 4}
{0, 1, 2, 3, 4}


In [9]:
labels = labels_train + labels_dev + labels_test
content = content_train + content_dev + content_test
keys = keys_train + keys_dev + keys_test

content[0]

'first of all i have experience of cocktails from all over the world , most of all from ny at   different mixologists bars , like milk and honey for ex.\\n\\nwe order two cocktails , the bartender or if he like to try to be a mixologist , he made the cocktails in front of us , and two of the ingridients was finsished after half use , for ex the ginger , but he did still make it as nothing has happend.\\nhe seamed pretty nervous , perhaps he was on drugs.\\n\\ncocktails tasted only strong spirits , my grirlfriend could not drink it , i told him that she do nt want it , he asked what she wanted instead ? she wanted red wine instead(cause he could not make cocktails ) , then he went away for 5 minutes , we starred at him and then he just asked like nothing happend \\"hey what do you like to have ? , --ehh ? we just said red wine -ok that will be xx pounds\\"\\n\\nwe just went away of this stupid guy and place.\\n\\nthey also charge you 50 pence for pay by card ? ? come on , what s the dea

In [11]:
df = pd.DataFrame({'text' : content, 'label' : labels, 'exp_split' : keys})
df.to_csv('yelp_dataset.csv', index=False)

In [12]:
%run "../preprocess_data_BC.py" --data_file yelp_dataset.csv --output_file ./vec_yelp.p --word_vectors_type fasttext.simple.300d --min_df 20

Vocabulary size :  38636
Found 23415 words in model out of 38636


In [2]:
import pandas as pd


In [4]:
df = pd.read_csv('yelp_dataset.csv')

Unnamed: 0,text,label,exp_split
0,first of all i have experience of cocktails fr...,0,train
1,they have the best mixture of asian fusion foo...,4,train
2,my husband michael loves it here .... not on...,3,train
3,so the hotel is nice and new . the rooms are n...,1,train
4,this is the second time i 've been to this pla...,4,train
...,...,...,...
699995,i could give two stars but one star is more pr...,0,test
699996,2.5-stars is more than fair ! but i am roundin...,1,test
699997,we came here for for a pool party on vegas on ...,1,test
699998,"love the vibe , be ready for a heart attack.\n...",3,test
