# Reading Data
### Dataset: Amazon Review 2018
<br>
Source: https://nijianmo.github.io/amazon/index.html <br><br>
Description: <br>
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, <br>
this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category <br>
information, price, brand, and image features), and links (also viewed/also bought graphs). <br>
<br>

### Data:
Data used in this project includes reviews for category Electronics. These data have been reduced to extract the 5-core,<br>
such that each of the remaining users and items have 5 reviews each.<br>





## Files:

### Electronics_5.json.gz
Includes reviews and ratings. Columns are following:
* reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
* asin - ID of the product, e.g. 0000013714
* reviewerName - name of the reviewer
* vote - helpful votes of the review
* style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
* reviewText - text of the review
* overall - rating of the product
* summary - summary of the review
* unixReviewTime - time of the review (unix time)
* reviewTime - time of the review (raw)
* image - images that users post after they have received the product

### Electronics.csv

This file includes ratings only. Columns are:
 * item 
 * user 
 * rating 
 * timestamp

## Imports

In [2]:
# basic data science packages
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set()

In [3]:
# json loading
import json

## Constants

In [4]:
reviews_path = '../data/raw/Electronics_5.json'
clean_path = '../data/clean/electronics.csv'

In [16]:
columns = ["reviewerID", "asin", "reviewerName","vote","style","reviewText",
           "overall","summary","unixReviewTime","reviewTime", "image"]

## Reading Data

In [6]:
empty_row = {col:None for col in columns}

In [7]:
# test few lines
new_dict= {}
with open(reviews_path) as f:
    new_dict = json.loads(f.readline())

In [8]:
new_dict

{'overall': 5.0,
 'vote': '67',
 'verified': True,
 'reviewTime': '09 18, 1999',
 'reviewerID': 'AAP7PPBU72QFM',
 'asin': '0151004714',
 'style': {'Format:': ' Hardcover'},
 'reviewerName': 'D. C. Carrad',
 'reviewText': 'This is the best novel I have read in 2 or 3 years.  It is everything that fiction should be -- beautifully written, engaging, well-plotted and structured.  It has several layers of meanings -- historical, family,  philosophical and more -- and blends them all skillfully and interestingly.  It makes the American grad student/writers\' workshop "my parents were  mean to me and then my professors were mean to me" trivia look  childish and silly by comparison, as they are.\nAnyone who says this is an  adolescent girl\'s coming of age story is trivializing it.  Ignore them.  Read this book if you love literature.\nI was particularly impressed with  this young author\'s grasp of the meaning and texture of the lost world of  French Algeria in the 1950\'s and \'60\'s...parti

In [9]:
z = {**empty_row, **new_dict}
z

{'reviewerID': 'AAP7PPBU72QFM',
 'asin': '0151004714',
 'reviewerName': 'D. C. Carrad',
 'vote': '67',
 'style': {'Format:': ' Hardcover'},
 'reviewText': 'This is the best novel I have read in 2 or 3 years.  It is everything that fiction should be -- beautifully written, engaging, well-plotted and structured.  It has several layers of meanings -- historical, family,  philosophical and more -- and blends them all skillfully and interestingly.  It makes the American grad student/writers\' workshop "my parents were  mean to me and then my professors were mean to me" trivia look  childish and silly by comparison, as they are.\nAnyone who says this is an  adolescent girl\'s coming of age story is trivializing it.  Ignore them.  Read this book if you love literature.\nI was particularly impressed with  this young author\'s grasp of the meaning and texture of the lost world of  French Algeria in the 1950\'s and \'60\'s...particularly poignant when read in  1999 from another ruined and abando

In [15]:
with open(reviews_path) as f:
        for i, line in enumerate(f):
            if i > 5:
                break
            print(i,line)

0 {"overall": 5.0, "vote": "67", "verified": true, "reviewTime": "09 18, 1999", "reviewerID": "AAP7PPBU72QFM", "asin": "0151004714", "style": {"Format:": " Hardcover"}, "reviewerName": "D. C. Carrad", "reviewText": "This is the best novel I have read in 2 or 3 years.  It is everything that fiction should be -- beautifully written, engaging, well-plotted and structured.  It has several layers of meanings -- historical, family,  philosophical and more -- and blends them all skillfully and interestingly.  It makes the American grad student/writers' workshop \"my parents were  mean to me and then my professors were mean to me\" trivia look  childish and silly by comparison, as they are.\nAnyone who says this is an  adolescent girl's coming of age story is trivializing it.  Ignore them.  Read this book if you love literature.\nI was particularly impressed with  this young author's grasp of the meaning and texture of the lost world of  French Algeria in the 1950's and '60's...particularly po

In [12]:
def simplecount(filename):
    lines = 0
    with open(filename) as f:
        for line in f:
            lines += 1
    return lines

In [13]:
%%time
count = simplecount(reviews_path)
print(count)

6739590
Wall time: 20.8 s


In [4]:
# read from file line by line
with open(reviews_path) as f:
    for i in range(13):
        reviews_list=[]
        for j in range(500_000):
            row = json.loads(f.readline())
            reviews_list.append(row)
        pd.DataFrame(reviews_list).to_csv(clean_path, mode='a', index=False, header=(i==0))
        print(i, end=', ')
    reviews_list=[]
    line = f.readline()
    while line:
        row = json.loads(line)
        reviews_list.append(row)
        line = f.readline()
    pd.DataFrame(reviews_list).to_csv(clean_path, mode='a', index=False, header=False)
    print(13)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
