# Reading Data
### Dataset: Amazon Review 2018
<br>
Source: https://nijianmo.github.io/amazon/index.html <br><br>
Description: <br>
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, <br>
this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category <br>
information, price, brand, and image features), and links (also viewed/also bought graphs). <br>
<br>

### Data:
Data used in this project includes reviews for category Electronics. These data have been reduced to extract the 5-core,<br>
such that each of the remaining users and items have 5 reviews each.<br>





## Files:

### Electronics_5.json.gz
Includes reviews and ratings. Columns are following:
* reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
* asin - ID of the product, e.g. 0000013714
* reviewerName - name of the reviewer
* vote - helpful votes of the review
* style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
* reviewText - text of the review
* overall - rating of the product
* summary - summary of the review
* unixReviewTime - time of the review (unix time)
* reviewTime - time of the review (raw)
* image - images that users post after they have received the product


## Imports

In [1]:
# basic data science packages
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set()

In [2]:
# json loading
import json

In [3]:
# import custom function for reading data
from scripts.utils_data import read_in_chunks

## Constants

In [12]:
reviews_path = '../data/raw/Electronics_5.json'
new_path = '../data/processed/electronics.csv'
new_path_simple = '../data/processed/electronics_simple.csv'

In [5]:
columns = ["reviewerID", "asin", "reviewerName","vote","style","reviewText","verified",
           "overall","summary","unixReviewTime","reviewTime", "image"]

new_columns = ["overall","vote","reviewText", "reviewTime"]

In [9]:
DTYPES = {"reviewerID": object,
          "asin": object,
          "reviewerName": object,
          "vote": object,
          "style": object,
          "reviewText": object,
          "verified":bool,
          "overall": np.float64,
          "summary": object,
          "unixReviewTime": np.int64,
          "reviewTime": object,
          "image": object}

NROWS = 10_000

## Converting JSON to CSV

In [7]:
read_in_chunks(reviews_path, new_path, 500_000, columns)

499999 lines processed
999999 lines processed
1499999 lines processed
1999999 lines processed
2499999 lines processed
2999999 lines processed
3499999 lines processed
3999999 lines processed
4499999 lines processed
4999999 lines processed
5499999 lines processed
5999999 lines processed
6499999 lines processed
Saved content of ../data/raw/Electronics_5.json to ../data/processed/electronics.csv succesfully
Processed 6739589 lines in 0:05:10.618888


## Checking csv file

In [15]:
df = pd.read_csv(new_path, parse_dates=[9], dtype=DTYPES, low_memory=True)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6739590 entries, 0 to 6739589
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   reviewerID      object 
 1   asin            object 
 2   reviewerName    object 
 3   vote            object 
 4   style           object 
 5   reviewText      object 
 6   verified        bool   
 7   overall         float64
 8   summary         object 
 9   unixReviewTime  object 
 10  reviewTime      object 
 11  image           object 
dtypes: bool(1), float64(1), object(10)
memory usage: 572.0+ MB


## Create Simplified Version with Review, Rating and Time

In [17]:
df['vote'] = df.vote.str.replace(',','').fillna(0).astype('int64')
df['overall'] = df.overall.fillna(0).astype('int16')

In [18]:
df[new_columns].to_csv(new_path_simple, index=False)