# Reading Data
### Dataset: Amazon Review 2018
<br>
Source: https://nijianmo.github.io/amazon/index.html <br><br>
Description: <br>
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, <br>
this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category <br>
information, price, brand, and image features), and links (also viewed/also bought graphs). <br>
<br>

### Data:
Data used in this project includes reviews for category Electronics. These data have been reduced to extract the 5-core,<br>
such that each of the remaining users and items have 5 reviews each.<br>





## Files:

### Electronics_5.json.gz
Includes reviews and ratings. Columns are following:
* reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
* asin - ID of the product, e.g. 0000013714
* reviewerName - name of the reviewer
* vote - helpful votes of the review
* style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
* reviewText - text of the review
* overall - rating of the product
* summary - summary of the review
* unixReviewTime - time of the review (unix time)
* reviewTime - time of the review (raw)
* image - images that users post after they have received the product


## Imports

In [1]:
# basic data science packages
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set()

In [2]:
# json loading
import json

In [8]:
# import custom function for reading data
from scripts.utils_data import read_in_chunks, DTYPES_SIMPLE, calc_weights

## Constants

In [4]:
reviews_path = '../data/raw/Electronics_5.json'
new_path = '../data/processed/electronics.csv'
new_path_simple = '../data/processed/electronics_simple.csv'
new_path_sample_100K = '../data/processed/electronics_simple_100K.csv'

In [5]:
columns = ["reviewerID", "asin", "reviewerName","vote","style","reviewText","verified",
           "overall","summary","unixReviewTime","reviewTime", "image"]

new_columns = ["overall","vote","reviewMonth","reviewText", "reviewYear"]

In [6]:
DTYPES = {"reviewerID": object,
          "asin": object,
          "reviewerName": object,
          "vote": object,
          "style": object,
          "reviewText": object,
          "verified":bool,
          "overall": np.float64,
          "summary": object,
          "unixReviewTime": np.int64,
          "reviewTime":object,
          "image": object}

NROWS = 100_000

## Converting JSON to CSV

In [7]:
read_in_chunks(reviews_path, new_path, 500_000, columns)

499999 lines processed
999999 lines processed
1499999 lines processed
1999999 lines processed
2499999 lines processed
2999999 lines processed
3499999 lines processed
3999999 lines processed
4499999 lines processed
4999999 lines processed
5499999 lines processed
5999999 lines processed
6499999 lines processed
Saved content of ../data/raw/Electronics_5.json to ../data/processed/electronics.csv succesfully
Processed 6739589 lines in 0:05:57.739182


## Checking csv file

In [8]:
df = pd.read_csv(new_path, parse_dates=[10], dtype=DTYPES, low_memory=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6739590 entries, 0 to 6739589
Data columns (total 12 columns):
 #   Column          Dtype         
---  ------          -----         
 0   reviewerID      object        
 1   asin            object        
 2   reviewerName    object        
 3   vote            object        
 4   style           object        
 5   reviewText      object        
 6   verified        bool          
 7   overall         float64       
 8   summary         object        
 9   unixReviewTime  int64         
 10  reviewTime      datetime64[ns]
 11  image           object        
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(8)
memory usage: 572.0+ MB


## Create Simplified Version with Review, Rating and Time

In [10]:
df['vote'] = df.vote.str.replace(',','').fillna(0).astype('int64')
df['overall'] = df.overall.fillna(0).astype('int16')
df['reviewMonth'] = df.reviewTime.dt.month
df['reviewYear'] = df.reviewTime.dt.year

In [11]:
df[new_columns].to_csv(new_path_simple, index=False)

In [12]:
df[new_columns].head()

Unnamed: 0,overall,vote,reviewMonth,reviewText,reviewYear
0,5,67,9,This is the best novel I have read in 2 or 3 y...,1999
1,3,5,10,"Pages and pages of introspection, in the style...",2013
2,5,4,9,This is the kind of novel to read when you hav...,2008
3,5,13,9,What gorgeous language! What an incredible wri...,2000
4,3,8,2,I was taken in by reviews that compared this b...,2000


## Create 100K Sample with balanced rating

In [9]:
df = pd.read_csv(new_path_simple,  dtype=DTYPES_SIMPLE, low_memory=True)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6739590 entries, 0 to 6739589
Data columns (total 5 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   overall      int16 
 1   vote         int64 
 2   reviewMonth  int16 
 3   reviewText   object
 4   reviewYear   int16 
dtypes: int16(3), int64(1), object(1)
memory usage: 141.4+ MB


In [11]:
# adding column with weights
df['weights']=df.overall.map(calc_weights(df.overall))

In [12]:
# collecting stratified sample from dataframe
df_sample = df.sample(NROWS, replace=False,
                      weights='weights',
                      random_state=42 ).copy().drop(labels=['weights'], axis=1).drop_duplicates()
df = None

In [13]:
# saving stratified data to csv
df_sample.to_csv(new_path_sample_100K, index=False)