<a href="https://colab.research.google.com/github/xia0405/Master-thesis-NLP/blob/master/data_download_%26_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

### The Amazon reviews come from [this link to data](https://nijianmo.github.io/amazon/index.html#subsets) and are updated version (2018).  
- Current data includes reviews in the range May 1996 - Oct 2018.
- The total number of reviews is 233.1 million (142.8 million in 2014). For the purpose of master thesis a "small" sample data is enough.   
- I use 5-core (14.3gb) - subset of the data in which all users and items have at least 5 reviews (75.26 million reviews)
- und One category **Sports_and_Outdoors** which contains **2,839,940** reviews. 

### Citation:
Justifying recommendations using distantly-labeled reviews and fined-grained aspects  
Jianmo Ni, Jiacheng Li, Julian McAuley  
Empirical Methods in Natural Language Processing (EMNLP), 2019

In [3]:
# download the data
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Sports_and_Outdoors_5.json.gz

--2020-05-12 12:18:34--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Sports_and_Outdoors_5.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 414308704 (395M) [application/octet-stream]
Saving to: ‘Sports_and_Outdoors_5.json.gz’


2020-05-12 12:19:28 (7.22 MB/s) - ‘Sports_and_Outdoors_5.json.gz’ saved [414308704/414308704]



In [4]:
### load the meta data

data = []
with gzip.open('Sports_and_Outdoors_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

2839940
{'overall': 5.0, 'verified': True, 'reviewTime': '06 3, 2015', 'reviewerID': 'A180LQZBUWVOLF', 'asin': '0000032034', 'reviewerName': 'Michelle A', 'reviewText': 'What a spectacular tutu! Very slimming.', 'summary': 'Five Stars', 'unixReviewTime': 1433289600}


In [5]:
# first row of the list
print(data[10])

{'overall': 5.0, 'verified': True, 'reviewTime': '08 2, 2016', 'reviewerID': 'A36QT6N7N0GF3O', 'asin': '0899332757', 'style': {'Format:': ' Paperback'}, 'reviewerName': 'Love is all I have', 'reviewText': 'Delorme has always made the best book maps in the USA.  Three thumbs up!', 'summary': 'Five Stars', 'unixReviewTime': 1470096000}


In [6]:
# convert list into pandas dataframe

df = pd.DataFrame.from_dict(data)

print(len(df))

2839940


In [21]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,vote,image
0,5.0,True,"06 3, 2015",A180LQZBUWVOLF,32034,Michelle A,What a spectacular tutu! Very slimming.,Five Stars,1433289600,,,
1,1.0,True,"04 1, 2015",ATMFGKU5SVEYY,32034,Crystal R,What the heck? Is this a tutu for nuns? I know...,Is this a tutu for nuns?!,1427846400,,,
2,5.0,True,"01 13, 2015",A1QE70QBJ8U6ZG,32034,darla Landreth,Exactly what we were looking for!,Five Stars,1421107200,,,
3,5.0,True,"12 23, 2014",A22CP6Z73MZTYU,32034,L. Huynh,I used this skirt for a Halloween costume and ...,I liked that the elastic waist didn't dig in (...,1419292800,,,
4,4.0,True,"12 15, 2014",A22L28G8NRNLLN,32034,McKenna,This is thick enough that you can't see throug...,This is thick enough that you can't see throug...,1418601600,,,


In [11]:
### I need only overall and reviewText columns for the purpose of my thesis. 

df_save = df[["overall","reviewText"]] 
df_save.head()

Unnamed: 0,overall,reviewText
0,5.0,What a spectacular tutu! Very slimming.
1,1.0,What the heck? Is this a tutu for nuns? I know...
2,5.0,Exactly what we were looking for!
3,5.0,I used this skirt for a Halloween costume and ...
4,4.0,This is thick enough that you can't see throug...


In [0]:
# save the raw data
df_save.to_csv("raw_data.csv", header=True, index =False)

In [17]:
df_new = pd.read_csv("raw_data.csv",index_col = 0)

  mask |= (ar1 == a)


In [18]:
df_new.head()

Unnamed: 0,overall,reviewText
0,5.0,What a spectacular tutu! Very slimming.
1,1.0,What the heck? Is this a tutu for nuns? I know...
2,5.0,Exactly what we were looking for!
3,5.0,I used this skirt for a Halloween costume and ...
4,4.0,This is thick enough that you can't see throug...


In [20]:
# for each rating the quantity of reviews
df_new.groupby("overall").count()

Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,111129
2.0,101629
3.0,210179
4.0,495399
5.0,1920488


I randomly sample for each rating the same amount review to ensure the balance distribution of each class.   
For each class **50,000** reviews. 

In [24]:
# check the missing values
df_new.isnull().sum()

overall          0
reviewText    1116
dtype: int64

In [0]:
# delete any missing values if there is 
df_new = df_new.dropna()

In [28]:
df_new[df_new['overall']==1.0].sample(n= 50000, random_state=42)

Unnamed: 0,overall,reviewText
256105,1.0,"Junk, spend a few more bucks if your in the ma..."
375776,1.0,The first time I extended this to grab a ball ...
1424104,1.0,Piece of crap! Don't buy! It hardly works for ...
1054248,1.0,This pack is only MOLLE compatible with other ...
2624287,1.0,Mag needs refilled after each run
...,...,...
667009,1.0,The first time I pulled on the velcro strap to...
1908669,1.0,returned those immediately - very cheaply made...
2782375,1.0,Returned
2512600,1.0,Screwdriver broke apart at first screw from ki...


In [0]:
# create a custom function to make a small and same distributed sample
def make_sample(data, number=50000):
  small_sample_1 = data[data['overall']==1.0].sample(n= number, random_state=42)
  for i in [2.0, 3.0, 4.0, 5.0]:
    small_sample_i = data[data['overall']==i].sample(n= number, random_state=42)
    total_sample = pd.concat([small_sample_1, small_sample_i])
    small_sample_1 = total_sample
  return total_sample
  

In [0]:
# write to a csv file for later use
total_sample=make_sample(df_new, number=50000)


In [35]:
total_sample.groupby('overall').count()

Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
1.0,50000
2.0,50000
3.0,50000
4.0,50000
5.0,50000


In [0]:
total_sample.to_csv('full_classes_reviews', header=True, index =False)

In [37]:
total_sample.head()

Unnamed: 0,overall,reviewText
256105,1.0,"Junk, spend a few more bucks if your in the ma..."
375776,1.0,The first time I extended this to grab a ball ...
1424104,1.0,Piece of crap! Don't buy! It hardly works for ...
1054248,1.0,This pack is only MOLLE compatible with other ...
2624287,1.0,Mag needs refilled after each run


### Sentiment analysis: Either positive or negative. 
- Rate 4 and 5 are positive  
- Rate 1 and 2 are negative  
- Rate 3 are neural and be ignored in polarity detection test  
I prepare the data for the polarity task.

In [0]:
# remove the reviews with overall equal to 3
df_polarity = total_sample[~(total_sample["overall"]==3.0)]

In [54]:
# replace the negative 1 and 2 stars to be 0, positive 4 and 5 to be 1. 
df_polarity['overall'] = df_polarity['overall'].map({1.0:0,2.0:0,4.0:1,5.0:1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [57]:
df_polarity.groupby('overall').count()

Unnamed: 0_level_0,reviewText
overall,Unnamed: 1_level_1
0,100000
1,100000


In [0]:
df_polarity.to_csv("polarity_reviews.csv",header=True, index =False)

In [0]:
#df_polarity_1 = pd.read_csv("polarity_reviews.csv")

Unnamed: 0,overall,reviewText
0,0,"Junk, spend a few more bucks if your in the ma..."
1,0,The first time I extended this to grab a ball ...
2,0,Piece of crap! Don't buy! It hardly works for ...
3,0,This pack is only MOLLE compatible with other ...
4,0,Mag needs refilled after each run
