# Data Collection

In this notebook I will be collecting the data needed for the Capstone Project.

The dataset to be used for the development of the model is [Amazon Customer Reviews Dataset](https://registry.opendata.aws/amazon-reviews/). We will be considering just “US reviews datasets” for our ML model.

The dataset comprises of files grouped by category and stored as tar archive file stored at amazon s3 bucket with public access. The list of download urls and data set column information can be found [here](https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt).

Data is formatted in tab ('\t') separated text file, without quote or escape characters. First line in each file is header and 1 line corresponds to 1 record. 


## Let us have a peek at sample data 

The urls to get the sample data is stored in SampleData.txt file. Load the urls into a list

In [3]:
with open('SampleData.txt') as f:
    content = f.readlines()
    
urls_sample_data = [x.strip() for x in content]

In [4]:
urls_sample_data

['https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_us.tsv',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_fr.tsv']

## Load the data into Pandas DataFrame 

In [5]:
import pandas as pd

#sample_us.tsv
sample_us_info=pd.read_csv(urls_sample_data[0],delimiter='\t',encoding='utf-8')

In [6]:
sample_us_info.head(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,18778586,RDIJS7QYB6XNR,B00EDBY7X8,122952789,Monopoly Junior Board Game,Toys,5,0,0,N,Y,Five Stars,Excellent!!!,2015-08-31
1,US,24769659,R36ED1U38IELG8,B00D7JFOPC,952062646,56 Pieces of Wooden Train Track Compatible wit...,Toys,5,0,0,N,Y,Good quality track at excellent price,Great quality wooden track (better than some o...,2015-08-31
2,US,44331596,R1UE3RPRGCOLD,B002LHA74O,818126353,Super Jumbo Playing Cards by S&S Worldwide,Toys,2,1,1,N,Y,Two Stars,Cards are not as big as pictured.,2015-08-31
3,US,23310293,R298788GS6I901,B00ARPLCGY,261944918,Barbie Doll and Fashions Barbie Gift Set,Toys,5,0,0,N,Y,my daughter loved it and i liked the price and...,my daughter loved it and i liked the price and...,2015-08-31
4,US,38745832,RNX4EXOBBPN5,B00UZOPOFW,717410439,Emazing Lights eLite Flow Glow Sticks - Spinni...,Toys,1,1,1,N,Y,DONT BUY THESE!,Do not buy these! They break very fast I spun ...,2015-08-31


In [7]:
type(sample_us_info)

pandas.core.frame.DataFrame

In [8]:
sample_fr_info=pd.read_csv(urls_sample_data[1],delimiter='\t',encoding='utf-8')
sample_fr_info.head(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,FR,14952,R32VYUWDIB5LKE,0552774294,362925721,The God Delusion,Books,5,0,0,N,Y,a propos de ce livre,je conseille fortement ce bouquin à ceux qui s...,2013-02-13
1,FR,14952,R3CCMP4EV6HAVL,B004GJXQ20,268067011,"A Game of Thrones (A Song of Ice and Fire, Boo...",Digital_Ebook_Purchase,5,0,0,N,Y,wow,"ce magnifique est livre , les personnages sont...",2014-08-03
2,FR,17564,R14NAE6UGTVTA2,B00GIGGS6A,256731097,Huion H610 PRO,PC,3,1,3,N,Y,Ca fait le job,Je dirais qu'il a un défaut :<br />On ne peut ...,2015-07-07
3,FR,18940,R2E7QEWSC6EWFA,B00CW7KK9K,977480037,Withings Pulse - Suivi d'activité + Analyse du...,Sports,4,0,1,N,Y,Fidele a description,Je l'ai depuis quelques jours et j'en suis trè...,2014-06-16
4,FR,20315,R26E6I47GQRYKR,B002L6SKIK,827187473,Prometheus,Video DVD,2,3,5,N,N,décevant,"je m'attendait à un bon film, car j'aime beauc...",2013-06-10


In [9]:
sample_us_info[['product_category','star_rating','review_body']][:5]

Unnamed: 0,product_category,star_rating,review_body
0,Toys,5,Excellent!!!
1,Toys,5,Great quality wooden track (better than some o...
2,Toys,2,Cards are not as big as pictured.
3,Toys,5,my daughter loved it and i liked the price and...
4,Toys,1,Do not buy these! They break very fast I spun ...


In [10]:
sample_fr_info[['product_category','star_rating','review_body']][:5]

Unnamed: 0,product_category,star_rating,review_body
0,Books,5,je conseille fortement ce bouquin à ceux qui s...
1,Digital_Ebook_Purchase,5,"ce magnifique est livre , les personnages sont..."
2,PC,3,Je dirais qu'il a un défaut :<br />On ne peut ...
3,Sports,4,Je l'ai depuis quelques jours et j'en suis trè...
4,Video DVD,2,"je m'attendait à un bon film, car j'aime beauc..."


  ## The sample data is in tsv format but the rest of the files are in tar archive file
  
  Let us try working with .gzip python module for the data extraction, before that we need to extract the category name of the  form the url we want to save as 

In [12]:
with open('DownloadLinks.txt') as f:
    content = f.readlines()
url_links = [x.strip() for x in content]
url_links

['https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_DVD_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Toys_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Tools_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Sports_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Software_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Shoes_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Pet_Products_v1_00.tsv.g

## Let's try downloading a file from above urls and play with data in it

Using regex to get the File name from the URL

In [22]:
import re
import os
import gzip
import urllib

def check_download(url_value):   
    file_name = re.search('us_(.+).tsv',url_value).group(1)
    file_name = "Reviews Data/"+ file_name + ".tsv.gz" 
    if not os.path.exists(file_name):
        print('Attempting to download:', file_name) 
        filename, _ = urllib.request.urlretrieve(url_value, file_name)
        print("Download Complete!")
    else:
        print("File already exits!")
    return file_name

In [14]:
current_file = check_download(url_links[1])

Attempting to download: Reviews Data/Watches.tsv.gz
Download Complete!


In [15]:
%%capture
testw = pd.read_csv(current_file,delimiter='\t',encoding='utf-8',error_bad_lines=False,header=0)

In [16]:
testw.head(10)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,3653882,R3O9SGZBVQBV76,B00FALQ1ZC,937001370,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",Watches,5,0,0,N,Y,Five Stars,Absolutely love this watch! Get compliments al...,2015-08-31
1,US,14661224,RKH8BNC3L5DLF,B00D3RGO20,484010722,Kenneth Cole New York Women's KC4944 Automatic...,Watches,5,0,0,N,Y,I love thiswatch it keeps time wonderfully,I love this watch it keeps time wonderfully.,2015-08-31
2,US,27324930,R2HLE8WKZSU3NL,B00DKYC7TK,361166390,Ritche 22mm Black Stainless Steel Bracelet Wat...,Watches,2,1,1,N,Y,Two Stars,Scratches,2015-08-31
3,US,7211452,R31U3UH5AZ42LL,B000EQS1JW,958035625,Citizen Men's BM8180-03E Eco-Drive Stainless S...,Watches,5,0,0,N,Y,Five Stars,"It works well on me. However, I found cheaper ...",2015-08-31
4,US,12733322,R2SV659OUJ945Y,B00A6GFD7S,765328221,Orient ER27009B Men's Symphony Automatic Stain...,Watches,4,0,0,N,Y,"Beautiful face, but cheap sounding links",Beautiful watch face. The band looks nice all...,2015-08-31
5,US,6576411,RA51CP8TR5A2L,B00EYSOSE8,230493695,Casio Men's GW-9400BJ-1JF G-Shock Master of G ...,Watches,5,0,0,N,Y,No complaints,"i love this watch for my purpose, about the pe...",2015-08-31
6,US,11811565,RB2Q7DLDN6TH6,B00WM0QA3M,549298279,Fossil Women's ES3851 Urban Traveler Multifunc...,Watches,5,1,1,N,Y,Five Stars,"for my wife and she loved it, looks great and ...",2015-08-31
7,US,49401598,R2RHFJV0UYBK3Y,B00A4EYBR0,844009113,INFANTRY Mens Night Vision Analog Quartz Wrist...,Watches,1,1,5,N,N,I was about to buy this thinking it was a ...,I was about to buy this thinking it was a Swis...,2015-08-31
8,US,45925069,R2Z6JOQ94LFHEP,B00MAMPGGE,263720892,G-Shock Men's Grey Sport Watch,Watches,5,1,2,N,Y,Perfect watch!,Watch is perfect. Rugged with the metal &#34;B...,2015-08-31
9,US,44751341,RX27XIIWY5JPB,B004LBPB7Q,124278407,Heiden Quad Watch Winder in Black Leather,Watches,4,0,0,N,Y,Great quality and build,Great quality and build.<br />The motors are r...,2015-08-31


##### We are able to download the data and load it into pandas Data Frame. So let's download the rest of the data and complete the process of data colletion.

In [17]:
for each_url in url_links:
    current_file = check_download(each_url)

Attempting to download: Reviews Data/Wireless.tsv.gz
Download Complete!
File already exits!
Attempting to download: Reviews Data/Video_Games.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Video_DVD.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Video.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Toys.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Tools.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Sports.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Software.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Shoes.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Pet_Products.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Personal_Care_Appliances.tsv.gz
Download Complete!
Attempting to download: Reviews Data/PC.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Outdoors.tsv.gz
Download Complete!
Attempting to download: Reviews Data/Offi

 we can see we get an ouptput of File exists even if we are downloading for the first time.
 Let us modify the 'check_download' method and try again

In [25]:
url_links[35:42]

['https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Music_Purchase_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Camera_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_02.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_01.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_00.tsv.gz']

In [26]:

test = check_download(url_links[37])

Attempting to download: Reviews Data/Digital_Ebook_Purchase_v1_00.tsv.gz
Download Complete!


In [27]:
test = check_download(url_links[40])

Attempting to download: Reviews Data/Books_v1_01.tsv.gz
Download Complete!


In [28]:
test = check_download(url_links[41])

Attempting to download: Reviews Data/Books_v1_00.tsv.gz
Download Complete!


# With this we complete the download of all the files of reviews_data.