# Exploration of Amazon consumer products reviews
## by Kyle McMillan

## Preliminary Wrangling

> Briefly introduce your dataset here.

In [22]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import re
import random
random.seed(42)

%matplotlib inline

### Data load and wrangle
This first section is a look at the file as it is loaded into a dataframe from a TSV file.  
The data is then assessed and checked for quality and tidiness.

In [23]:
#Data is loaded from a TSV file. Bad lines are dropped as fixing these manually will take too much time 
#for the low number of lines with errors.
reviews=pd.read_csv(r"amazon_reviews_multilingual_US_v1_00.tsv", sep="\t", error_bad_lines=False, header=0)

b'Skipping line 3231472: expected 15 fields, saw 22\n'
b'Skipping line 3509762: expected 15 fields, saw 22\n'
b'Skipping line 4018793: expected 15 fields, saw 22\n'
b'Skipping line 4280173: expected 15 fields, saw 22\nSkipping line 4290596: expected 15 fields, saw 22\n'
b'Skipping line 4331421: expected 15 fields, saw 22\nSkipping line 4340267: expected 15 fields, saw 22\nSkipping line 4341665: expected 15 fields, saw 22\nSkipping line 4386155: expected 15 fields, saw 22\nSkipping line 4388098: expected 15 fields, saw 22\n'
b'Skipping line 4408027: expected 15 fields, saw 22\nSkipping line 4442615: expected 15 fields, saw 22\n'
b'Skipping line 4519623: expected 15 fields, saw 22\n'
b'Skipping line 4525797: expected 15 fields, saw 22\nSkipping line 4543519: expected 15 fields, saw 22\n'
b'Skipping line 4587726: expected 15 fields, saw 22\nSkipping line 4589301: expected 15 fields, saw 22\nSkipping line 4634393: expected 15 fields, saw 22\n'
b'Skipping line 4666168: expected 15 fields, s

In [24]:
#View the unique items in the product category.
reviews.product_category.unique()

array(['Books', 'Music', 'Video', 'Video DVD', 'Toys', 'Tools',
       'Office Products', 'Video Games', 'Software',
       'Digital_Music_Purchase', 'Home Entertainment', 'Electronics',
       'Digital_Ebook_Purchase', 'Digital_Video_Download', 'Kitchen',
       'Camera', 'Outdoors', 'Musical Instruments', 'Sports', 'Watches',
       'PC', 'Home', 'Wireless', 'Beauty', 'Baby', 'Home Improvement',
       'Apparel', 'Shoes', 'Lawn and Garden', 'Mobile_Electronics',
       'Health & Personal Care', 'Grocery', 'Luggage',
       'Personal_Care_Appliances', 'Automotive', 'Mobile_Apps',
       'Furniture', '2012-12-22', 'Pet Products'], dtype=object)

In [25]:
#View the data that has a product category of a date.
reviews.query("product_category == '2012-12-22'")

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
1852794,US,49422747,R2T9JZNQ996WRC,1568652240,569473707,Emma (Large Print)\tBooks\t5\t0\t0\tN\tY\tFoll...,2012-12-22,,,,,,,,


In [26]:
#View the data string in the product title
list(reviews[reviews.review_id=="R2T9JZNQ996WRC"].product_title)

['Emma (Large Print)\tBooks\t5\t0\t0\tN\tY\tFollows Austen\'s book closley\tI have the movie with Ms Paltrow starring, so I bought the book to see how it matched. Very well done.\t2012-12-22\nUS\t7209000\tR1MZOZGBLVKYVP\tB003E8P9G0\t446279348\tThe Kane Chronicles, Book One: The Red Pyramid\tDigital_Ebook_Purchase\t5\t0\t0\tN\tN\tso....\tmy brother read this book and he was instantly addicted to it so i started reading it on my kindle and i goten adddicted to it too. i would say that it is a must read.Questions or Comments my kindle email is  haili@kindle.com\t2012-12-22\nUS\t49422747\tR31UI3EECPWNVA\tB002EWD0I6\t647475881\tLark Rise to Candleford: Season 1\tVideo DVD\t4\t0\t0\tN\tY\tGood\tI thought this would be more of theprevious characters of Cranford, but not to be.  This was good enough for me to order the 2nd season.\t2012-12-22\nUS\t7209000\tR2UXMYNK8AEEVR\tB005CRQ4GU\t632021160\tThe Third Wheel (Diary of a Wimpy Kid, Book 7)\tDigital_Ebook_Purchase\t5\t1\t2\tN\tN\tso....\tI LOV

In [5]:
#Drop this line as it appears to have been parsed wrong because of the extra "/t" in the title column and there are a lot
#of extra reviews in this title string.
reviews.drop(reviews.loc[reviews.product_category.isin(['2012-12-22'])].index, inplace=True)
reviews.product_category.unique()

array(['Books', 'Music', 'Video', 'Video DVD', 'Toys', 'Tools',
       'Office Products', 'Video Games', 'Software',
       'Digital_Music_Purchase', 'Home Entertainment', 'Electronics',
       'Digital_Ebook_Purchase', 'Digital_Video_Download', 'Kitchen',
       'Camera', 'Outdoors', 'Musical Instruments', 'Sports', 'Watches',
       'PC', 'Home', 'Wireless', 'Beauty', 'Baby', 'Home Improvement',
       'Apparel', 'Shoes', 'Lawn and Garden', 'Mobile_Electronics',
       'Health & Personal Care', 'Grocery', 'Luggage',
       'Personal_Care_Appliances', 'Automotive', 'Mobile_Apps',
       'Furniture', 'Pet Products'], dtype=object)

In [16]:
print(reviews.info(null_counts=True ))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6900562 entries, 0 to 6900885
Data columns (total 15 columns):
marketplace          6900562 non-null object
customer_id          6900562 non-null int64
review_id            6900562 non-null object
product_id           6900562 non-null object
product_parent       6900562 non-null int64
product_title        6900562 non-null object
product_category     6900562 non-null object
star_rating          6900562 non-null float64
helpful_votes        6900562 non-null float64
total_votes          6900562 non-null float64
vine                 6900562 non-null object
verified_purchase    6900562 non-null object
review_headline      6900488 non-null object
review_body          6900487 non-null object
review_date          6900562 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 842.4+ MB
None


In [7]:
#Investigate the reason that a few reviews have NaN for a date
dateless = reviews.loc[reviews['review_date'].isin(reviews.review_date.dropna().unique())==False]
dateless.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
142642,US,52767614,RJBEAQJ92LS61,B00003CXDG,765637830,Mission: Impossible 2 (Widescreen Edition),Video DVD,2.0,0.0,0.0,N,N,Technology is not always beneficial -\tespecia...,2001-03-08,
186564,US,50374272,R2LXX6V7B1PICZ,B00005OB0A,321992753,Fever,Music,5.0,0.0,0.0,N,N,The music you were playing really blew my mind...,2001-10-25,
217791,US,42627253,R1CRCYSML85MB4,B00005JKZQ,167276143,Showtime,Video DVD,5.0,5.0,6.0,N,N,It's............SHOWTIME!!!!!\tShowtime is the...,2002-03-16,
235044,US,52442862,R1YZIXBH6AVF4E,B00000JMQC,79142420,Return to Oz [VHS],Video,5.0,6.0,7.0,N,N,The Wizard of Oz it is not--should not be!\tFo...,2002-06-10,
239621,US,47692344,R12KVFMZYPXNF6,B00003CWL6,304141589,American Beauty (1999),Video DVD,5.0,1.0,3.0,N,N,"It's just STUFF!!!\t\\American Beauty\\"" is in...",2002-06-30,


In [8]:
#It seems there was an issue during the reading of the TSV file and the review titles used a tab space. Where read_csv was
#looking for a "\t" delimiter

#Investigate the titles for each of the reviews to see if there is any thing unusual
for i in dateless.review_id:
    print(re.split("\t", list(dateless[dateless.review_id==i].review_headline)[0])[0])

Technology is not always beneficial -
The music you were playing really blew my mind...
It's............SHOWTIME!!!!!
The Wizard of Oz it is not--should not be!
It's just STUFF!!!
Gooble gobble, we accept, you, one of us! one of us!
Ohana means. . . another Disney HIT.
The darker side of Star Trek�.
Unfortunately, no one can be told what the Matrix is. . .
I make more money than...
When you look long into an abyss,
Live' is the drug that I need to score. Daily.
My teacher said I'm a loser, I give a f**k if you feel me..
Anyone who says they'll die for their country is an idiot..
One Of The Best Films Of The 90's
what will really haunt you later is not...
I Couldn't Wait for HERO . . . .
 an elegant crime done by an elegant person
Is there anybody out there?"
Britney Salutes Las Vegas"
Pac's most famous work to date........
 A mosquito; my libido..............
Admire me, admire my home, admire my son...HE'S MY CLONE...
If you're looking for my professional opinion...he's nuts!
i left hi

In [9]:
#Investigate the body of the dateless reviews to see if there is anything unusual.
#As this is a very large and long list, a random sample of 10 were chosen.
DL_random_list = random.sample(list(dateless.review_id), 10)
for i in DL_random_list:
    print(re.split("\t", list(dateless[dateless.review_id==i].review_headline)[0])[1:])

["[[ASIN:0316067601 Lone Survivor: The Eyewitness Account of Operation Redwing and the Lost Heroes of SEAL Team 10]]<br /><br />Excellent narration. Inspiring, patriotic, and courageous. Also, one lucky SEAL. As an American, I am honored, and humbled by his 'call to service\\\\.<br />Bravo."]
['.. I\'m gonna follow my heart.\\\\ <BR>My favorite line on this album off of the highest point on this album, \\\\"Get Em High,\\\\" featuring epic rappers Talib Kweli and Common. I loved it because it influences me and inspires me, but after learning that this was the genius that produced five star tracks such as Poppa Was A Playa by Nas and Izzo by Jay-Z, I wanted to pick up this solely for the beats. I wasn\'t as interested in hearing him as a rapper, because I hadn\'t even heard Through the Wire or All Falls Down at the time. But, I wasn\'t disappointed at all, this guy is one of the best \\\\"new artist\\\\" rappers I\'ve heard in a long time. This guy knows how to write intellectual flows 

In [10]:
#It seems like there are a lot of other reviews mixed into the body of some of the dateless reviews.
#As there are a very large number of total reviews, and wrangling the data to sort out the 300 or so that need to be fixed,
#I have decided to drop these rows.
reviews.drop(dateless.index, inplace=True)
reviews.loc[reviews['review_date'].isin(reviews.review_date.dropna().unique())==False]

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date


In [11]:
#Investigate the reviews where there is no headline to see if there is anything unusual.
reviews.loc[reviews['review_headline'].isin(reviews.review_headline.dropna().unique())==False]

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
144995,US,51263804,RRR6EK045G5YL,630438551X,212477251,Romeo & Juliet [VHS],Video,5.0,4.0,8.0,N,N,,I consider myself to be a pretty much diehard ...,2001-03-21
274671,US,34490235,R25MC0QXPV5CRZ,6305949980,664817538,The Nightmare Before Christmas (Special Edition),Video DVD,4.0,1.0,6.0,N,N,,"Three words, \\""BEST MOVIE EVER.\\"" If you hav...",2002-12-11
373447,US,23532323,R1MNPI81JJ62NI,B0000DD7NL,868764779,The Diary of Alicia Keys,Music,5.0,0.0,2.0,N,N,,I love this cd! My favorite two songs are:<br...,2004-01-12
450489,US,43058957,RIHK7J9KBB62V,B00005V3Z4,681790048,Donnie Darko (Widescreen Edition),Video DVD,4.0,3.0,8.0,N,N,,It actually is pretty hard to classify this mo...,2004-11-18
515646,US,34215909,R1HAKDWL5X86B9,B00014NE62,59241661,Maurice - The Merchant Ivory Collection,Video DVD,4.0,2.0,3.0,N,N,,"Goes at quite a steady pace, however, this is ...",2005-08-12
522262,US,36882963,R2VEG37XSHHQMO,B0002Y4TTC,773850099,Blind Guardian - Imaginations Through the Look...,Video DVD,3.0,1.0,8.0,N,N,,This is a beautifully packaged 2-disc DVD edit...,2005-09-04
546262,US,24610674,R3NODAYWJSH67Y,B000BM6AVA,828472430,Hypnotize,Music,5.0,2.0,4.0,N,N,,"It's a warm, peaceful afternoon in the upstand...",2005-11-28
550148,US,16196376,R1DUI7MJ3FPONG,B0001YRVN4,745272701,Star Wars Trilogy (A New Hope / The Empire Str...,Video DVD,5.0,4.0,10.0,N,N,,What's all this I hear about people complainin...,2005-12-12
567259,US,22947656,R1ABIC2ULD2TCB,B000AP2ZDK,577609284,Donuts,Music,5.0,112.0,124.0,N,N,,"This isn't Dilla's best work, far from it real...",2006-02-16
568389,US,42224700,R3QMAWGAVYTR9F,B00001U0E1,475263130,Shakespeare in Love (Miramax Collector's Series),Video DVD,4.0,0.0,1.0,N,N,,"I am a Shakespeare buff, so I didn't find this...",2006-02-20


In [12]:
#Investigate the reviews where there is no body to see if there is anything unusual.
reviews.loc[reviews['review_body'].isin(reviews.review_body.dropna().unique())==False]

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
4362503,US,44409249,R28HX7V41NGPGD,B00IFMHZ58,731881208,Veronica Mars,Digital_Video_Download,4.0,0.0,0.0,N,Y,Four Stars,,2014-07-14
4364915,US,34483539,R1KRVATE3RM685,0545615402,160159764,Hogwarts Library (Harry Potter),Books,5.0,0.0,1.0,N,Y,Five Stars,,2014-07-14
4492736,US,14326822,R3QYW87GDRRNAF,B004S82OAE,235122690,CCNA Cisco Certified Network Associate Study G...,Digital_Ebook_Purchase,5.0,0.0,1.0,N,Y,Five Stars,,2014-08-03
4646674,US,20596625,R12WMFWQ96MK9E,B00HPYMVD8,108613172,The Target (Will Robie),Digital_Ebook_Purchase,4.0,0.0,0.0,N,Y,Four Stars,,2014-08-29
4867197,US,11236181,R1ZS0RW3SLED36,B0096YJDNQ,454110113,Distant Suns (max) - Unleash your inner astron...,Mobile_Apps,5.0,0.0,0.0,N,Y,Five Stars,,2014-10-07
4967822,US,3152188,R1RGMJWY7KCMFF,B003MYYJD0,120446899,Invicta Men's 6981 Pro Diver Analog Swiss Chro...,Watches,5.0,94.0,102.0,N,Y,Five Stars,,2014-10-24
4990204,US,8149238,R221BRF51GRMNT,B004GJDQT8,36653526,Amazon Underground,Mobile_Apps,4.0,5.0,6.0,N,Y,Four Stars,,2014-10-28
5155825,US,33444040,R2XT5Y4GSMFGBJ,B000LXQVA4,535123469,Fisher-Price Rainforest Jumperoo,Baby,5.0,0.0,0.0,N,N,Five Stars,,2014-11-25
5166126,US,44256154,RK1UNP8GPEEKV,B002IIY6X4,795348432,Homesick,Music,5.0,4.0,4.0,N,Y,Five Stars,,2014-11-27
5190776,US,32311114,R1Q33KCFX6EZKX,B001XVD21Y,818712447,The Departed,Digital_Video_Download,5.0,8.0,9.0,N,Y,Five Stars,,2014-11-30


Looking at the 2 sub tables of the bodyless and titleless reviews, it seems like there were no parsing errors with these but rather people simply did not write anything in these fields when writing their review.  
As such, I will keep these as they are and not remove them.  

There were a few lines that were dropped, these were to save time with this dataset. In total there are around 7 million reviews in this data set, and some need to be fixed manually. These rows of data of approximatley 1000 reviews were dropped as the final table has slightly more than 6.9Million rows.

In [17]:
#High level overview of the data shape and composition.
print(reviews.shape)
print(reviews.info(null_counts=True ))
print(reviews.head(10))

(6900562, 15)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6900562 entries, 0 to 6900885
Data columns (total 15 columns):
marketplace          6900562 non-null object
customer_id          6900562 non-null int64
review_id            6900562 non-null object
product_id           6900562 non-null object
product_parent       6900562 non-null int64
product_title        6900562 non-null object
product_category     6900562 non-null object
star_rating          6900562 non-null float64
helpful_votes        6900562 non-null float64
total_votes          6900562 non-null float64
vine                 6900562 non-null object
verified_purchase    6900562 non-null object
review_headline      6900488 non-null object
review_body          6900487 non-null object
review_date          6900562 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 842.4+ MB
None
  marketplace  customer_id       review_id  product_id  product_parent  \
0          US     53096384   R63J84G1LOX6R  156389011

In [14]:
#Save a sample of the table to upload to github, as the original file is far to large to be uploaded.
reviews.sample(n=10000, random_state=42).to_csv(r'amazon_reviews_sample.csv', index=False)

In [15]:
#Satistics of review ratings
print(reviews[["star_rating","helpful_votes","total_votes"]].describe())

        star_rating  helpful_votes   total_votes
count  6.900562e+06   6.900562e+06  6.900562e+06
mean   4.306589e+00   2.044490e+00  3.251608e+00
std    1.146197e+00   3.184562e+01  3.634051e+01
min    1.000000e+00   0.000000e+00  0.000000e+00
25%    4.000000e+00   0.000000e+00  0.000000e+00
50%    5.000000e+00   0.000000e+00  0.000000e+00
75%    5.000000e+00   1.000000e+00  2.000000e+00
max    5.000000e+00   2.755000e+04  2.872700e+04


### What is the structure of your dataset?

> Your answer here!

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!