In [1]:
import pandas as pd
import numpy as np

One major thing that needs to be figured out for this dataset is how to read it all...It is over 10 million lines long. I want to explore ways to read in chunks analyze that data then read in another chunk and create a visualization that uses all the data (or at least most of it) 

In [2]:
# Line below isn't really feasible to run since the file is over 10 million lines long 
#books00_df = pd.read_csv('amazon_reviews_us_Books_v1_00.tsv', sep='\t', error_bad_lines=False)

# Broke the data into a smaller set to work with for now
books00_df = pd.read_csv('books_00_100k.tsv', sep='\t', error_bad_lines=False)

b'Skipping line 3524: expected 15 fields, saw 22\nSkipping line 5282: expected 15 fields, saw 22\nSkipping line 20478: expected 15 fields, saw 22\nSkipping line 25895: expected 15 fields, saw 22\nSkipping line 27016: expected 15 fields, saw 22\nSkipping line 59798: expected 15 fields, saw 22\n'
b'Skipping line 69198: expected 15 fields, saw 22\nSkipping line 71953: expected 15 fields, saw 22\nSkipping line 78720: expected 15 fields, saw 22\nSkipping line 81300: expected 15 fields, saw 22\nSkipping line 87683: expected 15 fields, saw 22\nSkipping line 90516: expected 15 fields, saw 22\nSkipping line 91147: expected 15 fields, saw 22\n'


error_bad_lines = False will remove any lines that have more entries per row than expected. This is the first step in the cleaning process 

In [3]:
books00_df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,25933450,RJOVP071AVAJO,0439873800,84656342,There Was an Old Lady Who Swallowed a Shell!,Books,5,0.0,0.0,N,Y,Five Stars,I love it and so does my students!,2015-08-31
1,US,1801372,R1ORGBETCDW3AI,1623953553,729938122,I Saw a Friend,Books,5,0.0,0.0,N,Y,"Please buy ""I Saw a Friend""! Your children wil...",My wife and I ordered 2 books and gave them as...,2015-08-31
2,US,5782091,R7TNRFQAOUTX5,142151981X,678139048,"Black Lagoon, Vol. 6",Books,5,0.0,0.0,N,Y,Shipped fast.,Great book just like all the others in the ser...,2015-08-31
3,US,32715830,R2GANXKDIFZ6OI,014241543X,712432151,If I Stay,Books,5,0.0,0.0,N,N,Five Stars,So beautiful,2015-08-31
4,US,14005703,R2NYB6C3R8LVN6,1604600527,800572372,Stars 'N Strips Forever,Books,5,2.0,2.0,N,Y,Five Stars,Enjoyed the author's story and his quilts are ...,2015-08-31


In [49]:
books00_df['product_category'].unique()

array(['Books',
       "Andy Weir has stated that, when he writes, he doesn't have a point or moral to make.  The only reason he writes is to entertain.  &#34;I never want to make the reader do anything but go 'cool' and that's it.&#34;  (Aug. 17, 2015 interview at L.A. Google office)  I suspect his stated intent explains the difference in mentality between those who love and those who hate the book.  I have a background in physics and work in aerospace/defense as an analyst, so qualify as a nerd who appreciates the technical details.  But the thought I had over and over while trying to slog through the book is why try to stretch what seems to be a &#34;how-to manual&#34; into novel form at all??  There is no larger emotional arc.  No thought-provoking big ideas.  No real backstory for the main character.  I really enjoy my nerdy friends and coworkers, yet find Mark Watney boring, obvious, and annoying.  The sardonic humor is unrelenting.  And the writing is regularly cliche and corny.

Some questions to answer for this set of data
1. What is the date range?
2. What's the breakdown of star ratings? How many are 5's etc?
3. How many different books are there?
4. How often is the review body empty?
5. What to do about a 0 star rating? Is it real?

In [4]:
books00_df['string date'] = (books00_df['review_date']).apply(str)

In [5]:
# This shows that there are some dates that are missing so we need to clean that data 

print(books00_df['string date'].max())
print(books00_df['string date'].min())
print('Total number of nans in the review date:', books00_df['review_date'].isna().sum())

nan
2015-08-22
Total number of nans in the review date: 6


In [6]:
# Looks for nans in other columns 
print('Total number of nans in the product title:', books00_df['product_title'].isna().sum())
print('Total number of nans in the star rating:', books00_df['star_rating'].isna().sum())
print('Total number of nans in the review body:', books00_df['review_body'].isna().sum())
print('Total number of nans in the helpful votes:', books00_df['helpful_votes'].isna().sum())
print('Total number of nans in the total votes:', books00_df['total_votes'].isna().sum())
print('Total number of nans in the verified purchase:', books00_df['verified_purchase'].isna().sum())

Total number of nans in the product title: 0
Total number of nans in the star rating: 0
Total number of nans in the review body: 9
Total number of nans in the helpful votes: 2
Total number of nans in the total votes: 2
Total number of nans in the verified purchase: 2


In [7]:
df = books00_df[['product_title','review_date','review_body', 'star_rating', 'verified_purchase', 
                 'helpful_votes','total_votes']]

In [8]:
df = df.dropna()

In [9]:
print(books00_df.shape)
print(df.shape)

(97484, 16)
(97471, 7)


In [10]:
print('min date:', df['review_date'].min())
print('max date:', df['review_date'].max())

min date: 2015-08-22
max date: 2015-08-31


In [11]:
df['star_rating'].value_counts()

5    70883
4    13541
3     5880
1     4144
2     3023
Name: star_rating, dtype: int64

In [12]:
len(df['product_title'].unique())

71224

In [13]:
df['len_review_body'] = df['review_body'].apply(len)

In [14]:
df.head()

Unnamed: 0,product_title,review_date,review_body,star_rating,verified_purchase,helpful_votes,total_votes,len_review_body
0,There Was an Old Lady Who Swallowed a Shell!,2015-08-31,I love it and so does my students!,5,Y,0.0,0.0,34
1,I Saw a Friend,2015-08-31,My wife and I ordered 2 books and gave them as...,5,Y,0.0,0.0,364
2,"Black Lagoon, Vol. 6",2015-08-31,Great book just like all the others in the ser...,5,Y,0.0,0.0,50
3,If I Stay,2015-08-31,So beautiful,5,N,0.0,0.0,12
4,Stars 'N Strips Forever,2015-08-31,Enjoyed the author's story and his quilts are ...,5,Y,2.0,2.0,86


In [15]:
print('min review body len:', df['len_review_body'].min())

min review body len: 1


In [16]:
print('sum of min review body len:', sum(df['len_review_body'] == 1))

sum of min review body len: 32


In [17]:
print('sum of <= 5 review body len:', sum(df['len_review_body'] <= 5))

sum of <= 5 review body len: 2135


In [18]:
print('sum of <= 3 review body len:', sum(df['len_review_body'] <= 3))

sum of <= 3 review body len: 411


I'm thinking that we may want to drop the rows that have too short of a review body length because those lines will not be interesting enough or important enough to keep.

In [19]:
df_short3 = df[df['len_review_body'] <= 3]
df_short3.shape

(411, 8)

In [21]:
df_short3.head(20)

Unnamed: 0,product_title,review_date,review_body,star_rating,verified_purchase,helpful_votes,total_votes,len_review_body
234,Crime Scene Investigation (2nd Edition),2015-08-31,Ok!,1,Y,0.0,0.0,3
473,Shady Characters: The Secret Life of Punctuati...,2015-08-31,OK,3,Y,0.0,4.0,2
601,Debt-Free U: How I Paid for an Outstanding Col...,2015-08-31,meh,1,Y,0.0,4.0,3
761,"Crohn's Disease: My Life, My Journey",2015-08-31,Eh,2,Y,0.0,0.0,2
818,Project Estimating and Cost Management (Projec...,2015-08-31,meh,3,Y,0.0,0.0,3
934,Mathematical Proofs: A Transition to Advanced ...,2015-08-31,N\a,5,Y,0.0,3.0,3
1437,"The American Heritage Desk Dictionary, Fifth E...",2015-08-31,OK,3,Y,0.0,1.0,2
1732,The Prince of Tides,2015-08-31,:),5,Y,0.0,1.0,2
2316,Little Grey Rabbit's House: A model house to m...,2015-08-31,A+,5,Y,0.0,0.0,2
2433,Microbiology: The Human Experience,2015-08-31,A+,5,Y,0.0,3.0,2


In [22]:
df_short1 = df[df['len_review_body'] <= 1]
df_short1.shape

(32, 8)

In [23]:
df_short1.head(20)

Unnamed: 0,product_title,review_date,review_body,star_rating,verified_purchase,helpful_votes,total_votes,len_review_body
12705,Grey: Fifty Shades of Grey as Told by Christia...,2015-08-30,😊,5,Y,0.0,0.0,1
15750,All Quiet on the Western Front,2015-08-30,K,5,Y,0.0,0.0,1
18513,"NFPA 70®: National Electrical Code® (NEC®), 20...",2015-08-30,a,5,Y,0.0,0.0,1
18865,100 Edible Mushrooms,2015-08-30,1,4,Y,0.0,1.0,1
19707,Mushrooms of West Virginia and the Central App...,2015-08-30,1,4,Y,0.0,0.0,1
23252,School Nursing: A Comprehensive Text,2015-08-29,👍,5,Y,0.0,2.0,1
25787,CNA Exam Prep (Volume 2): Nurse Assistant Prac...,2015-08-29,👍,5,Y,0.0,0.0,1
28440,Descendants: Junior Novel,2015-08-29,😀,5,Y,0.0,0.0,1
28523,Jesus acted up: A gay and lesbian manifesto,2015-08-29,👍,4,Y,0.0,1.0,1
28928,Crazy Circles Lesson Plan & Record Book,2015-08-29,😀,5,Y,0.0,1.0,1


Clearly there are emojis that are sometimes used as the review body...how do we handle this? How can we handle this? 

Possibilities:
1. the NLP can handle it as it is so leave them in (might not need to deal with it till next term in this case)
2. Change emojis to words (Python emoji package) for example thumbs up emoji becomes 'thumbs up' or 'cool' 
3. Remove all rows that have an emoji in the review body (how many are there though?) 

In [29]:
emoji = df_short1.loc[28440, 'review_body']

In [30]:
emoji.shape

AttributeError: 'str' object has no attribute 'shape'

In [42]:
print(emoji[0])

😀


In [38]:
print(str(emoji))

😀


In [46]:
emoji_text = emoji.encode('utf-8')

In [47]:
print(emoji_text)

b'\xf0\x9f\x98\x80'
