# Mobile Electronics - Data Generation, Pre-processing, and EDA

### Loading the data
- Using the tsv link, I loaded the data using pandas. I had to use error_bad_lines=False to skip any lines that were bad
- Only skipped 2 lines out of almost 105,000 rows, which is great

In [1]:
import pandas as pd

url = 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz'

df = pd.read_csv(url, sep='\t', error_bad_lines=False)

b'Skipping line 35246: expected 15 fields, saw 22\n'
b'Skipping line 87073: expected 15 fields, saw 22\n'


### Taking a look at the data structure, checking for nulls

In [2]:
df.shape

(104852, 15)

In [3]:
df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,20422322,R8MEA6IGAHO0B,B00MC4CED8,217304173,BlackVue DR600GW-PMP,Mobile_Electronics,5.0,0.0,0.0,N,Y,Very Happy!,"As advertised. Everything works perfectly, I'm...",2015-08-31
1,US,40835037,R31LOQ8JGLPRLK,B00OQMFG1Q,137313254,GENSSI GSM / GPS Two Way Smart Phone Car Alarm...,Mobile_Electronics,5.0,0.0,1.0,N,Y,five star,it's great,2015-08-31
2,US,51469641,R2Y0MM9YE6OP3P,B00QERR5CY,82850235,iXCC Multi pack Lightning cable,Mobile_Electronics,5.0,0.0,0.0,N,Y,great cables,These work great and fit my life proof case fo...,2015-08-31
3,US,4332923,RRB9C05HDOD4O,B00QUFTPV4,221169481,abcGoodefg® FBI Covert Acoustic Tube Earpiece ...,Mobile_Electronics,4.0,0.0,0.0,N,Y,Work very well but couldn't get used to not he...,Work very well but couldn't get used to not he...,2015-08-31
4,US,44855305,R26I2RI1GFV8QG,B0067XVNTG,563475445,Generic Car Dashboard Video Camera Vehicle Vid...,Mobile_Electronics,2.0,0.0,0.0,N,Y,Cameras has battery issues,"Be careful with these products, I have bought ...",2015-08-31


In [4]:
df.isnull().sum()

marketplace          0
customer_id          0
review_id            0
product_id           0
product_parent       0
product_title        0
product_category     0
star_rating          2
helpful_votes        2
total_votes          2
vine                 2
verified_purchase    2
review_headline      4
review_body          3
review_date          2
dtype: int64

### Removing nulls
- only a few null values, so I'm just going to remove them for simplicities sake

In [5]:
before_length = len(df)
df = df.dropna()
print ("Rows Lost: ", before_length - len(df))
df.isnull().sum()

Rows Lost:  5


marketplace          0
customer_id          0
review_id            0
product_id           0
product_parent       0
product_title        0
product_category     0
star_rating          0
helpful_votes        0
total_votes          0
vine                 0
verified_purchase    0
review_headline      0
review_body          0
review_date          0
dtype: int64

Let's see how many unique products we are working with and calculate number of reviews and average star rating

In [6]:
print ("Unique Products: ", len(df['product_id'].unique()))
products = (df[['product_id', 'product_title', 'review_id', 'star_rating']]
 .groupby(['product_id', 'product_title'])
 .agg({'review_id': 'count', 'star_rating': 'mean'})
 .reset_index()
 .rename(columns={'review_id':'num_reviews'})
 .sort_values('num_reviews', ascending=False))

products

Unique Products:  25785


Unnamed: 0,product_id,product_title,num_reviews,star_rating
21904,B00J46XO9U,"iXCC Lightning Cable 3ft, iPhone charger, for ...",1078,4.376623
7648,B004911E9M,Wall AC Charger USB Sync Data Cable for iPhone...,730,2.427397
4059,B002D4IHYM,New Trent Easypak 7000mAh Portable Triple USB ...,690,4.530435
19199,B00E5PI594,Apple USB Lightning Cable 9ft -NEW 2014 for iO...,615,3.598374
11775,B005S1CYO6,Kindle Fire anti-glare Screen Protector,599,2.722871
...,...,...,...,...
3310,B001MC0XH0,Creative Zen X-fi Leather Carrying Case for Ze...,1,4.000000
3305,B001M5V2WC,Flip Leather Carrying Case with Belt Clip for ...,1,5.000000
3304,B001M5RQ0E,Flip Leather Carrying Case with Belt Clip for ...,1,5.000000
3301,B001M5LOKC,Premium Silicone Skin for Sony Walkman E436 / ...,1,4.000000


Before we do any further analysis on the products, we may have to set a number of review threshold on the products since products with very little reviews would not give us enough of a sample size to gain any insights. Let's see the frequency distribution for number of reviews:

In [7]:
products['num_reviews'].quantile([0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0])

0.0       1.0
0.1       1.0
0.2       1.0
0.3       1.0
0.4       1.0
0.5       1.0
0.6       2.0
0.7       2.0
0.8       3.0
0.9       6.0
1.0    1078.0
Name: num_reviews, dtype: float64

Looks like 90% of the products contain 6 or fewer reviews. I'm just going to choose an arbitrary number of 30 reviews to use as a cutoff. That gives us 459 products to work with.

In [8]:
top_products = products[products['num_reviews'] >= 30]
top_products

Unnamed: 0,product_id,product_title,num_reviews,star_rating
21904,B00J46XO9U,"iXCC Lightning Cable 3ft, iPhone charger, for ...",1078,4.376623
7648,B004911E9M,Wall AC Charger USB Sync Data Cable for iPhone...,730,2.427397
4059,B002D4IHYM,New Trent Easypak 7000mAh Portable Triple USB ...,690,4.530435
19199,B00E5PI594,Apple USB Lightning Cable 9ft -NEW 2014 for iO...,615,3.598374
11775,B005S1CYO6,Kindle Fire anti-glare Screen Protector,599,2.722871
...,...,...,...,...
23647,B00N2MEGB2,G1W-C Capacitor Model Dashboard Dash Cam - Hea...,30,3.366667
14215,B0085M17CA,Brightline Bags - Flex System - B10 Classic,30,4.500000
2899,B001EZ402O,11 Assorted Colors Silicone Pouch Compatible w...,30,4.700000
13732,B007VE8Z3C,Dancing Cat Speaker BROWN,30,3.466667


Let's look at the frequency distribution for star ratings to see if they are skewed to one end or the other. Average and median star rating is 3.836 and 3.948, respectively. Given that and the frequency distribution below, very little reviews generate a 1 or 2 rating.

In [9]:
print ("Mean star rating: ", top_products['star_rating'].mean())
top_products['star_rating'].quantile([0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1])

Mean star rating:  3.8360986652735183


0.0    1.625000
0.1    2.949405
0.2    3.295090
0.3    3.535667
0.4    3.771389
0.5    3.948276
0.6    4.118171
0.7    4.262657
0.8    4.421520
0.9    4.568339
1.0    4.913793
Name: star_rating, dtype: float64

### Top 10 products by number of reviews

In [10]:
pd.set_option("max_colwidth", 400)
top_products.head(10)

Unnamed: 0,product_id,product_title,num_reviews,star_rating
21904,B00J46XO9U,"iXCC Lightning Cable 3ft, iPhone charger, for iPhone X, 8, 8 Plus, 7, 7 Plus, 6s, 6s Plus, 6, 6 Plus, SE 5s 5c 5, iPad Air 2 Pro, iPad mini 2 3 4, iPad 4th Gen [Apple MFi Certified](Black and White)",1078,4.376623
7648,B004911E9M,"Wall AC Charger USB Sync Data Cable for iPhone 4, 3GS, and iPod",730,2.427397
4059,B002D4IHYM,"New Trent Easypak 7000mAh Portable Triple USB Port External Battery Charger/Power Pack for Smartphones, Tablets and more (w/built-in USB cable)",690,4.530435
19199,B00E5PI594,"Apple USB Lightning Cable 9ft -NEW 2014 for iOS 7 - iSmooth's 2nd Generation Apple Lightning Compatible (iOS 7 compatible) Cable Designed to Sync and Charge iPhone 5, iPhone 5S, iPhone 5C, iPad Mini, iPad Mini 2, iPad 5, iPad 4th Generation, iPod Nano (7th Generation), iPod Touch (5th Generation 16GB, 32GB, 64GB) - Premium Quality - The ONLY Cable with a 10-Year Guarantee!",615,3.598374
11775,B005S1CYO6,Kindle Fire anti-glare Screen Protector,599,2.722871
14953,B008R68DFS,Patazon Black Extension Dock Extender 30pin Adapter for iPod iPhone 4 4S,577,3.459272
12244,B0067XVNTG,"Generic Car Dashboard Video Camera Vehicle Video Accident Recorder (2.0"" 1080P)",502,2.697211
2431,B00166G81M,2-Port USB Car Charger Adapter,481,3.299376
6539,B003TPQBJW,splash Masque Clear Screen Protector for iPhone 4 4G 4S AT&T and Verizon (3-Pack + 2 Bonus Back Films),464,3.564655
10215,B0052RMI2Y,USB Power Wall Charger + Syn Data Cable for Apple iPod Touch iPhone 4 4S 3G 3GS,464,2.797414


- 6 out of the 10 products are for Apple products
- I thought I would see a lot of highly rated products having the most reviews but that's not the case

### Top 10 products by star rating

In [11]:
top_products.sort_values('star_rating', ascending=False).head(10)

Unnamed: 0,product_id,product_title,num_reviews,star_rating
1780,B000UGLIXC,"Shoe Pouch for Nike+ iPod Sport Kit, and Nike+ SportBand (2-Pack), Also Fits Adidas miCoach Speed Cell and Garmin Foot Pod",58,4.913793
2468,B00172WYWM,"Tuneband for iPod nano 4th Generation, Armband Compatible with Nike+iPod System",127,4.874016
13399,B007MJD5I6,"eForCity Leather Case for Barnes and Noble Nook 2 / Simple Touch, Purple",68,4.838235
5500,B00361ERPE,Ram Mount Handlebar Mount for Garmin Nuvi 500 and 550,39,4.820513
17242,B00B2HTMMW,"Eco-Fused Case Bundle for Apple iPod Touch 4 including 4 Polka Dot Covers / 2 Stylus Pens / 2 Screen Protectors / Microfiber Cleaning Cloth (Green, Blue, Purple, Yellow)",90,4.811111
19213,B00E657DAA,"Bose SoundLink Mini Bluetooth Speaker, Upto 30 ft Wireless Range, Silver - Bundle - with Bose SoundLink Mini Bluetooth Speaker Travel Bag",93,4.806452
16055,B00A02QLVA,Pioneer GM-D8601 Class D Mono Amplifier with Wired Bass Boost Remote,36,4.805556
23548,B00MWGREM2,0 Gauge Amp Kit Amplifier Install Wiring 1/0 Ga Power Installation Cables 4000W,36,4.805556
10191,B0052NZYXI,AYL (TM) Portable Mini Speaker System for PC / Phone / Tablet / Apple iPod Touch / iPhone 4 / iPad / MP3 Player (Black),263,4.787072
8912,B004MKNJCU,Manfrotto MM294A4 294 Aluminum 4 Section Monopod,42,4.785714


- a lot of the top rated products have under 100 reviews

One thing I'm curious to see is if there is any relationship between the length of the review and its star rating... ie do people who hate a product leave longer reviews or do people who love a product leave longer reviews. I'm going to perform a simple estimate of word counts to get the review length for now in order to get a rough idea. I'll only be looking at products that had 30 or more reviews.

In [12]:
product_universe = top_products['product_id'].tolist()
review_universe = df[df['product_id'].isin(product_universe)][['product_id', 'star_rating', 'review_body']]
review_universe['review_length'] = review_universe['review_body'].str.count(' ') + 1
review_universe.groupby('star_rating').mean()

Unnamed: 0_level_0,review_length
star_rating,Unnamed: 1_level_1
1.0,62.70526
2.0,75.975524
3.0,77.058979
4.0,80.600704
5.0,58.821352


Looks like the most review words come from the middle ratings range of 2 - 4, perhaps explaining why they took stars off or gave more than one star?? The least amount of review words come from the worst review score, 1, and best review score, 5.


After looking at some reviews manually, there are some emojis included in some of the reviews. We'd like to capture the meaning behind the emoji since it could be useful with eventually feeding into a machine learning algorithm. For that, I'm going to use the `demoji` Python package. This can be installed by using `pip install demoji`, then after imported, we'll download the current emojis from the repository since emojis are constanly being added and changed.

In [13]:
import demoji

demoji.download_codes()

Downloading emoji data ...
... OK (Got response in 0.17 seconds)
Writing emoji data to /Users/kmf229/.demoji/codes.json ...
... OK


Let's run a test to see what we get

In [14]:
emoji_test = ['👍🏽', '👍', '😁', 'no emoji here']
for i in emoji_test:
    print (type(i), demoji.findall(i))

<class 'str'> {'👍🏽': 'thumbs up: medium skin tone'}
<class 'str'> {'👍': 'thumbs up'}
<class 'str'> {'😁': 'beaming face with smiling eyes'}
<class 'str'> {}


Perfect, let's create a quick function that processes some text and turns any emoji's into text.

In [64]:
import re

def process_emojis(text):
    #Detect any emojis in text string
    found_emojis = demoji.findall(text)
    if found_emojis:
        #Loop through each emoji found
        for key, value in found_emojis.items():
            #replace with emoji text.. add spaces to account for emoji's right next to each other
            text = (text.replace(key, ' ' + value[:value.find(':') if ':' in value else len(value)] + ' ')
                    .replace('🏽', ''))
        #using regex to find double spaces and replace with single space, then return new text string
        return re.sub(' +', ' ', text)
    return text

review_test = 'I give this product a 👍🏽😁 with an extra 👍 and👍😁'
process_emojis(review_test)

'I give this product a thumbs up beaming face with smiling eyes with an extra thumbs up and thumbs up beaming face with smiling eyes '

Taking a random sample of reviews that have emojis, the function appears to be working. Although I would say the registered symbol comes up a lot, but I wouldn't really think of that as an emoji, may want to omit.

In [74]:
import random

def has_emoji(text):
    return demoji.findall(text) != {}

samples = random.sample([i for i in review_universe['review_body'].tolist() if has_emoji(i)], 10)
for i in samples:
    print ("Emoji's Found: ", demoji.findall(i))
    print ("Original Text: ", i)
    print ('\n')
    print ("Processed Text: ", process_emojis(i))
    print ('-----------------------------------')


Emoji's Found:  {'😢': 'crying face'}
Original Text:  This case is a pretty good case if u want to show off your iPod color but when I put mine on it scratched my iPod on the upper left corner and bottom right corner...<br />Kinda regret even putting it on :/😢


Processed Text:  This case is a pretty good case if u want to show off your iPod color but when I put mine on it scratched my iPod on the upper left corner and bottom right corner...<br />Kinda regret even putting it on :/ crying face 
-----------------------------------
Emoji's Found:  {'®': 'registered'}
Original Text:  I have gotten a couple of Sentey speakers previously and loved them. This one is no exception. Small enough to be portable. Will play music up to about 6 hours of continuous use. Music is great quality. and I LOVE the blue color! Really stands out from the ordinary sea of black!<br /><br />Specs: Built-in Mic for Hands free<br />up 6 Hours of Battery - Rechargeable<br />AUX Line in makes it compatible with many

Processed Text:  I absolutely love this battery. I've recently been having problems with my smartphone battery dying by mid-day (long story) while I'm out and about with work. I'm not near a electrical outlet most of the time so I decided to give this battery a shot. I was able to use my phone continuously all day long with it plugged into the battery pack and didn't even drain half of the juice from it. And all of that for less than $50 was a nice bonus.<br /><br />Now the unit isn't perfect. My only gripe with it is that the LED lights displaying it's charge level only illuminate when the battery itself is being charged (while plugged into the wall, for example) not while it's charging a device. It would be nice to know how much longer the battery can charge your device. But I was still willing to rate it 5 stars because of how portable it is (fits in a dress shirt pocket). I've very happy with my purchase. I was considering purchasing a similar Energizer unit for about the same pric

Original Text:  I've been in the market for an e-book reader for a long time now. I didn't want to get just anything and have been researching the iPad, the Kindle, Nook, and the Sony Reader. I chose the Reader because I liked the exclusive with Google Books, and it appeared to be the least-tethered and most-open of all the e-book readers. A week into it, I've had a complete change of heart, and am now contemplating the Apple iPad as its successor.<br /><br />I've spent the last 4+ weeks looking at every single e-reader on the market. My wife and I love to read and when Apple came out with the iPad, she had to have it, and that pretty much ended the debate right there as far as her choice. I was a little more scrutinizing as far as what I was willing to spend, and what I wanted. I know that sounds like the typical male-versus-female shopping taste but I didn't want a $500 ebook reader, and didn't need a tablet computer--just something I could read books on, reliably, and was easy to us





Processed Text:  I've been in the market for an e-book reader for a long time now. I didn't want to get just anything and have been researching the iPad, the Kindle, Nook, and the Sony Reader. I chose the Reader because I liked the exclusive with Google Books, and it appeared to be the least-tethered and most-open of all the e-book readers. A week into it, I've had a complete change of heart, and am now contemplating the Apple iPad as its successor.<br /><br />I've spent the last 4+ weeks looking at every single e-reader on the market. My wife and I love to read and when Apple came out with the iPad, she had to have it, and that pretty much ended the debate right there as far as her choice. I was a little more scrutinizing as far as what I was willing to spend, and what I wanted. I know that sounds like the typical male-versus-female shopping taste but I didn't want a $500 ebook reader, and didn't need a tablet computer--just something I could read books on, reliably, and was easy to u

-----------------------------------
Emoji's Found:  {'😊': 'smiling face with smiling eyes'}
Original Text:  😊<br />Received as said and in a timely manner


Processed Text:   smiling face with smiling eyes <br />Received as said and in a timely manner
-----------------------------------
Emoji's Found:  {'®': 'registered'}
Original Text:  Bought this unit for my wife to carry in her purse, she was using my EC TECHNOLOGY® New Dual USB 13000mAh External Battery Pack Backup Charger Portable Power Bank, (which I reluctantly gave up) but for a purse it was just a little to heavy. Glad to have my unit back I carry sa  back pack and don't motice it. Both unit offer great battery life for their size and battery life ratings.


Processed Text:  Bought this unit for my wife to carry in her purse, she was using my EC TECHNOLOGY registered New Dual USB 13000mAh External Battery Pack Backup Charger Portable Power Bank, (which I reluctantly gave up) but for a purse it was just a little to heavy. Glad

Let's now count the words used by each star review. I'll organize by rating and remove the stop words using NLTK

In [91]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kmf229/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [95]:
from collections import Counter
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

counter = {'1': Counter(), '2': Counter(), '3': Counter(), '4': Counter(), '5': Counter()}
for i in review_universe.values.tolist():
    counter[str(int(i[1]))].update([i for i in process_emojis(i[2].lower()).split(' ') if i not in stop_words])

### Most common words for a one star rating

In [96]:
counter['1'].most_common(10)

[('', 9035),
 ('one', 1749),
 ('would', 1713),
 ('get', 1480),
 ('work', 1275),
 ('product', 1229),
 ('/><br', 1108),
 ('even', 1015),
 ('bought', 949),
 ('screen', 915)]

### Most common words for a five star rating

In [97]:
counter['5'].most_common(10)

[('', 28008),
 ('great', 6816),
 ('one', 4570),
 ('works', 4322),
 ('/><br', 3827),
 ('good', 3823),
 ('use', 3772),
 ('would', 3626),
 ('like', 3546),
 ('product', 3014)]

### Takeaways from most common words
- clearly need to do more pre-processing to remove blank text and '/><br'
- interesting how "work" is one of the most popular words in one star ratings, perhaps because reviewers say "it doesn't work" while "works" is a top word for five star reviews, perhaps because reviewers say "works great"
- needs more pre-processing like stemming and lemmatization
- ngrams would probably be helpful here to get more context

Let's try it one more time using 2 word ngrams to see what we get

In [103]:
from nltk.util import ngrams

stop_words = stopwords.words('english')

counter = {'1': Counter(), '2': Counter(), '3': Counter(), '4': Counter(), '5': Counter()}
for i in review_universe.values.tolist():
    counter[str(int(i[1]))].update(ngrams([i for i in process_emojis(i[2].lower()).split(' ') if i not in stop_words], 2))


### Most common 2 word ngrams for one star ratings

In [104]:
counter['1'].most_common(10)

[(('', ''), 1229),
 (('it.', ''), 241),
 (('stopped', 'working'), 222),
 (('waste', 'money.'), 202),
 (('screen', 'protector'), 182),
 (('get', 'pay'), 181),
 (('', 'would'), 173),
 (('/><br', '/>i'), 171),
 (('waste', 'money'), 153),
 (('would', 'recommend'), 152)]

### Most common 2 word ngrams for five star ratings

In [105]:
counter['5'].most_common(10)

[(('', ''), 3976),
 (('it.', ''), 671),
 (('works', 'great'), 658),
 (('highly', 'recommend'), 615),
 (('/><br', '/>i'), 606),
 (('would', 'recommend'), 567),
 (('sound', 'quality'), 539),
 (('', 'great'), 489),
 (('/><br', '/>the'), 477),
 (('', 'would'), 415)]

### Takeaways from most common ngrams
- need to remove punctuation
- one star ngrams make sense, phrases like "stopped working" and "waste money" are negative feedback, although interesting how "would recommend" made it's way in there, maybe it was "wouldn't recommend", might have to look at how to handle contractions
- five star ngrams make sense as well, "works great" and "highly recommend" are clearly positive feedback
- maybe a lot of screen protectors are of poor quality?
- sound quality seems to be an important attribute for a five star review