## Before We Start...

Basic concepts of Spark: 
- RDD (Resilient Distributed Datasets): fundamental data structure for distributing data among cluster nodes. Immutable.
- Transformation: operations on RDD that returns an RDD, such as map, filter, reduce, and reduceByKey.
- Action: operations on RDD that returns a non-RDD value, such as collect.

We will be mainly using Spark Dataframe APIs instead of RDD APIs, to simplify development.
- Spark Dataframes are very similar to tables in relational databases. They have schema. Most of the operations on them are similar to querying a relational database as well. You can consider Spark Dataframe as a wrap on top of RDD.

## Loading Data

In [0]:
# Reading data from Delta Lake

amazon_review_raw = spark.sql("SELECT * FROM default.reviews_train").sample(0.25)

In [0]:
display(amazon_review_raw)

reviewID,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,label
4,3.0,True,"05 7, 2015",A3B6GKQQ1JJ167,0560467893,Harry Slaughter,"Pretty flimsy, but does the job. If your corner is not square and flat, the shelves will probably not end up level. These shelves should probably be going for half the price.",Meh,1430956800,1
7,5.0,True,"01 24, 2016",A324TTUBKTN73A,0001713353,Tekla Borner,My students (3 & 4 year olds) loved this book! Definitely recommend it to other teachers.,Five Stars,1453593600,0
9,1.0,True,"10 30, 2013",A7JVZFSXVY9RL,0681795107,Nickleen,"I like my coffee hot; borderline scorching but drinkable. About 10 years ago I purchased several Avantro coffee mugs that are very similar to these. Those have been able to keep my coffee at my preferred drinking temperature for well over 4 hours even on the coldest of days. I work outside so hot coffee is crucial in these nasty New England winters. Well, day one with this new mug and I am quite disappointed. This mug barely kept my coffee warm for an hour. That is quite unacceptable seeing as how this company is the one that bought out Avantro.",Not keeping coffee hot for long enough,1383091200,0
11,2.0,True,"02 20, 2015",A2204E1TH211HT,0700026657,Grandma KR,"found the game a bit too complicated, not what I expected after having played 1602, 1503, and 1701",Two Stars,1424390400,0
20,1.0,False,"08 2, 2014",A1KXJ1ELZIU05C,0700026657,Creation27,"I'm an avid gamer, but Anno 2070 is an INSULT to gaming. It is so buggy and half-finished that the first campaign doesn't even work properly and the DRM is INCREDIBLY frustrating to deal with. Once you manage to work your way past the massive amounts of bugs and get through the DRM, HOURS later you finally figure out that the game has no real tutorial, so you stuck just clicking around randomly. Sad, sad, sad, example of a game that could have been great but FTW.",Avoid This Game - Filled with Bugs,1406937600,0
21,5.0,True,"04 6, 2012",AO84V2TFRX9OB,0681795107,Meddybempster,"These mugs do what they are designed to do, i.e., keep liquids hot or cold for hours, don't leak, easy to clean and attractive to look at. But, do not put them in the dishwasher as they lose their magic!",Keeps your drinks hot or cold for hours,1333670400,0
25,5.0,True,"12 27, 2012",A2M46WTE5TR5WN,0001713353,SLL,"This was one of my favorites when I was a small child, so I bought a copy for my nephew when he was just starting to read.",A Favorite,1356566400,0
28,5.0,False,"08 15, 2011",ATHTCOG6BB6WK,0001713353,L. Williams,"So, you think you have problems? Things could be worse and this clever book can prove it. The king starts out with a problem. The mice are eating his cheese. The more he tries to fix the problem, the worse it gets. The king finally arranges to bring back the mice when he comes to the realization that his original dilemma wasn't so intolerable after all. The solution requires cooperation from the king and the mice. It involves the cheese.",Maybe It's Not As Bad As You Think,1313366400,0
30,4.0,True,"02 15, 2012",A345HVYNQJUGQM,0768205921,JenJen,"This is a great learning tool for teaching children how to tell time. The numbers are printed clearly and easy to read, including the seconds. The hands are easy to move. There is a piece on the back that can be used to prop up the clock, and the piece can pop off easily. It is easy to put back on though. I also ordered the ""Telling Time with the Judy Clock"" workbook. The little booklet that comes with the Judy clock has a lot of the same pages as the workbook, but it's a lot smaller. The ""Telling Time"" workbook gives children a lot more practice with learning to tell time.",Great learning tool,1329264000,1
35,5.0,True,"08 14, 2011",AN3YYDZAS3O1Y,0700099867,Bob,"Loved playing Dirt 2 and I thought the graphics were good. Purchased Dirt 3 as an addition to the other...and the graphics are absolutely ""Gorgeous"" If you liked Dirt or Dirt 2...you are going to love Dirt 3. The game was easier to configure with my Logitech wireless rumblepad...and with my EVGA GTX 580, and all detail set to full on graphics at 1920 x 1080 I get over 100 fps. The game looks good, plays well and is a blast!",A step up from Dirt 2 and that is terrific!,1313280000,1


## Cleaning Data

In [0]:
# Drop duplicates

print("Before duplication removal: ", amazon_review_raw.count())
amazon_review_distinct = amazon_review_raw.dropDuplicates(['reviewerID', 'asin'])
print("After duplication removal: ", amazon_review_distinct.count())

Before duplication removal:  786160
After duplication removal:  770578


In [0]:
# Convert Unix timestamp to readable date

from pyspark.sql.functions import from_unixtime, to_date
from pyspark.sql.types import *

amazon_review_with_date = amazon_review_distinct.withColumn("reviewTime", to_date(from_unixtime(amazon_review_distinct.unixReviewTime))) \
                                                .drop("unixReviewTime")

In [0]:
display(amazon_review_with_date)

reviewID,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label
1202645,5.0,True,2015-08-12,A0001528BGUBOEVR6T5U,B000GWG0T2,igozingo,super nice...,Five Stars,0
2740967,5.0,True,2014-09-12,A0001528BGUBOEVR6T5U,B002MF3CZG,igozingo,Very real,Five Stars,0
1988039,5.0,True,2015-10-11,A00231762M7PUF3V3ZBI,B0013LJA0Q,Cassie,"Bought 2 of these for a white trash themed party to cover the designated restroom doors. A bit short for a standard door, material is like a plastic tablecloth. Taped onto the doors easily, just made a hole to allow the doorknob to go through. The price is right for what it is!",A good value for a fun prop!,0
2344349,5.0,True,2017-10-23,A0034986DWR7WEDQN0GV,B001ASRFCC,oliver chong,superb,Five Stars,0
1109918,5.0,True,2014-02-10,A0090495K0FTJUG4CPSA,0030565073,Serenity,"I'd been wanting to read this for a while now, glad I did. Now I keep wanting to shout, it's a warning, not a Damn blueprint.",classic for a reason,0
2594294,5.0,True,2014-10-07,A0103047AS0C8QKUI0X2,B001V9VDP0,Becky Tompkins,Keeps my coffee beans fresh - holds plenty.,Five Stars,0
233055,4.0,True,2016-12-07,A0122375SQ8Z42DUL03J,B00005NCWV,Retired Lady,"I really like the sturdiness of the pan, cups and lid. The eggs I poached came out perfectly. My only complaint or caution is that when I went to retrieve a cooked egg in the little cup, I scalded my fingers since the steam holes are right next to the little finger holds you use to get the individual cups out of the pan. I learned to use a pot holder or tong to retrieve the cup and save my fingers.",I really like the sturdiness of the pan,0
2349148,5.0,True,2015-02-23,A0129009SMXEYR0W65IU,B001AZH6H4,Teresa Clark,"I previously had ordered one & it worked prefect so I ordered 2 more, one for me and one for my daughter Great item",one for me and one for my daughter Great,0
2812499,5.0,True,2013-08-25,A0139874ED7NYUB55TSR,B00317DIEY,Iryna,This piece is very helpful and serves exactly as expected. This is a must-have in every household! I used it only once as it just arrived recently but it seems to be a great quality,Amazing!!!,0
466759,5.0,True,2014-05-12,A0196552RI15HI7JB9PW,B0000CDVD7,Matt,"Very heavy duty. It is very solid, it doesn't feel as though it would break easily. Presses garlic good, leaves some unpressed around the edges, but no more than other presses I've used Cleans easy which is a huge + Would recommend it I used to take 10 mins mincing up garlic really fine, but now I press it in 30s with the same (if not better) results",Good product,0


As comparison, for pandas dataframe you will use .apply() to apply a function to a column. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

For example: amz_review['Date'] = amz_review['Time'].apply(to_date)

In [0]:
# Tokenization

from pyspark.ml.feature import RegexTokenizer

regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="reviewWord", pattern="\\W")

amazon_review_tokenized = regexTokenizer.transform(amazon_review_with_date.fillna("", subset=["reviewText"]))

In [0]:
# Remove stop words

from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="reviewWord", outputCol="reviewWordFiltered")
amazon_review_stop_word_removed = remover.transform(amazon_review_tokenized)

In [0]:
# Stemming

from nltk.stem.porter import PorterStemmer
from pyspark.sql.functions import udf

def stemming(col):
    p_stemmer = PorterStemmer()
    return [p_stemmer.stem(w) for w in col]

stemming_udf = udf(stemming, ArrayType(StringType()))
amazon_review_stemmed = amazon_review_stop_word_removed.withColumn("reviewWordCleaned", stemming_udf(amazon_review_stop_word_removed.reviewWordFiltered))

In [0]:
# Dropping temporary columns, and cache results (note that cache is also a lazy operation)

amazon_review_cleaned = amazon_review_stemmed.drop("reviewWord").drop("reviewWordFiltered").cache()

display(amazon_review_cleaned)

reviewID,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label,reviewWordCleaned
984583,5.0,True,2018-05-07,A101RVZV3RBT8O,B000BRU78C,James Rizer,"heavy duty, great thermos.",great thermos.,0,"List(heavi, duti, great, thermo)"
1726599,5.0,True,2016-01-06,A103CUDJVYLAHR,B000UBO0IW,Amazon Customer,These were perfect. I have 4 grandchildren and I give each one an ornament each year. These just fit the bill.,These were perfect. I have 4 grandchildren and I give each ...,0,"List(perfect, 4, grandchildren, give, one, ornament, year, fit, bill)"
2084214,5.0,True,2017-12-29,A104L4I6JM259S,B0017XHSC2,Cory Stamm,Works great,Five Stars,0,"List(work, great)"
68080,5.0,True,2017-02-06,A1095ORFAE5XEA,B00004RFM0,Cyrus Douglas,The j. A. Henckels knives are very sharp. Just what I wanted,The j. A. Henckels knives are very sharp ...,0,"List(j, henckel, knive, sharp, want)"
2330799,5.0,True,2014-12-28,A10AGLJYHNUKZA,B001A60SXW,Miss G,"Somehow there is a draft that is coming through my walls. How does that happen? I purchased these curtains for the insulation. They are a very close match to the color of my walls. Imagine a wall with 2 standard windows with the normal space between them. I hung the curtains on a rod stretched between the 2 windows. It actually looks nice (Not tacky like my daughter said it would.) More important, it cut the draft.",No More Wall Draft,0,"List(somehow, draft, come, wall, happen, purchas, curtain, insul, close, match, color, wall, imagin, wall, 2, standard, window, normal, space, hung, curtain, rod, stretch, 2, window, actual, look, nice, tacki, like, daughter, said, import, cut, draft)"
2671361,3.0,True,2018-04-05,A10BUZFXQ3BXIC,B002ACEGDS,cgant,"This dresser had wonderful reviews compared to many others Ive read. Was feeling very optimistic about it. When it arrived it was very well packaged, but there were apparently a couple dings before being put in the packaging. It was generally a very easy dresser to assemble, took about 1 1/2-2hrs. The biggest disappointment though is the drawer size. The dresser itself is pretty deep, but the drawers are not. Will be trying to sell this now to get something else :( No sense in it taking up so much space in our smaller rooms for the storage capacity of a smaller dresser.","Beautiful dresser,but disappointing",0,"List(dresser, wonder, review, compar, mani, other, ive, read, feel, optimist, arriv, well, packag, appar, coupl, ding, put, packag, gener, easi, dresser, assembl, took, 1, 1, 2, 2hr, biggest, disappoint, though, drawer, size, dresser, pretti, deep, drawer, tri, sell, get, someth, els, sens, take, much, space, smaller, room, storag, capac, smaller, dresser)"
2773118,4.0,True,2016-08-13,A10D6PSBCT0UQK,B002T49E4I,Robert R.,great smell and works well but best with a little insect spray,Four Stars,0,"List(great, smell, work, well, best, littl, insect, spray)"
1672667,2.0,False,2009-09-18,A10DCS8UQTNN7D,0060876115,Sara B.,"I won't go into how much I disliked the format of this story and how it took place simultaniously with the The Lost Duke of Wyndham, suffice it to say it make reading this book slow and repetitive and not as interesting as it could have been. I actually liked the characters of Thomas and Amelia better than Jack and Grace and felt they deserved a better, more in depth story line all their own, away from all the goings on in the first book. Ultimately it was better than some regency historicals that I've read just b/c it was a Julia Quinn novel, but I felt like I basically read the same novel twice with a bit of a different ending, and it made the ending feel very rushed also. I hate to be disappointed in a favorite author, but I definitely was this time. I have higher hopes for What Happens in London and hope that this was just an whim on the authors part and won't be repeated in future novels.",The only Julia Quinn novel to ever disappoint me!,0,"List(won, go, much, dislik, format, stori, took, place, simultani, lost, duke, wyndham, suffic, say, make, read, book, slow, repetit, interest, actual, like, charact, thoma, amelia, better, jack, grace, felt, deserv, better, depth, stori, line, away, go, first, book, ultim, better, regenc, histor, ve, read, b, c, julia, quinn, novel, felt, like, basic, read, novel, twice, bit, differ, end, made, end, feel, rush, also, hate, disappoint, favorit, author, definit, time, higher, hope, happen, london, hope, whim, author, part, won, repeat, futur, novel)"
1556950,3.0,True,2016-06-14,A10ELFUF3X8JYJ,B000PDHK82,Ronald V Clark,The 2nd one I ordered worked great.,Three Stars,0,"List(2nd, one, order, work, great)"
51429,3.0,False,2000-12-21,A10FBJXMQPI0LL,B00004YKHW,M. Thakery,"Okay, the whacks first then the good stuff. First whack: The intro-opening movie thing just really needs to go away. It was aweful!! It had very little to do with the story and even less to do with making me want to play the game! From crappy leaves to odd duck looking things to that terrible screeching pseudo-song as the movie goes on (and on and on and on. . .), the intro just needs to be trashed and redone completely. Second whack: The camera angles are really lousy. Sure, you control the camera, but that's just it; you spend most of your time trying to manipulate the camera to look at the monsters and generally end up getting whacked, yourself. There are areas in the game that are so difficult to navigate woth the odd angles that you invariably end up quitting and coming back to it. Half the time the camera's not even looking in the direction you're running. VERY annoying! Third whack: The mapping is nice and all, but it is ENTIRELY too easy to fall off a ledge. Sure, it's possible to fall off a cliff in real life, but in real life I don't control my actions with a 3/4 inch button. . . Sneeze and you're history. Last whack: This is actually a two parter- 1) the words on the screen are only acurate about 60% of the time with the words being spoken. Did we get 2 different translations?!? If so, let's pick 1 and stick to it. . . 2) The attributes given to various armaments in the menu rarely matches what you actually get. It will say you're getting one thing, then you get something totally different. Again, annoying. Praise: The gameplay is quick- you change weapons with almost every encounter. There are lots of things to buy from the creepy elephant-man shop owner and most of it is upgradeable. The bosses are not ridiculously powerful and the monsters are actually manageable. The game actually allows you to win on ocassion. The graphics are pretty good and the colors are lively. The male character looks realistic and the female looks almost so, except for the fact that her wrists are like a foot long. All in all, I would tell you to rent it first. Otherwise I would tell you to buy Legend of Lagaia. . . Except LOL can't recognize the PS2 mem cards. . . Bummer.",Great game. . . If it were for or PS1,1,"List(okay, whack, first, good, stuff, first, whack, intro, open, movi, thing, realli, need, go, away, awe, littl, stori, even, less, make, want, play, game, crappi, leav, odd, duck, look, thing, terribl, screech, pseudo, song, movi, goe, intro, need, trash, redon, complet, second, whack, camera, angl, realli, lousi, sure, control, camera, spend, time, tri, manipul, camera, look, monster, gener, end, get, whack, area, game, difficult, navig, woth, odd, angl, invari, end, quit, come, back, half, time, camera, even, look, direct, re, run, annoy, third, whack, map, nice, entir, easi, fall, ledg, sure, possibl, fall, cliff, real, life, real, life, control, action, 3, 4, inch, button, sneez, re, histori, last, whack, actual, two, parter, 1, word, screen, acur, 60, time, word, spoken, get, 2, differ, translat, let, pick, 1, stick, 2, attribut, given, variou, armament, menu, rare, match, actual, get, say, re, get, one, thing, get, someth, total, differ, annoy, prais, gameplay, quick, chang, weapon, almost, everi, encount, lot, thing, buy, creepi, eleph, man, shop, owner, upgrad, boss, ridicul, power, monster, actual, manag, game, actual, allow, win, ocass, graphic, pretti, good, color, live, male, charact, look, realist, femal, look, almost, except, fact, wrist, like, foot, long, tell, rent, first, otherwis, tell, buy, legend, lagaia, except, lol, recogn, ps2, mem, card, bummer)"


## Exploratory Analysis

In [0]:
# Let's use Spark SQL for some simple exploratory analysis. Firstly, we need to create a temporary view based on the dataframe.

amazon_review_cleaned.createOrReplaceTempView("amazon_book_reviews")

In [0]:
# Distribution of the star ratings of book reviews

star_rating = spark.sql('''
  SELECT 
    overall AS star_rating, 
    COUNT(*) AS count 
  FROM
    amazon_book_reviews
  GROUP BY
    overall
  ORDER BY
    overall
''')

display(star_rating)

star_rating,count
1.0,39263
2.0,34276
3.0,64606
4.0,135849
5.0,496584


In [0]:
# Number of reviews over time

review_over_time = spark.sql('''
  SELECT 
    reviewTime AS date, 
    COUNT(*) AS count 
  FROM
    amazon_book_reviews
  WHERE
    reviewTime >= '2015-01-01'
  GROUP BY
    reviewTime
  ORDER BY
    reviewTime
''')

display(review_over_time)

date,count
2015-01-01,748
2015-01-02,578
2015-01-03,828
2015-01-04,708
2015-01-05,725
2015-01-06,514
2015-01-07,727
2015-01-08,501
2015-01-09,580
2015-01-10,568


## Review Score Prediction

As comparison, without Spark we commonly use sklearn in Python for machine learning (read more: https://scikit-learn.org/stable/user_guide.html); or NLTK for natural language processing (read more: https://www.nltk.org/)

In [0]:
# Extract verified 5-star and 1-star reviews for prediction

prediction_df = amazon_review_cleaned.where( ((amazon_review_cleaned.overall == 1) | (amazon_review_cleaned.overall == 5)) \
                                             & amazon_review_cleaned.verified == True )

# This is equivalent to the following Spark SQL command:

prediction_df = spark.sql("SELECT * FROM amazon_book_reviews WHERE (overall = 1 OR overall = 5) AND verified = TRUE")

display(prediction_df)

reviewID,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label,reviewWordCleaned
984583,5.0,True,2018-05-07,A101RVZV3RBT8O,B000BRU78C,James Rizer,"heavy duty, great thermos.",great thermos.,0,"List(heavi, duti, great, thermo)"
1726599,5.0,True,2016-01-06,A103CUDJVYLAHR,B000UBO0IW,Amazon Customer,These were perfect. I have 4 grandchildren and I give each one an ornament each year. These just fit the bill.,These were perfect. I have 4 grandchildren and I give each ...,0,"List(perfect, 4, grandchildren, give, one, ornament, year, fit, bill)"
2084214,5.0,True,2017-12-29,A104L4I6JM259S,B0017XHSC2,Cory Stamm,Works great,Five Stars,0,"List(work, great)"
68080,5.0,True,2017-02-06,A1095ORFAE5XEA,B00004RFM0,Cyrus Douglas,The j. A. Henckels knives are very sharp. Just what I wanted,The j. A. Henckels knives are very sharp ...,0,"List(j, henckel, knive, sharp, want)"
2330799,5.0,True,2014-12-28,A10AGLJYHNUKZA,B001A60SXW,Miss G,"Somehow there is a draft that is coming through my walls. How does that happen? I purchased these curtains for the insulation. They are a very close match to the color of my walls. Imagine a wall with 2 standard windows with the normal space between them. I hung the curtains on a rod stretched between the 2 windows. It actually looks nice (Not tacky like my daughter said it would.) More important, it cut the draft.",No More Wall Draft,0,"List(somehow, draft, come, wall, happen, purchas, curtain, insul, close, match, color, wall, imagin, wall, 2, standard, window, normal, space, hung, curtain, rod, stretch, 2, window, actual, look, nice, tacki, like, daughter, said, import, cut, draft)"
1022610,5.0,True,2016-11-30,A10I77K5M5ZO4H,0025853503,Kelsy Bellah,Pat Conroy's Preface to this novel is just precious and inspiring. A note every mother should be proud to read.,Classic,0,"List(pat, conroy, prefac, novel, preciou, inspir, note, everi, mother, proud, read)"
227649,5.0,True,2010-12-27,A10K4042PZXP0X,B000FQBF1M,Kenny Parker,"Similar in scope to Call of Duty. Some of the scenarios require problem solving skills to defeat the foes. For the most part, this is a rock em sock em kick ass game that will get your heart beat up and challenge your trigger finger. I like it enough to pre-order KILLZONE 3.",Killzone GREAT,0,"List(similar, scope, call, duti, scenario, requir, problem, solv, skill, defeat, foe, part, rock, em, sock, em, kick, ass, game, get, heart, beat, challeng, trigger, finger, like, enough, pre, order, killzon, 3)"
2849493,5.0,True,2015-04-13,A10KA8JQ0MK4IN,B003AM896W,Gisela Spencer,This is a good quality container. I haven't teted the bean freshness as I just got it but looks easy to use.,Looking forward to fresh beans,0,"List(good, qualiti, contain, haven, tete, bean, fresh, got, look, easi, use)"
540581,5.0,True,2017-06-21,A10KZZ7BQ80UW8,B0000CFTPI,rl,"plastic is cheaper if you are on a tight budget, but even though it can chip or break, the pyrex is so much more sanitary and functional (freezer, microwave, etc.), it is my preferred method for temporary storage of food items, by far. this size is good for a hefty meal, and the lid is airtight, although it (the lid) cannot withstand extreme heat.",this size is good for a hefty meal,0,"List(plastic, cheaper, tight, budget, even, though, chip, break, pyrex, much, sanitari, function, freezer, microwav, etc, prefer, method, temporari, storag, food, item, far, size, good, hefti, meal, lid, airtight, although, lid, withstand, extrem, heat)"
1306268,1.0,True,2014-12-04,A10MTJCHE99UBA,B000IMYID0,Indiana Cheryl,Handle broke off the first time I washed it in my sink. I was issued a refund.,Microwave Corn Popper,0,"List(handl, broke, first, time, wash, sink, issu, refund)"


In [0]:
# Take a stratified sample

print("Number of rows before sampling: ", prediction_df.count())
prediction_df_sampled = prediction_df.sampleBy("overall", fractions = {1:0.001, 5:0.001}, seed = 16).cache()
print("Number of rows after sampling: ", prediction_df_sampled.count())

Number of rows before sampling:  449388
Number of rows after sampling:  478


### TF-IDF with Hashing Trick + Random Forest

In [0]:
# Copy prediction data

prediction_tfidf_hash = prediction_df_sampled.select('*')

In [0]:
# Extract bigram

from pyspark.ml.feature import NGram
from pyspark.sql.functions import array_union

ngram = NGram(n = 2, inputCol="reviewWordCleaned", outputCol="reviewBigrams")
prediction_tfidf_hash = ngram.transform(prediction_tfidf_hash)

prediction_tfidf_hash = prediction_tfidf_hash.withColumn("reviewNgrams", \
                                                         array_union(prediction_tfidf_hash.reviewWordCleaned, \
                                                                     prediction_tfidf_hash.reviewBigrams))

In [0]:
# Getting tf-idf values for 1-2grams

from pyspark.ml.feature import HashingTF, IDF

hashtf = HashingTF(numFeatures=2**12, inputCol="reviewNgrams", outputCol='TF')
tf = hashtf.transform(prediction_tfidf_hash)
idf = IDF(minDocFreq=3, inputCol="TF", outputCol="TF-IDF")
idfModel = idf.fit(tf)
prediction_tfidf_hash = idfModel.transform(tf)

In [0]:
display(prediction_tfidf_hash)

reviewID,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label,reviewWordCleaned,reviewBigrams,reviewNgrams,TF,TF-IDF
1981107,1.0,True,2010-05-10,A2TMEJ22IET200,0061656100,truth is logical,"Wow, after reading all the raving reviews on this one, I figured I couldn't possibly go wrong. Was I ever sorely mistaken. This was the first book by Elizabeth Peters I've read. The main character wasn't even slightly likable. She was a loose, sarcastic and mean person. The only reason the reader might root for her would be just to get the mystery solved. The other characters were shallow and insincere. There were so many loose ends that the whole story was unsatisfying. The concluding paragraphs even admit that there are many loose ends ""that we will never know."" How lame is that? If you enjoy a mystery that actually gives you clues so that you can at least begin to piece the mystery together, then this book is not for you. Nothing in the book allows the reader to piece it together (the author doesn't even piece it together). On the other hand, if you enjoy just following the arbitrary actions of a shallow woman who toys with her affections, then knock yourself out. The greatest mystery is why I bothered to finish the book. (Guess I kept hoping there'd be some great finale.)",Not even enjoyable.,1,"List(wow, read, rave, review, one, figur, couldn, possibl, go, wrong, ever, sore, mistaken, first, book, elizabeth, peter, ve, read, main, charact, wasn, even, slightli, likabl, loos, sarcast, mean, person, reason, reader, might, root, get, mysteri, solv, charact, shallow, insincer, mani, loos, end, whole, stori, unsatisfi, conclud, paragraph, even, admit, mani, loos, end, never, know, lame, enjoy, mysteri, actual, give, clue, least, begin, piec, mysteri, togeth, book, noth, book, allow, reader, piec, togeth, author, doesn, even, piec, togeth, hand, enjoy, follow, arbitrari, action, shallow, woman, toy, affect, knock, greatest, mysteri, bother, finish, book, guess, kept, hope, d, great, final)","List(wow read, read rave, rave review, review one, one figur, figur couldn, couldn possibl, possibl go, go wrong, wrong ever, ever sore, sore mistaken, mistaken first, first book, book elizabeth, elizabeth peter, peter ve, ve read, read main, main charact, charact wasn, wasn even, even slightli, slightli likabl, likabl loos, loos sarcast, sarcast mean, mean person, person reason, reason reader, reader might, might root, root get, get mysteri, mysteri solv, solv charact, charact shallow, shallow insincer, insincer mani, mani loos, loos end, end whole, whole stori, stori unsatisfi, unsatisfi conclud, conclud paragraph, paragraph even, even admit, admit mani, mani loos, loos end, end never, never know, know lame, lame enjoy, enjoy mysteri, mysteri actual, actual give, give clue, clue least, least begin, begin piec, piec mysteri, mysteri togeth, togeth book, book noth, noth book, book allow, allow reader, reader piec, piec togeth, togeth author, author doesn, doesn even, even piec, piec togeth, togeth hand, hand enjoy, enjoy follow, follow arbitrari, arbitrari action, action shallow, shallow woman, woman toy, toy affect, affect knock, knock greatest, greatest mysteri, mysteri bother, bother finish, finish book, book guess, guess kept, kept hope, hope d, d great, great final)","List(wow, read, rave, review, one, figur, couldn, possibl, go, wrong, ever, sore, mistaken, first, book, elizabeth, peter, ve, main, charact, wasn, even, slightli, likabl, loos, sarcast, mean, person, reason, reader, might, root, get, mysteri, solv, shallow, insincer, mani, end, whole, stori, unsatisfi, conclud, paragraph, admit, never, know, lame, enjoy, actual, give, clue, least, begin, piec, togeth, noth, allow, author, doesn, hand, follow, arbitrari, action, woman, toy, affect, knock, greatest, bother, finish, guess, kept, hope, d, great, final, wow read, read rave, rave review, review one, one figur, figur couldn, couldn possibl, possibl go, go wrong, wrong ever, ever sore, sore mistaken, mistaken first, first book, book elizabeth, elizabeth peter, peter ve, ve read, read main, main charact, charact wasn, wasn even, even slightli, slightli likabl, likabl loos, loos sarcast, sarcast mean, mean person, person reason, reason reader, reader might, might root, root get, get mysteri, mysteri solv, solv charact, charact shallow, shallow insincer, insincer mani, mani loos, loos end, end whole, whole stori, stori unsatisfi, unsatisfi conclud, conclud paragraph, paragraph even, even admit, admit mani, end never, never know, know lame, lame enjoy, enjoy mysteri, mysteri actual, actual give, give clue, clue least, least begin, begin piec, piec mysteri, mysteri togeth, togeth book, book noth, noth book, book allow, allow reader, reader piec, piec togeth, togeth author, author doesn, doesn even, even piec, togeth hand, hand enjoy, enjoy follow, follow arbitrari, arbitrari action, action shallow, shallow woman, woman toy, toy affect, affect knock, knock greatest, greatest mysteri, mysteri bother, bother finish, finish book, book guess, guess kept, kept hope, hope d, d great, great final)","Map(vectorType -> sparse, length -> 4096, indices -> List(12, 14, 18, 35, 71, 110, 119, 132, 137, 183, 208, 213, 289, 299, 301, 322, 398, 419, 425, 453, 459, 497, 503, 504, 517, 535, 565, 589, 637, 697, 700, 736, 743, 755, 775, 837, 861, 870, 871, 910, 914, 927, 936, 981, 1069, 1070, 1117, 1125, 1129, 1184, 1189, 1219, 1252, 1269, 1287, 1300, 1341, 1343, 1361, 1394, 1411, 1450, 1468, 1472, 1492, 1516, 1526, 1554, 1571, 1589, 1602, 1667, 1708, 1728, 1757, 1760, 1773, 1781, 1791, 1824, 1844, 1868, 1929, 1974, 2047, 2097, 2104, 2206, 2258, 2275, 2448, 2462, 2516, 2526, 2577, 2578, 2620, 2698, 2701, 2721, 2728, 2737, 2741, 2797, 2806, 2820, 2822, 2846, 2856, 2866, 2870, 2880, 2883, 2899, 2922, 2934, 2941, 2977, 2991, 2992, 3065, 3072, 3081, 3094, 3098, 3109, 3122, 3128, 3155, 3161, 3165, 3207, 3226, 3253, 3260, 3317, 3361, 3381, 3393, 3435, 3448, 3466, 3511, 3514, 3515, 3516, 3540, 3547, 3568, 3576, 3584, 3639, 3673, 3689, 3703, 3722, 3736, 3738, 3764, 3816, 3822, 3824, 3861, 3991, 4010, 4055, 4074), values -> List(1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(12, 14, 18, 35, 71, 110, 119, 132, 137, 183, 208, 213, 289, 299, 301, 322, 398, 419, 425, 453, 459, 497, 503, 504, 517, 535, 565, 589, 637, 697, 700, 736, 743, 755, 775, 837, 861, 870, 871, 910, 914, 927, 936, 981, 1069, 1070, 1117, 1125, 1129, 1184, 1189, 1219, 1252, 1269, 1287, 1300, 1341, 1343, 1361, 1394, 1411, 1450, 1468, 1472, 1492, 1516, 1526, 1554, 1571, 1589, 1602, 1667, 1708, 1728, 1757, 1760, 1773, 1781, 1791, 1824, 1844, 1868, 1929, 1974, 2047, 2097, 2104, 2206, 2258, 2275, 2448, 2462, 2516, 2526, 2577, 2578, 2620, 2698, 2701, 2721, 2728, 2737, 2741, 2797, 2806, 2820, 2822, 2846, 2856, 2866, 2870, 2880, 2883, 2899, 2922, 2934, 2941, 2977, 2991, 2992, 3065, 3072, 3081, 3094, 3098, 3109, 3122, 3128, 3155, 3161, 3165, 3207, 3226, 3253, 3260, 3317, 3361, 3381, 3393, 3435, 3448, 3466, 3511, 3514, 3515, 3516, 3540, 3547, 3568, 3576, 3584, 3639, 3673, 3689, 3703, 3722, 3736, 3738, 3764, 3816, 3822, 3824, 3861, 3991, 4010, 4055, 4074), values -> List(0.0, 4.37994112818286, 6.562657679029501, 4.562262684976814, 4.562262684976814, 0.0, 0.0, 3.175968323856924, 4.37994112818286, 4.225790448355602, 4.37994112818286, 3.7738053246125447, 4.37994112818286, 0.0, 4.37994112818286, 1.8676355042067454, 4.37994112818286, 2.8044047674244412, 4.09225905573108, 4.785406236291025, 3.2813288395147504, 3.7738053246125447, 4.785406236291025, 2.482821143296979, 4.37994112818286, 2.434030979127547, 3.7738053246125447, 4.37994112818286, 3.5326432677956565, 1.8149917707213237, 4.225790448355602, 4.09225905573108, 4.785406236291025, 3.974476020074696, 3.338487253354699, 4.225790448355602, 4.225790448355602, 4.562262684976814, 3.7738053246125447, 3.2272616182444747, 4.37994112818286, 3.338487253354699, 4.09225905573108, 3.399111875171134, 4.562262684976814, 4.785406236291025, 4.225790448355602, 0.0, 4.785406236291025, 3.338487253354699, 4.785406236291025, 2.5607826847666906, 4.09225905573108, 2.839496087235711, 3.2813288395147504, 3.974476020074696, 4.09225905573108, 1.6391011042576593, 3.974476020074696, 4.37994112818286, 4.37994112818286, 3.974476020074696, 3.8691155044168695, 3.6067512399493786, 4.225790448355602, 3.2272616182444747, 0.0, 4.09225905573108, 4.225790448355602, 3.974476020074696, 3.974476020074696, 3.0806581440525993, 0.0, 4.225790448355602, 4.785406236291025, 4.785406236291025, 2.220456878829488, 4.225790448355602, 4.09225905573108, 0.0, 3.974476020074696, 3.974476020074696, 4.225790448355602, 9.124525369953629, 4.785406236291025, 3.7738053246125447, 0.0, 3.4636503963087053, 4.562262684976814, 4.225790448355602, 4.225790448355602, 4.785406236291025, 3.7738053246125447, 2.839496087235711, 4.225790448355602, 0.0, 0.0, 0.0, 2.9136040593894332, 4.562262684976814, 4.37994112818286, 3.974476020074696, 3.974476020074696, 4.09225905573108, 3.974476020074696, 4.562262684976814, 4.562262684976814, 4.37994112818286, 2.128649329576365, 2.3430592009218203, 3.0806581440525993, 4.785406236291025, 3.5326432677956565, 3.5326432677956565, 4.225790448355602, 2.9936467670629696, 3.8691155044168695, 4.562262684976814, 3.1271781596874924, 4.225790448355602, 8.451580896711205, 4.562262684976814, 2.8758637314065862, 4.785406236291025, 2.9136040593894332, 4.562262684976814, 4.562262684976814, 4.785406236291025, 0.0, 4.225790448355602, 0.0, 0.0, 0.0, 4.562262684976814, 4.37994112818286, 4.562262684976814, 0.0, 8.451580896711205, 4.562262684976814, 4.09225905573108, 2.839496087235711, 0.0, 4.37994112818286, 3.686793947622915, 4.562262684976814, 0.0, 0.0, 0.0, 0.0, 3.4636503963087053, 4.562262684976814, 3.4636503963087053, 4.09225905573108, 3.6067512399493786, 0.0, 0.0, 4.37994112818286, 4.785406236291025, 4.785406236291025, 4.785406236291025, 1.2888986748245443, 4.785406236291025, 2.737713392925769, 0.0, 0.0, 4.225790448355602, 4.562262684976814))"
2330875,5.0,True,2014-11-18,A17NH151B5PZQL,B001A60SXW,Phyllis O'Neill,Great quality; true to color; no complaints!,Great quality,0,"List(great, qualiti, true, color, complaint)","List(great qualiti, qualiti true, true color, color complaint)","List(great, qualiti, true, color, complaint, great qualiti, qualiti true, true color, color complaint)","Map(vectorType -> sparse, length -> 4096, indices -> List(136, 176, 340, 849, 1311, 1539, 1967, 2207, 3822), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(136, 176, 340, 849, 1311, 1539, 1967, 2207, 3822), values -> List(3.4636503963087053, 4.225790448355602, 4.562262684976814, 3.6067512399493786, 4.785406236291025, 2.839496087235711, 0.0, 4.785406236291025, 1.2888986748245443))"
1798368,5.0,True,2013-09-21,A2MXHM7USSIVFB,B000WEIJ7K,mathgal,This fan is very small. It is very portable. My husband likes a fan blowing when he sleeps. He will take this with him when traveling.,mini fan,0,"List(fan, small, portabl, husband, like, fan, blow, sleep, take, travel)","List(fan small, small portabl, portabl husband, husband like, like fan, fan blow, blow sleep, sleep take, take travel)","List(fan, small, portabl, husband, like, blow, sleep, take, travel, fan small, small portabl, portabl husband, husband like, like fan, fan blow, blow sleep, sleep take, take travel)","Map(vectorType -> sparse, length -> 4096, indices -> List(590, 761, 1081, 1106, 1128, 1165, 1579, 1728, 2079, 2365, 2391, 2741, 2790, 3059, 3458, 3505, 4014), values -> List(1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(590, 761, 1081, 1106, 1128, 1165, 1579, 1728, 2079, 2365, 2391, 2741, 2790, 3059, 3458, 3505, 4014), values -> List(3.0362063814817657, 3.2272616182444747, 8.451580896711205, 4.37994112818286, 3.7738053246125447, 4.225790448355602, 4.785406236291025, 4.225790448355602, 4.562262684976814, 0.0, 2.9936467670629696, 3.974476020074696, 4.225790448355602, 4.785406236291025, 1.9521928922348084, 3.974476020074696, 3.8691155044168695))"
30409,5.0,True,2017-01-05,A3QUUV9DAU7OD1,0002219417,Dan1,Excellent book. The story is based on well researched historic facts. It is well written and very captivating. I strongly recommend it.,Excellent book. The story is based on well researched ...,0,"List(excel, book, stori, base, well, research, histor, fact, well, written, captiv, strongli, recommend)","List(excel book, book stori, stori base, base well, well research, research histor, histor fact, fact well, well written, written captiv, captiv strongli, strongli recommend)","List(excel, book, stori, base, well, research, histor, fact, written, captiv, strongli, recommend, excel book, book stori, stori base, base well, well research, research histor, histor fact, fact well, well written, written captiv, captiv strongli, strongli recommend)","Map(vectorType -> sparse, length -> 4096, indices -> List(216, 360, 401, 430, 504, 697, 711, 1031, 1062, 1782, 1896, 2605, 2669, 2692, 2757, 2769, 2909, 3097, 3169, 3430, 3535, 3663, 3753, 3974), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(216, 360, 401, 430, 504, 697, 711, 1031, 1062, 1782, 1896, 2605, 2669, 2692, 2757, 2769, 2909, 3097, 3169, 3430, 3535, 3663, 3753, 3974), values -> List(2.9136040593894332, 4.225790448355602, 4.785406236291025, 4.785406236291025, 2.482821143296979, 1.8149917707213237, 0.0, 4.785406236291025, 4.225790448355602, 4.785406236291025, 0.0, 1.9670079780199492, 4.09225905573108, 2.3650381076405953, 4.09225905573108, 3.974476020074696, 3.974476020074696, 4.37994112818286, 4.37994112818286, 4.225790448355602, 4.09225905573108, 3.4636503963087053, 3.686793947622915, 3.686793947622915))"
1461521,5.0,True,2014-02-15,AS0WVV16FAXK5,B000N25IDO,Kari,I AM SO SATISFIED--they are such a fine grade of stainless steel and will look lovely with my seashell dishes.,SO PRETTY,0,"List(satisfi, fine, grade, stainless, steel, look, love, seashel, dish)","List(satisfi fine, fine grade, grade stainless, stainless steel, steel look, look love, love seashel, seashel dish)","List(satisfi, fine, grade, stainless, steel, look, love, seashel, dish, satisfi fine, fine grade, grade stainless, stainless steel, steel look, look love, love seashel, seashel dish)","Map(vectorType -> sparse, length -> 4096, indices -> List(97, 191, 229, 277, 456, 602, 1198, 1201, 1583, 1616, 1692, 2151, 2160, 2214, 2340, 2579, 3421), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(97, 191, 229, 277, 456, 602, 1198, 1201, 1583, 1616, 1692, 2151, 2160, 2214, 2340, 2579, 3421), values -> List(3.399111875171134, 3.7738053246125447, 4.37994112818286, 4.37994112818286, 3.4636503963087053, 4.562262684976814, 2.9936467670629696, 0.0, 3.4636503963087053, 4.785406236291025, 4.562262684976814, 4.225790448355602, 1.6830642276787753, 3.6067512399493786, 3.7738053246125447, 2.434030979127547, 3.974476020074696))"
2590273,5.0,True,2016-01-06,A2CFJ4DVIK5HJ2,B001V3O4XY,BAC,"This is a very nice, sturdy log rack cover that is working perfectly for keeping the elements out of our wood-burning stove logs. Good product at a great price. Very pleased with purchase.","This is a very nice, sturdy log rack cover that is working perfectly ...",0,"List(nice, sturdi, log, rack, cover, work, perfectli, keep, element, wood, burn, stove, log, good, product, great, price, pleas, purchas)","List(nice sturdi, sturdi log, log rack, rack cover, cover work, work perfectli, perfectli keep, keep element, element wood, wood burn, burn stove, stove log, log good, good product, product great, great price, price pleas, pleas purchas)","List(nice, sturdi, log, rack, cover, work, perfectli, keep, element, wood, burn, stove, good, product, great, price, pleas, purchas, nice sturdi, sturdi log, log rack, rack cover, cover work, work perfectli, perfectli keep, keep element, element wood, wood burn, burn stove, stove log, log good, good product, product great, great price, price pleas, pleas purchas)","Map(vectorType -> sparse, length -> 4096, indices -> List(58, 93, 113, 122, 535, 560, 608, 655, 682, 801, 927, 947, 1016, 1047, 1575, 1608, 1866, 1906, 1922, 1938, 1957, 2395, 2498, 2607, 2705, 2723, 2840, 2933, 3412, 3559, 3727, 3752, 3784, 3822, 4055), values -> List(1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(58, 93, 113, 122, 535, 560, 608, 655, 682, 801, 927, 947, 1016, 1047, 1575, 1608, 1866, 1906, 1922, 1938, 1957, 2395, 2498, 2607, 2705, 2723, 2840, 2933, 3412, 3559, 3727, 3752, 3784, 3822, 4055), values -> List(4.562262684976814, 4.562262684976814, 0.0, 2.705964694611189, 4.868061958255094, 3.399111875171134, 4.37994112818286, 4.225790448355602, 2.675193035944435, 4.37994112818286, 3.338487253354699, 4.562262684976814, 4.562262684976814, 0.0, 1.9820458553844897, 3.5326432677956565, 2.482821143296979, 3.974476020074696, 3.1271781596874924, 3.8691155044168695, 4.562262684976814, 0.0, 0.0, 2.9528247725427144, 4.225790448355602, 4.785406236291025, 1.7173533011574076, 3.338487253354699, 4.562262684976814, 4.37994112818286, 1.937594092813656, 3.6067512399493786, 4.562262684976814, 1.2888986748245443, 4.225790448355602))"
2502336,5.0,True,2015-07-31,A2FS5AD2V6S52E,B001L1BTDY,Dennis Novak,"That what I call bag, nice and big !!!!! Dennis PA.",nice and big,0,"List(call, bag, nice, big, denni, pa)","List(call bag, bag nice, nice big, big denni, denni pa)","List(call, bag, nice, big, denni, pa, call bag, bag nice, nice big, big denni, denni pa)","Map(vectorType -> sparse, length -> 4096, indices -> List(324, 856, 1471, 1547, 1642, 1709, 1866, 2189, 3210, 3222, 3986), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(324, 856, 1471, 1547, 1642, 1709, 1866, 2189, 3210, 3222, 3986), values -> List(4.37994112818286, 3.338487253354699, 3.5326432677956565, 0.0, 0.0, 4.785406236291025, 2.482821143296979, 4.37994112818286, 3.686793947622915, 4.37994112818286, 4.37994112818286))"
18491,5.0,True,2017-09-24,A3A76MFPUD07RC,B00002NC6F,thinkering,Guess what? couldn't find that size at Sears where i bought the vacuum cleaner. at Amazon i got a package with two belts and it was less than the one non-matching belt from Sears that i bought first. now i have one to spare. it was a cinch putting the belt in and it's working just fine.,It fits!,0,"List(guess, couldn, find, size, sear, bought, vacuum, cleaner, amazon, got, packag, two, belt, less, one, non, match, belt, sear, bought, first, one, spare, cinch, put, belt, work, fine)","List(guess couldn, couldn find, find size, size sear, sear bought, bought vacuum, vacuum cleaner, cleaner amazon, amazon got, got packag, packag two, two belt, belt less, less one, one non, non match, match belt, belt sear, sear bought, bought first, first one, one spare, spare cinch, cinch put, put belt, belt work, work fine)","List(guess, couldn, find, size, sear, bought, vacuum, cleaner, amazon, got, packag, two, belt, less, one, non, match, first, spare, cinch, put, work, fine, guess couldn, couldn find, find size, size sear, sear bought, bought vacuum, vacuum cleaner, cleaner amazon, amazon got, got packag, packag two, two belt, belt less, less one, one non, non match, match belt, belt sear, bought first, first one, one spare, spare cinch, cinch put, put belt, belt work, work fine)","Map(vectorType -> sparse, length -> 4096, indices -> List(31, 191, 259, 368, 371, 419, 556, 618, 631, 680, 1300, 1343, 1575, 1580, 1583, 1586, 1664, 1700, 1766, 1949, 2114, 2218, 2320, 2507, 2531, 2575, 2760, 2780, 2801, 2805, 2872, 2875, 2966, 3145, 3209, 3227, 3302, 3435, 3448, 3524, 3555, 3597, 3622, 3633, 3686, 3722, 3752, 4011, 4023), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(31, 191, 259, 368, 371, 419, 556, 618, 631, 680, 1300, 1343, 1575, 1580, 1583, 1586, 1664, 1700, 1766, 1949, 2114, 2218, 2320, 2507, 2531, 2575, 2760, 2780, 2801, 2805, 2872, 2875, 2966, 3145, 3209, 3227, 3302, 3435, 3448, 3524, 3555, 3597, 3622, 3633, 3686, 3722, 3752, 4011, 4023), values -> List(3.974476020074696, 3.7738053246125447, 4.225790448355602, 4.09225905573108, 4.562262684976814, 2.8044047674244412, 4.225790448355602, 2.9936467670629696, 4.562262684976814, 3.2272616182444747, 3.974476020074696, 1.6391011042576593, 1.9820458553844897, 4.562262684976814, 3.4636503963087053, 3.7738053246125447, 4.225790448355602, 4.785406236291025, 3.0806581440525993, 3.2813288395147504, 4.785406236291025, 4.785406236291025, 4.225790448355602, 0.0, 3.7738053246125447, 4.09225905573108, 4.09225905573108, 4.562262684976814, 3.974476020074696, 0.0, 3.1271781596874924, 4.785406236291025, 4.785406236291025, 4.785406236291025, 3.974476020074696, 3.6067512399493786, 4.562262684976814, 4.09225905573108, 2.839496087235711, 0.0, 4.37994112818286, 4.785406236291025, 2.410500481717353, 3.686793947622915, 4.225790448355602, 0.0, 3.6067512399493786, 4.37994112818286, 3.399111875171134))"
801755,5.0,True,2008-07-02,A2TVH2OBNXYXHV,B00063QBDQ,Evan Jacobs,"This is an amazing cutting board. It's HUGE and solid. It's my new favorite thing in the kitchen. If you have the space, I highly recommend getting one.",Amazing,1,"List(amaz, cut, board, huge, solid, new, favorit, thing, kitchen, space, highli, recommend, get, one)","List(amaz cut, cut board, board huge, huge solid, solid new, new favorit, favorit thing, thing kitchen, kitchen space, space highli, highli recommend, recommend get, get one)","List(amaz, cut, board, huge, solid, new, favorit, thing, kitchen, space, highli, recommend, get, one, amaz cut, cut board, board huge, huge solid, solid new, new favorit, favorit thing, thing kitchen, kitchen space, space highli, highli recommend, recommend get, get one)","Map(vectorType -> sparse, length -> 4096, indices -> List(7, 105, 134, 135, 196, 280, 433, 861, 1343, 1796, 1943, 2102, 2478, 2538, 2692, 2757, 2866, 2873, 3175, 3222, 3230, 3339, 3344, 3702, 3817, 3835), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(7, 105, 134, 135, 196, 280, 433, 861, 1343, 1796, 1943, 2102, 2478, 2538, 2692, 2757, 2866, 2873, 3175, 3222, 3230, 3339, 3344, 3702, 3817, 3835), values -> List(4.37994112818286, 3.338487253354699, 3.8691155044168695, 0.0, 3.6067512399493786, 3.175968323856924, 2.9936467670629696, 4.225790448355602, 1.6391011042576593, 3.338487253354699, 4.225790448355602, 4.562262684976814, 2.9936467670629696, 3.2813288395147504, 2.3650381076405953, 4.09225905573108, 2.3430592009218203, 0.0, 4.562262684976814, 4.37994112818286, 7.37358789524583, 4.09225905573108, 4.09225905573108, 3.5326432677956565, 2.839496087235711, 0.0))"
1763354,1.0,True,2015-11-30,A1DMT20M71IQSM,0061043575,Dawn,Gave it to the library.,One Star,0,"List(gave, librari)",List(gave librari),"List(gave, librari, gave librari)","Map(vectorType -> sparse, length -> 4096, indices -> List(30, 1053, 3507), values -> List(1.0, 1.0, 1.0))","Map(vectorType -> sparse, length -> 4096, indices -> List(30, 1053, 3507), values -> List(3.974476020074696, 4.562262684976814, 0.0))"


In [0]:
# Random Forest

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="overall", outputCol="indexedScore").fit(prediction_tfidf_hash)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="TF-IDF", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

(trainingData, testData) = prediction_tfidf_hash.randomSplit([0.7, 0.3])

rf_model = pipeline.fit(trainingData)
predictions = rf_model.transform(testData)


In [0]:
display(predictions.select("overall", "indexedScore", "rawPrediction", "probability", "prediction", "predictedLabel"))

overall,indexedScore,rawPrediction,probability,prediction,predictedLabel
1.0,1.0,"Map(vectorType -> dense, length -> 2, values -> List(33.56263223286718, 6.437367767132829))","Map(vectorType -> dense, length -> 2, values -> List(0.8390658058216793, 0.1609341941783207))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(38.21759843717743, 1.7824015628225707))","Map(vectorType -> dense, length -> 2, values -> List(0.9554399609294357, 0.044560039070564265))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(38.21759843717743, 1.7824015628225707))","Map(vectorType -> dense, length -> 2, values -> List(0.9554399609294357, 0.044560039070564265))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(36.61545191490311, 3.3845480850968883))","Map(vectorType -> dense, length -> 2, values -> List(0.9153862978725777, 0.08461370212742221))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(36.944508319845646, 3.055491680154349))","Map(vectorType -> dense, length -> 2, values -> List(0.9236127079961413, 0.07638729200385873))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(38.21759843717743, 1.7824015628225707))","Map(vectorType -> dense, length -> 2, values -> List(0.9554399609294357, 0.044560039070564265))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(38.21759843717743, 1.7824015628225707))","Map(vectorType -> dense, length -> 2, values -> List(0.9554399609294357, 0.044560039070564265))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(38.21759843717743, 1.7824015628225707))","Map(vectorType -> dense, length -> 2, values -> List(0.9554399609294357, 0.044560039070564265))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(38.21759843717743, 1.7824015628225707))","Map(vectorType -> dense, length -> 2, values -> List(0.9554399609294357, 0.044560039070564265))",0.0,5.0
5.0,0.0,"Map(vectorType -> dense, length -> 2, values -> List(38.25225946542667, 1.7477405345733374))","Map(vectorType -> dense, length -> 2, values -> List(0.9563064866356665, 0.04369351336433343))",0.0,5.0


In [0]:
# Calculate AUC for train/test split

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="indexedScore", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print("AUC = %g" % auc)

AUC = 0.783126


In [0]:
# Performance evaluation with 5-fold cross validation

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

evaluator = BinaryClassificationEvaluator(labelCol="indexedScore", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
paramGrid = ParamGridBuilder().build()
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid, numFolds=5)
cvModel = cv.fit(prediction_tfidf_hash)

print("Average AUC = %g" % cvModel.avgMetrics[0])

Average AUC = 0.491531


### Doc2Vec + Random Forest

In [0]:
# Copy prediction data

prediction_doc2vec = prediction_df_sampled.select('*')

In [0]:
# Calculate Doc2Vec

from pyspark.ml.feature import Word2Vec

word2Vec = Word2Vec(inputCol="reviewWordCleaned", outputCol="doc2vec")
w2v_model = word2Vec.fit(prediction_doc2vec)

prediction_doc2vec = w2v_model.transform(prediction_doc2vec)

In [0]:
display(prediction_doc2vec)

reviewID,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label,reviewWordCleaned,doc2vec
1981107,1.0,True,2010-05-10,A2TMEJ22IET200,0061656100,truth is logical,"Wow, after reading all the raving reviews on this one, I figured I couldn't possibly go wrong. Was I ever sorely mistaken. This was the first book by Elizabeth Peters I've read. The main character wasn't even slightly likable. She was a loose, sarcastic and mean person. The only reason the reader might root for her would be just to get the mystery solved. The other characters were shallow and insincere. There were so many loose ends that the whole story was unsatisfying. The concluding paragraphs even admit that there are many loose ends ""that we will never know."" How lame is that? If you enjoy a mystery that actually gives you clues so that you can at least begin to piece the mystery together, then this book is not for you. Nothing in the book allows the reader to piece it together (the author doesn't even piece it together). On the other hand, if you enjoy just following the arbitrary actions of a shallow woman who toys with her affections, then knock yourself out. The greatest mystery is why I bothered to finish the book. (Guess I kept hoping there'd be some great finale.)",Not even enjoyable.,1,"List(wow, read, rave, review, one, figur, couldn, possibl, go, wrong, ever, sore, mistaken, first, book, elizabeth, peter, ve, read, main, charact, wasn, even, slightli, likabl, loos, sarcast, mean, person, reason, reader, might, root, get, mysteri, solv, charact, shallow, insincer, mani, loos, end, whole, stori, unsatisfi, conclud, paragraph, even, admit, mani, loos, end, never, know, lame, enjoy, mysteri, actual, give, clue, least, begin, piec, mysteri, togeth, book, noth, book, allow, reader, piec, togeth, author, doesn, even, piec, togeth, hand, enjoy, follow, arbitrari, action, shallow, woman, toy, affect, knock, greatest, mysteri, bother, finish, book, guess, kept, hope, d, great, final)","Map(vectorType -> dense, length -> 100, values -> List(-0.004688795161200687, 0.01183068166886057, 0.011641626414362987, -0.014178378124987439, 3.651971667434792E-4, 0.01043624402919537, -0.002648466687212337, 0.011437084357140167, -0.010252108643002503, -0.010576695449678796, 0.006783989714862474, -0.007940840777912538, 0.002149063861591513, -0.0067340118961399225, 0.002376375209102977, -0.008385658365132628, -0.015239988056923812, -0.016427046994735694, 0.002377903965011961, -1.9210219345820533E-4, 0.0036844007281244405, -0.010543926984870007, -6.770662475158745E-4, -0.005163045463446831, 0.005306382121860373, 0.0017524700058263022, -0.007154896554218757, -0.0011933446598681144, -0.009554780506509906, 2.8877555198459026E-4, -0.01176074346140198, 0.004740216064252605, -0.0055250332326086575, 0.002053321352257032, -0.003579194172242732, 0.013346121152886664, -0.004576148375885427, 0.008773214705240893, 0.01004636330572812, -0.011910343606842263, 3.2143836996306627E-4, -0.011566822647535223, 1.3272936900778274E-4, -0.01584239531075582, 0.011358153244851118, 0.0014308289380040855, -0.007238188117020288, 4.343941854669389E-5, -0.004376419250287913, -0.004352399013515999, 0.00869433792802144, 0.007960275047202118, 0.004645183833009487, -2.919943881908916E-4, -7.030844459148618E-4, -0.0020629570092435697, -0.010789652366894392, -0.006105925894950573, 0.0018293077431438603, -0.0022900022595422342, 0.006853842954282002, -0.0022683619240617225, 0.010058533218546713, 8.969037107260878E-4, -0.009030778609615351, 0.012207700535940119, 0.006980669177172952, -0.003988481610799588, -0.001323521018446404, -0.0040401957224702875, 0.001380632760591819, 0.01230161621446759, -0.0013651180746893834, 0.0011547191906898587, -0.0033304156629815314, 0.0011624071618770153, 0.002211298000029459, -0.0052688421039457185, -0.0015124635469781385, 0.004961100196684425, 0.0013774160301070469, 0.0015924506486935197, -0.002475647247462932, -0.005584518165013999, -0.018404739894619097, 0.02456307214713294, 0.0037516866957920847, 0.01487894106789359, 0.013491534211017115, -0.006180952893503542, -0.003675617388988921, 0.0019710826689296646, 0.0010630710116392762, -0.0159171831138356, 0.006782065238922416, 0.016794642157928676, 0.0021692390377842344, -0.003204257298755098, 0.008652545528353324, -0.009880109434728794))"
2330875,5.0,True,2014-11-18,A17NH151B5PZQL,B001A60SXW,Phyllis O'Neill,Great quality; true to color; no complaints!,Great quality,0,"List(great, qualiti, true, color, complaint)","Map(vectorType -> dense, length -> 100, values -> List(-0.0039030041079968214, 0.009753770008683205, 0.00882127033546567, -0.008212233800441027, -4.711554036475718E-4, 0.006573691393714398, -9.8326996085234E-4, 0.006709872465580702, -0.006335344118997455, -0.005472446640487761, 0.005148970428854227, -0.004153336957097054, 0.002344810962677002, -0.004162915598135442, 0.0015461464878171684, -0.0055698962882161146, -0.01034230620134622, -0.013199233822524548, 0.0026110057253390553, 0.001919164345599711, 0.0042089109774678946, -0.0088010068051517, 0.0015423231059685351, -0.0035635625477880243, 0.0010177076095715166, 0.0018332132138311865, -0.004926103499019519, -0.0029900423716753724, -0.007519173575565219, -5.267340689897537E-4, -0.0074165400001220405, 0.0012057213578373195, -0.00204610520740971, 0.0022555399802513423, -0.0025852602906525136, 0.00783300120383501, -0.0015887927263975145, 0.0067100759595632555, 0.005261230003088713, -0.006158623285591603, 6.928731920197607E-4, -0.007040850399062038, 1.097293570637703E-4, -0.009536758717149497, 0.006497940886765719, 0.0037974686594679954, -0.0036569125950336456, 0.001271548983640969, -0.004438129533082247, -0.0023952370742335916, 0.005945260287262499, 0.003687827941030264, 0.0010446784086525442, 0.0023724220227450132, 0.001849589858284162, -0.002373987005557865, -0.006045668106526137, -0.005471569392830134, 0.0028322743863100188, -0.003358825668692589, 0.003260094439610839, -7.974491454660894E-5, 0.007755045988596976, 2.328670700080693E-4, -0.00514799146912992, 0.008394619193859398, 0.003542025201022625, -0.002531102253124118, -0.0025962020736187696, -0.005122767947614193, -6.743747158907354E-4, 0.00606055126991123, 4.384774947538972E-4, -4.2891921475529676E-5, -0.0022896177135407927, 9.689813945442439E-4, 1.5473368111997845E-4, -0.004670837335288525, -0.0011822701431810857, 0.003396775998407975, 0.0018694101832807065, -1.5263527166098356E-4, -0.0013313945615664126, -0.003904251661151648, -0.013000541459769012, 0.01655400702729821, 0.003292361996136606, 0.011519848369061947, 0.010416828538291158, -0.003613044880330563, -0.0013006877154111863, 0.002637900330591947, -4.831018566619605E-4, -0.00994245456531644, 0.005494269356131554, 0.01322996448725462, -0.0010702814906835556, -0.004226215835660696, 0.0070852959295734765, -0.0062609828077256685))"
1798368,5.0,True,2013-09-21,A2MXHM7USSIVFB,B000WEIJ7K,mathgal,This fan is very small. It is very portable. My husband likes a fan blowing when he sleeps. He will take this with him when traveling.,mini fan,0,"List(fan, small, portabl, husband, like, fan, blow, sleep, take, travel)","Map(vectorType -> dense, length -> 100, values -> List(-0.0017544373171404005, 0.007629790017381311, 0.006575066328514368, -0.009488816210068763, -8.07618844555691E-4, 0.006329449824988842, -7.247765548527241E-4, 0.00698975456180051, -0.007842714816797524, -0.007105752592906356, 0.003834894485771656, -0.004467748588649556, 0.002042536111548543, -0.004052977368701249, 0.0023565788404084744, -0.004515347257256508, -0.009348824305925518, -0.00960337738506496, 0.0030644650803878905, -9.935752052115277E-4, 0.0014328358112834394, -0.005624922970309854, -0.001378026549355127, -0.003304260119330138, 0.0028358655283227566, 0.0012588848941959441, -0.004534128098748625, -0.0010376453516073526, -0.004547891998663545, 0.0018702404042414856, -0.007511137891560793, 0.002638837299309671, -0.003097488544881344, 0.001331952167674899, -0.00222760031465441, 0.008890543947927654, -0.002228937018662691, 0.0031797941424883906, 0.006159290438517928, -0.007698986539617181, 7.275572046637536E-4, -0.00697012497112155, -5.760855739936233E-4, -0.011611710814759136, 0.008362066384870559, 0.0015533972473349422, -0.0038027297821827235, 0.002114848588826135, -0.0034053237177431583, -0.0010574617132078858, 0.005378555716015399, 0.0062276775715872645, 0.0013578106998465957, -0.001223273715004325, -0.0014141577412374318, -0.001282646178151481, -0.006709820381365717, -0.0026341669203247876, 0.0010787002131110058, -0.0020849559747148304, 0.00513321491307579, -6.731833913363517E-4, 0.006213229289278388, 3.273530513979495E-4, -0.0069144384004175665, 0.00871394081041217, 0.004638057644478977, -0.0025375437398906797, -0.0013518815394490957, -0.003337418881710619, 0.0024284934508614245, 0.0076835399726405745, 3.4267635783180598E-6, -7.519087521359325E-5, -0.0021578382467851045, 0.001625098625663668, 0.001250567357055843, -0.0016893635620363057, -0.0032613137213047595, 0.002711925841867924, 6.71797525137663E-4, 2.812095859553665E-4, -8.799405564786867E-4, -8.961406769230963E-4, -0.012388417753390968, 0.013795071025379003, 0.002656426257453859, 0.010564728989265859, 0.009610881097614765, -0.004657780670095236, -0.0022913432738278063, 0.003027397964615375, 3.3891932107508186E-4, -0.009817149094305934, 0.004154309025034309, 0.011061744275502862, 6.922625092556701E-4, -0.0012687066104263068, 0.005281999870203436, -0.0068083485006354754))"
30409,5.0,True,2017-01-05,A3QUUV9DAU7OD1,0002219417,Dan1,Excellent book. The story is based on well researched historic facts. It is well written and very captivating. I strongly recommend it.,Excellent book. The story is based on well researched ...,0,"List(excel, book, stori, base, well, research, histor, fact, well, written, captiv, strongli, recommend)","Map(vectorType -> dense, length -> 100, values -> List(-0.009058388545571899, 0.020690340333833147, 0.019605667294504553, -0.022916408649717387, 0.0010026721507669068, 0.01755113840604631, -0.004361468761299665, 0.019441864748771947, -0.01874787833255071, -0.01600842429504085, 0.010445180563972548, -0.012635015837776547, 0.0039585621472304836, -0.010915103444578843, 0.005962262934190221, -0.014813378901005939, -0.025701801185137953, -0.028824552141416535, 0.004040676527298414, -2.670231669281538E-5, 0.005819529406905461, -0.017706385025611292, -4.3609454139816365E-4, -0.008571871216050709, 0.00905032184583923, 0.002056455375098337, -0.01014136284804688, -0.0032129136111157448, -0.016586739784823015, 0.0015353506180242852, -0.019206088098983925, 0.00813186727464199, -0.0100555163449966, 0.0016449284918892842, -0.006617825017131579, 0.02387835472249068, -0.0071486110637824125, 0.015066012238653807, 0.015022405432178998, -0.019706170934324082, 7.121332724077197E-4, -0.018910601609744705, 0.0015613813772618484, -0.027051464654505253, 0.01925281077050246, 0.003481388718892749, -0.011979850116544045, 2.404897867773588E-5, -0.007472795040275042, -0.008722254478086073, 0.014499496399926452, 0.013014370360626625, 0.007525235598083013, -1.8441816791892052E-5, -0.0011078017548872875, -0.0026985137642791066, -0.017339529478564285, -0.01058056653262331, 0.00403670977595119, -0.005604706552381126, 0.011972202059741205, -0.004346869617270736, 0.016204853757069662, 0.001237762720288279, -0.014337700666286625, 0.021144367575358886, 0.010480015610273069, -0.007352914231327864, -0.0024214644665614916, -0.007413093770782535, -4.1094405101970414E-4, 0.01953057625080244, -0.003244844981684135, 0.0011549166833552031, -0.005551466557125633, 0.0027410522295842664, 0.0029492140344630643, -0.007873328725019326, -0.0027206833474338055, 0.007466502407064232, 0.002222140758441618, 0.002922874306722616, -0.0033198441401159824, -0.00894621615477193, -0.030559699767484117, 0.04175328194665221, 0.006504001758562831, 0.024093436578718517, 0.02046252702935957, -0.012170057134846082, -0.005611152625463617, 0.0046708426850203145, 0.0029166728759614322, -0.027364472822787672, 0.012372549748621309, 0.02844202224738323, 0.004474385856435849, -0.005669431443003795, 0.015031458236850226, -0.015386145723123964))"
1461521,5.0,True,2014-02-15,AS0WVV16FAXK5,B000N25IDO,Kari,I AM SO SATISFIED--they are such a fine grade of stainless steel and will look lovely with my seashell dishes.,SO PRETTY,0,"List(satisfi, fine, grade, stainless, steel, look, love, seashel, dish)","Map(vectorType -> dense, length -> 100, values -> List(-0.004116812701492259, 0.008322561030379599, 0.010021779712082611, -0.01138417766843405, -9.427118412632908E-4, 0.006948080896917316, -0.0014735631282544797, 0.009996210936353438, -0.008514128909963701, -0.008286014199256897, 0.0034379240565208923, -0.007512176512844033, 9.88972599669877E-4, -0.003658734067964057, 0.002098080314074953, -0.006675373116094205, -0.011968847053746382, -0.012549544658718837, 0.0011668428907998733, 1.334912716023003E-4, 0.0033933054138388895, -0.010474547148785656, 0.0014600032267885075, -0.0037584741755078235, 0.003758765110332105, 0.001246756925765011, -0.00593391687531645, -0.0014319549890286806, -0.006844442583517067, 0.0014767852391944162, -0.00850614074928065, 0.0029537024868962663, -0.00275340442183531, 0.003399655171152618, -0.001984392137577136, 0.011123833945021033, -0.004805550897597439, 0.009916737300550772, 0.0061441235140793846, -0.009591150058743853, 9.248034524110456E-4, -0.0077219609585073255, 8.473037742078304E-4, -0.013396906562977366, 0.009434748162877642, 0.0014024366456497873, -0.004080946841794583, 9.106817241344187E-4, -0.0032407038702836465, -0.0023584850908567505, 0.006837962686808572, 0.0062680695167121785, 0.0033288614358752966, 8.122645603078934E-4, 2.5939354964066297E-4, -0.003339449849186672, -0.008413049832193388, -0.0036837077673731577, 0.00412517407676205, -0.004317660389157632, 0.007310316070086426, -7.229041625072972E-4, 0.007723183448736866, -0.001471322586035563, -0.006491158700858553, 0.008819847186613413, 0.003962855734344985, -0.002538494403577513, -0.0028146956141831144, -0.004309429095074948, -3.851575246598157E-4, 0.010385635050220622, -0.0027985521204148727, -4.5085662148065036E-5, -0.0014321527883617415, -1.1867677999867333E-4, 0.002111092017407322, -0.004288060034418272, -0.0013393774311730845, 0.0038508566794916987, 8.461593081139856E-4, 0.0011847920379497937, -0.0022578384054617747, -0.0037156922044232488, -0.01362217388426264, 0.01979008684348729, 0.001654111857836445, 0.012464611558243632, 0.012549981449006332, -0.004183169997607668, -0.00207110027420438, 0.002774805917094151, -0.00139057085228463, -0.012967661695761813, 0.007726943146230445, 0.014793476969417598, 5.774061105007099E-4, -0.002164243972705056, 0.007293370681711368, -0.009110922200812234))"
2590273,5.0,True,2016-01-06,A2CFJ4DVIK5HJ2,B001V3O4XY,BAC,"This is a very nice, sturdy log rack cover that is working perfectly for keeping the elements out of our wood-burning stove logs. Good product at a great price. Very pleased with purchase.","This is a very nice, sturdy log rack cover that is working perfectly ...",0,"List(nice, sturdi, log, rack, cover, work, perfectli, keep, element, wood, burn, stove, log, good, product, great, price, pleas, purchas)","Map(vectorType -> dense, length -> 100, values -> List(-0.0052582100248209345, 0.015185823296441843, 0.014696010069823578, -0.01708564116913629, -9.499364509553599E-4, 0.01238949245185052, -0.0034403539918314075, 0.013803222663945664, -0.013840612232390987, -0.010715088733520946, 0.008001337232264248, -0.009604961599076265, 0.0028627123101614416, -0.007725331233814359, 0.004106263948702498, -0.011228687166677492, -0.019076155253538958, -0.021106017508396975, 0.0038164094163987195, 5.070941789247291E-4, 0.0048456031914898435, -0.012832145954139137, 3.252748712456148E-4, -0.007527427037099474, 0.005504776993276257, 0.001517616976726506, -0.007795990407957058, -0.0037658014533869725, -0.012029529083520174, -1.4989692595248158E-4, -0.013955972353486638, 0.005603235389571637, -0.005422347513223557, 0.0030207139390863867, -0.005429366444188513, 0.017764366501452106, -0.00463003029213532, 0.011977378658852294, 0.010935747493548613, -0.014267333072463148, -1.0027961597140682E-4, -0.013597417794364063, -9.007857856647062E-4, -0.020115737117042665, 0.01380772783273929, 0.003385874758302969, -0.009060789107982265, 2.005556720848146E-4, -0.0053542928408684305, -0.005193065913198025, 0.009115525362032808, 0.010153515111213844, 0.00426518162222285, -4.937033518217504E-4, 4.2487451221524293E-4, -0.002512092074060714, -0.012982559476145787, -0.007918315633249125, 0.0033668070988680577, -0.003028704901225865, 0.009086357830950107, -0.0011649489874559404, 0.012103855903995663, -4.431534898206959E-4, -0.010026513758164487, 0.013332236104791886, 0.006029893968891548, -0.004780390691992483, -0.002842001760942175, -0.005400063034980312, 3.7089895187435965E-4, 0.014788488122193437, -8.220176318182462E-4, 8.124809552866376E-4, -0.0029526843971229696, 0.002766025430326791, 0.003255660158557523, -0.005527960338727816, -0.0013506268059197617, 0.005091032598437251, 0.001002421362117227, 9.248909068686004E-4, -0.0031734519993494212, -0.006490429226112993, -0.021650227718055248, 0.029492872497557023, 0.005607756026285259, 0.018584528986952807, 0.016758566550714404, -0.008441375470475146, -0.003439312093082423, 0.0025730937777552754, -2.6698838720269695E-4, -0.020615835281971254, 0.008614833503471394, 0.020018803421407938, 0.0037868063817241863, -0.004762305415505053, 0.012027584432967399, -0.011236106051671269))"
2502336,5.0,True,2015-07-31,A2FS5AD2V6S52E,B001L1BTDY,Dennis Novak,"That what I call bag, nice and big !!!!! Dennis PA.",nice and big,0,"List(call, bag, nice, big, denni, pa)","Map(vectorType -> dense, length -> 100, values -> List(-3.498127528776725E-4, 0.0027160790050402284, 0.0023851288327326374, -0.00373235740698874, -8.991240174509585E-4, 0.002191860364594807, -0.0015016435839546223, 0.003121919309099515, -0.004075213568285108, -6.988509655153999E-4, 0.0023187011247500777, -5.92647857653598E-4, 0.00143461250506031, -0.00117327148715655, 1.8154806457459927E-4, -0.004102120098347465, -0.003579149992826084, -0.003454017297675212, 0.0011353784163172045, -0.0015971853863447905, 5.594406878420461E-4, -0.0038986181025393307, 0.002926421584561467, -0.0014806703741972644, 0.0011505575093906373, -6.707197365661461E-4, 4.2928417678922415E-4, 4.1553297584565974E-4, -0.004223531286697835, -4.654253910606106E-4, -0.0029454473018025356, 0.0018543951834241548, -0.0019091439996069917, 0.0019464726598622897, -0.0022294962933907905, 0.0027244382848342257, 0.001189556161989458, 0.0044717964095373946, 6.857726257294416E-4, -0.002920146061417957, 2.5056618081483367E-4, -0.003220349259208888, 5.566916370298713E-4, -0.004509679352243741, 0.003376623518609752, 2.2123327168325582E-4, -0.002752794011030346, 8.47843747275571E-4, -0.0021764231302465, -0.0019388473592698574, -6.43237338711818E-6, 0.0019227598483363786, 0.0010141653086369236, -0.0011319619176598885, 0.001337354149048527, -0.0012530776439234614, -0.002888342210402091, -3.909242805093527E-4, 0.001421007385943085, -8.348891666779915E-4, 0.0025840082865518825, 5.044629991364975E-5, 4.2497195924321807E-4, -4.0323660747768975E-4, -0.001825234154239297, 0.0033683846704661846, 0.003237941612799962, -0.0014143884667040159, -0.0012885421650328983, -2.7257995679974556E-4, -2.3959570777757713E-4, 0.00410291668958962, -5.15273413232838E-4, -1.2067813562074055E-4, -3.6828084072719014E-4, 9.742074374419947E-4, 0.0014688369507590928, -0.0014521273357483246, -0.0010709815590720002, 6.981049276267488E-4, -5.833456816617399E-4, 0.0019073988078162074, -0.0026350323072013753, -0.0025587867324550944, -0.003889940058191617, 0.006106061938529213, 0.0025267568416893482, 0.003877013533686598, 0.003548853875448306, -0.0026025797706097364, -0.0011283469502814114, 0.002128831848191718, -0.0013189545425120741, -0.004159506953631838, 0.00262910903741916, 0.00572364404797554, 0.0012519511898669102, -0.0022655811723476895, 0.0012131716745595136, -0.0021147091950600343))"
18491,5.0,True,2017-09-24,A3A76MFPUD07RC,B00002NC6F,thinkering,Guess what? couldn't find that size at Sears where i bought the vacuum cleaner. at Amazon i got a package with two belts and it was less than the one non-matching belt from Sears that i bought first. now i have one to spare. it was a cinch putting the belt in and it's working just fine.,It fits!,0,"List(guess, couldn, find, size, sear, bought, vacuum, cleaner, amazon, got, packag, two, belt, less, one, non, match, belt, sear, bought, first, one, spare, cinch, put, belt, work, fine)","Map(vectorType -> dense, length -> 100, values -> List(-0.00439759784785565, 0.014194089906855618, 0.012359696999608006, -0.01584870897931978, 3.08526077008407E-4, 0.012093805658098842, -0.004122280247559371, 0.01263949485395902, -0.012686180453913818, -0.011492905369128233, 0.006090988423758452, -0.01013991262880154, 0.0028344763081154917, -0.00920637569340345, 0.003808416380447202, -0.010533506707620939, -0.017680449053711658, -0.02140369275418509, 0.003684026750436585, 4.35288628588231E-4, 0.005097242845554969, -0.01121767093094864, -3.833077605252453E-4, -0.006204750010510907, 0.005133735419284286, 7.1197551109695E-4, -0.006892373598280496, -0.0025504369612982763, -0.010334267725868682, 4.197946441958525E-4, -0.014954323244247851, 0.007158099091611803, -0.005326925473387486, 0.0031358153064502403, -0.004715744771861604, 0.015812292237699564, -0.004883795461085225, 0.010554258208555567, 0.011514407337277328, -0.014516541193838097, 0.0015467824102545688, -0.013012005668965035, 6.882600116244118E-4, -0.018625663073700185, 0.01308187626169196, 0.002767338099407165, -0.007136170091273795, 0.0011876489833022267, -0.005377860602623383, -0.005217186283386711, 0.008526537426015628, 0.011938664214020327, 0.002597733260440041, 6.454567880739757E-4, 3.24959032046276E-4, -0.0024101606229253642, -0.012510340569341288, -0.006301677490617813, 0.002989436196263081, -0.004142100367711724, 0.007726480452609913, -0.002037721679828662, 0.010787783415123287, -9.491417004028335E-4, -0.010048306688466775, 0.012027924497877911, 0.0066215086927903545, -0.0046894551155024335, -0.0016568612154514994, -0.005050377369376032, 0.0016380743826240566, 0.014483981258568486, 7.957630857293095E-5, 9.030214340392766E-4, -0.004923949377760956, 0.00232064225045698, 0.0029915243981771967, -0.004979811109868543, -0.001194570741582928, 0.004349932105729489, 0.0024272366255380412, 0.001246613764253977, -0.0035539790281161132, -0.005485362917949844, -0.021126286983157373, 0.02871548437646457, 0.005935655515973589, 0.0181270590650716, 0.015630914372325475, -0.007447603482952607, -0.0035615694484606914, 0.0026206238981103525, 0.0015364960989765156, -0.01886340423620173, 0.008492590659963233, 0.018910387838591954, 0.003961185893526168, -0.0033345648532433964, 0.010674197675793298, -0.010394952675726796))"
801755,5.0,True,2008-07-02,A2TVH2OBNXYXHV,B00063QBDQ,Evan Jacobs,"This is an amazing cutting board. It's HUGE and solid. It's my new favorite thing in the kitchen. If you have the space, I highly recommend getting one.",Amazing,1,"List(amaz, cut, board, huge, solid, new, favorit, thing, kitchen, space, highli, recommend, get, one)","Map(vectorType -> dense, length -> 100, values -> List(-0.0069226056428825745, 0.022018499855351235, 0.01804617655164163, -0.02280123927630484, 0.0016814443619555925, 0.016870490449946374, -0.004967306025459298, 0.016608377918601036, -0.017001252554889237, -0.013495002133173069, 0.007585905404994264, -0.012747038754501512, 0.004386274318676442, -0.009478888108528085, 0.005448283353221736, -0.014708508928639015, -0.02599918938774083, -0.03001790279189923, 0.005745633505284786, 9.100726633083208E-4, 0.006722157588228583, -0.01519942019201283, 2.9192756794925245E-4, -0.008801026007859036, 0.009540237893816084, 9.520040829167036E-4, -0.01044278493749776, -0.0024452746479905075, -0.015907984393249665, 0.001930912399464952, -0.02141244287070419, 0.006406187300204432, -0.008214951825461217, 0.0049991344234773085, -0.0052250846139421415, 0.022869368948574574, -0.006693484186793544, 0.01527461177569681, 0.014373130372924996, -0.02017032785806805, 0.0014205458324535616, -0.01935178942845336, 4.001674034433173E-4, -0.02617579464068902, 0.016786760900036564, 0.005098421060081039, -0.011437435967049428, 0.00196165987290442, -0.008729179223467196, -0.007274309713726065, 0.012553072750701435, 0.014635811293763772, 0.0037517147471329993, -2.9198963810423654E-4, 6.188615765755197E-4, -0.0022949018166400492, -0.017723990005574054, -0.008119996919828865, 0.0055781032507573915, -0.005323908017349562, 0.010619813501502253, -0.00409552310260811, 0.016419008823244697, 8.329286356456578E-5, -0.013753728408898625, 0.01936575105147702, 0.009778002052501376, -0.007447403656052691, -0.0018148622371622228, -0.008490128259706709, 0.0011769099707765105, 0.020327335107140243, 4.1988536083538614E-4, -7.692216022405773E-4, -0.007398146395904145, 0.004338691312503734, 0.0024970196328857647, -0.00772839185083285, -0.0029167404614521986, 0.005194361895389322, 0.0025989853020291775, 0.002388872526353225, -0.005072329135145992, -0.006671546525987131, -0.030858472771277384, 0.038515858668168736, 0.007579056799711127, 0.024145252254259373, 0.02156302793550172, -0.01146083277749962, -0.004259303287004254, 0.002040284702421299, 0.003216470163481842, -0.026576507919734076, 0.01071495802274772, 0.02741913553992552, 0.006554849462450615, -0.0061748401162081525, 0.014918988728563167, -0.01571295699769897))"
1763354,1.0,True,2015-11-30,A1DMT20M71IQSM,0061043575,Dawn,Gave it to the library.,One Star,0,"List(gave, librari)","Map(vectorType -> dense, length -> 100, values -> List(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))"


In [0]:
# Random Forest

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="overall", outputCol="indexedScore").fit(prediction_doc2vec)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="doc2vec", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

(trainingData, testData) = prediction_doc2vec.randomSplit([0.7, 0.3])

rf_model = pipeline.fit(trainingData)
predictions = rf_model.transform(testData)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="indexedScore", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print("AUC = %g" % auc)

AUC = 0.663221


In [0]:
# Performance evaluation with 10-fold cross validation

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder().build()
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid, numFolds=5)
cvModel = cv.fit(prediction_doc2vec)

print("Average AUC = %g" % cvModel.avgMetrics[0])

Average AUC = 0.740104


## Model Interpretation

In [0]:
# Extract bigram

interpret_tfidf = prediction_df_sampled.select('*')

from pyspark.ml.feature import NGram
from pyspark.sql.functions import array_union

ngram = NGram(n = 2, inputCol="reviewWordCleaned", outputCol="reviewBigrams")
interpret_tfidf = ngram.transform(interpret_tfidf)

interpret_tfidf = interpret_tfidf.withColumn("reviewNgrams", \
                                             array_union(interpret_tfidf.reviewWordCleaned, \
                                                         interpret_tfidf.reviewBigrams))

In [0]:
# Calculating TF-IDF without hashing; limit vocabulary to top 2^12 (4096) ngrams

from pyspark.ml.feature import CountVectorizer, IDF

tf = CountVectorizer(inputCol="reviewNgrams", outputCol='TF', minDF=2.0, vocabSize=2**12)
tf_model = tf.fit(interpret_tfidf)
tf_transformed = tf_model.transform(interpret_tfidf)
idf = IDF(minDocFreq=3, inputCol="TF", outputCol="TF-IDF")
idfModel = idf.fit(tf_transformed)
interpret_tfidf = idfModel.transform(tf_transformed)

In [0]:
# Building a full Random Forest model with all the data, using TF-IDF embedding without hashing

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="overall", outputCol="indexedScore").fit(interpret_tfidf)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="TF-IDF", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

rf_model = pipeline.fit(interpret_tfidf)

In [0]:
# Getting feature importance from the Random Forest model

feature_importance = rf_model.stages[-2].featureImportances
print(feature_importance)

(1561,[0,1,3,4,6,7,8,10,11,12,15,19,22,26,28,30,32,33,34,35,36,39,47,48,50,58,59,61,62,63,67,72,74,76,77,78,80,81,83,85,93,98,99,101,104,105,106,107,121,132,136,139,141,143,145,152,156,160,162,171,174,177,183,189,193,195,198,199,203,212,218,219,222,230,232,235,243,245,249,250,258,270,276,281,282,283,300,301,306,312,318,325,327,331,341,343,345,348,352,357,363,365,371,376,378,379,393,394,402,404,406,410,413,416,418,419,420,426,442,444,447,458,463,464,468,471,476,477,478,479,480,486,499,504,507,510,513,518,523,547,554,556,560,562,576,578,586,591,598,604,606,609,623,625,627,630,631,635,637,640,643,646,651,657,661,670,679,686,689,690,691,696,699,707,713,720,727,729,730,738,741,746,766,798,799,803,805,824,835,837,838,840,847],[0.0054539613749843945,0.005181292143884781,0.0012461925025749633,0.0023203791406947575,0.0153756697928116,0.0020945244436419456,0.004705249313787477,0.006030517379756843,0.0037701439985555014,0.0030304903405075387,0.003072861203386884,0.0037265903867781495,0.0066125844

In [0]:
# Get the top 20 most important feature's indices, and its importance metric

import numpy as np
import pandas as pd

top20_indice = np.flip(np.argsort(feature_importance.toArray()))[:20].tolist()
top20_importance = []
for index in top20_indice:
    top20_importance.append(feature_importance[index])

top20_df = spark.createDataFrame(pd.DataFrame(list(zip(top20_indice, top20_importance)), columns =['index', 'importance']))

display(top20_df)



index,importance
331,0.0308339670441592
670,0.028372773440812
803,0.0244587341930473
606,0.0244543718886642
847,0.0232003080516108
152,0.0210039022385398
345,0.0207911391671017
643,0.0177066770218534
199,0.0170031906363308
738,0.0154734986262583


In [0]:
# Create a map between each ngram and its index

from pyspark.sql.functions import explode, udf, col
from pyspark.sql.types import *

make_list_udf = udf(lambda col: [col], ArrayType(StringType()))
remove_list_udf = udf(lambda col: col[0], StringType())

def get_index(col):
    if len(col.indices) == 0:
        return -1   # Mark the ngram's index as -1 if it is not the top 2^12 ngrams
    else:
        return int(col.indices[0])
get_index_udf = udf(get_index, IntegerType())

ngram_index = interpret_tfidf.select(explode(interpret_tfidf.reviewNgrams).alias("reviewNgrams")).distinct() \
                             .withColumn("reviewNgrams", make_list_udf("reviewNgrams"))
ngram_index = tf_model.transform(ngram_index)
ngram_index = ngram_index.withColumn("reviewNgrams", remove_list_udf("reviewNgrams")) \
                         .withColumn("index", get_index_udf("TF")) \
                         .select("reviewNgrams", "index")

In [0]:
display(ngram_index.where(ngram_index.index > -1))

reviewNgrams,index
hope,152
stori,19
even,48
first book,1124
piec,98
hand,72
noth,499
wasn,450
allow,568
togeth,252


In [0]:
# Find the ngrams that map to the top 20 most important features

# Note that if you used hashingTF for word embedding, there would be multiple ngrams under the same index, because of the collision introduced by hashing, all of which would share and contribute to one importance score, and we don't have a way to separate their contribution to the importance score.
# Here in order to avoid such collision (so just one index per ngram), I used CountVectorizer instead of HashingTF during encoding.

import pyspark.sql.functions as f

top20_ngram = top20_df.join(ngram_index, on="index", how="left_outer")
display(top20_ngram.groupby("importance").agg(f.collect_list(top20_ngram.reviewNgrams).alias("reviewNgrams")).orderBy("importance", ascending=False))

importance,reviewNgrams
0.0308339670441592,List(return)
0.028372773440812,List(kept)
0.0244587341930473,List(front)
0.0244543718886642,List(junk)
0.0232003080516108,List(slide)
0.0210039022385398,List(hope)
0.0207911391671017,List(broke)
0.0177066770218534,List(com)
0.0170031906363308,List(cheap)
0.0154734986262583,List(crap)
