## Data Mining - MSDS 7331 - Thurs 6:30, Summer 2016
Team 3: Sal Melendez, Rahn Lieberman, Thomas Rogers

Github page: https://github.com/RahnL/DataScience-SMU/tree/master/DataMining

### Overview

Our team chose to pursue collaborative filtering to build a custom recommendation system (Option C), using Amazon Instant Video recommendation data (http://jmcauley.ucsd.edu/data/amazon/).

#### Business Understanding

In the world of digital commerce, engagement equates to revenue and personalization ensures “stickiness”. Amazon understood these concepts before most and leveraged data science to enable mass personalization on a scale not previously possible through their recommender system. The distinction between mass marketing and personalized marketing is best measured by effectiveness. Mass marketing feels intrusive and often consists of tactics that are “pushed” to potential consumers. The flipside of the push is a “pull” which is a similarly motivated message, but through personalization almost feels helpful.

An effective and scalable recommendation system is a competitive differentiator in many different markets. Consider that at its core, Amazon is a fairly simple marketplace with buyers and sellers. What makes Amazon powerful, beyond the breadth of offerings available, is that it brings insight to both sides of the buying and selling equation, with its recommendation system. 

Understanding the importance of recommender systems in digital commerce requires a bit of context. It may be that an algorithm exists to perfectly recommend a product to a person at the right place at the right time, but without elements like an effective user interface or without consideration for the user experience, that recommendation never makes it to the site visitor. 

#### Data Understanding

The data for this effort is taken from product reviews and metadata from Amazon, including 142.8 million reviews spanning from May of 1996 to July of 2014. Our analysis focuses specifically on Amazon Instant Video reviews and contains the following variables of the following types:
- asin – this is a unique variable defined by Amazon as, “A 10-character alphanumeric unique identifier assigned by Amazon.com and its partners for product identification within the Amazon organization.”
- helpful – this is a 1 x 2 matrix of numeric values, indicating whether the review was found helpful by a person viewing it. (thoughts here?)
- overall – this numeric value scales from 0 to 5 in whole increments.
- reviewText – this alpha-numeric value consists of free-form text from users writing whatever they think may be helpful as a narrative review.
- reviewTime – this variable consists of MM DD, YYYY and represents the date the review was made
- reviewerID – this unique identifier is a 14-character alpha-numeric used to identify the person providing the review
- reviewerName – this alphanumeric variable is user-defined and is the name that appears as the reviewer in the web application
- summary – this alphanumeric text represents a sample of the reviewText variable
- unixReviewTime – this ten-digit variable represents the time a review was given, in a standard format

In [3]:
import graphlab as gl
# set canvas to show sframes and sgraphs in ipython notebook
gl.canvas.set_target('ipynb')
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
# Load the data.  It's already been downloaded from the site listed above
amazonfile = 'C:/Users/trogers/Documents/GitHub/DataScience-SMU/DataMining/data/Amazon_Instant_Video_5.json'
# amazonfile = './data/Amazon_Instant_Video_5.json
sf = gl.SFrame.read_json(amazonfile,orient='lines')
sf.head()

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


asin,helpful,overall,reviewText,reviewTime,reviewerID
B000H00VBQ,"[0, 0]",2.0,I had big expectations because I love English ...,"05 3, 2014",A11N155CW1UV02
B000H00VBQ,"[0, 0]",5.0,I highly recommend this series. It is a must for ...,"09 3, 2012",A3BC8O2KCL29V2
B000H00VBQ,"[0, 1]",1.0,This one is a real snoozer. Don't believe ...,"10 16, 2013",A60D5HQFOTSOM
B000H00VBQ,"[0, 0]",4.0,Mysteries are interesting. The ten ...,"10 30, 2013",A1RJPIGRSNX4PW
B000H00VBQ,"[1, 1]",5.0,"This show always is excellent, as far as ...","02 11, 2009",A16XRPF40679KG
B000H00VBQ,"[12, 12]",5.0,I discovered this series quite by accident. Ha ...,"10 11, 2011",A1POFVVXUZR3IQ
B000H0X79O,"[0, 0]",3.0,"It beats watching a blank screen. However, I just ...","10 15, 2013",A1PG2VV4W1WRPL
B000H0X79O,"[0, 0]",3.0,"There are many episodes in this series, so I ...","12 29, 2013",ATASGS8HZHGIB
B000H0X79O,"[0, 0]",5.0,This is the best of the best comedy Stand-up. ...,"02 26, 2014",A3RXD7Z44T9DHW
B000H0X79O,"[0, 0]",3.0,Not bad. Didn't know any of the comedians but ...,"04 2, 2014",AUX8EUBNTHIIU

reviewerName,summary,unixReviewTime
AdrianaM,A little bit boring for me ...,1399075200
Carol T,Excellent Grown Up TV,1346630400
"Daniel Cooper ""dancoopermedia"" ...",Way too boring for me,1381881600
"J. Kaplan ""JJ""",Robson Green is mesmerizing ...,1383091200
Michael Dobey,Robson green and great writing ...,1234310400
Z Hayes,I purchased the series via streaming and loved ...,1318291200
"Jimmy C. Saunders ""Papa Smurf"" ...",It takes up your time.,1381795200
JohnnyC,A reasonable way to kill a few minutes ...,1388275200
Kansas,kansas001,1393372800
Louis V. Borsellino,Entertaining Comedy,1396396800


In [18]:
sf.describe()

AttributeError: 'SFrame' object has no attribute 'describe'

In [7]:
(train_set, test_set) = sf.random_split(0.8, seed=1)

Create a collaborative filter, from the sample.

In [8]:
item_sim_model = gl.item_similarity_recommender.create(train_set, 'reviewerID', 'asin')

Create a popularity model, from the sample

In [9]:
popularity_model = gl.popularity_recommender.create(train_set, 'reviewerID', 'asin')

Compare models, per the sample

In [10]:
result = gl.recommender.util.compare_models(test_set, [popularity_model, item_sim_model],
                                            user_sample=.1, skip_set=train_set)

compare_models: using 388 users to estimate model performance
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    | 0.0309278350515 | 0.0182560137457 |
|   2    |  0.020618556701 | 0.0227663230241 |
|   3    | 0.0214776632302 | 0.0397336769759 |
|   4    | 0.0251288659794 | 0.0659364261168 |
|   5    | 0.0237113402062 | 0.0762457044674 |
|   6    | 0.0231958762887 | 0.0856958762887 |
|   7    |  0.020618556701 | 0.0895618556701 |
|   8    | 0.0196520618557 | 0.0960051546392 |
|   9    | 0.0194730813288 |  0.111726804124 |
|   10   | 0.0185567010309 |  0.118170103093 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+----

In [11]:

K = 10
users = gl.SArray(sf['reviewerID'].unique().head(100))

recs = item_sim_model.recommend(users=users, k=K)
recs.head()

reviewerID,asin,score,rank
A2PO6BB2VDMF1T,B00944CJRK,0.0293227036794,1
A2PO6BB2VDMF1T,B004ZISVRW,0.0286416808764,2
A2PO6BB2VDMF1T,B007G5GMMC,0.0263884663582,3
A2PO6BB2VDMF1T,B007PYEWZ8,0.0257215599219,4
A2PO6BB2VDMF1T,B00D844FUQ,0.0241798857848,5
A2PO6BB2VDMF1T,B00DNUF7KW,0.023430476586,6
A2PO6BB2VDMF1T,B003XU02QG,0.0232643187046,7
A2PO6BB2VDMF1T,B00C2UG118,0.022988508145,8
A2PO6BB2VDMF1T,B00DTOYIIE,0.0228959520658,9
A2PO6BB2VDMF1T,B00JFQ96F0,0.0217086871465,10


In [12]:
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')


item_item = gl.recommender.item_similarity_recommender.create(train_set, 
                                  user_id="reviewerID", 
                                  item_id="asin", 
                                  target="overall",
                                  only_top_k=5,
                                  similarity_type="cosine")

rmse_results = item_item.evaluate(test_set)


Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.137371134021 | 0.0873629363191 |
|   2    |  0.115850515464 |  0.140872907162 |
|   3    | 0.0954467353952 |  0.170680333412 |
|   4    | 0.0818298969072 |  0.194639464487 |
|   5    | 0.0713917525773 |  0.210015844265 |
|   6    | 0.0637027491409 |  0.223516703371 |
|   7    | 0.0578055964654 |  0.236031747033 |
|   8    | 0.0525128865979 |  0.244496810034 |
|   9    | 0.0484822451317 |  0.253082205757 |
|   10   | 0.0451288659794 |  0.261750590637 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 4.2492408971181685)

Per User RMSE (best)
+----------------+-------+------------------+
|   reviewerID   | count |       rmse       |
+----------------+-------+------------------+
| A1HSNJH5EFKR30 |   1   | 0.00615993142128 |
+----------------+---

In [13]:
print rmse_results.viewkeys()
print rmse_results['rmse_by_item']

dict_keys(['rmse_by_user', 'precision_recall_overall', 'rmse_by_item', 'precision_recall_by_user', 'rmse_overall'])
+------------+-------+---------------+
|    asin    | count |      rmse     |
+------------+-------+---------------+
| B00H2E66H8 |   2   |      4.0      |
| B00F2C1RPS |   10  | 4.62601340249 |
| B002JJC6E8 |   1   |      2.0      |
| B00CB6SU5I |   3   | 3.02650053314 |
| B007ZU23GC |   1   |      4.0      |
| B001QDKK3W |   15  | 4.45024402385 |
| B0082JW05O |   1   |      2.0      |
| B0015TFUSC |   4   | 4.28364438539 |
| B007O2TG6Q |   4   | 4.18330013267 |
| B00EIQJH8I |   2   |      5.0      |
+------------+-------+---------------+
[1470 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [35]:

rmse_results['rmse_by_user']

reviewerID,count,rmse
A34D06JL7LC6MU,1,4.0
A2VQ08XUBNDU88,2,4.52769256907
A1A6ADR8JG12KQ,4,4.38662273144
AKL6E6FRY6CHB,1,5.0
A2WOH395IHGS0T,2,4.8094992798
A2K38LTTVICB2I,1,5.0
A2LNXVZ63VT91L,2,4.52769256907
A1X8IR1ZE3I77A,1,5.0
AZQJGDWARL3RR,1,1.0
ALI6HO0QLZZ8C,3,4.50293363449


References:
* https://github.com/turi-code/tutorials/blob/master/notebooks/recsys_rank_10K_song.ipynb
* https://github.com/turi-code/tutorials/blob/master/notebooks/five_line_recommender.ipynb

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
pdf

Inferring networks of substitutable and complementary products
J. McAuley, R. Pandey, J. Leskovec
Knowledge Discovery and Data Mining, 2015
pdf

http://jmcauley.ucsd.edu/data/amazon/