## PySpark recommendation engine

Have noted with Vidyut that currently we have to build method-pool instead of rules. (Notes are in separated email.)

* Here we illustrate the PySpark env with CF(ALS) and BPR using our own data

In [1]:
# env
from pyspark import SparkConf, SparkContext 

# methods
from pyspark.mllib.recommendation import ALS, Rating
from bpr_spark.bpr import bprMF

# data processing
import numpy as np
import pandas as pd
# result processing
import math
import heapq

## 0. Data Preparation

* For this demo, the data is prepared, hence in the future when real data comes, this module serves as the data cleansing and data wrangling accordingly.

## 1. Learning

### Load  data
This fuction translates the data into RDD rating format.

In [11]:
# _File_name = '/Users/ito/venv/pyspark-rec/Amazon_videogame_data'
_File_name = '/Users/ito/venv/pyspark-rec/CG-Tops/Tops_user-item_data'

In [13]:
def get_rating(str):  
    arr = str.split('\t') 
    user_id = int(arr[0])  
    movie_id = int(arr[1])  
    user_rating = float(arr[2])  
    return Rating(user_id, movie_id, user_rating)
try:
    sc.stop()
except:
    pass

Set Spark Context (cannot be reset once set well).

In [14]:
conf = SparkConf().setMaster('local').setAppName('RecoEng').set("spark.executor.memory", "8g")  
sc = SparkContext(conf=conf)
data = sc.textFile(_File_name)  
data.top(3)

[u'999\t7071\t1\t736433', u'999\t7070\t1\t736433', u'999\t6951\t1\t736485']

* For mllib.recommendation, there is already data structure running on it called 'Rating'

In [15]:
ratings = data.map(get_rating) 
ratings.top(3)

[Rating(user=9204, product=43518, rating=1.0),
 Rating(user=9204, product=43392, rating=2.0),
 Rating(user=9204, product=43378, rating=1.0)]

In [16]:
split = [0.8, 0.2]
ratings_cf_train, ratings_cf_test = ratings.randomSplit(split, seed = 225)
ratings2 = data.map(lambda line: line.split("\t")).map(lambda x: map(int, x[:2]))
ratings_bpr_train, ratings_bpr_test = ratings2.randomSplit(split, seed = 25)

### I. Collaborative Filtering 
* using Alternating Least Squares (ALS) optimization
* call java function "trainALSModel"

In [17]:
def CollaborativeFiltering(ratings, rank = 5, iterations = 20):
    model = ALS.train(ratings, rank, iterations)
    return model

In [61]:
rank = 10
iterations = 20

In [62]:
%%time
CFmodel = CollaborativeFiltering(ratings_cf_train, rank, iterations)

CPU times: user 13.7 ms, sys: 5.1 ms, total: 18.8 ms
Wall time: 29.9 s


In [63]:
# %%time
# rank = 10  
# iterations = 5    
# ALSmodel = ALS.train(ratings, rank, iterations)

The scalability and efficiency of pySpark is going well, and even it is not CPU-wise multiprocessed. 

So far so good, means my next step is towards accuracy and realistic level.

In [64]:
%%time
CFmodel_ = CollaborativeFiltering(ratings, rank, iterations)

CPU times: user 11.5 ms, sys: 4.49 ms, total: 15.9 ms
Wall time: 31.4 s


In [65]:
# %%time
# top_howmany = 5
# rec_items_cf_ = []
# for i in range(1, 4):
#     rec_items_cf_.extend(CFmodel_.recommendProducts(i, top_howmany)) 
# rec_items_cf_    

CPU times: user 4.72 ms, sys: 2.1 ms, total: 6.82 ms
Wall time: 118 ms


### II. Bayesian Personalized Ranking (for PySpark)
* optimized using Stochastic Gradient Descent (SGD, cannot be as simply paralled as ALS)  
* have remained only user-item information yet (the basic BPR or BPR-1)
* return 2 matrices (user matrix with (#user,k), item matrix with (k,#item))

In [24]:
# conf = SparkConf().setMaster("local").setAppName("BPR").set("spark.executor.memory", "8g")
# sc = SparkContext(conf=conf)

In [25]:
# data = sc.textFile("/Users/ito/venv/pyspark-rec/CG-Tops/Tops_user-item_data")
# ratings = data.map(lambda line: line.split("\t")).map(lambda x: map(int, x[:2]))

In [26]:
ratings_bpr_train.top(2)

[[9204, 43518], [9204, 43392]]

In [27]:
ratings_bpr_train.count()

1596054

In [28]:
%%time
userMat, prodMat = bprMF(ratings_bpr_train, rank = 5, num_iter = 20, num_neg_samples = 10) 

* Building another version that can run faster (MR for map reduce) (to be continued..)

In [31]:
# from bpr_spark.distbpr import bpr_MF_MR
# userMat2, prodMat2 = bpr_MF_MR(ratings, 10, 10)
# userMat2, prodMat2 = bpr_MF_MR(ratings, 10, 10, nb_partitions = 8)

* Can try to see the result of a single user

In [225]:
userid = 10
rec_items_bpr = np.inner(userMat[userid].T, prodMat)
rec_items_bpr

array([ 1.67108379,  1.76053857,  1.78451347, ...,  0.71302791,
        1.19058677,  0.88390921])

#### Simple Result showcasing

* Simple result from Collaborative Filtering 

In [70]:
# print ('recommend items for userid %d:' % userid)
# [i for i in rec_items]

* Simple result from Bayesian Personalized Ranking

In [37]:
# print ('recommend items for userid %d:' % userid)
# for idx, i in enumerate(res[0]):
#     print 'user=%d, ' % userid + 'product=%d, ' % i  + 'rating=%f' % top_list[idx]

In [38]:
# Manually do the similar showcasing like "ALSmodel.recommendProducts(userid, top_howmany)" using heap sorting

# top_howmany = 5
# res = []
# top_list = heapq.nlargest(top_howmany,rec_items_bpr)
# res.append([i for i in range(len(rec_items_bpr)) if rec_items_bpr[i] in top_list])

# userid = 10
# rec_items = CFmodel.recommendProducts(userid, top_howmany)  

# print top_list
# pd.DataFrame(rec_items)

In [39]:
# userMat.shape
# num_users = ratings2.map(lambda x: x[0]).max()

In [40]:
%%time

rec_items_cf_ = []
rec_items_bpr_ = []
for i in range(1, 4):
    rec_items_cf_.extend(CFmodel_.recommendProducts(i, top_howmany)) 
    
    rec_items_bpr = np.inner(userMat[i].T, prodMat)
    top_list = heapq.nlargest(top_howmany,rec_items_bpr)
    
    idx = 0
    for j in range(len(rec_items_bpr)):
        if rec_items_bpr[j] in top_list:
            rec_items_bpr_.append((i, j, top_list[idx])) 
            idx += 1


CPU times: user 103 ms, sys: 3.91 ms, total: 106 ms
Wall time: 260 ms


### III. FM (in another jupyter sheet)

* FMonSpark_demo_a9a.ipynb

## 2. Predicting

### predicting on test_set for each method considered

In [71]:
top_howmany = 10

* Gathering recommending result

In [72]:
# ## testing
# %%time
# # CFmodel_.recommendProducts(1, top_howmany)
# userid = 10
# rec_items_bpr = np.inner(userMat[userid].T, prodMat)
# res = []
# top_list = heapq.nlargest(top_howmany,rec_items_bpr)
# res.append([i for i in range(len(rec_items_bpr)) if rec_items_bpr[i] in top_list])

* Blackboxing the recommendProducts from BPR's perspective (as a service)

In [73]:
def bpr_recommendProducts(userID, top_howmany = 5):
    rec_items_bpr = np.inner(userMat[userID].T, prodMat)
    res_4_userID = []
    top_list = heapq.nlargest(top_howmany,rec_items_bpr)

    idx = 0
    for j in range(len(rec_items_bpr)):
        if rec_items_bpr[j] in top_list:
            res_4_userID.append((userID, j, top_list[idx])) 
            idx += 1
    return res_4_userID

In [44]:
rec_items_cf = []
rec_items_bpr = []
for i in range(1, userMat.shape[0]):
    rec_items_cf.extend(CFmodel_.recommendProducts(i, top_howmany)) 
    rec_items_bpr.extend(bpr_recommendProducts(i, top_howmany)) 

In [45]:
userMat.shape[0]

9205

### voting scheme based on the results gathered

* Result regularizing

In [46]:
cf_res_df = pd.DataFrame(rec_items_cf)
cf_devider = np.mean(cf_res_df.groupby('user').rating.max())
cf_res_df['rating'] = cf_res_df.rating/cf_devider

In [47]:
bpr_res_df = pd.DataFrame(rec_items_bpr)
bpr_res_df.columns = cf_res_df.columns # convenient way for column name
bpr_devider = np.mean(bpr_res_df.groupby('user').rating.max())
bpr_res_df['rating'] = bpr_res_df.rating/bpr_devider

* Generating overall results

In [48]:
overall_df = pd.concat([cf_res_df,bpr_res_df])
overall_df_show = overall_df.groupby(['user','product']).rating.sum()

* Selecting the most possible recommendations (from mid of the top values -- avoid overfitting)

In [49]:
def taking_out_reco_pairs(input_pairs, num = 7, init_pos = 7):
    _pairs_items = input_pairs.sort_values(ascending=False).index
    if len(_pairs_items)<init_pos+num:
        return _pairs_items[-num:]
    return _pairs_items[init_pos:init_pos+num]

In [50]:
overall_result = []
for i in range(1,userMat.shape[0]):
    overall_result.append([i, list(taking_out_reco_pairs(overall_df_show[i]))])

## 3. Evaluation (Just showcase here)

* Hit Ratio evaluation 

In [54]:
def getHit(input_tuple):
    user = input_tuple[0]
    item = input_tuple[1]
    
    for reco_item in overall_result[user-1][1]:
        if reco_item == item:
            return 7
    return 0

* NDCG evaluation

In [55]:
def getNDCG(input_tuple):
    user = input_tuple[0]
    item = input_tuple[1]
    
    for idx, reco_item in enumerate(overall_result[user-1][1]):
        if reco_item == item:
            return 2*math.log(2)/math.log(idx+2)
    return 0

* Calculation of the result

In [209]:
ratings_test_len = ratings_bpr_test.count()
print 'Hit Ratio (%100): ' + str(ratings_bpr_test.map(getHit).sum()*1.0/ratings_test_len)
print 'NDCG: ' + str(ratings_bpr_test.map(getNDCG).sum()/ratings_test_len)

Hit Ratio (%100): 0.0796852546353
NDCG: 0.0124813550497


## 4. Further Demonstration

### Customer Informations

In [274]:
def strip_brace(input):
#     input = input.strip('{')
#     input = input.strip('}')
    return input[1:-1]
#     return input
cstmfile = '/Users/ito/venv/CG-RecoEng/data/og_data/Customer_Profile.csv'
customer = pd.read_csv(cstmfile, sep=',', engine='c')
customer.CustomerID = customer.CustomerID.map(strip_brace)

* Select useful information to display

In [275]:
customer_sorted = customer.sort_values(by='CustomerID')
customer_sorted = customer_sorted.reset_index()
customer_sub = customer_sorted[['CustomerID','DateofBirth','Nationality','Gender','MaritalStatus','NoofChildren']]
customer_sub.head()

Unnamed: 0,CustomerID,DateofBirth,Nationality,Gender,MaritalStatus,NoofChildren
0,0003B311-AEFF-494D-9AAD-9951FA5AC599,1981-06-26 00:00:00,Thai,F,S,0
1,0005861F-87C4-4CEB-80D4-61CFB8F34E47,1978-09-08 00:00:00,Unknown,,,0
2,000684B2-425D-4B3E-9F97-388B88C2A2AD,1981-01-01 00:00:00,Thai,F,S,0
3,000728AC-3ED5-479A-8DF9-3A94AA40665F,1978-10-16 00:00:00,Thai,F,S,0
4,0007C4F6-4BBE-4706-AD88-D25B79FC41E3,1984-07-29 00:00:00,Thai,F,M,0


* Age converter

In [276]:
from datetime import date

days_in_year = 365.2425 # for leap year
today = pd.to_datetime(date.today())

def age_converter(input):
    try:
        return int((today - pd.to_datetime(input)).days / days_in_year)
    except:
        return 'Unknown'
    
# (today - pd.to_datetime(customer_sub.DateofBirth[0])).days
# int((today - pd.to_datetime('2018-01-22')).days / days_in_year)

* Information Processing

In [277]:
customer_sub = customer_sub.fillna('Unknown')
customer_sub['Age'] = customer_sub.DateofBirth.map(age_converter)
customer_sub.drop(columns='DateofBirth',inplace=True)
customer_sub['Gender'] = customer_sub.Gender.replace(['F','M'],['Female','Male'])
customer_sub['MaritalStatus'] = customer_sub.MaritalStatus.replace(['S','M'],['Single','Married'])
customer_sub.head()

Unnamed: 0,CustomerID,Nationality,Gender,MaritalStatus,NoofChildren,Age
0,0003B311-AEFF-494D-9AAD-9951FA5AC599,Thai,Female,Single,0,36
1,0005861F-87C4-4CEB-80D4-61CFB8F34E47,Unknown,Unknown,Unknown,0,39
2,000684B2-425D-4B3E-9F97-388B88C2A2AD,Thai,Female,Single,0,37
3,000728AC-3ED5-479A-8DF9-3A94AA40665F,Thai,Female,Single,0,39
4,0007C4F6-4BBE-4706-AD88-D25B79FC41E3,Thai,Female,Married,0,33


* We can hence even analysis whether a customer is possibly divorced as they are not indicating "Marital Status" while having babies.

In [278]:
divorce_idx = customer_sub[( (customer_sub.MaritalStatus == 'Single') | (customer_sub.MaritalStatus == 'Unknown') )&(customer_sub.NoofChildren>0)].index
customer_sub['PossDivorced'] = pd.Series(['Yes' if i in divorce_idx else 'N. A.' for i in range(customer_sub.shape[0])])

In [279]:
customer_sub[customer_sub.PossDivorced == 'Yes'].head(10)

Unnamed: 0,CustomerID,Nationality,Gender,MaritalStatus,NoofChildren,Age,PossDivorced
39,00BCA414-03FB-4477-B1BF-881A4E410F4F,Thai,Female,Unknown,1,45,Yes
90,01FA26A3-217B-40D8-871A-9277B606A72B,Thai,Female,Unknown,3,40,Yes
121,02A13115-AA4A-463F-A71A-12B5787D636D,Thai,Female,Unknown,2,52,Yes
152,037BC223-D01C-49C2-B4C1-C208937BA6A1,Thai,Female,Unknown,3,65,Yes
153,037D991B-0DB1-43D3-B95C-D3A3307F54DF,Thai,Female,Unknown,1,37,Yes
169,03E44B2E-D860-487F-AF50-37E44519739D,Thai,Female,Unknown,3,59,Yes
197,0475AAC0-C8A7-410B-8930-FCA49C8E5213,Thai,Female,Unknown,2,43,Yes
213,04F2FED0-8C16-4BE7-919A-52F4C3F977A1,Thai,Female,Single,1,37,Yes
222,05239B2F-5076-4DE3-AA49-C7630256ACE7,Thai,Female,Single,1,31,Yes
228,0543EFAF-833F-4CBB-A738-A3DDD7AD8380,Thai,Female,Unknown,2,34,Yes


### Product Info

In [280]:
productfile = '/Users/ito/venv/CG-RecoEng/data/og_data/ProductMaster_Tops.csv'
product = pd.read_csv(productfile, sep=',', engine='c')
product.head()

Unnamed: 0,BUID,SKUID,PRODUCT_ENG_DESC,DEPT_ID,DEPT_ENG_DESC,SUBDEPT_ID,SUBDEPT_ENG_DESC,CLASS_ID,CLASS_ENG_DESC,CAT_ID,CAT_ENG_DESC,SUBCAT_ID,SUBCAT_ENG_DESC,BRAND_CODE,BRAND_ENG_NAME,SUPPLIER_CODE,SUPPLIER_ENG_NAME
0,150,8851984131769,Autoquip Takara Springe(C2,3,GM/Housewares,5,GM/Non FMCG,1,DIY/Non FMCG,2,Garden Accessories/Non FMCG,1,Gardening Tools,10703,AUTO QUIP,9811566,"AUTO QUIP LTD.,PART"
1,150,8851984131837,Autoquip Takara Roll Jasmin(C2,3,GM/Housewares,5,GM/Non FMCG,1,DIY/Non FMCG,2,Garden Accessories/Non FMCG,1,Gardening Tools,10703,AUTO QUIP,9811566,"AUTO QUIP LTD.,PART"
2,150,8851984131882,Autoquip DGT Scissors 2504(C2,3,GM/Housewares,5,GM/Non FMCG,1,DIY/Non FMCG,2,Garden Accessories/Non FMCG,1,Gardening Tools,10703,AUTO QUIP,9811566,"AUTO QUIP LTD.,PART"
3,150,8858653363421,Triamorn VCD Mixed 69Bht(C,3,GM/Housewares,5,GM/Non FMCG,5,Toys/Gifts/Non FMCG,3,Entertainment,1,Movie,9813,TRIAMORN,9810917,"TRI AMORN PLUS CO.,LTD."
4,150,8850633566242,Seagull Marathon Fry Pan 24cm,3,GM/Housewares,1,GM,2,Household,5,Cooking/Kitchen,8,Pans,2116,SEAGULL,9801221,THAI STAINLESS STEEL


In [281]:
product_sorted = product.sort_values(by='SKUID')
product_sorted = product_sorted.reset_index()
product_sub = product_sorted[['SKUID','PRODUCT_ENG_DESC']]
product_sub.head()

Unnamed: 0,SKUID,PRODUCT_ENG_DESC
0,11501,M&S Mint Candy 50g
1,11679,M&S Butter Choc Cookie225g
2,11693,M&S Chocolate Chunk Cookie225g
3,11723,M&S Pistachio AlmondCookie225g
4,11747,M&S Chocolate Chunk Cookie225g


### Result matching back

#### Generating dictionary for both user and item

In [282]:
user_matching_df = pd.read_csv('/Users/ito/venv/pyspark-rec/CG-Tops/user_matching_list', header=None, sep='\t')
item_matching_df = pd.read_csv('/Users/ito/venv/pyspark-rec/CG-Tops/item_matching_list', header=None, sep='\t')

In [284]:
user_matching_dict = user_matching_df.set_index([0]).to_dict()[1]
item_matching_dict = item_matching_df.set_index([0]).to_dict()[1]

#### Match back

In [None]:
customer_sub['CustomerID'] = customer_sub.CustomerID.map(user_matching_dict)
customer_sub = customer_sub[customer_sub.CustomerID > 0]
customer_sub['CustomerID'] = customer_sub.CustomerID.astype(int)

In [290]:
customer_sub.head()

Unnamed: 0,CustomerID,Nationality,Gender,MaritalStatus,NoofChildren,Age,PossDivorced
0,1,Thai,Female,Single,0,36,N. A.
1,2,Unknown,Unknown,Unknown,0,39,N. A.
2,3,Thai,Female,Single,0,37,N. A.
3,4,Thai,Female,Single,0,39,N. A.
5,5,Thai,Female,Single,0,30,N. A.


In [293]:
product_sub['SKUID'] = product_sub.SKUID.map(item_matching_dict)
product_dict = product_sub.set_index('SKUID').to_dict()['PRODUCT_ENG_DESC']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


* Get back to the real information

In [308]:
Reco_df = pd.DataFrame(overall_result)
Reco_df.columns = ['CustomerID','ItemIDs']
Reco_df.head()

Unnamed: 0,CustomerID,ItemIDs
0,1,"[11419, 25769, 27536, 37100, 24504, 10884, 11052]"
1,2,"[1226, 26894, 11890, 6945, 44039, 6951, 11052]"
2,3,"[30657, 108, 25403, 5555, 11052, 1226, 6945]"
3,4,"[6945, 6951, 10154, 11389, 11419, 37100, 39780]"
4,5,"[6951, 11389, 11419, 25769, 27536, 37100, 39898]"


* Merging (first stage)

In [332]:
Reco_fs_df = pd.merge(customer_sub, Reco_df, on='CustomerID')

In [335]:
Reco_fs_df.head()

Unnamed: 0,CustomerID,Nationality,Gender,MaritalStatus,NoofChildren,Age,PossDivorced,ItemIDs
0,1,Thai,Female,Single,0,36,N. A.,"[11419, 25769, 27536, 37100, 24504, 10884, 11052]"
1,2,Unknown,Unknown,Unknown,0,39,N. A.,"[1226, 26894, 11890, 6945, 44039, 6951, 11052]"
2,3,Thai,Female,Single,0,37,N. A.,"[30657, 108, 25403, 5555, 11052, 1226, 6945]"
3,4,Thai,Female,Single,0,39,N. A.,"[6945, 6951, 10154, 11389, 11419, 37100, 39780]"
4,5,Thai,Female,Single,0,30,N. A.,"[6951, 11389, 11419, 25769, 27536, 37100, 39898]"


* Merging (second stage)

In [336]:
def ItemID_to_Name(input):
    return [product_dict[each] for each in input]

In [337]:
Reco_fs_df['ItemNames'] = Reco_fs_df.ItemIDs.map(ItemID_to_Name)
Reco_fs_df.drop(columns='ItemIDs',inplace=True)
Reco_fs_df.drop(columns='CustomerID',inplace=True)

In [348]:
Reco_fs_df.head()

Unnamed: 0,Nationality,Gender,MaritalStatus,NoofChildren,Age,PossDivorced,ItemNames
0,Thai,Female,Single,0,36,N. A.,"[Salad Bar(B, Meiji Past Milk Plain 2L, Nurse ..."
1,Unknown,Unknown,Unknown,0,39,N. A.,"[Green Bag, Roza Tuna Chunk in Oil 185g, Ban S..."
2,Thai,Female,Single,0,37,N. A.,"[Gumgig Pean 20 Tablets 1Box, My Choice Red Ch..."
3,Thai,Female,Single,0,39,N. A.,"[Kinder Joy Girls Chocolate 20g, Kinder Joy Bo..."
4,Thai,Female,Single,0,30,N. A.,"[Kinder Joy Boys Chocolate 20g, Australian Car..."


#### Recommendation Display (Note that this is the part which connects with UI/UX for displaying)

* Hence if we would like to understand the recommendations, we could do so. Such as,

In [365]:
_nationality = 'Thai'
_gender = 'Female'
_marry = 'Single'
_agemin = 35
_agemax = 40
_numkidsmin = 2
_divorce = 'Yes'

In [367]:
result_df = Reco_fs_df[(Reco_fs_df.Nationality == _nationality)
                       &(Reco_fs_df.Gender == _gender)
                       &(Reco_fs_df.MaritalStatus == _marry)
                       &(Reco_fs_df.Age>_agemin)
                       &(Reco_fs_df.Age<_agemax)&(Reco_fs_df.NoofChildren>=_numkidsmin)
                       &(Reco_fs_df.PossDivorced == _divorce)]
result_df

Unnamed: 0,Nationality,Gender,MaritalStatus,NoofChildren,Age,PossDivorced,ItemNames
675,Thai,Female,Single,4,37,Yes,"[Salty Chinese Dessert 50g, Gold Sardines Sauc..."
1090,Thai,Female,Single,2,37,Yes,"[Kinder Joy Girls Chocolate 20g, Kinder Joy Bo..."
2033,Thai,Female,Single,3,38,Yes,"[Nipponkai Sushi 1Bht(C, Kinder Joy Boys Choco..."
2036,Thai,Female,Single,2,39,Yes,"[Chinese Dessert 50g, Korean Strawberry 250g, ..."
2231,Thai,Female,Single,2,37,Yes,"[NYB2017 Basket No.40@590, Khum Palm Oil 1L, B..."
3509,Thai,Female,Single,2,37,Yes,"[Bee Palm Oil 1L Refil, Korean Strawberry 250g..."
3531,Thai,Female,Single,2,38,Yes,"[Nipponkai Sushi 5Bht(C, Kinder Joy Boys Choco..."
4845,Thai,Female,Single,3,36,Yes,"[Kinder Joy Boys Chocolate 20g, Hygienic Pork ..."
6107,Thai,Female,Single,2,36,Yes,"[Bee Palm Oil 1L Refil, Kinder Joy Boys Chocol..."
8356,Thai,Female,Single,2,37,Yes,"[SMS Cigarette American, Green Bag, Fiji Miner..."


* This also implies, different people have very different taste
* Such segmentation cannot solve "cold start" issue (new customer comes, what to recommend)
* Hence, we should do customer analysis to understand who are indeed in same group 
* Also, we can save to file for further analysis which our customers may be interested in

In [369]:
result_df.to_csv('recommendation_result.csv',index=False)

In [None]:
sc.stop()

### Appendix (Amazon Review Dataset)

In [2]:
# * Processing Amazon Data
# %time
# import pandas as pd
# import gzip

# def parse(path):
#     g = gzip.open(path, 'rb')
#     for l in g:
#         yield eval(l)

# def getDF(path):
#     i = 0
#     df = {}
#     for d in parse(path):
#         df[i] = d
#         i += 1
#     return pd.DataFrame.from_dict(df, orient='index')

# def getDF2(path):
#     df = {}
#     for idx, d in enumerate(parse(path)):
#         df[idx] = d
#     return pd.DataFrame.from_dict(df, orient='index')

In [3]:
# %time
# reviews_df = getDF('/Users/ito/Downloads/reviews_Video_Games.json.gz')

# %time
# meta_df = getDF('/Users/ito/Downloads/meta_Video_Games.json.gz')

In [4]:
# %time
# meta_df2 = getDF2('/Users/ito/Downloads/meta_Video_Games.json.gz')

In [5]:
# whether = pd.to_datetime(reviews_df.reviewTime)
# whether.head()

In [6]:
# %time
# print reviews_df.shape
# reviews_df.head(5)

In [7]:
# sub_data = reviews_df[['reviewerID','asin','overall','unixReviewTime']]
# sub_data = sub_data.sort_values(by=['reviewerID','asin'])

In [8]:
# sub_data['CustomerCat'] = pd.Categorical(list(sub_data.reviewerID)).codes
# sub_data['ItemCat'] = pd.Categorical(list(sub_data.asin)).codes

In [9]:
# * Match the categorical ID with the real ID
# idx_match = zip(sub_data['CustomerCat'],  sub_data.reviewerID)
# itm_match = zip(sub_data['ItemCat'], sub_data.asin)
# idx_dict = dict(idx_match)
# itm_dict = dict(itm_match)
# print idx_match[:5]
# print itm_match[:5]

In [10]:
# sub_data = sub_data[['CustomerCat','ItemCat','overall','unixReviewTime']]
# _File_name = '/Users/ito/venv/pyspark-rec/Amazon_videogame_data'
# # sub_data.to_csv(_File_name, index=False, header=False)
# sub_data.to_csv(_File_name, index=False,header=False, sep='\t')