Data Date Range:
    <br />Start time: 2016-05-05 09:40:26
    <br />End time: 2018-01-17 01:59:42

#### This notebook:
- Generate a table with total trailer viewed and their algorithm assignment
- Generate a table with recommendations that are clicked and their algorithm assignment
- Generate a aggregate table with count_total_played, count_rec_played, and algorithm assignment

#### We will be using the 'log_trailer_actions_sessionIds' file, and here are some explanations of the 'actions' in this file. ####       
##### If a user opens the trailer from the movie detail page: ####
* TrailerModalLaunched: If a user opens the trailer from the movie detail page
* PlayedFromLaunch: Happens right after 'TrailerModalLaunched'

##### User clicked in the trailer interface to go to a different trailer recommended: ####
* PlayedFromNext
* PalyedFromPrevious
* PlayedFromRecommendation


#### This notebook generates the following table:
- rec_played_alg: table with after recommendation clicks and algorithm assignment
- total_rec_refresh_aggregate: an aggretable with count_total_played, count_rec_played, and algorithm assignment


### Import ###

In [1]:
import pandas as pd
import numpy as np

#### Need to run 0 - Generate log_trailer_actions Dataframe before doing the following

In [2]:
action_data = pd.read_csv('../Clean_Data/log_trailer_actions_sessionIds.csv')

### Load the dataset with algorithm assignment in each session in the experiment

In [8]:
# Get the algorithm assignment
expt_trailer = pd.read_csv('../Clean_Data/expt_trailer.csv', sep=',')
alg_assignment = expt_trailer[['userId','loginId','recommenderId']]
alg_assignment.sort_values(by='recommenderId').shape
alg_assignment.head()

Unnamed: 0,userId,loginId,recommenderId
0,1892,04nVLGw,PredictedRating
1,1892,05UYXbK,FilmReleaseDate
2,1892,0Cjojur,FilmReleaseDate
3,1892,0fCo1lB,FilmReleaseDate
4,1892,0GvsUHM,FilmReleaseDate


### Rec_played (recommended trailers that are clicked on) and Algorithm

If an action is among: "PlayedFromNext", "PlayedFromRecommendation", "PlayedFromPrevious", it means a recommendation is played, so we keep this row in the table.

In [9]:
# Trailers recommended by an algorithm and played, excluding 'PlayedFromYoutubePlayer'
rec_played = action_data[action_data['action'].isin(["PlayedFromNext", "PlayedFromRecommendation", "PlayedFromPrevious"])]

# Join the table of trailers recommended and played with the table of algorithm assignment
rec_played_alg = pd.merge(rec_played, alg_assignment, on=['loginId','userId'], how='left')
rec_played_alg.head()

Unnamed: 0.1,Unnamed: 0,userId,loginId,movieId,action,tstamp,position,sessionIds,recommenderId
0,34,304211,00bSwKf,159193,PlayedFromRecommendation,2017-04-29 07:07:09,4,3,FilmReleaseDate
1,66,304211,00bSwKf,132800,PlayedFromRecommendation,2017-04-29 07:15:57,6,6,FilmReleaseDate
2,70,304211,00bSwKf,161966,PlayedFromRecommendation,2017-04-29 07:18:04,2,6,FilmReleaseDate
3,83,304211,00bSwKf,5617,PlayedFromRecommendation,2017-04-29 07:18:16,0,6,FilmReleaseDate
4,87,304211,00bSwKf,113829,PlayedFromRecommendation,2017-04-29 07:18:34,7,6,FilmReleaseDate


In [10]:
# Drop rows without recommender assignment
rec_played_alg2 = rec_played_alg.dropna(subset=['recommenderId'])


In [11]:
rec_played_alg2.count()

Unnamed: 0       9137
userId           9137
loginId          9137
movieId          9137
action           9137
tstamp           9137
position         9137
sessionIds       9137
recommenderId    9137
dtype: int64

In [12]:
rec_played_alg2.to_csv("../Clean_Data/rec_played_alg.csv", index=False)

In [13]:
# Find out the count of recommended trailers played in a session:
rec_played_alg_count = rec_played_alg2.groupby(['userId','loginId','recommenderId'])['movieId'].count().reset_index(name="count_rec_played")
rec_played_alg_count.head()

Unnamed: 0,userId,loginId,recommenderId,count_rec_played
0,43115,TYJlPxZ,TagSimilarity,2
1,43115,bl2JVsA,ShuffledTopPicks,2
2,43115,hwYZBXR,TagSimilarity,3
3,43115,nXM5hHL,ShuffledTopPicks,2
4,47315,QCQcblN,ShuffledTopPicks,1


#### Get content information of these recommendations

In [14]:
rec_seed_info = pd.read_csv("../Clean_Data/rec_seed_info.csv")

FileNotFoundError: File b'../Clean_Data/rec_seed_info.csv' does not exist

In [39]:
rec_seed_info.head()

Unnamed: 0,Algorithm,SeedMovie,loginId,movieId,sessionIds,userId,avgRating,popularityLastYear,avgRating_seedmovie,popularityLastYear_seedmovie,age_month,age_seedmovie_month
0,PredictedRating,118985,008C57f,318,1,276159,4.43,7047.0,3.4,159.0,267.5,20.0
1,PredictedRating,118985,008C57f,93721,1,276159,3.95,229.0,3.4,159.0,63.87,20.0
2,PredictedRating,118985,008C57f,148626,1,276159,4.01,1744.0,3.4,159.0,9.17,20.0
3,PredictedRating,118985,008C57f,2966,1,276159,3.93,94.0,3.4,159.0,205.9,20.0
4,PredictedRating,118985,008C57f,81845,1,276159,3.93,2032.0,3.4,159.0,70.57,20.0


In [40]:
rec_played_info = pd.merge(rec_played_alg2, rec_seed_info, on=['loginId','movieId'],how='left')
# Drop NA
rec_played_info.dropna(inplace=True)


In [41]:
rec_played_info.drop_duplicates(subset=['loginId','movieId','SeedMovie']).count()

Unnamed: 0                      16367
userId_x                        16367
loginId                         16367
movieId                         16367
action                          16367
tstamp                          16367
position                        16367
sessionIds_x                    16367
recommenderId                   16367
Algorithm                       16367
SeedMovie                       16367
sessionIds_y                    16367
userId_y                        16367
avgRating                       16367
popularityLastYear              16367
avgRating_seedmovie             16367
popularityLastYear_seedmovie    16367
age_month                       16367
age_seedmovie_month             16367
dtype: int64

### Total_played (All trailers that are clicked on) and Algorithm

If an action is among: "PlayedFromLaunch","PlayedFromNext", "PlayedFromRecommendation", "PlayedFromPrevious", it means a trailer is played, so we keep this row in the table.

In [13]:
# All trailers played, with their algorithm assignment. 
total_played = action_data[action_data['action'].isin(["PlayedFromLaunch","PlayedFromNext", "PlayedFromRecommendation", "PlayedFromPrevious"])].copy()
total_played.count()

Unnamed: 0    186878
userId        186878
loginId       186878
movieId       186878
action        186878
tstamp        186878
position      186878
sessionIds    186878
dtype: int64

In [14]:
total_played.drop("Unnamed: 0",axis=1, inplace=True)

In [15]:
# Merge total_played with algorithm assignment
total_played_alg = pd.merge(total_played, alg_assignment, on=['loginId','userId'], how='left')

# Drop na values
total_played_alg2 = total_played_alg.dropna(subset=['recommenderId']).copy()

In [17]:
total_played_alg2.count()

userId           166959
loginId          166959
movieId          166959
action           166959
tstamp           166959
position         166959
sessionIds       166959
recommenderId    166959
dtype: int64

In [46]:
total_played_alg2.to_csv("../Clean_Data/total_played_alg.csv", index=False)

### Aggretable count_total_played, count_rec_played, refreshes, with algorithm assignment, by loginId

In [47]:
# Find out the count of total trailers played in a session:
total_played_alg_count = total_played_alg2.groupby(['userId','loginId','recommenderId'])['movieId'].count().reset_index(name="count_total_played")

In [48]:
# Join the two tables above:
total_rec_count = pd.merge(total_played_alg_count, rec_played_alg_count, on=['loginId','userId'], how='left')


In [49]:
rec_refresh = action_data[action_data['action'] == 'RecommendationsRefreshed']
refresh_alg = pd.merge(rec_refresh, alg_assignment, on=['loginId','userId'], how='left')

refresh_alg_count = refresh_alg.groupby(['userId','loginId','recommenderId'])['movieId'].count().reset_index(name="count_refresh")
refresh_alg_count.head()

Unnamed: 0,userId,loginId,recommenderId,count_refresh
0,108928,cMSrxrL,FilmReleaseDate,1
1,125536,9Mh05HC,PredictedRating,1
2,131928,RAUtp1C,FilmReleaseDate,1
3,163262,jUXOHcr,ShuffledTopPicks,1
4,164934,VqCRXL9,TagSimilarity,1


In [50]:
# Join the table of #refresh with the previous aggregate table
total_rec_refresh_count = pd.merge(total_rec_count, refresh_alg_count, on=['loginId','userId'], how='left')

## Drop NA
total_rec_refresh_count['count_refresh'].fillna(0, inplace=True)
total_rec_refresh_count['count_rec_played'].fillna(0, inplace=True)


In [51]:
# Rename the column: recommenderId_x as Algorithm
total_rec_refresh_count=total_rec_refresh_count.rename(columns = {'recommenderId_x':'Algorithm'})

# Drop the columns recommenderId_y and recommenderId
total_rec_refresh_count.drop('recommenderId_y', axis = 1, inplace = True)
total_rec_refresh_count.drop('recommenderId', axis = 1, inplace = True)

In [52]:
total_rec_refresh_count.head()

Unnamed: 0,userId,loginId,Algorithm,count_total_played,count_rec_played,count_refresh
0,1892,Nm0bajY,TagSimilarity,1,0.0,0.0
1,12337,OnHpLST,PredictedRating,1,0.0,0.0
2,16783,2bqCVrC,TagSimilarity,19,0.0,0.0
3,22005,M1nww5f,ShuffledTopPicks,1,0.0,0.0
4,26229,t7hZ2Ty,PredictedRating,1,0.0,0.0


Now, we have generated an aggregate table with total played, rec played, and #refresh in each login session. 

In [53]:
# Write result to a csv file
total_rec_refresh_count.to_csv("../Clean_Data/total_rec_refresh_aggregate.csv")
