# Movie Recommender

The recommendation function takes a userid and returns the top 5 movie recommendations. This function can be found under the recommendation engine section.

## Methodology and Assumptions
Our recommendation engine recommends the top 5 rated movies that the user most similar to the target user reviewed.  
Similarity is determined from the number of same movies each user liked.  
A user liked a movie if they rated a movie greater than the average rating for that movie.

## Import Statements and SparkContext

In [1]:
# for removing headers
from itertools import islice
# set up sparkcontext
from pyspark import SparkConf, SparkContext
sc = SparkContext() 

## Import and Clean Data

In [2]:
def replace_commas_in_quotes_in_csv(line):
    if "\"" in line:
        l = line.split("\"")
        if "," in l[1]:
            l[1] = l[1].replace(",","^")
        line = "".join(l)
    return line

# Example
# replace_commas_in_quotes_in_csv('11,"American President, The (1995)",Comedy|Drama|Romance')
# replace_commas_in_quotes_in_csv('11,American President, The (1995),Comedy|Drama|Romance')

# outputs
# 11,American President^ The (1995),Comedy|Drama|Romance
# 11,American President, The (1995),Comedy|Drama|Romance

In [3]:
#create movies RDD
moviesRDD = sc.textFile("Data/movies.csv")
moviesRDD = moviesRDD.map(replace_commas_in_quotes_in_csv)
moviesRDD = moviesRDD.map(lambda x: tuple(x.split(',')))
moviesRDD = moviesRDD.mapPartitionsWithIndex(lambda idx, it: islice(it, 1, None) if idx == 0 else it)
#convert datatypes
moviesRDD = moviesRDD.map(lambda x: (int(x[0]), x[1], x[2]))
moviesRDD.take(5)

[(1, 'Toy Story (1995)', 'Adventure|Animation|Children|Comedy|Fantasy'),
 (2, 'Jumanji (1995)', 'Adventure|Children|Fantasy'),
 (3, 'Grumpier Old Men (1995)', 'Comedy|Romance'),
 (4, 'Waiting to Exhale (1995)', 'Comedy|Drama|Romance'),
 (5, 'Father of the Bride Part II (1995)', 'Comedy')]

In [4]:
# create ratings RDDs
ratingsRDD = sc.textFile("Data/ratings.csv")
ratingsRDD = ratingsRDD.map(lambda x: tuple(x.split(',')))
ratingsRDD = ratingsRDD.mapPartitionsWithIndex(lambda idx, it: islice(it, 1, None) if idx == 0 else it)
# convert datatypes in RDD
ratingsRDD = ratingsRDD.map(lambda x: (int(x[0]), int(x[1]), float(x[2]), int(x[3])))
ratingsRDD.take(5)

[(1, 31, 2.5, 1260759144),
 (1, 1029, 3.0, 1260759179),
 (1, 1061, 3.0, 1260759182),
 (1, 1129, 2.0, 1260759185),
 (1, 1172, 4.0, 1260759205)]

### Add Average Movie Rating To ratingsRDD

In [5]:
# create key-value pairs of movieid and user rating
averagemovierating = ratingsRDD.map(lambda x: (x[1], x[2]))

In [6]:
# create a tuple of (movieid, (sumofratings, numberof ratings))
averagemovierating = averagemovierating.mapValues(lambda x: (x,1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

In [7]:
# reduced the previous data frame to (movieid, averagemovierating)
averagemovierating = averagemovierating.map(lambda x: (x[0], x[1][0]/x[1][1]))

In [8]:
#converted to kev-value pairs
ratingsbymovieid = ratingsRDD.map(lambda x: (x[1], x))

In [9]:
# joined with the average movie rating on the movie id key
ratingsbymovieid = ratingsbymovieid.join(averagemovierating)

In [10]:
# fixed join formatting issues
ratingsbymovieid = ratingsbymovieid.map(lambda x: (x[0], (x[1][0][0], x[1][0][1], x[1][0][2], x[1][0][3], x[1][1])))
ratingsbymovieid.take(5)

[(1172, (1, 1172, 4.0, 1260759205, 4.260869565217392)),
 (1172, (23, 1172, 5.0, 1148670101, 4.260869565217392)),
 (1172, (38, 1172, 4.5, 1389867840, 4.260869565217392)),
 (1172, (56, 1172, 2.0, 1470350810, 4.260869565217392)),
 (1172, (94, 1172, 3.5, 1291781459, 4.260869565217392))]

### Add UserLikedMovie Boolean to ratingsRDD

In [11]:
# added in "liked" data - if user rating > than average movie ratin, the value is true
ratingsbymovieid = ratingsbymovieid.map(lambda x: (x[0], (x[1][0], x[1][1], x[1][2], x[1][3], x[1][4], x[1][2]>x[1][4])))

In [12]:
# update ratingsRDD to reflect new data
ratingsRDD = ratingsbymovieid.map(lambda x: (x[1][0], x[1][1], x[1][2], x[1][3], x[1][4], x[1][5]))
ratingsbymovieid.take(5)

[(1172, (1, 1172, 4.0, 1260759205, 4.260869565217392, False)),
 (1172, (23, 1172, 5.0, 1148670101, 4.260869565217392, True)),
 (1172, (38, 1172, 4.5, 1389867840, 4.260869565217392, True)),
 (1172, (56, 1172, 2.0, 1470350810, 4.260869565217392, False)),
 (1172, (94, 1172, 3.5, 1291781459, 4.260869565217392, False))]

### Find Users that liked each movie by movieid

In [13]:
likedmovies = ratingsRDD.filter(lambda x: x[5])

In [14]:
userslikedmovies = likedmovies.map(lambda x: (x[1], x[0]))

In [15]:
moviesusersliked = userslikedmovies.groupByKey().map(lambda x: (x[0], tuple(x[1])))

In [16]:
moviesusersliked.take(1)

[(1172,
  (23,
   38,
   133,
   148,
   229,
   280,
   320,
   321,
   330,
   358,
   373,
   387,
   391,
   430,
   441,
   481,
   497,
   510,
   521,
   537,
   539,
   545,
   547,
   585,
   587))]

### Find all targetuser-user pairs that like each movie

In [17]:
def similar_users(themoviesusersliked):
    l = []
    for i in range(len(themoviesusersliked[1])):
        for j in range(len(themoviesusersliked[1])):
            if j>i:
                l.append((themoviesusersliked[1][i], themoviesusersliked[1][j]))
    return tuple(l)

In [18]:
#combinations of users that liked the same movie
useraffinity = moviesusersliked.map(similar_users)

In [19]:
useraffinity.take(1)

[((23, 38),
  (23, 133),
  (23, 148),
  (23, 229),
  (23, 280),
  (23, 320),
  (23, 321),
  (23, 330),
  (23, 358),
  (23, 373),
  (23, 387),
  (23, 391),
  (23, 430),
  (23, 441),
  (23, 481),
  (23, 497),
  (23, 510),
  (23, 521),
  (23, 537),
  (23, 539),
  (23, 545),
  (23, 547),
  (23, 585),
  (23, 587),
  (38, 133),
  (38, 148),
  (38, 229),
  (38, 280),
  (38, 320),
  (38, 321),
  (38, 330),
  (38, 358),
  (38, 373),
  (38, 387),
  (38, 391),
  (38, 430),
  (38, 441),
  (38, 481),
  (38, 497),
  (38, 510),
  (38, 521),
  (38, 537),
  (38, 539),
  (38, 545),
  (38, 547),
  (38, 585),
  (38, 587),
  (133, 148),
  (133, 229),
  (133, 280),
  (133, 320),
  (133, 321),
  (133, 330),
  (133, 358),
  (133, 373),
  (133, 387),
  (133, 391),
  (133, 430),
  (133, 441),
  (133, 481),
  (133, 497),
  (133, 510),
  (133, 521),
  (133, 537),
  (133, 539),
  (133, 545),
  (133, 547),
  (133, 585),
  (133, 587),
  (148, 229),
  (148, 280),
  (148, 320),
  (148, 321),
  (148, 330),
  (148, 358)

In [20]:
#flatten
useraffinity = useraffinity.flatMap(lambda xs: [(x[0], x[1]) for x in xs])

In [21]:
useraffinity.take(5)

[(23, 38), (23, 133), (23, 148), (23, 229), (23, 280)]

### Find count of targetuser-user pairs

In [22]:
#ensure that the tuples are in (smallernumber, largernumber) form and count the pair
useraffinity = useraffinity.map(lambda x: ((x[1], x[0]), 1) if x[0]>x[1] else (x,1))

In [23]:
#find the count of each tuple pair
useraffinity = useraffinity.reduceByKey(lambda x,y: x+y)

In [24]:
useraffinity.take(5)

[((23, 387), 45),
 ((23, 391), 15),
 ((23, 539), 1),
 ((23, 547), 131),
 ((23, 587), 78)]

### Create Recommendation Engine

Accepts a user id and returns the top 5 movie recommendations for that user

In [25]:
def recommendation(userid):
    #identify targetuser in useraffinity
    specificaffinity = useraffinity.filter(lambda x: userid in x[0])
    
    #return the user with the highest affinity
    ((user1,user2), _) = specificaffinity.reduce(lambda x,y: x if x[1]>=y[1] else y)
    if userid == user1:
        affinityuser = user2
    elif userid == user2:
        affinityuser = user1
    affinityuserratings = ratingsRDD.filter(lambda x: x[0] == affinityuser)
    
    #find the top5 movies the affinity user rated
    top5 = affinityuserratings.top(5, lambda x: (x[2], x[4]))
    movienames = moviesRDD.filter(lambda x: x[0] in map(lambda x: x[1], top5)).map(lambda x: x[1] if "^" not in x[1] else x[1].replace("^",",")).collect()
    recommendation_greeting = "We recommend that user " + str(userid) + " watches:\n"
    for i in range(len(movienames)):
        recommendation_greeting += str(i+1)+") " + movienames[i] + "\n"
    print(recommendation_greeting)

In [26]:
recommendation(1)

We recommend that user 1 watches:
1) Usual Suspects, The (1995)
2) Maltese Falcon, The (1941)
3) Paths of Glory (1957)
4) Roger & Me (1989)
5) Mister Roberts (1955)



### Developer Notes

#### Compare target user to one other user or an aggregate many users  
    1) find the most similar user or aggregate user
    2) identify that user's highly rated movies
#### Metrics to judge similarity
        - based on aggregrate rating of a movie
        - based on what the rating
        - based on what genres they watched the most of
        - based on how frequently they rate movies
        - something with tags on the movie? Sentiment analysis?
        - scrape data from imdb for a critic's review
        - timestamps?
#### Misc  
    - User liked a movie = T/F if their rating > aggregate movie rating