### Movie Recommendations with Association Rules

Download small dataset : https://grouplens.org/datasets/movielens/ (or from included zip file)


In [1]:
###1. Load data 
import pandas as pd
ratings_df = pd.read_csv("ml-latest-small/ratings.csv",delimiter=",")
movies_df = pd.read_csv("ml-latest-small/movies.csv",delimiter=",")
#all_ratings["timestamp"] = pd.to_datetime(all_ratings['timestamp'])
print(ratings_df.head())

   userId  movieId  rating   timestamp
0       1       31     2.5  1260759144
1       1     1029     3.0  1260759179
2       1     1061     3.0  1260759182
3       1     1129     2.0  1260759185
4       1     1172     4.0  1260759205


In [2]:
movie_ratings_df = pd.merge(ratings_df, movies_df, how='left',
                            left_on='movieId', right_on='movieId')

In [3]:
movie_ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,1,1029,3.0,1260759179,Dumbo (1941),Animation|Children|Drama|Musical
2,1,1061,3.0,1260759182,Sleepers (1996),Thriller
3,1,1129,2.0,1260759185,Escape from New York (1981),Action|Adventure|Sci-Fi|Thriller
4,1,1172,4.0,1260759205,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama


Create a Pivot table using the panda function ```pivot```

In [4]:
rat=ratings_df.pivot(index='userId', columns='movieId', values='rating')
rat.fillna(0,inplace=True) #replace NaN with 0
rat.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The above shows 9066 movies (represented by different movieId) rated by the first 5 users.

In [5]:
# select 5 movieId randomly
sel=[31,1029,1061,1129,1172] 

# Get the ratings the 5 movies. 
# Use DataFrame.loc to access a particular cell in the given DataFrame using the index and column labels
rat.loc[1][sel] 

movieId
31      2.5
1029    3.0
1061    3.0
1129    2.0
1172    4.0
Name: 1, dtype: float64

In [6]:
# Discretized the ratings into two groups: <3 Bad and >=3 Good
rat[rat<3]=0  #Bad rating
rat[rat>=3]=1 #Good rating

In [7]:
# Return the ratings of the 5 movies again
rat.loc[1][sel] 

movieId
31      0.0
1029    1.0
1061    1.0
1129    0.0
1172    1.0
Name: 1, dtype: float64

In [8]:
# compute how many (a sum of) movies that were rated >=3 by each user 
rat.sum(axis=1) 

userId
1        8.0
2       70.0
3       47.0
4      194.0
5       96.0
       ...  
667     64.0
668     18.0
669     30.0
670     26.0
671    106.0
Length: 671, dtype: float64

### <font color=red>Questions Part 1 (1 mark)</font>
1. Create a frequent_itemsets with support of 0.2 and set use_colnames=True
2. Create the association rules with apriori with confidence metric = 0.5

In [10]:
pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.18.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.18.0
Note: you may need to restart the kernel to use updated packages.


In [9]:
import mlxtend
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

In [10]:
# use apriori to generate frequent itemsets with minsupp = 0.2
# use association_rules to generate association rules with minconf = 0.5
frequent_itemsets=apriori(rat, min_support=0.2, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.332340,(1),1
1,0.277198,(32),1
2,0.277198,(47),1
3,0.293592,(50),1
4,0.309985,(110),1
...,...,...,...
130,0.226528,"(296, 356, 318)",3
131,0.235469,"(296, 593, 318)",3
132,0.202683,"(296, 480, 356)",3
133,0.241431,"(296, 593, 356)",3


In [11]:
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(1),(260),0.332340,0.406855,0.207154,0.623318,1.532039,0.071939,1.574658
1,(260),(1),0.406855,0.332340,0.207154,0.509158,1.532039,0.071939,1.360233
2,(1),(296),0.332340,0.469449,0.202683,0.609865,1.299110,0.046666,1.359919
3,(1),(356),0.332340,0.476900,0.211624,0.636771,1.335230,0.053132,1.440139
4,(296),(47),0.469449,0.277198,0.238450,0.507937,1.832395,0.108320,1.468920
...,...,...,...,...,...,...,...,...,...
156,(356),"(296, 593)",0.476900,0.320417,0.241431,0.506250,1.579971,0.088624,1.376370
157,"(593, 356)",(318),0.299553,0.454545,0.219076,0.731343,1.608955,0.082916,2.030303
158,"(593, 318)",(356),0.292101,0.476900,0.219076,0.750000,1.572656,0.079773,2.092399
159,"(356, 318)",(593),0.295082,0.429210,0.219076,0.742424,1.729745,0.092424,2.216008


### <font color=red> Question Part 2 (3 marks) </font>
- (a) If a person likes The Matrix, and Star Wars, what other movies would you recommend to him/her? Why?
- (b) If a person likes Pulp Fiction and Jurassic Park, what other movies would you recommend to him/her? Why?
- (c) Comment on line 147 and 148, on why is there a difference in values for conviction and confidence:

147 	(296, 480) 	(356) 	0.244411 	0.476900 	0.202683 	0.829268 	1.738872 	0.086123 	3.063871

148 	(296, 356) 	(480) 	0.311475 	0.366617 	0.202683 	0.650718 	1.774925 	0.088490 	1.813384

Reference: http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/#metrics

In [None]:
# use the 'movie_ratings_df' dataframe
# get a list of 'Star Wars' movies IDs (look for strings that contain 'Star Wars' from the title column)
# get a list of 'Matrix' movies IDs (look for strings that contain 'Matrix' from the title column)
# concatenate the two lists
# create a list containing the MovieIDs for Star Wars and Matrix

In [None]:
# use the 'rat' dataframe (which contains data about whether a user like a movie or not)
# use apriori to generate frequent itemsets from the 'rat' dataframe and using a minsupp of 0.6
# use association_rules to generate rules using the metric "lift" and and min_threshold of 1.2
# filter for rules that contain the MovieIDs identified from the previous steps in their "antecedents"
# The consequents will be the rules reccommended to the person who likes the Matrix and Star Wars
# Use dataframe.loc function to return the movie names of the consequents

In [12]:
# Question A
df_starwars = movie_ratings_df[movie_ratings_df['title'].str.contains('Star Wars')].copy()
df_starwars

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
114,3,1210,3.0,1298921795,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi
154,4,260,5.0,949779042,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
202,4,1196,5.0,949779173,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi
208,4,1210,5.0,949778714,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi
308,4,2628,5.0,949810582,Star Wars: Episode I - The Phantom Menace (1999),Action|Adventure|Sci-Fi
...,...,...,...,...,...,...
99822,669,260,5.0,1015829081,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
99829,669,1210,3.0,1015829115,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi
99882,670,2628,4.0,938782234,Star Wars: Episode I - The Phantom Menace (1999),Action|Adventure|Sci-Fi
99893,671,260,5.0,1064891246,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi


In [13]:
df_matrix = movie_ratings_df[movie_ratings_df['title'].str.contains('Matrix')].copy()
df_matrix

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
475,6,2571,1.0,1109258202,"Matrix, The (1999)",Action|Sci-Fi|Thriller
638,8,2571,5.0,1154464738,"Matrix, The (1999)",Action|Sci-Fi|Thriller
740,9,2571,5.0,938628450,"Matrix, The (1999)",Action|Sci-Fi|Thriller
778,10,2571,5.0,942766515,"Matrix, The (1999)",Action|Sci-Fi|Thriller
911,13,2571,3.0,1331380888,"Matrix, The (1999)",Action|Sci-Fi|Thriller
...,...,...,...,...,...,...
98986,664,6934,3.5,1393891307,"Matrix Revolutions, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX
99499,665,2571,5.0,1010197436,"Matrix, The (1999)",Action|Sci-Fi|Thriller
99881,670,2571,4.0,938782234,"Matrix, The (1999)",Action|Sci-Fi|Thriller
99945,671,2571,4.5,1064891076,"Matrix, The (1999)",Action|Sci-Fi|Thriller


In [14]:
df_matrix_starwars = pd.concat([df_matrix, df_starwars], ignore_index = True)
df_matrix_starwars

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,6,2571,1.0,1109258202,"Matrix, The (1999)",Action|Sci-Fi|Thriller
1,8,2571,5.0,1154464738,"Matrix, The (1999)",Action|Sci-Fi|Thriller
2,9,2571,5.0,938628450,"Matrix, The (1999)",Action|Sci-Fi|Thriller
3,10,2571,5.0,942766515,"Matrix, The (1999)",Action|Sci-Fi|Thriller
4,13,2571,3.0,1331380888,"Matrix, The (1999)",Action|Sci-Fi|Thriller
...,...,...,...,...,...,...
1447,669,260,5.0,1015829081,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
1448,669,1210,3.0,1015829115,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi
1449,670,2628,4.0,938782234,Star Wars: Episode I - The Phantom Menace (1999),Action|Adventure|Sci-Fi
1450,671,260,5.0,1064891246,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi


In [15]:
moviesIDList = df_matrix_starwars['movieId'].to_list()
moviesIDList

[2571,
 2571,
 2571,
 2571,
 2571,
 2571,
 6365,
 6934,
 2571,
 6365,
 2571,
 2571,
 6365,
 6934,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 2571,
 6365,
 2571,
 6365,
 6934,
 2571,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 2571,
 2571,
 6934,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 6365,
 6934,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 6365,
 2571,
 2571,
 6365,
 2571,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 2571,
 6365,
 2571,
 6365,
 2571,
 2571,
 2571,
 2571,
 2571,
 2571,
 2571,
 2571,
 6365,
 6934,
 2571,
 2571,
 6365,
 2571,
 2571,
 2571,
 2571,
 2571,
 6365,
 2571,
 2571,
 6365,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 6365,
 2571,
 2571,
 2571,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 6365,
 6934,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 6365,
 6934,
 2571,
 6365,
 6934,
 2571,
 6365,
 2571,
 6365,
 6934,
 2571,
 6365,
 6934,
 2571,
 2571,
 2571,
 2571,
 6365,
 2571,
 2571,
 6365,
 6934,
 2571,

In [16]:
set_df_matrix_starwars = set(moviesIDList)
set_df_matrix_starwars

{260, 1196, 1210, 2571, 2628, 5378, 6365, 6934, 33493, 61160, 79006, 122886}

In [17]:
list_df_starwars = df_starwars['movieId'].to_list()
set_df_starwars = set(list_df_starwars)
set_df_starwars

{260, 1196, 1210, 2628, 5378, 33493, 61160, 79006, 122886}

In [18]:
list_df_matrix = df_matrix['movieId'].to_list()
set_df_matrix = set(list_df_matrix)
set_df_matrix

{2571, 6365, 6934}

In [19]:
frequent_itemsets=apriori(rat, min_support=0.2, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.332340,(1),1
1,0.277198,(32),1
2,0.277198,(47),1
3,0.293592,(50),1
4,0.309985,(110),1
...,...,...,...
130,0.226528,"(296, 356, 318)",3
131,0.235469,"(296, 593, 318)",3
132,0.202683,"(296, 480, 356)",3
133,0.241431,"(296, 593, 356)",3


In [20]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(1),(260),0.332340,0.406855,0.207154,0.623318,1.532039,0.071939,1.574658
1,(260),(1),0.406855,0.332340,0.207154,0.509158,1.532039,0.071939,1.360233
2,(296),(1),0.469449,0.332340,0.202683,0.431746,1.299110,0.046666,1.174933
3,(1),(296),0.332340,0.469449,0.202683,0.609865,1.299110,0.046666,1.359919
4,(1),(356),0.332340,0.476900,0.211624,0.636771,1.335230,0.053132,1.440139
...,...,...,...,...,...,...,...,...,...
197,"(593, 318)",(356),0.292101,0.476900,0.219076,0.750000,1.572656,0.079773,2.092399
198,"(356, 318)",(593),0.295082,0.429210,0.219076,0.742424,1.729745,0.092424,2.216008
199,(593),"(356, 318)",0.429210,0.295082,0.219076,0.510417,1.729745,0.092424,1.439833
200,(356),"(593, 318)",0.476900,0.292101,0.219076,0.459375,1.572656,0.079773,1.309408


In [21]:
combinations = []

for i in set_df_matrix:
    for j in set_df_starwars:
        combinations.append([i, j])
        
print(combinations)

[[2571, 5378], [2571, 260], [2571, 2628], [2571, 122886], [2571, 61160], [2571, 1196], [2571, 33493], [2571, 1210], [2571, 79006], [6365, 5378], [6365, 260], [6365, 2628], [6365, 122886], [6365, 61160], [6365, 1196], [6365, 33493], [6365, 1210], [6365, 79006], [6934, 5378], [6934, 260], [6934, 2628], [6934, 122886], [6934, 61160], [6934, 1196], [6934, 33493], [6934, 1210], [6934, 79006]]


In [22]:
for i in range(len(rules)):
    if list(rules.loc[i, 'antecedents']) in combinations:
        print(list(rules.loc[i, 'antecedents']))

[2571, 1196]
[2571, 260]


In [23]:
rules[(rules['antecedents'] == {2571, 1196}) | (rules['antecedents'] == {2571, 260})]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
166,"(2571, 1196)",(260),0.217586,0.406855,0.202683,0.931507,2.289528,0.114157,8.659911
168,"(2571, 260)",(1196),0.23994,0.33234,0.202683,0.84472,2.541737,0.122941,4.299732


In [24]:
for i in range(len(movie_ratings_df)):
    if list(rules.loc[166, 'consequents']) == movie_ratings_df.loc[i, 'movieId']:
        title1 = movie_ratings_df.loc[i, 'title']
    if list(rules.loc[168, 'consequents']) == movie_ratings_df.loc[i, 'movieId']:
        title2 = movie_ratings_df.loc[i, 'title']
        
print(title1 + '\n' + title2)

Star Wars: Episode IV - A New Hope (1977)
Star Wars: Episode V - The Empire Strikes Back (1980)


For a person who like Star Wars and The Matrix, I would recommend them Star Wars: Episode IV - A New Hope (1977), and Star Wars: Episode V - The Empire Strikes Back (1980) because the conviction value is very high meaning that the person most likely will enjoy these movies since the consequent value depend on the antecedent value. These movies also have a high confidence value meaning that any transaction containing these movies always appear in any transactions that contains both The Matrix and Star Wars.

In [25]:
# Question B
df_pulpfiction = movie_ratings_df[movie_ratings_df['title'].str.contains('Pulp Fiction')].copy()
df_pulpfiction

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
49,2,296,4.0,835355395,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
100,3,296,4.5,1298862418,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
156,4,296,5.0,949895708,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
590,8,296,4.0,1154465380,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
794,11,296,5.0,1391658423,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
...,...,...,...,...,...,...
99295,665,296,4.0,995233236,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
99709,666,296,4.0,838920789,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
99761,667,296,5.0,847271221,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
99801,668,296,5.0,993613478,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller


In [26]:
df_jurassicpark = movie_ratings_df[movie_ratings_df['title'].str.contains('Jurassic Park')].copy()
df_jurassicpark

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
72,2,480,4.0,835355643,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
171,4,480,5.0,949810582,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
243,4,1544,3.0,949811230,"Lost World: Jurassic Park, The (1997)",Action|Adventure|Sci-Fi|Thriller
382,5,1544,3.5,1163374562,"Lost World: Jurassic Park, The (1997)",Action|Adventure|Sci-Fi|Thriller
522,7,480,4.0,851869161,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
...,...,...,...,...,...,...
98771,664,480,3.5,1343747011,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
99674,665,4638,3.0,999571913,Jurassic Park III (2001),Action|Adventure|Sci-Fi|Thriller
99718,666,480,4.0,838920959,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
99783,667,480,4.0,847271278,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller


In [27]:
df_pulp_jurassic = pd.concat([df_pulpfiction, df_jurassicpark], ignore_index = True)
df_pulp_jurassic

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,2,296,4.0,835355395,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
1,3,296,4.5,1298862418,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
2,4,296,5.0,949895708,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
3,8,296,4.0,1154465380,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
4,11,296,5.0,1391658423,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
...,...,...,...,...,...,...
691,664,480,3.5,1343747011,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
692,665,4638,3.0,999571913,Jurassic Park III (2001),Action|Adventure|Sci-Fi|Thriller
693,666,480,4.0,838920959,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
694,667,480,4.0,847271278,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller


In [28]:
moviesIDList2 = df_pulp_jurassic['movieId'].to_list()
moviesIDList2

[296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296,
 296

In [29]:
set_df_pulp_jurassic = set(moviesIDList2)
set_df_pulp_jurassic

{296, 480, 1544, 4638}

In [30]:
list_df_pulpfiction = df_pulpfiction['movieId'].to_list()
set_df_pulpfiction = set(list_df_pulpfiction)
set_df_pulpfiction

{296}

In [31]:
list_df_jurassicpark = df_jurassicpark['movieId'].to_list()
set_df_jurassicpark = set(list_df_jurassicpark)
set_df_jurassicpark

{480, 1544, 4638}

In [32]:
frequent_itemsets=apriori(rat, min_support=0.2, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.332340,(1),1
1,0.277198,(32),1
2,0.277198,(47),1
3,0.293592,(50),1
4,0.309985,(110),1
...,...,...,...
130,0.226528,"(296, 356, 318)",3
131,0.235469,"(296, 593, 318)",3
132,0.202683,"(296, 480, 356)",3
133,0.241431,"(296, 593, 356)",3


In [33]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(1),(260),0.332340,0.406855,0.207154,0.623318,1.532039,0.071939,1.574658
1,(260),(1),0.406855,0.332340,0.207154,0.509158,1.532039,0.071939,1.360233
2,(296),(1),0.469449,0.332340,0.202683,0.431746,1.299110,0.046666,1.174933
3,(1),(296),0.332340,0.469449,0.202683,0.609865,1.299110,0.046666,1.359919
4,(1),(356),0.332340,0.476900,0.211624,0.636771,1.335230,0.053132,1.440139
...,...,...,...,...,...,...,...,...,...
197,"(593, 318)",(356),0.292101,0.476900,0.219076,0.750000,1.572656,0.079773,2.092399
198,"(356, 318)",(593),0.295082,0.429210,0.219076,0.742424,1.729745,0.092424,2.216008
199,(593),"(356, 318)",0.429210,0.295082,0.219076,0.510417,1.729745,0.092424,1.439833
200,(356),"(593, 318)",0.476900,0.292101,0.219076,0.459375,1.572656,0.079773,1.309408


In [34]:
combinations = []

for i in set_df_pulpfiction:
    for j in set_df_jurassicpark:
        combinations.append([i, j])
        
print(combinations)

[[296, 480], [296, 1544], [296, 4638]]


In [35]:
for i in range(len(rules)):
    if list(rules.loc[i, 'antecedents']) in combinations:
        print(list(rules.loc[i, 'antecedents']))

[296, 480]


In [36]:
rules[(rules['antecedents'] == {296, 480})]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
184,"(296, 480)",(356),0.244411,0.4769,0.202683,0.829268,1.738872,0.086123,3.063871


In [37]:
for i in range(len(movie_ratings_df)):
    if list(rules.loc[184, 'consequents']) == movie_ratings_df.loc[i, 'movieId']: 
        title = (movie_ratings_df.loc[i, 'title'])

print(title)

Forrest Gump (1994)


For a person who likes Pulp Fiction (1994) and Jurassic Park (1993), I would recommend them to watch Forrest Gump (1994) because the conviction value is very high meaning that the person most likely will enjoy watching Forrest Gump (1994) due to the fact that the consequent value is highly dependent on the antecedent value. It also has a high value of confidence meaning that the transaction containing Forrest Gump (1994) always appear in transactions that contains both Pulp Fiction (1994) and Jurassic Park (1993).

##### Question C
When the conviction value is high, it means that the consequent is highly dependent on the antecedent. For example, is a perfect confidence score of 1 exists, the denominator will become 0 which is because its 1 minus 1, for which the conviction score is defined as "inf" and this also means that x and y are completely unrelated. However, the value of confidence is measured by how often items in y appear in the transactions which contain x. this means that the confidence is the number of transactions that contain both items in x and y out of the total number of transaction that contains items in x. As for line 147, the value of confidence and conviction is high which means that a person who likes Pulp Fiction (1994) and Jurassic Park (1993) most likely will like Forrest Gump (1994) due to the value of consequent being highly dependent on the antecedent value. Lastly, for line 148, the value of confidence is lower than line 147 and the conviction value is closer to 1 which means that a person that likes Pulp Fiction (1994) and Forrest Gump (1994) might not like Jurassic Park (1993) because the value of consequent is not highly dependent on the antecedent value.