## Final Exam WQD 7005 Data Mining 
## Question 4

### Name: Nurullainy binti Mat Rashid                   
### ID :  17036591

### Topic: Rule Mining  of Internet Movie Database (IMDb)

The objectives are to find frequent itemsets and mining Association Rules using data from IMDb. The data was collected from the following website : https://www.imdb.com/search/title/?year=2017


In order to achieve the task, I will be going to cover the following steps:

    1) Importing required libraries
    2) Creating a list from dataset (Question 1)
    3) Convert list to dataframe with boolean values
    4) Find frequently occurring itemsets using Apriori Algorithm
    5) Find frequently occurring itemsets using F-P Growth
    6) Mine the Association Rules

### 1) Importing required libraries

In [1]:
import pandas as pd
import numpy as np

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules

import csv

In [2]:
# Load dataset

df = pd.read_csv('movies_imdb_preprocessed.csv')
df.head()

Unnamed: 0,movie_name,year_released,runtime_in_min,genre,revenues,imdb_rating,user_votes,director,actor
0,Gladiator,2000,155,"Action, Adventure, Drama",187705427,8.5,1295546,Ridley Scott,"Russell Crowe, Joaquin Phoenix, Connie Nielsen..."
1,Memento,2000,113,"Mystery, Thriller",25544867,8.4,1088700,Christopher Nolan,"Guy Pearce, Carrie-Anne Moss, Joe Pantoliano, ..."
2,Snatch,2000,104,"Comedy, Crime",30328156,8.3,760646,Guy Ritchie,"Jason Statham, Brad Pitt, Benicio Del Toro, De..."
3,Requiem for a Dream,2000,102,Drama,3635482,8.3,742193,Darren Aronofsky,"Ellen Burstyn, Jared Leto, Jennifer Connelly, ..."
4,X-Men,2000,104,"Action, Adventure, Sci-Fi",157299717,7.4,558716,Bryan Singer,"Patrick Stewart, Hugh Jackman, Ian McKellen, F..."


In [3]:
df.isnull().sum()

movie_name         0
year_released      0
runtime_in_min     0
genre              0
revenues           0
imdb_rating        0
user_votes         0
director          78
actor             78
dtype: int64

In [4]:
df.dropna(how='any', inplace=True)

In [5]:
df.shape

(816, 9)

### 2) Creating a list from dataset (Question 1)



In [6]:
# Create new subset of dataset

revenue_genre = df[['revenues', 'genre']]

revenue_genre.head(10)

Unnamed: 0,revenues,genre
0,187705427,"Action, Adventure, Drama"
1,25544867,"Mystery, Thriller"
2,30328156,"Comedy, Crime"
3,3635482,Drama
4,157299717,"Action, Adventure, Sci-Fi"
5,233632142,"Adventure, Drama, Romance"
6,15070285,"Comedy, Crime, Drama"
7,95011339,"Drama, Mystery, Sci-Fi"
8,215409889,"Action, Adventure, Thriller"
9,166244045,"Comedy, Romance"


In [7]:
# Change format of revenues data

df1 = revenue_genre['revenues'].div(1000000).to_frame('col') # Change to Million notation
df1.shape

revenue_genre['revenues'] = df1['col']
revenue_genre.info()


revenue_genre['revenues'] = revenue_genre['revenues'].round(0).astype(int)
revenue_genre.columns = ['revenues in mil', 'genre']  # Rename the columns name

revenue_genre.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 816 entries, 0 to 815
Data columns (total 2 columns):
revenues    816 non-null float64
genre       816 non-null object
dtypes: float64(1), object(1)
memory usage: 19.1+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,revenues in mil,genre
0,188,"Action, Adventure, Drama"
1,26,"Mystery, Thriller"
2,30,"Comedy, Crime"
3,4,Drama
4,157,"Action, Adventure, Sci-Fi"


The above code shows that using Revenues in Million in integer format. There are 298 unique revenue number out of 816 of total rows

Consolidate the items into 1 transaction by each revenue number, in this case revenues in million

In [8]:
revenue_genre['revenues in mil'].value_counts()

47     11
18     10
0       9
52      9
57      9
76      9
80      9
125     8
89      8
54      7
45      7
6       7
42      7
27      7
26      7
25      7
155     7
36      7
101     7
51      7
103     7
65      7
34      7
66      6
39      6
70      6
40      6
43      6
72      6
60      6
       ..
104     1
211     1
296     1
68      1
69      1
291     1
282     1
152     1
277     1
274     1
261     1
255     1
254     1
252     1
251     1
245     1
244     1
243     1
239     1
238     1
237     1
233     1
227     1
224     1
223     1
99      1
220     1
219     1
213     1
937     1
Name: revenues in mil, Length: 298, dtype: int64

In [9]:
# Group genre by revenue generated (298 unique revenue number) 

basket = revenue_genre.groupby(['revenues in mil'])['genre'].apply(list)

print("\n", basket[:10])


 revenues in mil
0    [Action, Crime, Thriller, Crime, Drama, Myster...
1    [Drama, Mystery, Sci-Fi, Action, Drama, Sci-Fi...
2    [Crime, Drama, Crime, Drama, Comedy, Romance, ...
3                                     [Drama, Romance]
4    [Drama, Crime, Drama, Musical, Comedy, Drama, ...
5    [Drama, Thriller, Animation, Adventure, Family...
6    [Comedy, Drama, Biography, Drama, History, Dra...
7     [Drama, Mystery, Thriller, Comedy, Drama, Drama]
8                 [Crime, Drama, Comedy, Crime, Drama]
9                             [Comedy, Drama, Romance]
Name: genre, dtype: object


In [10]:
# List all the genre in list format (for model preparation)

basket_list = list(basket)

print("\n", basket_list[:10])


 [['Action, Crime, Thriller', 'Crime, Drama, Mystery', 'Action, Crime, Drama', 'Crime, Drama, Sport', 'Adventure, Comedy, Sci-Fi', 'Drama', 'Drama, Fantasy, Romance', 'Comedy, Horror', 'Action, Adventure, Drama'], ['Drama, Mystery, Sci-Fi', 'Action, Drama, Sci-Fi', 'Action, Drama, Mystery', 'Drama, Thriller', 'Drama, Thriller', 'Drama'], ['Crime, Drama', 'Crime, Drama', 'Comedy, Romance, Sport', 'Drama, Horror, Romance', 'Biography, Crime, Drama'], ['Drama, Romance'], ['Drama', 'Crime, Drama, Musical', 'Comedy, Drama, Romance', 'Action, Comedy, Crime', 'Sci-Fi, Thriller'], ['Drama, Thriller', 'Animation, Adventure, Family', 'Drama, Mystery, Sci-Fi', 'Action, Drama, Sci-Fi', 'Animation, Drama, Fantasy'], ['Comedy, Drama', 'Biography, Drama, History', 'Drama, Romance', 'Action, Crime, Thriller', 'Drama, Mystery, Romance', 'Action, Adventure, Comedy', 'Comedy, Drama'], ['Drama, Mystery, Thriller', 'Comedy, Drama', 'Drama'], ['Crime, Drama', 'Comedy, Crime, Drama'], ['Comedy, Drama, Roman

### 3) Convert list to dataframe with Boolean values

In [11]:
# Convert list to dataframe with Boolean values

te = TransactionEncoder()
te_basket = te.fit(basket_list).transform(basket_list)

df2 = pd.DataFrame(te_basket, columns=te.columns_)
df2.head(10)

Unnamed: 0,"Action, Adventure","Action, Adventure, Biography","Action, Adventure, Comedy","Action, Adventure, Crime","Action, Adventure, Drama","Action, Adventure, Family","Action, Adventure, Fantasy","Action, Adventure, History","Action, Adventure, Horror","Action, Adventure, Mystery",...,Horror,"Horror, Mystery","Horror, Mystery, Thriller","Horror, Sci-Fi","Horror, Sci-Fi, Thriller","Horror, Thriller","Mystery, Sci-Fi, Thriller","Mystery, Thriller","Romance, Sci-Fi, Thriller","Sci-Fi, Thriller"
0,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


The above table shows the distribution of each movie genre(s) in one revenue number. False indicates no genre(s) by the specific revenue number whereas True indicates that the movie genre(s) falls under that specific revenue number.

### 4) Find frequently occurring itemsets using Apriori Algorithm

`Apriori` is an algorithm for frequent itemset mining and Association Rule learning over relational databases. The algorithm identify the frequent individual items in the database and extending them to larger and larger itemsets as long as those itemsets appear sufficiently often in the database.

The frequent itemsets determined by `Apriori` can be used to generate `Association Rules` which highlight general trends in the database. This application is widely used in market basket analysis.

1)	Pros: Easy to code up

2)	Cons: May be slow on large datasets

3)	Works with: Numeric values, nominal values


#### General approach to Apriori algorithm:

    1)	Preparation: Any data type will work because we storing sets.
    2)	Train: Use the Apriori algorithm to find frequent itemsets.
    3)	Test: Doesn’t apply.
    4)	Application: This will be used to find frequent itemsets and association rules between items.

The `Support` and `Confidence` are measures to measure how interesting a rule is. There are parameters that used to exclude rules in the result that have a `Support` or a `Confidence` lower than the minimum support and minimum confidence respectively. I have experimented a number trial of minimum support number and 0.01 is the best for this dataset.

In [12]:
# Frequently occurring itemsets using Apriori Algorithm

frequent_itemsets_apriori = apriori(df2, min_support=0.01, 
                                    use_colnames=True).sort_values(by='support', ascending=0)
frequent_itemsets_apriori.head(10)

Unnamed: 0,support,itemsets
7,0.204698,"(Action, Adventure, Sci-Fi)"
31,0.11745,"(Animation, Adventure, Comedy)"
5,0.097315,"(Action, Adventure, Fantasy)"
39,0.083893,(Comedy)
45,0.080537,"(Comedy, Drama, Romance)"
3,0.07047,"(Action, Adventure, Drama)"
1,0.07047,"(Action, Adventure, Comedy)"
51,0.057047,"(Crime, Drama, Thriller)"
10,0.057047,"(Action, Comedy, Crime)"
48,0.057047,"(Comedy, Romance)"


There are 113 number of itemsets found by `Apriori model` in this dataset.

The 1st row shows that (Action, Adventure, Sci-Fi) has support value of 0.204698 which means it occurred 167 times in the dataset. Let's view all itemsets frequency in dataframe

In [13]:
# Frequently occurring itemsets using Apriori Algorithm
# Adding new column frequency (number of occurrence) of each itemset

frequent_itemsets_apriori['frequency'] = frequent_itemsets_apriori['support'].mul(816)  # 816 is total of transaction
frequent_itemsets_apriori['frequency'] = frequent_itemsets_apriori['frequency'].round(0).astype(int)
frequent_itemsets_apriori = frequent_itemsets_apriori[frequent_itemsets_apriori.columns[[1,2,0]]]

frequent_itemsets_apriori

Unnamed: 0,itemsets,frequency,support
7,"(Action, Adventure, Sci-Fi)",167,0.204698
31,"(Animation, Adventure, Comedy)",96,0.117450
5,"(Action, Adventure, Fantasy)",79,0.097315
39,(Comedy),68,0.083893
45,"(Comedy, Drama, Romance)",66,0.080537
3,"(Action, Adventure, Drama)",58,0.070470
1,"(Action, Adventure, Comedy)",58,0.070470
51,"(Crime, Drama, Thriller)",47,0.057047
10,"(Action, Comedy, Crime)",47,0.057047
48,"(Comedy, Romance)",47,0.057047


### 5) Find frequently occurring itemsets using F-P Growth (Frequent Pattern Growth) 

The `FP-Growth Algorithm` is an alternative way to find frequent itemsets. It uses a divide-and-conquer strategy where the core of this method is the usage of pattern fragment growth named frequent-pattern tree (FP-tree). This method retains the itemset association information using an extended prefix-tree structure for storing information about frequent patterns. 

This method is proven to be more efficient and scalable for mining the complete set of frequent patterns over other algorithm such as Apriori Algorithm.

1) Pros: Usually faster than Apriori.

2) Cons: Not possible to hold the FP-tree in the main memory. Partition the database into a set of smaller databases and then construct an FP-tree from each of these smaller databases.

3) Works with Nominal values.

#### General approach to FP-growth algorithm

    1) Preparation: Discrete data is needed because we’re storing sets. For continuous data, it will need to be 
    quantized into discrete values.
    2) Train: Build an FP-tree and mine the tree.
    3) Test: Doesn’t apply.
    4) Application: This can be used to identify commonly occurring items that can be used to make decisions, 
    suggest items, make forecasts, and so on.

I have experimented a number trial of minimum support number and 0.01 is the best for this dataset.

In [14]:
# Frequently occurring itemsets using F-P Growth (Frequent Pattern Growth)

frequent_itemsets_fpgrowth = fpgrowth(df2, min_support=0.01, 
                                     use_colnames=True).sort_values(by='support', ascending=0)
frequent_itemsets_fpgrowth.head(10)

Unnamed: 0,support,itemsets
39,0.204698,"(Action, Adventure, Sci-Fi)"
37,0.11745,"(Animation, Adventure, Comedy)"
45,0.097315,"(Action, Adventure, Fantasy)"
33,0.083893,(Comedy)
15,0.080537,"(Comedy, Drama, Romance)"
0,0.07047,"(Action, Adventure, Drama)"
18,0.07047,"(Action, Adventure, Comedy)"
1,0.057047,"(Action, Crime, Thriller)"
32,0.057047,"(Crime, Drama, Thriller)"
16,0.057047,"(Action, Comedy, Crime)"


There are 113 number of itemsets found by `F-P Growth Algorithm` in this dataset.

The 1st row shows that (Action, Adventure, Sci-Fi) has support value of 0.204698 which means it occurred 167 times in the dataset. Let's view all itemsets frequency in dataframe

In [15]:
# Frequently occurring itemsets using F-P Growth (Frequent Pattern Growth)
# Adding new column number of occurence (frequency) of each itemset

frequent_itemsets_fpgrowth['frequency'] = frequent_itemsets_fpgrowth['support'].mul(816)  # 816 is total of transaction
frequent_itemsets_fpgrowth['frequency'] = frequent_itemsets_fpgrowth['frequency'].round(0).astype(int)
frequent_itemsets_fpgrowth = frequent_itemsets_fpgrowth[frequent_itemsets_fpgrowth.columns[[1,2,0]]]

frequent_itemsets_fpgrowth

Unnamed: 0,itemsets,frequency,support
39,"(Action, Adventure, Sci-Fi)",167,0.204698
37,"(Animation, Adventure, Comedy)",96,0.117450
45,"(Action, Adventure, Fantasy)",79,0.097315
33,(Comedy),68,0.083893
15,"(Comedy, Drama, Romance)",66,0.080537
0,"(Action, Adventure, Drama)",58,0.070470
18,"(Action, Adventure, Comedy)",58,0.070470
1,"(Action, Crime, Thriller)",47,0.057047
32,"(Crime, Drama, Thriller)",47,0.057047
16,"(Action, Comedy, Crime)",47,0.057047


### 6) Mine the Association Rules

`Association Rules` analysis is a technique to uncover how items are associated to each other. There are 3 common ways to measure association:

1) Measure 1: `Support` - This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears. If we discover that sales of certain items beyond a certain proportion or tend to have a significant impact on our profits, we might consider using that proportion as your `support` threshold. Thus, we identify itemsets with `support values above this threshold` as significant itemsets.

2) Measure 2: `Confidence`. This says how likely item B is purchased when item A is purchased, expressed as {A -> B}. This is measured by the proportion of transactions with item A, in which item B also appears. One drawback of the `confidence` measure is that it might misrepresent the importance of an association. This is because it only accounts for how popular A are, but not B. If B are also very popular in general, there will be a higher chance that a transaction containing A will also contain B, thus inflating the confidence measure. To account for the base popularity of both items, we use a third measure called `Lift`.

3) Measure 3: `Lift`. This says how likely item B is purchased when item A is purchased, while controlling for how popular item B is. This measurement take the account of probability of having B in the basket with knowledge of A being present over the probability of having B in the basket without any knowledge about present of A.

    a)Lift {A -> B} = 1, means no association between items. 
    b)Lift {A -> B} > 1, means that item B is likely to be bought if item A is bought, 
    c)Lift {A -> B} < 1, means that item B is unlikely to be bought if item A is bought.

#### 6a) Mine the Association Rules using Apriori Algorithm

In [41]:
# Generate the Association Rules using Apriori Algorithm with their corresponding support, confidence and lift. 

rules_apriori = association_rules(frequent_itemsets_apriori, metric="lift", min_threshold=1)  # min_threshold = minimum-support threshold aka number of frequent itemset
rules_apriori = rules_apriori.sort_values(by='lift', ascending=0)

In [42]:
# View top 5 rules 

rules_apriori.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
71,"(Comedy, Drama)","(Comedy, Action, Adventure, Comedy)",0.053691,0.016779,0.010067,0.1875,11.175,0.009166,1.210119
70,"(Comedy, Action, Adventure, Comedy)","(Comedy, Drama)",0.016779,0.053691,0.010067,0.6,11.175,0.009166,2.365772
68,"(Comedy, Drama, Comedy)","(Action, Adventure, Comedy)",0.013423,0.07047,0.010067,0.75,10.642857,0.009121,3.718121
73,"(Action, Adventure, Comedy)","(Comedy, Drama, Comedy)",0.07047,0.013423,0.010067,0.142857,10.642857,0.009121,1.151007
47,"(Comedy, Crime, Drama)","(Comedy, Drama, Romance)",0.020134,0.080537,0.010067,0.5,6.208333,0.008446,1.838926


`Antecedent` and a `Consequent`, both of which are a list of genres. Note that implication here is co-occurrence and not causality.

'Antecedent support' computes the proportion of transactions that contain the antecedent A.

'Consequent support' computes the support for the itemset of the consequent B.

`Leverage` computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent. An leverage value of 0 indicates independence.

High `conviction` value means that the consequent is highly depending on the antecedent.

The maximum value of `Lift` is `11.1` and maximum value for `Confidence` is `0.75` found in `Association Rules using Apriori Algorithm`. I want to view for a large value of `Lift` and `Conficence` with range value of more than 1 and more than 0.4 repectively. This means that genre B is likely to be chosen if genre A is chosen

In [25]:
# Filter the dataframe for Lift > 1 and high confidence >= 0.5

rules_apriori[(rules_apriori['lift'] > 1) & 
              (rules_apriori['confidence'] >= 0.5)].sort_values(by='lift', ascending=0)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
70,"(Comedy, Action, Adventure, Comedy)","(Comedy, Drama)",0.016779,0.053691,0.010067,0.6,11.175,0.009166,2.365772
68,"(Comedy, Drama, Comedy)","(Action, Adventure, Comedy)",0.013423,0.07047,0.010067,0.75,10.642857,0.009121,3.718121
47,"(Comedy, Crime, Drama)","(Comedy, Drama, Romance)",0.020134,0.080537,0.010067,0.5,6.208333,0.008446,1.838926
69,"(Comedy, Drama, Action, Adventure, Comedy)",(Comedy),0.020134,0.083893,0.010067,0.5,5.96,0.008378,1.832215
45,"(Comedy, Crime)",(Comedy),0.020134,0.083893,0.010067,0.5,5.96,0.008378,1.832215
59,"(Drama, Sport)","(Action, Adventure, Sci-Fi)",0.013423,0.204698,0.010067,0.75,3.663934,0.007319,3.181208


Now I want to view small value of Lift and Confidence with range value of less than 1, equal and more than 0 respectively.  This means that genre B is unlikely to be chosen if genre A is chosen

In [27]:
# Filter the dataframe for Lift < 1 and high confidence >= 0

rules_apriori[(rules_apriori['lift'] < 1) & 
              (rules_apriori['confidence'] >= 0)].sort_values(by='lift', ascending=1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
79,(Comedy),"(Action, Adventure, Sci-Fi)",0.083893,0.204698,0.010067,0.12,0.58623,-0.007106,0.903752
78,"(Action, Adventure, Sci-Fi)",(Comedy),0.204698,0.083893,0.010067,0.04918,0.58623,-0.007106,0.963492
16,"(Action, Adventure, Sci-Fi)","(Action, Adventure, Fantasy)",0.204698,0.097315,0.013423,0.065574,0.673827,-0.006497,0.966031
17,"(Action, Adventure, Fantasy)","(Action, Adventure, Sci-Fi)",0.097315,0.204698,0.013423,0.137931,0.673827,-0.006497,0.92255
68,"(Action, Adventure, Sci-Fi)","(Action, Adventure, Drama)",0.204698,0.07047,0.010067,0.04918,0.697892,-0.004358,0.977609
69,"(Action, Adventure, Drama)","(Action, Adventure, Sci-Fi)",0.07047,0.204698,0.010067,0.142857,0.697892,-0.004358,0.927852
64,"(Action, Adventure, Sci-Fi)","(Comedy, Romance)",0.204698,0.057047,0.010067,0.04918,0.862102,-0.00161,0.991726
74,"(Action, Adventure, Sci-Fi)","(Action, Crime, Thriller)",0.204698,0.057047,0.010067,0.04918,0.862102,-0.00161,0.991726
75,"(Action, Crime, Thriller)","(Action, Adventure, Sci-Fi)",0.057047,0.204698,0.010067,0.176471,0.862102,-0.00161,0.965724
65,"(Comedy, Romance)","(Action, Adventure, Sci-Fi)",0.057047,0.204698,0.010067,0.176471,0.862102,-0.00161,0.965724


In [28]:
# Filter the dataframe for Lift == 1 and high confidence >= 0

rules_apriori[(rules_apriori['lift'] == 1) & 
              (rules_apriori['confidence'] >= 0)].sort_values(by='lift', ascending=1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


#### Findings 1:

1) 1) There are 74 rules that have a Lift value more than 1, when I set the minimum support threshold to 1.

2) A value of `Lift` which greater than 1 indicate for high association between `Antecedents` and `Consequents`. The greater the value of `Lift`, the greater are the chances of preference to choose genre in `Consequents`. Here, if the viewer has already watched movie with genre of (Comedy, Action, Adventure, Comedy), viewer will likely watch (Comedy, Drama) movie genre.

3) Comedy genre has high `Lift` value 

4) `Lift` is the measure that will help movie producer to decide what kind of movie genre to produce next based on revenue generated from an individual movie.

5) These 74 rules also have wide range of `Confidence` number, range between 0.04 to 0.75.

6) 14 rules showed that genre fall in antecedent does not increase the chances of viewer to watch genre in consequent. Eg: Viewer who likes to watch (Action, Adventure, Sci-Fi) movie most likely does not watch (Comedy) movie genre.

7) There is no Association Rule for Lift value equal to 1

#### 6b) Mine the Association Rules using F-P Growth

In [39]:
rules_fpgrowth = association_rules(frequent_itemsets_fpgrowth, metric="lift", min_threshold=1)   # min_threshold = minimum-support threshold aka number of frequent itemset
rules_fpgrowth = rules_fpgrowth.sort_values(by='lift', ascending=0)

In [40]:
rules_fpgrowth.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
32,"(Comedy, Action, Adventure, Comedy)","(Comedy, Drama)",0.016779,0.053691,0.010067,0.6,11.175,0.009166,2.365772
33,"(Comedy, Drama)","(Comedy, Action, Adventure, Comedy)",0.053691,0.016779,0.010067,0.1875,11.175,0.009166,1.210119
30,"(Comedy, Drama, Comedy)","(Action, Adventure, Comedy)",0.013423,0.07047,0.010067,0.75,10.642857,0.009121,3.718121
35,"(Action, Adventure, Comedy)","(Comedy, Drama, Comedy)",0.07047,0.013423,0.010067,0.142857,10.642857,0.009121,1.151007
57,"(Comedy, Crime, Drama)","(Comedy, Drama, Romance)",0.020134,0.080537,0.010067,0.5,6.208333,0.008446,1.838926


The maximum value of `Lift` is `11.1` and maximum value for `Confidence` is `0.75` found in `Association Rules using F-P Growth Algorithm` same like the `Apriori Algorithm`. I want to view for a large value of `Lift` and `Conficence` with range value of more than 1 and more than 0.4 respectively

In [34]:
# Filter the dataframe for Lift > 1 and high confidence >= 0.5

rules_fpgrowth[(rules_fpgrowth['lift'] > 1) & 
               (rules_fpgrowth['confidence'] >= 0.5)].sort_values(by='lift', ascending=0)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
32,"(Comedy, Action, Adventure, Comedy)","(Comedy, Drama)",0.016779,0.053691,0.010067,0.6,11.175,0.009166,2.365772
30,"(Comedy, Drama, Comedy)","(Action, Adventure, Comedy)",0.013423,0.07047,0.010067,0.75,10.642857,0.009121,3.718121
57,"(Comedy, Crime, Drama)","(Comedy, Drama, Romance)",0.020134,0.080537,0.010067,0.5,6.208333,0.008446,1.838926
49,"(Comedy, Crime)",(Comedy),0.020134,0.083893,0.010067,0.5,5.96,0.008378,1.832215
31,"(Comedy, Drama, Action, Adventure, Comedy)",(Comedy),0.020134,0.083893,0.010067,0.5,5.96,0.008378,1.832215
43,"(Drama, Sport)","(Action, Adventure, Sci-Fi)",0.013423,0.204698,0.010067,0.75,3.663934,0.007319,3.181208


For rules with Lift value less than 1 and Confidence is equal and more than 0

In [36]:
# Filter the dataframe for Lift < 1 and high confidence >= 0

rules_fpgrowth[(rules_fpgrowth['lift'] < 1) &
              (rules_fpgrowth['confidence'] >= 0)].sort_values(by='lift', ascending=1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
63,(Comedy),"(Action, Adventure, Sci-Fi)",0.083893,0.204698,0.010067,0.12,0.58623,-0.007106,0.903752
62,"(Action, Adventure, Sci-Fi)",(Comedy),0.204698,0.083893,0.010067,0.04918,0.58623,-0.007106,0.963492
16,"(Action, Adventure, Sci-Fi)","(Action, Adventure, Fantasy)",0.204698,0.097315,0.013423,0.065574,0.673827,-0.006497,0.966031
17,"(Action, Adventure, Fantasy)","(Action, Adventure, Sci-Fi)",0.097315,0.204698,0.013423,0.137931,0.673827,-0.006497,0.92255
72,"(Action, Adventure, Sci-Fi)","(Action, Adventure, Drama)",0.204698,0.07047,0.010067,0.04918,0.697892,-0.004358,0.977609
73,"(Action, Adventure, Drama)","(Action, Adventure, Sci-Fi)",0.07047,0.204698,0.010067,0.142857,0.697892,-0.004358,0.927852
46,"(Action, Adventure, Sci-Fi)","(Comedy, Romance)",0.204698,0.057047,0.010067,0.04918,0.862102,-0.00161,0.991726
78,"(Action, Adventure, Sci-Fi)","(Action, Crime, Thriller)",0.204698,0.057047,0.010067,0.04918,0.862102,-0.00161,0.991726
47,"(Comedy, Romance)","(Action, Adventure, Sci-Fi)",0.057047,0.204698,0.010067,0.176471,0.862102,-0.00161,0.965724
79,"(Action, Crime, Thriller)","(Action, Adventure, Sci-Fi)",0.057047,0.204698,0.010067,0.176471,0.862102,-0.00161,0.965724


In [37]:
# Filter the dataframe for Lift == 1 and high confidence >= 0

rules_fpgrowth[(rules_fpgrowth['lift'] == 1) & 
              (rules_fpgrowth['confidence'] >= 0)].sort_values(by='lift', ascending=1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


#### Findings 2:

1) The output from `F-P Growth algorithm` is same like `Apriori algorithm` for this case study. About 74 rule have a high `Lift` value (more than 1), which means that it increase the chances of occurence of movie genre in `Consequents` in spite high `Confidence` value.

2) These 74 rules of high `Lift` value also have wide range of `Confidence` number, range between 0.04 to 0.75.

3) `F-P Growth algorithm` showed faster in processing the data than `Apriori algorithm` because it scan the database twice to generate the itemsets unlike Apriori scans multiple times over database to generate itemsets.

4) There are about 14 rules with Lift value less than 1

5) There is no Association Rule for Lift value equal to 1

In [43]:
rules_apriori.to_csv('Apriori_Revenues_Genre.csv',mode = 'w', index=False, header=True)

In [44]:
rules_fpgrowth.to_csv('FPGrowth_Revenues_Genre.csv',mode = 'w', index=False, header=True)