### 基于关联规则算法实现电影推荐系统
![image.png](attachment:8bf77266-511e-4e87-bcf0-a2b5340a7f20.png)
- 利用数据挖掘算法中的Apriori(关联规则)算法来实现一个电影推荐系统
  - 加载数据
  - 数据预处理
  - 生成频繁项集、关联规则
- 通过关联规则生成电影推荐的列表

### Apriori算法
- **案例：** 
 啤酒与尿布:  沃尔玛超市在分析销售记录时，发现了啤酒与尿布经常一起被购买，于是他们调整了货架将两者放在了一起，结果真的提升了啤酒的销量。  原因解释: 爸爸在给宝宝买尿布的时候，会顺便给自己买点啤酒？
- **概述：**
 Apriori算法是一种最有影响力的挖掘布尔关联规则的频繁项集的算法，其命名Apriori源于算法使用了频繁项集性质的先验(Prior)知识。
 接下来我们将以超市订单的例子理解关联分析相关的重要概念: Support(支持度)、Confidence(置信度)、Lift(提升度）。
 
  ![image.png](attachment:e0663e25-d0d7-4ee0-a2db-857a92930475.png)
  
  - Support(支持度)：指某事件出现的概率，在本例中即指某个商品组合出现的次数占总次数的比例。
  
  例：Support('Bread') = 4/5 = 0.8 Support('Milk') = 4/5 = 0.8
     Support('Bread+Milk') = 3/5 = 0.6  
     
  - Confidence(置信度)：本质上是个条件概率，即当购买了商品A的前提下，购买商品B的概率。
  
  例：Confidence('Bread'—> 'Milk') = Support('Bread+Milk')/ Support('Bread') = 0.6/0.8 = 0.75  
  
  - Lift(提升度）: 指商品A的出现，对商品B的出现的概率的提升程度。Lift(A->B) = Confidence(A, B) / Support(B)
  
  例：Lift('Bread'—> 'Milk') = 0.75/0.8 = 0.9375 

- **对于Lift(提升度）有三种情况：**
  - Lift(A->B)>1: 代表A对B的出现概率有提升。
  - Lift(A->B)=1: 代表A对B的出现概率没有提升，也没有下降。
  - Lift(A->B)<1: 代表A对B的出现概率有下降效果。
  
- **原理：**
 该算法挖掘关联规则的过程，即是查找频繁项集(frequent itemset)的过程:
   - 频繁项集：支持度大于等于最小支持度(Min Support)阈值的项集。
   - 非频繁集：支持度小于最小支持度的项集。

- **流程：**
 K = 1, 计算K项集的支持度；  
 筛选掉小于最小支持度的项集；
 如果项集为空，则对应K-1项集的结果为最终结果。否则K = K+1重复2-3步

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import mlxtend
import numpy as np

#### 电影数据准备

In [13]:
movie_data_file = './movie_dataset/movies_metadata.csv'
ratings_file = './movie_dataset/ratings_small.csv'

In [3]:
movie_data_df = pd.read_csv(movie_data_file)

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
movie_data_df.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [6]:
movie_data_df.describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,45460.0,45203.0,45460.0,45460.0
mean,11209350.0,94.128199,5.618207,109.897338
std,64332250.0,38.40781,1.924216,491.310374
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


In [7]:
movie_data_df.info

<bound method DataFrame.info of        adult                              belongs_to_collection    budget  \
0      False  {'id': 10194, 'name': 'Toy Story Collection', ...  30000000   
1      False                                                NaN  65000000   
2      False  {'id': 119050, 'name': 'Grumpy Old Men Collect...         0   
3      False                                                NaN  16000000   
4      False  {'id': 96871, 'name': 'Father of the Bride Col...         0   
...      ...                                                ...       ...   
45461  False                                                NaN         0   
45462  False                                                NaN         0   
45463  False                                                NaN         0   
45464  False                                                NaN         0   
45465  False                                                NaN         0   

                                           

In [11]:
movie_data_df.count()

adult                    45466
belongs_to_collection     4494
budget                   45466
genres                   45466
homepage                  7782
id                       45466
imdb_id                  45449
original_language        45455
original_title           45466
overview                 44512
popularity               45461
poster_path              45080
production_companies     45463
production_countries     45463
release_date             45379
revenue                  45460
runtime                  45203
spoken_languages         45460
status                   45379
tagline                  20412
title                    45460
video                    45460
vote_average             45460
vote_count               45460
dtype: int64

In [12]:
movie_data_df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')