## 背景

`MovieLens`数据集是一个广泛用来研究推荐系统算法的一个数据集。这个数据集包含了来自[电影推荐网站](https://movielens.org/)多年积累的数据。我们将基于这个数据集练习数据的处理和分析。

从 [grouplens](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip) 下载数据集 MovieLens 1M Dataset。
数据集包含6个文件：
- `tags.csv` 用户给电影打的标签:
    - userId
    - movieId
    - tag
    - timestamp
- `ratings.csv` 用户给电影的评分:
    - userId
    - movieId
    - rating
    - timestamp
- `movies.csv` 电影信息:
    - movieId
    - title
    - genres
- `links.csv` 链接到其他资源的`id`:
    - movieId
    - imdbId
    - tmbdId

## 需求

对数据集中的3个csv文件进行聚合，生成一个csv，包含电影的信息，其中每部电影一行，信息包括电影名称、主演、平均分、所有tag

## 提交方式

github工程源代码文件链接+结果截图

In [1]:
import pandas as pd

#### 加载数据

In [28]:
links = pd.read_csv('E:\\Desktop\\thoughtwork\\file\\links.csv')
movies = pd.read_csv('E:\\Desktop\\thoughtwork\\file\\movies.csv')
ratings = pd.read_csv('E:\\Desktop\\thoughtwork\\file\\ratings.csv')
tags = pd.read_csv('E:\\Desktop\\thoughtwork\\file\\tags.csv')

In [4]:
links.tail()

Unnamed: 0,movieId,imdbId,tmdbId
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0
9741,193609,101726,37891.0


In [5]:
movies.tail()

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [6]:
ratings.tail()

Unnamed: 0,userId,movieId,rating,timestamp
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352
100835,610,170875,3.0,1493846415


In [7]:
tags.tail()

Unnamed: 0,userId,movieId,tag,timestamp
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978
3682,610,168248,Heroic Bloodshed,1493844270


#### 1、对所有电影求平均分

In [9]:
a = ratings.groupby('movieId').agg({'rating':'mean'})

In [10]:
a

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.920930
2,3.431818
3,3.259615
4,2.357143
5,3.071429
...,...
193581,4.000000
193583,3.500000
193585,3.500000
193587,3.500000


#### 2、把tags表中的tags提取出来

In [49]:
b = tags[['movieId','tag']].drop_duplicates()

In [53]:
k = b.sort_values(by=['movieId'],ascending=True)

In [61]:
res = pd.DataFrame(k.groupby(['movieId']).apply(lambda x:'、'.join(x['tag'])),columns=['tag'])

In [62]:
res

Unnamed: 0_level_0,tag
movieId,Unnamed: 1_level_1
1,pixar、fun
2,magic board game、Robin Williams、fantasy、game
3,moldy、old
5,pregnancy、remake
7,remake
...,...
183611,Comedy、funny、Rachel McAdams
184471,video game adaptation、Alicia Vikander、adventure
187593,Josh Brolin、Ryan Reynolds、sarcasm
187595,Emilia Clarke、star wars


#### 3、抽取电影信息

In [63]:
c = movies.loc[:,['movieId','title']]

In [64]:
c

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)
...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017)
9738,193583,No Game No Life: Zero (2017)
9739,193585,Flint (2017)
9740,193587,Bungo Stray Dogs: Dead Apple (2018)


#### 4、左连接

In [65]:
d = c.merge(a,how="left",on="movieId")

In [66]:
d

Unnamed: 0,movieId,title,rating
0,1,Toy Story (1995),3.920930
1,2,Jumanji (1995),3.431818
2,3,Grumpier Old Men (1995),3.259615
3,4,Waiting to Exhale (1995),2.357143
4,5,Father of the Bride Part II (1995),3.071429
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),4.000000
9738,193583,No Game No Life: Zero (2017),3.500000
9739,193585,Flint (2017),3.500000
9740,193587,Bungo Stray Dogs: Dead Apple (2018),3.500000


In [67]:
e = d.merge(res,how="left",on="movieId")

In [68]:
e

Unnamed: 0,movieId,title,rating,tag
0,1,Toy Story (1995),3.920930,pixar、fun
1,2,Jumanji (1995),3.431818,magic board game、Robin Williams、fantasy、game
2,3,Grumpier Old Men (1995),3.259615,moldy、old
3,4,Waiting to Exhale (1995),2.357143,
4,5,Father of the Bride Part II (1995),3.071429,pregnancy、remake
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),4.000000,
9738,193583,No Game No Life: Zero (2017),3.500000,
9739,193585,Flint (2017),3.500000,
9740,193587,Bungo Stray Dogs: Dead Apple (2018),3.500000,


#### 5、写入csv

In [70]:
e.to_csv('total.csv')

#### 6、预览汇总后的csv

In [71]:
total = pd.read_csv('total.csv')

In [73]:
total.head(10)

Unnamed: 0.1,Unnamed: 0,movieId,title,rating,tag
0,0,1,Toy Story (1995),3.92093,pixar、fun
1,1,2,Jumanji (1995),3.431818,magic board game、Robin Williams、fantasy、game
2,2,3,Grumpier Old Men (1995),3.259615,moldy、old
3,3,4,Waiting to Exhale (1995),2.357143,
4,4,5,Father of the Bride Part II (1995),3.071429,pregnancy、remake
5,5,6,Heat (1995),3.946078,
6,6,7,Sabrina (1995),3.185185,remake
7,7,8,Tom and Huck (1995),2.875,
8,8,9,Sudden Death (1995),3.125,
9,9,10,GoldenEye (1995),3.496212,
