### 前言
##### 任务要求
![image-20220330184012189](https://s2.loli.net/2022/03/30/Iums7FZYVqPlnRi.png)

### 一、加载数据

In [2]:
import os
from pyspark import SparkContext
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName("MovieReviewAnalysis").getOrCreate()

#### Data File Structure:
##### Movies
    MovieID::Title::Genres
##### Ratings
    UserID::MovieID::Rating::Timestamp
##### Tags
    UserID::MovieID::Tag::Timestamp

In [29]:
root_path = "D:\\大三下\\Big_Data_Application_Spark\\datasets\\ml-10M100K"

movies_path = os.path.join(root_path, "movies.dat")
ratings_path = os.path.join(root_path, "ratings.dat")
tags_path = os.path.join(root_path, "tags.dat")

movies = spark.read\
    .format("csv")\
    .option("sep", "::")\
    .schema("_c0 Int, _c1 STRING, _c2 STRING")\
    .load(movies_path)\
    .toDF("MovieID", "Title", "Genres")


# UserID Int MovieID Int Rating Int Timestamp STRING
ratings = spark.read\
    .format("csv")\
    .option("sep", "::")\
    .schema("_c0 Int, _c1 Int, _c2 Float, _c3 STRING")\
    .load(ratings_path)\
    .toDF("UserID", "MovieID", "Rating", "Timestamp")

tags = spark.read\
    .format("csv")\
    .option("sep", "::")\
    .schema("_c0 Int, _c1 Int, _c2 STRING, _c3 STRING")\
    .load(tags_path)\
    .toDF("UserID", "MovieID", "Tag", "Timestamp")

In [30]:
movies.show(3)
ratings.show(3)
tags.show(3)

+-------+--------------------+--------------------+
|MovieID|               Title|              Genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
+-------+--------------------+--------------------+
only showing top 3 rows

+------+-------+------+---------+
|UserID|MovieID|Rating|Timestamp|
+------+-------+------+---------+
|     1|    122|   5.0|838985046|
|     1|    185|   5.0|838983525|
|     1|    231|   5.0|838983392|
+------+-------+------+---------+
only showing top 3 rows

+------+-------+----------+----------+
|UserID|MovieID|       Tag| Timestamp|
+------+-------+----------+----------+
|    15|   4973|excellent!|1215184630|
|    20|   1747|  politics|1188263867|
|    20|   1747|    satire|1188263867|
+------+-------+----------+----------+
only showing top 3 rows



In [31]:
movies.printSchema()
ratings.printSchema()
tags.printSchema()

root
 |-- MovieID: integer (nullable = true)
 |-- Title: string (nullable = true)
 |-- Genres: string (nullable = true)

root
 |-- UserID: integer (nullable = true)
 |-- MovieID: integer (nullable = true)
 |-- Rating: float (nullable = true)
 |-- Timestamp: string (nullable = true)

root
 |-- UserID: integer (nullable = true)
 |-- MovieID: integer (nullable = true)
 |-- Tag: string (nullable = true)
 |-- Timestamp: string (nullable = true)



### 二、数据清洗
#### 3.1 重复行删除

In [33]:
ratings = ratings.dropDuplicates()

#### 3.2 填充缺失值

In [34]:
from pyspark.sql.functions import col

# 查找每行缺失的数据
columns=ratings.columns
missing_cnt=[ratings.select(col(x)).where(col(x).isNull()).count() for x in columns]
print(columns, missing_cnt)

['UserID', 'MovieID', 'Rating', 'Timestamp'] [0, 0, 0, 0]


In [35]:
# ratings.select("Rating").groupBy().mean().show()

In [36]:
# # 使用平均值取整值（四舍五入）填充
# ratings = ratings.fillna({"Rating": 4})
# ratings.show()

In [37]:
# 再次查看
# ratings.select(col("Rating")).where(col("Rating").isNull()).count()

### 三、任务解决
#### 2.1 查询用户平均分
* 以用户作为单元进行分组，以评分的平均值作为该用户的评分，之后再对所有用户进行评分的平均值求解

In [38]:
ratings.groupBy("UserID").avg("Rating")\
    .groupBy().avg("avg(Rating)").show()

+------------------+
|  avg(avg(Rating))|
+------------------+
|3.6136413483301495|
+------------------+



#### 2.2 查询电影平均分
* 思路同上，不同的是先以电影进行分组

In [39]:
ratings.groupBy("MovieID").avg("Rating")\
    .groupBy().avg("avg(Rating)").show()

+-----------------+
| avg(avg(Rating))|
+-----------------+
|3.191955422921221|
+-----------------+



#### 2.3 查询大于平均分的电影的数量
* 先对上面得到的每部电影的平均分存取操作，之后基于此数据进行查询
* 定义一个累加器用来判断符合的条件

In [41]:
pre_movie_avg_ratings = ratings.groupBy("MovieID").avg("Rating")

In [42]:
# 电影平均分为2.2的结果
pre_movie_avg_ratings.filter("avg(Rating) > 3.191955422921221").count()

5900

#### 2.4 查询高分电影中（>3）打分次数最多的用户，并求出此人打的平均分
* 求一个用户，他给高分电影打分次数最多，并求出他对高分电影打分的平均值

In [61]:
df1 = pre_movie_avg_ratings.filter("avg(Rating) > 3")\
    .join(ratings, on="MovieID", how="inner")
df1.show()

+-------+------------------+------+------+----------+
|MovieID|       avg(Rating)|UserID|Rating| Timestamp|
+-------+------------------+------+------+----------+
|   5300|3.7041884816753927|     7|   4.0|1049764610|
|   2366|3.6127175743964064|    13|   4.0|1035217983|
|   1088|3.1912112010796223|    14|   4.0|1133571288|
|   3175|3.6245300142616363|    34|   2.0| 981824576|
|   3175|3.6245300142616363|    36|   4.0|1049772120|
|   1580| 3.563920531231442|    43|   5.0| 912611414|
|   1580| 3.563920531231442|    45|   4.0| 974295797|
|   1580| 3.563920531231442|    47|   4.5|1162150102|
|   2366|3.6127175743964064|    56|   5.0|1162159027|
|   1238| 4.003408495018354|    65|   3.0| 970834629|
|   1959|3.6309438040345823|    65|   4.0| 950887497|
|   6620|3.8627082213863515|    78|   4.0|1083963707|
|   1645| 3.450640298265521|    96|   3.0| 959877703|
|   2366|3.6127175743964064|    96|   4.0| 959875578|
|   1580| 3.563920531231442|   105|   3.0| 959879140|
|    471| 3.659111243662392|

In [54]:
df1.groupBy("UserID").count().orderBy("count", ascending=False).show(1)

+------+-----+
|UserID|count|
+------+-----+
| 59269| 6401|
+------+-----+
only showing top 1 row



In [60]:
df1.filter("UserID = 59269").groupBy("UserID").avg()\
    .select("UserID", "avg(Rating)").show()

+------+------------------+
|UserID|       avg(Rating)|
+------+------------------+
| 59269|3.3654897672238713|
+------+------------------+



#### 2.5 查询每个用户的平均打分，最低打分，最高打分

In [122]:
ratings.groupBy("UserID").avg("Rating").show()

+------+------------------+
|UserID|       avg(Rating)|
+------+------------------+
|   148| 4.178571428571429|
|   463|           3.56875|
|   471| 3.909090909090909|
|   496|3.5789473684210527|
|   833|3.8840579710144927|
|  1088| 3.348684210526316|
|  1238|3.3728813559322033|
|  1342| 4.130434782608695|
|  1580|3.3506493506493507|
|  1591| 4.392857142857143|
|  1645|3.5609756097560976|
|  1829| 3.966386554621849|
|  1959|              3.44|
|  2122|2.7398373983739837|
|  2142| 3.064935064935065|
|  2366|               4.2|
|  2659| 3.890909090909091|
|  2866|3.8181818181818183|
|  3175| 4.090909090909091|
|  3749| 3.870967741935484|
+------+------------------+
only showing top 20 rows



In [123]:
ratings.groupBy("UserID").min("Rating").show()

+------+-----------+
|UserID|min(Rating)|
+------+-----------+
|   148|          1|
|   463|          1|
|   471|          2|
|   496|          1|
|   833|          1|
|  1088|          1|
|  1238|          1|
|  1342|          3|
|  1580|          1|
|  1591|          2|
|  1645|          1|
|  1829|          2|
|  1959|          1|
|  2122|          1|
|  2142|          1|
|  2366|          3|
|  2659|          1|
|  2866|          1|
|  3175|          1|
|  3749|          2|
+------+-----------+
only showing top 20 rows



In [124]:
ratings.groupBy("UserID").max("Rating").show()

+------+-----------+
|UserID|max(Rating)|
+------+-----------+
|   148|          5|
|   463|          5|
|   471|          5|
|   496|          5|
|   833|          5|
|  1088|          5|
|  1238|          5|
|  1342|          5|
|  1580|          5|
|  1591|          5|
|  1645|          5|
|  1829|          5|
|  1959|          5|
|  2122|          5|
|  2142|          5|
|  2366|          5|
|  2659|          5|
|  2866|          5|
|  3175|          5|
|  3749|          5|
+------+-----------+
only showing top 20 rows



#### 2.6 查询被评分超过100次的电影的平均分排名：TOP10

In [132]:
ratings.groupBy("MovieID")\
    .count()\
    .filter("count > 100")\
    .orderBy("count", ascending=False)\
    .show(10)

+-------+-----+
|MovieID|count|
+-------+-----+
|    296|34864|
|    356|34457|
|    593|33668|
|    480|32631|
|    318|31126|
|    110|29154|
|    457|28951|
|    589|28948|
|    260|28566|
|    150|27035|
+-------+-----+
only showing top 10 rows

