## MLlib으로 영화 추천 알고리즘 구현하기
+ 데이터셋: MovieLens(2500만개 영화 평점 데이터), 본 예제에서는 7만개의 데이터만 사용 ```ratings_short.csv```
+ 추천 알고리즘: Alternating Least Squares (ALS)

---

### 1. 영화 평점 데이터 불러오기 & 데이터프레임 생성하기

In [4]:
# [+] SparkSession 설정
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('movie_recommendation').getOrCreate()

#### movielens 데이터 불러오기
+ 파일명: ```ratings_short.csv```
+ 스키마 설정: ```inferSchema=True```
+ 헤더 사용: ```header=True```

In [5]:
# [+] movielens 데이터 불러오기
path = './data/'
file = 'ratings_short.csv'

ratings_df = spark.read.csv(path + file, inferSchema = True, header = True)

In [6]:
# [+] 데이터프레임 출력
ratings_df.show()

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    296|   5.0|1147880044|
|     1|    306|   3.5|1147868817|
|     1|    307|   5.0|1147868828|
|     1|    665|   5.0|1147878820|
|     1|    899|   3.5|1147868510|
|     1|   1088|   4.0|1147868495|
|     1|   1175|   3.5|1147868826|
|     1|   1217|   3.5|1147878326|
|     1|   1237|   5.0|1147868839|
|     1|   1250|   4.0|1147868414|
|     1|   1260|   3.5|1147877857|
|     1|   1653|   4.0|1147868097|
|     1|   2011|   2.5|1147868079|
|     1|   2012|   2.5|1147868068|
|     1|   2068|   2.5|1147869044|
|     1|   2161|   3.5|1147868609|
|     1|   2351|   4.5|1147877957|
|     1|   2573|   4.0|1147878923|
|     1|   2632|   5.0|1147878248|
|     1|   2692|   5.0|1147869100|
+------+-------+------+----------+
only showing top 20 rows



In [7]:
# [+] 데이터프레임 스키마 출력
ratings_df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



rating : double
            실수형

In [8]:
# 타임스탬프를 제외한 컬럼 선택
                        # select 자체 제공 method
                        #        전체 column에서 리스트 안의 column만 선택
ratings_df = ratings_df.select(['userId', 'movieId', 'rating'])

In [9]:
# describe(): 기술 통계량 출력
            # dataframe
ratings_df.select('rating').describe().show()

+-------+------------------+
|summary|            rating|
+-------+------------------+
|  count|             71921|
|   mean|3.5821387355570695|
| stddev| 1.042406032579843|
|    min|               0.5|
|    max|               5.0|
+-------+------------------+



### 2. 훈련 데이터 준비 및 영화 추천 모델 학습

train data
        학습용
test data    
        성능 평가용

In [10]:
# randomSplit(): 훈련 데이터셋과 테스트 데이터셋을 나누기
            # 전체의 80% = 학습용 data
            # 전체의 20% = 테스트 용 data
train_df, test_df = ratings_df.randomSplit([0.8, 0.2])

In [11]:
# 추천 알고리즘(Alternating Least Squares) 임포트
from  pyspark.ml.recommendation import ALS

#### 추천 알고리즘 초매개변수 설정
+ ```maxIter```: 최대 학습 반복 횟수
+ ```regParam```: 정규화 매개변수(범위: 0~1)
+ ```coldStartStrategy```: 데이터가 부족한 신규 유저 및 아이템에 대한 예측 문제(Cold Start)를 처리하는 방식이며 ```drop```값은 해당 데이터를 모델 학습 과정에서 배제

In [12]:
# 추천 알고리즘 초매개변수 설정
                        # 'drop' 제외
als = ALS(
    maxIter=5,
    regParam=0.1,
    userCol='userId',
    itemCol='movieId',
    ratingCol='rating',
    coldStartStrategy='drop'
)

In [13]:
# [+] 모델 학습
model = als.fit(train_df)

In [None]:
# # 메모리 부족으로 인한 오류 발생시, 아래의 코드를 실행
# from pyspark.sql import SparkSession

# MAX_MEMORY = '5g'
# spark = SparkSession.builder.appName('movie-recommendation')\
#     .config('spark.executor.memory', MAX_MEMORY)\
#     .config('spark.driver.memory', MAX_MEMORY)\
#     .getOrCreate()

In [14]:
# [+] 모델 예측
predictions = model.transform(test_df)

In [15]:
# [+] 예측값 출력
predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   472|   1088|   4.0| 3.2102757|
|   321|   1580|   3.0|   3.15096|
|   375|   1580|   2.5| 3.4394088|
|   368|   3175|   5.0| 3.4469876|
|    76|   1342|   3.5|  3.150262|
|    76|   3175|   3.5| 2.7611709|
|   332|   1580|   4.0| 3.5164542|
|    12|   8638|   4.0| 3.8325171|
|    12|  33722|   3.0| 3.0845177|
|   548|   1342|   3.5| 3.2207136|
|   548|   1580|   3.5|   3.57827|
|   548|   4900|   3.5| 2.6797392|
|   548|   5803|   2.5| 3.2255998|
|   548|   6620|   4.5| 3.7811031|
|   409|   1580|   3.5| 3.6983516|
|   319|   1238|   5.0|  3.215889|
|    93|  44022|   4.0| 3.1742961|
|   233|   1580|   5.0| 3.5785766|
|   177|   1580|   3.0| 3.2942014|
|   185|   1959|   3.0| 3.2285845|
+------+-------+------+----------+
only showing top 20 rows



In [16]:
# [+] 평점과 예측평점에 대한 통계 출력
        # 전체 데이터의 20%에 대한 통계
predictions.select('rating', 'prediction').describe().show()

+-------+-----------------+------------------+
|summary|           rating|        prediction|
+-------+-----------------+------------------+
|  count|            13544|             13544|
|   mean|3.600671884229179| 3.418213703271884|
| stddev|1.050661753045585|0.7705040980908947|
|    min|              0.5|       -0.21918255|
|    max|              5.0|         5.5198298|
+-------+-----------------+------------------+



In [17]:
# 모델 성능 평가: RMSE(Root Mean Squared Error)
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    metricName='rmse',
    labelCol='rating',
    predictionCol='prediction'
)

In [18]:
# RMSE(평균 제곱근 오차) 측정
rmse = evaluator.evaluate(predictions)

In [19]:
rmse

0.9261723907399123

### 3. 학습된 모델을 이용한 영화 추천
+ ```recommendForAllUsers()```: 유저별 아이템 추천
+ ```recommendForAllItems()```: 아이템별 유저 추천
+ ```recommendForUserSubset()```: 특정 유저 그룹에 대한 아이템 추천

In [20]:
# 유저별 아이템을 3개씩 추천
model.recommendForAllUsers(3).show()



+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{8235, 5.5344176...|
|     2|[{105355, 5.61402...|
|     3|[{8582, 5.1088443...|
|     4|[{3910, 4.950136}...|
|     5|[{443, 5.142886},...|
|     6|[{1243, 5.499255}...|
|     7|[{714, 4.6875753}...|
|     8|[{3896, 5.540375}...|
|     9|[{3360, 5.324882}...|
|    10|[{8582, 5.0786133...|
|    11|[{78218, 5.327832...|
|    12|[{1221, 4.5944858...|
|    13|[{8235, 5.1588063...|
|    14|[{6286, 5.681328}...|
|    15|[{3881, 5.809094}...|
|    16|[{6286, 5.191756}...|
|    17|[{3037, 4.849959}...|
|    18|[{27846, 4.863466...|
|    19|[{135456, 4.93533...|
|    20|[{8582, 5.496078}...|
+------+--------------------+
only showing top 20 rows



In [21]:
# 아이템별 유저를 3명씩 추천
model.recommendForAllItems(3).show()

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|     12|[{198, 4.462164},...|
|     26|[{173, 5.0856786}...|
|     27|[{50, 4.037017}, ...|
|     28|[{173, 4.758955},...|
|     31|[{153, 4.535494},...|
|     34|[{327, 4.8244114}...|
|     44|[{349, 4.6093607}...|
|     65|[{117, 3.6695867}...|
|     76|[{463, 3.8579214}...|
|     78|[{463, 3.896467},...|
|     81|[{87, 4.532772}, ...|
|     85|[{327, 5.0911446}...|
|    101|[{448, 5.584436},...|
|    103|[{327, 4.395684},...|
|    115|[{127, 4.1773496}...|
|    155|[{117, 4.9961147}...|
|    159|[{129, 4.924922},...|
|    183|[{198, 4.6034145}...|
|    193|[{199, 4.5397544}...|
|    210|[{117, 4.417731},...|
+-------+--------------------+
only showing top 20 rows



In [22]:
# 특정 유저 선택
        # dataframe으로 만들기 위해 억지로 list化
user_lst = [1]

In [23]:
from pyspark.sql.types import IntegerType

In [24]:
# 데이터프레임 생성
                                # user_lst = [1] 
                                                                # column 명 지정
users_df = spark.createDataFrame(user_lst, IntegerType()).toDF('userID')

In [25]:
users_df.show()

+------+
|userID|
+------+
|     1|
+------+



In [26]:
# recommendForUserSubset(): 특정 유저 그룹에 대한 아이템 추천
user_recs = model.recommendForUserSubset(users_df, 5)

In [27]:
user_recs.show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{8235, 5.5344176...|
+------+--------------------+



In [28]:
user_recs.collect()

[Row(userId=1, recommendations=[Row(movieId=8235, rating=5.534417629241943), Row(movieId=84187, rating=5.058632850646973), Row(movieId=135456, rating=4.957610607147217), Row(movieId=6433, rating=4.951874732971191), Row(movieId=7099, rating=4.939278602600098)])]

In [29]:
# 추천결과를 파이썬 객체로 받아오기
movies_lst = user_recs.collect()[0].recommendations

In [30]:
movies_lst

[Row(movieId=8235, rating=5.534417629241943),
 Row(movieId=84187, rating=5.058632850646973),
 Row(movieId=135456, rating=4.957610607147217),
 Row(movieId=6433, rating=4.951874732971191),
 Row(movieId=7099, rating=4.939278602600098)]

In [31]:
# movies_lst에 대한 데이터프레임 생성
recs_df = spark.createDataFrame(movies_lst)
recs_df.show()

+-------+-----------------+
|movieId|           rating|
+-------+-----------------+
|   8235|5.534417629241943|
|  84187|5.058632850646973|
| 135456|4.957610607147217|
|   6433|4.951874732971191|
|   7099|4.939278602600098|
+-------+-----------------+



가장 추천하는 8235는 어떤 영화인가?

In [33]:
# [+] 영화 데이터에 대한 데이터프레임 생성

file = 'movies.csv'

movies_df = spark.read.csv(path+file, inferSchema = True, header = True)


In [34]:
movies_df.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

In [35]:
# [+] recs_df, movies_df 에 대한 Temporary View 생성
recs_df.createOrReplaceTempView('recommendations')
movies_df.createOrReplaceTempView('movies')

In [36]:
# SQL JOIN 연산을 통해 추천된 영화 제목 받아오기
spark.sql(
    "SELECT * \
    FROM movies JOIN recommendations \
    ON movies.movieID = recommendations.movieID \
    ORDER BY rating DESC"
                    # 내림차순
).show()

+-------+--------------------+--------------------+-------+-----------------+
|movieId|               title|              genres|movieId|           rating|
+-------+--------------------+--------------------+-------+-----------------+
|   8235| Safety Last! (1923)|Action|Comedy|Rom...|   8235|5.534417629241943|
|  84187|Evangelion: 2.0 Y...|Action|Animation|...|  84187|5.058632850646973|
| 135456|Ghost in the Shel...|Action|Animation|...| 135456|4.957610607147217|
|   6433|Man with the Movi...|         Documentary|   6433|4.951874732971191|
|   7099|Nausicaä of the V...|Adventure|Animati...|   7099|4.939278602600098|
+-------+--------------------+--------------------+-------+-----------------+



### 4. 유저 별 영화 추천 서비스를 간단하게 구현하기
1. SQL문 작성
2. 영화 추천 함수 작성
3. 영화 추천 테스트

In [37]:
# SQL JOIN 연산을 통해 추천된 영화 제목 받아오기
query = """
SELECT * 
FROM movies JOIN recommendations 
ON movies.movieID = recommendations.movieID
ORDER BY rating DESC
"""

In [38]:
# 입력된 유저에 대한 영화 추천 함수
def get_recommendations(user_id, num_recs):                                     # input user_id
    users_df = spark.createDataFrame([user_id], IntegerType()).toDF('userID')   # dataframe化 
    users_recs_df = model.recommendForUserSubset(users_df, num_recs)
    
    recs_lst = users_recs_df.collect()[0].recommendations
    recs_df = spark.createDataFrame(recs_lst)
    recs_df.createOrReplaceTempView('recommendations')

    recommended_movies = spark.sql(query)
    
    return recommended_movies

In [39]:
# 1번 유저에 대한 영화 5개 추천
recs = get_recommendations(395, 5)



In [40]:
# 추천 결과 출력
recs.show()

+-------+--------------------+--------------------+-------+-----------------+
|movieId|               title|              genres|movieId|           rating|
+-------+--------------------+--------------------+-------+-----------------+
|   8582|Manufacturing Con...|     Documentary|War|   8582|5.821588516235352|
|   2467|Name of the Rose,...|Crime|Drama|Myste...|   2467|5.677016258239746|
|   6286|Man Without a Pas...|  Comedy|Crime|Drama|   6286|5.653853416442871|
|   3067|Women on the Verg...|        Comedy|Drama|   3067|5.430441379547119|
|  83796|Anything for Her ...|Crime|Drama|Thriller|  83796|5.424349784851074|
+-------+--------------------+--------------------+-------+-----------------+



In [41]:
# toPandas(): Pandas 데이터프레임으로 출력
recs.toPandas()

Unnamed: 0,movieId,title,genres,movieId.1,rating
0,8582,Manufacturing Consent: Noam Chomsky and the Me...,Documentary|War,8582,5.821589
1,2467,"Name of the Rose, The (Name der Rose, Der) (1986)",Crime|Drama|Mystery|Thriller,2467,5.677016
2,6286,"Man Without a Past, The (Mies vailla menneisyy...",Comedy|Crime|Drama,6286,5.653853
3,3067,Women on the Verge of a Nervous Breakdown (Muj...,Comedy|Drama,3067,5.430441
4,83796,Anything for Her (Pour elle) (2008),Crime|Drama|Thriller,83796,5.42435
