# TMDB 5000 영화 데이터 세트 - 콘텐츠 기반 필터링 실습

캐글에서 파일 다운로드
https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata


**콘텐츠 기반 필터링**

특정 영화 감상 후 좋다고 평가했다면, 이와 비슷한 특성을 가진 다른 영화를 추천하는 것

장르기반으로 추천 시스템 모델링  

**데이터 로딩 및 가공**

 딕셔너리 형태의 필요한 칼럼을 분해하여 리스트 객체로 변환 (ast.literal_eval 사용)

In [8]:
import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')

# No columns to parse from file
movies = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/tmdb_5000_movies.csv')
print(movies.shape)
movies.head(1)

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [9]:
movies_df = movies[['id','title','genres','vote_average','vote_count','popularity','keywords','overview']]

In [10]:
pd.set_option('max_colwidth',100)
movies_df[['genres','keywords']][:1]

Unnamed: 0,genres,keywords
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""sp..."


In [12]:
from ast import literal_eval

movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)


In [13]:
movies_df['genres'] = movies_df['genres'].apply(lambda x:[y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x:[y['name'] for y in x])
movies_df[['genres','keywords']][:1]

Unnamed: 0,genres,keywords
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa..."


**장르 콘텐츠 유사도 측정**

여러 개별 장르가 저장된 장르 칼럼을 장르간 공백을 포함시키는 문자열로 변환,

CountVectorizer - 피터 벡터 행렬로 변환,

cosine_similarity - 유사도 측정,

반환된 행렬 유사도 내림차순으로 정렬 > 유사도 높은 칼럼(비교 대상의 영화)의 위치 인덱스 확인   

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

movies_df['genres_literal'] = movies_df['genres'].apply(lambda x:(' ').join(x))
count_vect = CountVectorizer(min_df=0, ngram_range=(1,2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)

(4803, 276)


In [15]:
from sklearn.metrics.pairwise import cosine_similarity
genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim[:1])

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]]


In [16]:
genre_sim_sorted_ind = genre_sim.argsort()[:,::-1]
print(genre_sim_sorted_ind[:1])

[[   0 3494  813 ... 3038 3037 2401]]


**장르 콘텐츠 필터링을 이용한 영화 추천**

추천 함수 생성 - 추천 영화 정보 반환


In [17]:
def find_sim_movie(df, sorted_ind, title_name, top_n=0):

  title_movie = df[df['title']==title_name]

  title_index = title_movie.index.values
  similar_indexes = sorted_ind[title_index, :(top_n)]

  print(similar_indexes)
  similar_indexes = similar_indexes.reshape(-1)

  return df.iloc[similar_indexes]

In [19]:
similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title','vote_average']]

[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]


Unnamed: 0,title,vote_average
2731,The Godfather: Part II,8.3
1243,Mean Streets,7.2
3636,Light Sleeper,5.7
1946,The Bad Lieutenant: Port of Call - New Orleans,6.0
2640,Things to Do in Denver When You're Dead,6.7
4065,Mi America,0.0
1847,GoodFellas,8.2
4217,Kids,6.8
883,Catch Me If You Can,7.7
3866,City of God,8.1


평점이 0인, 섣불리 추천하기 어려운 영화가 포함됨

> 평점으로 필터링 후 최종 추천 방식 사용


In [21]:
movies_df[['title','vote_average','vote_count']].sort_values('vote_average', ascending=False)[:10]

Unnamed: 0,title,vote_average,vote_count
3519,Stiff Upper Lips,10.0,1
4247,Me You and Five Bucks,10.0,2
4045,"Dancer, Texas Pop. 81",10.0,1
4662,Little Big Top,10.0,1
3992,Sardaarji,9.5,2
2386,One Man's Hero,9.3,2
2970,There Goes My Baby,8.5,2
1881,The Shawshank Redemption,8.5,8205
2796,The Prisoner of Zenda,8.4,11
3337,The Godfather,8.4,5893


왜곡된 평점 데이터 (소수의 높은 평가) 완화 위해 Weighted Rating 방식 사용

(v/(v+m)) * R + (m/(v+m)) * C


v 개별 영화에 평점을 투표한 횟수

m 평점을 부여하기 위한 최소 투표 횟수, 이 값을 높이면 투표 횟수가 많은 영화에 많은 가중치 부여. quantile 사용하여 상위 60% 값 사용

R 개별 영화에 대한 평균 평점

C 전체 영화에 대한 평균 평점



In [24]:
C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
print('C:',round(C,3), 'm:',round(m,3))

C: 6.092 m: 370.2


In [29]:
percentile =0.6
C = movies['vote_average'].mean()
m = movies['vote_count'].quantile(percentile)

def weighted_vote_average(record):
  v = record['vote_count']
  R = record['vote_average']

  return ((v/(v+m)) * R + (m/(v+m)) * C)

movies['weighted_vote'] = movies.apply(weighted_vote_average, axis=1)

In [31]:
movies[['title','weighted_vote','vote_average','vote_count']].sort_values('weighted_vote', ascending=False)[:10]

Unnamed: 0,title,weighted_vote,vote_average,vote_count
1881,The Shawshank Redemption,8.396052,8.5,8205
3337,The Godfather,8.263591,8.4,5893
662,Fight Club,8.216455,8.3,9413
3232,Pulp Fiction,8.207102,8.3,8428
65,The Dark Knight,8.13693,8.2,12002
1818,Schindler's List,8.126069,8.3,4329
3865,Whiplash,8.123248,8.3,4254
809,Forrest Gump,8.105954,8.2,7927
2294,Spirited Away,8.105867,8.3,3840
2731,The Godfather: Part II,8.079586,8.3,3338


In [35]:
# find_sim_movie 함수 변경

def find_sim_movie(df, sorted_ind, title_name, top_n=0):

  title_movie = df[df['title']==title_name]
  title_index = title_movie.index.values

  similar_indexes = sorted_ind[title_index, :(top_n*2)]
  similar_indexes = similar_indexes.reshape(-1)
  similar_indexes = similar_indexes[similar_indexes != title_index]

  return df.iloc[similar_indexes].sort_values('weighted_vote', ascending=False)[:top_n]

similar_movies = find_sim_movie(movies, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title','vote_average', 'weighted_vote']]

Unnamed: 0,title,vote_average,weighted_vote
2731,The Godfather: Part II,8.3,8.079586
1847,GoodFellas,8.2,7.976937
3866,City of God,8.1,7.759693
1663,Once Upon a Time in America,8.2,7.657811
883,Catch Me If You Can,7.7,7.557097
281,American Gangster,7.4,7.141396
4041,This Is England,7.4,6.739664
1149,American Hustle,6.8,6.717525
1243,Mean Streets,7.2,6.626569
2839,Rounders,6.9,6.530427
