# Quiz

> tmdb_5000_movies에서 overview를 이용하여 추천 시스템을 구축하시오
1. 파일 읽기
2. 전처리
3. 데이터 가공
4. 유사도 분석

## 1. 파일 읽기

In [11]:
import pandas as pd

movies = pd.read_csv('data/tmdb_5000_movies.csv')
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [12]:
movies['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

> 위의 정보 중 사용할 정보를 추린다.
* genres, id, keywords, overview,popularity, title, vote_average, vote_count

> 현재 진행해야 할 정보는 overview 정보로 추천 시스템을 만들려고 한다. 하지만 overview에 3개의 결측치가 발생하여 이를 제거해야 한다.

In [13]:
df_movies_back = movies.copy(deep=True)
df_movies_back = df_movies_back[['genres', 'id', 'keywords', 'overview', 'popularity', 'title', 'vote_average', 'vote_count']]
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4800 entries, 0 to 4802
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   genres        4800 non-null   object 
 1   id            4800 non-null   int64  
 2   keywords      4800 non-null   object 
 3   overview      4800 non-null   object 
 4   popularity    4800 non-null   float64
 5   title         4800 non-null   object 
 6   vote_average  4800 non-null   float64
 7   vote_count    4800 non-null   int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 337.5+ KB


## 2. 데이터 전처리

In [16]:
df_movies_back = df_movies_back.dropna()
df_movies_back.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4800 entries, 0 to 4802
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   genres        4800 non-null   object 
 1   id            4800 non-null   int64  
 2   keywords      4800 non-null   object 
 3   overview      4800 non-null   object 
 4   popularity    4800 non-null   float64
 5   title         4800 non-null   object 
 6   vote_average  4800 non-null   float64
 7   vote_count    4800 non-null   int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 337.5+ KB


>필요한 정보만 추리고 결측치는 제거한다.

> 정상적으로는 형태소 분석을 통해 명사만 추출하여 사용하는 것이 맞지만 간단하게 replace를 통해 불필요한 단어들을 제거해 보자

In [18]:
df_movies_back['overview'] = df_movies_back['overview'].apply(lambda x: x.lower())
df_movies_back['overview'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [19]:
df_movies_back.reset_index(drop=True, inplace=True)

In [20]:
df_movies_back.iloc[-1]

genres                        [{"id": 99, "name": "Documentary"}]
id                                                          25975
keywords        [{"id": 1523, "name": "obsession"}, {"id": 224...
overview        ever since the second grade when he first saw ...
popularity                                               1.929883
title                                           My Date with Drew
vote_average                                                  6.3
vote_count                                                     16
Name: 4799, dtype: object

## 3. 데이터 count

>overview는 특정 키워드만 들어 있는 것이 아니고 설명을 위한 문장들이 들어 있다.
* In the 22nd century, a paraplegic Marine is di...

> 첫 번째 영화는 위와 같은 overview가 작성되어 있으며 여기서 in, the, a, is 이러한 단어는 의미없는 단어 이므로 배제되어야 한다. 따라서 기존에 사용했던 빈도수를 구하는 countvectorizer가 아닌 가중치를 이용한 tfidf를 이용해야 한다.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
overview_matrix = vect.fit_transform(df_movies_back['overview'])

In [22]:
sorted(vect.vocabulary_)

['00',
 '00 agent',
 '00 middle',
 '000',
 '000 000',
 '000 50',
 '000 acre',
 '000 australia',
 '000 boys',
 '000 buried',
 '000 dm',
 '000 dream',
 '000 fathoms',
 '000 feet',
 '000 film',
 '000 foot',
 '000 greeks',
 '000 head',
 '000 hits',
 '000 intending',
 '000 investor',
 '000 jobs',
 '000 john',
 '000 light',
 '000 machine',
 '000 miles',
 '000 money',
 '000 people',
 '000 prize',
 '000 residents',
 '000 sam',
 '000 savings',
 '000 score',
 '000 ships',
 '000 slaves',
 '000 spend',
 '000 stolen',
 '000 stranded',
 '000 usd',
 '000 went',
 '000 year',
 '000 years',
 '007',
 '007 battles',
 '007 engraved',
 '007 evil',
 '007 fights',
 '007 sean',
 '007 second',
 '007 suspicious',
 '007 takes',
 '007 time',
 '07am',
 '07am girls',
 '10',
 '10 191',
 '10 children',
 '10 countries',
 '10 days',
 '10 led',
 '10 messages',
 '10 million',
 '10 mission',
 '10 round',
 '10 second',
 '10 tasked',
 '10 year',
 '10 years',
 '10 yuma',
 '100',
 '100 000',
 '100 company',
 '100 innocent',
 '

In [23]:
overview_matrix.shape

(4800, 136949)

## 4. 유사도 분석

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

overview_sim = cosine_similarity(overview_matrix, overview_matrix)
overview_sim[:3]

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.00787352, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.00551187, 0.        ,
        0.        ]])

> 위와 같이 cosine_similarity를 이용하여 유사도를 구할 수 있다. 이를 이용하여 추천 시스템을 만들어 보자

In [25]:
overview_sim_sorted_idx = overview_sim.argsort()[:, ::-1]
overview_sim_sorted_idx[:3]

array([[   0,  634, 3603, ...,    2,    1,  600],
       [   1, 3631, 2542, ...,   46, 4752, 4753],
       [   2, 1343, 3161, ...,   39, 4236,  557]])

In [27]:
df_movies_back.iloc[0]

genres          [{"id": 28, "name": "Action"}, {"id": 12, "nam...
id                                                          19995
keywords        [{"id": 1463, "name": "culture clash"}, {"id":...
overview        in the 22nd century, a paraplegic marine is di...
popularity                                             150.437577
title                                                      Avatar
vote_average                                                  7.2
vote_count                                                  11800
Name: 0, dtype: object

In [28]:
df_movies_back.iloc[634]

genres          [{"id": 28, "name": "Action"}, {"id": 878, "na...
id                                                            603
keywords        [{"id": 83, "name": "saving the world"}, {"id"...
overview        set in the 22nd century, the matrix tells the ...
popularity                                             104.309993
title                                                  The Matrix
vote_average                                                  7.9
vote_count                                                   8907
Name: 634, dtype: object

## 5. 추천

In [29]:
C = df_movies_back['vote_average'].mean()
m = df_movies_back['vote_count'].quantile(0.6)
print('C:', round(C, 3), 'm:', round(m, 3))

C: 6.093 m: 371.0


In [30]:
def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']
    return (v/(v+m) * R) + (m/(v+m) * C)

In [32]:
df_movies_back['weighted_vote'] = df_movies_back.apply(weighted_vote_average, axis=1)
df_movies_back.head()

Unnamed: 0,genres,id,keywords,overview,popularity,title,vote_average,vote_count,weighted_vote
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","in the 22nd century, a paraplegic marine is di...",150.437577,Avatar,7.2,11800,7.166254
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","captain barbossa, long believed to be dead, ha...",139.082615,Pirates of the Caribbean: At World's End,6.9,4500,6.838528
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",a cryptic message from bond’s past sends him o...,107.376788,Spectre,6.3,4466,6.284117
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",following the death of district attorney harve...,112.31295,The Dark Knight Rises,7.6,9106,7.541002
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","john carter is a war-weary, former military ca...",43.926995,John Carter,6.1,2124,6.098947


In [35]:
def find_sim_movie(df, sorted_idx, title, top_n=10):
    title_movie = df[df['title'].str.lower().isin([title.lower()])]
    title_idx = title_movie.index.values
    sim_idx = sorted_idx[title_idx, :(top_n)*2]
    sim_idx = sim_idx.reshape(-1)
    sim_idx = sim_idx[sim_idx != title_idx]
    return df.iloc[sim_idx].sort_values('weighted_vote',ascending=False)[:top_n]

In [37]:
find_sim_movie(df_movies_back, overview_sim_sorted_idx, 'avatar')

Unnamed: 0,genres,id,keywords,overview,popularity,title,vote_average,vote_count,weighted_vote
634,"[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...",603,"[{""id"": 83, ""name"": ""saving the world""}, {""id""...","set in the 22nd century, the matrix tells the ...",104.309993,The Matrix,7.9,8907,7.82774
2966,"[{""id"": 878, ""name"": ""Science Fiction""}, {""id""...",601,"[{""id"": 455, ""name"": ""farewell""}, {""id"": 1007,...",after a gentle alien becomes stranded on earth...,56.105798,E.T. the Extra-Terrestrial,7.3,3269,7.17697
942,"[{""id"": 10749, ""name"": ""Romance""}, {""id"": 16, ...",228326,"[{""id"": 128, ""name"": ""love triangle""}, {""id"": ...","the journey of manolo, a young man who is torn...",34.890999,The Book of Life,7.3,755,6.902284
1033,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 9648, ""na...",320,"[{""id"": 703, ""name"": ""detective""}, {""id"": 718,...",two los angeles homicide detectives are dispat...,41.322708,Insomnia,6.8,1148,6.627302
1610,"[{""id"": 28, ""name"": ""Action""}, {""id"": 53, ""nam...",50456,"[{""id"": 782, ""name"": ""assassin""}, {""id"": 1430,...",a 16-year-old girl raised by her father to be ...,31.191975,Hanna,6.5,1263,6.407572
529,"[{""id"": 28, ""name"": ""Action""}, {""id"": 18, ""nam...",9567,"[{""id"": 4595, ""name"": ""u.s. army""}, {""id"": 758...",navy seal lieutenant a.k. waters and his elite...,27.055085,Tears of the Sun,6.4,573,6.279314
570,"[{""id"": 28, ""name"": ""Action""}, {""id"": 53, ""nam...",3595,"[{""id"": 800, ""name"": ""bounty""}, {""id"": 1452, ""...","when a rich man's son is kidnapped, he coopera...",16.411345,Ransom,6.4,470,6.264533
2766,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 80, ""nam...",2084,"[{""id"": 293, ""name"": ""female nudity""}, {""id"": ...",a shy bank clerk orders a russian mail order b...,5.831302,Birthday Girl,6.1,103,6.094456
1341,"[{""id"": 28, ""name"": ""Action""}, {""id"": 14, ""nam...",16911,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","on the threshold of 22nd century, furrowing th...",2.785832,The Inhabited Island,5.3,23,6.04663
3723,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...",270938,"[{""id"": 1794, ""name"": ""yakuza""}, {""id"": 11399,...","chapman is an ex-marine in brazil's slums, bat...",6.988357,Falcon Rising,5.5,71,5.997674
