# 영화 추천 시스템 구현 
* Content-based Filtering
  * 유저 행동 데이터의 합계는 사용하지 않고 추천 항목 자체의 Feature만 고려하여 새로운 아이템을 추천
  * Feature Extraction 방법 : **TF-IDF, Word2Vec** 등
  * 아이템 간 유사도 측정 방법 : **코사인 유사도, 유클리어드 유사도, 자카드 유사도, 피어슨 상관관계**
  * 계산 된 각 아이템 간의 유사도를 통해, 어떤 유저가 특정 아이템을 선호할 때 유사한 top 10 아이템 추천

* 사용할 데이터 : [TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)

  * tmdb_5000_credits
  * tmdb_5000_movies

* 사용할 컬럼 : genres, id, keywords, overivew, popularity, release_date, status(Released만), tagline, title, cast, crew(Director key에 해당하는 값만)

In [1]:
import pandas as pd
import numpy as np

### 1) credits 데이터 
* **samples : 4,803**
* **features : 4개** (int형 변수 : 1개, object형 변수 : 3개)
    * movie_id : 각 영화의 id
    * title : 영화 제목 (tmdb_5000_movies 데이터의 title 컬럼과 일치)
    * cast : 영화 속 등장인물 
    * crew : 영화 감독 및 연출진

In [2]:
credits = pd.read_csv(r"C:\Users\mcw08\OneDrive\바탕 화면\BDA 7기\영화 추천시스템 프로젝트\archive\tmdb_5000_credits.csv")
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


### 2) movies 데이터 
* **samples : 4,803**
* **features : 20개** (int형 변수 : 4개, float형 변수 : 3개, object형 변수 : 13개)
  * budget : 영화 예산 (단위:달러)
  * genres : 영화 장르 리스트 (최소 1개~최대 6개) > 딕셔너리 키('name')로 장르 추출하면 될 듯
  * homepage : 영화 홈페이지 > url 열어보니 없어진 페이지가 많음
  * id : 영화 고유 식별자
  * keywords : 영화 키워드 리스트 > 딕셔너리 키('name')으로 추출
  * original_language : 영화 원어 > 총 37가지 언어 존재
  * original_title : 영화 원본 제목
  * overview : 영화 전체 줄거리 요약
  * popularity : 영화 인기도
  * production_companies : 영화 제작사 리스트 > 딕셔너리 키('name')으로 추출
  * production_countries : 영화 제작 국가 리스트 > 딕셔너리 키('name')으로 추출
  * release_date : 영화 개봉일
  * revenue : 영화 수익 (단위:달러)
  * runtime : 영화 상영시간 (단위:분)
  * spoken_languages : 영화에 음성으로 나온 언어 리스트 > 딕셔너리 키('name')으로 추출
  * status : 영화 상태 > 'Released'(개봉), 'Post Production'(촬영완료-후반작업 진행), 'Rumored'(제작 예정)
  * tagline : 영화 포스터 문구
  * title : 영화 제목
  * vote_average : 영화 평균 평점 > 0~10점 사이
  * vote_count : 영화에 대한 평가 횟수

In [3]:
movies = pd.read_csv(r"C:\Users\mcw08\OneDrive\바탕 화면\BDA 7기\영화 추천시스템 프로젝트\archive\tmdb_5000_movies.csv")
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


* 'title' 컬럼 기준 두 데이터 병합 = df

In [4]:
df = movies.merge(credits,on='title')
df

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4804,220000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",,9367,"[{""id"": 5616, ""name"": ""united states\u2013mexi...",es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,"[{""name"": ""Columbia Pictures"", ""id"": 5}]",...,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238,9367,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4805,9000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",,72766,[],en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,[],...,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5,72766,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4806,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",http://www.hallmarkchannel.com/signedsealeddel...,231617,"[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...",en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,"[{""name"": ""Front Street Pictures"", ""id"": 3958}...",...,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,6,231617,"[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4807,0,[],http://shanghaicalling.com/,126186,[],en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,[],...,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,Shanghai Calling,5.7,7,126186,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


In [5]:
df.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

### 전처리
1) **사용할 컬럼만 추출** : genres, id, keywords, overivew, popularity, release_date, status(Released만), tagline, title, cast, crew(Director key에 해당하는 값만)

2) **영화상태(status)** : released값만 추출

3) **영화 개봉일(release_date) null값 삭제** : 1개

4) **release_date는 년도(year)만 추출** > 정렬 위함

5) **장르(genres)와 키워드(keywords) 딕셔너리 키 'name' 에 해당되는 value값 추출**

6) **출연진(cast) 상위 3명만 추출** > 주연 위주

7) **연출진(crew)에서 감독만 추출**(딕셔너리 키 'Director'에 해당되는 value값)

8) **줄거리(overview)와 포스터문구(tagline) 전처리** : 각 문장에서 단어들을 분리하여 리스트 생성

9) 장르(genres), 키워드(keywords), 줄거리(overview), 포스터문구(tagline), 출연진(cast), 연출진(crew) 각 **원소 공백 제거** : ex) 'Action' 과 'Action ' 차이를 가짐

10) 코사인 유사도 측정을 위한 **새로운 컬럼 'tags' 생성** : DF['tags'] = DF['genres'] + DF['overview'] + DF['keywords'] + DF['tagline']  + DF['cast'] + DF['crew']

11) 정렬을 위해 기존 데이터프레임에 **'popularity' 컬럼 추가**

12) **text data 전처리** 
    * tags 컬럼의 데이터 값 > 소문자로 변경 : 검색일치성(사용자가 데이터 검색 시, 대소문자를 구분하지 않는 경우가 많기 때문)
    * tags 컬럼 list > str 데이터타입 변경
    * null 값 확인 > 없음
    * 각 단어의 어간(stem) 추출 및 공백 기준 분할(split)

#### status 값이 released만 남겨놓기

In [6]:
df['status'].unique()

array(['Released', 'Post Production', 'Rumored'], dtype=object)

In [7]:
df = df[df['status']=='Released']
df

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4804,220000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",,9367,"[{""id"": 5616, ""name"": ""united states\u2013mexi...",es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,"[{""name"": ""Columbia Pictures"", ""id"": 5}]",...,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238,9367,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4805,9000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",,72766,[],en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,[],...,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5,72766,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4806,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",http://www.hallmarkchannel.com/signedsealeddel...,231617,"[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...",en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,"[{""name"": ""Front Street Pictures"", ""id"": 3958}...",...,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,6,231617,"[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4807,0,[],http://shanghaicalling.com/,126186,[],en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,[],...,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,Shanghai Calling,5.7,7,126186,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


#### 원하는 컬럼만 다시 뽑은 DF 데이터셋 생성

In [8]:
DF = df[['id','title','genres','keywords', 'status','overview','tagline','popularity','release_date', 'cast', 'crew']]
DF

Unnamed: 0,id,title,genres,keywords,status,overview,tagline,popularity,release_date,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Released,"In the 22nd century, a paraplegic Marine is di...",Enter the World of Pandora.,150.437577,2009-12-10,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Released,"Captain Barbossa, long believed to be dead, ha...","At the end of the world, the adventure begins.",139.082615,2007-05-19,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Released,A cryptic message from Bond’s past sends him o...,A Plan No One Escapes,107.376788,2015-10-26,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Released,Following the death of District Attorney Harve...,The Legend Ends,112.312950,2012-07-16,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",Released,"John Carter is a war-weary, former military ca...","Lost in our world, found in another.",43.926995,2012-03-07,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...,...,...,...,...
4804,9367,El Mariachi,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 5616, ""name"": ""united states\u2013mexi...",Released,El Mariachi just wants to play his guitar and ...,"He didn't come looking for trouble, but troubl...",14.269792,1992-09-04,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4805,72766,Newlyweds,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",[],Released,A newlywed couple's honeymoon is upended by th...,A newlywed couple's honeymoon is upended by th...,0.642552,2011-12-26,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4806,231617,"Signed, Sealed, Delivered","[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...",Released,"""Signed, Sealed, Delivered"" introduces a dedic...",,1.444476,2013-10-13,"[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4807,126186,Shanghai Calling,[],[],Released,When ambitious New York attorney Sam is sent t...,A New Yorker in Shanghai,0.857008,2012-05-03,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


In [9]:
# tagline 널값 844개, overview 3개 제외 나머지는 널값 없음.

DF.isna().sum()

id                0
title             0
genres            0
keywords          0
status            0
overview          3
tagline         838
popularity        0
release_date      1
cast              0
crew              0
dtype: int64

In [10]:
# 정렬을 위한 컬럼에서는 널값이 있으면 안되기 때문에 삭제

DF = DF.dropna(subset=['release_date'])
DF

Unnamed: 0,id,title,genres,keywords,status,overview,tagline,popularity,release_date,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Released,"In the 22nd century, a paraplegic Marine is di...",Enter the World of Pandora.,150.437577,2009-12-10,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Released,"Captain Barbossa, long believed to be dead, ha...","At the end of the world, the adventure begins.",139.082615,2007-05-19,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Released,A cryptic message from Bond’s past sends him o...,A Plan No One Escapes,107.376788,2015-10-26,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Released,Following the death of District Attorney Harve...,The Legend Ends,112.312950,2012-07-16,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",Released,"John Carter is a war-weary, former military ca...","Lost in our world, found in another.",43.926995,2012-03-07,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...,...,...,...,...
4804,9367,El Mariachi,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 5616, ""name"": ""united states\u2013mexi...",Released,El Mariachi just wants to play his guitar and ...,"He didn't come looking for trouble, but troubl...",14.269792,1992-09-04,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4805,72766,Newlyweds,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",[],Released,A newlywed couple's honeymoon is upended by th...,A newlywed couple's honeymoon is upended by th...,0.642552,2011-12-26,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4806,231617,"Signed, Sealed, Delivered","[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...",Released,"""Signed, Sealed, Delivered"" introduces a dedic...",,1.444476,2013-10-13,"[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4807,126186,Shanghai Calling,[],[],Released,When ambitious New York attorney Sam is sent t...,A New Yorker in Shanghai,0.857008,2012-05-03,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


#### release_data에서 년도만 추출

In [11]:
DF['release_date'] = pd.to_datetime(DF['release_date'])
DF['release_date'] = DF['release_date'].dt.year
DF.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['release_date'] = pd.to_datetime(DF['release_date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['release_date'] = DF['release_date'].dt.year


Unnamed: 0,id,title,genres,keywords,status,overview,tagline,popularity,release_date,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Released,"In the 22nd century, a paraplegic Marine is di...",Enter the World of Pandora.,150.437577,2009,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Released,"Captain Barbossa, long believed to be dead, ha...","At the end of the world, the adventure begins.",139.082615,2007,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Released,A cryptic message from Bond’s past sends him o...,A Plan No One Escapes,107.376788,2015,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Released,Following the death of District Attorney Harve...,The Legend Ends,112.31295,2012,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",Released,"John Carter is a war-weary, former military ca...","Lost in our world, found in another.",43.926995,2012,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


#### name 값 추출

In [12]:
import ast

def convert(obj):
    l = []  # List to store name
    for i in ast.literal_eval(obj):   #ast.literal_eval() convert str into list
        l.append(i['name'])
    return l

In [13]:
DF['genres'] = DF['genres'].apply(convert)
DF['keywords'] = DF['keywords'].apply(convert)
DF.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['genres'] = DF['genres'].apply(convert)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['keywords'] = DF['keywords'].apply(convert)


Unnamed: 0,id,title,genres,keywords,status,overview,tagline,popularity,release_date,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...",Released,"In the 22nd century, a paraplegic Marine is di...",Enter the World of Pandora.,150.437577,2009,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",Released,"Captain Barbossa, long believed to be dead, ha...","At the end of the world, the adventure begins.",139.082615,2007,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",Released,A cryptic message from Bond’s past sends him o...,A Plan No One Escapes,107.376788,2015,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Released,Following the death of District Attorney Harve...,The Legend Ends,112.31295,2012,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",Released,"John Carter is a war-weary, former military ca...","Lost in our world, found in another.",43.926995,2012,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


#### 주요 3명의 캐스트만 추출하기 위한 코드

In [14]:
def convert3(obj):
    l =[]
    counter =0
    for i in ast.literal_eval(obj):
        if counter !=3:   # Limit 3 only
            l.append(i['name'])
            counter +=1
        else:
            break
    return l

In [15]:
DF['cast'] = DF['cast'].apply(convert3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['cast'] = DF['cast'].apply(convert3)


#### 감독 이름 추출

In [16]:
def extract_director(obj):
    l = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            l.append(i['name'])
            break
    return l

In [17]:
DF['crew'] = DF['crew'].apply(extract_director)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['crew'] = DF['crew'].apply(extract_director)


In [18]:
DF.head()

Unnamed: 0,id,title,genres,keywords,status,overview,tagline,popularity,release_date,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...",Released,"In the 22nd century, a paraplegic Marine is di...",Enter the World of Pandora.,150.437577,2009,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",Released,"Captain Barbossa, long believed to be dead, ha...","At the end of the world, the adventure begins.",139.082615,2007,"[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",Released,A cryptic message from Bond’s past sends him o...,A Plan No One Escapes,107.376788,2015,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Released,Following the death of District Attorney Harve...,The Legend Ends,112.31295,2012,"[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",Released,"John Carter is a war-weary, former military ca...","Lost in our world, found in another.",43.926995,2012,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


#### 단어 분리
* overview, tagline 문장을 분리한 단어들을 리스트로 묶음.
* 널값은 빈리스트를 반환

In [19]:
DF['overview'] = [x.split() if pd.notnull(x) else [] for x in DF['overview']]
DF['tagline'] = [x.split() if pd.notnull(x) else [] for x in DF['tagline']]
DF.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['overview'] = [x.split() if pd.notnull(x) else [] for x in DF['overview']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['tagline'] = [x.split() if pd.notnull(x) else [] for x in DF['tagline']]


Unnamed: 0,id,title,genres,keywords,status,overview,tagline,popularity,release_date,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...",Released,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Enter, the, World, of, Pandora.]",150.437577,2009,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",Released,"[Captain, Barbossa,, long, believed, to, be, d...","[At, the, end, of, the, world,, the, adventure...",139.082615,2007,"[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",Released,"[A, cryptic, message, from, Bond’s, past, send...","[A, Plan, No, One, Escapes]",107.376788,2015,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Released,"[Following, the, death, of, District, Attorney...","[The, Legend, Ends]",112.31295,2012,"[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",Released,"[John, Carter, is, a, war-weary,, former, mili...","[Lost, in, our, world,, found, in, another.]",43.926995,2012,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


#### 공백 제거

In [20]:
DF['genres'] = DF['genres'].apply(lambda x: [i.replace(" ","") for i in x])
DF['keywords'] = DF['keywords'].apply(lambda x: [i.replace(" ","") for i in x])
DF['overview'] = DF['overview'].apply(lambda x: [i.replace(" ","") for i in x])
DF['tagline'] = DF['tagline'].apply(lambda x: [i.replace(" ","") for i in x])
DF['cast'] = DF['cast'].apply(lambda x: [i.replace(" ","") for i in x])
DF['crew'] = DF['crew'].apply(lambda x: [i.replace(" ","") for i in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['genres'] = DF['genres'].apply(lambda x: [i.replace(" ","") for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['keywords'] = DF['keywords'].apply(lambda x: [i.replace(" ","") for i in x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['overview'] = DF['overview'].apply(lambda x: [

In [21]:
DF

Unnamed: 0,id,title,genres,keywords,status,overview,tagline,popularity,release_date,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...",Released,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Enter, the, World, of, Pandora.]",150.437577,2009,"[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...",Released,"[Captain, Barbossa,, long, believed, to, be, d...","[At, the, end, of, the, world,, the, adventure...",139.082615,2007,"[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...",Released,"[A, cryptic, message, from, Bond’s, past, send...","[A, Plan, No, One, Escapes]",107.376788,2015,"[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...",Released,"[Following, the, death, of, District, Attorney...","[The, Legend, Ends]",112.312950,2012,"[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",Released,"[John, Carter, is, a, war-weary,, former, mili...","[Lost, in, our, world,, found, in, another.]",43.926995,2012,"[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]
...,...,...,...,...,...,...,...,...,...,...,...
4804,9367,El Mariachi,"[Action, Crime, Thriller]","[unitedstates–mexicobarrier, legs, arms, paper...",Released,"[El, Mariachi, just, wants, to, play, his, gui...","[He, didn't, come, looking, for, trouble,, but...",14.269792,1992,"[CarlosGallardo, JaimedeHoyos, PeterMarquardt]",[RobertRodriguez]
4805,72766,Newlyweds,"[Comedy, Romance]",[],Released,"[A, newlywed, couple's, honeymoon, is, upended...","[A, newlywed, couple's, honeymoon, is, upended...",0.642552,2011,"[EdwardBurns, KerryBishé, MarshaDietlein]",[EdwardBurns]
4806,231617,"Signed, Sealed, Delivered","[Comedy, Drama, Romance, TVMovie]","[date, loveatfirstsight, narration, investigat...",Released,"[""Signed,, Sealed,, Delivered"", introduces, a,...",[],1.444476,2013,"[EricMabius, KristinBooth, CrystalLowe]",[ScottSmith]
4807,126186,Shanghai Calling,[],[],Released,"[When, ambitious, New, York, attorney, Sam, is...","[A, New, Yorker, in, Shanghai]",0.857008,2012,"[DanielHenney, ElizaCoupe, BillPaxton]",[DanielHsia]


#### 새로운 컬럼 생성

In [22]:
DF.columns

Index(['id', 'title', 'genres', 'keywords', 'status', 'overview', 'tagline',
       'popularity', 'release_date', 'cast', 'crew'],
      dtype='object')

In [23]:
DF['tags'] = DF['genres'] + DF['overview'] + DF['keywords'] + DF['tagline']  + DF['cast'] + DF['crew']
DF

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF['tags'] = DF['genres'] + DF['overview'] + DF['keywords'] + DF['tagline']  + DF['cast'] + DF['crew']


Unnamed: 0,id,title,genres,keywords,status,overview,tagline,popularity,release_date,cast,crew,tags
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...",Released,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Enter, the, World, of, Pandora.]",150.437577,2009,"[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[Action, Adventure, Fantasy, ScienceFiction, I..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...",Released,"[Captain, Barbossa,, long, believed, to, be, d...","[At, the, end, of, the, world,, the, adventure...",139.082615,2007,"[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Adventure, Fantasy, Action, Captain, Barbossa..."
2,206647,Spectre,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...",Released,"[A, cryptic, message, from, Bond’s, past, send...","[A, Plan, No, One, Escapes]",107.376788,2015,"[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[Action, Adventure, Crime, A, cryptic, message..."
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...",Released,"[Following, the, death, of, District, Attorney...","[The, Legend, Ends]",112.312950,2012,"[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Action, Crime, Drama, Thriller, Following, th..."
4,49529,John Carter,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",Released,"[John, Carter, is, a, war-weary,, former, mili...","[Lost, in, our, world,, found, in, another.]",43.926995,2012,"[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[Action, Adventure, ScienceFiction, John, Cart..."
...,...,...,...,...,...,...,...,...,...,...,...,...
4804,9367,El Mariachi,"[Action, Crime, Thriller]","[unitedstates–mexicobarrier, legs, arms, paper...",Released,"[El, Mariachi, just, wants, to, play, his, gui...","[He, didn't, come, looking, for, trouble,, but...",14.269792,1992,"[CarlosGallardo, JaimedeHoyos, PeterMarquardt]",[RobertRodriguez],"[Action, Crime, Thriller, El, Mariachi, just, ..."
4805,72766,Newlyweds,"[Comedy, Romance]",[],Released,"[A, newlywed, couple's, honeymoon, is, upended...","[A, newlywed, couple's, honeymoon, is, upended...",0.642552,2011,"[EdwardBurns, KerryBishé, MarshaDietlein]",[EdwardBurns],"[Comedy, Romance, A, newlywed, couple's, honey..."
4806,231617,"Signed, Sealed, Delivered","[Comedy, Drama, Romance, TVMovie]","[date, loveatfirstsight, narration, investigat...",Released,"[""Signed,, Sealed,, Delivered"", introduces, a,...",[],1.444476,2013,"[EricMabius, KristinBooth, CrystalLowe]",[ScottSmith],"[Comedy, Drama, Romance, TVMovie, ""Signed,, Se..."
4807,126186,Shanghai Calling,[],[],Released,"[When, ambitious, New, York, attorney, Sam, is...","[A, New, Yorker, in, Shanghai]",0.857008,2012,"[DanielHenney, ElizaCoupe, BillPaxton]",[DanielHsia],"[When, ambitious, New, York, attorney, Sam, is..."


#### 새로운 데이터셋 final
* popularity을 넣은 이유 -> 나중에 상위 10개 영화를 popularity기준으로 정렬하기 위해서.

In [24]:
final = DF[['id','title','tags','popularity','release_date']]
final

Unnamed: 0,id,title,tags,popularity,release_date
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction, I...",150.437577,2009
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action, Captain, Barbossa...",139.082615,2007
2,206647,Spectre,"[Action, Adventure, Crime, A, cryptic, message...",107.376788,2015
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller, Following, th...",112.312950,2012
4,49529,John Carter,"[Action, Adventure, ScienceFiction, John, Cart...",43.926995,2012
...,...,...,...,...,...
4804,9367,El Mariachi,"[Action, Crime, Thriller, El, Mariachi, just, ...",14.269792,1992
4805,72766,Newlyweds,"[Comedy, Romance, A, newlywed, couple's, honey...",0.642552,2011
4806,231617,"Signed, Sealed, Delivered","[Comedy, Drama, Romance, TVMovie, ""Signed,, Se...",1.444476,2013
4807,126186,Shanghai Calling,"[When, ambitious, New, York, attorney, Sam, is...",0.857008,2012


#### tags로 묶여있는 정보들 모두 소문자로 변경

In [25]:
final['tags'] = final['tags'].apply(lambda x: [word.lower() for word in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final['tags'] = final['tags'].apply(lambda x: [word.lower() for word in x])


In [26]:
final.head()

Unnamed: 0,id,title,tags,popularity,release_date
0,19995,Avatar,"[action, adventure, fantasy, sciencefiction, i...",150.437577,2009
1,285,Pirates of the Caribbean: At World's End,"[adventure, fantasy, action, captain, barbossa...",139.082615,2007
2,206647,Spectre,"[action, adventure, crime, a, cryptic, message...",107.376788,2015
3,49026,The Dark Knight Rises,"[action, crime, drama, thriller, following, th...",112.31295,2012
4,49529,John Carter,"[action, adventure, sciencefiction, john, cart...",43.926995,2012


#### 리스트에서 다시 문자형으로 변경

In [27]:
final['tags'] = final['tags'].apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final['tags'] = final['tags'].apply(lambda x: " ".join(x))


In [28]:
final.head()

Unnamed: 0,id,title,tags,popularity,release_date
0,19995,Avatar,action adventure fantasy sciencefiction in the...,150.437577,2009
1,285,Pirates of the Caribbean: At World's End,"adventure fantasy action captain barbossa, lon...",139.082615,2007
2,206647,Spectre,action adventure crime a cryptic message from ...,107.376788,2015
3,49026,The Dark Knight Rises,action crime drama thriller following the deat...,112.31295,2012
4,49529,John Carter,action adventure sciencefiction john carter is...,43.926995,2012


#### 널값 있는지 확인

In [29]:
final.isna().sum()

id              0
title           0
tags            0
popularity      0
release_date    0
dtype: int64

#### 각 단어의 어간 추출하고 공백으로 구분하여 문자열 반환
* count 기반으로 벡터화하기 위해 리스트를 공백 문자 기분으로 구분되는 문자열로 변환시킴

In [30]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()



In [31]:
def stem(text):
    y = []

    for i in text.split():
        y.append(ps.stem(i))

    return " ".join(y)

In [32]:
final['tags'] = final['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final['tags'] = final['tags'].apply(stem)


In [33]:
final

Unnamed: 0,id,title,tags,popularity,release_date
0,19995,Avatar,action adventur fantasi sciencefict in the 22n...,150.437577,2009
1,285,Pirates of the Caribbean: At World's End,"adventur fantasi action captain barbossa, long...",139.082615,2007
2,206647,Spectre,action adventur crime a cryptic messag from bo...,107.376788,2015
3,49026,The Dark Knight Rises,action crime drama thriller follow the death o...,112.312950,2012
4,49529,John Carter,action adventur sciencefict john carter is a w...,43.926995,2012
...,...,...,...,...,...
4804,9367,El Mariachi,action crime thriller el mariachi just want to...,14.269792,1992
4805,72766,Newlyweds,comedi romanc a newlyw couple' honeymoon is up...,0.642552,2011
4806,231617,"Signed, Sealed, Delivered","comedi drama romanc tvmovi ""signed, sealed, de...",1.444476,2013
4807,126186,Shanghai Calling,when ambiti new york attorney sam is sent to s...,0.857008,2012


#### 문서 전처리

* 문자를 숫자 벡터로 변환하는 가장 기본적인 방법 : **BOW(Bag of Words)** * ![image.png][https://datascienceschool.net/03%20machine%20learning/03.01.03%20Scikit-Learn%EC%9D%98%20%EB%AC%B8%EC%84%9C%20%EC%A0%84%EC%B2%98%EB%A6%AC%20%EA%B8%B0%EB%8A%A5.html]
    * BOW 만드는 과정
        1) 각 단어에 고유한 정수 인덱스를 부여합니다.
        2) 각 인덱스의 위치에 단어 토큰의 등장 횟수를 기록한 벡터를 만듭니다.
        
* Scikit-Learn 문서 전처리용 클래스
    * DictVectorizer : 각 단어의 수를 세어놓은 사전에서 BOW 인코딩 벡터 생성
    * CountVectorizer : 문서 집합에서 단어 토큰을 생성하고 각 단어의 수를 세어 BOW 인코딩 벡터 생성
    * TfidfVectorizer : TF-IDF 방식으로 단어 가중치를 조정한 BOW 인코딩 벡터 생성
    * HashingVectorizer : 해시 함수(Hash Function)를 사용하여 적은 메모리와 빠른 속도로 BOW 인코딩 벡터 생성

* **CountVectorizer** 사용하여 'tags'열의 텍스트 데이터를 단어의 등장 빈도로 변환 
    * CountVectorizer : 입력된 문장을 토큰화(Tokenize)하여 토큰의 등장 빈도 벡터로 바꿔주는 함수
    * ex) (입력) "hello, I am a data scientist!" > (출력) "hello" / "," / "I"  /  "am"  /  "a"  /  "data"  /  "scientist" /  "!" : 8개의 토큰으로 토큰화
    * 알고리즘 
        1) 문서를 토큰 리스트로 변환
        2) 각 문서에서 토큰의 출현 빈도 계산
        3) 각 문서를 BOW 인코딩 벡터로 변환

[def]: attachment:image.png

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words = 'english')   # 영어 불용어를 제거하는 CountVectorizer 객체 생성
count_matrix = count.fit_transform(final['tags']) # 'tags' 열의 텍스트 데이터를 변환 > count_matrix에 저장

* count_matrix 
    * 희소 행렬(sparse matrix)로 출력 > 희소행렬 : 대부분의 요소가 0인 행렬
    * 출력 결과 : (i,j)
    * i : 텍스트 데이터의 인덱스 / j : 단어의 인덱스
    * i번째 텍스트 데이터에서 j번째 단어가 등장한 횟수 의미
    * ex) (0,244) 1 : 첫번째 텍스트 데이터에서 단어 인덱스가 244인 단어가 1회 등장함 

In [35]:
print(count_matrix)

  (0, 549)	1
  (0, 702)	1
  (0, 10598)	1
  (0, 26759)	1
  (0, 244)	1
  (0, 5127)	1
  (0, 22542)	1
  (0, 18964)	2
  (0, 8549)	1
  (0, 20533)	1
  (0, 22482)	2
  (0, 31571)	1
  (0, 20309)	1
  (0, 2905)	1
  (0, 30732)	1
  (0, 11228)	1
  (0, 22134)	1
  (0, 24153)	1
  (0, 1092)	2
  (0, 5829)	1
  (0, 7141)	1
  (0, 11707)	1
  (0, 28354)	1
  (0, 28337)	1
  (0, 28133)	1
  :	:
  (4799, 4016)	2
  (4799, 17826)	1
  (4799, 24032)	1
  (4799, 2087)	1
  (4799, 26941)	1
  (4799, 4141)	1
  (4799, 8974)	1
  (4799, 11655)	1
  (4799, 2098)	1
  (4799, 26631)	1
  (4799, 7078)	2
  (4799, 15573)	1
  (4799, 8655)	1
  (4799, 256)	1
  (4799, 8973)	2
  (4799, 12520)	1
  (4799, 4585)	1
  (4799, 2689)	1
  (4799, 8956)	1
  (4799, 6668)	1
  (4799, 10396)	1
  (4799, 13487)	1
  (4799, 2688)	1
  (4799, 13488)	1
  (4799, 4033)	2


#### 코사인유사도(Consine Similarity) 
* 두 벡터 간의 코사인 값을 이용하여 구할 수 있는 두 벡터의 유사도 
* 두 벡터의 방향이 완전히 동일한 경우에는 1, 정반대의 방향을 가지면 -1
* 즉 **코사인 유사도 값이 1에 가까울수록 유사도가 높음**
------------------------------------------------------------------


In [36]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

array([[1.        , 0.10032154, 0.06375767, ..., 0.05103104, 0.        ,
        0.        ],
       [0.10032154, 1.        , 0.06907969, ..., 0.03686049, 0.01814885,
        0.01988107],
       [0.06375767, 0.06907969, 1.        , ..., 0.01952172, 0.        ,
        0.        ],
       ...,
       [0.05103104, 0.03686049, 0.01952172, ..., 1.        , 0.03077287,
        0.03370999],
       [0.        , 0.01814885, 0.        , ..., 0.03077287, 1.        ,
        0.08298827],
       [0.        , 0.01988107, 0.        , ..., 0.03370999, 0.08298827,
        1.        ]])

In [37]:
cosine_sim[0].shape

(4800,)

#### recommend_pop 모델
* 코사인 유사도 기반으로 가장 유사한 상위 10개 영화를 popularity 기준으로 내림차순 정렬 (인기 순으로 출력)

In [38]:
def recommendations_pop(title, cosine_sim=cosine_sim):
  # 영화 제목을 통해서 전체 데이터 기준 그 영화의 index 값을 얻기
  idx = final[final['title'] == title].index[0]
  # 코싸인 유사도 매틕스(cosine_sim)에서 idx에 해당하는 데이터를 (idx, 유사도)형식으로 출력
  sim_scores = list(enumerate(cosine_sim[idx]))
  # 코사인 유사도 기준으로 내림차순 정렬
  sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
  # 자기 자신을 제외한 10개의 추천 영화
  sim_scores = sim_scores[1:11]
  # 추천 영화 목록 10개의 인덱스 정보 추출
  movie_indices = [i[0] for i in sim_scores]
  # 인기 순으로 정렬
  recommended_movies = final.iloc[movie_indices].nlargest(10, 'popularity')

  for movie in recommended_movies['title']:
      print(movie)


In [39]:
recommendations_pop('Avatar')

Star Trek Into Darkness
Aliens
Independence Day
Battle: Los Angeles
Predators
Aliens vs Predator: Requiem
Meet Dave
Titan A.E.
Aliens in the Attic
Falcon Rising


#### recommend_year 모델
* 코사인 유사도 기반으로 가장 유사한 상위 10개 영화를 released_date 기준으로 내림차순 정렬 (최신 영화 순으로 출력)

In [40]:
def recommendations_year(title, cosine_sim=cosine_sim):
  # 영화 제목을 통해서 전체 데이터 기준 그 영화의 index 값을 얻기
  idx = final[final['title'] == title].index[0]
  # 코싸인 유사도 매틕스(cosine_sim)에서 idx에 해당하는 데이터를 (idx, 유사도)형식으로 출력
  sim_scores = list(enumerate(cosine_sim[idx]))
  # 코사인 유사도 기준으로 내림차순 정렬
  sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
  # 자기 자신을 제외한 10개의 추천 영화
  sim_scores = sim_scores[1:11]
  # 추천 영화 목록 10개의 인덱스 정보 추출
  movie_indices = [i[0] for i in sim_scores]
  # 개봉 년도 기준으로 정렬
  recommended_movies = final.iloc[movie_indices].nlargest(10, 'release_date')

  for movie in recommended_movies['title']:
      print(movie)

In [41]:
recommendations_year('Avatar')

Falcon Rising
Star Trek Into Darkness
Battle: Los Angeles
Predators
Aliens in the Attic
Meet Dave
Aliens vs Predator: Requiem
Titan A.E.
Independence Day
Aliens


## 웹사이트 구현을 위해 모델 저장하기

In [42]:
import pickle

In [43]:
movies = final[['id','title','popularity','release_date']].copy()
movies.head(5)

Unnamed: 0,id,title,popularity,release_date
0,19995,Avatar,150.437577,2009
1,285,Pirates of the Caribbean: At World's End,139.082615,2007
2,206647,Spectre,107.376788,2015
3,49026,The Dark Knight Rises,112.31295,2012
4,49529,John Carter,43.926995,2012


In [44]:
pickle.dump(movies, open('movies_final.pickle','wb'))

In [45]:
pickle.dump(cosine_sim, open('cosine_sim_final.pickle','wb'))