# 📽넷플릭스 대한민국 분석과 시각화 그리고 추천 시스템 구상

## 목표

```
1. 데이터 분석(EDA)
2. 시각화
3. 가공된 모델을 기반으로 사용자 컨텐츠 추천 시스템 구현
```

## 1_데이터 분석(EDA)

In [1]:
import pandas as pd

from ast import literal_eval

In [2]:
netflix_data = pd.read_csv('dataset/netflix_titles.csv')
tmdb_movie_data = pd.read_csv('dataset/tmdb_5000_movies.csv')
tmdb_credits_data = pd.read_csv('dataset/tmdb_5000_credits.csv')

### 1_1_`tmdb_movie_data` DataFrame과 `tmdb_credits_data` DataFrame 병합

`tmdb_movie_data`의 'id' 컬럼과 `tmdb_credits_data`의 'movie_id' 컬럼을 'movie_id'라는 컬럼명으로 일치시키는 작업을 한다.

In [3]:
tmdb_movie_data.rename(columns={'id':'movie_id'}, inplace=True)

In [4]:
tmdb_movie_data.head(2)

Unnamed: 0,budget,genres,homepage,movie_id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [5]:
tmdb_credits_data.columns = ['movie_id', 'cred_title', 'cast', 'crew']
tmdb_movie_data = tmdb_movie_data.merge(tmdb_credits_data, on='movie_id')

In [6]:
tmdb_movie_data.head(2)

Unnamed: 0,budget,genres,homepage,movie_id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cred_title,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [7]:
tmdb_movie_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   movie_id              4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [8]:
tmdb_movie_data = tmdb_movie_data.drop(columns=['budget', 'homepage', 'keywords', 'original_language', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count', 'cred_title', 'crew'])

In [9]:
tmdb_movie_data.head(2)

Unnamed: 0,genres,movie_id,original_title,overview,cast
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa..."


In [10]:
tmdb_movie_data['cast']

0       [{"cast_id": 242, "character": "Jake Sully", "...
1       [{"cast_id": 4, "character": "Captain Jack Spa...
2       [{"cast_id": 1, "character": "James Bond", "cr...
3       [{"cast_id": 2, "character": "Bruce Wayne / Ba...
4       [{"cast_id": 5, "character": "John Carter", "c...
                              ...                        
4798    [{"cast_id": 1, "character": "El Mariachi", "c...
4799    [{"cast_id": 1, "character": "Buzzy", "credit_...
4800    [{"cast_id": 8, "character": "Oliver O\u2019To...
4801    [{"cast_id": 3, "character": "Sam", "credit_id...
4802    [{"cast_id": 3, "character": "Herself", "credi...
Name: cast, Length: 4803, dtype: object

In [11]:
tmdb_movie_data['genres']

0       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
1       [{"id": 12, "name": "Adventure"}, {"id": 14, "...
2       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
3       [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
                              ...                        
4798    [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4799    [{"id": 35, "name": "Comedy"}, {"id": 10749, "...
4800    [{"id": 35, "name": "Comedy"}, {"id": 18, "nam...
4801                                                   []
4802                  [{"id": 99, "name": "Documentary"}]
Name: genres, Length: 4803, dtype: object

`tmdb_movie_data` DataFrame의 'cast' 컬럼과 'genres' 컬럼을 list 형태로 추출해야 하는데, 'cast' 컬럼의 경우 해당 영화에 등장하는 캐릭터명과 배우명이 같이 표기되어 있으며 'genres' 컬럼의 경우에는 "id" 값이 28이면 "Action"과 같이 표기되어 있는 것을 분리할 필요가 있다.

필요한 부분은 'cast' 컬럼의 경우에는 배우명만 출력되게 해야 하며, 'genres' 컬럼의 경우에는 "Action", "Comedy"와 같이 출력되게 하는 것이 목표이므로 Python의 [`literal_eval()`](https://docs.python.org/3/library/ast.html#ast.literal_eval) 함수를 사용하여 분리하도록 한다.

In [12]:
features = ['genres', 'cast']

for feature in features:
    tmdb_movie_data[feature] = tmdb_movie_data[feature].apply(literal_eval)

In [13]:
def get_list(meta_data):
    if isinstance(meta_data, list):
        names = [col['name'] for col in meta_data]
        # 'genres' 컬럼과 'cast' 컬럼의 name 값의 갯수가 5보다 작으면 나머지 갯수만큼 출력. 5보다 크면 전체를 출력함
        if len(names) > 5:
            names = names[:5]
        return names

    # 데이터가 누락되어 있거나 변형될 경우 비어있는 list 형태로 데이터를 리턴함
    return []

features = ['genres', 'cast']

for feature in features:
    tmdb_movie_data[feature] = tmdb_movie_data[feature].apply(get_list)

In [14]:
pd.set_option('max_colwidth', 150)  # DataFrame의 컬럼 너비를 조절함
tmdb_movie_data.head(5)

Unnamed: 0,genres,movie_id,original_title,overview,cast
0,"[Action, Adventure, Fantasy, Science Fiction]",19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and prot...","[Sam Worthington, Zoe Saldana, Sigourney Weaver, Stephen Lang, Michelle Rodriguez]"
1,"[Adventure, Fantasy, Action]",285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But...","[Johnny Depp, Orlando Bloom, Keira Knightley, Stellan Skarsgård, Chow Yun-fat]"
2,"[Action, Adventure, Crime]",206647,Spectre,A cryptic message from Bond’s past sends him on a trail to uncover a sinister organization. While M battles political forces to keep the secret se...,"[Daniel Craig, Christoph Waltz, Léa Seydoux, Ralph Fiennes, Monica Bellucci]"
3,"[Action, Crime, Drama, Thriller]",49026,The Dark Knight Rises,"Following the death of District Attorney Harvey Dent, Batman assumes responsibility for Dent's crimes to protect the late attorney's reputation an...","[Christian Bale, Michael Caine, Gary Oldman, Anne Hathaway, Tom Hardy]"
4,"[Action, Adventure, Science Fiction]",49529,John Carter,"John Carter is a war-weary, former military captain who's inexplicably transported to the mysterious and exotic planet of Barsoom (Mars) and reluc...","[Taylor Kitsch, Lynn Collins, Samantha Morton, Willem Dafoe, Thomas Haden Church]"


In [15]:
tmdb_movie_data.isnull().sum()

genres            0
movie_id          0
original_title    0
overview          3
cast              0
dtype: int64

In [16]:
tmdb_movie_data[tmdb_movie_data['overview'].isnull()]

Unnamed: 0,genres,movie_id,original_title,overview,cast
2656,[Drama],370980,Chiamatemi Francesco - Il Papa della gente,,"[Rodrigo de la Serna, Sergio Hernández, Àlex Brendemühl, Maximilian Dirr, Mercedes Morán]"
4140,[Documentary],459488,"To Be Frank, Sinatra at 100",,[Tony Oppedisano]
4431,[Documentary],292539,Food Chains,,[]


In [17]:
tmdb_movie_data.dropna(subset=['overview'], inplace=True)
tmdb_movie_data.head(3)

Unnamed: 0,genres,movie_id,original_title,overview,cast
0,"[Action, Adventure, Fantasy, Science Fiction]",19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and prot...","[Sam Worthington, Zoe Saldana, Sigourney Weaver, Stephen Lang, Michelle Rodriguez]"
1,"[Adventure, Fantasy, Action]",285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But...","[Johnny Depp, Orlando Bloom, Keira Knightley, Stellan Skarsgård, Chow Yun-fat]"
2,"[Action, Adventure, Crime]",206647,Spectre,A cryptic message from Bond’s past sends him on a trail to uncover a sinister organization. While M battles political forces to keep the secret se...,"[Daniel Craig, Christoph Waltz, Léa Seydoux, Ralph Fiennes, Monica Bellucci]"


In [18]:
tmdb_movie_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   genres          4800 non-null   object
 1   movie_id        4800 non-null   int64 
 2   original_title  4800 non-null   object
 3   overview        4800 non-null   object
 4   cast            4800 non-null   object
dtypes: int64(1), object(4)
memory usage: 225.0+ KB


In [19]:
tmdb_movie_data.isnull().sum()

genres            0
movie_id          0
original_title    0
overview          0
cast              0
dtype: int64

### 1_2_`netflix_data` DataFrame 데이터 클렌징 후 병합

앞서 데이터 클렌징했던 `tmdb_movie_data` DataFrame 중 'title' 컬럼이 일치되는 값만 병합하는 과정을 진행한다.

In [20]:
netflix_data.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zez...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi & Fantasy","In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join the 3% saved from squalor."
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies","After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued while trying desperately to stay alive."


In [21]:
netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


In [22]:
netflix_data.rename(columns={'title':'original_title'}, inplace=True)

In [23]:
netflix_data = netflix_data.drop(columns=['show_id', 'director', 'cast', 'country', 'release_year', 'listed_in', 'description'])

In [24]:
netflix_data.head(2)

Unnamed: 0,type,original_title,date_added,rating,duration
0,TV Show,3%,"August 14, 2020",TV-MA,4 Seasons
1,Movie,7:19,"December 23, 2016",TV-MA,93 min


In [25]:
netflix_data = netflix_data.merge(tmdb_movie_data, on='original_title')

In [26]:
netflix_data.head(2)

Unnamed: 0,type,original_title,date_added,rating,duration,genres,movie_id,overview,cast
0,Movie,9,"November 16, 2017",PG-13,80 min,"[Action, Adventure, Animation, Science Fiction, Thriller]",12244,"When 9 first comes to life, he finds himself in a post-apocalyptic world. All humans are gone, and it is only by chance that he discovers a small ...","[Christopher Plummer, Martin Landau, John C. Reilly, Crispin Glover, Jennifer Connelly]"
1,Movie,21,"January 1, 2020",PG-13,123 min,"[Drama, Crime]",8065,"Ben Campbell is a young, highly intelligent, student at M.I.T. in Boston who strives to succeed. Wanting a scholarship to transfer to Harvard Scho...","[Jim Sturgess, Kevin Spacey, Kate Bosworth, Aaron Yoo, Liza Lapira]"


In [27]:
netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 618 entries, 0 to 617
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   type            618 non-null    object
 1   original_title  618 non-null    object
 2   date_added      618 non-null    object
 3   rating          618 non-null    object
 4   duration        618 non-null    object
 5   genres          618 non-null    object
 6   movie_id        618 non-null    int64 
 7   overview        618 non-null    object
 8   cast            618 non-null    object
dtypes: int64(1), object(8)
memory usage: 48.3+ KB


In [32]:
netflix_data.rename(columns={'original_title':'title'}, inplace=True)

In [33]:
netflix_data.to_csv('dataset/netflix_tmdb_test.csv', index=False)

In [34]:
merge_data = pd.read_csv('dataset/netflix_tmdb_test.csv')
merge_data[['cast', 'genres']].apply(lambda x:' '.join(x), axis=1)
merge_data.head(2)

Unnamed: 0,type,title,date_added,rating,duration,genres,movie_id,overview,cast
0,Movie,9,"November 16, 2017",PG-13,80 min,"['Action', 'Adventure', 'Animation', 'Science Fiction', 'Thriller']",12244,"When 9 first comes to life, he finds himself in a post-apocalyptic world. All humans are gone, and it is only by chance that he discovers a small ...","['Christopher Plummer', 'Martin Landau', 'John C. Reilly', 'Crispin Glover', 'Jennifer Connelly']"
1,Movie,21,"January 1, 2020",PG-13,123 min,"['Drama', 'Crime']",8065,"Ben Campbell is a young, highly intelligent, student at M.I.T. in Boston who strives to succeed. Wanting a scholarship to transfer to Harvard Scho...","['Jim Sturgess', 'Kevin Spacey', 'Kate Bosworth', 'Aaron Yoo', 'Liza Lapira']"


In [35]:
merge_data['type'].unique()

array(['Movie', 'TV Show'], dtype=object)

In [36]:
merge_data.to_csv('dataset/netflix_tmdb_merge.csv.zip', index=False)