# 项目：整理Netflix电影演员评分数据

## 分析目标

此数据分析的目的是，整理不同流派影视作品，比如喜剧片、动作片、科幻片中，各演员出演作品的平均IMDB评分，从而挖掘出各个流派中的高评分作品演员。

本实战项目的目的在于练习整理数据，从而得到可供下一步分析的数据。

## 简介

原始数据集记录了截止至2022年7月美国地区可观看的所有Netflix电视剧及电影数据。数据集包含两个数据表：`titles.csv`和`credits.csv`。

`titles.csv`包含电影及电视剧相关信息，包括影视作品ID、标题、类型、描述、流派、IMDB（一个国外的在线评分网站）评分，等等。`credits.csv`包含超过7万名出现在Netflix影视作品的导演及演员信息，包括名字、影视作品ID、人物名、演职员类型（导演/演员）等。

`titles.csv`每列的含义如下：
- id：影视作品ID。
- title：影视作品标题。
- show_type：作品类型，电视节目或电影。
- description：简短描述。
- release_year：发布年份。
- age_certification：适龄认证。
- runtime：每集电视剧或电影的长度。
- genres：流派类型列表。
- production_countries：出品国家列表。
- seasons：如果是电视剧，则是季数。
- imdb_id：IMDB的ID。
- imdb_score：IMDB的评分。
- imdb_votes：IMDB的投票数。
- tmdb_popularity：TMDB的流行度。
- tmdb_score：TMDB的评分。

`credits.csv`每列的含义如下：
- person_ID：演职员ID。
- id：参与的影视作品ID。
- name：姓名。
- character_name：角色姓名。
- role：演职员类型，演员或导演。

# Load data

In [2]:
import pandas as pd

In [13]:
df_raw_titles = pd.read_csv('titles.csv')
df_raw_credits = pd.read_csv('credits.csv')

In [20]:
df_raw_titles_cleaned = df_raw_titles.copy()
df_raw_credits_cleaned = df_raw_credits.copy()

In [8]:
df_raw_titles_cleaned.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.6


# 评估清理数据

In [None]:
# 观察到genres, production_countries 不是一个单独的值
df_raw_titles_cleaned.info() # genres, production_countries 为字符串, 非列表

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5850 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5850 non-null   object 
 1   title                 5849 non-null   object 
 2   type                  5850 non-null   object 
 3   description           5832 non-null   object 
 4   release_year          5850 non-null   int64  
 5   age_certification     3231 non-null   object 
 6   runtime               5850 non-null   int64  
 7   genres                5850 non-null   object 
 8   production_countries  5850 non-null   object 
 9   seasons               2106 non-null   float64
 10  imdb_id               5447 non-null   object 
 11  imdb_score            5368 non-null   float64
 12  imdb_votes            5352 non-null   float64
 13  tmdb_popularity       5759 non-null   float64
 14  tmdb_score            5539 non-null   float64
dtypes: float64(5), int64(

## 结构型清理

In [21]:
df_raw_titles_cleaned['genres'] = df_raw_titles_cleaned['genres'].apply(lambda s : eval(s)) # eval() : str to list
df_raw_titles_cleaned['production_countries'] = df_raw_titles_cleaned['production_countries'].apply(lambda s : eval(s))

In [23]:
df_raw_titles_cleaned = df_raw_titles_cleaned.explode('genres')
df_raw_titles_cleaned = df_raw_titles_cleaned.explode('production_countries')

In [24]:
df_raw_titles_cleaned

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,documentation,US,1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,drama,US,,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,action,US,,tt0068473,7.7,107673.0,10.010,7.300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021,PG-13,37,,US,,,,,1.296,10.000
5849,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021,,7,family,,1.0,tt13711094,7.8,18.0,2.289,10.000
5849,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021,,7,animation,,1.0,tt13711094,7.8,18.0,2.289,10.000


In [26]:
df_raw_titles_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17818 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    17818 non-null  object 
 1   title                 17817 non-null  object 
 2   type                  17818 non-null  object 
 3   description           17790 non-null  object 
 4   release_year          17818 non-null  int64  
 5   age_certification     10889 non-null  object 
 6   runtime               17818 non-null  int64  
 7   genres                17755 non-null  object 
 8   production_countries  17439 non-null  object 
 9   seasons               6224 non-null   float64
 10  imdb_id               17116 non-null  object 
 11  imdb_score            16976 non-null  float64
 12  imdb_votes            16945 non-null  float64
 13  tmdb_popularity       17663 non-null  float64
 14  tmdb_score            17241 non-null  float64
dtypes: float64(5), int64

In [27]:
df_raw_titles_cleaned['release_year'] = pd.to_datetime(df_raw_titles_cleaned['release_year'], format='%Y')
df_raw_titles_cleaned['release_year']

0      1945-01-01
1      1976-01-01
1      1976-01-01
2      1972-01-01
2      1972-01-01
          ...    
5847   2021-01-01
5848   2021-01-01
5849   2021-01-01
5849   2021-01-01
5849   2021-01-01
Name: release_year, Length: 17818, dtype: datetime64[ns]

In [None]:
df_raw_credits_cleaned

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


In [28]:
df_raw_credits_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  int64 
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.0+ MB


In [29]:
df_raw_credits_cleaned['person_id'] = df_raw_credits_cleaned['person_id'].astype(str)

## 处理缺失值

In [32]:
df_raw_titles_cleaned.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945-01-01,TV-MA,51,documentation,US,1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972-01-01,R,109,drama,US,,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972-01-01,R,109,action,US,,tt0068473,7.7,107673.0,10.01,7.3


In [33]:
df_raw_titles_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17818 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    17818 non-null  object        
 1   title                 17817 non-null  object        
 2   type                  17818 non-null  object        
 3   description           17790 non-null  object        
 4   release_year          17818 non-null  datetime64[ns]
 5   age_certification     10889 non-null  object        
 6   runtime               17818 non-null  int64         
 7   genres                17755 non-null  object        
 8   production_countries  17439 non-null  object        
 9   seasons               6224 non-null   float64       
 10  imdb_id               17116 non-null  object        
 11  imdb_score            16976 non-null  float64       
 12  imdb_votes            16945 non-null  float64       
 13  tmdb_popularity  

由于缺失分析所需的核心数据`imdb_score`，我们将把这些观察值删除，并查看删除后该列空缺值个数和：

In [37]:
df_raw_titles_cleaned = df_raw_titles_cleaned.dropna(subset=['imdb_score'])
df_raw_titles_cleaned['imdb_score'].isnull().sum()

0

In [39]:
df_raw_titles_cleaned = df_raw_titles_cleaned.dropna(subset=['genres'])
df_raw_titles_cleaned['genres'].isnull().sum()

0

credits

In [41]:
df_raw_credits_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  object
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: object(5)
memory usage: 3.0+ MB


## 重复数据

In [43]:
df_raw_titles_cleaned.duplicated().sum()
df_raw_credits_cleaned.duplicated().sum()

0

## 不一致数据

In [None]:
# genres
df_raw_titles_cleaned['genres'].value_counts()

drama            3357
comedy           2419
thriller         1446
action           1339
romance          1080
crime            1066
documentation     981
family            769
animation         732
fantasy           727
european          679
scifi             647
horror            438
history           336
music             266
reality           226
war               221
sport             188
western            53
Name: genres, dtype: int64

In [None]:
# production_countries
print(df_raw_titles_cleaned['production_countries'].value_counts())

从以上输出结果来看，出品国家都用两位的国家代码来表示，除了里面存在一个的`Lebanon`值。

`Lebanon`的国家代码是`LB`，出现了39次，说明此处数据不一致。`LB`和`Lebanon`都在表示同一国家，需要进行统一。

In [None]:

df_raw_titles_cleaned['production_countries'] = df_raw_titles_cleaned['production_countries'].replace({'Lebanon':'LB'})
with pd.option_context('display.max_rows', None):
    print(df_raw_titles_cleaned['production_countries'].value_counts())

In [52]:
df_raw_credits['role'].value_counts() # 就两个值 需要 -> categories
df_raw_credits_cleaned['role'] = df_raw_credits_cleaned['role'].astype('category')

## 无效数据和错误数据

In [54]:
df_raw_titles_cleaned.describe()

Unnamed: 0,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,16970.0,5954.0,16970.0,16941.0,16842.0,16515.0
mean,80.912552,2.455492,6.514207,32816.55,29.396307,6.846933
std,39.596172,2.869428,1.131095,114149.2,93.178235,1.078831
min,0.0,1.0,1.5,5.0,0.6,1.0
25%,45.0,1.0,5.8,780.0,4.07,6.2
50%,90.0,2.0,6.6,3508.0,10.195,6.9
75%,107.0,3.0,7.3,16978.0,23.639,7.5
max,225.0,42.0,9.5,2294231.0,2274.044,10.0


# 整理数据

In [67]:
credits_with_titles = pd.merge(df_raw_titles_cleaned, df_raw_credits_cleaned, on='id', how='inner')

In [68]:
credits_with_titles.head(5)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,person_id,name,character,role
0,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,3748,Robert De Niro,Travis Bickle,ACTOR
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,14658,Jodie Foster,Iris Steensma,ACTOR
2,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,7064,Albert Brooks,Tom,ACTOR
3,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,3739,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,48933,Cybill Shepherd,Betsy,ACTOR


In [69]:
# 只看演员
actor_with_titles = credits_with_titles.query('role == "ACTOR"')

为了挖掘出各个流派中的高IMDB评分作品演员，我们需要先根据流派和演员进行分组。

对演员进行分组的时候，选择的是用`person_id`而不是`name`变量，原因是名字容易出现错拼或者重名的情况，演职员ID会比演员姓名更加准确地反映是哪位演员。

In [70]:
groupby_genres_and_person_id = actor_with_titles.groupby(['genres', "person_id"])

In [71]:
imdb_score_groupby_genres_and_person_id = groupby_genres_and_person_id['imdb_score'].mean() # 双层索引

In [72]:
imdb_score_groupby_genres_and_person_id = imdb_score_groupby_genres_and_person_id.reset_index()

In [73]:
imdb_score_groupby_genres_and_person_id

Unnamed: 0,genres,person_id,imdb_score
0,action,1000,6.866667
1,action,100007,7.000000
2,action,100013,6.400000
3,action,100019,6.500000
4,action,100020,6.500000
...,...,...,...
168876,western,993735,6.500000
168877,western,998673,7.300000
168878,western,998674,7.300000
168879,western,998675,7.300000


各个流派的最高分

In [75]:
genres_max_scores = imdb_score_groupby_genres_and_person_id.groupby('genres')['imdb_score'].max()
genres_max_scores

genres
action           9.3
animation        9.3
comedy           9.2
crime            9.5
documentation    9.1
drama            9.5
european         8.9
family           9.3
fantasy          9.3
history          9.1
horror           9.0
music            8.8
reality          8.9
romance          9.2
scifi            9.3
sport            9.1
thriller         9.5
war              8.8
western          8.9
Name: imdb_score, dtype: float64

In [83]:
genres_max_scores_with_personid = pd.merge(imdb_score_groupby_genres_and_person_id, genres_max_scores, on=['genres', 'imdb_score'])
genres_max_scores_with_personid

Unnamed: 0,genres,person_id,imdb_score
0,action,12790,9.3
1,action,1303,9.3
2,action,21033,9.3
3,action,336830,9.3
4,action,86591,9.3
...,...,...,...
131,war,826547,8.8
132,western,22311,8.9
133,western,28166,8.9
134,western,28180,8.9


从以上结果可以看出，最高分对应的演员不一定只有一位，可能有多位演员的平均得分相同。

为了得到演员ID所对应的演员名字，我们可以和`df_raw_credits_cleaned`这个DataFrame进行连接。这个DataFrame还有其它列，我们只需要得到`person_id`和`name`的对应，所以可以先提取出那两列，并把重复行删除。

In [84]:
# 为了找到最高分演员的名字和id, 先删除重复的演员
actor_id_with_name = df_raw_credits_cleaned[['person_id', 'name']].drop_duplicates() 
actor_id_with_name

Unnamed: 0,person_id,name
0,3748,Robert De Niro
1,14658,Jodie Foster
2,7064,Albert Brooks
3,3739,Harvey Keitel
4,48933,Cybill Shepherd
...,...,...
77796,736339,Adelaida Buscato
77797,399499,Luz Stella Luengas
77798,373198,Inés Prieto
77799,378132,Isabel Gaona


In [85]:
genres_max_scores_id_name = pd.merge(actor_id_with_name, genres_max_scores_with_personid, on=['person_id'])
genres_max_scores_id_name

Unnamed: 0,person_id,name,genres,imdb_score
0,22311,Koichi Yamadera,western,8.9
1,1652,Lukas Haas,music,8.8
2,1641,Leonardo DiCaprio,music,8.8
3,28180,Unsho Ishizuka,western,8.9
4,28166,Megumi Hayashibara,western,8.9
...,...,...,...,...
131,439923,Steve Kerr,history,9.1
132,439923,Steve Kerr,sport,9.1
133,408553,Phil Jackson,documentation,9.1
134,408553,Phil Jackson,history,9.1


In [87]:
genres_max_scores_id_name.sort_values('genres').reset_index().drop('index', axis=1)

Unnamed: 0,person_id,name,genres,imdb_score
0,12790,Olivia Hack,action,9.3
1,86591,Cricket Leigh,action,9.3
2,336830,André Sogliuzzo,action,9.3
3,21033,Zach Tyler,action,9.3
4,1303,Jessie Flower,action,9.3
...,...,...,...,...
131,140181,Naoya Uchida,war,8.8
132,93017,Aoi Tada,western,8.9
133,28166,Megumi Hayashibara,western,8.9
134,28180,Unsho Ishizuka,western,8.9
