# 项目：整理Netflix电影演员评分数据

## 分析目标

此数据分析的目的是，整理不同流派影视作品，比如喜剧片、动作片、科幻片中，各演员出演作品的平均IMDB评分，从而挖掘出各个流派中的高评分作品演员。

本实战项目的目的在于练习整理数据，从而得到可供下一步分析的数据。

## 简介

原始数据集记录了截止至2022年7月美国地区可观看的所有Netflix电视剧及电影数据。数据集包含两个数据表：`titles.csv`和`credits.csv`。

`titles.csv`包含电影及电视剧相关信息，包括影视作品ID、标题、类型、描述、流派、IMDB（一个国外的在线评分网站）评分，等等。`credits.csv`包含超过7万名出现在Netflix影视作品的导演及演员信息，包括名字、影视作品ID、人物名、演职员类型（导演/演员）等。

`titles.csv`每列的含义如下：
- id：影视作品ID。
- title：影视作品标题。
- show_type：作品类型，电视节目或电影。
- description：简短描述。
- release_year：发布年份。
- age_certification：适龄认证。
- runtime：每集电视剧或电影的长度。
- genres：流派类型列表。
- production_countries：出品国家列表。
- seasons：如果是电视剧，则是季数。
- imdb_id：IMDB的ID。
- imdb_score：IMDB的评分。
- imdb_votes：IMDB的投票数。
- tmdb_popularity：TMDB的流行度。
- tmdb_score：TMDB的评分。

`credits.csv`每列的含义如下：
- person_ID：演职员ID。
- id：参与的影视作品ID。
- name：姓名。
- character_name：角色姓名。
- role：演职员类型，演员或导演。

## **导入数据**

In [1]:
import pandas as pd
import numpy as np

In [2]:
credits = pd.read_csv("./credits.csv")
titles = pd.read_csv("./titles.csv")

**一、结构调整**

In [3]:
credits.sample(20)

Unnamed: 0,person_id,id,name,character,role
28832,1227765,tm414073,Alanna Tremblay,News Reporter,ACTOR
59593,1195338,tm827640,Wanda Webster,Self - Travis Scott's mom,ACTOR
62751,1508389,tm474496,Esteban Rojas,Miguel,ACTOR
49023,1548828,ts225101,Kim Ju-hun,Seo Chung-myung,ACTOR
71348,1783151,tm984204,Isaiah Finley,Detention Guard,ACTOR
6115,2178403,tm101914,Don Dunphy,Fight Announcer,ACTOR
18875,69116,tm157062,Chitrangda Singh,Special appearance,ACTOR
56452,44204,tm453098,Lee Da-wit,Go Joseph,ACTOR
72340,254346,tm1203307,Bo Maerten,Lisa,ACTOR
10066,54730,ts21664,Kelly Perine,,ACTOR


`credits`表结构没问题

In [4]:
titles.sample(20)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
866,tm151807,Creep,MOVIE,"Looking for work, Aaron comes across a cryptic...",2014,R,82,"['horror', 'thriller']",['US'],,tt2428170,6.3,53408.0,16.301,6.4
750,tm177958,Pyaar Ka Punchnama,MOVIE,Nishant starts dating Charu while his roommate...,2011,PG-13,149,"['romance', 'drama', 'comedy', 'european']",['IN'],,tt1926313,7.6,21521.0,4.851,7.0
962,ts3799,All Hail King Julien,SHOW,King Julien is back and shaking his booty hard...,2014,TV-Y7,23,"['comedy', 'family', 'fantasy', 'animation', '...",['US'],5.0,tt3807022,7.0,2042.0,26.409,6.9
2387,ts80042,ReBoot: The Guardian Code,SHOW,Four tech-savvy teens hone their skills as cyb...,2018,TV-G,23,"['drama', 'scifi', 'action', 'comedy']",['CA'],2.0,tt6849940,3.7,1204.0,4.406,7.2
4327,ts234501,Buddi,SHOW,"The Buddis bounce, spin, glide and giggle thro...",2020,TV-Y,10,"['family', 'fantasy', 'animation', 'comedy']",['US'],2.0,tt11829340,7.2,58.0,1.032,10.0
2006,ts89130,If I Hadn't Met You,SHOW,"Eduard, a husband and father who loses his fam...",2018,TV-MA,52,"['thriller', 'drama', 'scifi', 'fantasy', 'rom...",['ES'],1.0,tt9817268,7.7,2283.0,4.495,6.9
5347,ts287838,Mad for Each Other,SHOW,Bothered to realize they are next-door neighbo...,2021,,35,"['comedy', 'drama', 'romance']",['KR'],1.0,tt14596414,7.9,1724.0,8.719,8.2
4550,tm953122,Elf Pets: Santa's Reindeer Rescue,MOVIE,It's almost Christmas and Santa's test flights...,2020,,26,['animation'],[],,,,,1.028,5.0
1420,tm200454,Ok Kanmani,MOVIE,Adhi and Tara are in a live-in relationship bu...,2015,PG,139,"['romance', 'drama']",['IN'],,tt4271820,7.4,5877.0,4.318,6.8
291,tm116324,Jackass: The Movie,MOVIE,Johnny Knoxville and his band of maniacs perfo...,2002,R,88,"['comedy', 'documentation', 'action']",['US'],,tt0322802,6.6,94423.0,25.23,6.3


`titles`表`genres`列和`production_countries`列需要分列并匹配到行中

In [5]:
titles["genres"][1]

"['drama', 'crime']"

In [6]:
titles["production_countries"][1]

"['US']"

检索出`genres`列和`production_countries`列的单元格可知元素类型不是单纯的列表，本质是字符串，所以需要将字符串转换为列表

首先对`titles`表和`credits`表进行备份

In [7]:
titles_clean = titles.copy()
credits_clean = credits.copy()

将`genres`列和`production_countries`列的数据进行转换，使用eval函数

In [8]:
titles_clean["genres"] = titles_clean["genres"].apply(lambda x:eval(x))

In [9]:
titles_clean["genres"][1]

['drama', 'crime']

In [10]:
titles_clean["production_countries"] = titles_clean["production_countries"].apply(lambda x:eval(x))

In [11]:
titles_clean["production_countries"][1]

['US']

接下来将`genres`列和`production_countries`列的列表数据转为行

In [12]:
titles_clean = titles_clean.explode("genres")

In [13]:
titles_clean = titles_clean.explode("production_countries")

In [14]:
titles_clean.head(20)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,documentation,US,1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,drama,US,,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,action,US,,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,thriller,US,,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,european,US,,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,fantasy,GB,,tt0071853,8.2,534486.0,15.461,7.811
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,action,GB,,tt0071853,8.2,534486.0,15.461,7.811
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,comedy,GB,,tt0071853,8.2,534486.0,15.461,7.811


**二、内容调整**

1.检查`titles_clean`表内容

In [15]:
titles_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17818 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    17818 non-null  object 
 1   title                 17817 non-null  object 
 2   type                  17818 non-null  object 
 3   description           17790 non-null  object 
 4   release_year          17818 non-null  int64  
 5   age_certification     10889 non-null  object 
 6   runtime               17818 non-null  int64  
 7   genres                17755 non-null  object 
 8   production_countries  17439 non-null  object 
 9   seasons               6224 non-null   float64
 10  imdb_id               17116 non-null  object 
 11  imdb_score            16976 non-null  float64
 12  imdb_votes            16945 non-null  float64
 13  tmdb_popularity       17663 non-null  float64
 14  tmdb_score            17241 non-null  float64
dtypes: float64(5), int64(2), 

`titles_clean`表数据类型问题概况：  
1.空缺值：除了`id` `type` `release_year` `runtime` 列以外的其他列均有空缺值，其中`genres`列为关键数据，可以将空值替换为“other”  
2.数据类型调整：`release_year`列的数据类型应是日期时间，`runtime`列的数据类型应是时间

调整`titles_clean`表的内容

In [16]:
titles_clean["genres"] = titles_clean["genres"].fillna("other") 

将`release_year`列的数据类型改为日期时间

In [17]:
titles_clean["release_year"] = pd.to_datetime(titles_clean["release_year"], format="%Y")
titles_clean["release_year"]

0      1945-01-01
1      1976-01-01
1      1976-01-01
2      1972-01-01
2      1972-01-01
          ...    
5847   2021-01-01
5848   2021-01-01
5849   2021-01-01
5849   2021-01-01
5849   2021-01-01
Name: release_year, Length: 17818, dtype: datetime64[ns]

将`runtime`列的数据类型改为时间

In [18]:
titles_clean['runtime'] = pd.to_timedelta(titles_clean['runtime'], unit='m').dt.total_seconds() // 60
titles_clean['runtime']

0        51.0
1       114.0
1       114.0
2       109.0
2       109.0
        ...  
5847     90.0
5848     37.0
5849      7.0
5849      7.0
5849      7.0
Name: runtime, Length: 17818, dtype: float64

查看`titles_clean`表的数值是否有异常

In [19]:
titles_clean.describe()

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,17818,17818.0,6224.0,16976.0,16945.0,17663.0,17241.0
mean,2015-12-14 23:26:22.803906048,79.944326,2.40858,6.514467,32809.16,28.787751,6.847247
min,1945-01-01 00:00:00,0.0,1.0,1.5,5.0,0.009442,0.5
25%,2015-01-01 00:00:00,45.0,1.0,5.8,780.0,3.874,6.2
50%,2018-01-01 00:00:00,90.0,1.0,6.6,3508.0,9.885,6.9
75%,2020-01-01 00:00:00,107.0,3.0,7.3,16976.0,23.051,7.5
max,2022-01-01 00:00:00,240.0,42.0,9.6,2294231.0,2274.044,10.0
std,,39.855654,2.829108,1.131246,114136.8,91.691457,1.096069


`titles_clean`表的数值无异常

2.检查`credits_clean`表内容

In [20]:
credits_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  int64 
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.0+ MB


`credits_clean`表数据类型问题概况：  
1.空缺值：`character`列有空缺值，角色姓名列不是关键数据空缺值不处理  
2.数据类型调整：`person_id`列应该是字符串

`person_id`列改为字符串

In [21]:
credits_clean["person_id"] = credits_clean["person_id"].astype(str)
credits_clean["person_id"]

0           3748
1          14658
2           7064
3           3739
4          48933
          ...   
77796     736339
77797     399499
77798     373198
77799     378132
77800    1950416
Name: person_id, Length: 77801, dtype: object

重复数据筛选

In [22]:
credits_clean.duplicated().sum()

0

In [23]:
titles_clean.duplicated().sum()

0

无重复数据

筛选`genres`列和`production_countries`列是否有不一致数据

In [24]:
titles_clean["genres"].value_counts()

genres
drama            3517
comedy           2538
thriller         1505
action           1394
romance          1098
crime            1093
documentation    1085
animation         816
family            803
fantasy           738
european          699
scifi             676
horror            451
history           336
music             289
reality           241
war               232
sport             188
other              63
western            56
Name: count, dtype: int64

In [25]:
pd.set_option("display.max_rows",20)
titles_clean["production_countries"].value_counts()

production_countries
US         5904
IN         1662
GB         1107
JP         1099
FR          741
           ... 
CU            1
GT            1
NA            1
Lebanon       1
FO            1
Name: count, Length: 109, dtype: int64

`production_countries`列的Lebanon格式跟其他的国家不一样，要替换成LBN

In [26]:
titles_clean["production_countries"] = titles_clean["production_countries"].replace("Lebanon","LBN")
titles_clean["production_countries"].value_counts()

production_countries
US     5904
IN     1662
GB     1107
JP     1099
FR      741
       ... 
CU        1
GT        1
NA        1
LBN       1
FO        1
Name: count, Length: 109, dtype: int64

**三、合并数据**

In [27]:
Netflix_show_data = pd.merge(titles_clean,credits_clean,on ="id" )
Netflix_show_data.sample(10)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,person_id,name,character,role
29702,tm118637,John Q,MOVIE,John Quincy Archibald is a father and husband ...,2002-01-01,PG-13,116.0,thriller,US,,tt0251160,7.0,133215.0,32.229,7.121,5313,Ray Liotta,Chief Gus Monroe,ACTOR
156441,tm367473,Bad Seeds,MOVIE,"Wael, a former street child, makes a living fr...",2018-01-01,,100.0,european,FR,,tt6708116,7.3,5461.0,7.032,7.8,46060,Kheiron,,DIRECTOR
144202,ts82631,Monkey Twins,SHOW,Inspired by Khon dance drama and Thai martial ...,2018-01-01,,49.0,action,TH,1.0,tt8507498,7.9,95.0,0.99,6.0,352944,Sumret Muengput,เหนือ,ACTOR
85938,ts35354,Zoo,SHOW,Set amidst a wave of violent animal attacks sw...,2015-01-01,TV-14,43.0,thriller,US,3.0,tt3250026,6.6,24694.0,30.71,6.6,28044,James Wolk,Jackson Oz,ACTOR
249548,tm983723,The Sparks Brothers,MOVIE,Take a musical odyssey through five weird and ...,2021-01-01,R,140.0,documentation,GB,,tt8610436,7.8,4663.0,6.784,7.3,1909207,Richard Coble,Self,ACTOR
158359,tm233484,Duck Duck Goose,MOVIE,"After he’s grounded by an injury, a high-flyin...",2018-01-01,PG,82.0,action,CN,,tt4940416,5.7,3721.0,12.075,6.2,17809,Rick Overton,Stanley (voice),ACTOR
191440,ts88151,Quicksand,SHOW,After a tragedy at a school sends shock waves ...,2019-01-01,TV-MA,46.0,crime,SE,1.0,tt8686106,7.5,21276.0,16.469,7.3,1089604,Ella Rappich,Amanda Steen,ACTOR
78050,tm182580,Ghadi,MOVIE,Leba is a music instructor who lives in a smal...,2013-01-01,PG-13,100.0,family,LB,,tt2552296,7.3,1128.0,2.793,6.5,128413,Caroline Labaki,Nisrine,ACTOR
74829,tm167944,Berserk: The Golden Age Arc III - The Advent,MOVIE,A year has passed since Guts parted ways with ...,2013-01-01,NC-17,110.0,drama,JP,,tt2358913,7.8,10173.0,41.099,7.6,58716,Takahiro Fujiwara,Pippin (voice),ACTOR
229386,tm957903,The Boys in the Band: Something Personal,MOVIE,Decades after his play first put gay life cent...,2020-01-01,,28.0,documentation,US,,tt13206842,7.2,362.0,2.267,8.5,16060,Matt Bomer,Self,ACTOR


In [28]:
Netflix_show_data["production_countries"].value_counts()

production_countries
US    108977
GB     22013
IN     21640
JP     14802
FR     12068
       ...  
CU         7
NP         5
GE         5
FO         5
GT         2
Name: count, Length: 108, dtype: int64

**四、分析数据**

只分析演员，所以首先要筛选出演员的观察值

In [29]:
Netflix_show_ACTOR_data = Netflix_show_data.query("role == 'ACTOR'")

需要对`genres`列进行分组计算`imdb_score`的平均分

In [30]:
genres_meandata = pd.pivot_table(Netflix_show_data,index = ["genres","person_id"],values = "imdb_score",aggfunc = np.mean)

  genres_meandata = pd.pivot_table(Netflix_show_data,index = ["genres","person_id"],values = "imdb_score",aggfunc = np.mean)


找到每个`genres`分组中`imdb_score' 最高的行的索引

In [31]:
genres_meandata_clean = genres_meandata.reset_index()
genres_meandata_clean

Unnamed: 0,genres,person_id,imdb_score
0,action,1000,6.866667
1,action,100007,7.000000
2,action,100013,6.400000
3,action,100019,6.500000
4,action,100020,6.500000
...,...,...,...
177319,western,993735,6.500000
177320,western,998673,7.300000
177321,western,998674,7.300000
177322,western,998675,7.300000


In [32]:
genres_meandata_clean_max = genres_meandata_clean.groupby("genres")["imdb_score"].max()
genres_meandata_clean_max

genres
action           9.3
animation        9.3
comedy           9.2
crime            9.5
documentation    9.1
drama            9.5
european         8.9
family           9.3
fantasy          9.3
history          9.1
horror           9.0
music            8.8
other            7.8
reality          8.9
romance          9.2
scifi            9.3
sport            9.1
thriller         9.5
war              8.8
western          8.9
Name: imdb_score, dtype: float64

In [33]:
genres_meandata_clean_max = pd.merge(genres_meandata_clean,genres_meandata_clean_max,on =["genres","imdb_score"] )
genres_meandata_clean_max

Unnamed: 0,genres,person_id,imdb_score
0,action,12790,9.3
1,action,1303,9.3
2,action,21033,9.3
3,action,336830,9.3
4,action,86591,9.3
...,...,...,...
137,war,826547,8.8
138,western,22311,8.9
139,western,28166,8.9
140,western,28180,8.9


到这里筛选出了各流派平均分最高的演员id，有很多是并列的最高

然后需要把演员id跟演员名称匹配上，因为只需要名称列，所以先把`Netflix_show_data`表里的`name`和`person_id`列提取出来，并删除重复列

In [34]:
name_person_id = Netflix_show_data[["name","person_id"]].drop_duplicates()
name_person_id

Unnamed: 0,name,person_id
0,Robert De Niro,3748
1,Jodie Foster,14658
2,Albert Brooks,7064
3,Harvey Keitel,3739
4,Cybill Shepherd,48933
...,...,...
284180,Adelaida Buscato,736339
284181,Luz Stella Luengas,399499
284182,Inés Prieto,373198
284183,Isabel Gaona,378132


然后把`name_person_id` `genres_meandata_clean_max` 根据`person_id`合并

In [35]:
genres_imdbscore_max_actor = pd.merge(genres_meandata_clean_max,name_person_id,on ="person_id" )
genres_imdbscore_max_actor

Unnamed: 0,genres,person_id,imdb_score,name
0,action,12790,9.3,Olivia Hack
1,action,1303,9.3,Jessie Flower
2,action,21033,9.3,Zach Tyler
3,action,336830,9.3,André Sogliuzzo
4,action,86591,9.3,Cricket Leigh
...,...,...,...,...
137,war,826547,8.8,Yuto Uemura
138,western,22311,8.9,Koichi Yamadera
139,western,28166,8.9,Megumi Hayashibara
140,western,28180,8.9,Unsho Ishizuka
