# 项目：整理Netflix电影演员评分数据

## 分析目标

此数据分析的目的是，整理不同流派影视作品，比如喜剧片、动作片、科幻片中，各演员出演作品的平均IMDB评分，从而挖掘出各个流派中的高评分作品演员。

本实战项目的目的在于练习整理数据，从而得到可供下一步分析的数据。

## 简介

原始数据集记录了截止至2022年7月美国地区可观看的所有Netflix电视剧及电影数据。数据集包含两个数据表：`titles.csv`和`credits.csv`。

`titles.csv`包含电影及电视剧相关信息，包括影视作品ID、标题、类型、描述、流派、IMDB（一个国外的在线评分网站）评分，等等。`credits.csv`包含超过7万名出现在Netflix影视作品的导演及演员信息，包括名字、影视作品ID、人物名、演职员类型（导演/演员）等。

`titles.csv`每列的含义如下：
- id：影视作品ID。
- title：影视作品标题。
- show_type：作品类型，电视节目或电影。
- description：简短描述。
- release_year：发布年份。
- age_certification：适龄认证。
- runtime：每集电视剧或电影的长度。
- genres：流派类型列表。
- production_countries：出品国家列表。
- seasons：如果是电视剧，则是季数。
- imdb_id：IMDB的ID。
- imdb_score：IMDB的评分。
- imdb_votes：IMDB的投票数。
- tmdb_popularity：TMDB的流行度。
- tmdb_score：TMDB的评分。

`credits.csv`每列的含义如下：
- person_ID：演职员ID。
- id：参与的影视作品ID。
- name：姓名。
- character_name：角色姓名。
- role：演职员类型，演员或导演。

In [1]:
import pandas as pd

In [2]:
o_df1 = pd.read_csv("titles.csv")
o_df2 = pd.read_csv("credits.csv")

通过前十行的数据，评估"release_year",数据类型保存方式不准确需要使用to_datetime函数转换，"genres"列的数据储存方式非常标字符串，需要使用eval函数
转换后，再使用explode函数把单行多值拆成多行，"production_countries"储存方式也是非标字符串，age_certification，seasons，imdb_id ，imdb_score，
tmdb_score，都存在缺失值。

In [3]:
o_df1

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.010,7.300
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5845,tm1014599,Fine Wine,MOVIE,A beautiful love story that can happen between...,2021,,100,"['romance', 'drama']",['NG'],,tt13857480,6.8,45.0,1.466,
5846,tm898842,C/O Kaadhal,MOVIE,A heart warming film that explores the concept...,2021,,134,['drama'],[],,tt11803618,7.7,348.0,,
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021,,90,['comedy'],['CO'],,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021,PG-13,37,[],['US'],,,,,1.296,10.000


In [4]:
o_df1.head(10)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.6
5,ts22164,Monty Python's Flying Circus,SHOW,A British sketch comedy series with the shows ...,1969,TV-14,30,"['comedy', 'european']",['GB'],4.0,tt0063929,8.8,73424.0,17.617,8.306
6,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,395024.0,17.77,7.8
7,tm14873,Dirty Harry,MOVIE,When a madman dubbed 'Scorpio' terrorizes San ...,1971,R,102,"['thriller', 'action', 'crime']",['US'],,tt0066999,7.7,155051.0,12.817,7.5
8,tm119281,Bonnie and Clyde,MOVIE,"In the 1930s, bored waitress Bonnie Parker fal...",1967,R,110,"['crime', 'drama', 'action']",['US'],,tt0061418,7.7,112048.0,15.687,7.5
9,tm98978,The Blue Lagoon,MOVIE,Two small children and a ship's cook survive a...,1980,R,104,"['romance', 'action', 'drama']",['US'],,tt0080453,5.8,69844.0,50.324,6.156


数据评估，目前从数据中的可以看出每行每列单值，目前暂时无法判断数据存在异常，后续需要统计role.value_counts,是否只存在演员和导演，后续表连接可
以通过id进行两表连接。

In [5]:
o_df2

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


先把数据进行新的赋值

In [6]:
c_df1 = o_df1.copy()
c_df2 = o_df2.copy()

首先对"genres"列的数据储存方式进行转换，需要使用eval函数 转换后，再使用explode函数把单行多值拆成多行

In [7]:
c_df1["genres"] = c_df1["genres"].apply(lambda s: eval(s))

In [8]:
c_df1 = c_df1.explode(column="genres")
c_df1

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,documentation,['US'],1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,drama,['US'],,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,crime,['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,drama,['US'],,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,action,['US'],,tt0068473,7.7,107673.0,10.010,7.300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021,,90,comedy,['CO'],,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021,PG-13,37,,['US'],,,,,1.296,10.000
5849,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021,,7,family,[],1.0,tt13711094,7.8,18.0,2.289,10.000
5849,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021,,7,animation,[],1.0,tt13711094,7.8,18.0,2.289,10.000


同样对对"production_countries"列的数据储存方式进行转换，需要使用eval函数 转换后，再使用explode函数把单行多值拆成多行

In [9]:
c_df1["production_countries"] = c_df1["production_countries"].apply(lambda s: eval(s))
c_df1 = c_df1.explode(column="production_countries")
c_df1

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,documentation,US,1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,drama,US,,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,action,US,,tt0068473,7.7,107673.0,10.010,7.300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021,PG-13,37,,US,,,,,1.296,10.000
5849,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021,,7,family,,1.0,tt13711094,7.8,18.0,2.289,10.000
5849,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021,,7,animation,,1.0,tt13711094,7.8,18.0,2.289,10.000


将release_year列进行正确的值储存

In [10]:
c_df1["release_year"] = pd.to_datetime(c_df1["release_year"],format="%Y")
c_df1

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945-01-01,TV-MA,51,documentation,US,1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972-01-01,R,109,drama,US,,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972-01-01,R,109,action,US,,tt0068473,7.7,107673.0,10.010,7.300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021-01-01,PG-13,37,,US,,,,,1.296,10.000
5849,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021-01-01,,7,family,,1.0,tt13711094,7.8,18.0,2.289,10.000
5849,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021-01-01,,7,animation,,1.0,tt13711094,7.8,18.0,2.289,10.000


书局长总共有17818行数据，其中缺失数据有，age_certification，seasons，imdb_id ，imdb_score， tmdb_score，都存在缺失值，但是根据此次分析目标，只有imdb_score的值缺失会影响本次分析，所以暂时忽略其它缺失，只对imdb_score缺失值查看并且清洗。

In [11]:
c_df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17818 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    17818 non-null  object        
 1   title                 17817 non-null  object        
 2   type                  17818 non-null  object        
 3   description           17790 non-null  object        
 4   release_year          17818 non-null  datetime64[ns]
 5   age_certification     10889 non-null  object        
 6   runtime               17818 non-null  int64         
 7   genres                17755 non-null  object        
 8   production_countries  17439 non-null  object        
 9   seasons               6224 non-null   float64       
 10  imdb_id               17116 non-null  object        
 11  imdb_score            16976 non-null  float64       
 12  imdb_votes            16945 non-null  float64       
 13  tmdb_popularity  

查看到缺失值，因为我们要计算imdb评分的平均值，所以会把缺失值都删除

In [12]:
c_df1[c_df1["imdb_score"].isnull()]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945-01-01,TV-MA,51,documentation,US,1.0,,,,0.600,
75,tm132164,Bill Hicks: Sane Man,MOVIE,Sane Man was filmed before Bill recorded ‘Dang...,1989-01-01,R,80,comedy,US,,,,,3.377,7.5
145,ts251477,My First Errand,SHOW,“Hajimete no Otsukai” (First Errand) is a Japa...,1991-01-01,TV-G,18,documentation,JP,12.0,,,,7.730,7.8
145,ts251477,My First Errand,SHOW,“Hajimete no Otsukai” (First Errand) is a Japa...,1991-01-01,TV-G,18,family,JP,12.0,,,,7.730,7.8
145,ts251477,My First Errand,SHOW,“Hajimete no Otsukai” (First Errand) is a Japa...,1991-01-01,TV-G,18,reality,JP,12.0,,,,7.730,7.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5810,tm1225897,Social Man,MOVIE,Two competitive social media Influencers go he...,2021-01-01,,96,drama,,,tt20198164,,,,
5833,ts307884,HQ Barbers,SHOW,When a family run barber shop in the heart of ...,2021-01-01,TV-14,24,comedy,NG,1.0,,,,0.840,
5840,tm1216735,Sun of the Soil,MOVIE,"In 14th-century Mali, an ambitious young royal...",2022-01-01,,26,,,,,,,1.179,7.0
5844,tm1074617,Bling Empire - The Afterparty,MOVIE,"The stars of ""Bling Empire"" discuss the show's...",2021-01-01,,35,,US,,,,,,


In [13]:
c_df1.dropna(subset=["imdb_score"],inplace=True)

In [14]:
c_df1["imdb_score"].isnull().sum()

0

然后提取genres查看缺失值

由于核心数据的缺失，我们把其删除

In [15]:
c_df1[c_df1["genres"].isnull()]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
1813,ts77824,My Next Guest Needs No Introduction With David...,SHOW,TV legend David Letterman teams up with fascin...,2018-01-01,TV-MA,50,,US,4.0,tt7829834,7.8,5581.0,8.217,7.6
1939,ts215037,Minecraft: Story Mode,SHOW,"MInecraft: Story Mode is an interactive, anima...",2018-01-01,TV-PG,52,,US,1.0,tt10498322,5.6,347.0,,
2386,ts74805,A Little Help with Carol Burnett,SHOW,In this unscripted series starring comedy lege...,2018-01-01,TV-G,24,,US,1.0,tt7204366,6.3,237.0,1.621,6.2
2658,ts265844,#ABtalks,SHOW,#ABtalks is a YouTube interview show hosted by...,2018-01-01,TV-PG,68,,,1.0,tt12635254,9.6,7.0,,
4274,tm1172010,The Lockdown Plan,MOVIE,,2020-01-01,,49,,,,tt13079112,6.5,,,
4648,tm1113921,In Vitro,MOVIE,'In Vitro' is an otherworldly rumination on me...,2019-01-01,,27,,,,tt10545994,7.7,,,


目前titles.csv，的表格已经清洗干净，接下来评估清洗credits.csv

In [16]:
c_df1.dropna(subset=["genres"],inplace=True)
c_df1["genres"].isnull().sum()

0

In [82]:
c_df1.duplicated().sum()

0

经过数据比对，发现production_countries，有LB缩写同时还有Lebanon，两个写法应该同属一个国家，所以这里进行合并

In [28]:
pd.set_option('display.max_rows',10)
c_df1["production_countries"].value_counts()

US    5648
IN    1610
GB    1068
JP    1046
FR     720
ES     637
KR     637
CA     608
DE     383
CN     295
MX     264
IT     224
BR     221
AU     217
TR     195
PH     192
AR     150
ID     149
BE     148
TW     133
NG     131
PL     126
ZA     103
NL     102
HK     102
CO      94
EG      93
DK      89
TH      87
SE      81
LB      71
NO      68
AE      52
IE      49
SG      47
XX      43
IL      42
RU      41
CL      35
CH      33
PS      32
BG      31
MY      30
IS      28
SA      28
AT      28
LU      27
NZ      27
PE      26
RO      25
QA      24
CZ      22
JO      19
HU      18
FI      18
UY      15
MA      15
PT      14
KW      10
KH      10
PK       9
PR       9
MT       8
UA       8
VN       8
LT       7
TN       7
IR       7
CD       7
SU       7
SN       6
AL       6
GH       6
KE       6
CY       5
MU       5
IQ       5
MC       4
GR       4
TZ       4
IO       4
SY       4
KN       4
BD       3
HR       3
PY       3
DZ       3
BS       3
GL       3
AO       3
CM       3

In [27]:
c_df1["production_countries"] = c_df1["production_countries"].replace({"Lebanon":"LB"})
c_df1["production_countries"].value_counts()

US    5648
IN    1610
GB    1068
JP    1046
FR     720
ES     637
KR     637
CA     608
DE     383
CN     295
MX     264
IT     224
BR     221
AU     217
TR     195
PH     192
AR     150
ID     149
BE     148
TW     133
NG     131
PL     126
ZA     103
NL     102
HK     102
CO      94
EG      93
DK      89
TH      87
SE      81
LB      71
NO      68
AE      52
IE      49
SG      47
XX      43
IL      42
RU      41
CL      35
CH      33
PS      32
BG      31
MY      30
IS      28
SA      28
AT      28
LU      27
NZ      27
PE      26
RO      25
QA      24
CZ      22
JO      19
HU      18
FI      18
UY      15
MA      15
PT      14
KW      10
KH      10
PK       9
PR       9
MT       8
UA       8
VN       8
LT       7
TN       7
IR       7
CD       7
SU       7
SN       6
AL       6
GH       6
KE       6
CY       5
MU       5
IQ       5
MC       4
GR       4
TZ       4
IO       4
SY       4
KN       4
BD       3
HR       3
PY       3
DZ       3
BS       3
GL       3
AO       3
CM       3

目前titles.csv，的表格已经清洗干净，接下来评估清洗credits.csv

目前从表格数据评估暂时没有数据问题，接下来进行info，和value_count,的函数条件查看

In [32]:
c_df2.sample(10)

Unnamed: 0,person_id,id,name,character,role
46913,1908147,tm435520,Simon Hempe,Drug Dealer,ACTOR
50091,1799535,tm918706,Steve Rasetta,Uber Driver,ACTOR
24311,288194,tm218186,Niki Stanchev,Refugee,ACTOR
76717,17424,tm1032950,Garland Whitt,,ACTOR
18439,70161,tm143363,Ronit Roy,Vikram Malhotra,ACTOR
11912,434024,ts39604,Linda Liao,Kang Wenli,ACTOR
14200,599861,tm176507,Sean Cliver,,ACTOR
5817,4986,tm26897,Jesús Ochoa,Lt. Orso,ACTOR
65851,53792,tm1075698,Noah Urrea,Clay Moss,ACTOR
27344,252968,tm424879,Masayasu Yagi,Kabuto Ijuin,ACTOR


In [37]:
pd.set_option('display.max_rows',10)

观察评估数据中，核心数据name没有缺失值，person_id没有缺失值

In [38]:
c_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  int64 
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.0+ MB


观察评估数据中，数据不存在重复值

In [39]:
c_df2.duplicated().sum()

0

role列数据只有两个分类符合核心数据

In [40]:
c_df2["role"].value_counts()

ACTOR       73251
DIRECTOR     4550
Name: role, dtype: int64

此数据分析的目的是，整理不同流派影视作品，比如喜剧片、动作片、科幻片中，各演员出演作品的平均IMDB评分，从而挖掘出各个流派中的高评分作品演员。

目前可以通过影视作品id将两个表进行连接起来

In [41]:
m_df3 = pd.merge(c_df1,c_df2,on="id")

In [45]:
m_df3

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,person_id,name,character,role
0,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,3748,Robert De Niro,Travis Bickle,ACTOR
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,14658,Jodie Foster,Iris Steensma,ACTOR
2,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,7064,Albert Brooks,Tom,ACTOR
3,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,3739,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,48933,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
276104,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300,736339,Adelaida Buscato,María Paz,ACTOR
276105,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300,399499,Luz Stella Luengas,Karen Bayona,ACTOR
276106,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300,373198,Inés Prieto,Fanny,ACTOR
276107,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300,378132,Isabel Gaona,Cacica,ACTOR


因为本次数据分析只针对演员，所以单独筛选出演员的数据做分析

In [None]:
a_m_df3=m_df3[m_df3["role"] == "ACTOR"]

因为本次数据分析是针对不同流派和不同演员，所以先进行分组

In [62]:
a_m_df3_g = a_m_df3.groupby(["genres","person_id"])

因为本次数据分析针对不同流派和不同演员的平均IMDB评分最高分，所以先把平均分的组计算出来

In [65]:
a_m_df3_g_m = a_m_df3_g["imdb_score"].mean()

我们可以调用reset_index，对层次化索引进行重置，得到更加规整的DataFrame。

In [68]:
a_m_df3_g_m = a_m_df3_g_m.reset_index()

In [69]:
a_m_df3_g_m

Unnamed: 0,genres,person_id,imdb_score
0,action,45,5.0
1,action,48,5.4
2,action,51,6.4
3,action,53,6.8
4,action,54,5.3
...,...,...,...
168876,western,2353339,6.9
168877,western,2370848,6.1
168878,western,2398539,3.8
168879,western,2406218,6.0


In [74]:
g_a_m_df3_g_m_max

genres
action           9.3
animation        9.3
comedy           9.2
crime            9.5
documentation    9.1
                ... 
scifi            9.3
sport            9.1
thriller         9.5
war              8.8
western          8.9
Name: imdb_score, Length: 19, dtype: float64

In [73]:
g_a_m_df3_g_m_max = a_m_df3_g_m.groupby("genres")["imdb_score"].max()

In [82]:
g_p_a_m_df3_g_m_max = pd.merge(g_a_m_df3_g_m_max,a_m_df3_g_m,on=["genres","imdb_score"])
g_p_a_m_df3_g_m_max

Unnamed: 0,genres,imdb_score,person_id
0,action,9.3,1303
1,action,9.3,12790
2,action,9.3,21033
3,action,9.3,86591
4,action,9.3,336830
...,...,...,...
131,war,8.8,826547
132,western,8.9,22311
133,western,8.9,28166
134,western,8.9,28180


In [81]:
c_df2_dup=c_df2[["person_id","name"]].drop_duplicates()

In [88]:
df_4 = pd.merge(g_p_a_m_df3_g_m_max,c_df2_dup,on="person_id")

为了把相同流派都排序在一起，我们还可以用sort_values方法，把结果里面的行根据genres进行排序，然后用reset_index把索引重新排序。

索引重新排序后，DataFrame会多出index一列，我们可以再把index列进行删除。

In [87]:
df_4.sort_values("genres").reset_index().drop("index", axis=1)

Unnamed: 0,genres,imdb_score,person_id,name
0,action,9.3,1303,Jessie Flower
1,action,9.3,86591,Cricket Leigh
2,action,9.3,21033,Zach Tyler
3,action,9.3,12790,Olivia Hack
4,action,9.3,336830,André Sogliuzzo
...,...,...,...,...
131,war,8.8,826547,Yuto Uemura
132,western,8.9,28166,Megumi Hayashibara
133,western,8.9,28180,Unsho Ishizuka
134,western,8.9,22311,Koichi Yamadera
