# 项目：整理Netflix电影演员评分数据

## 分析目标

此数据分析的目的是，整理不同流派影视作品，比如喜剧片、动作片、科幻片中，各演员出演作品的平均IMDB评分，从而挖掘出各个流派中的高评分作品演员。

本实战项目的目的在于练习整理数据，从而得到可供下一步分析的数据。

## 简介

原始数据集记录了截止至2022年7月美国地区可观看的所有Netflix电视剧及电影数据。数据集包含两个数据表：`titles.csv`和`credits.csv`。

`titles.csv`包含电影及电视剧相关信息，包括影视作品ID、标题、类型、描述、流派、IMDB（一个国外的在线评分网站）评分，等等。`credits.csv`包含超过7万名出现在Netflix影视作品的导演及演员信息，包括名字、影视作品ID、人物名、演职员类型（导演/演员）等。

`titles.csv`每列的含义如下：
- id：影视作品ID。
- title：影视作品标题。
- show_type：作品类型，电视节目或电影。
- description：简短描述。
- release_year：发布年份。
- age_certification：适龄认证。
- runtime：每集电视剧或电影的长度。
- genres：流派类型列表。
- production_countries：出品国家列表。
- seasons：如果是电视剧，则是季数。
- imdb_id：IMDB的ID。
- imdb_score：IMDB的评分。
- imdb_votes：IMDB的投票数。
- tmdb_popularity：TMDB的流行度。
- tmdb_score：TMDB的评分。

`credits.csv`每列的含义如下：
- person_ID：演职员ID。
- id：参与的影视作品ID。
- name：姓名。
- character_name：角色姓名。
- role：演职员类型，演员或导演。

## **导入数据**

In [1]:
import pandas as pd
import numpy as np

In [2]:
credits = pd.read_csv("./credits.csv")
titles = pd.read_csv("./titles.csv")

**一、结构调整**

In [3]:
credits.sample(20)

Unnamed: 0,person_id,id,name,character,role
60167,55521,tm463038,Gabriela Tagliavini,,DIRECTOR
19262,4461,ts20238,Zhang Han,Feng Teng,ACTOR
35229,823507,tm326624,Samuel Bearman,Himself,ACTOR
60935,2026940,tm1013697,Waje,,ACTOR
40944,1800886,tm413418,Abhijeet Chavan,Shyam,ACTOR
49992,3464,tm463385,Alec Baldwin,John DeLorean,ACTOR
41649,12400,tm444564,Mónica del Carmen,Laura González,ACTOR
58966,2385251,tm845850,Sara Montalvo,,ACTOR
63419,1817506,tm888850,Mizuki Kayashima,Kawahara Risa,ACTOR
60312,1341231,tm501222,Beto Mendoza,Public Official #2,ACTOR


`credits`表结构没问题

In [4]:
titles.sample(20)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
4686,tm824566,Ibrahim: A Fate to Define,MOVIE,"In this provocative and personal documentary, ...",2019,,75,['documentation'],"['PS', 'DK', 'LB']",,tt10777716,7.4,31.0,1.121,
1327,ts37804,W/ Bob & David,SHOW,After being dishonorably discharged from the N...,2015,TV-MA,35,['comedy'],['US'],1.0,tt4574708,7.4,4160.0,4.117,6.8
3700,ts312044,Beyblade Burst Surge,SHOW,,2020,TV-Y,23,['animation'],['JP'],1.0,tt18554728,8.3,30.0,7.501,9.7
1513,tm239769,I'll Sleep When I'm Dead,MOVIE,An energetic and fast-paced bio-doc that exami...,2016,,82,['documentation'],['US'],,tt4679136,6.6,1364.0,3.096,6.5
4191,ts233367,Sing On! Spain,SHOW,"Six people compete for a 30,000 euro reward by...",2020,TV-G,39,"['music', 'reality']",['ES'],1.0,tt11698668,6.1,133.0,1.795,8.5
5009,tm993670,Cosmic Sin,MOVIE,"In the year 2524, four centuries after humans ...",2021,R,88,"['action', 'scifi']",['US'],,tt11762434,2.5,12517.0,98.432,4.2
1243,tm244788,13th,MOVIE,An in-depth look at the prison system in the U...,2016,,100,"['documentation', 'history', 'crime']",['US'],,tt5895028,8.2,35302.0,8.675,8.0
5202,ts258415,Elves,SHOW,A Christmas vacation turns into a nightmare fo...,2021,TV-14,24,"['thriller', 'family', 'drama', 'fantasy', 'ho...",['DK'],1.0,tt13231962,5.5,4165.0,45.71,7.0
4596,ts223115,The Charming Stepmom,SHOW,A quirky fashion student becomes the nanny of ...,2019,TV-14,46,"['family', 'romance', 'comedy']",['TH'],1.0,tt13846404,7.9,18.0,5.114,10.0
854,tm148147,Big Eyes,MOVIE,"In the late 1950s and early '60s, artist Walte...",2014,PG-13,106,"['drama', 'documentation', 'romance', 'crime']","['CA', 'US']",,tt1126590,7.2,51.0,11.98,6.998


`titles`表`genres`列和`production_countries`列需要分列并匹配到行中

In [5]:
titles["genres"][1]

"['drama', 'crime']"

In [6]:
titles["production_countries"][1]

"['US']"

检索出`genres`列和`production_countries`列的单元格可知元素类型不是单纯的列表，本质是字符串，所以需要将字符串转换为列表

首先对`titles`表和`credits`表进行备份

In [7]:
titles_clean = titles.copy()
credits_clean = credits.copy()

将`genres`列和`production_countries`列的数据进行转换，使用eval函数

In [8]:
titles_clean["genres"] = titles_clean["genres"].apply(lambda x:eval(x))

In [9]:
titles_clean["genres"][1]

['drama', 'crime']

In [10]:
titles_clean["production_countries"] = titles_clean["production_countries"].apply(lambda x:eval(x))

In [11]:
titles_clean["production_countries"][1]

['US']

接下来将`genres`列和`production_countries`列的列表数据转为行

In [12]:
titles_clean = titles_clean.explode("genres")

In [13]:
titles_clean = titles_clean.explode("production_countries")

In [14]:
titles_clean.head(20)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,documentation,US,1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,drama,US,,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,action,US,,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,thriller,US,,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,european,US,,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,fantasy,GB,,tt0071853,8.2,534486.0,15.461,7.811
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,action,GB,,tt0071853,8.2,534486.0,15.461,7.811
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,comedy,GB,,tt0071853,8.2,534486.0,15.461,7.811


**二、内容调整**

1.检查`titles_clean`表内容

In [15]:
titles_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17818 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    17818 non-null  object 
 1   title                 17817 non-null  object 
 2   type                  17818 non-null  object 
 3   description           17790 non-null  object 
 4   release_year          17818 non-null  int64  
 5   age_certification     10889 non-null  object 
 6   runtime               17818 non-null  int64  
 7   genres                17755 non-null  object 
 8   production_countries  17439 non-null  object 
 9   seasons               6224 non-null   float64
 10  imdb_id               17116 non-null  object 
 11  imdb_score            16976 non-null  float64
 12  imdb_votes            16945 non-null  float64
 13  tmdb_popularity       17663 non-null  float64
 14  tmdb_score            17241 non-null  float64
dtypes: float64(5), int64(2), 

`titles_clean`表数据类型问题概况：  
1.空缺值：除了`id` `type` `release_year` `runtime` 列以外的其他列均有空缺值，其中`genres`列为关键数据，可以将空值替换为“other”  
2.数据类型调整：`release_year`列的数据类型应是日期时间，`runtime`列的数据类型应是时间

调整`titles_clean`表的内容

In [16]:
titles_clean["genres"] = titles_clean["genres"].fillna("other") 

将`release_year`列的数据类型改为日期时间

In [18]:
titles_clean["release_year"] = pd.to_datetime(titles_clean["release_year"], format="%Y")
titles_clean["release_year"]

0      1945-01-01
1      1976-01-01
1      1976-01-01
2      1972-01-01
2      1972-01-01
          ...    
5847   2021-01-01
5848   2021-01-01
5849   2021-01-01
5849   2021-01-01
5849   2021-01-01
Name: release_year, Length: 17818, dtype: datetime64[ns]

将`runtime`列的数据类型改为时间

In [20]:
titles_clean['runtime'] = pd.to_timedelta(titles_clean['runtime'], unit='m').dt.total_seconds() // 60
titles_clean['runtime']

0        51.0
1       114.0
1       114.0
2       109.0
2       109.0
        ...  
5847     90.0
5848     37.0
5849      7.0
5849      7.0
5849      7.0
Name: runtime, Length: 17818, dtype: float64

查看`titles_clean`表的数值是否有异常

In [21]:
titles_clean.describe()

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,17818,17818.0,6224.0,16976.0,16945.0,17663.0,17241.0
mean,2015-12-14 23:26:22.803906048,79.944326,2.40858,6.514467,32809.16,28.787751,6.847247
min,1945-01-01 00:00:00,0.0,1.0,1.5,5.0,0.009442,0.5
25%,2015-01-01 00:00:00,45.0,1.0,5.8,780.0,3.874,6.2
50%,2018-01-01 00:00:00,90.0,1.0,6.6,3508.0,9.885,6.9
75%,2020-01-01 00:00:00,107.0,3.0,7.3,16976.0,23.051,7.5
max,2022-01-01 00:00:00,240.0,42.0,9.6,2294231.0,2274.044,10.0
std,,39.855654,2.829108,1.131246,114136.8,91.691457,1.096069


`titles_clean`表的数值无异常

2.检查`credits_clean`表内容

In [22]:
credits_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  int64 
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.0+ MB


`credits_clean`表数据类型问题概况：  
1.空缺值：`character`列有空缺值，角色姓名列不是关键数据空缺值不处理  
2.数据类型调整：`person_id`列应该是字符串

`person_id`列改为字符串

In [23]:
credits_clean["person_id"] = credits_clean["person_id"].astype(str)
credits_clean["person_id"]

0           3748
1          14658
2           7064
3           3739
4          48933
          ...   
77796     736339
77797     399499
77798     373198
77799     378132
77800    1950416
Name: person_id, Length: 77801, dtype: object

重复数据筛选

In [27]:
credits_clean.duplicated().sum()

0

In [28]:
titles_clean.duplicated().sum()

0

无重复数据

筛选`genres`列和`production_countries`列是否有不一致数据

In [29]:
titles_clean["genres"].value_counts()

genres
drama            3517
comedy           2538
thriller         1505
action           1394
romance          1098
crime            1093
documentation    1085
animation         816
family            803
fantasy           738
european          699
scifi             676
horror            451
history           336
music             289
reality           241
war               232
sport             188
other              63
western            56
Name: count, dtype: int64

In [32]:
pd.set_option("display.max_rows",200)
titles_clean["production_countries"].value_counts()

production_countries
US         5904
IN         1662
GB         1107
JP         1099
FR          741
KR          684
ES          660
CA          641
DE          397
CN          308
MX          278
IT          231
BR          225
AU          220
TR          198
PH          197
ID          156
AR          156
BE          155
NG          147
TW          144
PL          133
ZA          110
NL          109
HK          106
EG          104
CO           99
TH           95
DK           93
SE           81
LB           78
NO           71
SG           56
AE           54
IE           51
XX           45
PS           42
IL           42
RU           41
CL           36
BG           36
MY           35
SA           34
CH           34
AT           32
IS           29
LU           27
NZ           27
PE           27
CZ           26
QA           26
RO           25
HU           19
JO           19
FI           18
UY           15
MA           15
PT           14
KW           13
KH           10
VN           10
PR 

`production_countries`列的Lebanon格式跟其他的国家不一样，要替换成LBN

In [33]:
titles_clean["production_countries"] = titles_clean["production_countries"].replace("Lebanon","LBN")
titles_clean["production_countries"].value_counts()

production_countries
US     5904
IN     1662
GB     1107
JP     1099
FR      741
KR      684
ES      660
CA      641
DE      397
CN      308
MX      278
IT      231
BR      225
AU      220
TR      198
PH      197
ID      156
AR      156
BE      155
NG      147
TW      144
PL      133
ZA      110
NL      109
HK      106
EG      104
CO       99
TH       95
DK       93
SE       81
LB       78
NO       71
SG       56
AE       54
IE       51
XX       45
PS       42
IL       42
RU       41
CL       36
BG       36
MY       35
SA       34
CH       34
AT       32
IS       29
LU       27
NZ       27
PE       27
CZ       26
QA       26
RO       25
HU       19
JO       19
FI       18
UY       15
MA       15
PT       14
KW       13
KH       10
VN       10
PR        9
PK        9
TN        8
UA        8
MT        8
LT        7
CD        7
KE        7
IR        7
SU        7
AL        6
SN        6
GH        6
SY        6
CY        5
GR        5
MU        5
IO        5
IQ        5
KN        4
TZ     

**三、合并数据**

In [34]:
Netflix_show_data = pd.merge(titles_clean,credits_clean,on ="id" )
Netflix_show_data.sample(10)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,person_id,name,character,role
241169,ts255888,First Kill,SHOW,Falling in love is tricky for teens Juliette a...,2022-01-01,TV-MA,49.0,thriller,US,1.0,tt13315156,6.4,23605.0,252.955,8.4,171099,Jason R. Moore,Jack Burns,ACTOR
54186,tm36237,Contraband,MOVIE,When his brother-in-law runs afoul of a drug l...,2012-01-01,R,109.0,crime,US,,tt1524137,6.4,122383.0,46.41,6.3,608879,Turner Crumbley,Laird,ACTOR
53205,ts20429,Sword Art Online,SHOW,"In the near future, a Virtual Reality Massive ...",2012-01-01,TV-MA,23.0,animation,JP,4.0,tt2250192,7.6,44606.0,58.325,8.284,60138,Hiroaki Hirata,Klein (voice),ACTOR
145916,tm359713,El Camino Christmas,MOVIE,"A young man seeking a father he has never met,...",2017-01-01,,89.0,crime,US,,tt3255590,5.7,8830.0,8.462,5.6,785478,Darrell Keith Harris,Black Cowboy at gates,ACTOR
98382,tm235405,Hyena Road,MOVIE,"Three different men, three different worlds, t...",2015-01-01,R,120.0,action,CA,,tt4034452,6.5,7839.0,9.363,6.7,1891,Clark Johnson,General Rilmen,ACTOR
23908,tm20959,The Pursuit of Happyness,MOVIE,A struggling salesman takes custody of his son...,2006-01-01,PG-13,117.0,european,US,,tt0454921,8.0,501457.0,47.893,7.9,95791,Mike Garibaldi,Paul,ACTOR
118775,tm233482,The Foreigner,MOVIE,Quan is a humble London businessman whose long...,2017-01-01,R,113.0,thriller,IN,,tt1615160,7.0,113338.0,32.05,6.8,694930,Mike Ray,Businessman,ACTOR
207166,tm848995,Jingle Jangle: A Christmas Journey,MOVIE,An imaginary world comes to life in a holiday ...,2020-01-01,PG,122.0,music,US,,tt7736496,6.4,18374.0,28.329,6.65,1629063,Kenyah Sandy,Grandson,ACTOR
6248,tm27395,Mission: Impossible II,MOVIE,With computer genius Luther Stickell at his si...,2000-01-01,PG-13,123.0,thriller,US,,tt0120755,6.1,337987.0,29.392,6.1,856785,Antonio Vargas,Senor De L'Arena,ACTOR
75366,tm174289,Free Birds,MOVIE,Two turkeys from opposite sides of the tracks ...,2013-01-01,PG,91.0,animation,US,,tt1621039,5.8,24140.0,17.211,5.9,5198,Keith David,Chief Broadbeak (Voice),ACTOR


In [35]:
Netflix_show_data["production_countries"].value_counts()

production_countries
US     108977
GB      22013
IN      21640
JP      14802
FR      12068
CA      10899
ES      10152
KR       9706
DE       7805
CN       5101
IT       3807
PH       3350
MX       3055
AU       3017
BE       2936
TR       2851
ID       2676
BR       2451
PL       2264
HK       2003
NL       1969
AR       1951
ZA       1632
TH       1474
TW       1450
NG       1434
EG       1337
DK       1283
SE       1254
NO        964
AE        925
BG        923
IE        869
CH        840
CO        733
CZ        683
LB        677
RU        598
CL        556
IL        512
IS        492
SG        491
RO        469
NZ        460
HU        397
MY        390
LU        370
AT        369
PE        310
SA        288
XX        281
PS        278
JO        263
MT        248
PT        239
QA        232
IR        223
MA        219
PR        217
MC        212
FI        205
UY        199
BS        171
GR        161
KH        142
PK        128
AL        110
GH         99
TN         98
VN         89

**四、分析数据**

只分析演员，所以首先要筛选出演员的观察值

In [53]:
Netflix_show_ACTOR_data = Netflix_show_data.query("role == 'ACTOR'")

需要对`genres`列进行分组计算`imdb_score`的平均分

In [54]:
genres_meandata = pd.pivot_table(Netflix_show_data,index = ["genres","person_id"],values = "imdb_score",aggfunc = np.mean)

  genres_meandata = pd.pivot_table(Netflix_show_data,index = ["genres","person_id"],values = "imdb_score",aggfunc = np.mean)


找到每个`genres`分组中`imdb_score' 最高的行的索引

In [56]:
genres_meandata_clean = genres_meandata.reset_index()
genres_meandata_clean

Unnamed: 0,genres,person_id,imdb_score
0,action,1000,6.866667
1,action,100007,7.000000
2,action,100013,6.400000
3,action,100019,6.500000
4,action,100020,6.500000
...,...,...,...
177319,western,993735,6.500000
177320,western,998673,7.300000
177321,western,998674,7.300000
177322,western,998675,7.300000


In [58]:
genres_meandata_clean_max = genres_meandata_clean.groupby("genres")["imdb_score"].max()
genres_meandata_clean_max

genres
action           9.3
animation        9.3
comedy           9.2
crime            9.5
documentation    9.1
drama            9.5
european         8.9
family           9.3
fantasy          9.3
history          9.1
horror           9.0
music            8.8
other            7.8
reality          8.9
romance          9.2
scifi            9.3
sport            9.1
thriller         9.5
war              8.8
western          8.9
Name: imdb_score, dtype: float64

In [59]:
genres_meandata_clean_max = pd.merge(genres_meandata_clean,genres_meandata_clean_max,on =["genres","imdb_score"] )
genres_meandata_clean_max

Unnamed: 0,genres,person_id,imdb_score
0,action,12790,9.3
1,action,1303,9.3
2,action,21033,9.3
3,action,336830,9.3
4,action,86591,9.3
5,animation,1303,9.3
6,animation,21033,9.3
7,animation,28024,9.3
8,animation,336830,9.3
9,animation,86591,9.3


到这里筛选出了各流派平均分最高的演员id，有很多是并列的最高

然后需要把演员id跟演员名称匹配上，因为只需要名称列，所以先把`Netflix_show_data`表里的`name`和`person_id`列提取出来，并删除重复列

In [64]:
name_person_id = Netflix_show_data[["name","person_id"]].drop_duplicates()
name_person_id

Unnamed: 0,name,person_id
0,Robert De Niro,3748
1,Jodie Foster,14658
2,Albert Brooks,7064
3,Harvey Keitel,3739
4,Cybill Shepherd,48933
...,...,...
284180,Adelaida Buscato,736339
284181,Luz Stella Luengas,399499
284182,Inés Prieto,373198
284183,Isabel Gaona,378132


然后把`name_person_id` `genres_meandata_clean_max` 根据`person_id`合并

In [66]:
genres_imdbscore_max_actor = pd.merge(genres_meandata_clean_max,name_person_id,on ="person_id" )
genres_imdbscore_max_actor

Unnamed: 0,genres,person_id,imdb_score,name
0,action,12790,9.3,Olivia Hack
1,action,1303,9.3,Jessie Flower
2,action,21033,9.3,Zach Tyler
3,action,336830,9.3,André Sogliuzzo
4,action,86591,9.3,Cricket Leigh
5,animation,1303,9.3,Jessie Flower
6,animation,21033,9.3,Zach Tyler
7,animation,28024,9.3,Dante Basco
8,animation,336830,9.3,André Sogliuzzo
9,animation,86591,9.3,Cricket Leigh
