# Highest Paid K-drama Actors 2021

_This is a notebook by Sofuwa Oluwafunmilayo._

## Contents

[1. Import Libraries](#Import-Libraries)

[2. Scrape the website with the information on the highest paid K-drama actors](#Scrape-the-website-with-the-information-on-the-highest-paid-K-drama-actors)

[3. Scrape each actor's Wikipedia page](#Scrape-each-actor's-Wikipedia-page) 

[4. Creating the actor's dataframe](#Creating-the-actor's-dataframe)

[5. Scraping information from IMDb](#Scraping-information-from-IMDb)
    
[6. Creating the IMDb dataframe](#Creating-the-IMDb-dataframe)
    
[7. Scraping the series images](#Scraping-the-series-images)

[8. Downloading the data tables](#Downloading-the-data-tables)

[9. Link to Other Pages](#Link-to-Other-Pages)

## Import Libraries

In [1]:
from bs4 import BeautifulSoup as bs
import requests
import re
import pandas as pd
import os

## Scrape the website with the information on the highest paid K-drama actors

In [2]:
request_1 = requests.get("https://seoulspace.com/richest-kdrama-actors-highest-paid-actors-in-korean-dramas/", "html.parser")
# Convert to a beautiful soup object
first_soup = bs(request_1.content)

I will be extracting the fees per episode for each actor listed in the website. I will also get the summary of each actor as well as the actor's name from the website.

In [3]:
fee_per_episode=[]
summary = []
actor = []

info_box = first_soup.find(class_="entry-content")
info_fees = info_box.find_all("p")[3::3]
info_summary = info_box.find_all("p")[2::3]
info_actor = first_soup.find_all("h2")[0:10]

for row in info_fees:
    fee_per_episode.append(row.get_text().replace('\xa0',''))
for row in info_summary:
    summary.append(row.get_text().replace('\xa0',''))
for row in info_actor:
    actor.append(row.get_text().replace('\xa0',''))

## Scrape each actor's Wikipedia page

Get more information on each actor from their wikipedia page

In [4]:
# The wikipedia url for these actors have a similar format. To get the url:
# Firstly, I have to transform the actors name from the result i got in the 'actor' variable. Names that contain hyphen have to change.
# For example Kim Soo-Hyun will change to Kim Soo-hyun. The character that comes immediately after the hyphen has to be lowercased.
# Secondly, replace spaces with an underscore

actor_1 = [re.sub("([-])\s*([a-zA-Z])", lambda p: p.group(0).lower(), a) for a in actor]
actor_1 = [a.replace(' ','_') for a in actor_1]

base_url = 'https://en.wikipedia.org/wiki/'
actor_url = [(base_url + a) for a in actor_1]
actor_url

['https://en.wikipedia.org/wiki/Kim_Soo-hyun',
 'https://en.wikipedia.org/wiki/So_Ji-sub',
 'https://en.wikipedia.org/wiki/Hyun_Bin',
 'https://en.wikipedia.org/wiki/Lee_Min-ho',
 'https://en.wikipedia.org/wiki/Ji_Chang-wook',
 'https://en.wikipedia.org/wiki/Jo_In-sung',
 'https://en.wikipedia.org/wiki/Yoo_Ah-in',
 'https://en.wikipedia.org/wiki/Lee_Jong-suk',
 'https://en.wikipedia.org/wiki/Lee_Seung-gi',
 'https://en.wikipedia.org/wiki/Song_Joong-ki']

In [5]:
actor_dob = []
actor_age = []

for link in actor_url:
  request_2 = requests.get(link, "html.parser")
  second_soup = bs(request_2.content)
  info_box = second_soup.find(class_="infobox-data")
  
  info_box_text = info_box.get_text()

# I want to get the dob (date of birth) and the format I want to extract is between the first closing and second closing brackets. For
# example, (1985-09-19) September 19, 1985 (age 36)Dong District, Daejeon, South Korea
# This extraction method uses the logic of indexing and slicing.
  dob_start = ') '
  dob_end = ' ('
  s = info_box_text
  actor_dob.append(s[s.find(dob_start)+len(dob_start):s.rfind(dob_end)].replace(",",""))

# Applying the above step to extract the actor's age. The difference here is the start and end characters i'll use for the extraction. 
  age_start = '\xa0'
  age_end = ')'
  s = info_box_text
  actor_age.append(s[s.find(age_start)+len(age_start):s.rfind(age_end)])

## Creating the actor's dataframe

In [6]:
actors_df = pd.DataFrame()
actors_df['Actor'] = actor
actors_df['Date of Birth'] = actor_dob
actors_df['Age'] = actor_age
actors_df['Earnings Per Episode'] = fee_per_episode
actors_df['Information'] = summary

# Clean up the data
actors_df['Earnings Per Episode'] = actors_df['Earnings Per Episode'].str.replace("$", "")
to_drop = {"Fee per episode": "",",":"",": ":""}
actors_df['Earnings Per Episode'] = actors_df['Earnings Per Episode'].replace(to_drop, regex=True)
# Convert earnings per episode from string to integer
actors_df['Earnings Per Episode'] = actors_df['Earnings Per Episode'].astype(int) 

actors_df

Unnamed: 0,Actor,Date of Birth,Age,Earnings Per Episode,Information
0,Kim Soo-Hyun,February 16 1988,33,164000,Kim Soo Hyun has become the most popular Kdram...
1,So Ji-sub,November 4 1977,43,67000,So Ji-sub is one of the hardest working K-Dram...
2,Hyun Bin,September 25 1982,39,84000,Hyun Bin has been doing Korean dramas for a wh...
3,Lee Min-ho,June 22 1987,34,62000,Lee Min-ho started doing Korean dramas back in...
4,Ji Chang-wook,5 July 1987,34,50000,Ji Chang-wook really does it all. He has starr...
5,Jo In-Sung,July 28 1981,40,67000,It is crazy to think that Jo In-Sung is alread...
6,Yoo Ah-in,October 6 1986,35,50000,Yoo Ah-in is one of the most in-demand Korean ...
7,Lee Jong-suk,14 September 1989,32,50000,Many Kdrama fans might not know this but Lee J...
8,Lee Seung-gi,January 13 1987,34,59000,Fans of Lee Seung-gi love his innocent look an...
9,Song Joong-ki,September 19 1985,36,50000,Song Joong-ki has worldwide appeal thanks to t...


## Scraping information from IMDb

To get information from IMDb, I made use of the advanced search url to search for each actor's name. Unlike Wikipedia, IMDb uses '+' rather than '_' to replace spaces. I also had to make a change to one of the actor's name as his Wikipedia's name and IMDb's name were spelt a little bit differently.

In [7]:
actor_2 = [a.replace('So_Ji-sub','So_Ji-seob') for a in actor_1]
actor_2 = [a.replace('_','+') for a in actor_2]
actor_2

base_imdb_url = 'https://www.imdb.com/search/name/?name='
actor_imdb_search = [(base_imdb_url + a) for a in actor_2]
actor_imdb_search

['https://www.imdb.com/search/name/?name=Kim+Soo-hyun',
 'https://www.imdb.com/search/name/?name=So+Ji-seob',
 'https://www.imdb.com/search/name/?name=Hyun+Bin',
 'https://www.imdb.com/search/name/?name=Lee+Min-ho',
 'https://www.imdb.com/search/name/?name=Ji+Chang-wook',
 'https://www.imdb.com/search/name/?name=Jo+In-sung',
 'https://www.imdb.com/search/name/?name=Yoo+Ah-in',
 'https://www.imdb.com/search/name/?name=Lee+Jong-suk',
 'https://www.imdb.com/search/name/?name=Lee+Seung-gi',
 'https://www.imdb.com/search/name/?name=Song+Joong-ki']

After getting the search url, I would get the url of the first name on the page. This is because IMDb assigns a certain number to things on their site eg actors, movies etc. These numbers are included in the url rather than the actor's name.

In [8]:
actor_imdb_url = []

for url in actor_imdb_search:
  request_3  = requests.get(url, "html.parser")
  third_soup = bs(request_3.content)
  base_actor_url = 'https://www.imdb.com'
  first_url  = third_soup.find(class_="lister-item-header")
  second_url = first_url.find('a')['href']
  third_url  = base_actor_url + second_url
  actor_imdb_url.append(third_url)
actor_imdb_url

['https://www.imdb.com/name/nm4633543',
 'https://www.imdb.com/name/nm1234414',
 'https://www.imdb.com/name/nm1593460',
 'https://www.imdb.com/name/nm3316279',
 'https://www.imdb.com/name/nm3865611',
 'https://www.imdb.com/name/nm1251770',
 'https://www.imdb.com/name/nm2584860',
 'https://www.imdb.com/name/nm4062328',
 'https://www.imdb.com/name/nm3876951',
 'https://www.imdb.com/name/nm3609366']

For each of the url above, I will get only information about tv series that each actor has acted in. This information will include the series title, url and the actor's name.

In [9]:
series_title = []
series_url = []
series_actor = []

for url in actor_imdb_url:
  request_4 = requests.get(url, "html.parser")
  fourth_soup = bs(request_4.content)
  fourth_url = fourth_soup.find("div",class_="filmo-category-section")
  get_url = fourth_url.find_all('b')

#get just the series title and url
  just_series = []
  for i in get_url:
    if i.next_sibling.strip() == '(TV Series)' or i.next_sibling.strip() == '(TV Mini Series)':
      just_series.append(i)

#get the series title
  for i in range(len(just_series)):
    series_name = just_series[i].get_text()
    series_title.append(series_name)

#get the series url
    imdb_url = just_series[i].find('a')['href']
    base_url = 'https://www.imdb.com/'
    full_url = base_url + imdb_url
    series_url.append(full_url)

# make the actor's name the same length as the result from series_title
    actor_name = fourth_soup.find(class_="itemprop").get_text()
    series_actor.append(actor_name)
    series_actor*len(series_title)

In [10]:
#check if all variables are the same length
len(series_title),len(series_url),len(series_actor)

(130, 130, 130)

Using the series url, I will get the alternative title, imdb rating, summary, year when it was released and the genre for each series.

In [11]:
also_known_as = []
imdb_rating = []
series_summary = []
release_year = []
genre = []

for url in series_url:
  request_5 = requests.get(url, "html.parser")
  fifth_soup = bs(request_5.content)
  
  alt_title = fifth_soup.find("li",attrs={"data-testid":"title-details-akas"})
  if alt_title is not None:
    also_known_as.append(alt_title.get_text().replace('Also known as',''))
  else:
    also_known_as.append('No alternative title')

  rating = fifth_soup.find("div",class_="AggregateRatingButton__Rating-sc-1ll29m0-2 bmbYRW")
  if rating is not None:
    imdb_rating.append(rating.get_text())
  else:
    imdb_rating.append('No IMDb rating')

  summary = fifth_soup.find(class_="GenresAndPlot__TextContainerBreakpointXS_TO_M-cum89p-0 dcFkRD")
  if summary is not None:
    if summary.find('a') in summary:
      #some summaries have a link such as a'read all' link that allows individuals to read the full summary
      #for such summaries, i would follow the link and get the full summary from the first paragraph
      summary_url = url + summary.find('a')['href']
      request_6 = requests.get(summary_url, "html.parser")
      sixth_soup = bs(request_6.content)
      series_summary.append(sixth_soup.find('p').get_text())
    else:
      series_summary.append(summary.get_text())
  else:
    series_summary.append('No IMDb movie summary')

  r_year = fifth_soup.find("span",class_="TitleBlockMetaData__ListItemText-sc-12ein40-2 jedhex")
  if r_year is not None:
    year = r_year.get_text()
    #get only the first four characters
    release_year.append(year[:4])
  else:
    release_year.append('No year specified')

  s_genre = fifth_soup.find_all('a',class_="GenresAndPlot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt")
  if s_genre is not None:
    s_genre_1 = [i.get_text() for i in s_genre]
    genre.append(s_genre_1)
  else:
    genre.append('No genre specified')

#replace summaries that have string length below 1
series_summary = ['No IMDb movie summary' if len(s) < 1 else s for s in series_summary]

#convert each list in the 'genre' variable to a string
genre = [', '.join(i) for i in genre]

#replace genres that have string length below 1
genre = ['No genre specified' if len(g) < 1 else g for g in genre]

In [12]:
len(imdb_rating),len(also_known_as),len(series_summary), len(release_year), len(genre)

(130, 130, 130, 130, 130)

## Creating the IMDb dataframe

In [13]:
imdb_df = pd.DataFrame()
imdb_df['Actor'] = series_actor
imdb_df['Year'] = release_year
imdb_df['Film Title'] = series_title
imdb_df['Also Known As'] = also_known_as
imdb_df['Genre'] = genre

imdb_df['IMDb Rating'] = imdb_rating
imdb_df['IMDb Summary'] = series_summary

pd.set_option('display.max_rows', None)

imdb_df

Unnamed: 0,Actor,Year,Film Title,Also Known As,Genre,IMDb Rating,IMDb Summary
0,Kim Soo-hyun,2021,One Ordinary Day,That Night,"Crime, Mystery",No IMDb rating,"A remake of the BBC drama ""Criminal Justice"" t..."
1,Kim Soo-hyun,2020,It's Okay to Not Be Okay,Psycho But It's Okay,"Comedy, Drama, Romance",8.7/10,An extraordinary road to emotional healing ope...
2,Kim Soo-hyun,2019,Crash Landing on You,Love's Emergency Landing,"Adventure, Comedy, Romance",8.7/10,The absolute top secret love story of a chaebo...
3,Kim Soo-hyun,2019,Hotel Del Luna,Hotel Delluna,"Drama, Fantasy, Horror",8.2/10,When he's invited to manage a hotel for dead s...
4,Kim Soo-hyun,2015,Peurodyusa,Producer,"Comedy, Drama, Romance",7.3/10,A group of young television producers--Ra Joon...
5,Kim Soo-hyun,2013,My Love from Another Star,You Came from the Stars,"Comedy, Drama, Fantasy",8.2/10,"Do Min-Joon, an alien that came to our planet ..."
6,Kim Soo-hyun,2012,The 3rd Hospital,第3病院,"Drama, Romance",8.1/10,Revolves on the competition between western an...
7,Kim Soo-hyun,2012,Haereul poomeun dal,Moon Embracing the Sun,"Drama, Fantasy, Romance",8.0/10,"The story of the secret love between Lee Hwon,..."
8,Kim Soo-hyun,2011,Dream High,Deurim hai,"Comedy, Music, Romance",7.6/10,Dream High tells the story of six students at ...
9,Kim Soo-hyun,2010,Giant,ジャイアント,"Action, Drama, Romance",8.4/10,This drama tells the story of three siblings w...


## Scraping the series images

 I want to get the images for the unique series in the data. I'll make use of these images in my Tableau visualization.

In [14]:
#get the unique series and url
unique_series=[]
for i in series_title:
    if i not in unique_series:
        unique_series.append(i)

unique_url=[]
for i in series_url:
    if i not in unique_url:
        unique_url.append(i)

In [15]:
#check if they are the same length
len(unique_series),len(unique_url)

(121, 121)

In [16]:
#check if the series name corresponds with the url
unique_series[50],unique_url[50]

('Mackerel Run', 'https://www.imdb.com//title/tt4193068/')

Each series image in each of the unique url has a link that takes you to another imdb page. From this imdb page, I will get the actual image link.

In [17]:
image_imdb_url = []

for i in range(len(unique_url)):
  request_7 = requests.get(unique_url[i], "html.parser")
  seventh_soup = bs(request_7.content)
  imdb_url = seventh_soup.find("div",attrs={"data-testid":"hero-media__poster"})
  if imdb_url is not None:
    if imdb_url.find('a') in imdb_url:
      imdb_url_1 = imdb_url.find('a')['href']
      image_imdb_url.append(imdb_url_1)
    #not all these url contain a series image and therefore no image url so i'll return the index for such series
    else:
      image_imdb_url.append(i)
  else:
    image_imdb_url.append(i)

In [18]:
image_imdb_url

['/title/tt14170016/mediaviewer/rm1566766593/?ref_=tt_ov_i',
 '/title/tt12451520/mediaviewer/rm3671504129/?ref_=tt_ov_i',
 '/title/tt10850932/mediaviewer/rm3056777473/?ref_=tt_ov_i',
 '/title/tt10220588/mediaviewer/rm2995232513/?ref_=tt_ov_i',
 '/title/tt4612922/mediaviewer/rm1327992321/?ref_=tt_ov_i',
 '/title/tt3469052/mediaviewer/rm853926912/?ref_=tt_ov_i',
 '/title/tt2182614/mediaviewer/rm2844922880/?ref_=tt_ov_i',
 '/title/tt3143378/mediaviewer/rm1340466176/?ref_=tt_ov_i',
 '/title/tt1996607/mediaviewer/rm4127510273/?ref_=tt_ov_i',
 '/title/tt2693442/mediaviewer/rm3159587584/?ref_=tt_ov_i',
 '/title/tt2720634/mediaviewer/rm2553307648/?ref_=tt_ov_i',
 '/title/tt6289274/mediaviewer/rm1312728065/?ref_=tt_ov_i',
 '/title/tt8591092/mediaviewer/rm3189604096/?ref_=tt_ov_i',
 '/title/tt5189944/mediaviewer/rm1944511488/?ref_=tt_ov_i',
 '/title/tt4686292/mediaviewer/rm2083417088/?ref_=tt_ov_i',
 '/title/tt10777760/mediaviewer/rm3174161665/?ref_=tt_ov_i',
 '/title/tt3184674/mediaviewer/rm333

In [19]:
# Convert all items in the above list to string. This is beacuase I returned the index for series that did not have an image url.
# I also have to do this to avoid getting a type error in the next part of my code.
image_imdb_url_1 = list(map(str, image_imdb_url))

In [20]:
image_url = []
for i in range(len(image_imdb_url_1)):
  if len(image_imdb_url_1[i]) > 3:
    base_url_1 = 'https://www.imdb.com'
    full_url_1 = base_url_1 + image_imdb_url_1[i]
    request_8 = requests.get(full_url_1,"html.parser")
    eighth_soup = bs(request_8.content)
    img_1 = eighth_soup.find(True, {"class":["MediaViewerImagestyles__PortraitContainer-sc-1qk433p-2 iUyzNI", 
                                                "MediaViewerImagestyles__LandscapeContainer-sc-1qk433p-3 kXRNYt"]})

    if img_1 is not None:
      img_2 = img_1.find('img')['src']
      image_url.append(img_2)
    else:
      image_url.append(str(i))

  else:
    image_url.append(str(i))

In [21]:
image_url

['https://m.media-amazon.com/images/M/MV5BZmYwYjc2MmYtYmRjYy00ZmI0LTkyODctNzdlYjYxYWFjMmVhXkEyXkFqcGdeQXVyNDY5MjMyNTg@._V1_.jpg',
 'https://m.media-amazon.com/images/M/MV5BYTk0Nzk5ZWYtYTNlZi00YjBjLWJhYjctMWMwMmYyMDA5ZjJmXkEyXkFqcGdeQXVyNDY5MjMyNTg@._V1_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMzRiZWUyN2YtNDI4YS00NTg2LTg0OTgtMGI2ZjU4ODQ4Yjk3XkEyXkFqcGdeQXVyNTI5NjIyMw@@._V1_.jpg',
 'https://m.media-amazon.com/images/M/MV5BNzQ2MzQzNDktMTg4ZC00ZDE0LThhNmUtYWMxYmI3OTIzYzZlXkEyXkFqcGdeQXVyMzE4MDkyNTA@._V1_.jpg',
 'https://m.media-amazon.com/images/M/MV5BYjM1NGI1Y2MtZDg3NS00MjNmLWJjMTYtZmE3ZTk1YzgyODhmXkEyXkFqcGdeQXVyNDY5MjMyNTg@._V1_.jpg',
 'https://m.media-amazon.com/images/M/MV5BYTA2MTZhMmQtN2VmOC00NTc4LWE2ZWQtNzc5ZDk5YWE4NDdlXkEyXkFqcGdeQXVyMzE4MDkyNTA@._V1_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMmJlMmU2NmItOGQyOC00ZDNhLWJkNGYtZjU4ZTgwNzA0NjBkXkEyXkFqcGdeQXVyMzE4MDkyNTA@._V1_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMDBkZjE2NzMtZWNjMC00NmE4LTkyNTQtNzFhNzk2Mjg0NmE5

In [22]:
len(image_url),len(unique_series)

(121, 121)

In [27]:
#getting a count of the total images downloaded
img_count = 0
for i in range(len(image_url)):
  if len(image_url[i]) > 3:
    save_path = '/content/series_images' #path or folder where images should be downloaded to
    file_name = unique_series[i] + ".jpg" #using each series name to save its corresponding image
    image_name = os.path.join(save_path, file_name)

    downloaded_image = requests.get(image_url[i],"html.parser", stream=True).content
    with open(image_name, "wb") as f:
      f.write(downloaded_image)
      img_count += 1
  else:
    pass

print(str(img_count) + " images scraped successfully, \nCheck this folder for your images: '%s' "%save_path)


119 images scraped successfully, 
Check this folder for your images: '/content/series_images' 


## Downloading the data tables

In [28]:
actors_df.to_csv(r'/content/series_tables/actors_data.csv', index = False, header = True)
imdb_df.to_csv(r'/content/series_tables/imdb_data.csv', index = False, header = True)
print('Files saved successfully')

Files saved successfully


In [None]:
#download folders from google colab to my laptop
import shutil
shutil.make_archive('series_images','zip','series_images')
shutil.make_archive('series_tables','zip','series_tables')

## Link to Other Pages

### Thank you for taking time to study my notebook.
### You can use the links below to view my other pages

[Medium](https://oluwafunmilayo.medium.com/)

[Tableau](https://public.tableau.com/profile/oluwafunmilayo8574)

[LinkedIn Page](https://www.linkedin.com/in/oluwafunmilayo-sofuwa-79454390/)