# Data Gathering
This project is a part of my final thesis, where I'll build content-based Kdramas recommendation website.
Web scraping of all dramas from website Koreandrama.org.


It can be challenging to wrap your head around a long block of HTML code. To make it easier to read, you can use an HTML formatter to clean it up automatically.
https://webformatter.com/html

### Import necessary libraries

In [24]:
from requests import get
from bs4 import BeautifulSoup
from warnings import warn
from time import sleep
from random import randint
import numpy as np, pandas as pd
import seaborn as sns
import requests


Links to some tutorials:
    https://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/
    https://scrapeops.io/web-scraping-playbook/403-forbidden-error-web-scraping/#:~:text=The%20solution%20to%20this%20problem,scraper%20or%20a%20real%20user
    https://realpython.com/beautiful-soup-web-scraper-python/#find-elements-by-id

### Gathering links from the main page

In [25]:
pages = np.arange(1, 131)
#the number of dramas pages in website

Also in order to prevent our IP address to be catched don't request to the site several times in short amount of time.

In [26]:
#compiling has started 12.34
import random
page_links = set()
for page in pages:
   # in order to prevent 403 error
    user_agents_list = [
        'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
    ]


   #get request  
    response = get("https://www.koreandrama.org/list/korean-dramas/page/"+str(page),  headers={'User-Agent': random.choice(user_agents_list)})
  
    sleep(randint(8,15))
   
   #throw warning for status codes that are not 200
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))

   #parse the content of current iteration of request
    page_html = BeautifulSoup(response.text, 'html.parser')
   #extracting all the links from 
    
    soup = bs(response.text, "html.parser")


    tag = soup.find(class_="brxe-tobqlw brxe-block")                                  

    for link in tag.find_all_next("a"):
        url = link.get("href", "")
        if "https://www.koreandrama.org/" in url:
            page_links.add(url)

    for item in page_links.copy():
        if "https://www.koreandrama.org/list/korean-dramas/page" in item:
            page_links.discard(item)
    page_links.discard("https://www.koreandrama.org/list/korean-dramas/")
#27 min of compiling
   
  

### Scraping each page for complete information 

In [28]:
#compiling 
user_agents_list = [
        'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
    ]

import json
list_of_lists=[]

for link in page_links:
    r_page = requests.get(link, headers={'User-Agent': random.choice(user_agents_list)})

     #parsing json file wraped into HTML script using python library json
    soup2 = bs(r_page.text, "html.parser")
    data = json.loads(soup2.find('script', type='application/ld+json').text)
    
    list_of_features=["name","image","@type","description","datePublished"]
    list_of_information=[]
    #this loop for collecting all basic necessary features
    for feature in list_of_features:
        l=[]
        l.append(data["@graph"][0].get(feature))
        list_of_information.append(l)
    genres=[]

    #this loop for finding all tags and saving the m in to list 
    for g in range(len(data["@graph"][0]["genre"])):
        genres.append(data["@graph"][0]["genre"][g])
    tags_list=[]

    #this loop for finding all tags and  saving them into list 
    for i in soup2.find(id="brxe-hhddhx").find_all("a"):
        tags_list.append(i.text)

    #collecting extra information and saving it into dictionary
    extra_info=dict()
    for i in soup2.find(id="brxe-qefiit").find_all(class_="jet-listing-dynamic-field__content").copy():
        title = i.b.text
        unwanted = i.find('b')
        unwanted.extract()
        extra_info[title]=i.text
    #collecting the cast (list of actors)
    list_of_actors=[]
    for i in soup2.find_all(class_="jet-listing-dynamic-link__label"):
        list_of_actors.append(i.text)

    list_of_information.append(tags_list)
    list_of_information.append(genres)
    list_of_information.append(data["@graph"][0]["aggregateRating"]["ratingValue"])
    list_of_information.append(extra_info)
    list_of_information.append(list_of_actors)
    
    list_of_lists.append(list_of_information)

#


In [30]:
len(list_of_lists)

2968

Range of information into list : main features, tags, extra information, cast

### Creation of Dataframe from list of lists

In [31]:
import pandas as pd 

In [32]:

data_df = pd.DataFrame(list_of_lists, columns = ["title","image_URL","@type","description","datePublished","tags","genre","rating","extra_info","cast"])
 
# print dataframe.


In [34]:
data_df["extra_info"][3]
#this is a dictionary containing extra information

{'Director:': ' Kim Yong Kyu',
 'Aired on:': ' Feb 2, 1998',
 'Total Episodes:': ' 16',
 'Network:': ' KBS2',
 'Duration:': ' 50 min.',
 'Year:': ' 1998'}

### Data Preprocessing

As we can see there are brackets for each item in a row let's first remove them.

In [35]:
data_copy=data_df.copy()


First converting "datePublished" column into datetime format

In [36]:
data_copy['datePublished'] = data_copy['datePublished'].str[0]

In [37]:
count=0
for item in data_copy["datePublished"]:
    item=item.replace(" ", "-")
    item=item.replace(",","")
    data_copy.at[count,"datePublished"]=item
    count+=1

In [38]:

data_copy['datePublished'] = pd.to_datetime(data_copy['datePublished'], format='%B-%d-%Y')


In [39]:

#convert rows in lists to strings
data_copy["title"]=data_copy["title"].str[0]
data_copy["image_URL"]=data_copy["image_URL"].str[0]
data_copy["@type"]=data_copy["@type"].str[0]
data_copy["description"]=data_copy["description"].str[0]

data_copy["title"]=data_copy["title"].astype('str')
data_copy["image_URL"]=data_copy["image_URL"].astype('str')
data_copy["@type"]=data_copy["@type"].astype('string')
data_copy["description"]=data_copy["description"].astype('str')


#and rating to integer
data_copy["rating"]=data_copy["rating"].astype('float')



To create new features in our dataset let's find all uniques values in dictionarie keys.

In [40]:
extra_info_features=set()
for row in data_copy["extra_info"]:
    for dic in row:
        extra_info_features.add(dic)
extra_info_features

{'Aired on:',
 'Director & Screenwriter:',
 'Director:',
 'Duration:',
 'Network:',
 'Release Date:',
 'Screenwriter:',
 'Total Episodes:',
 'Year:'}

In [41]:
count=0
data_copy["director"] = np.nan
data_copy["year"] = np.nan
data_copy["duration"] = np.nan
data_copy["network"] = np.nan
data_copy["screenwriter"] = np.nan
data_copy["total_episodes"] = np.nan

for row in data_copy["extra_info"]:
    data_copy["director"][count] = row.setdefault("Director:")
    data_copy["year"][count] = row.setdefault("Year:")
    data_copy["duration"][count] = row.setdefault("Duration:")
    data_copy["network"][count] = row.setdefault("Network:")
    data_copy["screenwriter"][count] = row.setdefault("Screenwriter:")
    data_copy["total_episodes"][count] = row.setdefault("Total Episodes:")
    count+=1
data_copy.head() 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_copy["director"][count] = row.setdefault("Director:")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_copy["year"][count] = row.setdefault("Year:")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_copy["duration"][count] = row.setdefault("Duration:")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
 

Unnamed: 0,title,image_URL,@type,description,datePublished,tags,genre,rating,extra_info,cast,director,year,duration,network,screenwriter,total_episodes
0,A Gentleman's Dignity,https://www.koreandrama.org/wp-content/uploads...,TVSeries,"Four men in their forties go through love, bre...",2022-01-02,"[Adult Romance, Age Gap [Drama Life], Age Gap ...",[Comedy],8.3,"{'Director:': ' Kim Jung Hyun, Shin Woo Cheol,...","[Yoon Jin Yi, Kim Soo Ro, Yoon Se Ah, Kim Min ...","Kim Jung Hyun, Shin Woo Cheol, Kwon Hyuk Chan",2012,1 hr. 5 min.,SBS,Kim Eun Sook,20
1,Kiss Sixth Sense,https://www.koreandrama.org/wp-content/uploads...,TVSeries,"Hong Ye Sool, the best account executive on Pl...",2021-08-26,"[Boss-Employee Relationship, Ex-Boyfriend Come...",[Drama],8.1,"{'Director:': ' Nam Ki Hoon', 'Screenwriter:':...","[Yoon Kye Sang, Kim Ji Suk, Seo Ji Hye, Jin Su...",Nam Ki Hoon,2022,1 hr. 10 min.,,Jeon Yoo Ri,12
2,Melancholia,https://www.koreandrama.org/wp-content/uploads...,TVSeries,A sexual scandal between a math teacher and a ...,2021-07-25,"[Age Gap [Drama Life], Childhood Trauma, Corru...",[Drama],7.7,"{'Director:': ' Kim Sang Hyub', 'Screenwriter:...","[Choi Dae Hoon, Lee Do Hyun, Jin Kyung, Im Soo...",Kim Sang Hyub,2021,1 hr. 10 min.,tvN,Kim Ji Woon,16
3,The Barefoot Youth,https://www.koreandrama.org/wp-content/uploads...,TVSeries,Tales of the hardships and love.,2022-01-16,[],[Drama],7.2,"{'Director:': ' Kim Yong Kyu', 'Aired on:': ' ...","[Park Geun Hyung, Go So Young, Bae Yong Joon, ...",Kim Yong Kyu,1998,50 min.,KBS2,,16
4,8 Love Stories,https://www.koreandrama.org/wp-content/uploads...,TVSeries,<p>The drama describes 8 different love storie...,2021-07-05,[Omnibus],[Drama],8.0,"{'Director:': ' Kim Jong Hyeok, Lee Kang Hoon'...","[Shin Sung Woo, Lee Byung Hun, Park Sang Ah, K...","Kim Jong Hyeok, Lee Kang Hoon",1999,60 min.,SBS,Song Ji Na,16


In [43]:
data_copy.isna().sum()

title                0
image_URL            0
@type                0
description          0
datePublished        0
tags                 0
genre                0
rating               0
extra_info           0
cast                 0
director          1004
year                39
duration           318
network            431
screenwriter      1121
total_episodes     157
dtype: int64

In another notebook I'll work with data cleaning and preprocessing to make it ready for training.

### Save data from lists to CSV files

There are still several works with data cleaning and normalization as well. But let's save it for now.

In [42]:
data_copy.to_csv("drama_data.csv")

Korean movies and TV shows were collected in the same principle from the same website. So collected unique rows achived 7473 unique rows.

### Merging all three datasets

In [1]:
import pandas as pd 

In [10]:
shows_df=pd.read_csv("data/shows_data.csv")
movies_df=pd.read_csv("data/movie_data.csv")
dramas_df=pd.read_csv("data/drama_data.csv")

In [11]:
entire_df = pd.concat([shows_df, movies_df,dramas_df], axis=0)

In [13]:
len(entire_df)

7473