<a href="https://colab.research.google.com/github/Pedro-Grajau/data_analyze_webtoons/blob/main/webtoon_scrap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Instalando o kora, uma biblioteca que facilita o uso de ferramentas como o Selenium dentro do Google Colab
* Link para a biblioteca: https://pypi.org/project/kora/ 



In [1]:
!pip install kora -q

* **Importando uma base de dados de autores adquirida nesse link: https://www.webtoons.com/en/notice/detail?noticeNo=1609**

In [13]:
autores_txt = open("autores_webtoons.txt", "r")
autores_webtoons = [x.strip() for x in autores_txt.readlines()]

* **Fuções de tratamento para algumas entradas difíceis de lidar no seu estado "normal"**

In [12]:
#Tratando algumas peculiaridades das entradas:

def new_number(number):
  if number == 'Like':
    return 0
  if 'M' in number:
    if len(number) > 1:
      return int(float(number.replace('M', '')) * 1000000)
    return 1000000.0
  else:
    t, h = number.split(',')
    number = int(t) * 1000 + int(h)
    return number

def get_html(search_data):
  if search_data.status_code == 200:
    return bs(search_data.content,'lxml')
  else:
    return f'{search_data.status_code}, : error'

def remove_symbols(comic):
  if "?" in comic:
    comic = comic.replace("?","a")
  if "/" in comic:
    comic = comic.replace("/","a")
  if "#" in comic:
    comic = comic.replace("#","a")
  return comic

* **Função que foi criada para retornar todas as webtoons encontradas de acordo com o nome do autor.**



In [14]:
import requests
from bs4 import BeautifulSoup as bs

def author_comics(author):
  titles = list()
  search_url = "https://www.webtoons.com/en/search?keyword=" + author
  search_data = requests.get(search_url)

  #Check for 200 status code
  soup = get_html(search_data)
  
  #Doing the scraping thing 
  search_result = soup.find("div",class_="challenge_lst")
  all_results = search_result.find_all(class_="subj")
  a_links = search_result.find_all(class_="grade_num")
  title_data = search_result.find_all("a", class_="challenge_item")

  #Inserting into a list of results
  comics = [item.get_text().strip() for item in all_results]
  views = [new_number(item.get_text()) for item in a_links]

  for item in title_data:
    title = item.get("href")
    title_number = int(title.split("=")[1])
    titles.append(title_number)
  
  return titles, views, comics

* **Aqui está a função responsável por pegar todos os dados desejados baseado pelo nome do autor da webtons. Os dados são: Nome, Título da Obra, Inscritos, Visualizações, Nota e Data de Lançamento**

In [19]:
# Pesquisar todas as obras por autor (FEITO)
# Pegar a obra com mais views (FEITO)

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from kora.selenium import wd

authors = list()

for author in autores_webtoons:
  titles, views, comics = author_comics(author)

  #Catching the webtoon and your id by URL
  max_index = views.index(max(views))
  comic = comics[max_index]
  title_number = titles[max_index]

  if "?" in comic or "/" in comic or "#" in comic:
    comic = remove_symbols(comic) 

  comic_url = f"https://www.webtoons.com/en/challenge/{comic}/list?title_no={str(title_number)}"
  search_data = requests.get(comic_url)

  #Check for 200 status code
  soup = get_html(search_data)

  #Title Name
  title_name = soup.find("h3" ,class_="subj _challengeTitle")
  title_name = " ".join(title_name.get_text().split()[0:-1])

  #Subs, views and rates
  attributes = soup.find_all("em" ,class_="cnt")
  subs, views, rate = [i.get_text() for i in attributes]

  #genres
  genres = soup.find_all("p" ,class_="genre")
  genres = "|".join([i.get_text() for i in genres])

  #Author
  author = soup.find("span" ,class_="author").get_text()[:-11]

  #Patrons
  #patrons_number = soup.find("em" ,id="patronCount").get_text()
  
  #Finding Date of first chapter
  first_episode_button = soup.find("a" ,id = "_btnEpisode")
  first_episode_link = first_episode_button.get("href")
  wd.get(first_episode_link)

  #There is some webtoons with adult content, so they pop up a alert and we have
  #to accept this alert to continue with our scraping
  try:
    WebDriverWait(wd, 3).until(EC.alert_is_present(),
                                   'Timed out waiting for PA creation ' +
                                   'confirmation popup to appear.')

    alert = wd.switch_to.alert
    alert.accept()
    try:
      date = wd.find_element_by_class_name("u_cbox_date")
      date = date.text
      authors.append([author, title_name, subs, views, rate, genres, date])
    except:
      date = "no_date"
      authors.append([author, title_name, subs, views, rate, genres, date])

  except:
    try:
      date = wd.find_element_by_class_name("u_cbox_date")
      date = date.text
      authors.append([author, title_name, subs, views, rate, genres, date])
    except:
      date = "no_date"
      authors.append([author, title_name, subs, views, rate, genres, date])

* **Colocando os dados obtidos por meio de scraping e colocando eles dentro de um arquivo csv para uma futura análise visual de dados**



In [20]:
import csv

header = ["Author", "Webtoon", "Subscribed", "View", "Rate", "Genre", "Date"]
with open('webtoons.csv', 'w', encoding='UTF8', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    writer.writerows(authors)

* **Finalemente, a visualização do DataFrame usando a biblioteca pandas**

In [22]:
import pandas as pd

data = pd.read_csv('webtoons.csv')

data.head(20)

Unnamed: 0,Author,Webtoon,Subscribed,View,Rate,Genre,Date
0,Beimuyo,YOU ARE MY BFF (LGBTQ+),252.5K,56.1M,9.53,Drama|Romance,"Nov 6, 2019"
1,cathy octo /erry,Maid for Hire,205.3K,13.9M,9.72,Romance|Comedy,"Sep 3, 2020"
2,cheruke,Domestic Beast,100.7K,8.4M,9.8,Fantasy|Slice of life,"Nov 10, 2020"
3,EmAuthor,Papa Ai,86.4K,6.7M,9.75,Comedy|Slice of life,"Sep 13, 2019"
4,Fawnduu,My Dragon Girlfriend,262.8K,90.6M,9.49,Romance|Slice of life,"Mar 21, 2018"
5,Fuzzzzyy,"Out of Sight, Out of Body",188.7K,19.4M,9.8,Romance|Supernatural,"Apr 17, 2020"
6,Kirinu,Love n Life,672.9K,70.1M,9.28,Romance|Fantasy,"Jun 28, 2018"
7,loonytwin,EYES ON ME,920.3K,225.9M,9.47,Romance|Drama,"Dec 9, 2017"
8,Merryweatherey,Clinic of Horrors,725K,54.7M,9.72,Comedy|Horror,"Mar 7, 2019"
9,misterrico,Scripted Love,157.8K,42.3M,9.68,Comedy|Romance,"Jan 8, 2020"
