<a href="https://colab.research.google.com/github/Cseudave/automatic_tops/blob/main/Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping

## Anidb

Anidb es una página que almacena información de casi todos los animes. Incluso tiene una sección de etiquetas ponderadas para cada serie. Por lo que será necesario BeautifulSoup4

In [None]:
!pip install BeautifulSoup4
!pip install cfscrape
!pip install cloudscraper

Cloudscraper será necesario para lidiar con protecciones de algunos sitios

In [None]:
# Importamos librerías
from bs4 import BeautifulSoup
import cloudscraper
import requests
from cfscrape import create_scraper
import urllib.request

import random
import time 

import pandas as pd

In [None]:
# Obtenemos los datos del html de algún sitio usando BeautifulSoup
def ingrediente(scraper, url):
  response = scraper.get(url)
  # Usar BS4 para procesar el contenido 
  soup = BeautifulSoup(response.text, 'html.parser')
  return soup

In [None]:
# Configuramos un scraper para usarlo en chrome
scraper = cloudscraper.create_scraper(delay=10, browser='chrome') 

In [None]:
# Se construyen las url para las primeras dos páginas por cada temporada del 2022
links = []
year = 2022
for season in seasons:
  for page in pages:
    links.append('https://anidb.net/anime/?h=1&noalias=1&orderby.name=1.1&orderby.ucnt=0.2&'+ str(page)+\
              '&season.month='+ str(season) + '&season.year=' + str(year) + '&view=list')

In [None]:
# De cada página obtendremos los nombres del listado para obtener los urls por cada anime 
def anidb(db, link):
  sopa = ingrediente(scraper, link)
  titles = sopa.find_all('td', class_='name main anime')
  hrefs = sopa.find_all('td', class_='name main anime')
  for title, href in zip(titles, hrefs):
    db[title.find('a').text] = 'https://anidb.net/' + href.find('a')['href']
  return db

In [None]:
for link in links:
  db = anidb(db, link)
  # Se utiliza una pausa de tiempo aleatorio para no ser bloqueados
  time.sleep(random.randint(2, 7))

In [None]:
# Se guardan los datos en una DataFrame
df = pd.DataFrame([[key, db[key]] for key in  db.keys()], columns=['anime', 'link'])
# Y se guarda la lista de links y nombres de animes
df.to_csv('anidb_links22.csv')

In [None]:
# Se busca en la sopa los datos relevantes
# En caso de no existir ese dato especifico se guarda uno alternativo
def new_dic(sopa):
  try:
    table = sopa.find_all('table')[0].find_all('td')
  except:
    print("Bloqueo")
    return 'stop'
  
  starts = sopa.find_all('span', class_='weight')
  tagnames = sopa.find_all('span', class_='tagname')
  nd = {}

  nd['name'] = table[0].find('span', itemprop='name').text
  nd['img'] = img = sopa.find('img', itemprop='image')['src']
  try:
    nd['name_en'] = table[1].find('label', itemprop='alternateName').text
  except:
    nd['name_en'] = table[0].find('span', itemprop='name').text
  try:
    nd['name_jp'] = table[2].find('label', itemprop='alternateName').text
  except: 
    nd['name_jp'] = table[0].find('span', itemprop='name').text
  nd['type'] = table[3].text.split(',')[0]
  try:
    nd['episodes'] = table[3].text.split(',')[1]
  except:
    nd['episodes'] = 1
  nd['start'] = table[4].text.split(' until ')[0]
  try:
    nd['end'] = table[4].text.split(' until ')[1]
  except:
    nd['end'] = table[4].text.split(' until ')[0]
  nd['season'] = table[5].text.split(' ')[0]
  try:
    nd['year'] = table[5].text.split(' ')[1]
  except:
    nd['year'] =  None
  genre = []
  for line in table[6].find_all('span', itemprop='genre'):
    genre.append(line.text)
  nd['genre'] = genre
  link_ex = []
  for line in table[7].find_all('a'):
    link_ex.append(line['href'])
  nd['link_ex'] = link_ex
  nd['rating'] = table[8].text.split(' ')[0]
  try:
    nd['nrating'] = table[8].text.split(' ')[1]
  except:
    nd['nrating'] = None
  nd['average'] = table[9].text.split(' ')[0]
  try: 
    nd['naverage'] = table[9].text.split(' ')[1]
  except:
    nd['naverage'] = None
  try:
    nd['rrating'] = table[10].text.split(' ')[0]
  except:
    nd['rrating'] = None
  try:
    nd['nrrating'] = table[10].text.split(' ')[1]
  except:
    nd['nrrating'] = None
  star_dic = {}
  for i in range(len(starts) - 2):
    star_dic[tagnames[i + len(nd['genre'])].text] = starts[i].text.replace('\n', '')
  nd['tags'] = star_dic
  return nd

In [None]:
# Importamos librerías que nos permitirán ver el avance
from ipywidgets import IntProgress
from IPython.display import display

In [None]:
# Cargamos los links
df = pd.read_csv('anidb_links22.csv')
urls = df['link']

In [None]:
# Cargamos db22.csv para repetir el scrap
# Porque en ocaciones se bloquea nuestra IP
# Inicialmente no existe este archivo
try:
  df = pd.read_csv('db22.csv')
  data = df.to_dict(orient='records')
except:
  data = {}

In [None]:
progress = IntProgress()
display(progress)
scraper = cloudscraper.create_scraper(delay=10, browser='chrome') 

# Obtenemos los datos deseados de cada url faltante
# Comenzando desde donde fuimos bloqueados
for i in range(len(data) - 1, len(urls)):
  url = urls[i]
  sopa = ingrediente(scraper, url)
  p = new_dic(sopa)
  if p =='stop':
    break
  prueba.append(p)
  time.sleep(random.uniform(0, 1))
  progress.value = i

In [None]:
# Creamos el dataframe para guardarlo
df_db = pd.DataFrame.from_dict(prueba)
df_db.to_csv('db22.csv', index=False)
df_db.to_excel('db22.xlsx', index=False)

## Anilist

Alternativamente se usará los datos de otra página, o bien, comparando el rendimiento de ambos casos y poder elegir la mejor opción

Debido a cambios en el sistema operativo de las computadoras de google colab es necesario correr el siguiente script para poder utilizar Selenium

In [None]:
%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
#
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

# Install chromium and chromium-driver
apt-get update
apt-get install chromium chromium-driver

# Install selenium
pip install selenium

In [None]:
# Importamos librerías 
import re
import numpy as np
from numpy import arange
import pandas as pd
import matplotlib.pyplot as plt


import csv
import time
import random

In [None]:
def get_anilist(url):
  # Se configura el driver para usar selenium
  options = Options()
  options.add_argument("--headless")
  options.add_argument("--no-sandbox")
  options.headless = True
  driver = webdriver.Chrome("/usr/bin/chromedriver", options=options)
  driver.get(url)
  driver.maximize_window()

  time.sleep(random.uniform(3, 4))

  # Se copia el contenido de la url como si se hiciera con un mouse
  data = driver.find_element(By.XPATH, "/html/body").text
  driver.close()


  return data 

Debido a que todas las páginas comparten una estructura similar, se puede utilizar algunas palabras como referencia para obtener los datos deseados

In [None]:
# Encontramos las marcas claves
marks = ['Add to List',
 'Social',
'Format',
'Episodes',
'Episode Duration',
'Status',
#'Start Date',
#'End Date',
'Season',
'Average Score',
'Mean Score',
'Popularity',
'Favorites',
'Studios',
'Producers',
'Source',
'Hashtag',
'Genres',
'Romaji',
'English',
'Native',
'Synonyms',
'Tags',
'External & Streaming links',
'Relations',
'Characters',
'Staff',
'Status Distribution',
'Score Distribution',
'Trailer',
'Recommendations',
'ThreadsCreate New Thread'
 ]

In [None]:
# Obtenemos el texto copiado y según las marcas clasificamos los campos con los datos deseados
def to_dict(prueba, marks):
  nmarks = []
  prueba = prueba.split('\n')
  try:
    if prueba.index('Overview')  & prueba.index('Stats') < 50:
      del prueba[prueba.index('Overview'):prueba.index('Stats') + 1] 
  except:
    None
  for mark in marks:
    if mark in prueba:
      nmarks.append(mark)
  cuts = [prueba.index(x) for x in nmarks]
  valores = [prueba[cuts[i]+1:cuts[i+1]] for i in range(0, len(cuts) - 1)]
  ndict = {}
  for x, y in zip(nmarks, valores):
    ndict[x] = y
  return ndict

La función anterior solo requiere la lista de urls seleccionados. Por lo que se crea a mano una hoja de calculo con ellos, llamado anilinks.xlsx

In [None]:
df = pd.read_excel('anilinks.xlsx', sheet_name='Hoja 1' )

In [None]:
def more_anilist(links, db, marks):
  for i in range(len(db), len(links)):
    data = get_anilist(links[i])
    db.append(to_dict(data, marks))
    print(i, links[i])
  return db

In [None]:
# Se crea una lista vacia para agregar los diccionarios con los datos de cada link
db = []
db = more_anilist(links, db, marks)

In [None]:
# En caso de buscar solo un par de links, por ejemplo
# links = [
# 'https://anilist.co/anime/113717/Ousama-Ranking']

In [None]:
# En caso de agregar manualmente un registro:
# Ejemplo reducido de cómo luce el texto copiado

data = '''

Add to List
Akiba Meido sensou 
Akihabara is the center of the universe for the coolest hobbies and quirkiest amusements. In the spring of 1999, bright-eyed Nagomi Wahira moves there with dreams of joining a maid café. She quickly dons an apron at café Ton Tokoton, AKA the Pig Hut. But adjusting to life in bustling Akihabara isn’t as easy as serving tea and delighting customers. Paired with the dour Ranko who never seems to smile, Nagomi must do her best to elevate the Pig Hut over all other maid cafés vying for top ranking. Along the way she’ll slice out a place for herself amid the frills and thrills of life at the Pig Hut. Just when Nagomi’s dreams are within her grasp, she discovers not everything is as it seems amid the maid cafés of Akihabara.

 #45 Highest Rated 2022
 #62 Most Popular 2022
Format
TV
Episodes
12
Episode Duration
24 mins
Status
Finished
Start Date
Oct 7, 2022
...
Recommendations
View All Recommendations
'''

#Traducimos el texto a un diccionario
db = [to_dict(data, marks)]
#Para poder agregarlo a los registros
db = more_anilist(links, db, marks)


In [None]:
# Tomamos la base de datos y agregamos el nuevo anime
db_new = pd.DataFrame(db)
names = []
for text in db_new['Add to List']:
  names.append(text[0])
db_new.insert(0, 'name', names)
db_old = pd.read_csv('anilist22_raw.csv')

In [None]:
db_full = pd.concat([db_old, db_new], axis=0)
db_full.to_csv('anilist22_raw.csv', index=False)