<a href="https://colab.research.google.com/github/institutohumai/cursos-python/blob/master/Scraping/2_HTTP_Avanzado/scraping_extra_tips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg"></a>

# Tips para scrapear mejor

- Scrapear multiples cosas al mismo tiempo: https://python-docs-es.readthedocs.io/es/3.8/library/multiprocessing.html

In [4]:
from multiprocessing import Pool

from requests import get

def bajar_datos(url):
    return get(url).text

# En este ejemplo intento bajar varios datos del sitio web numbersapi.com

urls = [f"http://numbersapi.com/{number}" for number in [1,2,3,4,5,6,7,8]]
print(urls)

['http://numbersapi.com/1', 'http://numbersapi.com/2', 'http://numbersapi.com/3', 'http://numbersapi.com/4', 'http://numbersapi.com/5', 'http://numbersapi.com/6', 'http://numbersapi.com/7', 'http://numbersapi.com/8']


In [5]:
# De esta manera voy bajando los dato de a uno

for url in urls:
    resultado = bajar_datos(url)
    print(resultado)

1 is the loneliest number.
2 is the number of polynucleotide strands in a DNA double helix.
3 is number of performers in a trio.
4 is the number of movements in a symphony.
5 is the number of babies born in a quintuplet.
6 is the number of orders of the Mishnah.
7 is the number of periods, or horizontal rows of elements, in the periodic table.
8 is the number of bits in a byte.


In [3]:
# De esta manera hago todo al mismo tiempo, en paralelo

with Pool(5) as p:
    print(p.map(bajar_datos, urls))
    
# En jupiter notebook da problemas

In [1]:
from concurrent.futures import ThreadPoolExecutor
from requests import get

def bajar_datos(url):
    return get(url).text

# URLs a procesar
urls = [f"http://numbersapi.com/{number}" for number in [1, 2, 3, 4, 5, 6, 7, 8]]
print(urls)

['http://numbersapi.com/1', 'http://numbersapi.com/2', 'http://numbersapi.com/3', 'http://numbersapi.com/4', 'http://numbersapi.com/5', 'http://numbersapi.com/6', 'http://numbersapi.com/7', 'http://numbersapi.com/8']


In [2]:
# Procesamiento en paralelo con ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=5) as executor:
    resultados = list(executor.map(bajar_datos, urls))

In [3]:
for resultado in resultados:
    print(resultado)

1 is the number of Gods in monotheism.
2 is the number of stars in a binary star system (a stellar system consisting of two stars orbiting around their center of mass).
3 is number of performers in a trio.
4 is the number of human blood groups (A, B, O, AB).
5 is the number of appendages on most starfish, which exhibit pentamerism.
6 is the number of strings on a standard guitar.
7 is the number of periods, or horizontal rows of elements, in the periodic table.
8 is the number of bits in a byte.


Una alternativa es multithreading: 


- Evitar que te bloqueen
    - Rotacion de ip y useragent
        - rotacion de userAgent: https://pypi.org/project/fake-useragent/
        - smartproxy y https://github.com/mattes/rotating-proxy
        
    - A veces las cuentas premium las banean/bloquean menos, ya que son la fuente de dinero del sitio y son "intocables" (Ejemplo: Spotify)

- Crear cuentas sin límites
    - Registración con teléfono
        - Teléfonos descartables (proovl y twilio)
        - Reutilizar un mismo teléfono: +54/+549/11/15/011

    - Registración con email
        - Emails descartables
        - Reutilizar un mismo mail: pedroperez@gmail.com/pedro.perez@gmail.com/...

- [Resolver captchas](https://addons.mozilla.org/en-US/firefox/addon/recaptcha-solver/)

- [Acceder a sitios viejos](http://web.archive.org/)

