# HTTP requests
Generally, request methods are defined as verbs that indicate the desired action to be performed while the HTTP client(web browers) and the server communicate with each other: GET, HEAD, POST, PUT, DELETE, and so on. of these methods, GET and POST are two of the most common request methods that are used in web scraping applications:
1. The GET method requests sepcific data from the server. This method only retrieves data and has no other effect on the server and its databases.
2. The POST method sends data in a specific form that is accepted by the server. This data could be, for example, a meesage to a bulletin board, mailing list, or newsgroup, information to be submitted to a web form, or an item to be added to a database.

# The requests module

In [None]:
import requests
import time

## Example

In [None]:
url = "http://www.google.com"

res = requests.get(url)

print(res.status_code)
print(res.headers)

with open("google.html", "w") as f:
    f.write(res.text)

print("Done.")


200
{'Date': 'Fri, 16 Sep 2022 02:54:24 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'Content-Length': '6590', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2022-09-16-02; expires=Sun, 16-Oct-2022 02:54:24 GMT; path=/; domain=.google.com; Secure, AEC=AakniGPRqiGihYgm0x9ctaFkCuG3bIegNhrhyWdMyrsyAMoV-EgH3XQp0MM; expires=Wed, 15-Mar-2023 02:54:24 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax, NID=511=Eeb4diA8S8-cb6LNW6b95EJyeexMsNrJqmKMEv3_VIIxr1OsGOSQl8yStCVbCI2Anck7u2mrD00E2obW_8UkvWCpO09nunAGL9kVxp7phqzzl3cxNJRBtB_HGNBofFem8dNllMm9qhfiTlz-uRHjwbRE6RoGhZw3y_Vr-Ti2xik; expires=Sat, 18-Mar-2023 02:54:24 GMT; path=/; domain=.google.com; HttpOnly'}
Done.


## Get

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)
print(r)
print(r.status_code)
print(r.text)
print(r.json())

<Response [200]>
200
{
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-6323e563-11781d0d5f0db1425c0dd787"
  }, 
  "origin": "104.196.172.33", 
  "url": "https://httpbin.org/get?key1=value1&key2=value2"
}

{'args': {'key1': 'value1', 'key2': 'value2'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.23.0', 'X-Amzn-Trace-Id': 'Root=1-6323e563-11781d0d5f0db1425c0dd787'}, 'origin': '104.196.172.33', 'url': 'https://httpbin.org/get?key1=value1&key2=value2'}


## Post

In [None]:
url = 'https://httpbin.org/post'
payload_tuples = [('key1', 'value1'), ('key1', 'value2')]
r1 = requests.post(url, data=payload_tuples)
print(r1.text)
# print(r1.json())

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": [
      "value1", 
      "value2"
    ]
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-6323e566-495004a64b43e3706a5e37f7"
  }, 
  "json": null, 
  "origin": "104.196.172.33", 
  "url": "https://httpbin.org/post"
}



In [None]:
files = {'file': ('report.csv', 'some,data,to,send\nanother,row,to,send\n')}
r2 = requests.post(url, files=files)
print(r2.text)
# print(r2.json())

{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "some,data,to,send\nanother,row,to,send\n"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "184", 
    "Content-Type": "multipart/form-data; boundary=1431bcd9f23a23b4297cd59e747a62b4", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-6323e569-66c0863757a7c7e8433d8e9f"
  }, 
  "json": null, 
  "origin": "104.196.172.33", 
  "url": "https://httpbin.org/post"
}



## Status code

1XX: information status code. The request was received, and the werver is processing it.

2XX: successful status code. The request was successfully received, understood, and processed by the server.

3XX: redirectional status code. Addtional actions needed to be taken so that the request can be successfully processed.

4XX: error status code for the client. The request was incorrectly formatted by the client and could not be processed.

5XX: error status code for the server. The request, although valid, could not be processed by the server.

# Ping test

A ping test is a procedure in which yoiu test the communication between your system and specific web servers, simply by requesting each of the servers in questions. By considering the HTTP response status code returned by the server, the test is used to evaluate either the internet connection of your system or the availability of the servers.

## Simulate a ping test

In [None]:
def ping(url):
    res = requests.get(url)
    print(f"{url}: {res.text}")


urls = [
    "http://httpstat.us/200",
    "http://httpstat.us/400",
    "http://httpstat.us/404",
    "http://httpstat.us/408",
    "http://httpstat.us/500",
    "http://httpstat.us/511",
]

start = time.time()
for url in urls:
    ping(url)
print(f"Sequential: {time.time() - start : .2f} seconds")

print()

http://httpstat.us/200: 200 OK
http://httpstat.us/400: 400 Bad Request
http://httpstat.us/404: 404 Not Found
http://httpstat.us/408: 408 Request Timeout
http://httpstat.us/500: 500 Internal Server Error
http://httpstat.us/511: 511 Network Authentication Required
Sequential:  0.72 seconds



# Concurrent HTTP request


In [None]:
import threading

start = time.time()
threads = []
for url in urls:
    thread = threading.Thread(target=ping, args=(url,))
    threads.append(thread)
    thread.start()
for thread in threads:
    thread.join()

print(f"Threading: {time.time() - start : .2f} seconds")

http://httpstat.us/200: 200 OK
http://httpstat.us/400: 400 Bad Request
http://httpstat.us/404: 404 Not Found
http://httpstat.us/408: 408 Request Timeout
http://httpstat.us/500: 500 Internal Server Error
http://httpstat.us/511: 511 Network Authentication Required
Threading:  0.09 seconds


## Refactor

In [None]:
class MyThread(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url
        self.result = None

    def run(self):
        res = requests.get(self.url)
        self.result = f"{self.url}: {res.text}"


start = time.time()

threads = [MyThread(url) for url in urls]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()
for thread in threads:
    print(thread.result)

print(f"Took {time.time() - start : .2f} seconds")

print("Done.")

http://httpstat.us/200: 200 OK
http://httpstat.us/400: 400 Bad Request
http://httpstat.us/404: 404 Not Found
http://httpstat.us/408: 408 Request Timeout
http://httpstat.us/500: 500 Internal Server Error
http://httpstat.us/511: 511 Network Authentication Required
Took  0.09 seconds
Done.


# The problem with timeouts

Timeouts typically occurr when the server takes an unusually long time to process a specific request, and the connection betwen the server and its client is terminated.

## Simulate timeouts

In [None]:
urls = [
    "http://httpstat.us/200",
    "http://httpstat.us/200?sleep=20000",
    "http://httpstat.us/400",
]

start = time.time()

threads = [MyThread(url) for url in urls]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()
for thread in threads:
    print(thread.result)

print(f"Took {time.time() - start : .2f} seconds")

print("Done.")


http://httpstat.us/200: 200 OK
http://httpstat.us/200?sleep=20000: 200 OK
http://httpstat.us/400: 400 Bad Request
Took  20.10 seconds
Done.


## Timeout specifications
Counting from the thimeout threshold and check wether the thread is still alive

In [None]:
UPDATE_INTERVAL = 0.01

def process_requests(threads, timeout=5):
    def alive_count():
        alive = [1 if thread.is_alive() else 0 for thread in threads]
        return sum(alive)

    while alive_count() > 0 and timeout > 0:
        timeout -= UPDATE_INTERVAL
        time.sleep(UPDATE_INTERVAL)
    for thread in threads:
        print(thread.result)

In [None]:
class MyThread(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url
        self.result = f"{self.url}: Custom timeout"

    def run(self):
        res = requests.get(self.url)
        self.result = f"{self.url}: {res.text}"

In [None]:
urls = [
    "http://httpstat.us/200",
    "http://httpstat.us/200?sleep=4000",
    "http://httpstat.us/200?sleep=20000",
    "http://httpstat.us/400",
]

start = time.time()

threads = [MyThread(url) for url in urls]
for thread in threads:
    thread.setDaemon(True)
    thread.start()
process_requests(threads)

print(f"Took {time.time() - start : .2f} seconds")

print("Done.")

http://httpstat.us/200: 200 OK
http://httpstat.us/200?sleep=4000: 200 OK
http://httpstat.us/200?sleep=20000: Custom timeout
http://httpstat.us/400: 400 Bad Request
Took  5.13 seconds
Done.


set daemon thread to avoid blocking main thread when the program is finished 

In [None]:
urls = [
    "http://httpstat.us/200",
    "http://httpstat.us/200?sleep=4000",
    "http://httpstat.us/200?sleep=20000",
    "http://httpstat.us/400",
]

start = time.time()

threads = [MyThread(url) for url in urls]
for thread in threads:
    # thread.setDaemon(True)
    thread.start()
process_requests(threads)

print(f"Took {time.time() - start : .2f} seconds")

print("Done.")

http://httpstat.us/200: 200 OK
http://httpstat.us/200?sleep=4000: 200 OK
http://httpstat.us/200?sleep=20000: Custom timeout
http://httpstat.us/400: 400 Bad Request
Took  5.14 seconds
Done.


# Async HTTP request (Coroutine)

requests is not awaitable, can't use aynsc

In [None]:
import asyncio
import aiohttp

In [None]:
!pip install nest_asyncio --quiet
import nest_asyncio
nest_asyncio.apply()

In [None]:
async with aiohttp.ClientSession() as session:
    async with session.get('http://httpbin.org/get') as resp:
        print(resp.status)
        print(await resp.text())

200
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "Python/3.7 aiohttp/3.8.1", 
    "X-Amzn-Trace-Id": "Root=1-6323f7e5-5cda2de56ca227822b620b67"
  }, 
  "origin": "104.196.172.33", 
  "url": "http://httpbin.org/get"
}



In [None]:
async with aiohttp.ClientSession() as session:
    # async with session.request(method='GET', url='http://httpbin.org/request') as req:
    #   print(await req.text())
    async with session.post('http://httpbin.org/post', data=b'data') as req:
      print(req.status)
      print(await req.text())
    # session.put('http://httpbin.org/put', data=b'data')
    # session.delete('http://httpbin.org/delete')
    # session.head('http://httpbin.org/get')
    # session.options('http://httpbin.org/get')
    # session.patch('http://httpbin.org/patch', data=b'data')

200
{
  "args": {}, 
  "data": "data", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "4", 
    "Content-Type": "application/octet-stream", 
    "Host": "httpbin.org", 
    "User-Agent": "Python/3.7 aiohttp/3.8.1", 
    "X-Amzn-Trace-Id": "Root=1-6323db8e-4f6999b007d1007261434568"
  }, 
  "json": null, 
  "origin": "104.196.172.33", 
  "url": "http://httpbin.org/post"
}



# Example of extracting titles

In [None]:
from lxml import etree

urls = ['https://arxiv.org/abs/2201.000%02d'%i for i in range(1, 11)]

## Use request

In [None]:
def get_title(url,cnt):
    response = requests.get(url)
    html = response.content
    title = etree.HTML(html).xpath('//h1[contains(@class, "title")]/text()')
    print('Title %d: %s' % (cnt,''.join(title)))

In [None]:
%%time
for i, url in enumerate(urls):
    i = i + 1
    start = time.time()
    get_title(url,i)

Title 1: Modeling Advection on Directed Graphs using Matérn Gaussian Processes for Traffic Flow
Title 2: Time-Dependent Duhamel Renormalization method with Multiple Conservation and Dissipation Laws
Title 3: Simulating local fields in carbon nanotube reinforced composites for infinite strip with voids
Title 4: Robust reliability-based topology optimization under random-field material model
Title 5: A Literature Review on Length of Stay Prediction for Stroke Patients using Machine Learning and Statistical Approaches
Title 6: AttentionLight: Rethinking queue length and attention mechanism for traffic signal control
Title 7: Confidence-Aware Multi-Teacher Knowledge Distillation
Title 8: A Lightweight and Accurate Spatial-Temporal Transformer for Traffic Forecasting
Title 9: Improving Deep Neural Network Classification Confidence using Heatmap-based eXplainable AI
Title 10: Locally finite free space as limiting case of PT-symmetric medium
CPU times: user 208 ms, sys: 8.94 ms, total: 217 ms

## Concurrent

In [None]:
import threading

class TitleThread(threading.Thread):
    def __init__(self, cnt, url):
        threading.Thread.__init__(self)
        self.cnt = cnt
        self.url = url
        self.title = f"{self.url}: Bad url"

    def run(self):
        res = requests.get(self.url)
        html = res.content
        title = etree.HTML(html).xpath('//h1[contains(@class, "title")]/text()')
        self.title = ''.join(title)
        print('Title %d: %s' % (self.cnt,self.title))


UPDATE_INTERVAL = 0.01

def process_requests(threads, timeout=5):
    def alive_count():
        alive = [1 if thread.is_alive() else 0 for thread in threads]
        return sum(alive)

    while alive_count() > 0 and timeout > 0:
        timeout -= UPDATE_INTERVAL
        time.sleep(UPDATE_INTERVAL)

In [None]:
%%time
threads = [TitleThread(cnt, url) for cnt, url in enumerate(urls)]
for thread in threads:
    thread.start()
process_requests(threads)

Title 3: Robust reliability-based topology optimization under random-field material model
Title 2: Simulating local fields in carbon nanotube reinforced composites for infinite strip with voids
Title 0: Modeling Advection on Directed Graphs using Matérn Gaussian Processes for Traffic Flow
Title 7: A Lightweight and Accurate Spatial-Temporal Transformer for Traffic Forecasting
Title 8: Improving Deep Neural Network Classification Confidence using Heatmap-based eXplainable AI
Title 6: Confidence-Aware Multi-Teacher Knowledge Distillation
Title 1: Time-Dependent Duhamel Renormalization method with Multiple Conservation and Dissipation Laws
Title 4: A Literature Review on Length of Stay Prediction for Stroke Patients using Machine Learning and Statistical Approaches
Title 9: Locally finite free space as limiting case of PT-symmetric medium
Title 5: AttentionLight: Rethinking queue length and attention mechanism for traffic signal control
CPU times: user 238 ms, sys: 13.4 ms, total: 252 ms


## Multiprocessing

In [None]:
import multiprocessing
from multiprocessing import Pool

In [None]:
multiprocessing.cpu_count()

2

In [None]:
%%time 
p = Pool(2)
for i, url in enumerate(urls):
    p.apply_async(get_title, args=(url, i))
p.close()
p.join() 

Title 0: Modeling Advection on Directed Graphs using Matérn Gaussian Processes for Traffic Flow
Title 1: Time-Dependent Duhamel Renormalization method with Multiple Conservation and Dissipation Laws
Title 2: Simulating local fields in carbon nanotube reinforced composites for infinite strip with voids
Title 3: Robust reliability-based topology optimization under random-field material model
Title 4: A Literature Review on Length of Stay Prediction for Stroke Patients using Machine Learning and Statistical Approaches
Title 5: AttentionLight: Rethinking queue length and attention mechanism for traffic signal control
Title 6: Confidence-Aware Multi-Teacher Knowledge Distillation
Title 7: A Lightweight and Accurate Spatial-Temporal Transformer for Traffic Forecasting
Title 8: Improving Deep Neural Network Classification Confidence using Heatmap-based eXplainable AITitle 9: Locally finite free space as limiting case of PT-symmetric medium

CPU times: user 49.2 ms, sys: 33.2 ms, total: 82.4 m

## Coroutine
async aiohttp

In [None]:
async def get_title(cnt, url):
      async with aiohttp.ClientSession() as session:
          async with session.request('GET', url) as resp:
              html = await resp.read()
              title = etree.HTML(html).xpath('//h1[contains(@class, "title")]/text()')
              print('Title %d: %s' % (cnt,''.join(title)))

In [None]:
%%time
start = time.time()
loop = asyncio.get_event_loop()
tasks = [get_title(cnt, url) for cnt, url in enumerate(urls)]
loop.run_until_complete(asyncio.wait(tasks))

Title 9: Locally finite free space as limiting case of PT-symmetric medium
Title 0: Modeling Advection on Directed Graphs using Matérn Gaussian Processes for Traffic Flow
Title 3: Robust reliability-based topology optimization under random-field material model
Title 1: Time-Dependent Duhamel Renormalization method with Multiple Conservation and Dissipation Laws
Title 4: A Literature Review on Length of Stay Prediction for Stroke Patients using Machine Learning and Statistical Approaches
Title 7: A Lightweight and Accurate Spatial-Temporal Transformer for Traffic Forecasting
Title 2: Simulating local fields in carbon nanotube reinforced composites for infinite strip with voids
Title 5: AttentionLight: Rethinking queue length and attention mechanism for traffic signal control
Title 8: Improving Deep Neural Network Classification Confidence using Heatmap-based eXplainable AI
Title 6: Confidence-Aware Multi-Teacher Knowledge Distillation
CPU times: user 91 ms, sys: 9.2 ms, total: 100 ms
Wa

({<Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97ed>:1> result=None>,
  <Task finished coro=<get_title() done, defined at <ipython-input-187-ae3faecb97e