## Some thoughts on Concurrency

![onetask](https://www.pythontutorial.net/wp-content/uploads/2020/12/Python-Threading-Single-threaded-App.png)

A computer with only one CPU cannot perform more than one task at a time. When given multiple tasks, such as making requests to the Google, Twitter, and Medium websites, it simply switches between them. This switching is so quick and seamless that it appears to the user to be multitasking.

![one_CPU](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*j_zj53jzmDQm91Ql8NgtfA.png)

By using threading, we can shorten the computational time to execute a task multiple times. Source: [Python Threading](https://www.pythontutorial.net/python-concurrency/python-threading/).

![multi](https://www.pythontutorial.net/wp-content/uploads/2020/12/Python-Threading-Multi-threaded-App.png)

Source: [Short intro to concurrency](https://levelup.gitconnected.com/a-short-intro-on-concurrency-and-parallelism-1417bd04e881)

* A thread pool is a pattern for managing multiple threads efficiently.
* Use ThreadPoolExecutor class to manage a thread pool in Python.
* Call the `submit()` method of the ThreadPoolExecutor to submit a task to the thread pool for execution. The `submit()` method returns a Future object.
* Call the `map()` method of the ThreadPoolExecutor class to execute a function in a thread pool with each element in a list.

This example illustrates how to implement concurrencry using the concurrent.futures Python module.

Here we illustrate how much time can be save scrapping 3 properties links using their url as input.

!pip install futures

In [1]:
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor, as_completed
import time
import json

with open("final_url_list.json", 'r') as f:
    url_list = json.load(f)

"""url_list = ['https://www.immoweb.be/en/classified/house/for-sale/lede/9340/10660142',
        'https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10660214'
        ]"""

def house_scrapper(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    links = soup.find_all('td', {"class":"classified-table__data"})
    results = []
    for link  in links:
        results.append(link.contents[0].strip(' \t\n\r'))

    return results


JSONDecodeError: Extra data: line 1 column 2442 (char 2441)

# Without concurrency

In [4]:
start = time.time()
all_links = []
for url in url_list:
    result = house_scrapper(url)
    all_links.append(result)

end = time.time()
print("Time Taken: {:.6f}s".format(end-start))
print(all_links)

Time Taken: 1.135832s
[['After signing the deed', 'Oost- Vlaanderen', '1950', 'To be done up', '6 m', '3', 'Urban', '170', '32', 'Installed', '15', '3', '17', '12', '6', '1', '1', 'Yes', 'Yes', 'No', '564', '', 'Connected', 'Yes', '450', 'Yes', 'No', 'No', 'No', 'Yes', 'No', '452', 'E', '20230626-0002928488-RES-1', 'Not specified', 'Not specified', 'Gas', 'No', 'No', 'No', 'Living area (residential, urban or rural)', '', '', 'No', 'Hoogstraat 20', '', '5389284'], ['After signing the deed', 'October 1 2023 - 12:00 AM', '1899', 'Good', '4.2 m', '2', 'Urban', '119', '14', 'USA installed', '17', '3', '14', '14', '10', '2', '2', '2', 'Yes', 'No', '55', 'Connected', 'Yes', 'No', 'No', 'Yes', 'No', '246', 'C', '20230622-0002923649-RES-1', '4167 kg CO₂/m²', 'Not specified', 'Gas', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Non flood zone', 'Living area (residential, urban or rural)', '', '', 'No', 'Sint-Denijslaan 1', '', '5416299']]


# With concurrency

In [5]:
with ThreadPoolExecutor(max_workers=10) as executor:
    start = time.time()
    futures = [executor.submit(house_scrapper, url) for url in url_list]
    results =  [item.result() for item in futures]
    end = time.time()
    print("Time Taken: {:.6f}s".format(end-start))
    print(results)


Time Taken: 0.727800s
[['After signing the deed', 'Oost- Vlaanderen', '1950', 'To be done up', '6 m', '3', 'Urban', '170', '32', 'Installed', '15', '3', '17', '12', '6', '1', '1', 'Yes', 'Yes', 'No', '564', '', 'Connected', 'Yes', '450', 'Yes', 'No', 'No', 'No', 'Yes', 'No', '452', 'E', '20230626-0002928488-RES-1', 'Not specified', 'Not specified', 'Gas', 'No', 'No', 'No', 'Living area (residential, urban or rural)', '', '', 'No', 'Hoogstraat 20', '', '5389284'], ['After signing the deed', 'October 1 2023 - 12:00 AM', '1899', 'Good', '4.2 m', '2', 'Urban', '119', '14', 'USA installed', '17', '3', '14', '14', '10', '2', '2', '2', 'Yes', 'No', '55', 'Connected', 'Yes', 'No', 'No', 'Yes', 'No', '246', 'C', '20230622-0002923649-RES-1', '4167 kg CO₂/m²', 'Not specified', 'Gas', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Non flood zone', 'Living area (residential, urban or rural)', '', '', 'No', 'Sint-Denijslaan 1', '', '5416299']]


In [6]:
with ProcessPoolExecutor(max_workers=10) as executor:
    start = time.time()
    futures = [executor.submit(house_scrapper, url) for url in url_list]
    results =  [item.result() for item in futures]
    end = time.time()
    print("Time Taken: {:.6f}s".format(end-start))
    print(results)

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.