## **ThreadPoolExecutor for Web Scraping**

### What is ThreadPoolExecutor?
`ThreadPoolExecutor` is a Python class in the `concurrent.futures` module that allows you to manage a pool of threads efficiently. It simplifies multithreading by allowing you to run multiple tasks concurrently, making it ideal for I/O-bound tasks like web scraping.

---

### Why Use ThreadPoolExecutor in Web Scraping?
When web scraping, most of the time is spent waiting for server responses (I/O). Using `ThreadPoolExecutor` enables you to:
- Scrape multiple pages concurrently.
- Reduce overall execution time.
- Use system resources more efficiently.

---

### Basic Example
Here’s a simple example of using `ThreadPoolExecutor` to scrape multiple URLs:

```python
from concurrent.futures import ThreadPoolExecutor
import time

def scrape_page(url):
    print(f"Scraping: {url}")
    time.sleep(2)  # Simulates a delay for I/O-bound tasks
    return f"Data from {url}"

urls = [f"https://example.com/page{i}" for i in range(1, 6)]

with ThreadPoolExecutor(max_workers=3) as executor:  # 3 worker threads
    results = list(executor.map(scrape_page, urls))

print("Scraping completed!")
```
- **`max_workers=3`**: Creates 3 threads to scrape URLs concurrently.
- **`executor.map()`**: Maps the `scrape_page` function to each URL in the list.

---

### Benefits of Using ThreadPoolExecutor
- **Concurrency**: Reduces execution time for I/O-bound tasks.
- **Simple API**: Easy to use compared to manually managing threads.
- **Scalability**: Handles many tasks efficiently by reusing threads.

---

#

## **Class Activity**

In [None]:
""" 
Objective: Compare time execution based on basic loop
"""
import time


# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using basic loop
def main():
    # TODO: Fill main function to run io_bound_scraping 5 times
    # TODO: (Optional) Estimate the time execution
    pass
        
# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


In [None]:
""" 
Objective: Compare time execution based on the number of workers
"""
import time
from concurrent.futures import ThreadPoolExecutor

# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    # TODO: Creating a ThreadPoolExecutor with 1 threads
    with ThreadPoolExecutor(max_workers=3) as executor:
        # Use map to run the scrapings concurrently
        executor.map(io_bound_scraping, range(1,5))

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


In [None]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 2 workers

In [None]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 4 workers

In [None]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 500 workers for 1000
# TODO: Analyze how your program manage to execute 500 workers at once

In [None]:
""" 
Objective: Concurrently run function with 2 parameters or more
"""
# TODO: Import necessary package
# TODO: Create a function to simulate I/O bound task with 2 parameters: task_id and delay time
# TODO: Create list of task_id and delay time
# TODO: Run your function with multi-threading by mapping your function with all the parameters

import time
from concurrent.futures import ThreadPoolExecutor

def io_simulation_task(task_id, delay):
    """
    Simulates an I/O-bound task with a delay.
    """
    print(f"Task {task_id} started.")
    time.sleep(delay)
    print(f"Task {task_id} completed.")

def run_with_four_threads(tasks, delay):
    """
    Executes tasks using four threads.
    """
    start_time = time.time()
    with ThreadPoolExecutor(max_workers=4) as executor:
        executor.map(io_simulation_task, tasks, [delay] * len(tasks))
    end_time = time.time()
    print(f"Execution time with 4 threads: {end_time - start_time:.2f} seconds")

if __name__ == "__main__":
    tasks = list(range(8))  # Simulating 8 tasks
    delay = 2  # Each task takes 2 seconds

    print("\nRunning with 4 threads:")
    run_with_four_threads(tasks, delay)


In [None]:
"""
Homework Assignment: Improve previous code. 
Instead of creating a list of delay time, combine the list of task_id with a constant value of delay time
using lambda
"""
import time
from concurrent.futures import ThreadPoolExecutor

def io_simulation_task(task_id, delay):
    """
    Simulates an I/O-bound task by sleeping for a given delay.
    """
    print(f"Task {task_id} starting...")
    time.sleep(delay)  # Simulates an I/O delay
    print(f"Task {task_id} completed.")
    return f"Result of task {task_id}"

def measure_execution_time(thread_count, tasks, delay):
    """
    Measure the execution time of the tasks using the specified number of threads.
    """
    start_time = time.time()
    with ThreadPoolExecutor(max_workers=thread_count) as executor:
        results = list(executor.map(lambda task: io_simulation_task(task, delay), tasks))
    end_time = time.time()
    print(f"Execution with {thread_count} threads took {end_time - start_time:.2f} seconds.")
    return results

if __name__ == "__main__":
    num_tasks = 8
    delay_per_task = 2  # seconds
    tasks = range(1, num_tasks + 1)

    print("\n--- Four Threads Execution ---")
    measure_execution_time(4, tasks, delay_per_task)


--- Four Threads Execution ---
Task 1 starting...
Task 2 starting...
Task 3 starting...
Task 4 starting...
Task 1 completed.
Task 5 starting...
Task 2 completed.
Task 6 starting...
Task 3 completed.
Task 7 starting...
Task 4 completed.
Task 8 starting...
Task 5 completed.
Task 6 completed.
Task 7 completed.
Task 8 completed.
Execution with 4 threads took 4.01 seconds.


In [None]:
""" 
Objective: Implement multi-threading in web scraping
"""
# TODO: Implement multi-threading on your bookstoscrape project inside new branch
# TODO: Put github url here for grading

' \nObjective: \n'

In [None]:
""" 
Objective: Implement multi-threading in web scraping
"""
# TODO: Find any news site that you like: Tribun, Detik, BBC, nytimes, etc
# TODO: Extract data from the site in CSV
# TODO: Push on github and put the link here

' \nObjective: \n'