![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

# Lab | Parallelization

## Introduction

This lab will combine parallelization with some of the other topics you have learned in the Intermediate Python module of this program (list comprehensions, requests library, functional programming, web scraping, etc.). You will write code that extracts a list of links from a web page, requests each URL, and then indexes the page referenced by each link - both sequentially and in parallel.

## Resources

- [Multiprocessing Library Documentation](https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing#module-multiprocessing)
- [Python Parallel Computing (in 60 Seconds or less)](https://dbader.org/blog/python-parallel-computing-in-60-seconds)
- [Python Multiprocessing: Pool vs Process – Comparative Analysis](https://www.ellicium.com/python-multiprocessing-pool-process/)

## Step 1: Use the requests library to retrieve the content from the URL below.

In [18]:
import requests
url = 'https://en.wikipedia.org/wiki/Data_science'

In [19]:
html = requests.get(url).content

## Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [20]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(html)
link_tags = soup.find_all('a', href=True)
links = list(set([link['href'] for link in link_tags]))

## Step 3: Use list comprehensions with conditions to clean the link list.

Create a list with the absolute link and remove any that contain a percentage sign (%)

In [None]:
absolute_links = [link for link in links if link.startswith('http') and '%' not in link]

## Step 4: Write a function called crawl_page that accepts a link and does the following.

- Request the content of the page referenced by that link.
- Create a soup with the request content.
- Extract a list of links
- Return the count of links in the page

In [21]:
def crawl_page(url):
    import requests
    from bs4 import BeautifulSoup
    
    html = requests.get(url).content
    soup = BeautifulSoup(html)
    link_tags = soup.find_all('a', href=True)
    return len(link_tags)

## Step 5: Sequentially loop through the list of links, running the crawl_page function each time and save result in a list.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [15]:
%%time
result_list = []
for url in absolute_links:
    result_list.append(crawl_page(url))

print(result_list)

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[99, 0, 12, 171, 361, 125, 171, 271, 70, 0, 116, 315, 1, 274, 229, 298, 0, 69, 21, 4, 5, 0, 151, 130, 75, 67, 366, 2, 161, 161, 172, 111, 71, 369, 219, 1, 4, 80, 100, 115, 157, 484, 39, 263, 174, 83, 112, 273, 0, 258, 27, 280, 140, 272, 146, 100, 233, 119]
Wall time: 1min 51s


## Step 6: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [22]:
#import multiprocessing
import multiprocess
# If you are using MaC use the multiprocessing library 

In [23]:
%%time
# pool = multiprocessing.Pool()
pool = multiprocess.Pool()
result = pool.map(crawl_page, absolute_links)
pool.terminate()

print(result)

Wall time: 31 s
