![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

# Lab | Parallelization

## Introduction

This lab will combine parallelization with some of the other topics you have learned in the Intermediate Python module of this program (list comprehensions, requests library, functional programming, web scraping, etc.). You will write code that extracts a list of links from a web page, requests each URL, and then indexes the page referenced by each link - both sequentially and in parallel.

## Resources

- [Multiprocessing Library Documentation](https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing#module-multiprocessing)
- [Python Parallel Computing (in 60 Seconds or less)](https://dbader.org/blog/python-parallel-computing-in-60-seconds)
- [Python Multiprocessing: Pool vs Process – Comparative Analysis](https://www.ellicium.com/python-multiprocessing-pool-process/)

## Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import pandas as pd
import numpy as np 
import requests
url = 'https://en.wikipedia.org/wiki/Data_science'

In [56]:
# your code here
get_html = requests.get(url)
html_content = get_html.content

## Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [10]:
from bs4 import BeautifulSoup

In [30]:
# your code here
soup = BeautifulSoup(html_content, "lxml")
links=soup.find_all('a', href=True)
list_links=[link['href'] for link in links]
list_links

['#mw-head',
 '#searchInput',
 '/wiki/Information_science',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/Comet_NEOWISE',
 '/wiki/Astronomical_survey',
 '/wiki/Space_telescope',
 '/wiki/Wide-field_Infrared_Survey_Explorer',
 '/wiki/Machine_learning',
 '/wiki/Data_mining',
 '/wiki/File:Kernel_Machine.svg',
 '/wiki/Statistical_classification',
 '/wiki/Cluster_analysis',
 '/wiki/Regression_analysis',
 '/wiki/Anomaly_detection',
 '/wiki/Data_Cleaning',
 '/wiki/Automated_machine_learning',
 '/wiki/Association_rule_learning',
 '/wiki/Reinforcement_learning',
 '/wiki/Structured_prediction',
 '/wiki/Feature_engineering',
 '/wiki/Feature_learning',
 '/wiki/Online_machine_learning',
 '/wiki/Semi-supervised_learning',
 '/wiki/Unsupervised_learning',
 '/wiki/Learning_to_rank',
 '/wiki/Grammar_induction',
 '/wiki/Supervised_learning',
 '/wiki/Statistical_classification',
 '/wiki/Regression_analysis',
 '/wiki/Decision_tree_learning',
 '/wiki/Ensemble_learn

## Step 3: Use list comprehensions with conditions to clean the link list.

Create a list with the absolute link and remove any that contain a percentage sign (%)

In [73]:
# your code here

clean_links=['https://en.wikipedia.org'+link for link in list_links if 'wiki/' in link if 'http' not in link if '%' not in link if '.jpg' not in link]
clean_links

['https://en.wikipedia.org/wiki/Information_science',
 'https://en.wikipedia.org/wiki/Comet_NEOWISE',
 'https://en.wikipedia.org/wiki/Astronomical_survey',
 'https://en.wikipedia.org/wiki/Space_telescope',
 'https://en.wikipedia.org/wiki/Wide-field_Infrared_Survey_Explorer',
 'https://en.wikipedia.org/wiki/Machine_learning',
 'https://en.wikipedia.org/wiki/Data_mining',
 'https://en.wikipedia.org/wiki/File:Kernel_Machine.svg',
 'https://en.wikipedia.org/wiki/Statistical_classification',
 'https://en.wikipedia.org/wiki/Cluster_analysis',
 'https://en.wikipedia.org/wiki/Regression_analysis',
 'https://en.wikipedia.org/wiki/Anomaly_detection',
 'https://en.wikipedia.org/wiki/Data_Cleaning',
 'https://en.wikipedia.org/wiki/Automated_machine_learning',
 'https://en.wikipedia.org/wiki/Association_rule_learning',
 'https://en.wikipedia.org/wiki/Reinforcement_learning',
 'https://en.wikipedia.org/wiki/Structured_prediction',
 'https://en.wikipedia.org/wiki/Feature_engineering',
 'https://en.wi

## Step 4: Write a function called crawl_page that accepts a link and does the following.

- Request the content of the page referenced by that link.
- Create a soup with the request content.
- Extract a list of links
- Return the count of links in the page

In [74]:
# your code here
def crawl_page(url):
    get_html = requests.get(url)
    html_content = get_html.content
    soup = BeautifulSoup(html_content, "lxml")
    links=soup.find_all('a', href=True)
    listlinks=[link['href'] for link in links]
    cleanlinks=['https://en.wikipedia.org'+link for link in listlinks if 'wiki/' in link if 'http' not in link if '%' not in link if '.jpg' not in link]
    return len(listlinks)

## Step 5: Sequentially loop through the list of links, running the crawl_page function each time and save result in a list.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [75]:
%%time
# your code here
#import time

totallinks=[]
for link in clean_links:
    totallinks.append(crawl_page(link))
totallinks

Wall time: 3min 10s


[857,
 977,
 291,
 870,
 1558,
 1777,
 1029,
 235,
 631,
 1252,
 1305,
 606,
 219,
 313,
 540,
 869,
 241,
 326,
 364,
 314,
 495,
 552,
 651,
 337,
 501,
 631,
 1305,
 607,
 757,
 370,
 416,
 633,
 375,
 1092,
 399,
 2212,
 1344,
 492,
 210,
 764,
 1252,
 219,
 205,
 414,
 666,
 382,
 275,
 297,
 423,
 963,
 698,
 545,
 857,
 1034,
 1315,
 238,
 248,
 241,
 678,
 618,
 429,
 787,
 606,
 375,
 289,
 2212,
 836,
 244,
 2310,
 360,
 372,
 1222,
 823,
 149,
 210,
 367,
 799,
 500,
 1567,
 288,
 496,
 212,
 480,
 89,
 608,
 869,
 469,
 324,
 304,
 375,
 214,
 222,
 223,
 344,
 458,
 360,
 165,
 253,
 4176,
 2978,
 1473,
 181,
 223,
 700,
 1279,
 282,
 1029,
 1777,
 1555,
 1658,
 1503,
 568,
 1618,
 1658,
 1482,
 857,
 95,
 827,
 514,
 158,
 408,
 686,
 751,
 202,
 1555,
 368,
 97,
 1280,
 1447,
 171,
 180,
 198,
 1352,
 1777,
 938,
 1522,
 139,
 258,
 3456,
 365,
 1147,
 257,
 757,
 494,
 257,
 146,
 321,
 198,
 309,
 161,
 192,
 1555,
 1658,
 1092,
 1344,
 325,
 764,
 1252,
 423,
 1777,
 

## Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [None]:
#import multiprocessing
import multiprocess
# If you are using MaC use the multiprocessing library 

In [None]:
%%time
# your code here
