# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### **Step 1: Use the requests library to retrieve the content from the URL below.**

In [None]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [None]:
html = requests.get(url).content
type(html)

bytes

### **Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.**

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(html, "lxml");
table = soup.find_all('a');

links = [link['href'] for link in table if 'href' in link.attrs]

### **Step 3: Use list comprehensions with conditions to clean the link list.**

There are two types of links, absolute and relative. Absolute links have the full URL and begin with http while relative links begin with a forward slash (/) and point to an internal page within the wikipedia.org domain. Clean the respective types of URLs as follows.

Absolute Links: Create a list of these and remove any that contain a percentage sign (%).

Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).

Combine the list of absolute and relative links and ensure there are no duplicate

In [None]:
domain = 'http://wikipedia.org'

In [None]:
import re


absolute_links =[link for link in links if (re.findall(r'http.+',link)) and ('%' not in link)]
len(absolute_links)

61

In [None]:
relative_links = [domain+link for link in links if link.startswith('/') and ('%' not in link)]
len(relative_links)

290

check if not duplicates

In [None]:
cleaned_links = absolute_links + relative_links
cleaned_links = list(set(cleaned_links))

In [None]:
len(cleaned_links)

308

### **Step 4: Use the os library to create a folder called wikipedia and make that the current working directory.**

In [None]:
import os

In [None]:
directory = 'wikipedia'
parent_dir = './'
path = os.path.join(parent_dir, directory)
path

'./wikipedia'

In [None]:
os.mkdir(path)

In [None]:
os.chdir(path)

### **Step 5: Write a function called index_page that accepts a link and does the following.**

Tries to request the content of the page referenced by that link.

Slugifies the filename using the slugify function from the python-slugify library and adds a .html file extension.

  If you don't already have the python-slugify library installed, you can pip install it as follows:
  $ pip3 install python-slugify.

  To import the slugify function, you would do the following: from slugify import slugify.

  You can then slugify a link as follows slugify(link).

Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.

If an exception occurs during the process above, just pass.

In [None]:
from slugify import slugify

In [None]:
import urllib.request, urllib.error, urllib.parse

def index_page(link):
  try:
    file_name = slugify(link)+'.html'
    html = urllib.request.urlopen(link)
    webContent = html.read().decode('UTF-8')
    with open(file_name, "w") as file:
    file.write(webContent)
   except:
     pass

In [None]:
for url in cleaned_links:
    index_page(url)

'https-www-nsf-gov-pubs-2005-nsb0540'

### **Step 6: Sequentially loop through the list of links, running the index_page function each time.**

Remember to include %%time at the beginning of the cell so that it measures the time it takes for the cell to run.

In [None]:
%%time
for url in urls:
    index_page(url)

### **Step 7: Perform the page indexing in parallel and note the difference in performance.**

Remember to include %%time at the beginning of the cell so that it measures the time it takes for the cell to run.

In [None]:
import multiprocessing 

In [None]:
%%time
pool = multiprocessing.Pool()
result = pool.map(index_page, urls)
pool.terminate()