# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [5]:
import requests
import re
from tqdm import tqdm
import time

url = 'https://en.wikipedia.org/wiki/Data_science'
response = requests.get(url).content

In [6]:
# your code here
response = requests.get(url).content

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [7]:
from bs4 import BeautifulSoup

In [8]:
# your code here
soup = BeautifulSoup(requests.get(url).content)
lista = [x for x in soup.find_all('link')]

In [9]:
lista

[<link href="/w/load.php?lang=en&amp;modules=ext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.toc.styles%7Cskins.vector.styles.legacy%7Cwikibase.client.init&amp;only=styles&amp;skin=vector" rel="stylesheet"/>,
 <link href="/w/load.php?lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector" rel="stylesheet"/>,
 <link href="/w/index.php?title=Data_science&amp;action=edit" rel="alternate" title="Edit this page" type="application/x-wiki"/>,
 <link href="/w/index.php?title=Data_science&amp;action=edit" rel="edit" title="Edit this page"/>,
 <link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>,
 <link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>,
 <link href="/w/opensearch_desc.php" rel="search" title="Wikipedia (en)" type="application/opensearchdescription+xml"/>,
 <link href="//en.wikipedia.org/w/api.php?action=rsd" rel="EditURI" type="application/rsd+xml"/>,
 <link href="//cr

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [10]:
domain = 'http://wikipedia.org'
wiki = BeautifulSoup(requests.get(domain).content)

In [11]:
# your code here
wik_link = [x['href'] for x in wiki.find_all('a') if 'https' not in x]
wik_link
absolut = list(filter(lambda x: 'https' in x, wik_link))
absolut
relative = list(filter(lambda x: 'https' not in x, wik_link))
relative = ['https:'+x for x in relative]
relative
rel_abs = absolut + relative
rel_abs

['https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications',
 'https://creativecommons.org/licenses/by-sa/3.0/',
 'https://en.wikipedia.org/',
 'https://es.wikipedia.org/',
 'https://ja.wikipedia.org/',
 'https://de.wikipedia.org/',
 'https://ru.wikipedia.org/',
 'https://fr.wikipedia.org/',
 'https://it.wikipedia.org/',
 'https://zh.wikipedia.org/',
 'https://pt.wikipedia.org/',
 'https://pl.wikipedia.org/',
 'https://ar.wikipedia.org/',
 'https://de.wikipedia.org/',
 'https://en.wikipedia.org/',
 'https://es.wikipedia.org/',
 'https://fr.wikipedia.org/',
 'https://it.wikipedia.org/',
 'https://nl.wikipedia.org/',
 'https://ja.wikipedia.org/',
 'https://pl.wikipedia.org/',
 'https://pt.wikipedia.org/',
 'https://ru.wikipedia.org/',
 'https://ceb.wikipedia.org/',
 'https://sv.wikipedia.org/',
 'https://vi.wikipedia.org/',
 'https://war.wikipedia.org/',
 'https://zh.wikipedia.org/',
 'https://ast.wikipedia.org/',
 'https://az.wikipedia.org/',
 'https://bg.wikipedia.org/',
 'h

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [12]:
import os

In [27]:
# your code here
folder = 'wikipedia'
os.mkdir(folder)
os.chdir(folder)

'C:\\Users\\LIBRE\\Desktop\\Jupyter\\IRONHACK_2020\\LABS\\marcelo_ironhack\\wikipedia'

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [24]:
from slugify import slugify


'https-creativecommons-org-licenses-by-sa-3-0.txt'

In [57]:
# your code here
def index_page(link):
    try:
        soup = BeautifulSoup(requests.get(link).content)
        title = soup.find_all('title')[0].text.strip()
        slug = ''.join([slugify(title), '.html'])    
        file = open(slug, 'w+', encoding="utf-8")    
        file.write(str(soup))
    except:
        pass
    

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run. 

_hint: Use tqdm to keep track of the time._ 

In [58]:
# your code here
%time
for x in tqdm(rel_abs):
    index_page(x)

Wall time: 0 ns






  0%|                                                                                          | 0/323 [00:00<?, ?it/s]



  0%|▎                                                                                 | 1/323 [00:00<05:00,  1.07it/s]



  1%|▌                                                                                 | 2/323 [00:01<03:49,  1.40it/s]



  1%|▊                                                                                 | 3/323 [00:01<03:59,  1.33it/s]



  1%|█                                                                                 | 4/323 [00:02<04:17,  1.24it/s]



  2%|█▎                                                                                | 5/323 [00:04<04:45,  1.11it/s]



  2%|█▌                                                                                | 6/323 [00:05<05:06,  1.03it/s]



  2%|█▊                                                                                | 7/323 [00:06<05:17,  1.00s/it]



  2%|██     

 41%|████████████████████████████████▋                                               | 132/323 [02:27<03:42,  1.16s/it]



 41%|████████████████████████████████▉                                               | 133/323 [02:28<03:55,  1.24s/it]



 41%|█████████████████████████████████▏                                              | 134/323 [02:29<03:57,  1.25s/it]



 42%|█████████████████████████████████▍                                              | 135/323 [02:30<03:32,  1.13s/it]



 42%|█████████████████████████████████▋                                              | 136/323 [02:31<03:21,  1.08s/it]



 42%|█████████████████████████████████▉                                              | 137/323 [02:32<03:13,  1.04s/it]



 43%|██████████████████████████████████▏                                             | 138/323 [02:33<03:23,  1.10s/it]



 43%|██████████████████████████████████▍                                             | 139/323 [02:34<03:24,  1.11s/it]



 43%|███████████

 82%|█████████████████████████████████████████████████████████████████▍              | 264/323 [04:54<01:05,  1.11s/it]



 82%|█████████████████████████████████████████████████████████████████▋              | 265/323 [04:55<01:03,  1.10s/it]



 82%|█████████████████████████████████████████████████████████████████▉              | 266/323 [04:56<01:05,  1.16s/it]



 83%|██████████████████████████████████████████████████████████████████▏             | 267/323 [04:58<01:06,  1.18s/it]



 83%|██████████████████████████████████████████████████████████████████▍             | 268/323 [04:59<01:02,  1.14s/it]



 83%|██████████████████████████████████████████████████████████████████▋             | 269/323 [05:00<01:01,  1.14s/it]



 84%|██████████████████████████████████████████████████████████████████▊             | 270/323 [05:01<01:03,  1.19s/it]



 84%|███████████████████████████████████████████████████████████████████             | 271/323 [05:02<01:00,  1.16s/it]



 84%|███████████

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

Use both methods, i.e., for one hand use the `multiprocess` module to use the function created in the jupyter notebook and run the download in parallel.

And for another hand create a python file containing the function to download the file and use the `multiprocessing` module to run. 

In [60]:
# your code here
from multiprocess import Pool
pool = Pool(processes=4)
%time
for x in tqdm(rel_abs):
    pool.map(index_page, x)

Wall time: 0 ns






  0%|                                                                                          | 0/323 [00:00<?, ?it/s]



  0%|▎                                                                                 | 1/323 [00:00<03:01,  1.77it/s]



  1%|█                                                                                 | 4/323 [00:00<02:09,  2.46it/s]



  3%|██▌                                                                              | 10/323 [00:00<01:30,  3.45it/s]



  5%|████                                                                             | 16/323 [00:00<01:04,  4.79it/s]



  7%|█████▎                                                                           | 21/323 [00:00<00:45,  6.57it/s]



  8%|██████▊                                                                          | 27/323 [00:01<00:33,  8.92it/s]



 10%|████████▎                                                                        | 33/323 [00:01<00:24, 11.90it/s]



 12%|███████

**BONUS**: Create a function that counts how many files are there in the wikipedia folder using the `os` module. 

Delete the files from the folder before you run and perform the above solution asynchronously. 

Use your function to check how many files are being downloaded.

In [63]:
print (f"The number os files in the directory is: {len([x for x in os.listdir('.') if os.path.isfile(x)])}")

The number os files in the directory is: 169


In [70]:
def count_files():
    return f"The number os files in the directory is: {len([x for x in os.listdir('.') if os.path.isfile(x)])}"
        

In [71]:
count_files()

'The number os files in the directory is: 169'