# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [2]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [3]:
# your code here

response = requests.get(url)

if response.status_code == 200:
    content = response.text
    print("Content retrieved successfully.")
else:
    print(f"Failed to retrieve content. Status code: {response.status_code}")

Content retrieved successfully.


### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [5]:
from bs4 import BeautifulSoup

In [19]:
# your code here

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [16]:
domain = 'http://wikipedia.org'

In [20]:
# your code here

# Absolute links

# Get the request and clean string
absol_links = soup.find_all("a", attrs={'class':'external text'})
list = [str(link) for link in absol_links]
list = [link.replace('<a class="external text" href=', '') for link in list]
list = [link.split('"') for link in list]

# Create link list, delete the ones with "?"
list2 = [(i[1]) for i in list] 
absol_list = [item for item in list2 if not "?" in item]


In [13]:
# Relative links

domain = 'http://wikipedia.org'
relat_links = soup.find_all("a",attrs={'class':'mw-redirect'})

# Get the request and clean string
list = [str(link) for link in relat_links]
list = [link.replace('<a class="mw-redirect" href=', '') for link in list]
list = [link.split('"') for link in list]

# Create link list, delete the ones with "?"
list2 = [(domain+i[1]) for i in list] 
relat_list = [item for item in list2 if not "?" in item]


In [14]:
# Convert to set to erase repeated and then again into list
final_set = set(absol_list+relat_list)
final_list = [item for item in final_set]
final_list2 = []

# Adding http where is missing
for item in final_list:
    if item[0]=="/":
        item = 'http:'+ item
        final_list2.append(item)
    else:
        final_list2.append(item)

final_list2


['https://api.semanticscholar.org/CorpusID:9743327',
 'http://wikipedia.org/wiki/Iterative',
 'https://www.worldcat.org/issn/0360-0300',
 'http://wikipedia.org/wiki/Data_(computing)',
 'http://wikipedia.org/wiki/Scientific_computing',
 'http://wikipedia.org/wiki/Problem-solving',
 'https://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks',
 'https://www.researchgate.net/publication/256438799',
 'https://www.forbes.com/sites/peterpham/2015/08/28/the-impacts-of-big-data-that-you-may-not-have-heard-of/',
 'https://pubmed.ncbi.nlm.nih.gov/19265007',
 'https://www.worldcat.org/issn/0001-0782',
 'http://wikipedia.org/wiki/Data_loading',
 'http://wikipedia.org/wiki/Information_visualization',
 'https://www.bostonglobe.com/business/2015/11/11/behind-scenes-sexiest-job-century/Kc1cvXIu31DfHhVmyRQeIJ/story.html',
 'https://magazine.amstat.org/blog/2016/06/01/datascience-2/',
 'https://www.worldcat.org/issn/0017-8012',
 'https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [21]:
import os

In [24]:
# your code here

# Create a folder called 'wikipedia'
folder_name = 'wikipedia'
os.makedirs(folder_name, exist_ok=True)

# Set the current working directory to the new folder
os.chdir(folder_name)

# Print the current working directory to verify the change
print(f"Current Working Directory: {os.getcwd()}")

Current Working Directory: /Users/amandine/Desktop/Ironhack/10_Week/lab-parallelization/your-code/wikipedia


### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [28]:
from slugify import slugify
import time

In [26]:
# your code here

def index_page(link):
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.content)
        file_name = slugify(link)+'.html'
        fp = open(file_name, 'w')
        fp.write(str(soup))
        fp.close()
        print('Link: ', link, 'done..')
    except:
#         print('This link: ', link, 'failed...')
        pass
    

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [30]:
# your code here

start_time = time.time()

# Assuming you have the list of cleaned_links from previous steps
for link in cleaned_links:
    index_page(link)

end_time = time.time()
execution_time = end_time - start_time

print(f"Total Execution Time: {execution_time:.2f} seconds")

Link:  https://eo.wikipedia.org/wiki/Datum-scienco done..
Link:  http://wikipedia.org/wiki/Data_loading done..
Link:  http://wikipedia.org/w/index.php?title=Special:DownloadAsPdf&page=Data_science&action=show-download-screen done..
Link:  http://wikipedia.org/w/index.php?title=Data_science&action=edit&section=2 done..
Link:  http://wikipedia.org/wiki/Main_Page done..
Link:  http://wikipedia.org/wiki/Database done..
Link:  http://wikipedia.org/wiki/Boston_Globe done..
Link:  http://wikipedia.org/wiki/Data_scraping done..
Link:  http://wikipedia.org/wiki/Data_fusion done..
Link:  http://wikipedia.org/wiki/Critical_thinking done..
Link:  https://www.researchgate.net/publication/256438799 done..
Link:  https://web.archive.org/web/20170320193019/https://books.google.com/books?id=oGs_AQAAIAAJ done..
Link:  http://wikipedia.org/wiki/Data_farming done..
Link:  http://wikipedia.org/wiki/Data_(computing) done..
Link:  https://www.worldcat.org/issn/0001-0782 done..
Link:  http://wikipedia.org/wik

Link:  http://wikipedia.org/wiki/Category:Use_dmy_dates_from_August_2023 done..
Link:  https://dstf.acm.org/DSTF_Final_Report.pdf done..
Link:  http://wikipedia.org/wiki/Iterative done..
Link:  http://wikipedia.org/wiki/Data_mining done..
Link:  https://api.semanticscholar.org/CorpusID:207595944 done..
Link:  http://wikipedia.org/wiki/Category:Information_science done..
Link:  https://gl.wikipedia.org/wiki/Ciencia_de_datos done..
Link:  http://wikipedia.org/wiki/Information_explosion done..
Link:  https://www.forbes.com/sites/gilpress/2013/08/19/data-science-whats-the-half-life-of-a-buzzword/ done..
Link:  http://wikipedia.org/wiki/Scientific_computing done..
Link:  http://wikipedia.org/wiki/Committee_on_Data_for_Science_and_Technology done..
Link:  https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ done..
Link:  https://cs.wikipedia.org/wiki/Data_science done..
Link:  https://en.wikipedia.org/w/index.php?title=Data_science&oldid=1189437909 done..
Link:  http:/

Link:  https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html done..
Link:  http://wikipedia.org/wiki/Data_cleaning done..
Link:  http://wikipedia.org/w/index.php?title=Data_science&action=edit&section=9 done..
Link:  http://wikipedia.org/wiki/Information_technology done..
Link:  http://wikipedia.org/wiki/Data_lineage done..
Link:  http://wikipedia.org/w/index.php?title=Data_science&action=edit&section=1 done..
Link:  http://wikipedia.org/wiki/Journal_of_Computational_and_Graphical_Statistics done..
Link:  http://wikipedia.org/w/index.php?title=Data_science&oldid=1189437909 done..
Link:  https://no.wikipedia.org/wiki/Datavitenskap done..
Link:  http://wikipedia.org/wiki/Peter_Naur done..
Link:  http://wikipedia.org/w/index.php?title=Data_science&action=edit done..
Link:  http://wikipedia.org/wiki/Nate_Silver done..
Link:  http://wikipedia.org/wiki/Jeff_Hammerbacher done..
Link:  http://wikipedia.org/wiki/Boston done..
Link:  https://en.wikipedia.org/w/index.php

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [31]:
import multiprocessing

In [None]:
import concurrent.futures

def index_page_parallel(link):
    try:
        # Your existing index_page function logic here
        index_page(link)
    except Exception as e:
        pass

start_time_parallel = time.time()


with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(index_page_parallel, cleaned_links)

end_time_parallel = time.time()
execution_time_parallel = end_time_parallel - start_time_parallel

print(f"Total Execution Time (Parallel): {execution_time_parallel:.2f} seconds")
