# Exercise scraping minifigs

Minifigs are the figurines included in legosets. There are many of them, and some of them have a real personality, while others are more generic. You can find all of them on the [brickset-website](https://brickset.com/browse/minifigs). Note that we won't be downloading all the images, as that would stress the bandwidth of this free website way to much. The goal therefore is to create a list of URL's, divided by theme.

If you want you can still download the image files later on, print them all individually on A4 pages and redecorate your room.

But back to code. First step is including the libraries.

In [1]:
# ! pip install requests
# ! pip install beautifulsoup4

import requests
from bs4 import BeautifulSoup

## Step 1
 
Request the content of the page https://brickset.com/browse/minifigs.  
Show the first and last 100 characters.

In [2]:
# Up to you!




<!DOCTYPE html>
<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--
10.0","token":"6a9a238990984c4492f28280a102793d"}' crossorigin="anonymous"></script>
</body>
</html>


## Step 2
 
So we can see we have the basic site content now. The actual site is a page with links to all the different themes, and we need all these links to go and fetch the page behind it. We reach back to trusty old Chrome Inspect to see what part of the website we're interested in:

![](images/2022-03-07-21-23-48.png)

We can skip everything not in ```"<div class="content">"```.  

Write the code to create a list of tuples.  Each tuple contains the category name and the url of the page showing the corresponding minifigs.

![](images/minifig_part2.PNG)


In [3]:
# Up to you!



('Adventurers', '/minifigs/category-Adventurers')
('Agents', '/minifigs/category-Agents')
('Alpha Team', '/minifigs/category-Alpha-Team')
('Aquazone', '/minifigs/category-Aquazone')
('Atlantis', '/minifigs/category-Atlantis')
('Avatar', '/minifigs/category-Avatar')
('Avatar The Last Airbender', '/minifigs/category-Avatar-The-Last-Airbender')
('Back to the Future', '/minifigs/category-Back-to-the-Future')
('Basic', '/minifigs/category-Basic')
('Batman I', '/minifigs/category-Batman-I')
('Belville', '/minifigs/category-Belville')
('BIONICLE', '/minifigs/category-BIONICLE')
('BrickLink Designer Program', '/minifigs/category-BrickLink-Designer-Program')
('Building Bigger Thinking', '/minifigs/category-Building-Bigger-Thinking')
('Cars', '/minifigs/category-Cars')
('Castle', '/minifigs/category-Castle')
('Clikits', '/minifigs/category-Clikits')
('Collectible Minifigures', '/minifigs/category-Collectible-Minifigures')
('DC Super Hero Girls', '/minifigs/category-DC-Super-Hero-Girls')
('Dimens

## Step 3

We also need to store the base URL in a variable (https://brickset.com), since it's not in the href-property. 

Note that in the created list we don't only have links to every theme, but also to every year. Every minifig in a theme is also in a year, and visa versa.

![](images/minifig_part3.PNG)

We don't need to download every minifig image twice! The goal of this step is filtering out all non-theme items (~years) in the list. 

Try to use regex to check whether the first element of the tuple contains a year (4 digits).  


In [4]:
# Delete
import re
categories = []

print (len(pretty_links))

for item in pretty_links:
    if not re.match(r"\d{4}",item[0]):
        categories.append (item)

print(len(categories))

159
110


Check your code.  Probably you didn't think of using list comprehension. In that case rewrite your solution.  

In [5]:
# Up to you!



159
110


## Step 4

Next we'll be doing something different than before: in stead of simply putting a URL in a variable and using that for the request, we'll write a function taking the URL as parameter. This way it's easier to reuse this function later on for all the urls in the list of categories. 


The downside of this when using a Jupyter notebook is if the function is in codeblock A and you call it in codeblock B, then running codeblock B *won't* recompile the function. You'll run the function as it was the last time you ran codeblock A.

And what does this function do? We'll be looking at the following pages:

![](images/2022-03-08-16-09-11.png)

As you can see the interesting part is in the section ```<section class="setlist minifiglist">```. All individual images are conveniently grouped in articles with class "set":

![](images/2022-03-08-16-17-53.png)

Goal of this step is to define a function _**download_page(url)**_ that scrapes this information and stores it in a list of tuples.  The first element in the tuple is the link to the image file and the second element is the content of the attribute 'title'.



In [6]:
# Up to you!



In [7]:
# USE THIS CODE TO TEST YOUR FUNCTION
images = download_page("https://brickset.com/minifigs/category-Adventurers")
print(*images[0:10],sep='\n')

('https://images.brickset.com/minifigs/large/adv001.jpg', 'adv001: Achu')
('https://images.brickset.com/minifigs/large/adv002.jpg', 'adv002: Alexis Sanister')
('https://images.brickset.com/minifigs/large/adv027.jpg', 'adv027: Babloo')
('https://images.brickset.com/minifigs/large/adv004.jpg', 'adv004: Baron Von Barron with Brown Aviator Cap')
('https://images.brickset.com/minifigs/large/adv005.jpg', 'adv005: Baron Von Barron with Light Gray Aviator Cap')
('https://images.brickset.com/minifigs/large/adv003.jpg', 'adv003: Baron Von Barron with Pith Helmet')
('https://images.brickset.com/minifigs/large/adv039.jpg', 'adv039: Baron Von Barron with Pith Helmet and White Epaulettes')
('https://images.brickset.com/minifigs/large/adv006.jpg', 'adv006: Dr. Charles Lightning')
('https://images.brickset.com/minifigs/large/adv040.jpg', 'adv040: Dr. Charles Lightning with Backpack')
('https://images.brickset.com/minifigs/large/adv033.jpg', 'adv033: Dr. Kilroy - Gray Suit')


The output should look like this

![](images/minifig_part4.PNG)

## Step 5

From here we give you the code. 

That much is working. But there's another problem:

![](images/2022-03-08-16-32-19.png)

Pagination. We're not looking at all minifigs, but only the ones that fitted on the page. There are two solutions:

- Do some extra scraping, and get a list of all pages, calling the function on all these pages
- Make the existing function recursive: if there is a "next page" link, get the list from that page and add it to the returned list.

The second is the topic of the first chapter of the AI-course you'll be getting. Let's do a preview!

In [8]:
def download_page_recursive(url):
    
    page = requests.get(url)
    information = []
    soup = BeautifulSoup(page.content, "html.parser")

    results = soup.find("section", {"class": "setlist minifiglist"})
    # print(results.prettify())
    image_list = []

    articles = results.find_all("article", {"class": "set"})

    for article in articles:
        image = article.find("img", src=True)
        image_list.append( ( image["src"], image["title"] ) )

    # the new part:
    results = soup.find("li", {"class": "next"}) # look in the entire page, not just the center part
    if results != None:
        link = results.find("a", href=True)
        if link != None:
            image_list += download_page(link['href']) # add all returned links to the list we already had
        
    return image_list

images = download_page("https://brickset.com/minifigs/category-Adventurers")
print(len(images))

images = download_page_recursive("https://brickset.com/minifigs/category-Adventurers")
print(len(images))


50
55


Five more, which checks out, because there are five minifigs on the second page. But what, so I hear you think, happens if there is a third page? Well, the second page will have a "Next" link as well, so the second page will ask the third page for a list, add that list to the list the second page made and return it to the function creating the list of the first page. And a fourth page? Let the third page handle that. Do note that this only works when the last page doesn't have a "Next"-link. If the last page were to have a link to the first page (circular pagination, so to speak) we'd have ourselves an infinite loop.

Recursion is complicated, but it does great things. Just watch [this](https://www.youtube.com/watch?v=G_UYXzGuqvM) video. It won't, however, be in the exam for this course.

Next up is running this recursive function on all the links we created earlier (in the code sample this variable is called 'pretty_links' You must adjust the name!). Do remember to add the base_url variable (adjust the name) to the url in the list. 

In [9]:
all_images = []
for pretty_link in pretty_links[0:3]:
    images = download_page_recursive(base_url + pretty_link[1])
    all_images += [ (pretty_link[0], im[0], im[1]) for im in images]

print(all_images[:10])
print(all_images[-10:])
print(len(all_images))



[('Adventurers', 'https://images.brickset.com/minifigs/large/adv001.jpg', 'adv001: Achu'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv002.jpg', 'adv002: Alexis Sanister'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv027.jpg', 'adv027: Babloo'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv004.jpg', 'adv004: Baron Von Barron with Brown Aviator Cap'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv005.jpg', 'adv005: Baron Von Barron with Light Gray Aviator Cap'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv003.jpg', 'adv003: Baron Von Barron with Pith Helmet'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv039.jpg', 'adv039: Baron Von Barron with Pith Helmet and White Epaulettes'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv006.jpg', 'adv006: Dr. Charles Lightning'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv040.jpg', 'adv040: Dr. Charles L

Did you note the "[0:3]" at the end of the for-loop? That is there for testing purposes. A loop like this never works on the first try and this way you can test it without always running it on the full list of a couple of thousand images. And we left it here because for this example the 119 links we have are plenty. There is no need to run all categories and download a list of 13.000 links...

So now we have the list. Maybe we want it in a CSV-file? That would be a good way of storing it for later usage.

In [10]:
import csv

header = ['Category', 'URL', 'name']

with open('to_download.csv', 'w', encoding='UTF8', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    writer.writerows(all_images)

Now all we need to do in the Jupyter notebook where we're downloading the images is read the csv (using CSV-reader) and recreate the exact same list-variable we had before. Which seems like a waste, isn't there a way to store the variable as a file and simply re-import the file everytime we need the file? Like we do with vegetables: we put them in a jar with a mixture of salt and vinegar to keep them until the dark days in winter when we need a homemade Bicky-burger with a homemade pickle.

(The keyword here is [pickle](https://docs.python.org/3/library/pickle.html).)

([And another link that is much easier to understand](https://dodona.ugent.be/nl/activities/58032010/).)

In [11]:
import pickle

pickle.dump( all_images, open( "all_images.p", "wb" ) )

Opening a jar with condiments can be very hard. Is the same true for opening a pickle-file?

In [12]:
my_images = pickle.load( open( "all_images.p", "rb" ) )

print(my_images[:10])
print(my_images[-10:])
print(len(my_images))

[('Adventurers', 'https://images.brickset.com/minifigs/large/adv001.jpg', 'adv001: Achu'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv002.jpg', 'adv002: Alexis Sanister'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv027.jpg', 'adv027: Babloo'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv004.jpg', 'adv004: Baron Von Barron with Brown Aviator Cap'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv005.jpg', 'adv005: Baron Von Barron with Light Gray Aviator Cap'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv003.jpg', 'adv003: Baron Von Barron with Pith Helmet'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv039.jpg', 'adv039: Baron Von Barron with Pith Helmet and White Epaulettes'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv006.jpg', 'adv006: Dr. Charles Lightning'), ('Adventurers', 'https://images.brickset.com/minifigs/large/adv040.jpg', 'adv040: Dr. Charles L

No.