# Web scraping to save time on a competition

In 2024, E-Fellows.net hosted a sweepstakes on Halloween. In their collection of employer portraits, they had hidden images of pumpkins.

The objective was to find all 8 pumpkins in those pages, count the amounts of candies in each pumpkin, note the motive on the sweets, and send the results to E-Fellows.net.

I didn't want to waste my time scrolling through all the pages, so I decided to write a web scraper to do the job for me, and got a little practice for my web-scraping skills in the proces.

In [1]:
import requests # For making HTTP requests
from bs4 import BeautifulSoup # For parsing the pages

base_company_portrait_url = "https://www.e-fellows.net/unternehmen"


In [2]:
list_page = requests.get(base_company_portrait_url)

soup = BeautifulSoup(list_page.content, 'html.parser')
company_links = []

# Get all company links from the page by class.
for link in soup.find_all(class_='headline__link', recursive=True):
    company_links.append(link.get('href'))

print(f"We found {len(company_links)} companies on the page.\
      \nAssume scanning each page for pumpkins takes 30 seconds.\
      \nThat is {len(company_links) * 30 / 60} minutes of scanning.")

We found 102 companies on the page.      
Assume scanning each page for pumpkins takes 30 seconds.      
That is 51.0 minutes of scanning.


From the promotion page, we have a sample of the pumpkin we are looking for:
```
<img class="image__img" src="https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Schaedel-2024.jpg" srcset="https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_750xAUTO_crop_center-center_none/Bonbon-Schaedel-2024.jpg 2x" width="375" height="161" loading="lazy" role="presentation">
```

We also found an empty pumpkin:
```
<img class="image__img" src="https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Niete-2024.jpg" srcset="https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_750xAUTO_crop_center-center_none/Niete-2024.jpg 2x" width="375" height="161" loading="lazy" role="presentation">
```

Looks like we want to look for "Gewinnspiel" in the URL. Let's do that.



In [3]:
def img_has_in_url(page_soup, keyword):
    if not isinstance(page_soup, BeautifulSoup): 
        page_soup = BeautifulSoup(page_soup, 'html.parser')
        
    for img in page_soup.find_all('img'):
        if keyword in img.get('src'):
            # We print the source url for further insights.
            print(img.get('src'))
            return True
        
for company_link in company_links:
    company_page = requests.get(company_link)
    if img_has_in_url(company_page.content, "Gewinnspiel"):
        print(company_link)
        print("has pumpkin")

https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Niete-2024.jpg
https://www.e-fellows.net/unternehmen/deutsche-bank
has pumpkin
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Niete-2024.jpg
https://www.e-fellows.net/unternehmen/burda/trainee-programm
has pumpkin
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Hut-2024.jpg
https://www.e-fellows.net/unternehmen/capgemini
has pumpkin
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Niete-2024.jpg
https://www.e-fellows.net/unternehmen/munich-re
has pumpkin
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Katze-2024.jpg
https://www.e-fellows.net/unternehmen/basf
has pumpkin
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Niete-2024.jpg
https://ww

We discover that only some pumpkins are full, they have "Bonbon" in every successfull catch. Let's re-define the problem as finding the pumpkin with "Bonbon" in the URL.

In [4]:
def has_full_pumpkin(page_soup):
    img_has_in_url(page_soup, "Bonbon")

winning_links = []
for company_link in company_links:
    company_page = requests.get(company_link)
    if has_full_pumpkin(company_page.content):
        print(company_link)
        print("has full pumpkin")
        winning_links.append(company_link)

https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Hut-2024.jpg
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Katze-2024.jpg
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Auge-2024.jpg
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Spinne-2024.jpg
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Kuerbis-2024.jpg
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Fledermaus-2024.jpg
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Geist-2024.jpg
https://www.e-fellows.net/uploads/NEU-Medienbibliothek/Unternehmen/00_Gewinnspiel/_contentSmall/Bonbon-Schaedel-2024.jpg


For each of these, we have to name the company, the motive and the count of sweets. The latter can only be extracted visually, we chose to do this manually. However, as a little toy experiment, I want to extract the company and the motive using a some regex magic.

In [5]:
import re

def get_sweet_link(page_soup):
    if not isinstance(page_soup, BeautifulSoup): 
        page_soup = BeautifulSoup(page_soup, 'html.parser')
        
    for link in page_soup.find_all('img'):
        if "Bonbon" in link.get('src'):
            return link.get('src')

sweet_pages = [requests.get(link) for link in winning_links]
sweet_image_links = [get_sweet_link(page.content) for page in sweet_pages]


In [6]:
bonbon_regex = r"https://www.e-fellows.net/.*/Bonbon-(\w*)-2024.jpg"
company_regex = r"https://www.e-fellows.net/unternehmen/([\w\-]*)/?.*"

bonbon_motives = [re.findall(bonbon_regex, link)[0] for link in sweet_image_links]
company_names = [re.findall(company_regex, link)[0] for link in winning_links]

pairs = zip(company_names, bonbon_motives)
for pair in pairs:
    print(f"Company: {pair[0]} has the motive: {pair[1]}")

Results according to the desired format:

```
Capgemini/Bonbons: 2/Motiv: Hut
BASF/Bonbons: 3/Motiv: Katze
Pepsico/Bonbons: 4/Motiv: Auge
Wavestone/Bonbons: 7/Motiv: Spinne
Enova/Bonbons: 5/Motiv: Kürbis
Lidl/Bonbons: 6/Motiv: Fledermaus
Forvis Mazars/Bonbons: 4/Motiv: Geist
Gleiss Lutz/Bonbons: 5/Motiv: Schädel
```

## My takeaways
Through this fast prototyping exercise, i took 30 minutes instead of optimistic 52 to get all necessary results. It was paramount not to over-engineer the solution, as the time saved was the main objective. I can reccommend this approach to anyone who is in a similar situation, it's a fun exercise.

This was successful, let's see if I win the price.