# Getting List of Employer Profile URLs


The Kununu [sitemap](https://www.kununu.com/de/sitemap) contains links to lists of lists of company profiles:

<img src="images/sitemap.png" width="500"/>
<br>

Clicking on 'A' leads to:

<img src="images/sitemap_A.png" width="500"/>
<br>

Clicking on 'A1' leads to:

<img src="images/sitemap_A1.png" width="500"/>

We can therefore obtain all links to company profiles by scraping all 3 levels of the sitemap.

The first two levels (corresponding to the first two images above) can quickly  be scraped sequentially, while the third is scraped in parallel for efficiency. The scraping should take ~5 minutes.

In [None]:
# !pip install requests beautifulsoup4 python-dotenv

from multiprocessing.dummy import Pool as ThreadPool
from tqdm import tqdm
from random import shuffle
import requests
from bs4 import BeautifulSoup
import os
from dotenv import load_dotenv
from utils import *

load_dotenv() # make sure to have a .env file that defines the variable 'SCRAPINGBEE_API_KEY' if using scrapingbee

In [None]:
# getting all urls from level 1 of the sitemap (see image 1 above)
level1_urls = get_all_links_from_div("https://www.kununu.com/de/sitemap", "CategoryLevel_letterContainer__pUMeY", "https://www.kununu.com/")

# getting all urls from level 2 of the sitemap (see image 1 above)
level2_urls = []
for url in level1_urls:
    level2_urls += get_all_links_from_div(url, "PaginationLevel_container__dLOfo", "https://www.kununu.com/de/sitemap/")

# # getting all urls from level 3 of the sitemap (see image 1 above)
concurrency = 8
with ThreadPool(concurrency) as pool:
    all_kununu_employer_urls = list(tqdm(pool.imap(get_all_links_from_div, level2_urls), total=len(level2_urls)))

# saving all urls to data/all_kununu_links.txt
all_kununu_employer_urls = [item for sublist in all_kununu_employer_urls for item in sublist] # flatten list
shuffle(all_kununu_employer_urls)       
with open('data/all_kununu_company_profile_links.txt', 'w') as f:
    for line in all_kununu_employer_urls:
        f.write(f"{line}\n")