# Stayin' Alive
### *An AI-Powered Tool for Optimal Restaurant and Bar Location Selection and Business Longevity*
42578 – Advanced Business Analytics, DTU, 2025 <br>
Group 21 - Crocs Validation<br>
Giulia Andreatta -sxxxx<br>
Gabriel Lanaro - sxxxx<br>
Alessia Saccardo - sxxxx<br>
Gabriele Turetta - sxxxx<br>

### Objective
Opening a restaurant or a bar is a high-risk endeavor—many establishments close within their first few years. In Copenhagen, aspiring restaurateurs and investors often lack a data-driven approach when selecting a location. Moreover, understanding the reasons behind a restaurant’s success or failure remains a challenge.

This project aims to:

- Recommend optimal locations for new restaurants or bars using Survival Analysis.
- Visualize location suitability through an interactive heatmap enriched with predictive longevity scores, pedestrian peak hours, density of restaurants, pins of active and closed activities.

### Datasets

- **Company data scraped from the official CVR registry via [virk.dk](https://datacvr.virk.dk/soegeresultater?fritekst=d&sideIndex=0&size=10)**<br>
Includes business registration details, location, restaurant closures, branchekode

- **Google Maps Scraped Data**<br>
Includes business location, rating, number of reviews, price range, tags.

- **Pedestrian Dataset from [OpenData.dk](https://www.opendata.dk/city-of-copenhagen/taelling_fodg#:~:text=Number%20of%20pedestrians%20counted%20on,19%20in%20both%20directions)**<br>
Provides foot traffic counts recorded at specific times and locations in Copenhagen

### ABA Topics Covered
- **Web Data Mining**
Scraping large-scale data from Google Maps and government databases to construct the datasets.

- **Survival Analysis**
Predicting restaurant longevity using Kaplan-Meier and Cox Proportional Hazards models.

- **Recommender Systems**
Suggesting location options for new restaurants and bars based on market gaps and existing competition.

- **AI in the Real World**
Delivering real value to stakeholders by supporting data-driven restaurant planning and resilience strategies.



## 1st Step - Data Scraping from the official CVR registry

This script performs web scraping on the Danish company registry website (https://datacvr.virk.dk)
to extract company details for active business units in specified industry sectors (branchekoder).
It uses Selenium to navigate the search results, extract key information for each business unit,
and follow links to detailed company pages to obtain start and end dates.

The results are saved in a CSV file, and duplicate entries (based on P-number) are avoided by 
keeping track of already seen values. The script is designed to be resumed without duplicating 
previous entries.

Required: chromedriver installed and path correctly set.


In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import time
import csv
import os

# === CONFIGURATION ===
driver_path = r"C:\\programmi\\chromedriver\\chromedriver.exe"
options = Options()
options.add_argument("--window-size=1920,1080")

driver = webdriver.Chrome(service=Service(driver_path), options=options)

'''
the loop can be run for each branchekode separately, or all at once by uncommenting the lines below.
In our project, the branchekodes were manually uncommented, to create a single csv file for each branchekode.
This was done to avoid the script from running for too long, risking on IP bans or website crashes.
The single csv files were then merged into a single csv file.
'''
branchekodes = [
    # 561110,   # serving food in restaurants and cafes
    # 561190,   # includes the operation of restaurants, where the main emphasis is on takeaway with very limited table service.
    # 563010,   # includes serving beverages, possibly with some edibles, but where the main emphasis is on serving non-alcoholic beverages for immediate consumption on site.
    563020,     #  includes serving beverages, possibly with some edibles, but where the main emphasis is on serving alcoholic beverages for immediate consumption on the premises.
]

for branchekode in branchekodes:
    page = 0
    csv_file_path = f"scraped_companies_{branchekode}_active.csv"
    header = ["Name", "Address", "P-nummer", "Status", "Company Type", "Startdate", "Enddate"]
    pnummer_seen = set()

    # If file exists, read already saved P-numbers
    file_exists = os.path.exists(csv_file_path)
    if file_exists:
        with open(csv_file_path, "r", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            for row in reader:
                pnummer_seen.add(row["P-nummer"])
    else:
        # Create file and write header
        with open(csv_file_path, "w", newline='', encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=header)
            writer.writeheader()

    while True:
        url = f"https://datacvr.virk.dk/soegeresultater?sideIndex={page}&enhedstype=produktionsenhed&region=29190623&branchekode={branchekode}"
        print(f"Scraping page {page}")
        driver.get(url)
        time.sleep(2)

        rows = driver.find_elements(By.CSS_SELECTOR, 'div[data-cy="soegeresultater-tabel"] > div.row')

        if not rows:
            print("No data found. Stopping.")
            break

        for row in rows:
            try:
                name = row.find_element(By.CSS_SELECTOR, "span.bold.value").text.strip()

                address_block = row.find_element(By.CSS_SELECTOR, "div.col-12.col-lg-4")
                address_lines = address_block.text.strip().split("\n")[-2:]
                address = ", ".join(address_lines)

                pnummer = row.find_element(By.XPATH, './/div[div[text()="P-nummer:"]]/div[2]').text.strip()

                # Skip if already saved
                if pnummer in pnummer_seen:
                    continue
                pnummer_seen.add(pnummer)

                status = row.find_element(By.XPATH, './/div[div[text()="Status:"]]/div[2]').text.strip()
                form = row.find_element(By.XPATH, './/div[div[text()="Virksomhedsform:"]]/div[2]').text.strip()

                link_elem = row.find_element(By.CSS_SELECTOR, 'div[data-cy="vis-mere"] a')
                link = link_elem.get_attribute("href")

                # Open detail page in new tab
                driver.execute_script("window.open('');")
                driver.switch_to.window(driver.window_handles[1])
                driver.get(link)
                time.sleep(3)

                # Extract dates
                startdato = ""
                ophoersdato = ""

                try:
                    startdato_element = driver.find_element(
                        By.XPATH, '//div[(strong[text()="Startdato"] or span[text()="Startdato"])]/following-sibling::div'
                    )
                    startdato = startdato_element.text.strip()
                except:
                    startdato = ""

                try:
                    ophoersdato_element = driver.find_element(
                        By.XPATH, '//div[(strong[text()="Ophørsdato"] or span[text()="Ophørsdato"])]/following-sibling::div'
                    )
                    ophoersdato = ophoersdato_element.text.strip()
                except:
                    ophoersdato = ""

                driver.close()
                driver.switch_to.window(driver.window_handles[0])

                # Write to CSV
                with open(csv_file_path, "a", newline='', encoding="utf-8") as f:
                    writer = csv.DictWriter(f, fieldnames=header)
                    writer.writerow({
                        "Name": name,
                        "Address": address,
                        "P-nummer": pnummer,
                        "Status": status,
                        "Company Type": form,
                        "Startdate": startdato,
                        "Enddate": ophoersdato
                    })

                print(f"{name} | {startdato} → {ophoersdato}")

            except Exception as e:
                print("Error during parsing:", e)
                continue

        page += 1
        time.sleep(1)

driver.quit()
print("Scraping finished.")
