# Stayin' Alive
### *An AI-Powered Tool for Optimal Restaurant and Bar Location Selection and Business Longevity*
42578 – Advanced Business Analytics, DTU, 2025 <br>
Group 21 - Crocs Validation<br>
Giulia Andreatta -sxxxx<br>
Gabriel Lanaro - sxxxx<br>
Alessia Saccardo - sxxxx<br>
Gabriele Turetta - sxxxx<br>

### Objective
Opening a restaurant or a bar is a high-risk endeavor—many establishments close within their first few years. In Copenhagen, aspiring restaurateurs and investors often lack a data-driven approach when selecting a location. Moreover, understanding the reasons behind a restaurant’s success or failure remains a challenge.

This project aims to:

- Recommend optimal locations for new restaurants or bars using Survival Analysis.
- Visualize location suitability through an interactive heatmap enriched with predictive longevity scores, pedestrian peak hours, density of restaurants, pins of active and closed activities.

### Datasets

- **Company data scraped from the official CVR registry via [virk.dk](https://datacvr.virk.dk/soegeresultater?fritekst=d&sideIndex=0&size=10)**<br>
Includes business registration details, location, restaurant closures, branchekode

- **Google Maps Scraped Data**<br>
Includes business location, rating, number of reviews, price range, tags.

- **Pedestrian Dataset from [OpenData.dk](https://www.opendata.dk/city-of-copenhagen/taelling_fodg#:~:text=Number%20of%20pedestrians%20counted%20on,19%20in%20both%20directions)**<br>
Provides foot traffic counts recorded at specific times and locations in Copenhagen

### ABA Topics Covered
- **Web Data Mining**
Scraping large-scale data from Google Maps and government databases to construct the datasets.

- **Survival Analysis**
Predicting restaurant longevity using Kaplan-Meier and Cox Proportional Hazards models.

- **Recommender Systems**
Suggesting location options for new restaurants and bars based on market gaps and existing competition.

- **AI in the Real World**
Delivering real value to stakeholders by supporting data-driven restaurant planning and resilience strategies.



## 1st Step - Data Scraping from the official CVR registry

This script performs web scraping on the Danish company registry website (https://datacvr.virk.dk)
to extract company details for active business units in specified industry sectors (branchekoder).
It uses Selenium to navigate the search results, extract key information for each business unit,
and follow links to detailed company pages to obtain start and end dates.

The results are saved in a CSV file, and duplicate entries (based on P-number) are avoided by 
keeping track of already seen values. The script is designed to be resumed without duplicating 
previous entries.

Required: chromedriver installed and path correctly set.


In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import time
import csv
import os

# === CONFIGURATION ===
driver_path = r"C:\\programmi\\chromedriver\\chromedriver.exe"
options = Options()
options.add_argument("--window-size=1920,1080")

driver = webdriver.Chrome(service=Service(driver_path), options=options)


# the loop can be run for each branchekode separately, or all at once by uncommenting the lines below.
# In our project, the branchekodes were manually uncommented, to create a single csv file for each branchekode.
# This was done to avoid the script from running for too long, risking on IP bans or website crashes.
# The single csv files were then merged into a single csv file.
branchekodes = [
    # 561110,   # serving food in restaurants and cafes
    # 561190,   # includes the operation of restaurants, where the main emphasis is on takeaway with very limited table service.
    # 563010,   # includes serving beverages, possibly with some edibles, but where the main emphasis is on serving non-alcoholic beverages for immediate consumption on site.
    563020,     #  includes serving beverages, possibly with some edibles, but where the main emphasis is on serving alcoholic beverages for immediate consumption on the premises.
]

# Scraping Structure overview:
# - Main search results are loaded via URL with parameters: sideIndex (pagination), branchekode (industry), etc.
# - Each company entry is a 'div.row' within a 'div[data-cy="soegeresultater-tabel"]'.
# - Basic info (name, address, P-nummer, status, company type) is extracted directly from the search results.
# - For each company, the script follows the "Show More" link in a new browser tab to extract Start date (Startdato)
#   and End date (Ophørsdato), which appear in divs following label tags (either <strong> or <span>).
for branchekode in branchekodes:
    page = 0
    csv_file_path = f"scraped_companies_{branchekode}.csv"
    header = ["Name", "Address", "P-nummer", "Status", "Company Type", "Startdate", "Enddate"]
    pnummer_seen = set()

    # Load existing P-numbers if the file already exists to avoid duplicates
    file_exists = os.path.exists(csv_file_path)
    if file_exists:
        with open(csv_file_path, "r", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            for row in reader:
                pnummer_seen.add(row["P-nummer"])
    else:
        # Create file and write header
        with open(csv_file_path, "w", newline='', encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=header)
            writer.writeheader()

    while True:
        url = f"https://datacvr.virk.dk/soegeresultater?sideIndex={page}&enhedstype=produktionsenhed&region=29190623&branchekode={branchekode}"
        print(f"Scraping page {page}")
        driver.get(url)
        time.sleep(2)

        # Get all company rows from the result table
        rows = driver.find_elements(By.CSS_SELECTOR, 'div[data-cy="soegeresultater-tabel"] > div.row')

        if not rows:
            print("No data found. Stopping.")
            break

        # Process each company in the current page
        for row in rows:
            try:
                name = row.find_element(By.CSS_SELECTOR, "span.bold.value").text.strip()

                address_block = row.find_element(By.CSS_SELECTOR, "div.col-12.col-lg-4")
                address_lines = address_block.text.strip().split("\n")[-2:]
                address = ", ".join(address_lines)

                pnummer = row.find_element(By.XPATH, './/div[div[text()="P-nummer:"]]/div[2]').text.strip()

                # Skip if already saved
                if pnummer in pnummer_seen:
                    continue
                pnummer_seen.add(pnummer)

                status = row.find_element(By.XPATH, './/div[div[text()="Status:"]]/div[2]').text.strip()
                form = row.find_element(By.XPATH, './/div[div[text()="Virksomhedsform:"]]/div[2]').text.strip()

                link_elem = row.find_element(By.CSS_SELECTOR, 'div[data-cy="vis-mere"] a')
                link = link_elem.get_attribute("href")

                # Open detail page in new tab
                driver.execute_script("window.open('');")
                driver.switch_to.window(driver.window_handles[1])
                driver.get(link)
                time.sleep(3)

                # Extract dates
                startdato = ""
                ophoersdato = ""

                # Extract start and end dates from the detail page
                try:
                    startdato_element = driver.find_element(
                        By.XPATH, '//div[(strong[text()="Startdato"] or span[text()="Startdato"])]/following-sibling::div'
                    )
                    startdato = startdato_element.text.strip()
                except:
                    startdato = ""

                try:
                    ophoersdato_element = driver.find_element(
                        By.XPATH, '//div[(strong[text()="Ophørsdato"] or span[text()="Ophørsdato"])]/following-sibling::div'
                    )
                    ophoersdato = ophoersdato_element.text.strip()
                except:
                    ophoersdato = ""

                # Close the detail tab and return to the main results tab
                driver.close()
                driver.switch_to.window(driver.window_handles[0])

                # Write to CSV
                with open(csv_file_path, "a", newline='', encoding="utf-8") as f:
                    writer = csv.DictWriter(f, fieldnames=header)
                    writer.writerow({
                        "Name": name,
                        "Address": address,
                        "P-nummer": pnummer,
                        "Status": status,
                        "Company Type": form,
                        "Startdate": startdato,
                        "Enddate": ophoersdato
                    })

                print(f"{name} | {startdato} → {ophoersdato}")

            except Exception as e:
                print("Error during parsing:", e)
                continue

        page += 1
        time.sleep(1)

driver.quit()
print("Scraping finished.")


## 2nd Step - Geocoding Restaurant Addresses Using OpenStreetMap API
This script performs address geocoding using the OpenStreetMap (OSM) API via the geopy library. It takes as input a CSV file containing restaurant records with address fields but missing geographic coordinates. For each address, it attempts to retrieve the corresponding latitude and longitude, which are then saved in a new CSV file.

In [None]:
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import os

# === CONFIGURATION ===
input_file = "scraped_companies_combined_clean.csv"
output_file = "scraped_companies_combined_clean_with_coords.csv"

# === LOAD ORIGINAL DATA ===
df = pd.read_csv(input_file)

# Function to simplify the address before geocoding
def simplify_address(row):
    addr = str(row["Address"])
    addr = addr.split(",")[0].strip()  # only keep the part before the first comma
    return f"{addr}, Denmark"

# Add coordinate columns if they don't exist
if "latitude" not in df.columns:
    df["latitude"] = None
if "longitude" not in df.columns:
    df["longitude"] = None

# Load already geocoded addresses to avoid duplicates
already_done = set()
if os.path.exists(output_file):
    df_existing = pd.read_csv(output_file)
    already_done = set(df_existing["Address"].dropna().unique())
    print(f"Resuming from {len(already_done)} already completed addresses.")

# Filter rows that still need geocoding
df_to_process = df[~df["Address"].isin(already_done)].copy()
print(f"Addresses to geocode: {len(df_to_process)}")

# Initialize OpenStreetMap geocoder with delay to respect rate limits
geolocator = Nominatim(user_agent="stayin_alive_simple_geocoder")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1.5)

# Progressive saving to CSV (append mode!!)
with open(output_file, "a", encoding="utf-8", newline="") as f_out:
    header_written = os.stat(output_file).st_size == 0
    for i, row in df_to_process.iterrows():
        full_address = simplify_address(row)
        try:
            location = geocode(full_address)
            if location:
                row["latitude"] = location.latitude
                row["longitude"] = location.longitude
                print(f"{full_address} -> ({location.latitude}, {location.longitude})")
            else:
                print(f"{full_address} -> not found")
        except Exception as e:
            print(f"Error on {full_address}: {e}")
            continue

        # Append row to output CSV
        pd.DataFrame([row]).to_csv(f_out, index=False, header=header_written)
        header_written = False


## 3rd Step - Scraping Restaurant Metadata from Google Maps with Selenium
This script performs web scraping from Google Maps using Selenium to enrich the dataset of restaurants obtained from steps 1-2. For each restaurant entry (name and address), the script:

1. Opens a Google Maps search page

2. Extracts the official listing title, star rating, number of reviews, price level, and associated category tags if present

3. Saves the collected data into a CSV file

4. The script supports resuming interrupted sessions by skipping entries that have already been saved to the output CSV file.

Required: chromedriver installed and path correctly set.



In [None]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import csv
import os

# === CONFIGURATION ===
driver_path = r"C:\Program Files\chromedriver\chromedriver.exe"
options = Options()
options.add_argument("--window-size=1920,1080")
# options.add_argument("--headless")  # Uncomment to run without opening browser window

driver = webdriver.Chrome(service=Service(driver_path), options=options)

# === INPUT & OUTPUT PATHS ===
csv_input_path = r"C:\Users\Admin\Documents\HCAI\ADVANCED_BUSINESS_ANALYTICS\StayingAlive\StayingAlive\src\scraping_correct\scraped_companies_563020_notactive.csv"
csv_output_path = r"C:\Users\Admin\Documents\HCAI\ADVANCED_BUSINESS_ANALYTICS\StayingAlive\StayingAlive\src\scraping_correct\maps_data_scraped.csv"

# Load input data
df_input = pd.read_csv(csv_input_path)
restaurant_data = df_input.to_dict(orient="records")

# Load already saved entries (if output file exists)
saved_entries = set()
header = ["Input Name", "Input Address", "Title", "Rating", "Reviews", "Price Level", "Tags"]

if os.path.exists(csv_output_path):
    with open(csv_output_path, "r", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            key = (row["Input Name"], row["Input Address"])
            saved_entries.add(key)
else:
    with open(csv_output_path, "w", newline='', encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=header)
        writer.writeheader()

# === MAIN LOOP OVER RESTAURANTS ===
for entry in restaurant_data:
    name = entry["Name"]
    address = entry["Address"]
    key = (name, address)

    if key in saved_entries:
        continue

    try:
        print(f"Searching: {name} @ {address}")
        query = f"{name} {address}".replace(" ", "+")
        linkmaps = f"https://www.google.com/maps/search/{query}"
        print(f"URL: {linkmaps}")
        driver.get(linkmaps)
        time.sleep(2)

        try:
            title = driver.find_element(By.CSS_SELECTOR, 'h1.DUwDvf').text
        except:
            title = ""

        try:
            rating = driver.find_element(By.CSS_SELECTOR, 'div.F7nice > span span[aria-hidden="true"]').text
        except:
            rating = ""

        try:
            reviews_elem = driver.find_element(By.CSS_SELECTOR, 'div.F7nice > span span[aria-label$="reviews"]').text
            reviews = reviews_elem.strip("()")
        except:
            reviews = ""

        try:
            price_level = driver.find_element(By.CSS_SELECTOR, 'div.DfOCNb.fontBodyMedium > div').text.split('\n')[0]
        except:
            price_level = ""

        try:
            outer_divs = driver.find_elements(By.CSS_SELECTOR, "div.KNfEk.aUjao")
            tags = []
            for div in outer_divs:
                try:
                    tag = div.find_element(By.CSS_SELECTOR, "div.tXNTee span.uEubGf.fontBodyMedium").text
                    tags.append(tag)
                except:
                    continue
            tags = ", ".join(tags)
        except:
            tags = ""

        # Append to CSV immediately
        with open(csv_output_path, "a", newline='', encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=header)
            writer.writerow({
                "Input Name": name,
                "Input Address": address,
                "Title": title,
                "Rating": rating,
                "Reviews": reviews,
                "Price Level": price_level,
                "Tags": tags
            })

        print(f"Saved: {title}")

    except Exception as e:
        print("Error during scraping:", e)
        continue

driver.quit()
print("Scraping finished.")


## 4th Step - 1st Heatmap, an interactive Spatial Visualization of Restaurants/Bars and Pedestrian Traffic in Copenhagen

This script generates an interactive Folium heatmap that visualizes restaurant/bar locations and pedestrian traffic in Copenhagen. It combines multiple layers of spatial data to support exploratory analysis for business location decisions. 
This initial heatmap with multiple layers serves as a visual foundation for later overlaying the survival analysis scores as a new layer in the heatmap. By combining all the layers, it will be possible fx to compare areas of high restaurant density and high pedestrian density with predicted survival outcomes, helping to identify not only where restaurants are concentrated, but also where they are most likely to succeed over time.

This first heatmap was obtained by combining the restaurants/bars dataset obtained in the first 2 steps and the Pedestrian Dataset downloaed from OpenData.dk

Code key functionalities:
- Restaurant Heatmap: Shows the density of restaurant locations.
- Longevity Heatmap: Visualizes how long restaurants have stayed open, based on registration and closure dates.
- Status Markers: Differentiates between currently active and closed restaurants with green and red markers.
- Branchekode Filter: Allows filtering restaurants by industry classification code (branchekode).
- Pedestrian Traffic Heatmap: Displays average daily foot traffic (7 AM–7 PM) from official measurements.
- Peak Hour Traffic Circles: Highlights high-density areas during peak foot traffic (7 AM–7 PM) using proportional red circles.

The result is an interactive map saved as an HTML file, enabling users to toggle layers, explore patterns, and identify high-potential zones for business development.

In [None]:
import pandas as pd
import folium
from folium.plugins import HeatMap

# Load restaurant dataset with coordinates
restaurants_df = pd.read_csv("scraped_companies_combined_clean_with_coords.csv")
restaurants_df = restaurants_df.dropna(subset=['latitude', 'longitude'])

# Load pedestrian traffic dataset with coordinates
traffic_df = pd.read_csv("foot_trafic.csv")
traffic_df = traffic_df.dropna(subset=['lat', 'lon'])

# Preprocessing longevity
restaurants_df['startdate'] = pd.to_datetime(restaurants_df['startdate'], errors='coerce')
restaurants_df['enddate'] = pd.to_datetime(restaurants_df['enddate'], errors='coerce')
restaurants_df['enddate_filled'] = restaurants_df['enddate'].fillna(pd.Timestamp.today())
restaurants_df['longevity_days'] = (restaurants_df['enddate_filled'] - restaurants_df['startdate']).dt.days

# Initialize the map centered on Copenhagen
map_ = folium.Map(location=[55.6761, 12.5683], zoom_start=13)

# --- HEATMAP: Restaurants ---
heat_points = restaurants_df[['latitude', 'longitude']].values.tolist()
heatmap_layer = folium.FeatureGroup(name="Restaurants Heatmap")
HeatMap(heat_points, radius=10, blur=15).add_to(heatmap_layer)
heatmap_layer.add_to(map_)

# --- HEATMAP: Longevity ---
longevity_points = restaurants_df[['latitude', 'longitude', 'longevity_days']].dropna().values.tolist()
longevity_layer = folium.FeatureGroup(name="Restaurants Longevity Heatmap", show=False)
HeatMap(longevity_points, radius=15, blur=25, max_zoom=14).add_to(longevity_layer)
longevity_layer.add_to(map_)

# --- MARKERS: Active / Closed Restaurants ---
active_layer = folium.FeatureGroup(name="Active Restaurants", show=False)
closed_layer = folium.FeatureGroup(name="Closed Restaurants", show=False)

for _, row in restaurants_df.iterrows():
    popup = folium.Popup(
        f"<b>{row.get('name', 'N/A')}</b><br>"
        f"Business Code: {row.get('branchekode', 'N/A')}<br>"
        f"Status: {row.get('status', 'N/A')}<br>"
        f"Opening Date: {row.get('startdate', 'N/A')}<br>"
        f"Postal Code: {row.get('zip', 'N/A')}",
        max_width=300
    )
    marker = folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=popup,
        icon=folium.Icon(color="green" if row.get('active', False) else "red")
    )
    if row.get('active', False):
        marker.add_to(active_layer)
    else:
        marker.add_to(closed_layer)

active_layer.add_to(map_)
closed_layer.add_to(map_)

# --- FILTER: Branchekode ---
branche_layer_dict = {}
for branche in restaurants_df['branchekode'].dropna().unique():
    layer = folium.FeatureGroup(name=f"Branchekode: {branche}", show=False)
    for _, row in restaurants_df[restaurants_df['branchekode'] == branche].iterrows():
        popup = folium.Popup(
            f"<b>{row.get('name', 'N/A')}</b><br>"
            f"Branchekode: {row.get('branchekode', 'N/A')}<br>"
            f"Status: {row.get('status', 'N/A')}<br>"
            f"Startdate: {row.get('startdate', 'N/A')}<br>"
            f"ZIP: {row.get('zip', 'N/A')}",
            max_width=300
        )
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=4,
            color="blue",
            fill=True,
            fill_opacity=0.6,
            popup=popup
        ).add_to(layer)
    layer.add_to(map_)

# --- PEDESTRIAN TRAFFIC: HEATMAP aadt_fod_7_19 ---
heat_traffic_points = traffic_df[['lat', 'lon', 'aadt_fod_7_19']].dropna().values.tolist()
heatmap_ped_layer = folium.FeatureGroup(name="Pedestrian Heatmap (7-19)")
HeatMap(heat_traffic_points, radius=15, blur=25, max_zoom=14).add_to(heatmap_ped_layer)
heatmap_ped_layer.add_to(map_)

# --- PEDESTRIAN TRAFFIC: CIRCLE LAYER hvdt_fod_7_19 ---
circle_layer = folium.FeatureGroup(name="Pedestrian Peak Hour 7-19", show=False)
for _, row in traffic_df.iterrows():
    value = row.get('hvdt_fod_7_19')
    if pd.notna(value):
        radius = value / 500  # scaling factor
        popup = folium.Popup(
            f"<b>{row.get('vejnavn', '')}</b><br>"
            f"Peak Hour 7-19: {int(value)}<br>"
            f"Description: {row.get('beskrivelse', '')}<br>"
            f"Date: {row.get('taelle_dato', '')}",
            max_width=300
        )
        folium.CircleMarker(
            location=[row['lat'], row['lon']],
            radius=radius,
            color="red",
            fill=True,
            fill_opacity=0.5,
            popup=popup
        ).add_to(circle_layer)
circle_layer.add_to(map_)

# Add layer control to enable toggling layers
folium.LayerControl(collapsed=False).add_to(map_)

# Save the final map
map_.save("interactive_map_with_filtered_traffic.html")
