# FindOpenSimulationModels

An experiment to find simulation models such as FMU and Modelica files on the open internet. I am curious how prevalent they are and whether they have inputs and outputs that would be suitable for reinforcement learning environments.

## FMUs on GitHub

Let's start by looking at FMU files that exist in GitHub repositories.

### Manual search

We can enter `extension:fmu` in the GitHub search box and choose `All GitHub', resulting in the query https://github.com/search?q=extension%3Afmu&type=code. This resulted in 10,841 code results. That's good that there are thousands of FMU files out there! However, we would need to hit the *Next* button to page through a few files at a time and manually copy their URLs from the web page.

### GitHub API search

Next, let's try doing this programmatically with the [GitHub search API](https://docs.github.com/en/rest/search). Unfortunately, this doesn't seem to be possible based on this [Reddit](https://www.reddit.com/r/github/comments/dr19uu/finding_all_files_with_a_certain_extension/) and [Stack Overflow](https://stackoverflow.com/questions/58673751/find-all-files-with-certain-filetype-on-github) discussion from three years ago. My results were the same.

Here's what I tried (TLDR it didn't work):

A **repository** search of `https://api.github.com/search/repositories?q=extension:fmu` only returns one result. It did indeed find a [repository](https://github.com/INTO-CPS-Association/distributed-maestro-fmu) that contains a [singlewatertank-20sim.fmu](https://github.com/INTO-CPS-Association/distributed-maestro-fmu/blob/95922d63eb50c17609320c180319f23d17173c7f/bundle/src/test/resources/singlewatertank-20sim.fmu) file. But there should be many more repositories. It seems that the extensions qualifier is doing something but--if it works--it is only scanning a small subset of GitHub.

The [repository API doc](https://docs.github.com/en/rest/search?apiVersion=2022-11-28#search-repositories) says to see [Searching for repositories](https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories) for a detailed list of qualifiers. In that documentation, [Search based on the contents of a repository](https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories#search-based-on-the-contents-of-a-repository) states:

> Besides using in:readme, it's not possible to find repositories by searching for specific content within the repository. To search for a specific file or content within a repository, you can use the file finder or code-specific search qualifiers.

A **code** search of `https://api.github.com/search/code?q=extension:fmu` returns a `Validation Failed` error with `Must include at least one user, organization, or repository`. So it seems it is not possible to search all of GitHub in this way. In the documentation, [Considerations for code search](https://docs.github.com/en/rest/search?apiVersion=2022-11-28#considerations-for-code-search) states:

> * Only files smaller than 384 KB are searchable.
> * ...
> * You must always include at least one search term when searching source code. For example, searching for language:go is not valid, while amazing language:go is.

This will not work for FMU files because we want all of them (not just FMUs containing some particular search term), they are larger than 384 KB, and they are in a binary (zip) format that wouldn't work with a text search.

In the [Reddit thread](https://www.reddit.com/r/github/comments/dr19uu/comment/f6ezx4e/?utm_source=share&utm_medium=web2x&context=3) OP Gasp0de also looked into using [GH Archive](https://www.gharchive.org/), but it it doesn't look like GH Archive includes an event type with file information about the contents of repositories/commits.

### Scraping

Based on the [Information Usage Restrictions](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions), it appears that web scraping of GitHub is permitted:

> You may use information from our Service for the following reasons, regardless of whether the information was scraped, collected through our API, or obtained otherwise:
> 
> Researchers may use public, non-personal information from the Service for research purposes, only if any publications resulting from that research are open access.
Archivists may use public information from the Service for archival purposes.

We are researching the nature of FMU files that are present on GitHub with the intent to crate an archive or index of them, so that seems to fit. We don't expect it will be a tremendous amount of data and we won't be spamming anyone. Let's give it a try.

In [1]:
import requests
search_url = 'https://github.com/search?q=extension%3Afmu&type=code'
print(f'Getting: {search_url}')
search = requests.get(search_url)
print(f'Result: status code = {search.status_code}, url = {search.url}')
# print response lines containing the string <title>
for line in search.text.splitlines():
    if '<title>' in line:
        print(line)

Getting: https://github.com/search?q=extension%3Afmu&type=code
Result: status code = 200, url = https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Dextension%253Afmu%26type%3Dcode
  <title>Sign in to GitHub Â· GitHub</title>


In [2]:
# It is requiring us to log in, so we won't just be able to get the results by fetching URLs. Let's try scraping the page with selenium.

# While developing, limit the amount of data that is downloaded.
# Set to False when ready to download all the data.
is_testing = False

import time
import os
from IPython.display import clear_output
from collections import namedtuple
from sortedcontainers import SortedSet
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import urllib.parse

# Create a new instance of the Chrome browser
driver = webdriver.Chrome()

# Navigate to the GitHub website
print(f'Opening: {urllib.parse.unquote(search_url)}')
driver.get(search_url)

# It might be a good idea to automate the user login (if that's possible), but for now do it manually
print('Use the web browser window to log in to GitHub...')

# Wait for the URL to change to the search page
WebDriverWait(driver, 180).until(EC.url_to_be(search_url))

print('First page of search results loaded')

# Find the heading containing the count of all search results
h3_elements = driver.find_elements(By.CSS_SELECTOR, 'h3')
for h3_element in h3_elements:
    if 'results' in h3_element.text:
        print(h3_element.text)

# Get number of pages of results
current_em = driver.find_element(By.CSS_SELECTOR, 'em.current')
page_count = int(current_em.get_attribute('data-total-pages'))
print(f'Found {page_count} pages of results')
if is_testing:
    print(f'Limiting to 3 pages for testing purposes')
    page_count = min(page_count, 3)

Opening: https://github.com/search?q=extension:fmu&type=code
Use the web browser window to log in to GitHub...
First page of search results loaded
10,841 code results
Found 100 pages of results


In [3]:
# Create file to hold results
class ResultStore:
    def __init__(self, filename):
        self.filename = filename
        self.results = SortedSet()
        if os.path.exists(self.filename):
            with open(self.filename, 'r') as f:
                for line in f:
                    self.results.add(line.strip())
        self.new_results = 0
        self.preexisting_results = 0

    def add_result(self, result):
        if result in self.results:
            self.preexisting_results += 1
            return
        self.results.add(result)
        self.new_results += 1

    def print_stats(self):
        print(f'This scan has found {self.new_results} new FMUs, {self.preexisting_results} already known FMUs')
        print(f'The entire collection now has {len(self.results)} FMUs')

    def save(self):
        with open(self.filename, 'w') as f:
            for result in self.results:
                f.write(result + '\n')

result_store = ResultStore('results/github-fmu-search.txt')

# Function to process the current page of results
def scrape_page_results():
    # Get the list items containing the search results (divs with class "code-list-item")
    item_divs = driver.find_elements(By.CSS_SELECTOR, 'div[class*="code-list-item"]')
    for item_div in item_divs:
        # Get the link to the FMU file (not the secondary one to the repository)
        item_links = item_div.find_elements(By.CSS_SELECTOR, 'a:not(.Link--secondary)')
        if (len(item_links) != 1):
            print(f'Warning: Parsing problem. Search result item contains {len(item_links)} links, expected just 1 link to the FMU file. Something may have changed on the GitHub website.')
        for link in item_links:
            result_store.add_result(link.get_attribute("href"))

    expected_number_of_results = 10
    if len(item_divs) < expected_number_of_results:
        print(f'Warning: Search page only has {len(item_divs)} items, expected {expected_number_of_results} items.')
        return False
    return True

# It looks like we're limited to 100 pages so, unfortunately, we won't be able to get all the results using this method.
# We'll try using different search orders (best/indexed, ascending/descending) to give us different subsets of results.
# Best match seems to return results inconsitently. Not sure if ascending/descending has an effect, but we'll try both.
max_sort_order = 4
def get_search_page_url(page, sort_order, extension='fmu'):
    assert 1 <= sort_order <= max_sort_order, f'order must be between 1 and {max_sort_order}'
    index = sort_order - 1

    order_options = [
        'asc', # ascending
        'desc', # descending
    ]
    order = order_options[index % 2]

    sort_options = [
        'indexed', # recently indexed
        '', # best match
    ]
    sort = sort_options[index // 2]

    url = f'https://github.com/search?o={order}&p={page}&q=extension%3A{extension}&s={sort}&type=Code'
    return url

# Strangely, search pages sometimes fail to load, returning 0 of a small number of results. Retry a few times and track how often this occurs.
retry_count = 0
max_retry_count = 1 if is_testing else 15
PageRetryRecord = namedtuple('PageRetryRecord', ['page', 'retry_count', 'succeeded'])
page_retry_data = []

# Process each sort order
for sort_order in range(1, max_sort_order + 1):
    # Process each page of the search results
    # Note that we're assuming each search order has the same number of pages. That might not be correct,
    # but in practice we always seem to be hitting a limit of 100 pages, so it shouldn't matter.
    for current_page in range(1, page_count + 1):
        page_url = get_search_page_url(current_page, sort_order)

        clear_output(wait=True)
        print(f'Scraping page {current_page}/{page_count} order {sort_order}/{max_sort_order}: {urllib.parse.unquote(page_url)}')
        result_store.print_stats()

        driver.get(page_url)
        WebDriverWait(driver, 10).until(EC.url_to_be(page_url))
        succeeded = scrape_page_results()
        if succeeded or retry_count >= max_retry_count:
            # Move on to next page
            page_retry_data.append(PageRetryRecord(current_page, retry_count, succeeded))
            retry_count = 0

            # Save results to file every so often
            save_after_number_of_pages = 2 if is_testing else 10
            if current_page % save_after_number_of_pages == 0:
                result_store.save()
        else:
            # Failed, repeat this page
            retry_count += 1
            print(f'Retrying ({retry_count}/{max_retry_count})')

        # Avoid hitting GitHub with too many rapid-fire requests.
        # GitHub defines rate limits for the API, such as 10 requests per minute for unauthenticated requests.
        # But https://api.github.com/rate_limit doesn't seem to be affected by scraping the web site and it
        # isn't clear how rate limits are handled. Let's be cautious.
        sleep_time = 3 if is_testing else 20 # quick results when testing, 20 seconds when downloading all the data (seems like would be reasonable for a human to read each page in this amount of time)
        time.sleep(sleep_time) # 20 seconds would be plenty of time for a human to read each page of the search results

result_store.save()

clear_output(wait=True)
print(f'Done scraping {page_count} pages * {max_sort_order} orders')
result_store.print_stats()
print()
print("Retries:")
for i in range(0, max_retry_count+1):
    print(f'  succeeded after {i} retries: {sum(1 if x.retry_count == i else 0 for x in page_retry_data)} pages')
print(f'  failed: {sum(1 if not x.succeeded else 0 for x in page_retry_data)} pages')

Scraping page 1/100 order 0/3: https://github.com/search?o=asc&p=1&q=extension:fmu&s=indexed&type=Code
This scan has found 0 new FMUs, 0 already known FMUs
The entire collection now has 2548 FMUs


KeyboardInterrupt: 

In [None]:
# Close the browser
driver.quit()

In [4]:
# It looks like we're limited to 100 pages so, unfortunately, we won't be able to get all the results using this method.
# We'll try using different search orders (order = 0-3 for best/indexed ascending/descending) to give us different subsets of results.
# Best match seems to return results inconsitently. Not sure if ascending/descending has an effect, but we'll try both.
def get_search_page_url(page, order, extension='fmu'):
    assert 0 <= order <= 3, "order must be in the range 0-3"

    order_options = [
        'asc', # ascending
        'desc', # descending
    ]
    o = order_options[order % 2]

    sort_options = [
        '', # best match
        'indexed', # recently indexed

    ]
    s = sort_options[order // 2]

    url = f'https://github.com/search?o={o}&p={page}&q=extension%3A{extension}&s={s}&type=Code'
    return url

for i in range(0, 4):
    print(get_search_page_url(3, i))

is_testing = True
page_count = min(page_count, 3)
print(get_search_page_url(2, 4))


https://github.com/search?o=asc&p=3&q=extension%3Afmu&s=&type=Code
https://github.com/search?o=desc&p=3&q=extension%3Afmu&s=&type=Code
https://github.com/search?o=asc&p=3&q=extension%3Afmu&s=indexed&type=Code
https://github.com/search?o=desc&p=3&q=extension%3Afmu&s=indexed&type=Code


AssertionError: order must be in the range 0-3

## Future Work

- Statistics about FMUs (# params/inputs/outputs, OS platforms)
- Take a closer looks at a sampling of FMUs to see if they can be understood and controlled to achieve some sort of objective
- How to handle versioning of FMU files? The URLs are https://github.com/{username}/{repository}/blob/{commit_hash}/{file_path}. We could eventually have multiple commit hashes for different versions of the same file. Should we keep them all? Or just the latest. Also note that branch name could be used instead of commit_hash to reference the latest version in a branch. Perhaps we should convert the commit hashes to reference the latest version in the default branch?
- Other file types (Modelica, MATLAB, ...?)
- Internet search (files outside GitHub)
- Security note (executable code in FMU files should be treated as untrusted)