# FindOpenSimulationModels

An experiment to find simulation models such as FMU and Modelica files on the open internet. I am curious how prevalent they are and whether they have inputs and outputs that would be suitable for reinforcement learning environments.

## FMUs on GitHub

Let's start by looking at FMU files that exist in GitHub repositories.

### Manual search

We can enter `extension:fmu` in the GitHub search box and choose `All GitHub', resulting in the query https://github.com/search?q=extension%3Afmu&type=code. This resulted in 10,841 code results. That's good that there are thousands of FMU files out there! However, we would need to hit the *Next* button to page through a few files at a time and manually copy their URLs from the web page.

### GitHub API search

Next, let's try doing this programmatically with the [GitHub search API](https://docs.github.com/en/rest/search). Unfortunately, this doesn't seem to be possible based on this [Reddit](https://www.reddit.com/r/github/comments/dr19uu/finding_all_files_with_a_certain_extension/) and [Stack Overflow](https://stackoverflow.com/questions/58673751/find-all-files-with-certain-filetype-on-github) discussion from three years ago. My results were the same.

Here's what I tried (TLDR it didn't work):

A **repository** search of `https://api.github.com/search/repositories?q=extension:fmu` only returns one result. It did indeed find a [repository](https://github.com/INTO-CPS-Association/distributed-maestro-fmu) that contains a [singlewatertank-20sim.fmu](https://github.com/INTO-CPS-Association/distributed-maestro-fmu/blob/95922d63eb50c17609320c180319f23d17173c7f/bundle/src/test/resources/singlewatertank-20sim.fmu) file. But there should be many more repositories. It seems that the extensions qualifier is doing something but--if it works--it is only scanning a small subset of GitHub.

The [repository API doc](https://docs.github.com/en/rest/search?apiVersion=2022-11-28#search-repositories) says to see [Searching for repositories](https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories) for a detailed list of qualifiers. In that documentation, [Search based on the contents of a repository](https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories#search-based-on-the-contents-of-a-repository) states:

> Besides using in:readme, it's not possible to find repositories by searching for specific content within the repository. To search for a specific file or content within a repository, you can use the file finder or code-specific search qualifiers.

A **code** search of `https://api.github.com/search/code?q=extension:fmu` returns a `Validation Failed` error with `Must include at least one user, organization, or repository`. So it seems it is not possible to search all of GitHub in this way. In the documentation, [Considerations for code search](https://docs.github.com/en/rest/search?apiVersion=2022-11-28#considerations-for-code-search) states:

> * Only files smaller than 384 KB are searchable.
> * ...
> * You must always include at least one search term when searching source code. For example, searching for language:go is not valid, while amazing language:go is.

This will not work for FMU files because we want all of them (not just FMUs containing some particular search term), they are larger than 384 KB, and they are in a binary (zip) format that wouldn't work with a text search.

In the [Reddit thread](https://www.reddit.com/r/github/comments/dr19uu/comment/f6ezx4e/?utm_source=share&utm_medium=web2x&context=3) OP Gasp0de also looked into using [GH Archive](https://www.gharchive.org/), but it it doesn't look like GH Archive includes an event type with file information about the contents of repositories/commits.

### Scraping

Based on the [Information Usage Restrictions](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions), it appears that web scraping of GitHub is permitted:

> You may use information from our Service for the following reasons, regardless of whether the information was scraped, collected through our API, or obtained otherwise:
> 
> Researchers may use public, non-personal information from the Service for research purposes, only if any publications resulting from that research are open access.
Archivists may use public information from the Service for archival purposes.

We are researching the nature of FMU files that are present on GitHub with the intent to crate an archive or index of them, so that seems to fit. We don't expect it will be a tremendous amount of data and we won't be spamming anyone. Let's give it a try.

In [1]:
import requests
search_url = 'https://github.com/search?q=extension%3Afmu&type=code'
print(f'Getting: {search_url}')
search = requests.get(search_url)
print(f'Result: status code = {search.status_code}, url = {search.url}')
# print response lines containing the string <title>
for line in search.text.splitlines():
    if '<title>' in line:
        print(line)

Getting: https://github.com/search?q=extension%3Afmu&type=code
Result: status code = 200, url = https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Dextension%253Afmu%26type%3Dcode
  <title>Sign in to GitHub · GitHub</title>


In [5]:
# It is requiring us to log in, so we won't just be able to get the results by fetching URLs. Let's try scraping the page with selenium.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import urllib.parse
from IPython.display import clear_output
import time

# Create a new instance of the Chrome browser
driver = webdriver.Chrome()

# Navigate to the GitHub website
print(f'Opening: {urllib.parse.unquote(search_url)}')
driver.get(search_url)

# It might be a good idea to automate the user login (if that's possible), but for now do it manually
print('Use the web browser window to log in to GitHub...')

# Wait for the URL to change to the search page
WebDriverWait(driver, 180).until(EC.url_to_be(search_url))

print('First page of search results loaded')


Use the web browser window to log in to GitHub...
First page of search results loaded


In [27]:
# While developing, limit the amount of data that is downloaded.
# Set to False when ready to download all the data.
is_testing = True

# Get number of pages of results
current_em = driver.find_element(By.CSS_SELECTOR, 'em.current')
page_count = int(current_em.get_attribute('data-total-pages'))
print(f'Found {page_count} pages of results')
if is_testing:
    print(f'Limiting to 3 pages for testing purposes')
    page_count = min(page_count, 3)

# Create file to hold results
results_file = open('results/github-fmu-search.txt', 'w')
fmu_count = 0

# Function to process the current page of results
def save_page_results():
    # Get the list items containing the search results (divs with class "code-list-item")
    item_divs = driver.find_elements(By.CSS_SELECTOR, 'div[class*="code-list-item"]')
    for item_div in item_divs:
        # Get the link to the FMU file (not the secondary one to the repository)
        item_links = item_div.find_elements(By.CSS_SELECTOR, 'a:not(.Link--secondary)')
        if (len(item_links) != 1):
            print(f'Warning: Parsing problem. Search result item contains {len(item_links)} links, expected just 1 link to the FMU file. Something may have changed on the GitHub website.')
        for link in item_links:
            # Write link to the results file
            results_file.write(f'{link.get_attribute("href")}\n')
            global fmu_count
            fmu_count += 1

# Process each page of the search results
current_page = 1
while current_page <= page_count:
    page_url = f'https://github.com/search?p={current_page}&q=extension%3Afmu&type=Code'
    clear_output(wait=True)
    print(f'Scraping results: {current_page}/{page_count} pages, {fmu_count} FMUs found')
    driver.get(page_url)
    WebDriverWait(driver, 10).until(EC.url_to_be(page_url))
    save_page_results()
    current_page += 1

    # Wait a bit before going to the next page, to avoid hitting GitHub with too many requests
    # TODO: Check rate limit status using https://docs.github.com/en/rest/overview/resources-in-the-rest-api?apiVersion=2022-11-28#rate-limiting
    time.sleep(5)

clear_output(wait=True)
print(f'Done scraping results: {current_page-1}/{page_count} pages, {fmu_count} FMUs found')

# Close results file
results_file.close()

Done scraping results: 3/3 pages, 27 FMUs found


In [4]:
# Close the browser
driver.quit()

## Future Work

- Statistics about FMUs (# params/inputs/outputs, OS platforms)
- Take a closer looks at a sampling of FMUs to see if they can be understood and controlled to achieve some sort of objective
- Modelica files
- Internet search (files outside GitHub)
- Security note (executable code in FMU files should be treated as untrusted)