# FindOpenSimulationModels

An experiment to find simulation models such as FMU and Modelica files on the open internet. I am curious how prevalent they are and whether they have inputs and outputs that would be suitable for reinforcement learning environments.

In [36]:
# While developing, limit the amount of data that is downloaded.
# Set to False when ready to download all the data.
is_testing = True

fmu_list_filename = 'results/github-fmu-search.txt'
import importlib

## FMUs on GitHub

Let's start by looking at FMU files that exist in GitHub repositories.

### Manual search

We can enter `extension:fmu` in the GitHub search box and choose `All GitHub', resulting in the query https://github.com/search?q=extension%3Afmu&type=code. This resulted in 10,841 code results. That's good that there are thousands of FMU files out there! However, we would need to hit the *Next* button to page through a few files at a time and manually copy their URLs from the web page.

### GitHub API search

Next, let's try doing this programmatically with the [GitHub search API](https://docs.github.com/en/rest/search). Unfortunately, this doesn't seem to be possible based on this [Reddit](https://www.reddit.com/r/github/comments/dr19uu/finding_all_files_with_a_certain_extension/) and [Stack Overflow](https://stackoverflow.com/questions/58673751/find-all-files-with-certain-filetype-on-github) discussion from three years ago. My results were the same.

Here's what I tried (TLDR it didn't work):

A **repository** search of `https://api.github.com/search/repositories?q=extension:fmu` only returns one result. It did indeed find a [repository](https://github.com/INTO-CPS-Association/distributed-maestro-fmu) that contains a [singlewatertank-20sim.fmu](https://github.com/INTO-CPS-Association/distributed-maestro-fmu/blob/95922d63eb50c17609320c180319f23d17173c7f/bundle/src/test/resources/singlewatertank-20sim.fmu) file. But there should be many more repositories. It seems that the extensions qualifier is doing something but--if it works--it is only scanning a small subset of GitHub.

The [repository API doc](https://docs.github.com/en/rest/search?apiVersion=2022-11-28#search-repositories) says to see [Searching for repositories](https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories) for a detailed list of qualifiers. In that documentation, [Search based on the contents of a repository](https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories#search-based-on-the-contents-of-a-repository) states:

> Besides using in:readme, it's not possible to find repositories by searching for specific content within the repository. To search for a specific file or content within a repository, you can use the file finder or code-specific search qualifiers.

A **code** search of `https://api.github.com/search/code?q=extension:fmu` returns a `Validation Failed` error with `Must include at least one user, organization, or repository`. So it seems it is not possible to search all of GitHub in this way. In the documentation, [Considerations for code search](https://docs.github.com/en/rest/search?apiVersion=2022-11-28#considerations-for-code-search) states:

> * Only files smaller than 384 KB are searchable.
> * ...
> * You must always include at least one search term when searching source code. For example, searching for language:go is not valid, while amazing language:go is.

This will not work for FMU files because we want all of them (not just FMUs containing some particular search term), they are larger than 384 KB, and they are in a binary (zip) format that wouldn't work with a text search.

In the [Reddit thread](https://www.reddit.com/r/github/comments/dr19uu/comment/f6ezx4e/?utm_source=share&utm_medium=web2x&context=3) OP Gasp0de also looked into using [GH Archive](https://www.gharchive.org/), but it it doesn't look like GH Archive includes an event type with file information about the contents of repositories/commits.

### Scraping

Based on the [Information Usage Restrictions](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions), it appears that web scraping of GitHub is permitted:

> You may use information from our Service for the following reasons, regardless of whether the information was scraped, collected through our API, or obtained otherwise:
> 
> Researchers may use public, non-personal information from the Service for research purposes, only if any publications resulting from that research are open access.
Archivists may use public information from the Service for archival purposes.

We are researching the nature of FMU files that are present on GitHub with the intent to crate an archive or index of them, so that seems to fit. We don't expect it will be a tremendous amount of data and we won't be spamming anyone. Let's give it a try.

In [3]:
import requests
search_url = 'https://github.com/search?q=extension%3Afmu&type=code'
print(f'Getting: {search_url}')
search = requests.get(search_url)
print(f'Result: status code = {search.status_code}, url = {search.url}')
# print response lines containing the string <title>
for line in search.text.splitlines():
    if '<title>' in line:
        print(line)

Getting: https://github.com/search?q=extension%3Afmu&type=code
Result: status code = 200, url = https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Dextension%253Afmu%26type%3Dcode
  <title>Sign in to GitHub · GitHub</title>


It is requiring us to log in, so we won't just be able to get the results by fetching URLs. Let's try scraping the search results using browser automation.

In [2]:
# It is requiring us to log in, so we won't just be able to get the results by fetching URLs. Let's try scraping the search results using browser automation.
import ScrapeGitHubFilesByExtension
importlib.reload(ScrapeGitHubFilesByExtension) # reload changes to ScrapeGitHubFilesByExtension.py every run
# Open browser and get ready to scrape search results
scrape = ScrapeGitHubFilesByExtension.ScrapeGitHubFilesByExtension('fmu', fmu_list_filename, filter_out_private_repositories=True, is_testing=is_testing)

Opening: https://github.com/search?q=extension:fmu&type=code
Use the web browser window to log in to GitHub...
First page of search results loaded
10,845 code results
Found 100 pages of results
Limiting to 3 pages in testing mode


In [4]:
# Page through the search results to scrape a list of FMU URLs
# Note that, during development, this cell can be run multiple times while the logged-in browser is still open.
scrape.scrape()

Done scraping 3 pages * 4 orders
This scan has found 0 new FMUs, 109 already known FMUs, 1 FMUs from private repos (filtered out)
The entire collection now has 3879 FMUs

Retries:
  succeeded after 0 retries: 9 pages
  succeeded after 1 retries: 1 pages
  failed: 0 pages


In [5]:
# Close the browser
del scrape

We weren't able to identify all of the 10,841 FMUs reported by GitHub search, but we do have several thousand. This is a good enough sampling that we will move on to take a look at these files to see if they would be suitable as reinforcement learning environments.

### Downloading

Next, let's download FMU files from the GitHub URLs we collected.

In [34]:
import DownloadGitHubFiles
importlib.reload(DownloadGitHubFiles) # reload changes to DownloadGitHubFiles.py every run
download = DownloadGitHubFiles.DownloadGitHubFiles(fmu_list_filename, is_testing)
download.download()
del download

Done.
0 files downloaded, 2 failed, 3877 skipped (already cached)
Request https://raw.githubusercontent.com/microsoft/FMU-bonsai-connector/937ce984d0896681132fde209ef06048f14a5850/samples/Integrator.fmu failed: 404 - 404: Not Found
Request https://raw.githubusercontent.com/microsoft/FMU-bonsai-connector/937ce984d0896681132fde209ef06048f14a5850/samples/vanDerPol.fmu failed: 404 - 404: Not Found



### Analysis

Now we can analyze our collection of FMU files.

In [79]:
import AnalyzeFmuFiles
importlib.reload(AnalyzeFmuFiles) # reload changes to AnalyzeFmuFiles.py every run
import pandas
analyze = AnalyzeFmuFiles.AnalyzeFmuFiles('results/downloads', is_testing)
df = analyze.analyze()
df.to_csv('results/github-fmu-analysis.csv', index=False)
display(df)
display(df.dtypes)

FMU file results/downloads\AIT-IES\detb-lablink-example\fmu_SingleConsumerWithBooster.fmu is invalid: ModelStructure/InitialUnknowns does not contain the expected set of variables. Expected {T_return, der(singleConsumerWithBooster.demand.PID.I.y), der(singleConsumerWithBooster.demand.demandType.rad.vol[1].dynBal.U), der(singleConsumerWithBooster.demand.demandType.rad.vol[2].dynBal.U), der(singleConsumerWithBooster.demand.demandType.rad.vol[3].dynBal.U), der(singleConsumerWithBooster.demand.demandType.rad.vol[4].dynBal.U), der(singleConsumerWithBooster.demand.demandType.rad.vol[5].dynBal.U), der(singleConsumerWithBooster.demand.flowUnit.pump.filter.x[1]), der(singleConsumerWithBooster.demand.flowUnit.pump.filter.x[2]), der(singleConsumerWithBooster.demand.flowUnit.pump.vol.dynBal.U), der(singleConsumerWithBooster.fan.filter.x[1]), der(singleConsumerWithBooster.fan.filter.x[2]), der(singleConsumerWithBooster.fan.vol.dynBal.U), der(singleConsumerWithBooster.heatpump.W_effective), der(sing

Unnamed: 0,Filename,Valid,Invalid Reason,FMI Version,Co-Simulation,Model Exchange,Param Count,Input Count,Output Count,Generation Tool
0,results/downloads\addicted-by\predictive_analy...,True,,2.0,True,True,56.0,0.0,0.0,OpenModelica Compiler OpenModelica v1.18.1 (64...
1,results/downloads\AIT-IES\detb-lablink-example...,True,,2.0,True,True,0.0,2.0,3.0,"Dymola Version 2020x (64-bit), 2019-10-10"
2,results/downloads\AIT-IES\detb-lablink-example...,True,False,,,,,,,
3,results/downloads\AIT-IES\detb-lablink-example...,True,,2.0,True,True,3.0,6.0,3.0,"Dymola Version 2020x (64-bit), 2019-10-10"
4,results/downloads\AIT-IES\FMITerminalBlock\doc...,True,,1.0,False,True,6.0,1.0,3.0,OpenModelica Compiler OpenModelica v1.11.0 (32...


Filename            string
Valid                 bool
Invalid Reason      string
FMI Version         string
Co-Simulation       object
Model Exchange      object
Param Count        float64
Input Count        float64
Output Count       float64
Generation Tool     string
dtype: object

## Future Work

- Statistics about FMUs (# params/inputs/outputs, OS platforms)
- Take a closer looks at a sampling of FMUs to see if they can be understood and controlled to achieve some sort of objective
- How to handle versioning of FMU files? The URLs are https://github.com/{username}/{repository}/blob/{commit_hash}/{file_path}. We could eventually have multiple commit hashes for different versions of the same file. Should we keep them all? Or just the latest. Also note that branch name could be used instead of commit_hash to reference the latest version in a branch. Perhaps we should convert the commit hashes to reference the latest version in the default branch?
- There are many [limitations on GitHub code search](https://docs.github.com/en/search-github/searching-on-github/searching-code#considerations-for-code-search). What would be a better way to find files on GitHub. An archive or crawler?
- Other file types (Modelica, MATLAB, ...?)
- Internet search (files outside GitHub)
- Security note (executable code in FMU files should be treated as untrusted)