# zbMATH Data Collection
### Orion Portelli

This notebook explains and defines the process used to collect zbMATH record data in a reproduciple manner.
___
## Data Collection Process

Data collection takes place in two phases:

1. **ID Collection:** The client first collects a list of record ID's using the zbMATH client.
2. **Record Scraping:** The client scrapes key details (including software) from each record in the ID list.

___

## Phase 1 - ID Collection

Collecting records is limited primarily by the speed of HTTP requests to zbMATH. As a result it is important to limit the number of records scraped as much as possible. This can be done during the ID collection phase via two means:

1. **Mathematical Subject:** The MSC of the records being scraped. Different mathematical subjects rely on computational ability more than others, and by limiting ID search to individual subjects it is possible to capture these trends and their differences.
2. **Time Interval:** The start and end date for the ID collection range. Records on zbMATH date back to even the 18th century, however the use of most modern software packages in mathematics was not truly mainstream until the late 2000s. Capturing time varying trends while increasing specificity is an important consideration.

Here I selected 5 mathematical subjects of interest and used swMATH to find the top subjects in certain large software packages to determine if the fields selected were sufficiently large and distinct. From there, I determined a reasonably wide interval and used the API's count feature to determine if the number of records scraped was reasonable. The selected subjects follow:
| MSC | Subject |
| --- | --- |
| 05 | Combinatorics |
| 11 | Number Theory |
| 20 | Group Theory & Generalisations |
| 62 | Statistics |
| 65 | Numerical Analysis |

As for the selected time interval, we aim for at least a 7 year interval with ~50,000 entries maximum per subject. This would give 250,000 entries and take roughly 24 hours to scrape

In [3]:
import api_client

sets = {'05': 0, '11': 0, '20': 0, '62': 0, '65': 0}
start = '2010'
end = '2015'

for s in sets:
    sets[s] = api_client.getIDCount(s, start=start, end=end)

print(sets)
print('Total Count:', sum(sets.values()))

{'05': 35450, '11': 21022, '20': 15243, '62': 36205, '65': 49760}
Total Count: 157680


Based on this experimentation, we can see that 2010 to 2017 is likely a reasonable timeframe for this experiment. The total record count is around 228k for this window and most software was at least semi-established at this time.

## NOTES ON COLLECTION OF DATA FOR GROUP 20

| Event | Time |
| --- | --- |
| Started Collection | 15:05 |
| Limit Reached | 15:20 |
| Latest Request | 15:39 |
| Limit Removed |  |