# zbMATH Full Experiment & Analysis
### Orion Portelli

This notebook explains and defines the process used to collect and analyse zbMATH record data in a reproduciple manner.
___
## Data Collection Process

Data collection takes place in two phases:

1. **ID Collection:** The client first collects a list of record ID's using the zbMATH client.
2. **Record Scraping:** The client scrapes key details (including software) from each record in the ID list.
3. **Data Cleaning:** The client cleans the dataset to fix erroneous dates and clean the plaintext language field.

The first two stages are automated through the use of the `py_client.fullCollect` function.

Cleaning can be then applied using the `py_client.cleanDataset` function.

___

## Phase 1 - Parameter Selection

Collecting records is limited primarily by the speed of HTTP requests to zbMATH. As a result it is important to limit the number of records scraped as much as possible. This can be done during the ID collection phase via two means:

1. **Mathematical Subject:** The MSC of the records being scraped. Different mathematical subjects rely on computational ability more than others, and by limiting ID search to individual subjects it is possible to capture these trends and their differences.
2. **Time Interval:** The start and end date for the ID collection range. Records on zbMATH date back to even the 18th century, however the use of most modern software packages in mathematics was not truly mainstream until the late 2000s. Capturing time varying trends while increasing specificity is an important consideration.

Here I selected 5 mathematical subjects of interest and then used swMATH to find the top subjects in certain large software packages to determine a smaller set of subjects that were sufficiently large and distinct. 
| MSC | Subject |
| --- | --- |
| 05 | Combinatorics |
| 11 | Number Theory |
| 20 | Group Theory & Generalisations |
| 62 | Statistics |
| 65 | Numerical Analysis |

In [3]:
import src.api.api_client as api_client

In [4]:
# View possible MSC codes
api_client.getClasses()

{'00': 'General and overarching topics; collections',
 '01': 'History and biography',
 '03': 'Mathematical logic and foundations',
 '05': 'Combinatorics',
 '06': 'Order, lattices, ordered algebraic structures',
 '08': 'General algebraic systems',
 '11': 'Number theory',
 '12': 'Field theory and polynomials',
 '13': 'Commutative algebra',
 '14': 'Algebraic geometry',
 '15': 'Linear and multilinear algebra; matrix theory',
 '16': 'Associative rings and algebras',
 '17': 'Nonassociative rings and algebras',
 '18': 'Category theory; homological algebra',
 '19': '\\(K\\)-theory',
 '20': 'Group theory and generalizations',
 '22': 'Topological groups, Lie groups',
 '26': 'Real functions',
 '28': 'Measure and integration',
 '30': 'Functions of a complex variable',
 '31': 'Potential theory',
 '32': 'Several complex variables and analytic spaces',
 '33': 'Special functions',
 '34': 'Ordinary differential equations',
 '35': 'Partial differential equations',
 '37': 'Dynamical systems and ergodic t

In [2]:
sets = {'05': 0, '11': 0, '20': 0, '62': 0, '65': 0}
start = '2010'
end = '2015'

for s in sets:
    sets[s] = api_client.getIDCount(s, start=start, end=end)

print(sets)
print('Total Count:', sum(sets.values()))

{'05': 35450, '11': 21023, '20': 15243, '62': 36205, '65': 49759}
Total Count: 157680


From there, I determined a reasonably wide interval and used the API's count feature to determine if the number of records scraped was reasonable. The selected subjects follow:

| MSC | Subject |
| --- | --- |
| 11 | Number Theory |
| 20 | Group Theory & Generalisations |
| 62 | Statistics |

Again, these were chosen due to the distinction in the area of mathematics they covered, as well as the reasonably small count for sets 11 and 20.

___
## Phase 2 - Data Collection

The code itself won't be run here due to its long runtime, it is better to use scripts or run in it directly from the command line. Despite this, it is shown below for the sake of reproducibility.