## CALC log analysis

The following notebook assumes you have downloaded the CALC logs as CSV from [api.data.gov/admin](https://api.data.gov/admin/) and saved them as `logs.csv` in the same directory as this notebook.

In [43]:
import pandas

In [44]:
rows = pandas.read_csv('logs.csv', nrows=None, index_col=False, usecols=[
    'Time',
    'Method',
    'URL',
    'State',
    'Country',
    'City',
    'Status',
    'IP Address',
])

The following step is optional; we're removing identical requests coming from the same IP address. This might not actually be a great idea, especially if multiple actual users are behind the same IP address.

In [58]:
rows.drop_duplicates(subset=['Method', 'URL', 'IP Address'], inplace=True)

del rows['IP Address']

Now we'll filter only successful requests to the `/rates/` endpoint, which is the one used when users click the **Search** button (or initially load the page). We'll also parse the query string and add new columns that represent information about the search criteria.

In [94]:
from urllib.parse import urlparse, parse_qsl

RATES_URL = 'https://api.data.gov/gsa/calc/rates/'
FIELD_DEFAULTS = {
    'Search term': '',
    'Minimum experience': 0,
    'Maximum experience': 45,
    'Education level': '',
    'Worksite': '',
    'Business size': '',
    'Schedule': '',
    'Contract year': 'current',
    'Rows excluded': 0,
}

rates = rows
rates = rates[rates['Method'] == 'GET']
rates = rates[rates['Status'] == 200]
rates = rates[rates['URL'].str.startswith(RATES_URL)]

del rates['Method']
del rates['Status']

rates['Query'] = rates['URL'].apply(
    lambda url: dict(parse_qsl(urlparse(url).query))
)

del rates['URL']

rates['Search term'] = rates['Query'].apply(
    # We're only getting the first search term; not sure if this is a good idea.
    lambda query: query.get('q', '').split(', ')[0].lower().strip()[:25]
)
rates['Minimum experience'] = rates['Query'].apply(
    lambda query: int(query.get('min_experience', FIELD_DEFAULTS['Minimum experience']))
)
rates['Maximum experience'] = rates['Query'].apply(
    lambda query: int(query.get('max_experience', FIELD_DEFAULTS['Maximum experience']))
)
rates['Education level'] = rates['Query'].apply(
    lambda query: query.get('education', '')
)
rates['Worksite'] = rates['Query'].apply(
    lambda query: query.get('site', '')
)
rates['Business size'] = rates['Query'].apply(
    lambda query: query.get('business_size', '')
)
rates['Schedule'] = rates['Query'].apply(
    lambda query: query.get('schedule', '')
)
rates['Contract year'] = rates['Query'].apply(
    lambda query: query.get('contract-year', FIELD_DEFAULTS['Contract year'])
)
rates['Rows excluded'] = rates['Query'].apply(
    lambda query: len(query['exclude'].split(',')) if query.get('exclude') else 0
)

del rates['Query']

## Popular search terms

In [80]:
rates['Search term'].value_counts()

                             3633
engineer                     1598
subject matter expert        1477
project manager              1345
program manager              1114
senior engineer               737
software engineer             595
consultant                    464
systems engineer              444
training specialist           398
dive                          389
technician                    365
program analyst               339
analyst                       309
plans                         291
financial analyst             276
technical writer              269
engineer iii                  258
senior program manager        238
senior consultant             230
senior systems engineer       209
business analyst              199
mechanical engineer           193
engineer ii                   192
senior analyst                192
principal engineer            191
senior project manager        186
journeyman engineer/scien     183
technical specialist          179
administrative

## Search customization

In [93]:
total_rates = rates.shape[0]

for field, default in FIELD_DEFAULTS.items():
    non_default_rates = rates[rates[field] != default].shape[0]
    print(
        "Queries involving {}: {}%".format(
            field.lower(),
            int(non_default_rates / total_rates * 100)
        )
    )

Queries involving search term: 92%
Queries involving rows excluded: 15%
Queries involving education level: 44%
Queries involving minimum experience: 38%
Queries involving worksite: 25%
Queries involving schedule: 13%
Queries involving business size: 19%
Queries involving maximum experience: 37%
Queries involving contract year: 3%
