## Collecting Data

We need to create many data pulls to get a wide collection of data. To do that, we should query for specific families of data over a range of institutions.
We can do that by selecting data from the top x institutions for each animal. We also want a range of preserved specimens

In [1]:
import idigbio
import pandas as pd

In [2]:
api = idigbio.json()

In [3]:
### Selects Top Animals by count
top_animals = api.top_records(count=10000)

In [4]:
top_animals

{'scientificname': {'genre indet.': {'itemCount': 137200},
  'plethodon cinereus': {'itemCount': 110970},
  'aphodius': {'itemCount': 98708},
  'peromyscus maniculatus': {'itemCount': 98665},
  'diptera': {'itemCount': 97315},
  'gen. sp.': {'itemCount': 94287},
  'hymenoptera': {'itemCount': 84727},
  'aleocharinae': {'itemCount': 74181},
  'undetermined': {'itemCount': 72756},
  'ichneumonidae': {'itemCount': 68503},
  'megaselia sp.': {'itemCount': 67438},
  'plantae': {'itemCount': 66970},
  'mollusca': {'itemCount': 66143},
  'crinoidea': {'itemCount': 65279},
  'mammalia': {'itemCount': 59772},
  'undetermined sp.': {'itemCount': 55211},
  'undet. foraminiferida': {'itemCount': 54944},
  'unidentified': {'itemCount': 53899},
  'uta stansburiana': {'itemCount': 53246},
  'incertae sedis': {'itemCount': 52891},
  'genus sp': {'itemCount': 52700},
  'gastropoda': {'itemCount': 51280},
  'curculionidae': {'itemCount': 49141},
  'coleoptera': {'itemCount': 47974},
  'peromyscus leucop

In [5]:
institutions = api.top_records(rq={"hasImage": "true"},top_fields=["institutioncode"],count=30)
institutions

{'institutioncode': {'mnhn': {'itemCount': 6105381},
  'us': {'itemCount': 4165621},
  'ny': {'itemCount': 3614705},
  'nhmuk': {'itemCount': 2202079},
  'brit': {'itemCount': 1116991},
  'o': {'itemCount': 1113179},
  'f': {'itemCount': 820415},
  'mich': {'itemCount': 752694},
  'mo': {'itemCount': 743683},
  'gh': {'itemCount': 732216},
  'usnm': {'itemCount': 684536},
  'ph': {'itemCount': 634083},
  'k': {'itemCount': 607352},
  'ypm': {'itemCount': 595070},
  'jbrj': {'itemCount': 511942},
  'wis': {'itemCount': 505635},
  'ncu': {'itemCount': 493175},
  'colo': {'itemCount': 453793},
  'rsa': {'itemCount': 403043},
  'tenn': {'itemCount': 395413},
  'min': {'itemCount': 376950},
  'vt': {'itemCount': 323643},
  'cas': {'itemCount': 317011},
  'mcz': {'itemCount': 303431},
  'uam': {'itemCount': 302440},
  'ummz': {'itemCount': 285640},
  'a': {'itemCount': 277463},
  'nebc': {'itemCount': 275800},
  'ga': {'itemCount': 273410},
  'usf': {'itemCount': 255654}},
 'itemCount': 4701

In [2]:
## Example Redis call
import redis

redis_conn = redis.Redis(host='localhost', port=6379, decode_responses=True)
redis_conn.lpush('idigbio','{"search_dict":{"order":"aphodius","hasImage":true,"data.dwc:institutionCode":"us"},"import_all":false}')
# redis_conn.close()

1

Now that we know which animals are most common and which institutions have the most animals, we can query the top 5 instutions for each animal

But first we need to connect to Redis

In [73]:
for name in top_animals["scientificname"]:
    institutions = api.top_records(rq={"hasImage": "true","scientificname": name},top_fields=["institutioncode"],count=5)
    for inst in institutions["institutioncode"]:
       redis_conn.lpush('idigbio','{"search_dict":{"scientificname":{name},"hasImage":true,"data.dwc:institutionCode":{inst},"count":1000},"import_all":true}'.format(name,inst))

    

## Now Have it run for hours and hours using the ingestor.py in the main branch