Skip to content

Commit

Permalink
mfd
Browse files Browse the repository at this point in the history
  • Loading branch information
Miserlou committed Jan 15, 2019
2 parents 1ef2105 + e0d8c5a commit 14dd1f5
Show file tree
Hide file tree
Showing 35 changed files with 1,621 additions and 459 deletions.
91 changes: 69 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ The following services will need to be installed:
so Docker does not need sudo permissions.
- [Terraform](https://www.terraform.io/)
- [Nomad](https://www.nomadproject.io/docs/install/index.html#precompiled-binaries) can be installed on Linux clients with `sudo ./install_nomad.sh`.
- [pip3](https://pip.pypa.io/en/stable/) can be installed on Linux clients with `sudo apt-get install python3-pip`
- [git-crypt](https://www.agwa.name/projects/git-crypt/)
- [jq](https://stedolan.github.io/jq/)
- [iproute2](https://wiki.linuxfoundation.org/networking/iproute2)
Expand Down Expand Up @@ -135,17 +136,6 @@ repo. Sub-projects each have their own environments managed by their
containers. When returning to this project you should run
`source dr_env/bin/activate` to reactivate the virtualenv.

#### Common Dependecies

The [common](./common) sub-project contains common code which is
depended upon by the other sub-projects. So before anything else you
should prepare the distribution directory `common/dist` with this
command:

```bash
(cd common && python setup.py sdist)
```

#### Services

`refinebio` also depends on Postgres and Nomad. Postgres can be
Expand All @@ -166,16 +156,6 @@ Then, to initialize the database, run:
./common/install_db_docker.sh
```

Finally, to make the migrations to the database, use:

```bash
./common/make_migrations.sh
```

Note: there is a small chance this might fail with a `can't stat`, error. If this happens, you have
to manually change permissions on the volumes directory with `sudo chmod -R 740 volumes_postgres`
then re-run the migrations.

If you need to access a `psql` shell for inspecting the database, you can use:

```bash
Expand Down Expand Up @@ -207,6 +187,22 @@ the Nomad agent, which will then launch a Docker container which runs
the job. If address conflicts emerge, old Docker containers can be purged
with `docker container prune -f`.

#### Common Dependecies

The [common](./common) sub-project contains common code which is
depended upon by the other sub-projects. So before anything else you
should prepare the distribution directory `common/dist` with this
script:

```bash
./update_models.sh
```

(_Note:_ This step requires the postgres container to be running and initialized.)

Note: there is a small chance this might fail with a `can't stat`, error. If this happens, you have
to manually change permissions on the volumes directory with `sudo chmod -R 740 volumes_postgres`
then re-run the migrations.

### Testing

Expand Down Expand Up @@ -341,13 +337,44 @@ Example for an ArrayExpress experiment:
./foreman/run_surveyor.sh survey_all --accession E-MTAB-3050
```

Transcriptome indices are a bit special.
For species within the "main" Ensembl division, the species name can be provided like so:

```bash
./foreman/run_surveyor.sh survey_all --accession "Homo sapiens"
```

However for species that are in other divisions, the division must follow the species name after a comma like so:

```bash
./foreman/run_surveyor.sh survey_all --accession "Caenorhabditis elegans, EnsemblMetazoa"
```
The possible divisions that can be specified are:
* Ensembl (this is the "main" division and is the default)
* EnsemblPlants
* EnsemblFungi
* EnsemblBacteria
* EnsemblProtists
* EnsemblMetazoa

If you are unsure what division a species falls into, unfortunately the only way to tell is go to check ensembl.com.
(Although googling the species name + "ensembl" may work pretty well.)

You can also supply a newline-deliminated file to `survey_all` which will
dispatch survey jobs based on accession codes like so:

```bash
./foreman/run_surveyor.sh survey_all --file MY_BIG_LIST_OF_CODES.txt
```

The main foreman job loop can be started with:

```bash
./foreman/run_surveyor.sh retry_jobs
```

This must actually be running for jobs to move forward through the pipeline.

#### Sequence Read Archive

When surveying SRA, you can supply _either_ run accession codes (e.g.,
Expand Down Expand Up @@ -524,7 +551,6 @@ We have two different version counters, one for `dev` and one for `master` so a
* v1.1.3
* v1.1.3-dev


However you may see that the `dev` counter is way ahead, because we often need more than one staging deploy to be ready for a production deploy.
This is okay, just find the latest version of the type you want to deploy and increment that to get your version.
For example, if you wanted to deploy to staging and the above versions were the largest that `git tag --list` output, you would increment `v1.1.3-dev` to get `v1.1.4-dev`.
Expand Down Expand Up @@ -671,6 +697,27 @@ awslogs get data-refinery-log-group-myusername-dev log-stream-api-nginx-access-*

will show all of the API access logs made by Nginx.

### Dumping and Restoring Database Backups

Automatic snapshots are created automatically by RDS. Manual database dumps can be created by priveledged users with [these instructions](https://gist.github.com/syafiqfaiz/5273cd41df6f08fdedeb96e12af70e3b). Postgres versions on the host (I suggest the PGBouncer instance) must match the RDS instance version:

```bash
sudo add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -sc)-pgdg main"
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get install postgresql-9.6
```

Archival dumps can also be provided upon request.

Dumps can be restored locally by copying the `backup.sql` file to the `volumes_postgres` directory, then executing:

```bash
docker exec -it drdb /bin/bash
psql --user postgres -d data_refinery -f /var/lib/postgresql/data/backup.sql
```

This can take a long time (>30 minutes)!

### Tearing Down

Expand Down
14 changes: 13 additions & 1 deletion api/data_refinery_api/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@ def setUp(self):
# self.user = User.objects.create(username="mike")

experiment = Experiment()
experiment.accession_code = "GSE123"
experiment.save()
self.experiment = experiment

experiment_annotation = ExperimentAnnotation()
experiment_annotation.data = {"hello": "world", "123": 456}
Expand Down Expand Up @@ -214,6 +216,16 @@ def test_sample_pagination(self):
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.json()['results'][0]['title'], '123')

def test_fetching_experiment_samples(self):
response = self.client.get(reverse('samples'), {'experiment_accession_code': self.experiment.accession_code})
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(len(response.json()['results']), 1)
self.assertEqual(response.json()['results'][0]['accession_code'], '789')

# Expect 404 if the experiment accession code isn't valid
response = self.client.get(reverse('samples'), {'experiment_accession_code': 'wrong-accession-code'})
self.assertEqual(response.status_code, 404)


def test_search_and_filter(self):

Expand Down Expand Up @@ -551,7 +563,7 @@ def test_processed_samples_only(self):
self.assertEqual(response.json()['count'], 1)

qs = Experiment.processed_public_objects
self.assertEqual(experiment.get_processed_samples().count(), 1)
self.assertEqual(len(experiment.processed_samples), 1)

experiment.delete()
sample.delete()
Expand Down
66 changes: 42 additions & 24 deletions api/data_refinery_api/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,15 +178,15 @@ def filter_samples_count(self, queryset, name, value):
# '@' Full-text search.
# '$' Regex search.
search_fields = ( 'title',
'@description',
'@accession_code',
'@alternate_accession_code',
'@protocol_description',
'@publication_title',
'description',
'accession_code',
'alternate_accession_code',
'protocol_description',
'publication_title',
'publication_doi',
'publication_authors',
'pubmed_id',
'@submitter_institution',
'submitter_institution',
'experimentannotation__data',
# '@sample__accession_code',
# '@sample__platform_name',
Expand Down Expand Up @@ -268,36 +268,35 @@ def get_filters(self, queryset, filters_to_calculate):

if 'technology' in filters_to_calculate:
# Technology
techs = queryset.values('technology').annotate(Count('technology', unique=True))
techs = queryset.values('technology').annotate(count=Count('sample__id', distinct=True))
for tech in techs:
if not tech['technology'] or not tech['technology'].strip():
continue
result['technology'][tech['technology']] = tech['technology__count']
result['technology'][tech['technology']] = tech['count']

if 'has_publication' in filters_to_calculate:
# Publication
pubs = queryset.values('has_publication').annotate(Count('has_publication', unique=True))
pubs = queryset.values('has_publication').annotate(count=Count('sample__id', distinct=True))
for pub in pubs:
if pub['has_publication']:
result['publication']['has_publication'] = pub['has_publication__count']
result['publication']['has_publication'] = pub['count']

if 'organisms__name' in filters_to_calculate:
# Organisms
organisms = queryset.values('organisms__name').annotate(Count('organisms__name', unique=True))
organisms = queryset.values('organisms__name').annotate(count=Count('sample__id', distinct=True))
for organism in organisms:
# This experiment has no ExperimentOrganism-association, which is bad.
# This information may still live on the samples though.
if not organism['organisms__name']:
continue

result['organism'][organism['organisms__name']] = organism['organisms__name__count']
result['organism'][organism['organisms__name']] = organism['count']

if 'platform' in filters_to_calculate:
# Platforms
platforms = queryset.values('samples__platform_name').annotate(Count('samples__platform_name', unique=True))
platforms = queryset.values('samples__platform_name').annotate(count=Count('sample__id', distinct=True))
for plat in platforms:
if plat['samples__platform_name']:
result['platforms'][plat['samples__platform_name']] = plat['samples__platform_name__count']
result['platforms'][plat['samples__platform_name']] = plat['count']

return result

Expand Down Expand Up @@ -533,7 +532,10 @@ class SampleList(PaginatedAPIView):
List all Samples.
Pass in a list of pk to an ids query parameter to filter by id.
Can also accept a `dataset_id` field instead of a list of accession codes.
Also accepts:
- `dataset_id` field instead of a list of accession codes
- `experiment_accession_code` to return the samples associated with a given experiment
Append the pk or accession_code to the end of this URL to see a detail view.
Expand All @@ -545,15 +547,19 @@ def get(self, request, format=None):
filter_dict.pop('offset', None)
order_by = filter_dict.pop('order_by', None)
ids = filter_dict.pop('ids', None)
accession_codes = filter_dict.pop('accession_codes', None)

filter_by = filter_dict.pop('filter_by', None)

if ids is not None:
ids = [ int(x) for x in ids.split(',')]
filter_dict['pk__in'] = ids

if accession_codes is not None:
experiment_accession_code = filter_dict.pop('experiment_accession_code', None)
if experiment_accession_code:
experiment = get_object_or_404(Experiment.objects.values('id'), accession_code=experiment_accession_code)
filter_dict['experiments__in'] = [experiment['id']]

accession_codes = filter_dict.pop('accession_codes', None)
if accession_codes:
accession_codes = accession_codes.split(',')
filter_dict['accession_code__in'] = accession_codes

Expand All @@ -563,7 +569,17 @@ def get(self, request, format=None):
# Python doesn't provide a prettier way of doing this that I know about.
filter_dict['accession_code__in'] = [item for sublist in dataset.data.values() for item in sublist]

samples = Sample.public_objects.filter(**filter_dict)
samples = Sample.public_objects \
.prefetch_related('sampleannotation_set') \
.prefetch_related('organism') \
.prefetch_related('results') \
.prefetch_related('results__processor') \
.prefetch_related('results__computationalresultannotation_set') \
.prefetch_related('results__computedfile_set') \
.filter(**filter_dict) \
.order_by('-is_processed') \
.distinct()

if order_by:
samples = samples.order_by(order_by)

Expand Down Expand Up @@ -700,7 +716,7 @@ def get(self, request, format=None):
filter_dict = request.query_params.dict()
limit = max(int(filter_dict.pop('limit', 100)), 100)
offset = int(filter_dict.pop('offset', 0))
jobs = SurveyJob.objects.filter(**filter_dict)[offset:(offset + limit)]
jobs = SurveyJob.objects.filter(**filter_dict).order_by('-id')[offset:(offset + limit)]
serializer = SurveyJobSerializer(jobs, many=True)
return Response(serializer.data)

Expand All @@ -713,7 +729,7 @@ def get(self, request, format=None):
filter_dict = request.query_params.dict()
limit = max(int(filter_dict.pop('limit', 100)), 100)
offset = int(filter_dict.pop('offset', 0))
jobs = DownloaderJob.objects.filter(**filter_dict)[offset: offset + limit]
jobs = DownloaderJob.objects.filter(**filter_dict).order_by('-id')[offset: offset + limit]
serializer = DownloaderJobSerializer(jobs, many=True)
return Response(serializer.data)

Expand All @@ -726,7 +742,7 @@ def get(self, request, format=None):
filter_dict = request.query_params.dict()
limit = max(int(filter_dict.pop('limit', 100)), 100)
offset = int(filter_dict.pop('offset', 0))
jobs = ProcessorJob.objects.filter(**filter_dict)[offset: offset + limit]
jobs = ProcessorJob.objects.filter(**filter_dict).order_by('-id')[offset: offset + limit]
serializer = ProcessorJobSerializer(jobs, many=True)
return Response(serializer.data)

Expand All @@ -750,6 +766,8 @@ def get(self, request, format=None):
data['processor_jobs'] = self._get_job_stats(ProcessorJob.objects, range_param)
data['samples'] = self._get_object_stats(Sample.objects, range_param)
data['experiments'] = self._get_object_stats(Experiment.objects, range_param)
data['processed_samples'] = self._get_object_stats(Sample.processed_objects)
data['processed_experiments'] = self._get_object_stats(Experiment.processed_public_objects)
data['input_data_size'] = self._get_input_data_size()
data['output_data_size'] = self._get_output_data_size()
data['active_volumes'] = list(get_active_volumes())
Expand Down Expand Up @@ -937,7 +955,7 @@ def _get_job_stats(self, jobs, range_param):

return result

def _get_object_stats(self, objects, range_param):
def _get_object_stats(self, objects, range_param = False):
result = {
'total': objects.count()
}
Expand Down
28 changes: 28 additions & 0 deletions common/data_refinery_common/migrations/0007_auto_20190103_1555.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Generated by Django 2.1.2 on 2019-01-03 15:55

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('data_refinery_common', '0006_auto_20181102_1526'),
]

operations = [
migrations.AddField(
model_name='originalfile',
name='downloader_jobs',
field=models.ManyToManyField(through='data_refinery_common.DownloaderJobOriginalFileAssociation', to='data_refinery_common.DownloaderJob'),
),
migrations.AddField(
model_name='originalfile',
name='processor_jobs',
field=models.ManyToManyField(through='data_refinery_common.ProcessorJobOriginalFileAssociation', to='data_refinery_common.ProcessorJob'),
),
migrations.AddField(
model_name='sample',
name='experiments',
field=models.ManyToManyField(through='data_refinery_common.ExperimentSampleAssociation', to='data_refinery_common.Experiment'),
),
]
Loading

0 comments on commit 14dd1f5

Please sign in to comment.