mfd

AlexsLemonade · Jan 15, 2019 · 14dd1f5 · 14dd1f5
2 parents 1ef2105 + e0d8c5a
commit 14dd1f5
Show file tree

Hide file tree

Showing 35 changed files with 1,621 additions and 459 deletions.
diff --git a/README.md b/README.md
@@ -88,6 +88,7 @@ The following services will need to be installed:
 so Docker does not need sudo permissions.
 - [Terraform](https://www.terraform.io/)
 - [Nomad](https://www.nomadproject.io/docs/install/index.html#precompiled-binaries) can be installed on Linux clients with `sudo ./install_nomad.sh`.
+- [pip3](https://pip.pypa.io/en/stable/) can be installed on Linux clients with `sudo apt-get install python3-pip`
 - [git-crypt](https://www.agwa.name/projects/git-crypt/)
 - [jq](https://stedolan.github.io/jq/)
 - [iproute2](https://wiki.linuxfoundation.org/networking/iproute2)
@@ -135,17 +136,6 @@ repo. Sub-projects each have their own environments managed by their
 containers. When returning to this project you should run
 `source dr_env/bin/activate` to reactivate the virtualenv.
 
-#### Common Dependecies
-
-The [common](./common) sub-project contains common code which is
-depended upon by the other sub-projects. So before anything else you
-should prepare the distribution directory `common/dist` with this
-command:
-
-```bash
-(cd common && python setup.py sdist)
-```
-
 #### Services
 
 `refinebio` also depends on Postgres and Nomad. Postgres can be
@@ -166,16 +156,6 @@ Then, to initialize the database, run:
 ./common/install_db_docker.sh
 ```
 
-Finally, to make the migrations to the database, use:
-
-```bash
-./common/make_migrations.sh
-```
-
-Note: there is a small chance this might fail with a `can't stat`, error. If this happens, you have
-to manually change permissions on the volumes directory with `sudo chmod -R 740 volumes_postgres`
-then re-run the migrations.
-
 If you need to access a `psql` shell for inspecting the database, you can use:
 
 ```bash
@@ -207,6 +187,22 @@ the Nomad agent, which will then launch a Docker container which runs
 the job. If address conflicts emerge, old Docker containers can be purged
 with `docker container prune -f`.
 
+#### Common Dependecies
+
+The [common](./common) sub-project contains common code which is
+depended upon by the other sub-projects. So before anything else you
+should prepare the distribution directory `common/dist` with this
+script:
+
+```bash
+./update_models.sh
+```
+
+(_Note:_ This step requires the postgres container to be running and initialized.)
+
+Note: there is a small chance this might fail with a `can't stat`, error. If this happens, you have
+to manually change permissions on the volumes directory with `sudo chmod -R 740 volumes_postgres`
+then re-run the migrations.
 
 ### Testing
 
@@ -341,13 +337,44 @@ Example for an ArrayExpress experiment:
 ./foreman/run_surveyor.sh survey_all --accession E-MTAB-3050
 ```
 
+Transcriptome indices are a bit special.
+For species within the "main" Ensembl division, the species name can be provided like so:
+
+```bash
+./foreman/run_surveyor.sh survey_all --accession "Homo sapiens"
+```
+
+However for species that are in other divisions, the division must follow the species name after a comma like so:
+
+```bash
+./foreman/run_surveyor.sh survey_all --accession "Caenorhabditis elegans, EnsemblMetazoa"
+```
+The possible divisions that can be specified are:
+* Ensembl (this is the "main" division and is the default)
+* EnsemblPlants
+* EnsemblFungi
+* EnsemblBacteria
+* EnsemblProtists
+* EnsemblMetazoa
+
+If you are unsure what division a species falls into, unfortunately the only way to tell is go to check ensembl.com.
+(Although googling the species name + "ensembl" may work pretty well.)
+
 You can also supply a newline-deliminated file to `survey_all` which will
 dispatch survey jobs based on accession codes like so:
 
 ```bash
 ./foreman/run_surveyor.sh survey_all --file MY_BIG_LIST_OF_CODES.txt
 ```
 
+The main foreman job loop can be started with:
+
+```bash
+./foreman/run_surveyor.sh retry_jobs
+```
+
+This must actually be running for jobs to move forward through the pipeline.
+
 #### Sequence Read Archive
 
 When surveying SRA, you can supply _either_ run accession codes (e.g.,
@@ -524,7 +551,6 @@ We have two different version counters, one for `dev` and one for `master` so a
 * v1.1.3
 * v1.1.3-dev
 
-
 However you may see that the `dev` counter is way ahead, because we often need more than one staging deploy to be ready for a production deploy.
 This is okay, just find the latest version of the type you want to deploy and increment that to get your version.
 For example, if you wanted to deploy to staging and the above versions were the largest that `git tag --list` output, you would increment `v1.1.3-dev` to get `v1.1.4-dev`.
@@ -671,6 +697,27 @@ awslogs get data-refinery-log-group-myusername-dev log-stream-api-nginx-access-*
 
 will show all of the API access logs made by Nginx.
 
+### Dumping and Restoring Database Backups
+
+Automatic snapshots are created automatically by RDS. Manual database dumps can be created by priveledged users with [these instructions](https://gist.github.com/syafiqfaiz/5273cd41df6f08fdedeb96e12af70e3b). Postgres versions on the host (I suggest the PGBouncer instance) must match the RDS instance version:
+
+```bash
+sudo add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -sc)-pgdg main"
+wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
+sudo apt-get update
+sudo apt-get install postgresql-9.6
+```
+
+Archival dumps can also be provided upon request.
+
+Dumps can be restored locally by copying the `backup.sql` file to the `volumes_postgres` directory, then executing:
+
+```bash
+docker exec -it drdb /bin/bash
+psql --user postgres -d data_refinery -f /var/lib/postgresql/data/backup.sql
+```
+
+This can take a long time (>30 minutes)!
 
 ### Tearing Down
 

diff --git a/api/data_refinery_api/tests.py b/api/data_refinery_api/tests.py
@@ -55,7 +55,9 @@ def setUp(self):
         # self.user = User.objects.create(username="mike")
 
         experiment = Experiment()
+        experiment.accession_code = "GSE123"
         experiment.save()
+        self.experiment = experiment
 
         experiment_annotation = ExperimentAnnotation()
         experiment_annotation.data = {"hello": "world", "123": 456}
@@ -214,6 +216,16 @@ def test_sample_pagination(self):
         self.assertEqual(response.status_code, status.HTTP_200_OK)
         self.assertEqual(response.json()['results'][0]['title'], '123')
 
+    def test_fetching_experiment_samples(self):
+        response = self.client.get(reverse('samples'), {'experiment_accession_code': self.experiment.accession_code})
+        self.assertEqual(response.status_code, status.HTTP_200_OK)
+        self.assertEqual(len(response.json()['results']), 1)
+        self.assertEqual(response.json()['results'][0]['accession_code'], '789')
+
+        # Expect 404 if the experiment accession code isn't valid
+        response = self.client.get(reverse('samples'), {'experiment_accession_code': 'wrong-accession-code'})
+        self.assertEqual(response.status_code, 404)
+
 
     def test_search_and_filter(self):
 
@@ -551,7 +563,7 @@ def test_processed_samples_only(self):
         self.assertEqual(response.json()['count'], 1)
 
         qs = Experiment.processed_public_objects
-        self.assertEqual(experiment.get_processed_samples().count(), 1)
+        self.assertEqual(len(experiment.processed_samples), 1)
 
         experiment.delete()
         sample.delete()

diff --git a/api/data_refinery_api/views.py b/api/data_refinery_api/views.py
@@ -178,15 +178,15 @@ def filter_samples_count(self, queryset, name, value):
     # '@' Full-text search.
     # '$' Regex search.
     search_fields = (   'title',
-                        '@description',
-                        '@accession_code',
-                        '@alternate_accession_code',
-                        '@protocol_description',
-                        '@publication_title',
+                        'description',
+                        'accession_code',
+                        'alternate_accession_code',
+                        'protocol_description',
+                        'publication_title',
                         'publication_doi',
                         'publication_authors',
                         'pubmed_id',
-                        '@submitter_institution',
+                        'submitter_institution',
                         'experimentannotation__data',
                         # '@sample__accession_code',
                         # '@sample__platform_name',
@@ -268,36 +268,35 @@ def get_filters(self, queryset, filters_to_calculate):
 
         if 'technology' in filters_to_calculate:
             # Technology
-            techs = queryset.values('technology').annotate(Count('technology', unique=True))
+            techs = queryset.values('technology').annotate(count=Count('sample__id', distinct=True))
             for tech in techs:
                 if not tech['technology'] or not tech['technology'].strip():
                     continue
-                result['technology'][tech['technology']] = tech['technology__count']
+                result['technology'][tech['technology']] = tech['count']
 
         if 'has_publication' in filters_to_calculate:
             # Publication
-            pubs = queryset.values('has_publication').annotate(Count('has_publication', unique=True))
+            pubs = queryset.values('has_publication').annotate(count=Count('sample__id', distinct=True))
             for pub in pubs:
                 if pub['has_publication']:
-                    result['publication']['has_publication'] = pub['has_publication__count']
+                    result['publication']['has_publication'] = pub['count']
 
         if 'organisms__name' in filters_to_calculate:
             # Organisms
-            organisms = queryset.values('organisms__name').annotate(Count('organisms__name', unique=True))
+            organisms = queryset.values('organisms__name').annotate(count=Count('sample__id', distinct=True))
             for organism in organisms:
                 # This experiment has no ExperimentOrganism-association, which is bad.
                 # This information may still live on the samples though.
                 if not organism['organisms__name']:
                     continue
-
-                result['organism'][organism['organisms__name']] = organism['organisms__name__count']
+                result['organism'][organism['organisms__name']] = organism['count']
 
         if 'platform' in filters_to_calculate:
             # Platforms
-            platforms = queryset.values('samples__platform_name').annotate(Count('samples__platform_name', unique=True))
+            platforms = queryset.values('samples__platform_name').annotate(count=Count('sample__id', distinct=True))
             for plat in platforms:
                 if plat['samples__platform_name']:
-                    result['platforms'][plat['samples__platform_name']] = plat['samples__platform_name__count']
+                    result['platforms'][plat['samples__platform_name']] = plat['count']
 
         return result
 
@@ -533,7 +532,10 @@ class SampleList(PaginatedAPIView):
     List all Samples.
 
     Pass in a list of pk to an ids query parameter to filter by id.
-    Can also accept a `dataset_id` field instead of a list of accession codes.
+    
+    Also accepts:
+        - `dataset_id` field instead of a list of accession codes
+        - `experiment_accession_code` to return the samples associated with a given experiment
 
     Append the pk or accession_code to the end of this URL to see a detail view.
 
@@ -545,15 +547,19 @@ def get(self, request, format=None):
         filter_dict.pop('offset', None)
         order_by = filter_dict.pop('order_by', None)
         ids = filter_dict.pop('ids', None)
-        accession_codes = filter_dict.pop('accession_codes', None)
-
         filter_by = filter_dict.pop('filter_by', None)
 
         if ids is not None:
             ids = [ int(x) for x in ids.split(',')]
             filter_dict['pk__in'] = ids
 
-        if accession_codes is not None:
+        experiment_accession_code = filter_dict.pop('experiment_accession_code', None)
+        if experiment_accession_code:
+            experiment = get_object_or_404(Experiment.objects.values('id'), accession_code=experiment_accession_code)
+            filter_dict['experiments__in'] = [experiment['id']]
+
+        accession_codes = filter_dict.pop('accession_codes', None)
+        if accession_codes:
             accession_codes = accession_codes.split(',')
             filter_dict['accession_code__in'] = accession_codes
 
@@ -563,7 +569,17 @@ def get(self, request, format=None):
             # Python doesn't provide a prettier way of doing this that I know about.
             filter_dict['accession_code__in'] = [item for sublist in dataset.data.values() for item in sublist]
 
-        samples = Sample.public_objects.filter(**filter_dict)
+        samples = Sample.public_objects \
+            .prefetch_related('sampleannotation_set') \
+            .prefetch_related('organism') \
+            .prefetch_related('results') \
+            .prefetch_related('results__processor') \
+            .prefetch_related('results__computationalresultannotation_set') \
+            .prefetch_related('results__computedfile_set') \
+            .filter(**filter_dict) \
+            .order_by('-is_processed') \
+            .distinct()
+
         if order_by:
             samples = samples.order_by(order_by)
 
@@ -700,7 +716,7 @@ def get(self, request, format=None):
         filter_dict = request.query_params.dict()
         limit = max(int(filter_dict.pop('limit', 100)), 100)
         offset = int(filter_dict.pop('offset', 0))
-        jobs = SurveyJob.objects.filter(**filter_dict)[offset:(offset + limit)]
+        jobs = SurveyJob.objects.filter(**filter_dict).order_by('-id')[offset:(offset + limit)]
         serializer = SurveyJobSerializer(jobs, many=True)
         return Response(serializer.data)
 
@@ -713,7 +729,7 @@ def get(self, request, format=None):
         filter_dict = request.query_params.dict()
         limit = max(int(filter_dict.pop('limit', 100)), 100)
         offset = int(filter_dict.pop('offset', 0))
-        jobs = DownloaderJob.objects.filter(**filter_dict)[offset: offset + limit]
+        jobs = DownloaderJob.objects.filter(**filter_dict).order_by('-id')[offset: offset + limit]
         serializer = DownloaderJobSerializer(jobs, many=True)
         return Response(serializer.data)
 
@@ -726,7 +742,7 @@ def get(self, request, format=None):
         filter_dict = request.query_params.dict()
         limit = max(int(filter_dict.pop('limit', 100)), 100)
         offset = int(filter_dict.pop('offset', 0))
-        jobs = ProcessorJob.objects.filter(**filter_dict)[offset: offset + limit]
+        jobs = ProcessorJob.objects.filter(**filter_dict).order_by('-id')[offset: offset + limit]
         serializer = ProcessorJobSerializer(jobs, many=True)
         return Response(serializer.data)
 
@@ -750,6 +766,8 @@ def get(self, request, format=None):
         data['processor_jobs'] = self._get_job_stats(ProcessorJob.objects, range_param)
         data['samples'] = self._get_object_stats(Sample.objects, range_param)
         data['experiments'] = self._get_object_stats(Experiment.objects, range_param)
+        data['processed_samples'] = self._get_object_stats(Sample.processed_objects)
+        data['processed_experiments'] = self._get_object_stats(Experiment.processed_public_objects)
         data['input_data_size'] = self._get_input_data_size()
         data['output_data_size'] = self._get_output_data_size()
         data['active_volumes'] = list(get_active_volumes())
@@ -937,7 +955,7 @@ def _get_job_stats(self, jobs, range_param):
 
         return result
 
-    def _get_object_stats(self, objects, range_param):
+    def _get_object_stats(self, objects, range_param = False):
         result = {
             'total': objects.count()
         }

diff --git a/common/data_refinery_common/migrations/0007_auto_20190103_1555.py b/common/data_refinery_common/migrations/0007_auto_20190103_1555.py
@@ -0,0 +1,28 @@
+# Generated by Django 2.1.2 on 2019-01-03 15:55
+
+from django.db import migrations, models
+
+
+class Migration(migrations.Migration):
+
+    dependencies = [
+        ('data_refinery_common', '0006_auto_20181102_1526'),
+    ]
+
+    operations = [
+        migrations.AddField(
+            model_name='originalfile',
+            name='downloader_jobs',
+            field=models.ManyToManyField(through='data_refinery_common.DownloaderJobOriginalFileAssociation', to='data_refinery_common.DownloaderJob'),
+        ),
+        migrations.AddField(
+            model_name='originalfile',
+            name='processor_jobs',
+            field=models.ManyToManyField(through='data_refinery_common.ProcessorJobOriginalFileAssociation', to='data_refinery_common.ProcessorJob'),
+        ),
+        migrations.AddField(
+            model_name='sample',
+            name='experiments',
+            field=models.ManyToManyField(through='data_refinery_common.ExperimentSampleAssociation', to='data_refinery_common.Experiment'),
+        ),
+    ]