Merge ef57b12 into 0c893ad

4dn-dcic · May 8, 2019 · 215e693 · 215e693
2 parents 0c893ad + ef57b12
commit 215e693
Show file tree

Hide file tree

Showing 6 changed files with 33 additions and 32 deletions.
diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@ pip3 install submit4dn --upgrade
 
 ### Troubleshooting
 
-If you encounter an error containing something like:   
+If you encounter an error containing something like:
 
 ```
  Symbol not found: _PyInt_AsLong
@@ -85,17 +85,17 @@ get_field_info --type Biosample --comments --outfile biosample.xls
 
 Example list of sheets:
 ~~~~
-get_field_info --type Publication --type Document --type Vendor --type Protocol --type BiosampleCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentAgent --type TreatmentRnai --type Modification --type Biosample --type FileFastq --type IndividualMouse --type ExperimentHiC --type ExperimentSetReplicate --type ExperimentCaptureC --type Target --type GenomicRegion --type ExperimentSet --type Image --comments --outfile MetadataSheets.xls
+get_field_info --type Publication --type Document --type Vendor --type Protocol --type BiosampleCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentAgent --type TreatmentRnai --type Modification --type Biosample --type FileFastq --type IndividualMouse --type ExperimentHiC --type ExperimentSetReplicate --type ExperimentCaptureC --type BioFeature --type GenomicRegion --type ExperimentSet --type Image --comments --outfile MetadataSheets.xls
 ~~~~
 
 Example list of sheets: (using python scripts)
 ~~~~
-python3 -m wranglertools.get_field_info --type Publication --type Document --type Vendor --type Protocol --type BiosampleCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentAgent --type TreatmentRnai --type Modification --type Biosample --type FileFastq --type IndividualHuman --type ExperimentHiC --type ExperimentCaptureC --type Target --type GenomicRegion --type ExperimentSet --type ExperimentSetReplicate --type Image --comments --outfile MetadataSheets.xls
+python3 -m wranglertools.get_field_info --type Publication --type Document --type Vendor --type Protocol --type BiosampleCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentAgent --type TreatmentRnai --type Modification --type Biosample --type FileFastq --type IndividualHuman --type ExperimentHiC --type ExperimentCaptureC --type BioFeature --type GenomicRegion --type ExperimentSet --type ExperimentSetReplicate --type Image --comments --outfile MetadataSheets.xls
 ~~~~
 
 Example list of sheets: (Experiment seq)
 ~~~~
-python3 -m wranglertools.get_field_info --type Publication --type Document --type Vendor --type Protocol --type BiosampleCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentAgent --type TreatmentRnai --type Modification --type Biosample --type FileFastq --type ExperimentSeq --type Target --type GenomicRegion --type ExperimentSet --type ExperimentSetReplicate --type Image --comments --outfile exp_seq_all.xls
+python3 -m wranglertools.get_field_info --type Publication --type Document --type Vendor --type Protocol --type BiosampleCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentAgent --type TreatmentRnai --type Modification --type Biosample --type FileFastq --type ExperimentSeq --type BioFeature --type GenomicRegion --type ExperimentSet --type ExperimentSetReplicate --type Image --comments --outfile exp_seq_all.xls
 ~~~~
 
 Example list of sheets: (Experiment seq simple)

diff --git a/doc/metadata_submission.md b/doc/metadata_submission.md
@@ -69,7 +69,7 @@ In some cases a field value must be formatted in a certain way or the Item will
 In other cases a field value must match a certain pattern. For example, if a field requires a DNA sequence then the submitted value must contain only the characters A, T, G, C or N.
 
 
-_Database Cross Reference (DBxref) fields_, which contain identifiers that refer to external databases, are another case requiring special formatting. In many cases the values of these fields need to be in database\_name:ID format. For example, an SRA experiment identifier would need to be submitted in the form ‘SRA:SRX1234567’ (see also [Basic fields example](#basic-field) above). Note that in a few cases where the field takes only identifiers for one or two specific databases the ID alone can be entered - for example, when entering gene symbols in the *'targeted\_genes’* field of the Target Item you can enter only the gene symbols i.e. PARK2, DLG1.
+_Database Cross Reference (DBxref) fields_, which contain identifiers that refer to external databases, are another case requiring special formatting. In many cases the values of these fields need to be in database\_name:ID format. For example, an SRA experiment identifier would need to be submitted in the form ‘SRA:SRX1234567’ (see also [Basic fields example](#basic-field) above).
 
 ####When a field specifies a linked item
 Some fields in a Sheet for an Item may contain references to another Item. These may be of the same type or different types. Examples of this type of field include the *‘biosource’* field in Biosample or the *‘files’* field in the ExperimentHiC. Note that the latter is also an example of a list field that can take multiple values.
@@ -264,7 +264,7 @@ The scripts accepts the following parameters:.
 
 **To get the complete list of relevant sheets in one workbook:**
 
-    get_field_info --type Publication --type Document --type Vendor --type Protocol --type BiosampleCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentChemical --type TreatmentRnai --type Modification --type Biosample --type FileFastq --type FileSet --type IndividualHuman --type IndividualMouse --type ExperimentHiC --type ExperimentCaptureC --type ExperimentRepliseq --type Target --type GenomicRegion --type ExperimentSet --type ExperimentSetReplicate --type Image --comments --outfile AllItems.xls
+    get_field_info --type Publication --type Document --type Vendor --type Protocol --type BiosampleCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentChemical --type TreatmentRnai --type Modification --type Biosample --type FileFastq --type FileSet --type IndividualHuman --type IndividualMouse --type ExperimentHiC --type ExperimentCaptureC --type ExperimentRepliseq --type BioFeature --type GenomicRegion --type Gene --type ExperimentSet --type ExperimentSetReplicate --type Image --comments --outfile AllItems.xls
 
 ##<a name="rest"></a>Submission of metadata using the 4DN REST API
 The 4DN-DCIC metadata database can be accessed using a Hypertext-Transfer-Protocol-(HTTP)-based, Representational-state-transfer (RESTful) application programming interface (API) - aka the REST API.  In fact, this API is used by the ```import_data``` script used to submit metadata entered into excel spreadsheets as described [in this document](https://docs.google.com/document/d/1Xh4GxapJxWXCbCaSqKwUd9a2wTiXmfQByzP0P8q5rnE). This API was developed by the [ENCODE][encode] project so if you have experience retrieving data from or submitting data to ENCODE use of the 4DN-DCIC API should be familiar to you.   The REST API can be used both for data submission and data retrieval, typically using scripts written in your language of choice.  Data objects exchanged with the server conform to the standard JavaScript Object Notation (JSON) format.  Libraries written for use with your chosen language are typically used for the network connection, data transfer, and parsing of data  (for example, requests and json, respectively for Python).  For a good introduction to scripting data retrieval (using GET requests) you can refer to [this page](https://www.encodeproject.org/help/rest-api/) on the [ENCODE][encode] web site that also has a good introduction to viewing and understanding JSON formatted data.

diff --git a/doc/schema_info.md b/doc/schema_info.md
@@ -6,6 +6,7 @@ award.json | Award | award(s)
 biosample.json | Biosample | biosample(s)
 biosample\_cell\_culture.json | BiosampleCellCulture | biosample-cell-cultures, biosample\_cell\_culture
 biosource.json | Biosource | biosource(s)
+bio_feature.json | BioFeature | bio-features, bio\_feature
 construct.json | Construct | construct(s)
 document.json | Document | document(s)
 enzyme.json | Enzyme | enzyme(s)
@@ -19,6 +20,7 @@ file\_fastq.json | FileFastq | files-fastq, file\_fastq
 file\_processed.json | FileProcessed | files-processed, file\_processed
 file\_reference.json | FileReference | files-reference, file\_reference
 file\_set.json | FileSet | file-sets, file\_set
+gene.json | Gene | gene(s)
 genomic\_region.json | GenomicRegion | genomic-regions, genomic\_region
 image.json | Image | image(s)
 individual\_human.json | IndividualHuman | individuals-human, individual\_human
@@ -37,7 +39,6 @@ software.json | Software | software(s)
 sop\_map.json | SopMap | sop-maps, sop\_map
 summary\_statistic.json | SummaryStatistic | summary-statistics, summary\_statistic
 summary\_statistic\_hi\_c.json | SummaryStatisticHiC | summary-statistics-hi-c, summary\_statistic\_hi\_c
-target.json | Target | target(s)
 treatment\_chemical.json | TreatmentChemical | treatments-chemical, treatment\_chemical
 treatment\_rnai.json | TreatmentRnai | treatments-rnai, treatment\_rnai
 user.json | User | user(s)

diff --git a/wranglertools/4DNWranglerTools.egg-info/PKG-INFO b/wranglertools/4DNWranglerTools.egg-info/PKG-INFO
@@ -6,13 +6,13 @@ Home-page: http://data.4dnucleome.org
 Author: 4DN Team at Harvard Medical School
 Author-email: jeremy_johnson@hms.harvard.edu
 License: MIT
-Description: 
+Description:
         ##Connection
         first thing you need is the keyfile to access the REST application
         it is a json formatted file that contains key,secret and server
         under one identifier. Here is the default structure. The default path
         is /Users/user/keypairs.json
-        
+
             {
               "default": {
                 "key": "TheConnectionKey",
@@ -22,7 +22,7 @@ Description:
             }
         if file name is different and the key is not named default add it to the code:
         python3 code.py --keyfile nameoffile.json --key NotDefault
-        
+
         ##Generate fields.xls
         To create an xls file with sheets to be filled use the example and modify to your needs. It will accept the following parameters.
         --type           use for each sheet that you want to add to the excel workbook
@@ -31,11 +31,11 @@ Description:
         --comments       adds the comments together with enums (by default False)
         --writexls       creates the xls file (by default True)
         --outfile        change the default file name "fields.xls" to a specified one
-        
+
         *Full list*
         ~~~~
-        python3 get_field_info.py --type Publication --type Document --type Vendor --type Protocol --type ProtocolsCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentChemical --type TreatmentRnai --type Modification --type Biosample --type File --type FileSet --type IndividualHuman --type IndividualMouse --type ExperimentHiC --type ExperimentCaptureC --type Target --type GenomicRegion --type ExperimentSet --type Image --outfile AllItems.xls --order
-        
+        python3 get_field_info.py --type Publication --type Document --type Vendor --type Protocol --type ProtocolsCellCulture --type Biosource --type Enzyme --type Construct --type TreatmentChemical --type TreatmentRnai --type Modification --type Biosample --type File --type FileSet --type IndividualHuman --type IndividualMouse --type ExperimentHiC --type ExperimentCaptureC --type BioFeature --type GenomicRegion --type ExperimentSet --type Image --outfile AllItems.xls --order
+
         ~~~~
         *To get a single sheet use*
         ```
@@ -44,50 +44,50 @@ Description:
         python3 get_field_info.py --type Biosample --comments --outfile biosample.xls
         python3 get_field_info.py --type Biosample --comments --outfile biosample.xls --order
         ```
-        
+
         #Specifications for fields.xls
         In fields.xls, each excel sheet is named after an object type, like ExperimentHiC, Biosample, Construct, Protocol...
-        
+
         *Each sheet has 3 rows*
         1) Field name
         2) Field description
         3) Choices for controlled vocabulary (some fields only accept a value from a list of selection, like experiment type)
-        
+
         The first entry will start from row 4, and column 2.
-        
+
         Each field can be a certain type; string, number/integer, list. If the type is integer, number or array, it will be indicated with the fields name; field:number, fields:int, field:array. If the field is a string, you will only see the field name.
         If the field is an array (field:list), you may enter a single item, or multiple items separated by comma.
-        
+
             field:array
             item1,item2,item2,item4
-        
+
         Some objects containing fields that are grouped together, called embedded sub-objects. For example the "experiment_relations" has 2 fields called "relationship_type", and "experiment". In the field names you will see
         * experiment_relations.relationship_type
         * experiment_relations.experiment
-        
+
         If the embedded sub-object is a list, you can increase the number of items by creating new columns and appending numbers to the fields names
         * experiment_relations.relationship_type1
         * experiment_relations.experiment1
         * experiment_relations.relationship_type2
         * experiment_relations.experiment2
-        
-        
+
+
         **Aliases**
-        
+
         When you create new object types at the same time, it is not possible to reference one item in another with an accession or uuid since it is not assigned yet. For example, if you have a new experiment with a new biosample in the same excel workbook (different sheets), what are you going to put in biosample field in experiments sheet? To overcome this problem, a lab specific identifier called alias is used. "aliases" field accepts multiple entries in the form of "labname:refname,labname:refname2" (testlab:expHic001). If you add lab:bisample1 to aliases field in biosample, you can then use this value in biosample field in experiment.
-        
-        
+
+
         #Specifications for import_data.py
         You can use import_data.py either to upload new metadata or patch fields of an existing metadata.
         When you import file data, the status has to be "uploading". if you have some other status, like "uploaded" and then patch the status to "uploading", you will not be able to upload file, because the dedicated url for aws upload is creating during post if the status is uploading.
-        
+
         **Uploading vs Patching**
-        
+
         If there is a uuid, alias, @id, or accession in the document that matches and existing entry in the database, it will ask if you want to PATCH that object one by one.
         If you use '--patchall' if you want to patch ALL objects in your document and ignore that message.
-        
+
         If no object identifiers are found in the document, you need to use '--update' for POSTing to occur.
-        
+
         To upload objects with attachments, use the column titled "attachment" containing the path the file you wish to attach
-        
+
 Platform: UNKNOWN
diff --git a/wranglertools/get_field_info.py b/wranglertools/get_field_info.py
@@ -233,7 +233,7 @@ class FieldInfo(object):
 sheet_order = [
     "User", "Award", "Lab", "Document", "ExperimentType", "Protocol", "Publication", "Organism",
     "IndividualMouse", "IndividualFly", "IndividualHuman", "FileFormat", "Vendor", "Enzyme",
-    "Construct", "TreatmentRnai", "TreatmentAgent", "GenomicRegion", "Target",
+    "Construct", "TreatmentRnai", "TreatmentAgent", "GenomicRegion", "Gene", "Target", "BioFeature",
     "Antibody", "Modification", "Image", "Biosource", "BiosampleCellCulture",
     "Biosample", "FileFastq", "FileProcessed", "FileReference", "FileCalibration",
     "FileSet", "FileSetCalibration", "MicroscopeSettingD1", "MicroscopeSettingD2",

diff --git a/wranglertools/import_data.py b/wranglertools/import_data.py
@@ -45,7 +45,7 @@
 Defining Object type:
     Each "sheet" of the excel file is named after the object type you are uploading,
     with the format used on http://data.4dnucleome.org//profiles/
-Ex: ExperimentHiC, Biosample, Document, Target
+Ex: ExperimentHiC, Biosample, Document, BioFeature
 
 If there is a single sheet that needs to be posted or patched, you can name the single sheet
 with the object name and use the '--type' argument