## Parsing xml and json files
### BIOINF 575

### Bioinformatics and text files 

The following text is from:  
https://medium.com/ngs-sh/a-simple-introduction-to-xml-and-json-4547b93c9aae

Most bioinformatics file formats are simple text files, a famous example being the FASTA format to store sequences. Historically, most file formats were proposed to ad hoc address a specific need, resulting in a fragmented universe of formats.

Examples of famous bioinformatics formats are the FASTA and FASTQ for sequences, the SAM format to store details of sequence mappings, the VCF format to describe the variants of an individual compared to a reference genome, the GFF and BED formats to describe features in a genome (e. g. genes, enhancers, binding sites…).

#### XML and JSON
Among the “general purpose” formats commonly used in computer science, two are XML (for eXtensible Markup Language) and JSON (JavaScript Object Notation). The former has been very popular at the beginning of the new century, while the latter gained popularity later in this decade. They are both meant to encode structured information, and possibly to be able to describe any form of document needed (not necessarily in an ideal way). XML is more formal and enables a strict adherence to a defined structure, while JSON is a simpler data container (but this simplicity resulted in a good popularity in later times, the BIOM 1.0 format is an example of widely adopted JSON format).


_________

### Extensible Markup Language -- `xml` -- format

The following text is from:    
https://www.tutorialspoint.com/xml/xml_overview.htm

XML stands for Extensible Markup Language. It is a text-based markup language derived from Standard Generalized Markup Language (SGML).

XML tags identify the data and are used to store and organize the data, rather than specifying how to display it like HTML tags, which are used to display the data. XML is not going to replace HTML in the near future, but it introduces new possibilities by adopting many successful features of HTML.

There are three important characteristics of XML that make it useful in a variety of systems and solutions:  
* XML is extensible − XML allows you to create your own self-descriptive tags, or language, that suits your application.
* XML carries the data, does not present it − XML allows you to store the data irrespective of how it will be presented.
* XML is a public standard − XML was developed by an organization called the World Wide Web Consortium (W3C) and is available as an open standard.

##### XML Usage
A short list of XML usage says it all:
* XML can work behind the scene to simplify the creation of HTML documents for large web sites.
* XML can be used to exchange the information between organizations and systems.
* XML can be used for offloading and reloading of databases.
* XML can be used to store and arrange the data, which can customize your data handling needs.
* XML can easily be merged with style sheets to create almost any desired output.

Virtually, any type of data can be expressed as an XML document.

##### What is Markup?   
XML is a markup language that defines set of rules for encoding documents in a format that is both human-readable and machine-readable. So what exactly is a markup language? Markup is information added to a document that enhances its meaning in certain ways, in that it identifies the parts and how they relate to each other. More specifically, a markup language is a set of symbols that can be placed in the text of a document to demarcate and label the parts of that document.




#### Parsing `xml` files

```python
import xml.etree.ElementTree as ET
```

https://docs.python.org/3/library/xml.html

The XML handling submodules are:

* xml.etree.ElementTree: the ElementTree API, a simple and lightweight XML processor
* xml.dom: the DOM API definition
* xml.dom.minidom: a minimal DOM implementation
* xml.dom.pulldom: support for building partial DOM trees
* xml.sax: SAX2 base classes and convenience functions
* xml.parsers.expat: the Expat parser binding



`genes.xml`

```xml
<?xml version="1.0"?>
<data>
    <gene symbol="BRCA1">
        <id>672</id>
        <name>BRCA1 DNA repair associated</name>
        <alias sym="IRIS"/>
        <alias sym="PSCP"/>
    </gene>
    <gene symbol="BRCA2">
        <id>675</id>
        <name>BRCA2 DNA repair associated</name>
        <alias sym="FAD"/>
        <alias sym="FAD1"/>
        <alias sym="BRCC2"/>
    </gene>
</data>
```

In [2]:
### import the ElementTree, a simple and lightweight XML processor

import xml.etree.ElementTree as ET

In [4]:
# see what is available in the ElementTree module 

[elem for elem in dir(ET) if not elem.startswith("_")]

['C14NWriterTarget',
 'Comment',
 'Element',
 'ElementPath',
 'ElementTree',
 'HTML_EMPTY',
 'PI',
 'ParseError',
 'ProcessingInstruction',
 'QName',
 'SubElement',
 'TreeBuilder',
 'VERSION',
 'XML',
 'XMLID',
 'XMLParser',
 'XMLPullParser',
 'canonicalize',
 'collections',
 'contextlib',
 'dump',
 'fromstring',
 'fromstringlist',
 'indent',
 'io',
 'iselement',
 'iterparse',
 'parse',
 're',
 'register_namespace',
 'sys',
 'tostring',
 'tostringlist',
 'weakref']

......................
#### The `.parse` method

```python
help(ET.parse)


Help on function parse in module xml.etree.ElementTree:

parse(source, parser=None)
    Parse XML document into element tree.
    
    *source* is a filename or file object containing XML data,
    *parser* is an optional parser instance defaulting to XMLParser.
    
    Return an ElementTree instance.
```

In [6]:
## parse the file and create a tree

tree = ET.parse('genes.xml')

In [8]:
type(tree)

xml.etree.ElementTree.ElementTree

In [11]:
# see what is available for the ElementTree object 

[elem for elem in dir(tree) if not elem.startswith("_")]

['find',
 'findall',
 'findtext',
 'getroot',
 'iter',
 'iterfind',
 'parse',
 'write',
 'write_c14n']

In [10]:
help(tree.getroot)

Help on method getroot in module xml.etree.ElementTree:

getroot() method of xml.etree.ElementTree.ElementTree instance
    Return root element of this tree.



In [12]:
root = tree.getroot()

In [14]:
type(root)

xml.etree.ElementTree.Element

In [20]:
[elem for elem in dir(root) if not elem.startswith("_")]

['append',
 'attrib',
 'clear',
 'extend',
 'find',
 'findall',
 'findtext',
 'get',
 'insert',
 'items',
 'iter',
 'iterfind',
 'itertext',
 'keys',
 'makeelement',
 'remove',
 'set',
 'tag',
 'tail',
 'text']

................
```xml
<?xml version="1.0"?>
<data>
    <gene symbol="BRCA1">
```
................



In [22]:
# As an Element object, root has a tag and a dictionary of attributes:
root.tag

'data'

In [24]:
root.attrib

{}

In [19]:
root.findall("id")

[]

In [27]:
root.findall("gene")

[<Element 'gene' at 0x137e05fd0>, <Element 'gene' at 0x137e04c20>]

In [42]:
[e.attrib for e in root.findall("gene")]

[{'symbol': 'BRCA1'}, {'symbol': 'BRCA2'}]

................

```xml
<?xml version="1.0"?>
<data>
    <gene symbol="BRCA1">
        <id>672</id>
        <name>BRCA1 DNA repair associated</name>
```
................




In [23]:
# root is an iterable and has child Elements
# use a for loop to check them out
[e.attrib for e in root.findall("gene")]


[{'symbol': 'BRCA1'}, {'symbol': 'BRCA2'}]

In [26]:
for e in root.findall("gene"):
    for id in e.findall("id"):
        print(id.text)

672
675


In [29]:
# Children are nested, and we can access specific child nodes by index:

root[0].attrib

{'symbol': 'BRCA1'}

In [32]:
root[0][3].attrib["sym"]

'PSCP'

........    
Element has some useful methods that help iterate recursively over all the sub-tree below it (its children, their children, and so on). 

For example, `Element.iter()`


In [81]:
# we can explore specific tags

# iterator with Element objects for each alias tag in the file
root.iter?


[0;31mSignature:[0m [0mroot[0m[0;34m.[0m[0miter[0m[0;34m([0m[0mtag[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mType:[0m      builtin_method

In [35]:
root.iter('alias')

<_elementtree._element_iterator at 0x169ea83b0>

In [37]:
for a in root.iter('alias'):
    print(a.attrib)

{'sym': 'IRIS'}
{'sym': 'PSCP'}
{'sym': 'FAD'}
{'sym': 'FAD1'}
{'sym': 'BRCC2'}


In [41]:
# list with Element objects for each gene tag in root

root.findall?

[0;31mSignature:[0m [0mroot[0m[0;34m.[0m[0mfindall[0m[0;34m([0m[0mpath[0m[0;34m,[0m [0mnamespaces[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mType:[0m      builtin_method

In [43]:
root.findall('alias')

[]

In [94]:
root.findall('gene')

[<Element 'gene' at 0x137e05fd0>, <Element 'gene' at 0x137e04c20>]

In [45]:
root.find?

[0;31mSignature:[0m [0mroot[0m[0;34m.[0m[0mfind[0m[0;34m([0m[0mpath[0m[0;34m,[0m [0mnamespaces[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mType:[0m      builtin_method

In [100]:
root.find("gene").attrib

{'symbol': 'BRCA1'}

In [47]:
root.find("gene")

<Element 'gene' at 0x169e4fb50>

In [49]:
# the first Element object with a gene tag in root

root.find('gene')

<Element 'gene' at 0x169e4fb50>

In [51]:
# explore a gene element

ge = root.find('gene')

In [53]:
# see the tag
ge.tag

'gene'

In [55]:
# see the text 
ge.text

'\n        '

In [57]:
# see the keys
ge.keys()

['symbol']

In [59]:
ge.attrib.values()

dict_values(['BRCA1'])

In [61]:
# see the attributes
ge.attrib

{'symbol': 'BRCA1'}

In [63]:
ge = root.find('gene')
ge.attrib

{'symbol': 'BRCA1'}

................

```xml
<?xml version="1.0"?>
<data>
    <gene symbol="BRCA1">
        <id>672</id>
        <name>BRCA1 DNA repair associated</name>
```
................




In [65]:
# get attribute from Element

ge.get?

[0;31mSignature:[0m [0mge[0m[0;34m.[0m[0mget[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0mdefault[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mType:[0m      builtin_function_or_method

In [124]:
ge.get("symbol")

'BRCA1'

In [68]:
# find tag and get text

ge.find("name").text

'BRCA1 DNA repair associated'

In [70]:
# Putting it all together, creating a list of genes (tuples with gene info)

genes = []
for gene in root.findall('gene'):
    gene_id = gene.find('id').text
    symbol = gene.get('symbol')
    name = gene.find('name').text
    aliases = []
    for alias in gene.findall('alias'):
        aliases.append(alias.get('sym'))
    genes.append((gene_id, symbol, name, aliases))
genes

[('672', 'BRCA1', 'BRCA1 DNA repair associated', ['IRIS', 'PSCP']),
 ('675', 'BRCA2', 'BRCA2 DNA repair associated', ['FAD', 'FAD1', 'BRCC2'])]

In [72]:
# xml to pandas DataFrame
import pandas as pd
pd.DataFrame(genes, columns = ("gene_id", "symbol", "name", "aliases"))

Unnamed: 0,gene_id,symbol,name,aliases
0,672,BRCA1,BRCA1 DNA repair associated,"[IRIS, PSCP]"
1,675,BRCA2,BRCA2 DNA repair associated,"[FAD, FAD1, BRCC2]"


In [74]:
pd.read_xml('genes.xml')

Unnamed: 0,symbol,id,name,alias
0,BRCA1,672,BRCA1 DNA repair associated,
1,BRCA2,675,BRCA2 DNA repair associated,


In [76]:
pd.read_xml?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_xml[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpath_or_buffer[0m[0;34m:[0m [0;34m'FilePath | ReadBuffer[bytes] | ReadBuffer[str]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mxpath[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m'./*'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnamespaces[0m[0;34m:[0m [0;34m'dict[str, str] | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0melems_only[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mattrs_only[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m:[0m [0;34m'Sequence[str] | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m:[0m [0;34m'DtypeArg | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;3

In [80]:
pd.read_xml('stores.xml')

Unnamed: 0,Msg,slNo,foodItem,price,quantity,discount
0,Food Store items.,,,,,
1,,1.0,oranges,5.0,1kg,7%
2,,2.0,carrots,2.0,1kg,5%


#### xml to pandas DataFrame
from: https://www.geeksforgeeks.org/how-to-create-pandas-dataframe-from-nested-xml/

Save the following content into a file stores.xml

```xml
<?xml version="1.0" encoding="UTF-8"?>
  
       <Food>
  
           <Info>
           <Msg>Food Store items.</Msg>
           </Info>
  
           <store slNo="1">
               <foodItem>oranges</foodItem>
               <price>5</price>
               <quantity>1kg</quantity>
               <discount>7%</discount>
           </store>
  
           <store slNo="2">
               <foodItem>carrots</foodItem>
               <price>2</price>
               <quantity>1kg</quantity>
               <discount>5%</discount>
           </store>
  
       </Food>
```


In [82]:
# imports should not be done multiple times
# should be set up at the beginning of the notebook

import xml.etree.ElementTree as ET
import pandas as pd
  
# give the path where you saved the xml file # inside the quotes

st_tree = ET.parse("stores.xml")
st_root = st_tree.getroot()
  
# print(root)
store_items = []
all_items = []
  
for store in st_root.iter('store'):
    
    store_Nr = store.attrib.get('slNo')
    itemsF = store.find('foodItem').text
    price = store.find('price').text
    quan = store.find('quantity').text
    dis = store.find('discount').text
  
    store_items = [store_Nr, itemsF, price, quan, dis]
    all_items.append(store_items)
  
xmlToDf = pd.DataFrame(all_items, columns=['SL No', 'ITEM_NUMBER', 'PRICE', 'QUANTITY', 'DISCOUNT'])
  
print(xmlToDf.to_string(index=False))

SL No ITEM_NUMBER PRICE QUANTITY DISCOUNT
    1     oranges     5      1kg       7%
    2     carrots     2      1kg       5%


In [84]:
xmlToDf

Unnamed: 0,SL No,ITEM_NUMBER,PRICE,QUANTITY,DISCOUNT
0,1,oranges,5,1kg,7%
1,2,carrots,2,1kg,5%


In [147]:
#help(pd.read_xml)

Resources:

https://www.geeksforgeeks.org/xml-parsing-python/
https://www.w3schools.com/xml/
https://www.tutorialspoint.com/python/python_xml_processing.htm
https://docs.python.org/3/library/xml.etree.elementtree.html    
https://realpython.com/python-xml-parser/
https://docs.python-guide.org/scenarios/xml/
https://www.geeksforgeeks.org/reading-and-writing-xml-files-in-python/     
https://www.guru99.com/manipulating-xml-with-python.html


______

### JavaScript Object Notation -- `json` -- format

A JSON document is composed by a list of items stored as key and value pairs.    
Values can be single values (strings, integers, floating point…) or a list of values.

https://medium.com/ngs-sh/a-simple-introduction-to-xml-and-json-4547b93c9aae


JSON supports primitive types, like strings and numbers, as well as nested lists, tuples, and objects (dict), or null (None).  

#### Working with `json` format data

Import the `json` module.   
https://docs.python.org/3/library/json.html    
To work with JSON data (string or JSON file), first, it has to be 'translated' into the python data structure. In this lesson, we are going to use python's built-in module json to do it.   

```python
import json
```
   
There are a few python methods used to load json data:   

* load(): This method loads data from a JSON file into a python dictionary.
* loads(): This method loads data from a JSON variable into a python dictionary.
* dump(): This method saves data from the JSON format to a file.
* dumps(): This method  saves data from the JSON format to a text variable.

https://www.networkacademy.io/devnet-associate/data-formats/parsing-json-with-python

Datatypes conversion: python to json  

| Python                                 | JSON   |
|----------------------------------------|--------|
| dict                                   | object |
| list, tuple                            | array  |
| str                                    | string |
| int, float, int- & float-derived Enums | number |
| True                                   | true   |
| False                                  | false  |
| None                                   | null   |



`gene.json`

```json
{
  "id": 672,
  "symbol": "BRCA1",
  "full_name": "BRCA1 DNA repair associated",
  "aliases": [
    "IRIS",
    "PSCP",
    "BRCAI",
    "FANCS",
    "PNCA4",
    "RNF53",
    "BROVCA1",
    "PPP1R53"
  ]
}
```

In [88]:
import json

In [90]:
dir(json)

['JSONDecodeError',
 'JSONDecoder',
 'JSONEncoder',
 '__all__',
 '__author__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_default_decoder',
 '_default_encoder',
 'codecs',
 'decoder',
 'detect_encoding',
 'dump',
 'dumps',
 'encoder',
 'load',
 'loads',
 'scanner']

In [92]:
help(json.load)

Help on function load in module json:

load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)
    Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
    a JSON document) to a Python object.

    ``object_hook`` is an optional function that will be called with the
    result of any object literal decode (a ``dict``). The return value of
    ``object_hook`` will be used instead of the ``dict``. This feature
    can be used to implement custom decoders (e.g. JSON-RPC class hinting).

    ``object_pairs_hook`` is an optional function that will be called with the
    result of any object literal decoded with an ordered list of pairs.  The
    return value of ``object_pairs_hook`` will be used instead of the ``dict``.
    This feature can be used to implement custom decoders.  If ``object_hook``
    is also defined, the ``object_pairs_hook`` takes priority.

    To use a custom ``JSONDecoder`` subclas

In [94]:
json.load?

[0;31mSignature:[0m
[0mjson[0m[0;34m.[0m[0mload[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfp[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcls[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mobject_hook[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparse_float[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparse_int[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparse_constant[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mobject_pairs_hook[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkw[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
a JSON document) to a Python object.

``object_hook`` is an optional function that will be called with

In [100]:
with open("gene.json") as gene_file: 
    gene1_dict = json.load(gene_file) 
gene1_dict

{'id': 672,
 'symbol': 'BRCA1',
 'full_name': 'BRCA1 DNA repair associated',
 'aliases': ['IRIS',
  'PSCP',
  'BRCAI',
  'FANCS',
  'PNCA4',
  'RNF53',
  'BROVCA1',
  'PPP1R53']}

In [104]:
test_lst = [1,2,3, ("test", "gene")]

In [106]:
# create json format string from list object
json.dumps(test_lst)

'[1, 2, 3, ["test", "gene"]]'

In [108]:
# create json format string from dict object
res = json.dumps(gene1_dict)
res

'{"id": 672, "symbol": "BRCA1", "full_name": "BRCA1 DNA repair associated", "aliases": ["IRIS", "PSCP", "BRCAI", "FANCS", "PNCA4", "RNF53", "BROVCA1", "PPP1R53"]}'

In [112]:
type(res)

str

In [114]:
json.loads(res)

{'id': 672,
 'symbol': 'BRCA1',
 'full_name': 'BRCA1 DNA repair associated',
 'aliases': ['IRIS',
  'PSCP',
  'BRCAI',
  'FANCS',
  'PNCA4',
  'RNF53',
  'BROVCA1',
  'PPP1R53']}

Resources:
    
https://www.geeksforgeeks.org/working-with-json-data-in-python/    
https://www.geeksforgeeks.org/read-json-file-using-python/    
https://www.networkacademy.io/devnet-associate/data-formats/parsing-json-with-python      
https://medium.com/ngs-sh/a-simple-introduction-to-xml-and-json-4547b93c9aae     
https://www.tutorialspoint.com/python_data_science/python_processing_json_data.htm    
https://www.w3schools.com/js/js_json_intro.asp
https://www.w3schools.com/python/python_json.asp

https://www.tutorialspoint.com/json/json_comparison.htm

### Exercise:

Add the BRCA2 gene to the json file to have a list of 2 genes, and read the results.    
Make a pandas DataFrame out of the list of genes.   
Convert the data from the DataFrame into json format.   
Save it to a file.   
Load the data from the file into a DataFrame.


https://www.ncbi.nlm.nih.gov/gene/675


```json
{
  "id": 675,
  "symbol": "BRCA2",
  "full_name": "BRCA2 DNA repair associated",
  "aliases": [
    "FAD",
    "FACD",
    "FAD1",
    "GLM3".
    "BRCC2".
    "FANCD",
    "PNCA2",
    "FANCD1",
    "XRCC11",
    "BROVCA2"
  ]
}
```

    

In [134]:
with open("gene.json") as gene_file: 
    genes_lst = json.load(gene_file) 
genes_lst

[{'id': 672,
  'symbol': 'BRCA1',
  'full_name': 'BRCA1 DNA repair associated',
  'aliases': ['IRIS',
   'PSCP',
   'BRCAI',
   'FANCS',
   'PNCA4',
   'RNF53',
   'BROVCA1',
   'PPP1R53']},
 {'id': 675,
  'symbol': 'BRCA2',
  'full_name': 'BRCA2 DNA repair associated',
  'aliases': ['FAD',
   'FACD',
   'FAD1',
   'GLM3',
   'BRCC2',
   'FANCD',
   'PNCA2',
   'FANCD1',
   'XRCC11',
   'BROVCA2']}]

In [136]:
df = pd.DataFrame(genes_lst)

In [138]:
df

Unnamed: 0,id,symbol,full_name,aliases
0,672,BRCA1,BRCA1 DNA repair associated,"[IRIS, PSCP, BRCAI, FANCS, PNCA4, RNF53, BROVC..."
1,675,BRCA2,BRCA2 DNA repair associated,"[FAD, FACD, FAD1, GLM3, BRCC2, FANCD, PNCA2, F..."


In [140]:
json.dumps(df) # this does not work

TypeError: Object of type DataFrame is not JSON serializable

In [146]:
dfj = df.to_json() # this makes a string
dfj

'{"id":{"0":672,"1":675},"symbol":{"0":"BRCA1","1":"BRCA2"},"full_name":{"0":"BRCA1 DNA repair associated","1":"BRCA2 DNA repair associated"},"aliases":{"0":["IRIS","PSCP","BRCAI","FANCS","PNCA4","RNF53","BROVCA1","PPP1R53"],"1":["FAD","FACD","FAD1","GLM3","BRCC2","FANCD","PNCA2","FANCD1","XRCC11","BROVCA2"]}}'

In [142]:
help(json.dump)

Help on function dump in module json:

dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)
    Serialize ``obj`` as a JSON formatted stream to ``fp`` (a
    ``.write()``-supporting file-like object).

    If ``skipkeys`` is true then ``dict`` keys that are not basic types
    (``str``, ``int``, ``float``, ``bool``, ``None``) will be skipped
    instead of raising a ``TypeError``.

    If ``ensure_ascii`` is false, then the strings written to ``fp`` can
    contain non-ASCII characters if they appear in strings contained in
    ``obj``. Otherwise, all such characters are escaped in JSON strings.

    If ``check_circular`` is false, then the circular reference check
    for container types will be skipped and a circular reference will
    result in an ``RecursionError`` (or worse).

    If ``allow_nan`` is false, then it will be a ``ValueError`` to
    serialize out of range 

In [148]:
with open("df.json", "w") as df_file:
    dfjd = json.loads(dfj) # this loads the object that is in the string, a dict
    print(dfjd) 
    json.dump(dfjd, df_file)

{'id': {'0': 672, '1': 675}, 'symbol': {'0': 'BRCA1', '1': 'BRCA2'}, 'full_name': {'0': 'BRCA1 DNA repair associated', '1': 'BRCA2 DNA repair associated'}, 'aliases': {'0': ['IRIS', 'PSCP', 'BRCAI', 'FANCS', 'PNCA4', 'RNF53', 'BROVCA1', 'PPP1R53'], '1': ['FAD', 'FACD', 'FAD1', 'GLM3', 'BRCC2', 'FANCD', 'PNCA2', 'FANCD1', 'XRCC11', 'BROVCA2']}}


In [151]:
pd.read_json("df.json")

Unnamed: 0,id,symbol,full_name,aliases
0,672,BRCA1,BRCA1 DNA repair associated,"[IRIS, PSCP, BRCAI, FANCS, PNCA4, RNF53, BROVC..."
1,675,BRCA2,BRCA2 DNA repair associated,"[FAD, FACD, FAD1, GLM3, BRCC2, FANCD, PNCA2, F..."
