# Some file types and their parsing

 _parse (verb)_ 

>   resolve (a sentence) into its component parts and describe their syntactic roles.

* You've already done some parsing - you've parsed tables into lists. 

* There are many file types with defined formats that are more difficult to read in than text tables. 

* Often other people have done the hard work of writing code that turns these into things you can work with in python.

* We will look in more detail at more complex operations on tables in a later video

## FASTA format

`FASTA` format is a format for storing DNA, RNA and protein sequences.

It consists of a title line marked with a `>`, followed by any number of sequence lines. E.g. the below is the sequence of a single gene/genome/protein:

    >gene1
    AATAGACCGCGATAATAGCGAA
    ATTTTCAGGGCAAAGGCCCCAT



Each file can be more than one gene/protein. 

    >gene1
    AATAGACCGCGATAATAGCGAA
    ATTTTCAGGGCAAAGGCCCCAT
    >gene2
    AATAGACCGCGATAATAGCGAA
    ATTTTCAGGGCAAAGGCCCCAT

`FASTA` is tricky to parse, but luckily someone has done it for us.

In [None]:
from Bio import SeqIO

The `Bio` package (full name biopython) is not part of the standard python, but we have installed it for you to use.

## JSON
(JavaScript Object Notation)

* This is a general file format for transmitting structured information across the internet.

* It is effectively a way of writing a dictionary to disk.

An example of some JSON:

    [
        {
            "gene_name": "gene1",
            "RNA sequence": "ATGTATGCAGAGATCTATAGCGTA",
            "Domains": ["domain1", "domain2", "domain3"]
        },
        
        {   
            "gene_name": "gene2",
            "RNA sequence": "TGCGAYCGAATTTCAAACTTAC",
            "Domains": ["domainA", "domainB", "domainC"]
        }
    ]

In [2]:
import json

fh = open("example.json")
gene_data = json.load(fh)
print("There are ", len(gene_data), "genes in the file")
i = 0

for gene in gene_data:
    i += 1
    print()
    print("The name of gene", i, "is", gene["gene_name"])
    print("Its RNA sequence is", gene["RNA sequence"])
    print("It has", len(gene["Domains"]), "domains")
    print()
    

There are  2 genes in the file

The name of gene 1 is gene1
Its RNA sequence is ATGTATGCAGAGATCTATAGCGTA
It has 3 domains


The name of gene 2 is gene2
Its RNA sequence is TGCGAYCGAATTTCAAACTTAC
It has 3 domains



# XML

You've probably heard of HTML. 

HTML is "HyperText Markup Lanuage"<br>
*HyperText* Becuase there are links
*Markup* because there are tags (things in <>) that "markup" the text with meaning.

In HTML the "markup" tells the browser how to display the text.

XML is "eXtensible Markup Lanuage"

It also tags, to say what the information means.

Many web services will allow you choose to have data returned in `XML`.

Example:
    

There are packages designed to "parse" XML files - turn them automatically in things that look a bit like lists/dictionaries. 

Or in many cases you can process them yourself, particularly with something called "regular expressions" which we will talk about next time.