Skip to content

Commit

Permalink
backup
Browse files Browse the repository at this point in the history
  • Loading branch information
rasbt committed Apr 2, 2017
1 parent c591821 commit 8d9beed
Show file tree
Hide file tree
Showing 23 changed files with 7,433 additions and 205 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
If you are a computational biologist, chances are that you cursed one too many times about protein structure files. Yes, I am talking about ye Goode Olde Protein Data Bank format, aka "PDB files." Nothing against PDB, it's a neatly structured format (if deployed correctly); yet, it is a bit cumbersome to work with PDB files in "modern" programming languages -- I am pretty sure we all agree on this.

As machine learning and "data science" person, I fell in love with [pandas](http://pandas.pydata.org) DataFrames for handling just about everything that can be loaded into memory.
So, why don't we take pandas to the structural biology world? Working with molecular structures of biological macromolecules in pandas DataFrames is what BioPandas is all about!
So, why don't we take pandas to the structural biology world? Working with molecular structures of biological macromolecules (from PDB and MOL2 files) in pandas DataFrames is what BioPandas is all about!

<br>

Expand Down
3 changes: 2 additions & 1 deletion biopandas/mol2/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,6 @@
"""

from .pandas_mol2 import PandasMOL2
from .mol2_io import split_multimol2

__all__ = ["PandasMOL2"]
__all__ = ["PandasMOL2", "split_multimol2"]
21 changes: 21 additions & 0 deletions biopandas/mol2/pandas_mol2.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
# Code Repository: https://github.com/rasbt/biopandas

import pandas as pd
import numpy as np
from .mol2_io import split_multimol2


Expand Down Expand Up @@ -211,3 +212,23 @@ def rmsd(df1, df2, heavy_only=True):
(d1['z'] - d2['z'])**2)
rmsd = round((total.sum() / df1.shape[0])**0.5, 4)
return rmsd

def distance(self, xyz=(0.00, 0.00, 0.00)):
"""Computes Euclidean distance between atoms and a 3D point.
Parameters
----------
xyz : tuple (0.00, 0.00, 0.00)
X, Y, and Z coordinate of the reference center for the distance
computation
Returns
---------
pandas.Series : Pandas Series object containing the Euclidean
distance between the atoms in the atom section and `xyz`.
"""
return self.df.apply(lambda x: np.sqrt(np.sum(
((x['x'] - xyz[0])**2,
(x['y'] - xyz[1])**2,
(x['z'] - xyz[2])**2))), axis=1)
7 changes: 7 additions & 0 deletions biopandas/mol2/tests/test_pandas_mol2.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,10 @@ def test_rmsd():

assert pdmol_1.rmsd(pdmol_1.df, pdmol_2.df, heavy_only=False) == 1.5523
assert pdmol_1.rmsd(pdmol_1.df, pdmol_2.df) == 1.1609


def test_distance():
data_path = os.path.join(this_dir, 'data', '1b5e_1.mol2')

pdmol = PandasMOL2().read_mol2(data_path)
assert round(pdmol.distance().values[0], 3) == 31.185
3 changes: 2 additions & 1 deletion docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ extra_javascript:
extra_css:
- extra.css

copyright: Copyright &copy; 2015-2016 <a href="http://sebastianraschka.com">Sebastian Raschka</a>
copyright: Copyright &copy; 2015-2017 <a href="http://sebastianraschka.com">Sebastian Raschka</a>
google_analytics: ['UA-38457794-3', 'rasbt.github.io/biopandas']

pages:
Expand All @@ -33,6 +33,7 @@ pages:
- tutorials/Working_with_PDB_Structures_in_DataFrames.md
- API:
- api_subpackages/biopandas.pdb.md
- api_subpackages/biopandas.mol2.md
- Changelog: changelog.md
- Installation: installation.md
- Contributing: contributing.md
178 changes: 178 additions & 0 deletions docs/sources/api_subpackages/biopandas.mol2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
biopandas version: 0.2.0.dev0
## PandasMOL2

*PandasMOL2()*

Object for working with Tripos Mol2 structure files.

**Attributes**

- `df` : pandas.DataFrame

DataFrame of a Mol2's ATOM section


- `mol2_text` : str

Mol2 file contents in string format


- `code` : str

ID, code, or name of the molecule stored



### Methods

<hr>

*read_mol2(path, columns=None)*

Reads Mol2 files (unzipped or gzipped) from local drive

Note that if your mol2 file contains more than one molecule,
only the first molecule is loaded into the DataFrame

**Attributes**

- `path` : str

Path to the Mol2 file in .mol2 format or gzipped format (.mol2.gz)


- `columns` : dict or None (default: None)

If None, this methods expects a 9-column ATOM section that contains
the following columns:

{0:('atom_id', int), 1:('atom_name', str),
2:('x', float), 3:('y', float), 4:('z', float),
5:('atom_type', str), 6:('subst_id', int),
7:('subst_name', str), 8:('charge', float)}

If your Mol2 files are formatted differently, you can provide your
own column_mapping dictionary in a format similar to the one above.
However, note that not all assert_raise_message methods
may be supported then.

**Returns**

self

<hr>

*read_mol2_from_list(mol2_lines, mol2_code, columns=None)*

Reads Mol2 file from a list into DataFrames

**Attributes**

- `mol2_lines` : list

A list of lines containing the mol2 file contents. For example,
['@<TRIPOS>MOLECULE
',
'ZINC38611810
',
' 65 68 0 0 0
',
'SMALL
',
'NO_CHARGES
',
'
',
'@<TRIPOS>ATOM
',
' 1 C1 -1.1786 2.7011 -4.0323 C.3 1 <0> -0.1537
',
' 2 C2 -1.2950 1.2442 -3.5798 C.3 1 <0> -0.1156
',
...]


- `mol2_code` : str or None

Name or ID of the molecule.


- `columns` : dict or None (default: None)

If None, this methods expects a 9-column ATOM section that contains
the following columns:
{0:('atom_id', int), 1:('atom_name', str),
2:('x', float), 3:('y', float), 4:('z', float),
5:('atom_type', str), 6:('subst_id', int),
7:('subst_name', str), 8:('charge', float)}
If your Mol2 files are formatted differently, you can provide your
own column_mapping dictionary in a format similar to the one above.
However, note that not all assert_raise_message methods may be
supported then.

**Returns**

self



<hr>

*rmsd(df1, df2, heavy_only=True)*

Compute the Root Mean Square Deviation between molecules

**Parameters**

- `df1` : pandas.DataFrame

DataFrame with HETATM, ATOM, and/or ANISOU entries

- `df2` : pandas.DataFrame

Second DataFrame for RMSD computation against df1. Must have the
same number of entries as df1

- `heavy_only` : bool (default: True)

Which atoms to compare to compute the RMSD. If `True` (default),
computes the RMSD between non-hydrogen atoms only.

**Returns**

- `rmsd` : float

Root Mean Square Deviation between df1 and df2

### Properties

<hr>

*df*

Acccesses the pandas DataFrame

## split_multimol2

*split_multimol2(mol2_path)*

Splits a multi-mol2 file into individual Mol2 file contents.

**Parameters**

- `mol2_path` : str

Path to the multi-mol2 file. Parses gzip files if the filepath
ends on .gz.

**Returns**

A generator object for lists for every extracted mol2-file. Lists contain
the molecule ID and the mol2 file contents.
e.g., ['ID1234', ['@<TRIPOS>MOLECULE
', '...']]. Note that bytestrings
are returned (for reasons of efficieny) if the Mol2 content is read
from a gzip (.gz) file.



60 changes: 59 additions & 1 deletion docs/sources/api_subpackages/biopandas.pdb.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
biopandas version: 0.1.5.dev0
biopandas version: 0.2.0.dev0
## PandasPDB

*PandasPDB()*
Expand Down Expand Up @@ -33,6 +33,64 @@ Object for working with Protein Databank structure files.

<hr>

*amino3to1(record='ATOM', residue_col='residue_name', fillna='?')*

Creates 1-letter amino acid codes from DataFrame

Non-canonical amino-acids are converted as follows:
ASH (protonated ASP) => D
CYX (disulfide-bonded CYS) => C
GLH (protonated GLU) => E
HID/HIE/HIP (different protonation states of HIS) = H
HYP (hydroxyproline) => P
MSE (selenomethionine) => M

**Parameters**

- `record` : str (default: 'ATOM')

Specfies the record DataFrame

- `residue_col` : str (default: 'residue_name')

Column in `record` DataFrame to look for 3-letter amino acid
codes for the conversion

- `fillna` : str (default: '?')

Placeholder string to use for unknown amino acids

**Returns**

- `pandas.Series` : Pandas Series object containing the 1-letter amino

acid codes after conversion

<hr>

*distance(xyz=(0.0, 0.0, 0.0), record='ATOM')*

Computes Euclidean distance between atoms and a 3D point.

**Parameters**

- `xyz` : tuple (0.00, 0.00, 0.00)

X, Y, and Z coordinate of the reference center for the distance
computation

- `record` : str (default: 'ATOM')

Specfies the record DataFrame

**Returns**

- `pandas.Series` : Pandas Series object containing the Euclidean

distance between the atoms in the record section and `xyz`.

<hr>

*fetch_pdb(pdb_code)*

Fetches PDB file contents from the Protein Databank at rcsb.org.
Expand Down
2 changes: 1 addition & 1 deletion docs/sources/api_subpackages/biopandas.testutils.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
biopandas version: 0.1.5.dev0
biopandas version: 0.2.0.dev0
Binary file added docs/sources/img/logos/1b5e.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/sources/img/logos/1b5e_120.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions docs/sources/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
[![PyPI Version](https://img.shields.io/pypi/v/biopandas.svg)](https://pypi.python.org/pypi/biopandas/)
[![License](https://img.shields.io/badge/license-new%20BSD-blue.svg)](https://github.com/rasbt/biopandas/blob/master/LICENSE)
![Python 2.7](https://img.shields.io/badge/python-2.7-blue.svg)
![Python 3.5](https://img.shields.io/badge/python-3.5-blue.svg)
![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)

<br>

Expand All @@ -28,7 +28,8 @@
If you are a computational biologist, chances are that you cursed one too many times about protein structure files. Yes, I am talking about ye Goode Olde Protein Data Bank format, aka "PDB files." Nothing against PDB, it's a neatly structured format (if deployed correctly); yet, it is a bit cumbersome to work with PDB files in "modern" programming languages -- I am pretty sure we all agree on this.

As machine learning and "data science" person, I fell in love with [pandas](http://pandas.pydata.org) DataFrames for handling just about everything that can be loaded into memory.
So, why don't we take pandas to the structural biology world? Working with molecular structures of biological macromolecules in pandas DataFrames is what BioPandas is all about!
So, why don't we take pandas to the structural biology world? Working with molecular structures of biological macromolecules (from PDB and MOL2 files) in pandas DataFrames is what BioPandas is all about!




Expand Down
5 changes: 3 additions & 2 deletions docs/sources/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ pip install biopandas

## Conda-forge

The latest stable release of `biopandas` is now also available via [conda-forge](https://github.com/conda-forge/biopandas-feedstock); you can install it via
Versions of `biopandas` are now also available via [conda-forge](https://github.com/conda-forge/biopandas-feedstock); you can install it via


```bash
Expand All @@ -21,9 +21,10 @@ conda install biopandas -c conda-forge

or simply

```bash
```bash
conda install biopandas
```

if you have `conda-forge` already [added to your channels](https://github.com/conda-forge/biopandas-feedstock).


Expand Down
Loading

0 comments on commit 8d9beed

Please sign in to comment.