This package implements the BiocFrame
class, a Bioconductor-friendly alternative to Pandas DataFrame
.
The main advantage is that the BiocFrame
makes no assumption on the types of the columns -
as long as an object has a length (__len__
) and slicing methods (__getitem__
), it can be used inside a BiocFrame
.
This allows us to accept arbitrarily complex objects as columns, which is often the case in Bioconductor objects.
To get started, install the package from PyPI:
pip install biocframe
# To install optional dependencies
pip install biocframe[optional]
To construct a BiocFrame
object, simply provide the data as a dictionary.
from biocframe import BiocFrame
obj = {
"ensembl": ["ENS00001", "ENS00002", "ENS00003"],
"symbol": ["MAP1A", "BIN1", "ESR1"],
}
bframe = BiocFrame(obj)
print(bframe)
## BiocFrame with 3 rows and 2 columns
## ensembl symbol
## <list> <list>
## [0] ENS00001 MAP1A
## [1] ENS00002 BIN1
## [2] ENS00003 ESR1
You can specify complex objects as columns, as long as they have some "length" equal to the number of rows.
For example, we can nest a BiocFrame
inside another BiocFrame
:
obj = {
"ensembl": ["ENS00001", "ENS00002", "ENS00002"],
"symbol": ["MAP1A", "BIN1", "ESR1"],
"ranges": BiocFrame({
"chr": ["chr1", "chr2", "chr3"],
"start": [1000, 1100, 5000],
"end": [1100, 4000, 5500]
}),
}
bframe2 = BiocFrame(obj, row_names=["row1", "row2", "row3"])
print(bframe2)
## BiocFrame with 3 rows and 3 columns
## ensembl symbol ranges
## <list> <list> <BiocFrame>
## row1 ENS00001 MAP1A chr1:1000:1100
## row2 ENS00002 BIN1 chr2:1100:4000
## row3 ENS00002 ESR1 chr3:5000:5500
Properties can be accessed directly from the object:
print(bframe.shape)
## (3, 2)
print(bframe.get_column_names())
## ['ensembl', 'symbol']
print(bframe.column_names) # same as above
## ['ensembl', 'symbol']
We can fetch individual columns:
bframe.get_column("ensembl")
## ['ENS00001', 'ENS00002', 'ENS00003']
bframe["ensembl"]
## ['ENS00001', 'ENS00002', 'ENS00003']
And we can get individual rows as a dictionary:
bframe.get_row(2)
## {'ensembl': 'ENS00003', 'symbol': 'ESR1'}
To extract a subset of the data in the BiocFrame
, we use the subset ([]
) operator.
This accepts different subsetting arguments like a boolean vector, a slice
object, a sequence of indices, or row/column names.
sliced = bframe[1:2, [True, False, False]]
print(sliced)
## BiocFrame with 1 row and 1 column
## column1
## <list>
## [0] ENS00002
sliced = bframe[[0,2], ["symbol", "ensembl"]]
print(sliced)
## BiocFrame with 2 rows and 2 columns
## symbol ensembl
## <list> <list>
## [0] MAP1A ENS00001
## [1] ESR1 ENS00003
# Short-hand to get a single column:
bframe["ensembl"]
## ['ENS00001', 'ENS00002', 'ENS00003']
To set BiocFrame
properties, we encourage a functional style of programming that avoids mutating the object.
This avoids inadvertent modification of BiocFrame
s that are part of larger data structures.
modified = bframe.set_column_names(["column1", "column2"])
print(modified)
## BiocFrame with 3 rows and 2 columns
## column1 column2
## <list> <list>
## [0] ENS00001 MAP1A
## [1] ENS00002 BIN1
## [2] ENS00003 ESR1
# Original is unchanged:
print(bframe.get_column_names())
## ['ensembl', 'symbol']
To add new columns, or replace existing columns:
modified = bframe.set_column("symbol", ["A", "B", "C"])
print(modified)
## BiocFrame with 3 rows and 2 columns
## ensembl symbol
## <list> <list>
## [0] ENS00001 A
## [1] ENS00002 B
## [2] ENS00003 C
modified = bframe.set_column("new_col_name", range(2, 5))
print(modified)
## BiocFrame with 3 rows and 3 columns
## ensembl symbol new_col_name
## <list> <list> <range>
## [0] ENS00001 MAP1A 2
## [1] ENS00002 BIN1 3
## [2] ENS00003 ESR1 4
Change the row or column names:
modified = bframe.\
set_column_names(["FOO", "BAR"]).\
set_row_names(['alpha', 'bravo', 'charlie'])
print(modified)
## BiocFrame with 3 rows and 2 columns
## FOO BAR
## <list> <list>
## alpha ENS00001 MAP1A
## bravo ENS00002 BIN1
## charlie ENS00003 ESR1
We also support Bioconductor's metadata concepts, either along the columns or for the entire object:
modified = bframe.\
set_metadata({ "author": "Jayaram Kancherla" }).\
set_column_data(BiocFrame({"column_source": ["Ensembl", "HGNC" ]}))
print(modified)
## BiocFrame with 3 rows and 2 columns
## ensembl symbol
## <list> <list>
## [0] ENS00001 MAP1A
## [1] ENS00002 BIN1
## [2] ENS00003 ESR1
## ------
## column_data(1): column_source
## metadata(1): author
Properties can also be set by direct assignment for in-place modification.
We prefer not to do it this way as it can silently mutate BiocFrame
instances inside other data structures.
Nonetheless:
testframe = BiocFrame({ "A": [1,2,3], "B": [4,5,6] })
testframe.column_names = ["column1", "column2" ]
print(testframe)
## BiocFrame with 3 rows and 2 columns
## column1 column2
## <list> <list>
## [0] 1 4
## [1] 2 5
## [2] 3 6
Similarly, we could set or replace columns directly:
testframe["column2"] = ["A", "B", "C"]
testframe[1:3, ["column1","column2"]] = BiocFrame({"x":[4, 5], "y":["E", "F"]})
## BiocFrame with 3 rows and 2 columns
## column1 column2
## <list> <list>
## [0] 1 A
## [1] 4 E
## [2] 5 F
These assignments are the same as calling the corresponding set_*()
methods with in_place = True
.
It is best to do this only if the BiocFrame
object is not being used anywhere else;
otherwise, it is safer to just create a (shallow) copy via the default in_place = False
.
BiocFrame implements methods for the various combine
generics from BiocUtils.
So, for example, to combine by row:
import biocutils
bframe1 = BiocFrame(
{
"odd": [1, 3, 5, 7, 9],
"even": [0, 2, 4, 6, 8],
}
)
bframe2 = BiocFrame(
{
"odd": [11, 33, 55, 77, 99],
"even": [0, 22, 44, 66, 88],
}
)
combined = biocutils.combine_rows(bframe1, bframe2)
print(combined)
## BiocFrame with 10 rows and 2 columns
## odd even
## <list> <list>
## [0] 1 0
## [1] 3 2
## [2] 5 4
## [3] 7 6
## [4] 9 8
## [5] 11 0
## [6] 33 22
## [7] 55 44
## [8] 77 66
## [9] 99 88
Similarly, to combine by column:
bframe3 = BiocFrame(
{
"foo": ["A", "B", "C", "D", "E"],
"bar": [True, False, True, False, True]
}
)
combined = biocutils.combine_columns(bframe1, bframe3)
print(combined)
BiocFrame with 5 rows and 4 columns
odd even foo bar
<list> <list> <list> <list>
[0] 1 0 A True
[1] 3 2 B False
[2] 5 4 C True
[3] 7 6 D False
[4] 9 8 E True
By default, both methods above assume that the number and identity of columns (for combine_rows()
) or rows (for combine_columns()
) are the same across objects.
If this is not the case, e.g., with different columns across objects, we can use BiocFrame's relaxed_combine_rows()
instead:
from biocframe import relaxed_combine_rows
modified2 = bframe2.set_column("foo", ["A", "B", "C", "D", "E"])
combined = relaxed_combine_rows(bframe1, modified2)
print(combined)
## BiocFrame with 10 rows and 3 columns
## odd even foo
## <list> <list> <list>
## [0] 1 0 None
## [1] 3 2 None
## [2] 5 4 None
## [3] 7 6 None
## [4] 9 8 None
## [5] 11 0 A
## [6] 33 22 B
## [7] 55 44 C
## [8] 77 66 D
## [9] 99 88 E
Similarly, if the rows are different, we can use BiocFrame's merge
function:
from biocframe import merge
modified1 = bframe1.set_row_names(["A", "B", "C", "D", "E"])
modified3 = bframe3.set_row_names(["C", "D", "E", "F", "G"])
combined = merge([modified1, modified3], by=None, join="outer")
## BiocFrame with 7 rows and 4 columns
## odd even foo bar
## <list> <list> <list> <list>
## A 1 0 None None
## B 3 2 None None
## C 5 4 A True
## D 7 6 B False
## E 9 8 C True
## F None None D False
## G None None E True
BiocFrame
is intended for accurate representation of Bioconductor objects for interoperability with R.
Most users will probably prefer to work with pandas DataFrame
objects for their actual analyses.
This conversion is easily achieved:
from biocframe import BiocFrame
bframe = BiocFrame(
{
"foo": ["A", "B", "C", "D", "E"],
"bar": [True, False, True, False, True]
}
)
pd = bframe.to_pandas()
print(pd)
## foo bar
## 0 A True
## 1 B False
## 2 C True
## 3 D False
## 4 E True
Conversion back to a BiocFrame
is similarly easy:
out = BiocFrame.from_pandas(pd)
print(out)
## BiocFrame with 5 rows and 2 columns
## foo bar
## <list> <list>
## 0 A True
## 1 B False
## 2 C True
## 3 D False
## 4 E True
Check out the reference documentation for more details.
Also see check out Bioconductor's S4Vectors package,
which implements the DFrame
class on which BiocFrame
was based.