-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d8b0949
commit d04dc9a
Showing
2 changed files
with
24 additions
and
170 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,186 +1,34 @@ | ||
Discussion at https://github.com/pandas-dev/pandas/issues/18767 | ||
# Cyberpandas | ||
|
||
## Install | ||
[![Build Status](https://travis-ci.org/ContinuumIO/cyberpandas.svg?branch=master)](https://travis-ci.org/ContinuumIO/cyberpandas) | ||
[![Documentation Status](https://readthedocs.org/projects/cyberpandas/badge/?version=latest)](http://cyberpandas.readthedocs.io/en/latest/?badge=latest) | ||
|
||
This requires some modifications to pandas. These modifications are being | ||
merged upstream in pandas, but for now you can install from my channel with | ||
|
||
``` | ||
conda install -c TomAugspurger pandas cyberpandas | ||
``` | ||
|
||
## Abstract | ||
|
||
Proposal to add support for storing and operating on IP Address data. | ||
Adds a new block type for ip address data and an `ip` accessor to | ||
`Series` and `Index`. | ||
|
||
## Rationale | ||
|
||
For some communities, IP and MAC addresses are a common data format. The data | ||
format was deemed important enough to add the `ipaddress` module to the standard | ||
library (see `PEP 3144`_). At Anaconda, we hear from customers who would use a | ||
first-class IP address array container if it existed in pandas. | ||
|
||
I turned to StackOverflow to gauge interest in this topic. A search for "IP" on | ||
the [pandas stackoverflow | ||
tag](https://stackoverflow.com/search?q=%5Bpandas%5D+IP) turns up 300 results. | ||
Under the NumPy tag there are another 80. For comparison, I ran a few other | ||
searches to see what interest there is in other "specialized" data types (this | ||
is a very rough, probably incorrect, way of estimating interest): | ||
|
||
| term | results | | ||
| --------- | ------- | | ||
| financial | 251 | | ||
| geo | 120 | | ||
| ip | 300 | | ||
| logs | 590 | | ||
|
||
|
||
Categorical, which is already in pandas, turned up 1,089 items. | ||
|
||
Overall, I think there's enough interest relative to the implementation / | ||
maintenance burden to warrant adding the support for IP Addresses. I don't | ||
anticipate this causing any issues for the arrow transition, once ARROW-1587 is | ||
in place. We can be careful which parts of the storage layer are implementation | ||
details. | ||
|
||
## Specification | ||
|
||
The proposal is to add | ||
|
||
1. A type and container for IPArray and MACAddress (similar to | ||
`CategoricalDtype` and `Categorical`). | ||
2. A block for IPArray and MACAddress (similar to `CategoricalBlock`). | ||
3. A new accessor for Series and Indexes, `.ip`, for operating on IP | ||
addresses and MAC addresses (similar to `.cat`). | ||
|
||
The type and block should be generic IP address blocks, with no | ||
distinction between IPv4 and IPv6 addresses. In our experience, it's | ||
common to work with data from multiple sources, some of which may be | ||
IPv4, and some of which may be IPv6. This also matches the semantics | ||
of the default `ipaddress.ip_address` factory function, which returns | ||
an `IPv4Address` or `IPv6Address` as needed. Being able to deal with | ||
ip addresses in an IPv4 vs. IPv6 agnostic fashion is useful. | ||
|
||
### Data Layout | ||
|
||
Since IPv6 addresses are 128 bits, they do not fit into a standard NumPy uint64 | ||
space. This complicates the implementation (but, gives weight to accepting the | ||
proposal, since doing this on your own can be tricky). | ||
|
||
Each record will be composed of two uint64s. The first element | ||
contains the first 64 bits, and the second array contains the second 64 | ||
bits. As a NumPy structured dtype, that's | ||
Cyberpandas provides support for storing IP and MAC address data inside a pandas DataFrame using pandas' [Extension Array Interface](http://pandas-docs.github.io/pandas-docs-travis/extending.html#extension-types) | ||
|
||
```python | ||
base = np.dtype([('lo', '>u8'), ('hi', '>u8')]) | ||
``` | ||
|
||
This is a common format for handling IPv4 and IPv6 data: | ||
|
||
> Hybrid dual-stack IPv6/IPv4 implementations recognize a special class of | ||
> addresses, the IPv4-mapped IPv6 addresses. These addresses consist of an | ||
> 80-bit prefix of zeros, the next 16 bits are one, and the remaining, | ||
> least-significant 32 bits contain the IPv4 address. | ||
In [1]: from cyberpandas import IPArray | ||
|
||
From [here](https://en.wikipedia.org/wiki/IPv6#Software) | ||
In [2]: import pandas as pd | ||
|
||
### Missing Data | ||
In [3]: df = pd.DataFrame({"address": IPArray(['192.168.1.1', '192.168.1.10'])}) | ||
|
||
Use the lowest possible IP address as a marker. According to RFC2373, | ||
|
||
> The address 0:0:0:0:0:0:0:0 is called the unspecified address. It must | ||
> never be assigned to any node. It indicates the absence of an address. | ||
See [here](https://tools.ietf.org/html/rfc2373.html#section-2.5.2). | ||
|
||
### Methods | ||
|
||
The new user-facing `IPArray` (analogous to a `Categorical`) will have | ||
a few methods for easily constructing arrays of IP addresses. | ||
|
||
```python | ||
IPArray.from_pyints(cls, values: Sequence[int]) -> 'IPArray': | ||
"""Construct an IPArray array from a sequence of python integers. | ||
>>> IPArray.from_pyints([10, 18446744073709551616]) | ||
<IPArray(['0.0.0.10', '::1'])> | ||
""" | ||
|
||
IPArray.from_str(cls, values: Sequence[str]) -> 'IPArray': | ||
"""Construct an IPArray from a sequence of strings.""" | ||
In [4]: df | ||
Out[4]: | ||
address | ||
0 192.168.1.1 | ||
1 192.168.1.10 | ||
``` | ||
|
||
The methods in the new `.ip` namespace should follow the standard | ||
library's design. | ||
|
||
**Properties** | ||
|
||
- `is_multicast` | ||
- `is_private` | ||
- `is_global` | ||
- `is_unspecificed` | ||
- `is_reserved` | ||
- `is_loopback` | ||
- `is_link_local` | ||
|
||
### Reference Implementation | ||
See the [documentation](https://cyberpandas.readthedocs.io/en/latest/) for more. | ||
|
||
An implementation of the types and block is available at | ||
[cyberpandas](https://github.com/ContinuumIO/cyberpandas/) (at the moment | ||
it's a proof of concept). | ||
## Installation | ||
|
||
### Alternatives | ||
With Conda: | ||
|
||
Adding a new block type to pandas is a major change. Downstream libraries may | ||
have special-cased handling for pandas' extension types, so this shouldn't be | ||
adopted without careful consideration. | ||
|
||
Some alternatives to this that exist outside of pandas: | ||
|
||
1. Store `ipaddress.IPv4Address` or `ipaddress.IPv6Address` objects in | ||
an `object` dtype array. The `.ip` namespace could still be included | ||
with an extension decorator. The drawback here is the poor | ||
performance, as every operation would be done element-wise. | ||
2. A separate library that provides a container and methods. The | ||
downside here is that the library would need to subclass `Series`, | ||
`DataFrame`, and `Index` so that the custom blocks and types are | ||
interpreted correctly. Users would need to use the custom | ||
`IPSeries`, `IPDataFrame`, etc., which increases friction when working | ||
with other libraries that may expect / coerce to pandas objects. | ||
|
||
To expand a bit on the (current) downside of alternative 2, when the pandas constructors | ||
see an "unknown" object, they falls back to `object` dtype and stuffs the actual Python object | ||
into whatever container is being created: | ||
|
||
```python | ||
In [1]: import pandas as pd | ||
|
||
In [2]: import cyberpandas as ip | ||
|
||
In [3]: arr = ip.IPArray.from_pyints([1, 2]) | ||
|
||
In [4]: arr | ||
Out[4]: <IPArray(['0.0.0.1', '0.0.0.2'])> | ||
|
||
In [5]: pd.Series(arr) | ||
Out[5]: | ||
0 <IPArray(['0.0.0.1', '0.0.0.2'])> | ||
dtype: object | ||
``` | ||
conda install -c conda-forge cyberpandas | ||
|
||
I'd rather not have to make a subclass of Series, just to stick an array-like thing into a Series. | ||
Or from PyPI | ||
|
||
If pandas could provide an interface such that objects satisfying that interface | ||
are treated as array-like, and not a simple python object, then I'll gladly close | ||
this issue and develop the IP-address specific functionality in another package. | ||
That might be the best possible outcome to all this. | ||
pip install cyberpandas | ||
|
||
### References | ||
|
||
- [cyberpandas](https://github.com/ContinuumIO/cyberpandas/) | ||
- [PEP 3144](https://www.python.org/dev/peps/pep-3144/) | ||
- [RFC 2373](https://tools.ietf.org/html/rfc2373.html#section-2.5.2) | ||
- [ipaddress howto](https://docs.python.org/3/howto/ipaddress.html) | ||
- [ipaddress](https://docs.python.org/3/library/ipaddress.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters