Skip to content

Commit

Permalink
Merge pull request #378 from tiems90/df_documentation
Browse files Browse the repository at this point in the history
Pandas Documentation
  • Loading branch information
pcattori committed Apr 30, 2020
2 parents fe86704 + 4a1f8af commit d5aca25
Show file tree
Hide file tree
Showing 3 changed files with 185 additions and 1 deletion.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
**NEW FEATURES**
- Added user documentation on [Geospatial functionalities with GeoPandas](https://github.com/Datatamer/tamr-client/blob/master/docs/user-guide/geo.md). Documented limitations in Geopandas and workarounds.
- [#366](https://github.com/Datatamer/tamr-client/issues/366) Now able to connect to Tamr instance with implicit port

- [#376](https://github.com/Datatamer/tamr-client/issues/376) Added user documentation on [Working with Pandas](https://github.com/Datatamer/tamr-client/blob/master/docs/user-guide/pandas.md).

## 0.11.0
**BETA**

Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ assert op.succeeded()
* [Create and update resources](user-guide/spec)
* [Logging](user-guide/logging)
* [Geospatial data](user-guide/geo)
* [Pandas usage](user-guide/pandas)
* [Advanced usage](user-guide/advanced-usage)

## Contributor Guide
Expand Down
182 changes: 182 additions & 0 deletions docs/user-guide/pandas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Pandas Workflow

## Connecting To Tamr

Connecting to a Tamr instance:

```python
import os
import pandas as pd
from tamr_unify_client import Client
from tamr_unify_client.auth import UsernamePasswordAuth

username = os.environ['TAMR_USERNAME']
password = os.environ['TAMR_PASSWORD']

auth = UsernamePasswordAuth(username, password)
tamr = Client(auth)
```
## Load dataset as Dataframe

### Loading: In Memory

Loading a `dataset` as a pandas `dataframe` is possible via the `from_records()` method that pandas provides.
An example is shown below:

```python
my_dataset = tamr.datasets.by_name("my_tamr_dataset")
df = pd.DataFrame.from_records(my_dataset.records())
```

This will construct a pandas dataframe based on the records that are streamed in, and stored in the pandas dataframe.
Once all records have been loaded, you will be able to interact with the dataframe normally.

Note that as values are typically represented inside `arrays` within Tamr, the values will be encapsulated `lists`
inside the dataframe. You can use traditional methods in pandas to deal with this; for example by calling `.explode()`,
or extracting specific elements.

### Loading: Streaming
When working with large `datasets` it is sometimes better not to work in memory, but to iterate through a dataset, rather
than load the entire dataset at once.
Since `dataset.records()` is a generator, this can easily be done as follows:
```python
output = []
for record in dataset.records():
single_record_df = pd.DataFrame.from_records(record)
output.append(do_something(single_record_df))
```

### Custom Generators
In order to customise the data loaded into the pandas dataframe, it is possible to customise the generator object
`dataset.records()` by wrapping it in a different generator.

For example, it is possible to automatically flatten all lists with a length of one, and apply this to the `dataset.records()`
generator as follows:

```python
def unlist(lst):
"""
If object is a list of length one, return first element.
Otherwise, return original object.
"""
if isinstance(lst, list) and len(lst) is 1:
return lst[0]
else:
return lst

def dataset_to_pandas(dataset):
"""
Incorporates basic unlisting for easy transfer between Tamr and Pandas.
"""
for record in dataset.records():
for key in record:
record[key] = unlist(record[key])
yield record

df = pd.DataFrame.from_records(dataset_to_pandas(my_dataset))
```

Similarly, it is possible to filter to extracting only certain attributes, by specifying this in the generator:

```python
def filter_dataset_to_pandas(dataset, colnames):
"""
Filter the dataset to only the primary key and the columns specified as a list in colnames.
"""
assert isinstance(colnames, list)
colnames = dataset.key_attribute_names + colnames if dataset.key_attribute_names[0] not in colnames else colnames
for record in dataset.records():
yield {k: unlist(v) for k, v in record.items() if k in colnames}

df = pd.DataFrame.from_records(filter_dataset_to_pandas(my_dataset, ['City', 'new_attr']))
```

Note that upserting these records back to the original Tamr Dataset would overwite the existing records and attributes, and cause loss of the data
stored in the removed attributes.

## Upload Dataframe as Dataset

### Create New Dataset
To create a new dataset and upload data, the convenience function `dataset.create_from_dataframe()` can be used.
Note that Tamr will throw an error if columns aren't generally formatted as strings. (The exception being geospatial
columns. For that, see the geospatial examples.)

In order to achieve this, the following code will transform the column types to string.
```python
df = df.astype(str)
```

Creating the dataset is as easy as calling:
```python
tamr.datasets.create_from_dataframe(df, 'primaryKey', 'my_new_dataset')
```

### Changing Values

#### Making Changes: In Memory
When making changes to a dataset that was loaded as a dataframe, changes can be pushed back to Tamr using the
`dataset.upsert_from_dataframe()` method as follows:

```python
df = pd.DataFrame.from_records(my_dataset.records())
df['column'] = 'new_value'
my_dataset.upsert_from_dataframe(df, primary_key_name='primary_key')
```

#### Making Changes: Streaming
For larger datasets it might be better to stream the data and apply changes while iterating through the dataset.
This way the full dataset does not need to be loaded into memory.
```python
for record in dataset.records():
single_record_df = pd.DataFrame.from_records(record)
single_record_df['column_to_change'] = 'new_value'
dataset.upsert_from_dataframe(single_record_df, primary_key_name='primary_key')
```
### Adding Attributes
When making changes to dataframes, new dataframe columns are not automatically created as attributes when upserting
records to Tamr. In order for these changes to be recorded, these attributes first need to be created.

One way of creating these for source datasets automatically would be as follows:

```python
def add_missing_attributes(dataset, df):
"""
Detects any attributes in the dataframe that aren't in the dataset and attempts to add them (as strings).
"""
existing_attributes = [att.name for att in dataset.attributes]
new_attributes = [att for att in df.columns.to_list() if att not in existing_attributes]

if not new_attributes:
return

for new_attribute in new_attributes:
attr_spec = {"name": new_attribute,
"type": {"baseType": "ARRAY", "innerType": {"baseType": "STRING"}},
}
dataset.attributes.create(attr_spec)

add_missing_attributes(my_dataset, df)
```

## Troubleshooting

When running into errors upon loading `dataset.records()` into a pandas dataframe, it is good to consider the following
steps. To extract a single record, the following code can be used to provide a minimal reproducible example:
```python
record = next(dataset.records())
print(record)
```

### Parsing
Tamr allows for more variety in attribute names and contents than pandas does. In most cases pandas can load data
correctly, but it is possible to modify the parsing using a custom generator as shown above. An example below changes
an attribute name, and extracts only the first element:
```python
def custom_parser(dataset):
for record in dataset.records():
record['pandas_column_name'] = record.pop('dataset_attribute_name')
record['first_element_of_column'] = record['multi_value_column'][0]
yield record

df = pd.DataFrame.from_records(custom_parser(dataset))
```

0 comments on commit d5aca25

Please sign in to comment.