Skip to content

Commit

Permalink
Merge pull request #469 from skalish/mastering-tutorial
Browse files Browse the repository at this point in the history
Add tamr-client continuous mastering tutorial
  • Loading branch information
pcattori committed Oct 28, 2020
2 parents 22651b3 + 631f171 commit c7a1dd3
Show file tree
Hide file tree
Showing 4 changed files with 154 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
- [#456](https://github.com/Datatamer/tamr-client/pull/456) Added first example `tamr_client` script `examples/get_tamr_version.py`
- [#461](https://github.com/Datatamer/tamr-client/pull/461) Added functions for golden record workflow operations in `tc.golden_records`
- [#462](https://github.com/Datatamer/tamr-client/issues/462) Added function for checking operations, raising exception when operation has failed
- [#469](https://github.com/Datatamer/tamr-client/pull/469) Added "Continuous Mastering" `tamr_client` tutorial

**NEW FEATURES**
- [#383](https://github.com/Datatamer/tamr-client/issues/383) Now able to create an Operation from Job resource id
Expand Down
1 change: 1 addition & 0 deletions docs/beta.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

## Tutorials
* [Get Tamr version](beta/tutorial/get_version)
* [Continuous Mastering](beta/tutorial/continuous_mastering)

## Reference

Expand Down
111 changes: 111 additions & 0 deletions docs/beta/tutorial/continuous_mastering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Tutorial: Continuous Mastering
This tutorial will cover using the Python client to keep a Mastering project up-to-date. This includes carrying new data through to the end of the project and using any new labels to update the machine-learning model.

## Prerequisites
To complete this tutorial you will need:
- `tamr-unify-client` [installed](../../user-guide/installation)
- access to a Tamr instance, specifically:
- a username and password that allow you to log in to Tamr
- the socket address of the instance
- an existing Mastering project with loaded data that has been run to the end at least once
- it is recommended that you first complete the tutorial [here](https://docs.tamr.com/tamr-tutorials/docs/overview-mastering)
- alternatively, a different Mastering project can be used however the project name will likely be different

## Steps
### Configure the Session and Instance
- Use your username and password to create an instance of `tamr_client.UsernamePasswordAuth`.
- Use the function `tamr_client.session.from.auth` to create a `Session`.
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 1-9
```
- Create an `Instance` using the `protocol`, `host`, and `port` of your Tamr instance. Replace the values of `protocol`, `host`, and `port` with the corresponding values for your Tamr instance.
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 11-15
```

### Get the Tamr Mastering project to be updated
Use the function `tc.project.by_name` to retrieve the project information from the server.
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 17
```
Ensure that the retrieved project is a Mastering project by checking its type:
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 19-20
```

### Update the unified dataset
To apply the [attribute mapping configuration](https://docs.tamr.com/tamr-tutorials/docs/define-project-schema-mastering) and any transformations to update the unified dataset with updated source data, use the function `tc.mastering.update_unified_dataset`.
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 22-23
```
This function and all others in this tutorial are *synchronous*, meaning that they will not return until the job in Tamr has resolved, either successfully or unsuccessfully. The function `tc.operation.check` will raise an exception and halt the script if the job started in Tamr fails for any reason.

### Generate pairs
To generate pairs according to the [configured pair filter rules](https://docs.tamr.com/tamr-tutorials/docs/setup-how-pairs-are-found), use the function `tc.mastering.generate_pairs`
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 25-26
```

### Train the model with new Labels
To [update the machine-learning model](https://docs.tamr.com/tamr-tutorials/docs/help-tamr-learn-about-your-data) with newly-applied pairs. use the function `tc.mastering.apply_feedback`
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 28-29
```
Note: The "Apply feedback and update results" action in the Tamr GUI is equivalent to this and the following section.

### Apply the model
Applying the trained machine-learning model requires three functions.
- To update the pair prediction results, use the function `tc.mastering.update_pair_results`
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 31-32
```
- To update the list of [high-impact pairs](https://docs.tamr.com/tamr-tutorials/docs/help-tamr-learn-about-your-data#4-filter-for-high-impact-pairs), use the function `tc.mastering.update_high_impact_pairs`
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 34-35
```
- To update the clustering results, use the function `tc.mastering.update_cluster_results`
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 37-38
```

### Publish the clusters
To publish the record clusters, use the function `tc.mastering.publish_clusters`
```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
:lines: 40-41
```


All of the above steps can be combined into the following script `continuous_mastering.py`:

```eval_rst
.. literalinclude:: ../../../examples/continuous_mastering.py
:language: python
```
To run the script via command line:
```bash
TAMR_CLIENT_BETA=1 python continuous_mastering.py
```

To continue learning, see other tutorials and examples.
41 changes: 41 additions & 0 deletions examples/continuous_mastering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from getpass import getpass

import tamr_client as tc

username = input("Tamr Username:")
password = getpass("Tamr Password:")

auth = tc.UsernamePasswordAuth(username, password)
session = tc.session.from_auth(auth)

protocol = "http"
host = "localhost"
port = 9100

instance = tc.Instance(protocol=protocol, host=host, port=port)

project = tc.project.by_name(session, instance, "MasteringTutorial")

if not isinstance(project, tc.MasteringProject):
raise RuntimeError(f"{project.name} is not a mastering project.")

operation_1 = tc.mastering.update_unified_dataset(session, project)
tc.operation.check(session, operation_1)

operation_2 = tc.mastering.generate_pairs(session, project)
tc.operation.check(session, operation_2)

operation_3 = tc.mastering.apply_feedback(session, project)
tc.operation.check(session, operation_3)

operation_4 = tc.mastering.update_pair_results(session, project)
tc.operation.check(session, operation_4)

operation_5 = tc.mastering.update_high_impact_pairs(session, project)
tc.operation.check(session, operation_5)

operation_6 = tc.mastering.update_cluster_results(session, project)
tc.operation.check(session, operation_6)

operation_7 = tc.mastering.publish_clusters(session, project)
tc.operation.check(session, operation_7)

0 comments on commit c7a1dd3

Please sign in to comment.