# Entity Resolution Kit Overview

The high-level methods can be accessed via a global import of the package like so:

In [14]:
from entity_resolution import *

## Downloading S3 data

Provided that your credentials are set you can donwload S3 data directly to your local disk. To set your credentials use `aws configure` in your terminal. The entity resolution package does not accept credentials but relies on the AWS SDK for this.

In [None]:
s3_url = "s3://mirror-sc-identity-resolution-snowcatcloud/customer-data/enriched/stream/run=20211112/"
download_path = "/Users/Swa/temp/stuff"
download_s3_folder(s3_url, download_path)

## In one go

The following `demo()` method will

- generate a pseudo-random graph
- perform the identity resolution on it
- create predictions
- export the prediction to csv
- open the default GraphML app to visualize the graph (with predictions)


This will open the default application for GraphML files (*.graphml) and [yEd is probably the best option](https://www.yworks.com/products/yed/download#download). Use the "Smart Organic Layout" and the properties mapper (Edit > Properties Mapper) to turn data values into visuals. There is
a `yEdMapping.cnfx` file in the `diverse` directory you can import, it contains a set of mapping which will render something like the image below.

The `df` is a dataframe and it contains rows of predictions in the shape (Device, Identity, Probability). It's saved as a CSV as well.

![](../sample.png)
![](../predictions.png)

You can also output synthetic data to [Gephi](https://gephi.org)

In [30]:
from entity_resolution import *
g,df = demo()
df

Analyzing 103 component(s). The largest contains 28 nodes.


100%|██████████| 103/103 [00:00<00:00, 5115.49it/s]

Predictions 37
Timing 0:00:00
GraphML diagram at  /Users/swa/Projects/entity-resolution/notebooks/synthetic.graphml





Predictions saved to  /Users/swa/Projects/entity-resolution/notebooks/predictions.csv
Nodes: 753 Edges: 713 Identities: 50 Inferred: 37


Unnamed: 0,Device,Identity,Probability
15,778c92bd-abdd-4b79-b354-a2b3ce25314b,32dcc34c-7f68-41a3-af78-044ee8e788ab,0.38
12,fc5a5fe4-c96d-4e6f-8a03-07e66d0e745b,41ce6b58-fc7b-4839-94f3-0930f286cf6b,0.33
18,aca29a1f-595e-43c0-8e26-d92762293098,28b802ea-663a-4622-be0e-cbe08b0cf8b2,0.29
35,c26e7ff7-9f2a-4451-ba8d-f0242c17cf19,8ea9a1f7-c1d9-4bbf-b9ec-470e21ed5bc0,0.29
23,c8dad358-4cd2-415f-beec-2f554af9f531,f5171e47-1b7c-4807-a781-8b0ac87fff80,0.29
31,c26f19cd-2790-4395-bdb8-677a97a7b580,6bdd00fb-0d47-4024-91c7-dad03bdc76d2,0.29
30,2d8d9baa-0b93-4ca2-837e-a56efc38efa9,c06807e0-bab1-413c-b255-d12d33c732c6,0.29
29,9b0d3f3b-e2d7-4adf-a1df-eda7c81181bb,0aa8a59f-290d-41da-8ab8-a371d4f9e8a4,0.29
8,465ac3f9-ef76-47c8-9918-812a79b43e1b,62e1a974-1b32-4e87-8621-937fb089f936,0.29
25,113498e6-91c6-4322-b622-9b1aa1f3d29e,24ce984d-db99-4f1b-b808-d3e2675ca7c3,0.22


In [13]:
!ls

SharedIP.png demo.ipynb


## Visualization

As shown above the framework can easily generate GraphML files for visualization but you can just as easily use Gephi if you prefer. The generated `gexf` file can be opened using Gephi and some other compatible applications.

In [None]:
from entity_resolution import *
create_synthetic_gephi_graph()

## Generating test graphs

This is useful for demo and test purposes.


All of the methods creating synthetic data have the following tuning parameters:

- **device count**: how many device nodes to create
- **identity count**: how many identity nodes to create
- **cookie probability**: the probability that a cookie node is shared between two devices
- **ip probability**: the probability that an IP node is shared between two devices
- **location probability**: the probability that an Location node is shared between two devices

These parameters have appropriate default so you can use the methods like this


In [None]:
g = create_synthetic_graph()

When you have a graph you can infer indentities immediately like this:

In [None]:
infer_identities(g)

The ouput represents the inferred edges between devices and identities together with a probability.

By default the internal id's of the nodes are returned but you can access the true unique identifiers as well:

In [None]:
coll, ids = infer_identities(g, return_ids = True)
ids

This can be saved to CSV and used for downstream tasks.

## Raw format

Given raw data (downloaded from S3 like highlighted above) in a directory you can make immediately predictions:

In [24]:
from entity_resolution import *
raw_directory = "/Users/Swa/temp/raw"
g, df = raw_to_predictions(raw_directory)
df

Reading 457 file(s) in directory '/Users/Swa/temp/raw'.


100%|██████████| 457/457 [00:00<00:00, 1034.21it/s]


Analyzing 6892 component(s). The largest contains 47 nodes.


  2%|▏         | 137/6892 [00:00<00:00, 7163.05it/s]

Predictions 100
Timing 0:00:00.300000
Predictions saved to  /Users/swa/Projects/entity-resolution/notebooks/predictions.csv





Unnamed: 0,Device,Identity,Probability
25,491f4f36-9ffc-4dc7-823f-a1b125ff3008,55afe9b5-86e5-4dc5-a54f-4ba3915117b3,0.38
71,9210af4e-5f86-4968-ab64-c023088f9cda,9fb50933-68bd-4acc-9c50-9ba1da181509,0.38
22,d9fe125a-d2d6-49f2-8a4e-1b8a085c4ccf,0b15d659-e1c9-40a9-b2b7-3cc1d4dddd5d,0.30
94,e73db22c-89ba-4752-b3db-b5eeb60df87e,4601c7b6-2921-400b-8a4d-b408b1567e52,0.29
75,c7ee4751-8924-4538-be3e-fb69c04eb2a5,cfb8933b-3c92-4db1-85bf-68a23eb2e684,0.29
...,...,...,...
65,652cb291-ebb4-432e-b553-47874580fc2b,39ae99a1-b4ef-4da2-9b87-bd9ff9256f33,0.10
81,64244b62-8e16-48a7-81f1-35b6602df110,70e018b9-f260-4d2b-adb5-901283b5dea5,0.10
80,c0fa54b8-0266-4c22-be91-b4a69d26f473,542bf700-9ec2-428d-bf1d-bef9edd07145,0.10
79,34027480-9172-422c-a4ab-4702cf2d98b9,7101b4d9-e489-45ce-bff4-e1cf122f9585,0.10


The predictions are outputted and saved as CSV as well.
The following parameters are defined here:

- **raw_directory**: the directory where the gz-files are located. The S3 download utility described above can be used to download data from S3.
- **threshold**: only predictions with a probability above this value are retained
- **save_diagram**: whether the internally generated graph should be saved (as GraphML). This is only for visualization purposes.
- **predictions_path**: the path where the CSV with the predictions should be saved. If not specified it will be save to `predictions.csv` in the directory where the script was ran.
- **max_predictions**: specifies how many of the top predictions are retained.

If for some reason you wish to convert the raw data to JSON entities for other tasks you can do it like this:

In [26]:
from entity_resolution import *
raw_directory = "/Users/Swa/temp/raw"
json_entities_path = "/Users/Swa/temp/entities/"
raw_to_entities_json(raw_directory,json_entities_path )

Reading 457 file(s) in directory '/Users/Swa/temp/raw'.


100%|██████████| 457/457 [00:00<00:00, 1042.51it/s]


Written to  /Users/Swa/temp/entities/raw.json


[{'Device': 'b06b6fff-e48e-4fcd-ae7b-750551ecdc35',
  'IP': '',
  'Location': '541888cd-fcbf-427e-bd1b-577aad4b9f22',
  'Identity': '',
  'Cookie': ''},
 {'Device': '3b0d9602-87d2-40f7-ae30-2e5d566a2ff7',
  'IP': '',
  'Location': '',
  'Identity': '',
  'Cookie': '3b0d9602-87d2-40f7-ae30-2e5d566a2ff7'},
 {'Device': '3b0d9602-87d2-40f7-ae30-2e5d566a2ff7',
  'IP': '',
  'Location': '',
  'Identity': '',
  'Cookie': '69b54d52-bcf8-4819-8b38-e753b14e03a4'},
 {'Device': '3b0d9602-87d2-40f7-ae30-2e5d566a2ff7',
  'IP': 'b554455c-29cf-4515-8d03-c8543abd148c',
  'Location': '',
  'Identity': '',
  'Cookie': ''},
 {'Device': '3b0d9602-87d2-40f7-ae30-2e5d566a2ff7',
  'IP': '',
  'Location': '4f230bf0-2296-4503-bbe7-d00ee5475c16',
  'Identity': '',
  'Cookie': ''},
 {'Device': '90b3609f-dd66-44e8-80cd-d499cdd709bc',
  'IP': '',
  'Location': '',
  'Identity': '',
  'Cookie': '90b3609f-dd66-44e8-80cd-d499cdd709bc'},
 {'Device': '90b3609f-dd66-44e8-80cd-d499cdd709bc',
  'IP': '',
  'Location': '',


In [25]:
!ls

SharedIP.png    demo.ipynb      predictions.csv


## Neo4j

You can make prediction directly from a Neo4j database:

In [27]:
g, df = neo_to_predictions()
df

Nodes: 1585 Edges: 1166 Components: 757 Largest component: 334
Analyzing 757 component(s). The largest contains 334 nodes.


100%|██████████| 757/757 [00:00<00:00, 3706.50it/s]

Predictions 17
Timing 0:00:00.200000
Predictions saved to  /Users/swa/Projects/entity-resolution/notebooks/predictions.csv





Unnamed: 0,Device,Identity,Probability
16,d07b47ce-13a5-4c58-8668-1d3bd7122578,1560a6d9-8be0-4f44-a8ee-38ae2110ba3c,1.0
15,6b853faf-f6d4-470e-a008-8ca53e29b1bf,3d900cba-68b9-4b60-8e2d-80ed0854947a,0.33
7,aef8c668-2fc9-43eb-9e90-ef242754850b,a131d207-876a-47d3-9e61-b5af6ab87ad1,0.27
5,06cb8d84-ee85-4e57-8550-c2b0143473f6,6f71017a-296c-410f-97d1-56d9b90f9513,0.22
10,cebec72b-5d9c-423f-bee9-1deb7107531b,896d8048-26b8-4ef6-871d-365772c71442,0.2
0,fe6a75066c724173fe9e42d260de9f70,c1a05e97230cb88ccd0e97d83fbe9e5d,0.2
1,4e488d431e0d7819e46455034939cb5c,c1a05e97230cb88ccd0e97d83fbe9e5d,0.18
2,6fd267e36a215714f929543f3b89ba6c,6d7869610504537a68b2b10e00888a97,0.17
6,063ac104-3262-46c8-ad4e-304be8a41175,b05ae855-cd6b-4ae8-9e95-af136d989ced,0.15
8,aef8c668-2fc9-43eb-9e90-ef242754850b,647049ea-052f-457d-a1bc-6df32fdb7823,0.15


Here again, there are plenty of parameters to tune the results and you can save the resulting predictions to a CSV for other purposes.

For demo purposes you can also create test databases:

In [28]:
create_synthetic_neo("demodata")

Database 'demodata' has been created.
Creating nodes


100%|██████████| 763/763 [00:20<00:00, 37.00it/s]


Creating edges


100%|██████████| 664/664 [00:19<00:00, 34.13it/s]

Done.





<networkx.classes.digraph.DiGraph at 0x7ff17168e310>

In [29]:
add_predictions_to_neo("demodata")

All INFERRED edges have been removed from database 'demodata'.
Nodes: 613 Edges: 664 Components: 116 Largest component: 17
Analyzing 116 component(s). The largest contains 17 nodes.


100%|██████████| 116/116 [00:00<00:00, 9646.67it/s]

Predictions 30
Timing 0:00:00





Written 30 prediction(s) to the 'demodata' database.


Unnamed: 0,Device,Identity,Probability
0,26de254b-f08b-462f-a432-6c6697a4a8d6,27e641d0-8dee-4776-8d1f-8575975321d0,1.0
12,e3a48e0e-8ab6-4af6-b514-d67df805f9a2,cbca723a-5eb9-4780-9570-2a3742e367b9,1.0
28,dcd18202-6049-41a3-b6fe-a6cb1ae5d3f7,4f4c3013-69b4-4385-9516-32de715aee42,1.0
27,7f8e5a6c-bd54-450b-94d4-9213df500727,97068a71-a02b-4a2f-bdd6-c89173e452e8,1.0
26,02178704-2e92-4d23-9b41-a983dea4e199,e2bef139-a6ad-4276-b28d-6d476976dc61,1.0
24,360b6845-5d8a-464a-9b0a-a8c7b82ad591,d5b1f9e6-2fd6-4cf9-8f47-2594a1561a38,1.0
23,ea0991aa-5fda-48cf-9a7a-07e0675e08e0,5637a199-7fc1-4c7b-9a06-f753007cac0d,1.0
22,0d77f824-bb2b-4955-bc86-5eeb9b53ae0f,5637a199-7fc1-4c7b-9a06-f753007cac0d,1.0
21,7d6c16fa-bf9a-4ff0-8074-50b64d7b90b7,afbe52f9-0f07-4e99-9c28-504cca56c849,1.0
20,fef57c48-31fe-4386-8037-9c973a8799b5,691798c2-533d-4aa9-b2eb-b2764f31ca95,1.0
