In [None]:
!pip install antigranular

In [21]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, competition = "Sandbox for Harvard OpenDP Hackathon")

tls_cert_name: ip-100-100-16-101.eu-west-1.compute.internal_a9c02632-efbf-4a5b-a4c2-4a5afccd23e3
cert_thumbprint: c45d530b540599da856d61bcafd2de5c3ca49e8ae9d20980f5dfb64eae43f4f9e1580e1cc09aeffb0599ccb6d5994c5b44839b9f94782eaa84b4538a364f67e9
local_host_port: a9c02632-efbf-4a5b-a4c2-4a5afccd23e3
server_hostname: ip-100-100-16-101.eu-west-1.compute.internal
Loading dataset "Flight Company Dataset for Sandbox" to the kernel...
Dataset "Flight Company Dataset for Sandbox" loaded to the kernel as flight_company_dataset_for_sandbox
Loading dataset "Health Organisation Dataset for Sandbox" to the kernel...
Dataset "Health Organisation Dataset for Sandbox" loaded to the kernel as health_organisation_dataset_for_sandbox
Connected to Antigranular server session id: 156100ca-b505-43f2-b4bb-3430333c5385, the session will time out if idle for 60 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Every

In [22]:
%%ag
health = health_organisation_dataset_for_sandbox
flight = flight_company_dataset_for_sandbox

## Basic metadata analysis
Estimate the size of the datasets , the column names and their metadatas.

In [23]:
%%ag
# Printing the column names using ag_print
ag_print(f"Flight Dataset \n {flight.columns} \n")
ag_print(f"Health Datset \n {health.columns}")

Flight Dataset 
 Index(['flight_number', 'flight_date', 'flight_from', 'flight_to',
       'passenger_firstname', 'passenger_lastname', 'passenger_date_of_birth'],
      dtype='object') 

Health Datset 
 Index(['patient_firstname', 'patient_lastname', 'patient_date_of_birth',
       'covidtest_date', 'covidtest_result', 'patient_address'],
      dtype='object')



In [24]:
%%ag
# Getting the differentially private count
ag_print(f"Total health records : {health['patient_firstname'].count(eps=0.1)}")
ag_print(f"Total flight records :  {flight['passenger_firstname'].count(eps=0.1)}")

Total health records : 59230

Total flight records :  39025



To filter out those flights which may contain passenger which was reported covid 19 recently , we will need to link both datasets in an efficient way. One way to do it is by using the `recordlinkage` library.

In [25]:
%%ag
# Lets remove those passenger records who tested negative.
health['covidtest_result'] = health['covidtest_result'].where(health['covidtest_result'] == 'positive')
health = health.dropna()

Sample visualization of how the filtered health data looks
![](https://content.antigranular.com/image/notebook_content/health_dataset_rlsand.png)

In [26]:
%%ag
ag_print(f"Total covid positive health records : {health['patient_firstname'].count(eps=0.1)}")

Total covid positive health records : 1291



### Setting Indexing Rules

When using the `recordlinkage` library , make sure you index both datasets against a column which you will might be the most similar in nature. If you do not index both datasets on a similar column , then the unique MultiIndexes generated can be of very high order.

In [27]:
%%ag
import op_recordlinkage as rl
# A full indexing is a complete cartesian product
indexer  = rl.Index()
indexer.full()




Lets index both datasets based on the date of birth. However , you must take care of the format of the dates so that they both are in similar fashion before you index both datasets against them.

**click [here](https://recordlinkage.readthedocs.io/en/latest/ref-index.html) to learn about more indexing algorithms.**

We currently support:
- Full
- Block
- SortedNeighbor
- Random


In [28]:
%%ag
import op_recordlinkage as rl
indexer = rl.Index()
indexer.block('passenger_date_of_birth','patient_date_of_birth')
candidate_links = indexer.index(flight,health)

In [29]:
%%ag
# total number of links based on this indexing choice.
ag_print(candidate_links.count(eps=0.1))

6097



### Setting Comparison Rules

Once the candidate links are formed , we can set compare rules against them to refine our linking process. In these rules , we can set a weight for each compare rule that we define.
  - Lets fuzzy match the firstnames and lastnames(thresholded) of the passenger. ( using default value of either 1 or 0 )
  - Allow links on the positive covid result happening atleast 14 before flight departure. ( with weight = 2 )
 
**click [here](https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.Compare) to learn about more compare rules.**

We currently support:
- String
- Numeric
- Exact
- Geo
- Date
- Custom Compares
    

In [30]:
%%ag
compare = rl.Compare()

# Using inbuilt string linking via fuzzy match. ( keeping threshold for last_name to get stronger links )
compare.string("passenger_firstname" , "patient_firstname" ,method='jarowinkler', label="firstname")
compare.string("passenger_lastname","patient_lastname" ,method='jarowinkler', threshold=0.9, label="lastname")


# Using a custom compare rule.
# custom functions are executed in isolated environment.
import datetime
def cmp(date_str1:str , date_str2:str)->int: # datetime and regex are pre-imported in isolated environment.
    # Convert date strings to datetime objects
    date1 = datetime.datetime.strptime(date_str1, "%Y-%m-%d")
    date2 = datetime.datetime.strptime(date_str2, "%Y-%m-%d")
    
    # Calculate the absolute difference in days
    days_apart = (date2 - date1).days
    # Check if the dates are within two weeks (14 days) apart
    if days_apart <= 14:
        return 2
    else:
        return 0

compare.custom(cmp,"flight_date","covidtest_date",label="date_cmp")


In [31]:
%%ag
features = compare.compute(candidate_links,flight,health)

Sample visualization of features matrix ( PrivateDataFrame )
![](https://content.antigranular.com/image/notebook_content/feat_mat_rlsand.png)

In [32]:
%%ag
# Lets find out the average matching weights obtained based on the compare rules we set.
ag_print(f"Average weight : {features.sum(axis=1).mean(eps=0.1)}")

Average weight : 2.384529685884475



### Linking the datasets

Choosing a value = 3 for linking based on the obtained average. Our purpose of giving the custom_compare
a weight=2 was to prioritize it over the matching of first_names/last_names. Moreover choosing a value=3 as threshold
for matching will give us strong links based on the compare rules we have set.

In [33]:
%%ag
linked_df = compare.get_match(3)

Sample Visualization of the linked PrivateDataFrame
![](https://content.antigranular.com/image/notebook_content/rl_linked_rlsand.png)

In [35]:
%%ag
# Submitting the column containing the filtered set of airlines we should report regarding a covid passenger.
res = linked_df[["l_flight_number"]]
x = submit_predictions(res)

score: {'leaderboard': 0.9318478284190648, 'logs': {'LIN_EPS': -0.002, 'MCC': 0.9338478284190648}}

