In [4]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, competition = "Sandbox for Harvard Open DP Hackathon")

Dataset "Flight Company Dataset for Sandbox" loaded to the kernel as flight_company_dataset_for_sandbox
Dataset "Health Organisation Dataset for Sandbox" loaded to the kernel as health_organisation_dataset_for_sandbox
Connected to Antigranular server session id: 70af9603-1b85-4987-a0ec-dbd654fd4db2, the session will time out if idle for 60 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server


In [22]:
%%ag
health = health_organisation_dataset_for_sandbox
flight = flight_company_dataset_for_sandbox

In [2]:
ag.__version__

'0.2.9'

## Basic metadata analysis
Estimate the size of the datasets , the column names and their metadatas.

In [23]:
%%ag
# Printing the column names using ag_print
ag_print(f"Flight Dataset \n {flight.columns} \n")
ag_print(f"Health Datset \n {health.columns}")

Flight Dataset 
 Index(['flight_number', 'flight_date', 'flight_from', 'flight_to',
       'passenger_firstname', 'passenger_lastname', 'passenger_date_of_birth'],
      dtype='object') 

Health Datset 
 Index(['patient_firstname', 'patient_lastname', 'patient_date_of_birth',
       'covidtest_date', 'covidtest_result', 'patient_address'],
      dtype='object')



In [24]:
%%ag
# Getting the differentially private count
ag_print(f"Total health records : {health['patient_firstname'].count(eps=0.1)}")
ag_print(f"Total flight records :  {flight['passenger_firstname'].count(eps=0.1)}")

Total health records : 59230

Total flight records :  39025



To filter out those flights which may contain passenger which was reported covid 19 recently , we will need to link both datasets in an efficient way. One way to do it is by using the `recordlinkage` library.

In [25]:
%%ag
# Lets remove those passenger records who tested negative.
health['covidtest_result'] = health['covidtest_result'].where(health['covidtest_result'] == 'positive')
health = health.dropna()

In [112]:
health._df

Unnamed: 0,patient_firstname,patient_lastname,patient_date_of_birth,covidtest_date,covidtest_result,patient_address
163,Jola,Fano,1987-02-16,2020-01-18,positive,"Apartment 211, Beach Complex, Rome"
164,Jola,Fano,1987-02-16,2020-02-05,positive,"Building 433, Hill Alley, Tokyo"
194,Zor Fano,Juhn,2000-11-17,2020-01-03,positive,"House 203, Green Street, Tokyo"
195,Zor Fano,Juhn,2000-11-17,2020-01-13,positive,"House 447, Bond Road, Pretoria"
196,Zor Fano,Juhn,2000-11-17,2020-01-27,positive,"Apartment 72, Eagle Close, London"
...,...,...,...,...,...,...
59080,Gos,Immu,1967-07-01,2020-01-01,positive,"Office 193, Station Close, Tokyo"
59081,Gos,Immu,1967-07-01,2020-01-14,positive,"Apartment 158, Bond Street, London"
59139,Wolm Dapi,Simo,1990-11-06,2020-01-08,positive,"House 301, Bond Corner, Brasilia"
59140,Wolm Dapi,Simo,1990-11-06,2020-01-26,positive,"Building 186, President Road, Tokyo"


In [26]:
%%ag
ag_print(f"Total covid positive health records : {health['patient_firstname'].count(eps=0.1)}")

Total covid positive health records : 1291



When using the `recordlinkage` library , make sure you index both datasets against a column which you will might be the most similar in nature. If you do not index both datasets on a similar column , then the unique MultiIndexes generated can be of very high order.

In [27]:
%%ag
import op_recordlinkage as rl
# A full indexing is a complete cartesian product
indexer  = rl.Index()
indexer.full()




Lets index both datasets based on the date of birth. However , you must take care of the format of the dates so that they both are in similar fashion before you index both datasets against them.

**click [here](https://recordlinkage.readthedocs.io/en/latest/ref-index.html) to learn about more indexing algorithms.**

We currently support:
- Full
- Block
- SortedNeighbor
- Random


In [28]:
%%ag
import op_recordlinkage as rl
indexer = rl.Index()
indexer.block('passenger_date_of_birth','patient_date_of_birth')
candidate_links = indexer.index(flight,health)

In [29]:
%%ag
# total number of links based on this indexing choice.
ag_print(candidate_links.count(eps=0.1))

6097



Once the links are formed , we can set compare rules against them to refine our linking process. In these rules , we can set a weight for each compare rule that we define.
  - Lets fuzzy match the firstnames and lastnames(thresholded) of the passenger. ( using default value of either 1 or 0 )
  - Allow links on the positive covid result happening atleast 14 before flight departure. ( with weight = 2 )
 
**click [here](https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.Compare) to learn about more compare rules.**

We currently support:
- String
- Numeric
- Exact
- Geo
- Date
- Custom Compares
    

In [30]:
%%ag
compare = rl.Compare()

# Using inbuilt string linking via fuzzy match. ( keeping threshold for last_name to get stronger links )
compare.string("passenger_firstname" , "patient_firstname" ,method='jarowinkler', label="firstname")
compare.string("passenger_lastname","patient_lastname" ,method='jarowinkler', threshold=0.9, label="lastname")


# Using a custom compare rule.
from datetime import datetime
def cmp(date_str1 , date_str2):
    # Convert date strings to datetime objects
    date1 = datetime.strptime(date_str1, "%Y-%m-%d")
    date2 = datetime.strptime(date_str2, "%Y-%m-%d")
    
    # Calculate the absolute difference in days
    days_apart = (date2 - date1).days
    # Check if the dates are within two weeks (14 days) apart
    if days_apart <= 14:
        return 2
    else:
        return 0

compare.custom(cmp,"flight_date","covidtest_date",label="date_cmp")


In [31]:
%%ag
features = compare.compute(candidate_links,flight,health)

In [148]:
# Sample demonstration of how a features matrix looks.
# Since its a PrivateDataFrame , you need to apply DP mechanisms to retrieve useful infos.

Unnamed: 0,Unnamed: 1,firstname,lastname,date_cmp
9,48703,0.000000,1.0,2
28227,48703,0.412037,0.0,2
32066,48703,0.888889,1.0,0
32067,48703,0.888889,1.0,2
32068,48703,0.888889,1.0,2
...,...,...,...,...
38654,58656,1.000000,1.0,2
38970,59139,0.977778,1.0,2
38970,59140,0.977778,1.0,2
39011,59207,0.975000,1.0,2


In [32]:
%%ag
# Lets find out the average matching weights obtained based on the compare rules we set.
ag_print(f"Average weight : {features.sum(axis=1).mean(eps=0.1)}")

Average weight : 2.384529685884475



Choosing a value = 3 for linking based on the obtained average. Our purpose of giving the custom_compare
a weight=2 was to prioritize it over the matching of first_names/last_names. Moreover choosing a value=3 as threshold
for matching will give us strong links based on the compare rules we have set.

In [33]:
%%ag
linked_df = compare.get_match(3)

In [153]:
# Sample visualization of how the linked_dataframe internally looks.

Unnamed: 0,l_flight_number,l_flight_date,l_flight_from,l_flight_to,l_passenger_firstname,l_passenger_lastname,l_passenger_date_of_birth,r_patient_firstname,r_patient_lastname,r_patient_date_of_birth,r_covidtest_date,r_covidtest_result,r_patient_address
0,CHI-ROM-0019,2020-01-19,Chicago,Rome,Nymo,Thum,1978-03-30,Dina Anin,Thum,1978-03-30,2020-01-27,positive,"House 401, Eagle Alley, London"
1,CHI-TOK-0018,2020-01-18,Chicago,Tokyo,Dina,Thum,1978-03-30,Dina Anin,Thum,1978-03-30,2020-01-27,positive,"House 401, Eagle Alley, London"
2,TOK-LON-0024,2020-01-24,Tokyo,London,Dina,Thum,1978-03-30,Dina Anin,Thum,1978-03-30,2020-01-27,positive,"House 401, Eagle Alley, London"
3,PRE-ROM-0013,2020-01-13,Pretoria,Rome,WolmUlna,Fano,1996-12-13,Wolm,Fano,1996-12-13,2020-01-22,positive,"House 255, Newton Corner, Pretoria"
4,ROM-PRE-0020,2020-01-20,Rome,Pretoria,WolmUlna,Fano,1996-12-13,Wolm,Fano,1996-12-13,2020-01-22,positive,"House 255, Newton Corner, Pretoria"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,LON-SYD-0011,2020-01-11,London,Sydney,Nymo Linka,Kun,2005-06-15,Nymo Linka,Kun,2005-06-15,2020-01-07,positive,"Flat 346, Moon Road, London"
1290,SYD-ROM-0014,2020-01-14,Sydney,Rome,Nymo Linka,Kun,2005-06-15,Nymo Linka,Kun,2005-06-15,2020-01-07,positive,"Flat 346, Moon Road, London"
1291,BRA-TOK-0019,2020-01-19,Brasilia,Tokyo,WolmDapi,Simo,1990-11-06,Wolm Dapi,Simo,1990-11-06,2020-01-26,positive,"Building 186, President Road, Tokyo"
1292,BRA-LON-0014,2020-01-14,Brasilia,London,XaimEvi,Fink,1979-05-10,Xaim Evi,Fink,1979-05-10,2020-01-26,positive,"House 70, Queen Alley, Brasilia"


In [35]:
%%ag
# Submitting the column containing the filtered set of airlines we should report regarding a covid passenger.
res = linked_df[["l_flight_number"]]
x = submit_predictions(res)

score: {'leaderboard': 0.9318478284190648, 'logs': {'LIN_EPS': -0.002, 'MCC': 0.9338478284190648}}

