# Antigranular Harvard OpenDP Hackathon

Submission by:
```
  Adarsh Gupta
```

## Table Of Contents:
* [Getting Started](#getting_started)
* [Data Preprocessing](#preprocessing)
* [Setting Indexing Rule](#indexing)
* [Comparing the Records](#comparing)
* [Linking Datasets](#linking)
* [Submission](#sub)

NOTE: This notebook contains what I did to get my best submission, there are (a lot) of other things I tried which did not lead to good results.

## 0. Getting Started: Installation, Imports & Connect to Antigranular <a class="anchor" id="getting_started"></a>


In [None]:
#Installing Antigranular in quiet mode (-q)
!pip install antigranular -q

In [None]:
#Logging into Antigranular
import antigranular as ag
from google.colab import userdata #To get secrets
session = ag.login(<client_id>, <client_secret>, competition = "Harvard OpenDP Hackathon")

Dataset "Flight Company Dataset" loaded to the kernel as [92mflight_company_dataset[0m

Dataset "Health Organisation Dataset" loaded to the kernel as [92mhealth_organisation_dataset[0m

Connected to Antigranular server session id: 4a3265ae-eb94-4f79-a8bc-90b9948fcc81, the session will time out if idle for 25 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!


In [None]:
%%ag
#Useful for creating candidate link
import op_recordlinkage as rl

In [None]:
%%ag
#Setting up aliases for easier access
health = health_organisation_dataset
flight = flight_company_dataset

## 1. Data Preprocessing <a class="anchor" id="preprocessing"></a>

 **The first step before we begin indexing and setting up compare rules is to clean up both the datasets as much as we can, given the nuances that we are aware of.**

### 1.1. Removing Negative Covid Results & Missing Values

We will only consider the passenger records who tested positive for covid-19.

In [None]:
%%ag
# To remove passengers who tested negative, since we know there are only two prognosis categories we can check where the covidtest_result column is positive
# and only keep those columns
health['covidtest_result'] = health['covidtest_result'].where(health['covidtest_result'] == 'positive')

In [None]:
%%ag
# Also removing all rows which have na or unknown values
health = health.dropna()

### 1.2. First and middle name

From the information given about the datasets it is evident that first names have various formats and sometimes the first and middle name are jointly recorded.

We will clear such cases by only considering the first names

In [None]:
%%ag
# Defining function to seperate the first and middle name
def seperate_first_middle_name(name: str) -> str:
    """
    Extracts the first name from a given full name.

    Parameters
    ----------
    name : str
        A string representing a full name.

    Returns
    -------
    str
        The extracted first name.

    Notes
    -----
    This function assumes that the first word in the provided name represents the first name. In cases where the name
    is comprised of multiple words or a continuous string of characters, the function identifies the first name
    by detecting capital letters. If the first word contains more than one capital letter, the first name is considered to
    conclude before the second capital letter.

    Examples
    --------
    >>> separate_first_middle_name("Nok Vanu")
    'Nok'

    >>> separate_first_middle_name("NokVanu")
    'Nok'

    >>> separate_first_middle_name("Nok")
    'Nok'
    """

    # Finding Capital Letters
    capital_letters = [i for i, c in enumerate(name) if c.isupper()]

    # If more than one capital letter is present, assume first name ends before the second capital letter
    if len(capital_letters) > 1:
        second_cap_index = capital_letters[1]
        return name[:second_cap_index]

    # If only one capital letter, assume the entire word is the first name
    else:
        return name.split()[0]

In [None]:
%%ag
# Applying the first and middle name cleaning function to the first name column of both dataframes
health["patient_firstname"] = health[["patient_firstname"]].applymap(seperate_first_middle_name, eps=0)
flight["passenger_firstname"] = flight[["passenger_firstname"]].applymap(seperate_first_middle_name, eps=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df[key] = value._df



### 1.3 Phone Number

We will keep the last 6 digits of all phone numbers and remove all dashes and spaces

In [None]:
%%ag
import re

def clean_phone_num(phone_num: str) -> str:
    """
    Cleans and extracts the last 6 digits from a phone number string using regular expressions.

    Parameters
    ----------
    phone_num : str
        A string representing a phone number.

    Returns
    -------
    last_six_digits: str
        The last six digits of the cleaned phone number.

    Examples
    --------
    >>> clean_phone_num(" +1-343-343-9900")
    '439900'

    >>> clean_phone_num(" 0091 992 992 9900")
    '929900'
    """

    # Use regular expression to find all digits in the input string
    digits = re.findall(r'\d', phone_num)

    # Extract and return the last 6 digits
    last_six_digits = ''.join(digits[-6:])

    return last_six_digits

In [None]:
%%ag
# Apply the map for cleaning the phone num to the appropriate columns
health["patient_phone_number"] = health[["patient_phone_number"]].applymap(clean_phone_num, eps=0)
flight["passenger_phone_number"] = flight[["passenger_phone_number"]].applymap(clean_phone_num, eps=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df[key] = value._df



### 1.4. Last name

Standardizing the last name for fuzzy blocking by Capitalizing it as standard english.

In [None]:
%%ag
def capitilise_last_name(name: str) -> str:
    """
    Standardizes the last name by capitalizing the first letter and lowercasing the rest.

    Parameters
    ----------
    name : str
        A string representing a last name.

    Returns
    -------
    str
        The standardized last name.

    Examples
    --------
    >>> clean_last_name("VANU")
    'Vanu'
    """
    return name.capitalize()

In [None]:
%%ag
# Applying the map that cleans the last name on the appropriate column
health["patient_lastname"] = health[["patient_lastname"]].applymap(capitilise_last_name, eps=0)
flight["passenger_lastname"] = flight[["passenger_lastname"]].applymap(capitilise_last_name, eps=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df[key] = value._df



### 1.5. Email ID

Just like phone number we will just take the first five letters of the email id and lowercase it to account for entry errors.

Email ID will be used for fuzzy matching later on.

In [None]:
%%ag
def email_id_standard(email_id: str) -> str:
    """
    Standardizes an email ID by extracting the first 6 characters in lowercase, if the total length is greater than or equal to 6.

    Parameters
    ----------
    email_id : str
        A string representing an email ID.

    Returns
    -------
    str
        The standardized email ID, consisting of the first 6 characters in lowercase if the total length is sufficient; otherwise, returns the original email ID.

    Examples
    --------
    >>> email_id_standard("blue_daisy_345@htmail.co.uk")
    'blue_d'

    >>> email_id_standard("nok_jam_nok@gamil.com")
    'nok_ja'
    """

    # Check if the length of the email ID is greater than or equal to 6
    if len(email_id) >= 6:
        # Extract the first 6 characters and convert to lowercase and return them
        result = email_id[:6].lower()
        return result

    # If the length is less than 6, return the original email ID by lowercasing
    return email_id.lower()

In [None]:
%%ag
# Applying the process of standardizing email id to the appropriate columns
health["patient_email_address"] = health[["patient_email_address"]].applymap(email_id_standard, eps=0)
flight["passenger_email_address"] = flight[["passenger_email_address"]].applymap(email_id_standard, eps=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df[key] = value._df



### 1.6. Date of birth

As mentioned in the dataset description there are many formats of the date of birth such as 01.Sep.1990 1 September 90 1-Sept-1990

We will try to standardize them as much as possible.

Standardizing for matching later

In [None]:
%%ag
import datetime

def standardize_date(date_str: str) -> str:
    """
    Converts dates from various formats to a standardized 'YYYY-MM-DD' format.

    Supported formats:
    - 'DD.MMM.YYYY'
    - 'DD MMMM YY'
    - 'DD-MMMM-YYYY'

    Parameters
    ----------
    date_str : str
        A string representing a date.

    Returns
    -------
    str
        The date in 'YYYY-MM-DD' format.

    Examples
    --------
    >>> standardize_date('01.Sep.1990')
    '1990-09-01'

    >>> standardize_date('1 September 90')
    '1990-09-01'

    >>> standardize_date('1-Sept-1990')
    '1990-09-01'
    """

    # Replace various separators with a standard one (space) which can be removed later on
    for separator in ['.', '/', '-']:
        date_str = date_str.replace(separator, ' ')

    # Splitting the date into parts and storing individual parts
    day, month_str, year = date_str.split()

    # Correct two-digit year if necessary
    if len(year) == 2:
      year = '20' + year if int(year) <= 23 else '19' + year

    # Transform month into a number
    month_str = month_str[:3]
    month = datetime.datetime.strptime(month_str, "%b").month
    month_formatted = f"{month:02d}"  # Ensuring two-digit format

    # Adjust day if necessary
    day_formatted = f"{int(day):02d}"  # Ensuring two-digit format

    # Reconstruct the date string in the identified format
    output_str = f"{year}-{month:02d}-{day_formatted}"
    return output_str

In [None]:
%%ag
# Applying standardize date function to the appropriate date columns which require the transformation
health["patient_date_of_birth"] = health[["patient_date_of_birth"]].applymap(standardize_date, eps=0)
flight["passenger_date_of_birth"] = flight[["passenger_date_of_birth"]].applymap(standardize_date, eps=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df[key] = value._df



**Now that we are done preprocessing as much as we can, we will start with setting up indexing rules to create candidate links followed by finally creating the individual links using Fuzzy matching and date matching**

## 2. Setting Indexing Rule <a class="anchor" id="indexing"></a>

In [None]:
%%ag
#Creating an indexer that creates candidate links.
indexer = rl.Index()

#Blocking on lastname and phone number to create candidate links.
indexer.block(["passenger_lastname", "passenger_phone_number"],
              ["patient_lastname", "patient_phone_number"])

#Creating the candidate links based on the blocks.
candidate_links = indexer.index(flight, health)

#Total number of links based for this indexing choice.
ag_print("Number of candidate links", candidate_links.count(eps=0.1))

Number of candidate links 3310



**Considering the size of the datasets this is an appropriate number of candidate links to procedd with linking the datasets**

## 3. Comparing records <a class="anchor" id="comparing"></a>

After making the candidate links, we need to set compare rules to create the final links.

We will use the following matching rules
  - Fuzzy Match the First Name with default weights
  - Fuzzy Match the Email ID with default weights
  - Fuzzy Match the DOB with default weights
  - Link date of flight and date of test result with a custom compare function

The fuzzy matching method used for each case is what suits the datatypes.

In [None]:
%%ag
import datetime

#Creating a comparer to refine linkings
comparer = rl.Compare()

# Adding inbuilt fuzzy string matching functions to the comparer
comparer.string("passenger_firstname" , "patient_firstname" ,method='jarowinkler', label="firstname")
comparer.string("passenger_email_address" , "patient_email_address" ,method='damerau_levenshtein', label="email_id")
comparer.string("passenger_date_of_birth" , "patient_date_of_birth" ,method='lcs', label="dob")

# Using a custom compare rule.
def cmp(date_str1: str , date_str2: str ) -> int:
    # Converting standardized date strings to datetime objects
    date1 = datetime.datetime.strptime(date_str1, "%Y-%m-%d")
    date2 = datetime.datetime.strptime(date_str2, "%Y-%m-%d")

    # Calculating the absolute difference in days
    days_apart = (date2 - date1).days

    # and checking if the dates are within two weeks of each other
    if -14 <= days_apart <= 14:
      #Giving a weight of 2 to date
        return 2
    else:
        return 0

# Adding the custom compare function we made
comparer.custom(cmp, "flight_date", "covidtest_date", label="date_cmp")

In [None]:
%%ag
# Calculating the feature matrix based on the candidate links and compare rules
features = comparer.compute(candidate_links,flight,health)

# Finding the average matching weights obtained based on the compare rules we set.
ag_print(f"Average weight : {features.sum(axis=1).mean(eps=0.5)}")

Average weight : 2.971358562761492



## 4. Linking datasets <a class="anchor" id="linking"></a>

Finally it's time to link the two datasets together using the comparer!

In [None]:
%%ag
# Using a weight of 3 to link the two datasets
linked_df = comparer.get_match(3)

## 5. Submission <a class="anchor" id="sub"></a>

In [None]:
%%ag
# Submitting the column containing the filtered set of airlines we should report regarding a covid passenger.
res = linked_df[["l_flight_number"]]
x = submit_predictions(res)

score: {'leaderboard': 0.8175404324091969, 'logs': {'LIN_EPS': -0.0055000000000000005, 'MCC': 0.8230404324091969}}

