# Jaccard Coefficient Calculation for Pathological Test Results

This notebook demonstrates the calculation of Jaccard coefficients for pairs of individuals based on their pathological test results.

## Data Preprocessing
To calculate the Jaccard coefficient, we first convert the asymmetric variables to binary values:
- Y & P = 1 (Present/Yes)
- N & A = 0 (Absent/No)

Note: Gender is a symmetric variable (male, female have the same weight), so it is not converted.

In [1]:
import pandas as pd
import numpy as np

# Data
data = {
    'Name': ['Jack', 'Mary', 'Jim'],
    'Gender': ['M', 'F', 'M'],
    'Fever': ['Y', 'Y', 'Y'],
    'Cough': ['N', 'N', 'P'],
    'Test-1': ['P', 'P', 'N'],
    'Test-2': ['N', 'A', 'N'],
    'Test-3': ['N', 'P', 'N'],
    'Test-4': ['A', 'N', 'A']
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)
print("Original Data:")
display(df)

Original Data:


Unnamed: 0_level_0,Gender,Fever,Cough,Test-1,Test-2,Test-3,Test-4
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jack,M,Y,N,P,N,N,A
Mary,F,Y,N,P,A,P,N
Jim,M,Y,P,N,N,N,A


## Binary Conversion
Convert the data to binary format where:
- Y (Yes) and P (Present) are converted to 1
- N (No) and A (Absent) are converted to 0
- Gender remains unchanged as it's a symmetric variable

In [2]:
def convert_to_binary(val):
    return 1 if val in ['Y', 'P'] else 0

# Create binary dataframe (excluding Gender)
binary_df = df.copy()
for col in binary_df.columns:
    if col != 'Gender':
        binary_df[col] = binary_df[col].apply(convert_to_binary)

print("Binary Converted Data:")
display(binary_df)

Binary Converted Data:


Unnamed: 0_level_0,Gender,Fever,Cough,Test-1,Test-2,Test-3,Test-4
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jack,M,1,0,1,0,0,0
Mary,F,1,0,1,0,1,0
Jim,M,1,1,0,0,0,0


## Jaccard Coefficient Calculation

The Jaccard coefficient is calculated using the formula:

\[ Jaccard = \frac{f_{01} + f_{10}}{f_{01} + f_{10} + f_{11}} \]

Where:
- f₀₁: count of cases where first individual has 0 and second has 1
- f₁₀: count of cases where first individual has 1 and second has 0
- f₁₁: count of cases where both individuals have 1

In [3]:
def jaccard_coefficient(row1, row2):
    # Exclude Gender column
    features1 = row1[1:].astype(int)
    features2 = row2[1:].astype(int)
    
    f01 = sum((f1 == 0) and (f2 == 1) for f1, f2 in zip(features1, features2))
    f10 = sum((f1 == 1) and (f2 == 0) for f1, f2 in zip(features1, features2))
    f11 = sum((f1 == 1) and (f2 == 1) for f1, f2 in zip(features1, features2))
    
    denominator = f01 + f10 + f11
    return (f01 + f10) / denominator if denominator > 0 else 0

In [4]:
# Calculate Jaccard coefficients for pairs
pairs = [('Jack', 'Mary'), ('Jack', 'Jim'), ('Jim', 'Mary')]
for a, b in pairs:
    coef = jaccard_coefficient(binary_df.loc[a], binary_df.loc[b])
    print(f'Jaccard coefficient for ({a}, {b}): {coef:.2f}')

Jaccard coefficient for (Jack, Mary): 0.33
Jaccard coefficient for (Jack, Jim): 0.67
Jaccard coefficient for (Jim, Mary): 0.75
