Skip to content

ONSdigital/lvu_comp

Repository files navigation

LVU_COMP

This tool is in development and should not be used for the purpose of identifying potential bias in linked datasets

LVU_COMP, or linked versus unlinked comparisons, contains two methods for identifying potential bias in linked data. The first tool is lvu_chi2, which compares the distribution of a categorical variable in linked data to its distribution in unlinked data. The second tool is lvu_effect, which looks at the magnitude of the difference between the proportion of a particular level of a categorical variable (relative to the total across all levels) in true matches compared to false or missed matches. These two methods are described in more detail below.

lvu_chi2 performs Chi2 test on a contingency table created from specified input dataframe and categorical variables.

Parameters:
- input_dataframe (DataFrame): The input DataFrame.
- input_file_name (str): Name of the input file used for output file naming.
- category_1 (str): Name of the first categorical variable/column in the data.
- category_2 (str): Name of the second categorical variable/column in the data.

Returns:
- chi2 (float): The Chi-squared statistic. 
- p (float): The p-value of the test.
- dof (int): Degrees of freedom.
- expected (ndarray): The expected frequencies table.
- contingency_table (DataFrame): The contingency table used in the test.

Creates an output directory (<input_file_name>_output) and saves the following .csv files:
- <input_file_name>_chi2_results: saves dataset, category_1, category_2, chi2, p-value, degrees of freedom
- <input_file_name_category1_category2>_contingency_table: saves the contingency table
- <input_file_name_category1_category2>_expected_frequencies: saves the expected frequencies table 

- Multiple analyses run on the same dataset will be saved as new rows in chi2_results
- Multiple analyses will generate separate contingency and expected frequencies tables .csv files

Input data should have at least 2 categorical variables/columns for analysis.
- Input data can be int (e.g., 0/1) or string (e.g., 'Linked'/'Unlinked').
- One of the columns should represent the linked status (whether the data in that
    row were linked or not linked). 
- The other column(s) should represent categorical variables of interest to
    investigate for potential bias.
- The method will ignore any columns not specified in the function call.

Method assumes no missing values in the specified categorical variables/columns.

lvu_effect calculates standard difference (stdiff) effect sizes for false matches and missed matches compared to true matches.

Parameters:
- input_dataframe (str): Name of the dataframe containing the target variables for effect size calculation
- input_data_name (str): Name of the input dataset for output file naming. 
- category_name (int/str): The name of the categorical variable/column to analyze.
- category_level (int/str): The specific category value to calculate effect sizes for.
    
Returns:
- stdiff_false (float) : Standard difference effect size for false matches.
- stdiff_missed (float) : Standard difference effect size for missed matches.
- linked_n (int): Total number of linked records.
- unlinked_n (int): Total number of unlinked records.
- linked_true_n (int): Total number of correct (true) matches.
- linked_false_n (int): Total number of false matches.
- unlinked_true_n (int): Total number of missed matches.
- linked_true_cat_n (int): Number of correct matches in the specified category.
- linked_false_cat_n (int): Number of false matches in the specified category.
- unlinked_true_cat_n (int): Number of missed matches in the specified category.
- prop_linked_true_cat (float): Proportion of correct matches in the specified category.
- prop_linked_false_cat (float): Proportion of false matches in the specified category.
- prop_unlinked_true_cat (float): Proportion of missed matches in the specified category.

Creates an output directory (<input_data_name>_output) and saves the following .csv file:
- <input_data_name>_effect_size_results: saves category, category level, counts, proportions, 
and effect sizes.
- Multiple analyses run on the same dataset will be saved as new rows in effect_size_results.
    
Input data should have the following columns:
- LinkedStatus indicates whether the record was linked (1) or unlinked (0).
- LinkTruth indicates whether the record is a true match (1) or not (0).
- Category indicates the categorical variable of interest for effect size calculation.
    - Integers represent different categories (e.g., 1, 2, 3) 
- Method will ignore all other columns. 

Method assumes no missing values in the specified columns.

Standard difference (stdiff) is calculated as:

stdiff = (p1 - p2) / sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / 2)
where p1 = prop_linked_true_cat and p2 = EITHER prop_linked_false_cat (for comparing
to false matches) OR prop_unlinked_true_cat (for comparing to missed matches)

About

Code to identify potential linkage bias by comparing features of linked vs unlinked data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages