# Low-Fidelity Synthetic Data generation notebook

This notebook describes and implements the process of generating low-fidelity synthetic data from a collection of summary statistics.  It makes use of the `behavioural_synthetic` library built by the Behavioural Insights Team.  

### Requirements:
- Python 3.11 or greater.
- R 4.4.1 or greater.  
    - Required libraries: rlang, jsonlite, janitor, tidyverse, openxlsx.

### Setup: 
- Copy the following into your working directory:
    - The `behavioural_synthetic` directory.
    - The `requirements.txt` file.
    - The `QA_code.R` file.
        - Set the `lib.loc` parameter in each `require` statement to the location of your R libraries.
- Create a venv for Python using requirements.txt.
- Set up the input and output directories.
- Edit the notebook to use those input and output directories, as described later.
- Run the notebook.

## Libraries and shared functions

This section describes library imports and useful functions that will be used throughout the notebook.

### Libraries

In [None]:
import json
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from os import listdir
from os.path import isfile, join
from datetime import datetime 


from behavioral_synthetic.tables.Table import Table
from behavioral_synthetic.tables.columns.general_functions import run_Rscript
from behavioral_synthetic.tables.columns.general_functions import index_to_column, group_ids, individual_ids

## Functions

This contains the following useful functions:
- `get_files_in_directory` -- returns a list of files in a directory
- `is_file_in_directory` -- if file is in a list of files, return True, else return False.
- `prepend_to_file` -- prepends a string to a file and writes it to disk.  Returns a dictionary containing the execution status (i.e. whether it worked) and any necessary error messages.
- `write_log` -- writes a string to a log file.

In [None]:
def get_files_in_directory(directory: str) -> list:
    return [file for file in listdir(directory) if isfile(join(directory, file))]

def is_file_in_directory(file: str, files_in_dir) -> list:
    return any(file in filename for filename in files_in_dir)
    
def prepend_to_file(data:str, file_name:str, new_file: str) -> dict:
    try:
        with open(file_name, 'r') as file:
            file_data = file.read()
        
        prepended_data = "# "+ data + "\n" + file_data
        
        with open(new_file, 'w') as file:
            file.write(prepended_data)
            
        return {
            "successful": True
        }
    except Exception as e:
        return {
            "successful": False,
            "error": e
        }
        
def write_log(message, logfile):
        record = f'{datetime.now().strftime("%Y-%m-%dT%H:%M:%S %Z")}: {message}' + '\n'
        print(record)
        with open(logfile, 'a') as file:
            file.write(record)

## Input and output settings
Here we set the following parameters:
- `SOURCE_DIRECTORY` -- where the summary statistics are kept (in .json format). Note that you may need to convert exports from the SRS to this format using the `convert_from_srs.py` and `test_convert.py`utility.  The latter may need to be adapted according to your file structure.
- `TARGET_DIRECTORY` -- this is where intermediate .tsv files will be stored prior to the QA run. Note that they won't have correct ID columns at this stage; this is done during the QA processing.
- `METADATA_DIRECTORY` -- This is where the final results and any metadata related to them are stored.  It should have the following subdirectories:
    - `categorical_metrics_comparison`
    - `corr_p_values`
    - `cumulative_distribution_plots`
    - `numerical_metrics_comparison`
- `OVERWRITE_SD` -- if a file with the same name as the one about to be generated exists, do we overwrite it? If `True` we do, otherwise we skip to the next dataset.

The default values of the above and other variables in the following cell (e.g. `BATCH_NUMBER`) assume that we are processing the sum total of datasets in batches.  Please modify this as necessary.

In [None]:
BATCH_NUMBER=6

SOURCE_DIRECTORY = f" ... \\JSON files\\BATCH{BATCH_NUMBER}"
TARGET_DIRECTORY = f" ... \\TSV files\\BATCH{BATCH_NUMBER}"
METADATA_TARGET_DIRECTORY= f" ... \\Post QA files\\BATCH{BATCH_NUMBER}"
OVERWRITE_SD = False


Here we set the data sets to process. We assume that the names of the summary statistics files are of the form `${data_set_name}${append}` where `${append}` is the same for all data sets and `${data_set_name}` is a value listed in `files_base`. Please adapt this to your preferred convention if needed. 

Do not include the file suffix (i.e. `.json`, `.txt`, etc.) -- this is handled by the code and doing so will result in error messages.

In [None]:

files_base = [
"File 1",
"File 2"
]


append = ""

files = [f"{file}{append}" for file in files_base]



## Step One: Generate synthetic data

Running this will generate the intermediate files placed in `TARGET_DIRECTORY`.  For each data set in turn, it reads in the summary data and generates the corresponding synthetic data.

In [None]:

#create_log_file
sd_logfile = f'{TARGET_DIRECTORY}\\SD_generation_log_{datetime.now().strftime("%Y-%m-%dT%H_%M_%S")}.txt'
open(sd_logfile, 'w').close()


files_in_target_directory = get_files_in_directory(TARGET_DIRECTORY)

for file in files:
    write_log(f"{file}:", sd_logfile)
    if is_file_in_directory(file, files_in_target_directory) and OVERWRITE_SD == False:
        write_log("Synthetic Data already generated", sd_logfile)
        continue
    else :
        write_log("Generating synthetic Data", sd_logfile)
        
        input_file = f"{SOURCE_DIRECTORY}\\{file}.json"
        output_file = f"{TARGET_DIRECTORY}\\{file}.tsv"
        

        with open(input_file, 'r') as file:
            table_definition = json.load(file)
        
        table = Table(table=pd.DataFrame(), table_name="")
        table.read_in_table(table_definition)
        synthetic_data = table.generate(new_column_length=table_definition['Number_of_rows'])

        synthetic_data.to_csv(output_file, sep='\t', index=False)
        
        write_log("Done!", sd_logfile)


## Step Two: Generate QA and post-process synthetic data files

Here we generate the QA data and post-process the synthetic data so that it has plausible ID columns and a datestamp.

### Functions

Here we define functions that are useful only for this stage of the process.  They include:
-  `extract_QA_data` -- this converts the dataset name into a format where it matches the result in the R QA output file so it can be extracted.  At this point a lot of the conversions have been entered by hand, but it should be possible to replace them with a more compact regular expression. Returns a dictionary describing whether the operation worked and any relevant error messages.
-  `clean_data` -- removes data with no correlation p-values. Columns where this is the case are logged in a warning file.
-  `extract_correlations` -- extracts correlation p-values from the associated R QA output file.
-  `get_cumulative_distribution` -- returns correlation p-values in the form of a cumulative distribution.
-  `plot_cumulative_distribution` -- plots the cumulative distribution of p-values. Returns a dictionary describing whether the graphs was successfully generated and any relevant error messages.
-  `generate_anon_ids` -- generates plausible id values for some id columns: "Project_Row_ID", "Anon_Pupil_ID", "Anon_School_ID", "Anon_Teacher_ID", "Anon_Class_ID". Returns a dictionary describing whether this was done successfully and any relevant error messages.

In [None]:
def extract_QA_data(directory,file_prefix, QA_type):
    try:
        with open(f"{directory}\\{QA_type}", 'r') as file:
            data = json.load(file)
        underlined_prefix = (file_prefix.replace(" ", "_")
                             .replace(",", "_")
                             .replace("-", "_")
                             .replace("(", "_")
                             .replace(")","_")
                             .replace("___","_")
                             .replace("__", "_")
                             .replace("extNow", "ext_now")
                             .replace("ThinkFor", "Think_For")
                             .replace("phoGame", "pho_Game")
                             .replace("ScratchMaths", "Scratch_Maths")
                             .replace("flectEd", "flect_ed")
                             .replace("1stClass@Number", "x1st_class_number")
                             .replace("SciNapse", "sci_napse")
                             .replace("FiveRs", "five_rs")
                             .replace("CraftOfWriting157", "craft_of_writing157")
                             .replace("YoungJournalistAcademy158", "young_journalist_academy158")
                             .replace("PowerOfPictures159", "power_of_pictures159")
                             .replace("FirstThingMusic160", "first_thing_music160")
                             .replace("SpeechBubbles161", "speech_bubbles161")
                             .lower())
        selector=f"{underlined_prefix}"
        QA_dir = QA_type.split('.')[0]
    
        with open(f"{directory}\\{QA_dir}\\{file_prefix}_{QA_type}", "w") as file:
            json.dump(data[selector], file, indent=4)
        return {
            "successful": True
        }
    except Exception as e:
        return {
            "successful": False,
            "error": e
        }

def clean_data(data, directory, correlation_file):
    
    cleaned_data = []
    for item in data:
        if 'p_value' not in list(item.keys()):
            message = f'The p-value for {item["Comparison_column_1"]} and {item["Comparison_column_2"]} correlations is missing:' +'\n' +f'{item}'
            with open(f"{directory}\\..\\{correlation_file.split('.')[0]}_WARNING_notes.txt", "a") as file:
                file.write(message + "\n")
        else:
            cleaned_data.append(item)
    
    return cleaned_data
           

def extract_correlations(directory, correlation_file):
    
    with open(f"{directory}\\{correlation_file}", 'r') as file:
        data=json.load(file)
        
    cleaned_data = clean_data(data, directory, correlation_file)
    
    return [item['p_value'] for item in cleaned_data]

def get_cumulative_distribution(directory, correlation_file):
    
    data_list = extract_correlations(directory,correlation_file) 
    fractional_distance = []
    for i in range(len(data_list)):
        fractional_distance.append((i+1)/len(data_list))

    return {"X":fractional_distance, "Y": np.sort(data_list)}

def plot_cumulative_distribution(directory, correlation_file):
    try:
        data = get_cumulative_distribution(directory, correlation_file)
        #based on https://stackoverflow.com/posts/22588814/revisions
        Comparison_X = data["X"]
        Comparison_Y = data["X"]  #both are set equal to the X paramater to get linear curve with gradient 1, the ideal case
        
        file_prefix = correlation_file.split('.')[0]
    
        output = f"{directory}\\..\\cumulative_distribution_plots\\{file_prefix}_cumulative_distribution.pdf"
    
        plt.step(data['X'], data['Y'], label = 'synthetic data')
        plt.step(Comparison_X,Comparison_Y, label = 'ideal case')
        plt.title(file_prefix)
        plt.xlabel("Fraction")
        plt.ylabel("p-value")
        plt.legend()
        plt.grid(True)
        plt.savefig(output, format="pdf")
        plt.close()   
        
        return {
            "successful": True
        }
    except Exception as e:
        return {
            "successful": False,
            "error": e
        }
   
def generate_anon_ids(no_meta_data_file, with_meta_data_file):
    try:
        dataframe = pd.read_csv(no_meta_data_file, sep='\t', comment='#')
        dataframe.to_csv(with_meta_data_file, sep='\t')

        #handling the anon/unique id columns:
        id_column_list = ["Project_Row_ID", "Anon_Pupil_ID", "Anon_School_ID", "Anon_Teacher_ID", "Anon_Class_ID"]
        no_rows = len(dataframe.index)
        
        for col in id_column_list:
            if not dataframe[col].isna().all():
                
                if col == "Project_Row_ID":
                    dataframe[col] = index_to_column(dataframe) #, "Project_Row_ID")
                elif col == "Anon_Pupil_ID":
                    dataframe[col] = individual_ids(4, 7, no_rows, False, True)
                elif col == "Anon_School_ID":
                    group_size = int(max(5,min(no_rows/50, 250 )))
                    dataframe[col] = group_ids(4, 6, group_size, no_rows, False, True)
                elif col == "Anon_Teacher_ID":
                    group_size = int(max(5, min(no_rows/100, 80 ))) #set too close to 99 and it will take forever to finish
                    dataframe[col] = group_ids(1, 2, group_size, no_rows, False, True)
                elif col == "Anon_Class_ID":
                    group_size = int(max(5, min(no_rows/100, 500 ))) # orginally set to have a max of 999, seems that slows it down in large cases
                    dataframe[col] = group_ids(1, 3, group_size, no_rows, False, True)

        dataframe.to_csv(with_meta_data_file, sep='\t', index=False)
    
        return {
            "successful": True
        }
    except Exception as e:
        return {
            "successful": False,
            "error": e
        }

### Run QA and post-processing

The value of `OVERWRITE_METADATA` determines whether individual QA files will be written over if they already exist.  Note that this does not affect the operation of the R script that generates the source files from which that data is extracted.

The R script will generate output files corresponding to all files in the source and target directories.  For large collections of datasets it therefore makes sense to work in smaller batches of 10 or so datafiles, as in some cases the script can take up to 20 minutes to run.  The output will often include some warning messages.  These can usually be ignored.

Once the R script has run, each dataset in the **Input and Output Settings** is processed in turn.  The relevant QA data is extracted from the R script output and saved in separate files. The intermediate .tsv file is processed to add id columns and a datestamp, and saved in the metadata output directory.

**Note on cumulative density plots:** It is unlikely that individual plots will have more than a general resemblance to the ideal line plot displayed in the graph.  In order to be as sure as possible that the distribution of correlations is as close to chance as possible, use the `generate_overal_cd_plots.py` utility script.  It may need to be adapted in order to account for how you have decided to store your data.

In [None]:
OVERWRITE_METADATA = False


settings = {
    "INPUT_DIR": f"{SOURCE_DIRECTORY}",
    "OUTPUT_DIR": f"{TARGET_DIRECTORY}",
    "JSON_FILE_LOC": f"{METADATA_TARGET_DIRECTORY}",
        }

#create_log_file
qa_logfile = f'{METADATA_TARGET_DIRECTORY}\\QA_log_{datetime.now().strftime("%Y-%m-%dT%H_%M_%S")}.txt'
open(qa_logfile, 'w').close()

write_log("Generating metadata for batch", qa_logfile)
print(run_Rscript(settings))

files_in_metadata_directory = get_files_in_directory(METADATA_TARGET_DIRECTORY)

for file in files:
    write_log(f"{file}:", qa_logfile)
    output_types = ["categorical_metrics_comparison.json", "corr_p_values.json", "numerical_metrics_comparison.json", "with_anon_ids.tsv"]
    generated = all([is_file_in_directory(f"{file}_{output_type}",files_in_metadata_directory) for output_type in output_types])
    if generated and OVERWRITE_METADATA == False:
        write_log("Metadata already extracted.", qa_logfile)
    else:
        write_log("Extracting QA metadata (this will overwrite any old files)", qa_logfile)
       
        no_meta_data_file = f"{TARGET_DIRECTORY}\\{file}.tsv"
        with_meta_data_file = f"{METADATA_TARGET_DIRECTORY}\\{file}_with_anon_ids.tsv"

        QA_files = output_types[0:3]
        
        for QA_file in QA_files:
            write_log(f"Extracting from: {QA_file}", qa_logfile)
            extract_QA_status = extract_QA_data(METADATA_TARGET_DIRECTORY, file, QA_file)
            if not extract_QA_status["successful"]:
                write_log(f"WARNING: problem extracting metadata for {file} from {QA_file}: {extract_QA_status['error']}", qa_logfile)
            
       # generate anon ids for files -- take no_meta_data_File and outpurt as with_meta_data_file
        
        id_gen_status = generate_anon_ids(no_meta_data_file, with_meta_data_file)
        if not id_gen_status["successful"]:
            write_log(f"WARNING: problem adding ids to synthetic data: {id_gen_status['error']}",qa_logfile)
       
        print('prepend data') 
        prepend_data= {}
        prepend_data['Date generated'] = datetime.now().strftime("%Y-%m-%dT%H:%M:%S %Z")
        
        prepend_status = prepend_to_file(json.dumps(prepend_data), with_meta_data_file, with_meta_data_file)
        if not prepend_status["successful"]:
            write_log(f"WARNING: problem prepending metadata to synthetic data: {prepend_status['error']}",qa_logfile)
            
        plot_status = plot_cumulative_distribution(f"{METADATA_TARGET_DIRECTORY}\\corr_p_values", f"{file}_corr_p_values.json")
        if not plot_status["successful"]:
           write_log(f"WARNING: problem generating cumulative distribution plot: {plot_status['error']}", qa_logfile)
        
        write_log(f"Metadata processing for file {file} complete.", qa_logfile)

write_log(f"Metadata processing for batch finished", qa_logfile)
