# Prepare dataset

### Clear Solomon data

Convert all distant (2) & unclear (3) interactions to no-interaction (0) to ensure dichotomous outcome.

In [2]:
import pandas as pd

# List of dataset names
dataset_names = ['DYAD06NF', 'DYAD10NF', 'DYAD11NF', 'DYAD21NF', 'DYAD24NF']

# Base directory for input and output
input_dir = '/Users/ruzenkakaldenbach/Desktop/Solomon_output/'
output_dir = '/Users/ruzenkakaldenbach/Desktop/Solomon_output/'

# Process each dataset
for dat_name in dataset_names:
    print(f"Processing {dat_name}...")
    
    # Load the dataset
    file_path = f"{input_dir}solomon_{dat_name}.csv"
    df = pd.read_csv(file_path)
    
    # Replace `2` (distant) and `3` (unclear) with `0` (no interaction)
    df[['si_ry', 'si_by', 'si_rb']] = df[['si_ry', 'si_by', 'si_rb']].replace({2: 0, 3: 0})
    
    # Save the modified dataset
    output_file = f"{output_dir}solomon_{dat_name}_dichotomous.csv"
    df.to_csv(output_file, index=False)
    print(f"Saved processed file to {output_file}")

print("Processing complete.")


Processing DYAD06NF...
Saved processed file to /Users/ruzenkakaldenbach/Desktop//Solomon_output/solomon_DYAD06NF_dichotomous.csv
Processing DYAD10NF...
Saved processed file to /Users/ruzenkakaldenbach/Desktop//Solomon_output/solomon_DYAD10NF_dichotomous.csv
Processing DYAD11NF...
Saved processed file to /Users/ruzenkakaldenbach/Desktop//Solomon_output/solomon_DYAD11NF_dichotomous.csv
Processing DYAD21NF...
Saved processed file to /Users/ruzenkakaldenbach/Desktop//Solomon_output/solomon_DYAD21NF_dichotomous.csv
Processing DYAD24NF...
Saved processed file to /Users/ruzenkakaldenbach/Desktop//Solomon_output/solomon_DYAD24NF_dichotomous.csv
Processing complete.


### Create a common dataset for Loopy and Solomon data

The resulting dataset will contain Loopy data as predictor (distance, angle, facing) and Solomon data as outcome (social interaction). All dyads will be listed one below the other. All spreadsheets will then be listed one below the other.

### From the no-interactions rows, drop half randomly

The aim is to have as many non-interactions as interactions, else the ML will be correct by simply predicting non-interactions for all predictors.