# Program Description: Dataset Optimization (Module 7)

## Overview:
This module focuses on screening outliers in the dataset by identifying samples that do not meet specified criteria based on structural descriptors such as coordination number (CN) and bond length (CR). The program uses statistical distribution analysis (from **Module 5**) to set acceptable ranges for CN and CR, and then filters out the samples that do not fall within these ranges.

### Key Steps:
1. **Label Screening**: The program allows users to define the range for CN and CR based on the statistical distribution charts created in **Module 5**. Samples outside the defined range are considered outliers.
2. **Method for Bond Length and Coordination Number Calculation**: The program provides six methods for calculating the bond length and coordination number (from **Module 2**). These methods include:
   - `BrunnerNN_relative`
   - `VoronoiNN`
   - `JmolNN`
   - `MinimumDistanceNN`
   - `CrystalNN`
   - `EconNN`
   
   Users can select the method to use as the screening criterion using the `method` parameter.

3. **Filtering**: Samples that do not meet the defined criteria for CN and CR are identified and removed. The program stores the indices of the excluded samples and provides them in a CSV file for further review.

### Input:
- **Input File**: The program reads the dataset located in the directory where the program is executed (typically the folder containing the `datasets`).
- **Method Selection**: The user can specify the method used for calculating the bond length and coordination number. The `method` parameter should contain one of the following methods:
  - `BrunnerNN_relative`
  - `VoronoiNN`
  - `JmolNN`
  - `MinimumDistanceNN`
  - `CrystalNN`
  - `EconNN`
  
  This method will be used to screen the samples.

### Output:
- **Indices of Excluded Samples**: The indices of the samples that do not meet the defined criteria are saved in a CSV file: 
  - `indices_to_move_{method}.csv`
  - The file is saved in the `datasets` folder for reference.
  
- **Check Folder**: The program generates a `check` folder in the current directory. It copies the file types and labels corresponding to the excluded samples to this folder, allowing users to inspect which samples failed the criteria.
  
- **Sorted Indices**: After the screening process, the indices of the excluded samples are extracted and sorted.

### Example Usage:
1. **Set Label Screening Ranges**: Use the statistical analysis from **Module 5** to define the acceptable ranges for CN and CR.
2. **Select Method for Calculation**: Choose one of the six methods to calculate CN and CR for screening (e.g., `BrunnerNN_relative`).
3. **Run the Program**: The program filters out the samples that do not meet the selected criteria, saves the excluded sample indices, and prepares a folder with the details of the excluded samples.


contacts: zhaohf@ihep.ac.cn

#  Import libraries

In [1]:
import os
import pandas as pd
import sys 
from os.path import join
import shutil
import logging

#  Version Information

In [2]:
def get_python_version():
    return sys.version
def get_package_version(package_name):
    try:
        module = __import__(package_name)
        version = getattr(module, '__version__', None)
        if version:
            return version
        else:      
            return pkg_resources.get_distribution(package_name).version
    except (ImportError, AttributeError, pkg_resources.DistributionNotFound):
        return "Version info not found"

packages = ['pandas']
for package in packages:
    print(f"{package}: {get_package_version(package)}")
print(f"Python: {get_python_version()}")

pandas: 2.0.3
Python: 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:46:39) 
[GCC 10.4.0]


# Parameter Settings for Dataset Optimization 

## Input File:
- **load_path**: Specifies the directory where the dataset is located. By default, this is the folder where the program is executed.
  
## Filter Label Range:
- **Method Parameter**: Specifies the calculation method for the coordination number (CN) and bond length (CR). This will be used as the basis for filtering samples in the dataset. The following options are available:
  - `'BrunnerNN_relative'`
  - `'VoronoiNN'`
  - `'JmolNN'`
  - `'MinimumDistanceNN'`
  - `'CrystalNN'`
  - `'EconNN'`
  
  Choose one of these methods to filter the dataset according to the corresponding CN and CR calculation.

### Example of Parameter Setup:
```python
# Set the calculation method for filtering
method = 'VoronoiNN'  # Choose the desired method

In [3]:
# Input file settings
data_dir = "/media/dell-hd/data1/datasets/Au-datasets"
# Method used for filtering labels (choose one from the list)
method = "JmolNN"
# File paths for CN and CR data based on the selected method
cn_file_path = os.path.join(data_dir, f"cn/cn_{method}.csv")
cr_file_path = os.path.join(data_dir, f"cr/cr_{method}.csv")
# Labels to be used for filtering
labels = ['cn', 'cr']
# Filtering ranges for each label
pick_dict = {
    'cn': [2, 10],
    'cr': [2.6, 3.1]
}
# Features to check for the selected samples
file_types_to_copy = ['xmu', 'chi', 'wt', 'rdf', 'norm', 'wt_pic']
# Path for saving the output features of the selected samples
output_file_path = os.path.join(data_dir, f"{method}_check")
# Path for the log file
output_log_path = os.path.join(output_file_path, 'output_log.txt')
# Path for saving the indices of the selected samples
indices_output_file = os.path.join(data_dir, f"indices_to_move_{method}.csv")
# Check if the data directory exists
if os.path.exists(data_dir):
    print(f"Directory '{data_dir}' exists.")
else:
    raise FileNotFoundError(f"Directory '{data_dir}' does not exist.")

# Function to ensure the directory exists; creates it if it does not
def ensure_directory_exists(directory):
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Directory '{directory}' created.")
    else:
        print(f"Directory '{directory}' already exists, no creation needed.")

# Ensure the output directory exists
ensure_directory_exists(output_file_path)

# Check if the CN file exists
if not os.path.exists(cn_file_path):
    raise FileNotFoundError(f"File '{cn_file_path}' does not exist. No CN data available for method '{method}'.")

# Check if the CR file exists
if not os.path.exists(cr_file_path):
    raise FileNotFoundError(f"File '{cr_file_path}' does not exist. No CR data available for method '{method}'.")

print("CN File Path:", cn_file_path)
print("CR File Path:", cr_file_path)


File '/media/dell-hd/data1/datasets/Au-datasets' exists.
Directory '/media/dell-hd/data1/datasets/Au-datasets/JmolNN_check' created.
CN File Path: /media/dell-hd/data1/datasets/Au-datasets/cn/cn_JmolNN.csv
CR File Path: /media/dell-hd/data1/datasets/Au-datasets/cr/cr_JmolNN.csv


# Function settings

In [4]:
def ensure_directory_exists(directory):
    """Ensure the specified directory exists. Creates it if it does not."""
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Directory '{directory}' created.")
    else:
        print(f"Directory '{directory}' already exists.")

def get_filtered_indices_from_file(file_path, min_value, max_value):
    """Get indices of samples from a file that are outside the specified range."""
    indices_to_move = set()
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        logging.info(f"Processing file: {file_path}")
        logging.info(f"Initial rows: {len(df)}")
        print(f"Processing file: {file_path}")
        print(f"Initial rows: {len(df)}")
        # Filter out samples that are not within the specified range
        filtered_df = df[(df.iloc[:, 1] < min_value) | (df.iloc[:, 1] > max_value)]
        indices_to_move.update(filtered_df.iloc[:, 0].tolist())  # Use the sample sequence number from the first column

        logging.info(f"Filtered rows: {len(filtered_df)}")
        logging.info(f"Indices to move: {indices_to_move}")
        
        print(f"Filtered rows: {len(filtered_df)}")
        print(f"Indices to move: {indices_to_move}")
    else:
        print(f"File {file_path} does not exist.")
    return indices_to_move, len(filtered_df)

def save_indices_to_csv(indices, output_file):
    """Save the filtered indices to a CSV file."""
    df_indices = pd.DataFrame(list(indices), columns=["index"])
    df_indices.to_csv(output_file, index=False)
    print(f"Saved indices to {output_file}")

def sort_and_save_csv(file_path):
    """Sort the indices in the CSV file and save the sorted data."""
    if os.path.exists(file_path):
        data = pd.read_csv(file_path)
        data = data.sort_values(by="index")  # Sort by the 'index' column
        data.to_csv(file_path, index=False)
        print(f"Sorted and saved: {file_path}")
    else:
        print(f"File does not exist: {file_path}")

def copy_files_by_indices(data_dir, indices_to_move, output_dir, file_types):
    """Copy files corresponding to sample indices to a unified check folder."""
    ensure_directory_exists(output_dir)

    for file_type in file_types:
        source_dir = os.path.join(data_dir, file_type)

        for index in indices_to_move:
            # Copy CSV files
            source_file_csv = os.path.join(source_dir, f"{index}.csv")
            if os.path.exists(source_file_csv):
                target_file_csv = os.path.join(output_dir, f"{file_type}_{index}.csv")
                shutil.copy(source_file_csv, target_file_csv)
                print(f"Copied {source_file_csv} to {target_file_csv}")

            # Copy PNG files
            source_file_png = os.path.join(source_dir, f"{index}.png")
            if os.path.exists(source_file_png):
                target_file_png = os.path.join(output_dir, f"{file_type}_{index}.png")
                shutil.copy(source_file_png, target_file_png)
                print(f"Copied {source_file_png} to {target_file_png}")

def save_all_filtered_indices(file_path, filtered_indices, output_dir, label, method):
    """Save all filtered sample indices and their corresponding values to a new CSV file."""
    ensure_directory_exists(output_dir)
    
    # Read the original data file
    df = pd.read_csv(file_path)
    
    # Filter the data for the selected indices
    filtered_data = df[df['index'].isin(filtered_indices)]
    
    # Build the output file path
    output_file = os.path.join(output_dir, f"{label}_{method}.csv")
    
    # Save the filtered data to a CSV file
    filtered_data.to_csv(output_file, index=False)
    print(f"Saved all filtered data to: {output_file}")


# Main Program: Extract Sample Indexes that Do Not Meet the Range, Save and Sort Them

## Functionality:
- This part of the program extracts the indexes of the samples that do not meet the specified screening conditions (e.g., coordination number or bond length).
- It then saves these indexes into a CSV file and sorts them for easier reference and analysis.

In [5]:
# Set up the logging configuration
logging.basicConfig(filename=output_log_path, level=logging.INFO, format='%(message)s')

# Get the filtered indices from the CN and CR files
Index_to_move_cn, count_cn = get_filtered_indices_from_file(cn_file_path, *pick_dict['cn'])
Index_to_move_cr, count_cr = get_filtered_indices_from_file(cr_file_path, *pick_dict['cr'])

# Log the filtered indices results for CN and CR
logging.info(f"Indices to move from CN: {Index_to_move_cn} (Count: {count_cn})")
logging.info(f"Indices to move from CR: {Index_to_move_cr} (Count: {count_cr})")

# Sort and save the indices to a CSV file
sort_and_save_csv(indices_output_file)

# Merge the indices to move from CN and CR
indices_to_move = Index_to_move_cn.union(Index_to_move_cr)
logging.info(f"Total indices to move: {indices_to_move} (Count: {len(indices_to_move)})")

# Save the merged indices to a CSV file
save_indices_to_csv(indices_to_move, indices_output_file)

# Log the total counts of indices to move
logging.info(f"Total indices count to move from CN: (Count: {count_cn})")
logging.info(f"Total indices count to move from CR: (Count: {count_cr})")
logging.info(f"Total indices count to move from CN and CR: (Count: {len(indices_to_move)})")

# Print the total counts of indices to move
print(f"Total indices count to move from CN: (Count: {count_cn})")
print(f"Total indices count to move from CR: (Count: {count_cr})")
print(f"Total indices count to move from CN and CR: (Count: {len(indices_to_move)})")


Processing file: /media/dell-hd/data1/datasets/Au-datasets/cn/cn_JmolNN.csv
Initial rows: 5001
Filtered rows: 17
Indices to move: {3489, 4897, 3880, 3561, 4425, 4553, 4809, 4333, 4841, 4879, 4272, 4496, 4911, 4475, 4536, 1243, 4796}
Processing file: /media/dell-hd/data1/datasets/Au-datasets/cr/cr_JmolNN.csv
Initial rows: 5001
Filtered rows: 71
Indices to move: {4224, 2818, 4233, 4621, 3344, 3090, 4626, 4756, 4377, 4250, 4378, 4127, 1696, 3489, 3617, 4385, 4896, 4135, 3880, 4519, 4520, 4778, 4909, 4142, 4272, 4400, 3890, 4914, 3512, 4536, 4921, 3900, 4542, 4288, 4289, 4418, 4295, 4425, 4809, 4301, 3406, 3794, 4434, 4436, 1877, 3926, 4438, 3801, 4827, 3164, 3036, 4060, 4316, 4320, 3809, 4830, 3173, 4327, 4585, 4841, 3307, 4972, 4077, 2159, 2291, 4083, 2421, 3829, 4860, 3069, 4863}
File does not exist: /media/dell-hd/data1/datasets/Au-datasets/indices_to_move_JmolNN.csv
Saved indices to /media/dell-hd/data1/datasets/Au-datasets/indices_to_move_JmolNN.csv
Total indices count to move from C

In [6]:
# Copy files based on the selected indices to the specified output directory
copy_files_by_indices(data_dir, indices_to_move, output_file_path, file_types_to_copy)

# Process each label to save the filtered indices data
for label in labels:
    # Construct the file path for the CSV file specific to the label and method
    file_path = os.path.join(data_dir, f"{label}/{label}_{method}.csv")
    
    # Save the filtered indices and their corresponding data to a new CSV file
    save_all_filtered_indices(file_path, indices_to_move, output_file_path, label, method)


Directory '/media/dell-hd/data1/datasets/Au-datasets/JmolNN_check' already exists.
Copied /media/dell-hd/data1/datasets/Au-datasets/xmu/2818.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/xmu_2818.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/xmu/4621.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/xmu_4621.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/xmu/4879.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/xmu_4879.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/xmu/3344.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/xmu_3344.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/xmu/3090.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/xmu_3090.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/xmu/4626.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/xmu_4626.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/xmu/4377.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check

Copied /media/dell-hd/data1/datasets/Au-datasets/wt/4077.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/wt_4077.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/wt/2291.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/wt_2291.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/wt/4083.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/wt_4083.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/wt/3829.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/wt_3829.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/wt/4860.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/wt_4860.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/wt/3069.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/wt_3069.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/wt/4863.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/wt_4863.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/rdf/2818.csv to /media/dell-hd/dat

Copied /media/dell-hd/data1/datasets/Au-datasets/norm/3406.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/norm_3406.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/norm/4434.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/norm_4434.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/norm/4436.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/norm_4436.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/norm/1877.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/norm_1877.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/norm/3926.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/norm_3926.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/norm/4438.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/norm_4438.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/norm/3164.csv to /media/dell-hd/data1/datasets/Au-datasets/JmolNN_check/norm_3164.csv
Copied /media/dell-hd/data1/datasets/Au-datasets/norm/3