<h1>CrystalineFileLecturePredict.ipynb</h1>

Reads from D3Files all the files it is going to process

# WE REQUEST THE USER TO GIVE THE FILE WITH ONLY TWO ROWS PER POLARISER CELL USED

We should expect something like this:


|  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
polariser cell info |ge18004 |pressure/init. |polar |2.29 |0.79 |initial |date/time |17 |09 |23 |@ |10:39
|37391 |4.000 |0.000 |1.000 |18/09/23 |06:20:44 |155.03 |+z |+z |0.8391 |0.0156 |11.4270 |1.2031 |120.00
|37417 |4.000 |0.000 |1.000 |18/09/23 |10:31:59 |155.29 |+z |+z |0.8120 |0.0187 |9.6406 |1.0613 |120.00
polariser cell info |ge18012 |pressure/init. |polar |2.27 |0.79 |initial |date/time |18 |09 |23 |@ |10:33
|37418 |4.000 |0.000 |1.000 |18/09/23 |10:37:52 |155.29 |+z |+z |0.9101 |0.0107 |21.2483 |2.6375 |120.00
|37434 |4.000 |0.000 |1.000 |18/09/23 |14:16:07 |155.33 |+z |+z |0.8784 |0.0129 |15.4409 |1.7409 |120.00
polariser cell info |ge18004 |pressure/init. |polar |2.28 |0.79 |initial |date/time |22 |09 |23 |@ |09:45
|37462 |4.000 |0.000 |1.000 |22/09/23 |10:06:36 |0.00 |+z |+z |0.8670 |0.0427 |14.0333 |4.8278 |10.00
|37521 |4.000 |0.000 |1.000 |23/09/23 |09:51:06 |0.00 |+z |+z |0.7598 |0.0211 |7.3276 |0.7333 |120.00
|  ...  |       |       |       |          |          |      |        |        |        |        |        | | |

You can have as many sets of three lines as you desire, but they have to be a header and two regular rows per experiment. 
### How to prepare the files
1. Take the raw .fli file and open it in a text reader app (The Note Bloc in Windows opens them like .txt files so it works). 
2. Choose a Miller index combination and find the first row that has as polarization directions (+z,+z)
3. Find the last appearance of that Miller index combination with polarization direction (+z,+z)
4. Erase all but those two lines and keep the header too.
5. Repeat for every cell used (for every 'polariser cell info' row)

The reason why this process has not been automated is to allow the user more freedom when predicting. Perhaps an automatic routine will not take into account what Miller index combination is adequate. Also, the user might have files that have already the correct structure and they don't need the whole pipeline. Either way, if the ILL staff wants this automated routine (using the same logic as the other Reading.ipynb files) please contact Gonzalo and he will happily help you.

__________________________________________________________________________________________

Outputs the following folders and files:

1. **CrystallineLog_Predicting_Creation.txt**
Logs all the prints and every step the code does. If you trust the code, it is irrelevant. If you don't trust it or want to change it then this txt file will tell you how each experiment file has been processed and where there might have been issues.


2. **Crystalline_CellID**
Contains the Cell IDs found on all the files. Needed for the ML code.


3. **CrystallineSeparatedFolder**
It will create a folder for each subfolder where .fli files, that could be used, were found. Only the numerical files are saved, a.k.a, *{base_name}.txt*. They contain DeltaTime (the time of the measurement measured from the first VALID polarization measurement), PolarizationD3, SoftPolarizationD3 (the polarization after using a Savitzky-Golay filter) and ErrPolarizationD3 (the uncertainty). The parameter files are directly sent to the next folder. Despite having duplicates of the information, this folder has been saved as it also has any intermediate file that has not been fully processed. If one of the .fli files has any issues, the file is saved, as is, before running to the issue. 

4. **ML/CrystallinePredictFiles**
This folder is not inside the folder you are currently working on (FileReadingStoring) but on another folder at the same level as FileReadingStoring called ML. It contains the files needed for the ML predictions.

    4.1 **{base\_name}.txt**
Contains DeltaTime (the time of the measurement measured from the first VALID polarization measurement), PolarizationD3, SoftPolarizationD3 (the polarization after using a Savitzky-Golay filter) and ErrPolarizationD3 (the uncertainty).

    4.2 **{base\_name}\_Parameters.txt**
Contains the CellID, Pressure, LabPolarization (the polarization measured at the lab) and LabTimeCellID (the time when it was measured)


5. **CrystallineDataBase**
Contains all the .fli files that were attempted to be read before manipulating them


6. **CrystallineBadFiles**
Contains all the .fli separated in experiment sets folders that were rejected (not enough points, negative polarizations, etc.)

7. **PolarizationTimeReference**
The models where trained using relative time only. This means that the first valid polarization measurement has been considered as time zero for each experiment and the rest of measurements have their associated time values as a variation of time since that reference. This way all experiments have the same structure. However this reference time is not absolute and the measurements of the diffractograms may have an absolute time reference different from the ones used in the models. By saving the string "Year-Month-Day Hour:Minute:Second" we can safely change the time reference


_________________________________________________________________________________________

Some parts of the code might use data from different sessions. It is safer to erase them and create all files from scratch everytime. This is not a big deal because this code file should only be run once unless the data base changes. It also clears all previous ML predictions to avoid mixing up information between experiments

The code will take all zipped folders from the folder _D3Files_ and prepare them to get their .fli files extracted.

First, it will check if there are duplicate zip folders. To check it, it will compare the folder name and the hash sha256. Duplicate folders will be erased. For more information about hash sha256 check for example:
>Wikipedia contributors. (2026, January 2). SHA-2. In Wikipedia, The Free Encyclopedia. Retrieved 10:49, January 17, 2026, from https://en.wikipedia.org/w/index.php?title=SHA-2&oldid=1330753870

Second, it will copy the contents of the zipped folders and create a new folder with the name of the experiment inside _D3Files_. No more zipped folders are erased and information gets duplicated.

Third, it will try to find all .fli files inside all the unzipped folders whether if they come from a zipped folder or not. It will then send them to the newly created folder _CrystalineDataBase_. If, for each experiment proposal there are more than one .fli files, they get a numeric suffix (\_1, \_2,...) to distinguish them. Afterwards, all unzipped folders get erased leaving behind only the non-duplicated zipped folders.

Note: All unzipped folders in D3Files will be explored, however they will get erased at the end of the pipeline. If you want them to persist for future runs of the code, they should be zipped first. For ILL users, when navigating the ILL Cloud, the easiest way to prepare the zip files is to download the _processed_ folder for each experiment proposal. The code is "smart" enough to only process .fli files with polarization information. Therefore, there is no need to manually prepare anything.
To be precise, this cell of code will take all the zip files, extract them nd remove duplicates using the file name And the hash sha256. 






An explanation of the information that all the contents in the .fli files give can be found here: 
1. 'polariser cell info' (str): Log of the installation of the polariser cells
2. 'PolariserID' (str): A string with the type of cell used
3. 'pressure/init. polar' (str): A string to introduce the $^\mathrm{3}$He gas pressure and the polarization measured at the creation lab.
4. 'PolariserPressure' (float): $^\mathrm{3}$He gas pressure in some units
5. 'InitialLabPolarization' (float): Polarization measured at the creation lab
6. 'initial date/time' (str): A string that introduces the day, month and year and the hour and minutes.
7. 'Date' (str): A string with the information DD/MM/YY
8. '@' (str): A string to separate date and time
9. 'time' (str): A string with the information HH:MM

And for the rest of the rows:
1. 'Measurement number' (int): The number index of the measurement.
2. 'First_Miller_Index' (float): The first Miller index of the crystal. Polarization is measured using a known Si Bragg crystal. For the source of the origin of the Si crystal see:
>Stunault, Anne & Vial, S & Pusztai, Laszlo & Cuello, Gabriel & Temleitner, László. (2016). Structure of hydrogenous liquids: separation of coherent and incoherent cross sections using polarised neutrons. Journal of Physics: Conference Series. 711. 012003. 10.1088/1742-6596/711/1/012003. 
3. 'Second_Miller_Index' (float)
4. 'Third_Miller_Index' (float):
5. 'Date' (str): A string with the information DD/MM/YY of that measurement
6. 'time' (str): A string with the information HH:MM:SS of that measurement
7. Temperature 
8. 'Direction_1' (str): It is the direction of the polarization after the monochromator.The direction +z corresponds to the orthogonal with respect to the floor pointing away from it. +x is the direction of the beam (variable) and +y is the orthogonal (positive orthonormal basis) direction to +z and +x. 
8. 'Direction_2' (str): The direction of the polarization at the sensors. True polarization measurements are done **only** on the (+z,+z) direction.

8. 'D3Polarization' (float): A float with the polarization measurement
9. 'ErrD3Polarization' (float): A float with the uncertainty of that polarization measurement
10. 'FlippingRatio' (float): The flipping ratio. Given either the flipping ratio or the polarization value, the other one is fully determined. Therefore, only one is needed and that is why we don´t work with the flipping ratio
11. 'ErrFlippingRatio' (float): The uncertainty of the flipping ratio
12. 'Elapsed time' (float): It is the time used to obtain the measurement (integration of the beam over that number of seconds)

Temperature did not seem to have an effect on the decay. Therefore, it has been eliminated in this code cell. Here is a summary of what the code does:

1. The code will go through the .fli files and find all rows with 'polariser cell info'. A cell change is considered once a new 'polariser cell info'. At the moment it ignores the experiments that use the 'magical box' as we are not sure if they are experiments compatible with the ones studied here.
2. For evey cell change, a new .fli file is created storing all the information including the header row and the measured data rows (in this case, just two). Also, all cell IDs are recorded
3. For all .fli files the code now will:   

    3.1 Remove unwanted rows (hopefully none)
    
    3.2 It removes any rows that don´t have polarization directions (+z,+z). (Should remove none if done correctly).
    
    3.3 Extract data form the header row.
    
    3.4 Remove unwanted columns (temperature, flipping ratio, counts, elapsed time,...).
    
    3.5 Set a time reference with the first measurement row. All other time values get referenced with respect to this moment in time and converted into seconds.
    
    3.6 Ignore all Miller index combinations that are not integers (hopefully no issues here).
    
    3.7 Save two files for each experiment. One with the header rows and another one with just the numeric rows (with a new header that explains what each column has).

For every succesful experiment we will output:
1. Image:  **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Multiplier={Multiplier}.png"** in PlotResults. Shows the plot with the extended area with the raw data
2. Image: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Multiplier={Multiplier}\_Soft.png"** in PlotResults. Shows the plot with the extended area with the filtered data
3. Image: **"PolarizationD3\_{folder_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Filtered.txt\_plot\_Derivatives.png"** in PlotResults. Shows the evolution of the "derivatives". I apologize for the hideous names. Unless it results in a fatal error, I am scared to change the code.
4. Image: **"PolarizationD3\_{folder_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex_{PrettyCombination}\_Filtered.txt\_N\_{N}\_ManualInterval.png"** in PlotResults. Shows the plot with the non-exteded area
5. Txt: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}.txt"** in MLDataBase. It contains the four data columns (DeltaTime, PolarizationD3, SoftPolarizationD3, ErrPolarizationD3)
6. Txt: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Parameters.txt"** in MLDataBase. It contains the parameters (CellID, Pressure, LabPolarization, LabTime)


The plots are not necessary but are saved for the user to know what all the files look like. The txt files are fundamental for the rest of the pipeline. 
The files that are wrong or useless when all is done are the folowing. They are kept for  debug purposes (to see files with differents structures, why they fail,etc).

1. Txt: **"{folder\_name}\_Arrays\_{i}.txt"** in SeparatedFolder/{folder_name}. It still has the header and useless columns. It is the fli file of evey chunk, of every recorded experiment (correct or incorrect)
2. Txt: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}.txt"** in SeparatedFolder/{folder_name}. It is the same as the one in MLDataBase (a duplicate)
3. Image: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}.png"** in SeparatedFolder/{folder_name}. It plots (with error bars) PolarizationD3
4. Image: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Combined.png"** in SeparatedFolder/{folder_name}. It plots (with error bars) both PolarizationD3 and SoftPolarizationD3
5. Image: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Softened.png"** in SeparatedFolder/{folder_name}. It plots (with error bars) SoftPolarizationD3
6. Folder: **"FailuresTest"** contains all the graphs of the data sets that were considered not worthy but had more points that the ones saved. Check them if your experiment was not properly added
7. Folder: **"DataBase"** has the raw fli files. Once the code has been used they are no longer important (if you don't find the folder I may have added a line of code to erase it. Sorry in advance for any inconveniences)  


It erases all intermediate files and prepares the remaining ones for the ML pipeline

1. Removes all .fli files that have been created.
2. Removes empty folders
3. Collects all unique polariser–analyser ID pairs
 
As a result, the only useful files are _Crystalline_CellID.txt_ and the folder _ML/CrystallineToPredictFiles_

## 1. Libraries

In [None]:
%reset -f
import os
import shutil
import zipfile
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import pandas as pd
from datetime import datetime
from pathlib import Path
from scipy.signal import savgol_filter
from scipy.optimize import curve_fit
from collections import defaultdict
from typing import Optional
import hashlib
import zipfile

## 2. Auxiliary Functions and log file creation

1. _PrintDebug_ is a flag that allows the code to output on screen all the steps. If it is set to false, it won´t show anything. However, all information will be properly logged whether this flag is set to true or false. The name of the log is determined by the variable *log_file_path*. The code runs faster if it is set to False.

2. _ShowPlot_ is a similar flag that allows the code to show on screen all plots that are being produced. They are all stored independently of whether this flag is True or False. The code runs faster if it is set to False.

3. **log_message** is a function used for writting on the log file

4. **long_path** is a function that "fixes" directory paths

In [None]:
PrintDebug = True 
ShowPlot = False 
log_file_path = os.path.join(".", "CrystallineLog_Reading_Creation.txt")
with open(log_file_path, 'w', encoding='utf-8') as log_file:
    log_file.write("=== Log started ===\n")

def log_message(message):
    if PrintDebug:
        print(message)
    with open(log_file_path, 'a', encoding='utf-8') as log_file:
        log_file.write(str(message) + "\n")


def long_path(path):
    """
    Arguments: 
        path (path): The path that needs to be converted
    
    Returns:
        The updated path string or path depending on the platform used
        
    Notes:
        To avoid Windows 260 character limit for Windows paths, a special "prefix" is added.
        It also unifies how directories are managed.
        Also works with Linux and Mac

    """
    # Convert to Path and resolve to absolute
    path = Path(path).resolve()
    
    #Windows only:  
    if os.name == "nt":
        path_str = str(path)
        if not path_str.startswith("\\\\?\\"):
            # UNC paths need special handling
            if path_str.startswith("\\\\"):
                path_str = "\\\\?\\UNC\\" + path_str[2:]
            else:
                path_str = "\\\\?\\" + path_str
            return path_str
    
    return path

## 3. Functions

1. **Time** is a function that converts time to a universal format

2. **deltatime** is a funtions that computes the difference in time between two sets of time, in seconds 

3. **format_combination** is just an aesthetic change in the Miller index combination variable

4. **sanitize** is a funtion that fixes any directory path with "illegal" variables

5. **savgol_params_func** is a function that ensures that the window length is odd and large enough for the polynomial order.

6. Formating functions: Just optimizations for the names of the files being stored. Nothing of interest.

7. **filter_best_combination**  is a function that discards all problematic data sets (negative polarization, small sets of data and also uses **Overall_Decrease**  



In [None]:
def Time(Day_Ref, Hour_Ref):
    """
    Arguments: 
        Day_Ref (str): 'DD/MM/YY' a.k.a Day/Month/Year
        Hour_Ref (str): = 'HH:MM' or 'HH:MM:SS' a.k.a Hour:Month:Second
        
    Returns:
        A 'datetime' object with format (year, month, day, hour, minute, second).
        
    Notes:
        If there is no information about the seconds, they will be considered 0
    """
    match = re.match(r"(\d+)/(\d+)/(\d+)", Day_Ref)
    if match:
        DD = int(match.group(1))
        MM = int(match.group(2))
        YY = int(match.group(3))
    else:
        raise ValueError(f"Invalid date format: {Day_Ref}")

    match = re.match(r"(\d+):(\d+):(\d+)", Hour_Ref)
    if match:
        Hour = int(match.group(1))
        Minute = int(match.group(2))
        Second = int(match.group(3))
    else:
        # If seconds are missing, try HH:MM
        match = re.match(r"(\d+):(\d+)", Hour_Ref)
        if match:
            Hour = int(match.group(1))
            Minute = int(match.group(2))
            Second = 0
        else:
            raise ValueError(f"Invalid time format: {Hour_Ref}")

    return datetime(YY + 2000 if YY < 100 else YY, MM, DD, Hour, Minute, Second)

################################################################


def deltatime(AIni,BIni, AFin,BFin):
    """
    Arguments: 
        AIni (str): 'DD/MM/YY' a.k.a Day/Month/Year for the initial time
        BIni (str): = 'HH:MM' or 'HH:MM:SS' a.k.a Hour:Month:Second for the initial time
        AFin (str): 'DD/MM/YY' a.k.a Day/Month/Year for the final time
        BFin (str): = 'HH:MM' or 'HH:MM:SS' a.k.a Hour:Month:Second for the final time
    Returns:
        The time variation in seconds
        
    Notes:
        Requires the function "Time"
    """
    time1 = Time(AIni, BIni)
    time2 = Time(AFin, BFin)
    return( int((time2 - time1).total_seconds()))

#################################################################

def format_combination(comb):
    """
    Arguments: 
        comb (float, float, float): A set of three floats characterizing the Miller indices.

    Returns:
        An object (int,int,int): With the floor integer of those float variables. (3.0 -> 3)
    """
    if comb is None:
        return "(None)"
    ints = tuple(int(float(x)) for x in comb)
    return f"({','.join(map(str, ints))})"

###############################################################

def sanitize(name):
    """
    Arguments: 
        name (str): Directory string 
    Returns:
        The same string but with symbols [,<,>,:,",/,\\,|,?,*,] converted to _
    """
    
    return re.sub(r'[<>:"/\\|?*]', '_', name)

###############################################################

def savgol_params_func(n_points):
    """
    Arguments: 
        n_points (int): Number of points where the filtered will be used
    Returns:
        A dictionary containing valid parameters for a Savitzky–Golay filter.
    Notes:    
        It ensures that the window length is odd and large enough for the polynomial order.
    """
    window_length = min(default_window_length, n_points)
    if window_length % 2 == 0:
        window_length -= 1
    if window_length < polyorder + 2:
        window_length = polyorder + 2
        if window_length % 2 == 0:
            window_length += 1
    return {'window_length': window_length, 'polyorder': polyorder}

###################################################


def make_clean_name(filename: str) -> str:
    """
    Turn e.g.
      PolarizationD3_CaFeAl_13_7_6_24_2_MillerIndex_(0,0,2)_Filtered.txt
    into:
      CaFeAl_13_7_6_24_2_(0,0,2)
    and handle cases where filename contains '/' or '\\' (dates like DD/MM/YY).
    """
    s = str(filename).replace("/", "_").replace("\\", "_")  # prevent path splitting
    base = Path(s).stem  # remove extension if present
    if base.startswith("PolarizationD3_"):
        base = base[len("PolarizationD3_"):]
    if base.endswith("_Filtered"):
        base = base[:-len("_Filtered")]
    base = base.replace("MillerIndex_", "")
    return base


def clean_plot_filename(filename: str, needed_N: Optional[float], plot_folder: Path) -> Path:
    """
    Build a clean filename for linear fit plots.
    Example:
    155K_2_18_9_23_1_(4,0,1)_N_4.30e-03.png
    or
    155K_2_18_9_23_1_(4,0,1)_NoNFound.png
    """
    base = make_clean_name(filename)
    suffix = f"_N_{needed_N:.2e}" if needed_N else "_NoNFound"
    return plot_folder / f"{base}{suffix}.png"

def derivative_plot_filename(filename: str, plot_folder: Path) -> Path:
    """
    Build a clean filename for derivative plots.
    Example:
    CaFeAl_13_7_6_24_2_(0,0,2)_Derivatives.png
    """
    base = make_clean_name(filename)
    return plot_folder / f"{base}_Derivatives.png"

def extended_area_plot_filename(filename: str) -> str:
    """EuAgAs_5_31_10_23_0_(3,0,0)_ExtendedArea.png"""
    return f"{make_clean_name(filename)}_ExtendedArea.png"
    

#####################################################
    
def filter_best_combination(i, df):
    """
    Arguments: 
        1. i (int): The chunk number a.k.a the (ordinal) number of the Miller index combination
        2. df (pandas object): It is something like this: (NaNs are intended)
                  1    2    3  PolarizationD3 ErrPolarizationD3  12   13   14  DeltaTime
            0   3.0  3.0  3.0          0.5383            0.0021 NaN  NaN  NaN          0
            1   3.0  3.0  3.0          0.5379            0.0021 NaN  NaN  NaN        147
            2   3.0  3.0  3.0          0.5315            0.0022 NaN  NaN  NaN       3919
            ...

    Returns:
        1. A filtered df object like this one:
                  1    2    3  PolarizationD3 ErrPolarizationD3 DeltaTime
            0   3.0  3.0  3.0          0.5383            0.0021         0
            1   3.0  3.0  3.0          0.5379            0.0021       147
            2   3.0  3.0  3.0          0.5315            0.0022      3919
            ...
        2. An object (int,int,int) with the adequate Miller index combination.
        
    Notes:    
        First, it extracts the Miller index combination and converts it into a set of three integers (format_combination)
        Then it tries a couple of tests to see if the data associated to them is valid
            1. Check there is data
            2. Check if the time array is present and convert all values to either floats or integers
            3. Check if all polarization values are positive. If they are not, skip that Miller index combination
            4. Check if there are more than three rows of data. If there are not, skip that Miller index combination
            5. Check if the filtered df object passes the Overall_Decrease
            
    """
    
    filter_func=savgol_filter #Only tested for 'savgol_filter'
    filter_params_func=savgol_params_func #Only tested for the previously defined function 'savgol_params_func'
    min_points_required=3  #Minimum points needed for the filter to work (3 for Savitzky-Golay)
    tolerance=1e-8 #Tolerance to decide if the filtered value is worth keeping
    filter_column_idx=df.columns.get_loc('PolarizationD3')
    time_column_idx=df.columns.get_loc('DeltaTime')
    error_column_idx=df.columns.get_loc('ErrPolarizationD3')
    new_column_name='SoftPolarizationD3'
    folder_name = FileName.replace(".fli", "")   
    
    # Group by first three columns (Miller indices)
    combination_counts = (
        df.groupby([df.columns[0], df.columns[1], df.columns[2]])
        .size()
        .sort_values(ascending=False)
    )
    log_message(f"Analyzing combinations in file: {folder_name}_Array_{i}.fli")
    #Read the three numbers from the .fli file
    for comb, count in combination_counts.items(): #This becomes a loop of just one combination
        log_message(f"Combination {comb} occurs {count} times in file {folder_name}.fli. Trying this combination")
        mask = (
            (df.iloc[:,0] == comb[0]) &
            (df.iloc[:,1] == comb[1]) &
            (df.iloc[:,2] == comb[2]))
        
        PrettyCombination = format_combination(comb)
        filtered_df = df.loc[mask].copy()
        
        # Requisites for the Combination to be valid:
        # Requisite 1: Have data in the data
        if filtered_df.empty:
            log_message(f"      {PrettyCombination} has no data")
            continue
        
        # Requisite 2: Check if data column exists
        if filtered_df.shape[1] <= filter_column_idx:
            log_message(f"      Expected column index {filter_column_idx} not found. Skipping combination {PrettyCombination}")
            continue
        
        # Convert to numeric all columns (all columns are considered as object type)
        filtered_df = filtered_df.apply(pd.to_numeric, errors='coerce')
        filtered_df = filtered_df.dropna()  # drops any rows with NaNs introduced by coercion in the last line
        
        # Check dtypes
        all_numeric = all(dtype.kind in ('f', 'i') for dtype in filtered_df.dtypes)
        
        if all_numeric:
            log_message(f"      All columns have been successfully converted to numbers.")
        else:
            log_message(f"      Not all columns are numbers. Current dtypes:")
            log_message(f"      {filtered_df.dtypes}")
            log_message(f"      Expect Error Message from Python. Perhaps removing this file might be wise unless all files have the same issue")
        if filtered_df.empty:
            log_message(f"      All rows dropped after conversion to numeric. Skipping combination {PrettyCombination}")
            continue
        
        # Requisite 3: Polarization is ALWAYS positive. If any is negative, that is not a polarization. Immediately sent to the Bad Files Folder
        if (filtered_df.iloc[:, filter_column_idx] < 0).any():
            filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt"
            badfile_subfolder = BadFilesFolder / folder_name
            badfile_subfolder.mkdir(parents=True, exist_ok=True)
            badfiles_txt_path = badfile_subfolder / filename
            filtered_df.to_csv(long_path(badfiles_txt_path), index=False, sep='\t')
            log_message(f"      {PrettyCombination} has negative polarization values. Sent to BadFiles with name {filename}. Skipping to next Combination")
            continue

        # Requisite 4: Have at least three rows (otherwise we can't teach the ML algorithm anything).
        if len(filtered_df) < 2:
            filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt"
            badfile_subfolder = BadFilesFolder / folder_name
            badfile_subfolder.mkdir(parents=True, exist_ok=True)
            badfiles_txt_path = badfile_subfolder / filename
            filtered_df.to_csv(long_path(badfiles_txt_path), index=False, sep='\t')
            log_message(
                f"      {PrettyCombination} has only {len(filtered_df)} rows (< {min_points_required}). "
                f"Sent to BadFiles with name {filename}. Skipping to next Combination"
            )
            continue


        # "Requisite 5": Be worthy of having the filter used.
        y = filtered_df.iloc[:, filter_column_idx].values
        filter_params = filter_params_func(len(y))
        try:
            y_filtered = filter_func(y, **filter_params)
            diff = np.abs(y - y_filtered)
            changed_count = np.sum(diff > tolerance)
            filtered_df[new_column_name] = y_filtered
            if changed_count > 0:
                log_message(f"      Filter changed {changed_count}/{len(y)} points. Adding column '{new_column_name}'.")
            else:
                log_message(f"      Filter applied but data unchanged. Adding '{new_column_name}' as duplicated values.")
            
            
        
        except Exception as e:
            log_message(f"      Error applying filter to combination {comb}: {e}")
            log_message(f"      Adding '{new_column_name}' as duplicated values to proceed anyway.")
            # Just duplicate the original column
            y_filtered = y.copy()
            filtered_df[new_column_name] = y_filtered
        # Requisite 5: Add the Derivative and BandWidth filtering logic
        return filtered_df, PrettyCombination

    return None, None #As a Failsafe just in case someone uses two Miller index combinations



## 4. Clean old files and force certain experiments

Some parts of the code might use data from different sessions. It is safer to erase them and create all files from scratch everytime. This is not a big deal because this code file should only be run once unless the data base changes. It also clears all previous ML predictions to avoid mixing up information between experiments

In [None]:
to_erase = [
    "CrystallineLog_Predicting_Creation.txt",
    "CrystallinePolariserAndAnalyser_IDs.txt",
    "CrystallineSeparatedFolder",
    "CrystallinePlotResults",
    "CrystallineMLDataBase",
    "PolarizationTimeReference.txt",
    "CrystallineDataBase",
    "CrystallineBadFiles", 
    "CrystallineFailuresTest", 
    "CrystallineCell_ID.txt"
]
for item in to_erase:
    path = os.path.abspath(item)  
    if os.path.exists(path):
        try:
            if os.path.isfile(path):
                os.remove(path)
                log_message(f"Deleted file: {path}")
            elif os.path.isdir(path):
                shutil.rmtree(path)
                log_message(f"Deleted folder: {path}")
        except Exception as e:
            log_message(f" Could not delete {path}: {e}")
    else:
        log_message(f"Not found (skipped): {path}")
        
predict_path = os.path.abspath(os.path.join("..", "ML", "CrystallinePredictFiles"))

if os.path.exists(predict_path):
    try:
        if os.path.isdir(predict_path):
            shutil.rmtree(predict_path)
            log_message(f"Deleted folder: {predict_path}")
        else:
            os.remove(predict_path)
            log_message(f"Deleted file (unexpected, was not a folder): {predict_path}")
    except Exception as e:
        log_message(f"Could not delete {predict_path}: {e}")
else:
    log_message(f"Not found (skipped): {predict_path}") 

## 5. ZIP Folder Treatment and .fli data extraction

The code will take all zipped folders from the folder _D3Files_ and prepare them to get their .fli files extracted.

First, it will check if there are duplicate zip folders. To check it, it will compare the folder name and the hash sha256. Duplicate folders will be erased. For more information about hash sha256 check for example:
>Wikipedia contributors. (2026, January 2). SHA-2. In Wikipedia, The Free Encyclopedia. Retrieved 10:49, January 17, 2026, from https://en.wikipedia.org/w/index.php?title=SHA-2&oldid=1330753870

Second, it will copy the contents of the zipped folders and create a new folder with the name of the experiment inside _D3Files_. No more zipped folders are erased and information gets duplicated.

Third, it will try to find all .fli files inside all the unzipped folders whether if they come from a zipped file or not. It will then send them to the newly created folder _CrystalineDataBase_. If, for each experiment proposal there are more than one .fli files, they get a numeric suffix ('_1', '_2',...) to distinguish them. Afterwards, all unzipped folders get erased leaving behind only the non-duplicated zipped folders.

Note: All unzipped folders in D3Files will be explored, however they will get erased at the end of the pipeline. If you want them to persist for future runs of the code, they should be zipped first. For ILL users, when navigating the ILL Cloud, the easiest way to prepare the zip files is to download the _processed_ folder for each experiment proposal. The code is "smart" enough to only process .fli files with polarization information. Therefore, there is no need to manually prepare anything.
To be precise, this cell of code will take all the zip files, extract them nd remove duplicates using the file name And the hash sha256. 

In [None]:
def file_hash(filepath, algo="sha256", block_size=65536):
    """Compute hash of a file (default SHA256)."""
    h = hashlib.new(algo)
    with open(long_path(filepath), "rb") as f:
        for block in iter(lambda: f.read(block_size), b""):
            h.update(block)
    return h.hexdigest()

folder = Path("D3Files")  
zip_files = [f.name for f in folder.glob("*.zip")] 

log_message(f"Reading ZIP files. Checking for true duplicates by content...")
base_names = set()
seen_hashes = {}

for zip_file in zip_files:
    zip_path = folder / zip_file
    name = Path(zip_file).stem
    ext = Path(zip_file).suffix
    filehash = file_hash(zip_path)

    if filehash in seen_hashes:
        log_message(f"Duplicate confirmed by hash! Removing: {zip_file} (same as {seen_hashes[filehash]})")
    else:
        seen_hashes[filehash] = zip_file
        base_names.add(name)

log_message(f"\n All duplicates (by content) removed. Begin unzipping...\n")

######################################################################################

""" UNZIPPING """

log_message(f"Begin unzipping...\n")

# Refresh zip_files list after removals
zip_files = [f.name for f in folder.glob("*.zip")]

for zip_file in zip_files:
    zip_path = folder / zip_file
    if zipfile.is_zipfile(long_path(zip_path)):
        folder_name = sanitize(zip_file.stem if isinstance(zip_file, Path) else os.path.splitext(zip_file)[0])
        extract_dir = folder / folder_name
        log_message(f"   Unzipping: {zip_file} -> {extract_dir}")
        try:
            with zipfile.ZipFile(long_path(zip_path), 'r') as zip_ref:
                zip_ref.extractall(long_path(extract_dir))
        except Exception as e:
            log_message(f"   WARNING: Error extracting {zip_file}: {e}")
    else:
        log_message(f"   WARNING: Skipping invalid zip file: {zip_file}")

log_message(f"\nFinished Unzipping. Experiments stored in individual folders substituting the zip files\n")

######################################################################################


""" .fli FILE EXTRACTION """

source_folder = Path("D3Files")
database_folder = (Path.cwd() / "CrystallineDataBase").resolve()
database_folder.parent.mkdir(parents=True, exist_ok=True)
os.makedirs(long_path(database_folder), exist_ok=True)

log_message(f"\nScanning all folders for .fli files...\n ")
for item in os.listdir(long_path(source_folder)):
    item_path = source_folder / item  # Path object
    if item_path.is_dir():
        log_message(f"   Processing folder: {item_path.name}")
        for root, dirs, files in os.walk(long_path(item_path)):
            for file in files:
                if file.lower().endswith(".fli"):
                    src_file = Path(root) / file
                    dest_file = database_folder / file

                    # Handle duplicate names
                    counter = 1
                    base_name = Path(file).stem
                    ext = Path(file).suffix
                    while (database_folder / f"{base_name}_{counter}{ext}").exists():
                        counter += 1
                    dest_file = database_folder / f"{base_name}_{counter}{ext}"

                    log_message(f"   Copying: {src_file} -> {dest_file}")
                    shutil.copy2(src_file, dest_file)


        log_message(f"   Deleting folder: {item_path}")
        shutil.rmtree(long_path(item_path))


log_message(f"\nAll .fli files collected, sent from folder {source_folder} to folder {database_folder} . \n")

## 6. Separation of .fli files according to experiments

This code cell is different from all the other ones because **we request the user to give the file with only two rows per polariser cell used**

We should expect something like this:


|  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
polariser cell info |ge18004 |pressure/init. |polar |2.29 |0.79 |initial |date/time |17 |09 |23 |@ |10:39
|37391 |4.000 |0.000 |1.000 |18/09/23 |06:20:44 |155.03 |+z |+z |0.8391 |0.0156 |11.4270 |1.2031 |120.00
|37417 |4.000 |0.000 |1.000 |18/09/23 |10:31:59 |155.29 |+z |+z |0.8120 |0.0187 |9.6406 |1.0613 |120.00
polariser cell info |ge18012 |pressure/init. |polar |2.27 |0.79 |initial |date/time |18 |09 |23 |@ |10:33
|37418 |4.000 |0.000 |1.000 |18/09/23 |10:37:52 |155.29 |+z |+z |0.9101 |0.0107 |21.2483 |2.6375 |120.00
|37434 |4.000 |0.000 |1.000 |18/09/23 |14:16:07 |155.33 |+z |+z |0.8784 |0.0129 |15.4409 |1.7409 |120.00
polariser cell info |ge18004 |pressure/init. |polar |2.28 |0.79 |initial |date/time |22 |09 |23 |@ |09:45
|37462 |4.000 |0.000 |1.000 |22/09/23 |10:06:36 |0.00 |+z |+z |0.8670 |0.0427 |14.0333 |4.8278 |10.00
|37521 |4.000 |0.000 |1.000 |23/09/23 |09:51:06 |0.00 |+z |+z |0.7598 |0.0211 |7.3276 |0.7333 |120.00
|  ...  |       |       |       |          |          |      |        |        |        |        |        | | |

You can have as many sets of three lines as you desire, but they have to be a header and two regular rows per experiment. 
### How to prepare the files
1. Take the raw .fli file and open it in a text reader app (The Note Bloc in Windows opens them like .txt files so it works). 
2. Choose a Miller index combination and find the first row that has as polarization directions (+z,+z)
3. Find the last appearance of that Miller index combination with polarization direction (+z,+z)
4. Erase all but those two lines and keep the header too.
5. Repeat for every cell used (for every 'polariser cell info' row)

The reason why this process has not been automated is to allow the user more freedom when predicting. Perhaps an automatic routine will not take into account what Miller index combination is adequate. Also, the user might have files that have already the correct structure and don't need the whole pipeline. Either way, if the ILL staff wants this automated routine (using the same logic as the other Reading.ipynb files) please contact Gonzalo and he will happily help you.




An explanation of the information that all the contents in the .fli files give can be found here: 
1. 'polariser cell info' (str): Log of the installation of the polariser cells
2. 'PolariserID' (str): A string with the type of cell used
3. 'pressure/init. polar' (str): A string to introduce the $^\mathrm{3}$He gas pressure and the polarization measured at the creation lab.
4. 'PolariserPressure' (float): $^\mathrm{3}$He gas pressure in some units
5. 'InitialLabPolarization' (float): Polarization measured at the creation lab
6. 'initial date/time' (str): A string that introduces the day, month and year and the hour and minutes.
7. 'Date' (str): A string with the information DD/MM/YY
8. '@' (str): A string to separate date and time
9. 'time' (str): A string with the information HH:MM

And for the rest of the rows:
1. 'Measurement number' (int): The number index of the measurement.
2. 'First_Miller_Index' (float): The first Miller index of the crystal. Polarization is measured using a known Si Bragg crystal. For the source of the origin of the Si crystal see:
>Stunault, Anne & Vial, S & Pusztai, Laszlo & Cuello, Gabriel & Temleitner, László. (2016). Structure of hydrogenous liquids: separation of coherent and incoherent cross sections using polarised neutrons. Journal of Physics: Conference Series. 711. 012003. 10.1088/1742-6596/711/1/012003. 
3. 'Second_Miller_Index' (float)
4. 'Third_Miller_Index' (float):
5. 'Date' (str): A string with the information DD/MM/YY of that measurement
6. 'time' (str): A string with the information HH:MM:SS of that measurement
7. Temperature 
8. 'Direction_1' (str): It is the direction of the polarization after the monochromator.The direction +z corresponds to the orthogonal with respect to the floor pointing away from it. +x is the direction of the beam (variable) and +y is the orthogonal (positive orthonormal basis) direction to +z and +x. 
8. 'Direction_2' (str): The direction of the polarization at the sensors. True polarization measurements are done **only** on the (+z,+z) direction.

8. 'D3Polarization' (float): A float with the polarization measurement
9. 'ErrD3Polarization' (float): A float with the uncertainty of that polarization measurement
10. 'FlippingRatio' (float): The flipping ratio. Given either the flipping ratio or the polarization value, the other one is fully determined. Therefore, only one is needed and that is why we don´t work with the flipping ratio
11. 'ErrFlippingRatio' (float): The uncertainty of the flipping ratio
12. 'Elapsed time' (float): It is the time used to obtain the measurement (integration of the beam over that number of seconds)

Temperature did not seem to have an effect on the decay. Therefore, it has been eliminated in this code cell. Here is a summary of what the code does:

1. The code will go through the .fli files and find all rows with 'polariser cell info'. A cell change is considered once a new 'polariser cell info'. At the moment it ignores the experiments that use the 'magical box' as we are not sure if they are experiments compatible with the ones studied here.
2. For evey cell change, a new .fli file is created storing all the information including the header row and the measured data rows (in this case, just two). Also, all cell IDs are recorded
3. For all .fli files the code now will:   

    3.1 Remove unwanted rows (hopefully none)
    
    3.2 It removes any rows that don´t have polarization directions (+z,+z). (Should remove none if done correctly).
    
    3.3 Extract data form the header row.
    
    3.4 Remove unwanted columns (temperature, flipping ratio, counts, elapsed time,...).
    
    3.5 Set a time reference with the first measurement row. All other time values get referenced with respect to this moment in time and converted into seconds.
    
    3.6 Ignore all Miller index combinations that are not integers (hopefully no issues here).
    
    3.7 Save two files for each experiment. One with the header rows and another one with just the numeric rows (with a new header that explains what each column has).

For every succesful experiment we will output:
1. Image:  **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Multiplier={Multiplier}.png"** in PlotResults. Shows the plot with the extended area with the raw data
2. Image: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Multiplier={Multiplier}\_Soft.png"** in PlotResults. Shows the plot with the extended area with the filtered data
3. Image: **"PolarizationD3\_{folder_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Filtered.txt\_plot\_Derivatives.png"** in PlotResults. Shows the evolution of the "derivatives". I apologize for the hideous names. Unless it results in a fatal error, I am scared to change the code.
4. Image: **"PolarizationD3\_{folder_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex_{PrettyCombination}\_Filtered.txt\_N\_{N}\_ManualInterval.png"** in PlotResults. Shows the plot with the non-exteded area
5. Txt: **"PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}.txt"** in MLDataBase. It contains the four data columns (DeltaTime, PolarizationD3, SoftPolarizationD3, ErrPolarizationD3)
6. Txt: **"PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Parameters.txt"** in MLDataBase. It contains the parameters (CellID, Pressure, LabPolarization, LabTime)


The plots are not necessary but are saved for the user to know what all the files look like. The txt files are fundamental for the rest of the pipeline. 
The files that are wrong or useless when all is done are the folowing. They are kept for  debug purposes (to see files with differents structures, why they fail,etc).

1. Txt: **"{folder\_name}\_Arrays\_{i}.txt"** in SeparatedFolder/{folder_name}. It still has the header and useless columns. It is the fli file of evey chunk, of every recorded experiment (correct or incorrect)
2. Txt: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}.txt"** in SeparatedFolder/{folder_name}. It is the same as the one in MLDataBase (a duplicate)
3. Image: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}.png"** in SeparatedFolder/{folder_name}. It plots (with error bars) PolarizationD3
4. Image: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Combined.png"** in SeparatedFolder/{folder_name}. It plots (with error bars) both PolarizationD3 and SoftPolarizationD3
5. Image: **"PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Softened.png"** in SeparatedFolder/{folder_name}. It plots (with error bars) SoftPolarizationD3
6. Folder: **"FailuresTest"** contains all the graphs of the data sets that were considered not worthy but had more points that the ones saved. Check them if your experiment was not properly added
7. Folder: **"DataBase"** has the raw fli files. Once the code has been used they are no longer important (if you don't find the folder I may have added a line of code to erase it. Sorry in advance for any inconveniences)  

In [None]:
DataBase = Path('CrystallineDataBase')
output_base = Path('CrystallineSeparatedFolder')

# List all .fli files in that folder, prepare folders
FileNameList = [f.name for f in DataBase.glob('*.fli')]
polyorder = 2
default_window_length = 5
SeparatedFolder = Path("CrystallineSeparatedFolder")
BadFilesFolder = Path("CrystallineBadFiles")
MLDataBaseFolder = Path("CrystallineMLDataBase")
BadFilesFolder.mkdir(exist_ok=True, parents=True)
MLDataBaseFolder.mkdir(exist_ok=True, parents=True)
log_message(f"\n\nFiles in the data base that will be (tried) to be used\n {FileNameList}\n")

for FileName in FileNameList:
    """READ THE FILE AND SEPRATE IT INTO EACH EXPERIMENT USING THE POLARIZATION CELL"""
    # 1- Open file
    folder_name = FileName.replace(".fli", "")
    output_folder = output_base / folder_name
    file_path = DataBase / FileName
    output_folder.mkdir(parents=True, exist_ok=True)

    with open(long_path(file_path), "r", encoding='utf-8', errors='ignore') as f:
        lines = f.readlines()

    
    # 2- Locate the header with CellID, Pressure, etc. Chunks are the data rows sandwiched between two 'polariser cell info' strings
    chunks = []
    current_chunk = []
    started = False  

    for line in lines:
        if line.strip().startswith("polariser cell info"):  # All before polariser cell info will be forgotten
            if started and current_chunk:
                chunks.append(current_chunk)
            current_chunk = [line]
            started = True
        else:
            if started:
                current_chunk.append(line)

    if not started:
        log_message(f" File '{FileName}' does NOT contain any 'polariser cell info' header. Skipping.\n")
        continue
    else:
        log_message(f" File '{FileName}' contains at least one 'polariser cell info' header.")

    if current_chunk:
        chunks.append(current_chunk)

    #3- Save .fli files for every correct chunk
    base_name = FileName.replace(".fli", "")  # remove .fli for clean filenames
    log_message(f"\n\nCreating all the Array files \n")
    for i, chunk in enumerate(chunks):
        log_message(chunk)
        fli_filename = f"{base_name}_Arrays_{i}.fli"
        fli_path = output_folder / fli_filename
        with open(fli_path, "w") as f_out:
            f_out.writelines(chunk)  

    # 4- As CellID can be exchanged with real parameters, it is written in an independent file
    cell_id_file = Path.cwd() / "Crystalline_CellID.txt"
    try:
        with open(long_path(cell_id_file), 'r', encoding='utf-8') as file:
            seen_strings = set(line.strip() for line in file)
    except FileNotFoundError:
        seen_strings = set()
    
    # 5- Open each Array file and work with it (The Array file still has the header)
    with open(long_path(cell_id_file), 'a', encoding='utf-8') as file:
        for i in range(len(chunks)):
            FLI_filename = f"{base_name}_Arrays_{i}.fli"  
            FLI_path = output_folder / FLI_filename  
            if not FLI_path.exists():
                log_message(f"WARNING: Array file does not exist: {FLI_path}")
                continue
            df = pd.read_csv(long_path(FLI_path), sep=r'\s+', header=None, on_bad_lines='skip')  # Read file
            log_message(f"Reading {FLI_path}, removing ***WARNING No centering scan found ")
            warning_str = "***WARNING No centering scan found"

            #5.1 Combine first 4 columns as strings, join them with space, and filter rows containing this phrase (it is not important for us)
            df = df[~df.iloc[:, :5].astype(str).agg(' '.join, axis=1).str.contains('No centering scan found', regex=False)] 
            
            #5.2 Extract useful information from the header. Hopefully, CellID, Pressure, LabPolarization, Year, Month, Day, time of lab measurement before first experiment measurement (negative time) will be stored locally
            log_message(f"Header Information Extraction...")
            CellID =          df.iloc[0].tolist()[3]
            Pressure =        df.iloc[0].tolist()[6]
            LabPolarization = df.iloc[0].tolist()[7]

            try:
                HM, DD, MM, YY = df.iloc[0].tolist()[14], int(df.iloc[0].tolist()[10]), int(df.iloc[0].tolist()[11]), int(df.iloc[0].tolist()[12])
                Day_Ref = f"{DD:02d}/{MM:02d}/{YY:02d}"
                dt = Time(Day_Ref, HM)
            except Exception as e:
                log_message(f"Skipping file {file_path} because of invalid header data: {e}")
                continue

            
            #5.3 All redundant/useless information is removed
            log_message(f"Removing Measurement Index, Temperature, Flipping Ratio, Uncertainty of Flipping Ratio and Time between measurements,...")
            df = df.iloc[1:].reset_index(drop=True)
            df = df.drop(df.columns[0], axis=1)
            df = df.drop(df.columns[5], axis=1)
            df = df.drop(df.columns[9], axis=1)
            df = df.drop(df.columns[9], axis=1)
            df = df.drop(df.columns[9], axis=1)
            df = df.drop(df.columns[9], axis=1)
            log_message(f"Saving only polarization values for the Spin Directions wanted in both Polarizer Cells, i.e. (+z,+z)")

            #5.4 Keep only rows where both are +z
            df = df[(df[7] == '+z') & (df[8] == '+z')].copy()
            if df.empty:
                log_message(f"No valid '+z' rows in file {FileName}_Arrays_{i}.fli, skipping")
                continue  
            df = df.drop(df.columns[[5,6]], axis=1, errors='ignore')
            #5.5 Convert Miller index columns into integers. From string or object to float and if the float is close to an integer (tolerance is 1e-8) then save as integer. Otherwise remove row
            cols_to_convert = [1, 2, 3]
            df[cols_to_convert] = df[cols_to_convert].apply(pd.to_numeric, errors='coerce').astype(float)            
            mask = np.isclose(df[cols_to_convert], np.round(df[cols_to_convert]), atol=1e-8)
            df = df[mask.all(axis=1)].copy()
            log_message(f"All Spin directions removed. All irrational Miller Indices removed. Adding DeltaTime")
            
            #5.6 The time columns are converted into difference of time being the referenced time the first +z,+z measurement that has survived at this point
            if df.shape[0] < 2:
                log_message(f"Not enough valid rows after filtering, skipping chunk")
                continue
            df['DeltaTime'] = df.apply(
                lambda row: deltatime(df[4].iloc[0], df[5].iloc[0], row[4], row[5]), axis=1 )
            ref_dt = Time(df[4].iloc[0], df[5].iloc[0])
            LabTime = int((dt - ref_dt).total_seconds())
            with open("PolarizationTimeReference.txt", "a") as f:
                f.write(str(ref_dt) + "\n")

            #5.6 Rename the columns PolarizationD3, ErrPolarizationD3 (the polarization column and its uncertainty). The other one with name is DeltaTime. The rest are numbers (will be erased).
            #Also we remove the time strings (with DeltaTime they have no new information)
            log_message(f"Renaming PolarizationD3 and ErrPolarizationD3")
            df.rename(columns={
                df.columns[5]: 'PolarizationD3',
                df.columns[6]: 'ErrPolarizationD3'
            }, inplace=True)
            df.drop(columns=[df.columns[3], df.columns[4]], inplace=True)
            log_message(f"Dropped Time Strings")

            
            #5.7 Begin filtering and softening with previous functions
            log_message(f"Begin removal of Bad files and softening with Savitzky-Golay filter")
            filtered_df, PrettyCombination = filter_best_combination(i,df)
            #If nothing survived the filters/purge then use 'continue' and go for the next experiment
            if filtered_df is None and PrettyCombination is None:
                log_message(f"Chunk {i}: No suitable combination found. Perhaps, check the file again to see what  could have gone wrong or check the log to see why it has been discarded. Skipping to next chunk or file.")
                log_message(f"_______________________________________________________________\n")
                continue  # skip to next chunk
            
            #5.8 Removal of Miller indices (we have all the information they could give us)
            log_message(f"Removing Miller Indices columns")
            filtered_df = filtered_df.iloc[:, 3:]
            desired_order = ["DeltaTime", "PolarizationD3", "SoftPolarizationD3", "ErrPolarizationD3"]

            
            # 5.9 Remove the points that won't be useful for the ML algorithm
            columns_to_save = [col for col in desired_order if col in filtered_df.columns]  # Keep only the columns that exist
            df_SEMIFINAL = filtered_df[columns_to_save].copy()
            df_FINAL = filtered_df = df_SEMIFINAL
            
            
            # 5.10 Save the files
            log_message(f"Finally we save the chunk")
            csv_filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}.txt"
            Parameter_filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Parameters.txt"
            csv_path = output_folder / csv_filename
            csv_path.parent.mkdir(parents=True, exist_ok=True)

            df_FINAL.to_csv(long_path(csv_path), index=False, sep=',')
            
            ml_txt_path = MLDataBaseFolder / csv_filename
            ml_txt_path.parent.mkdir(parents=True, exist_ok=True)

            df_FINAL.to_csv(long_path(ml_txt_path), index=False, sep=',')
            log_message(f"Saved: {csv_filename}")
            log_message(f"Processing Parameter file...")
            ml_param_path = MLDataBaseFolder / Parameter_filename
            ml_txt_path.parent.mkdir(parents=True, exist_ok=True)

            with open(long_path(ml_param_path), 'w', encoding='utf-8') as f:
                f.write("CellID,Pressure,LabPolarization,LabTime\n")
                f.write(f"{CellID},{Pressure},{LabPolarization},{LabTime}")
            log_message(f"Saved: {Parameter_filename}")
            log_message(f"Parameter and Array files saved to ML database: {MLDataBaseFolder}\n_______________________________________________________________\n\n")
        else:
            log_message(f"{FileName}_Arrays_{i}.fli is empty, skipping this file.\n")
            continue  
            
    #5.12 Remove unwanted folders and files
    for i in range(len(chunks)):
        temp_filename = f"{base_name}_Arrays_{i}.fli"
        temp_path = output_folder / temp_filename
        try:
            temp_path.unlink()
        except FileNotFoundError:
            pass        
    log_message(f"Created and saved {len(chunks)} CSV files from file called {FileName}.")
    if output_folder.exists() and not any(output_folder.iterdir()):
        output_folder.rmdir()
        log_message(f"Removed empty folder: {output_folder}")
    log_message('\n\n')



## 7. CellID File processing

We need to save the CellID types without duplicates.

In [None]:
ml_database_folder = Path("CrystallineMLDataBase")
parameter_files = list(ml_database_folder.glob('*Parameters.txt'))
log_message(f"Found {len(parameter_files)} parameter files.")
unique_cell_ids = []
seen = set()

for filepath in parameter_files:
    try:
        with open(long_path(filepath), 'r', encoding='utf-8') as f:
            lines = f.readlines()
            if len(lines) >= 2:
                second_row = lines[1].strip()
                parts = second_row.split(',')
                if parts:
                    cell_id = parts[0]
                    if cell_id not in seen:
                        seen.add(cell_id)
                        unique_cell_ids.append(cell_id)
    except Exception as e:
        log_message(f"Failed to read {filepath}: {e}")

# Write to Crystalline_CellID.txt
cellid_file = Path.cwd() / "Crystalline_CellID.txt"
with open(long_path(cellid_file), "w", encoding='utf-8') as f:
    for cell_id in unique_cell_ids:
        f.write(f"{cell_id}\n")

log_message(f"Saved {len(unique_cell_ids)} unique cell IDs to {cellid_file.name}.")

# Remove the separated folder
folder_to_delete = Path.cwd() / "CrystallineSeparatedFolder"
if folder_to_delete.exists():
    shutil.rmtree(long_path(folder_to_delete))
    log_message(f"Folder '{folder_to_delete}' has been deleted.")
else:
    log_message(f"Folder '{folder_to_delete}' does not exist.")



## 8. Cleanup

It erases all intermediate files and prepares the remaining ones for the ML pipeline

1. Removes all .fli files that have been created.
2. Removes empty folders
3. Collects all unique polariser–analyser ID pairs
 
As a result, the only useful files are _Crystalline_CellID.txt_ and the folder _CrystallineMLDataBase_

In [None]:

hash_map = defaultdict(list)

def file_sha256(filepath, block_size=65536):
    """Compute SHA256 hash of a file (safe for large files)."""
    sha256 = hashlib.sha256()
    with open(long_path(filepath), "rb") as f:
        while chunk := f.read(block_size):
            sha256.update(chunk)
    return sha256.hexdigest()

# Scan all .txt files (only base files without '_Parameters')
for root, _, files in os.walk(long_path(ml_database_folder)):
    for file in files:
        if file.lower().endswith(".txt") and "_parameters" not in file.lower():
            path = Path(root) / file
            file_hash = file_sha256(path)
            hash_map[file_hash].append(path)

# Report & delete duplicates
duplicates_found = False
for file_hash, paths in hash_map.items():
    if len(paths) > 1:
        duplicates_found = True
        log_message(f"\nDuplicate group (hash={file_hash}):")
        log_message(f"   Keeping: {paths[0]}")

        # All but the first are duplicates
        for p in paths[1:]:
            base_name, ext = os.path.splitext(p)
            param_file = Path(f"{base_name}_Parameters{ext}")

            try:
                os.remove(long_path(p))
                log_message(f"   Deleted duplicate base file: {p}")
            except Exception as e:
                log_message(f"   Could not delete base file {p}: {e}")
            if param_file.exists():
                try:
                    os.remove(long_path(param_file))
                    log_message(f"   Deleted parameter file: {param_file}")
                except Exception as e:
                    log_message(f"   Could not delete parameter file {param_file}: {e}")

if not duplicates_found:
    log_message("No duplicates found in MLDataBase!")
else:
    log_message("\n Duplicate cleanup complete!")
    
    
# Define paths relative to the notebook location
current_dir = Path().resolve()  
ml_database = current_dir / "CrystallineMLDataBase"
predict_files = current_dir.parent / "ML" / "CrystallineToPredictFiles"

# Make sure the destination exists
predict_files.mkdir(parents=True, exist_ok=True)

for item in predict_files.iterdir():
    if item.is_file():
        os.remove(item)
        log_message(f"Deleted existing file: {item}")
    elif item.is_dir():
        shutil.rmtree(item)
        log_message(f"Deleted existing folder: {item}")

for item in ml_database.iterdir():
    dest = predict_files / item.name
    shutil.move(long_path(item), long_path(dest))
    log_message(f"Moved {item.name} -> {predict_files}")

try:
    ml_database.rmdir()
    log_message(f"Removed empty folder: {ml_database}")
except OSError:
    log_message(f"Could not remove {ml_database}, not empty.")