<h1>AmorphousFileLectureCreate.ipynb</h1> 

Reads from D3Files all the files it is going to process

Outputs the following folders and files:

1. **AmorphousLog\_Reading\_Creation.txt**
Logs all the prints and every step the code does. If you trust the code, it is irrelevant. If you don't trust it or want to change it then this txt file will tell you how each experiment file has been processed and where there might have been issues.


2. **Amorphous\_CellID**
Contains the Cell IDs found on all the files. Needed for the ML code.


3. **AmorphousPlotResults**
This folder will store the graphs of all the experiments that were accepted. Not needed for anything but it is nice to see the files that will be fed to the model. For each experiment you can find the following files:

    3.1 **{base\_name}\_ExtendedArea.png**
It shows the same pot and also the range y_min=mx+n-1.3*N<y<mx+n+1.3*N<y_max. The points outside the green area will be discarded as they are considered to be too off to be considered correct.


4. **AmorphousMLDataBase**
Contains the .txt files NECESSARY for the ML algorithm. There are two per experiment

    4.1 **{base\_name}.txt** 
Contains DeltaTime (the time of the measurement measured from the first VALID polarization measurement), PolarizationD3, SoftPolarizationD3 (the polarization after using a Savitzky-Golay filter) and ErrPolarizationD3 (the uncertainty)

    4.2 **{base\_name}\_Parameters.txt**
Contains the CellID, Pressure, LabPolarization (the polarization measured at the lab) and LabTimeCellID (the time when it was measured)

5. **AmorphousDataBase**
Contains all the .fli files that were attempted to be read


6. **AmorphousBadFiles**
Contains all the .fli separated in experiment sets folders that were rejected (not enough points, negative polarizations, etc.)


_________________________________________________________________________________________

## Explanations

Some parts of the code might use data from different sessions. It is safer to erase them and create all files from scratch everytime. This is not a big deal because this code file should only be run once unless the data base changes.

Some experiments did not pass the filtering methods of the previous functions despite looking very promising. Also, some experiments were not adequate yet they passed all of the filtering process. That is why we will store the names of those files manually.
The code will take all zipped folders from the folder _D3Files_ and prepare them to get their .fli files extracted.

First, it will check if there are duplicate zip folders. To check it it will compare the folder name and the hash sha256. Duplicate folders will be erased. For more information about hash sha256 check for example:
>Wikipedia contributors. (2026, January 2). SHA-2. In Wikipedia, The Free Encyclopedia. Retrieved 10:49, January 17, 2026, from https://en.wikipedia.org/w/index.php?title=SHA-2&oldid=1330753870

Second, it will copy the contents of the zipped folders and create a new folder with the name of the experiment inside _D3Files_. No more zipped folders are erased and information gets duplicated.

Third, it will try to find all .fli files inside all the unzipped folders whether if they come from a zipped file or not. It will then send them to the newly created folder _AmorphousDataBase_. If, for each experiment proposal there are more than one .fli files, they get a numeric suffix (\_1, \_2,...) to distinguish them. Afterwards, all unzipped folders get erased leaving behind only the non-duplicated zipped folders.

Note: All unzipped folders in D3Files will be explored, however they will get erased at the end of the pipeline. If you want them to persist for future runs of the code, they should be zipped first. **For ILL users, when navigating the ILL Cloud, the easiest way to prepare the zip files is to download the _processed_ folder for each experiment proposal. The code is "smart" enough to only process .fli files with polarization information. Therefore, there is no need to manually prepare anything.**


Some fli files have the wrong structure which means that they are not polarization measurements and if they are polarization files they may have used more than one polarization cell. Therefore, we need to remove all the non-polarization sets of data and separate the good fli files depending of the type of polarization cell they used.

For evey fli file we will read the contents and try to find the header (two strings in two consecutive lines). This symbolizes the installation of a new polarizer cell. If there are numerical values before the first header, that means that the process of saving the file occured before swapping the cell. Those data rows will be skipped. A correct fli file will have the following structure:

|  |  |  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|--|--|
| polariser cell info | ge18004 | pressure/init. polar | 2.30 | 0.79 | initial date/time | 21/11/23 | @ | 12:45 |
| analyser cell info  | sic1402 | pressure/init. polar | 2.00 | 0.79 | initial date/time | 21/11/23 | @ | 12:45 |
| 40661 | 3.000 | 3.000 | 3.000 | 21/11/23 | 12:50:35 | 0.00 | 0.7890 | 0.0020 | 8.4795 | 0.0897 | 120.00 |
| 40662 | 3.000 | 3.000 | 3.000 | 21/11/23 | 12:54:43 | 0.00 | 0.7851 | 0.0020 | 8.3048 | 0.0867 | 120.00 |
|  ...  |       |       |       |          |          |      |        |        |        |        |        |

Which corresponds to the following information for the first two rows:
1. 'polariser cell info'/'analyser cell info' (str): Log of the installation of the first and second polariser cells
2. 'PolariserID' (str): A string with the type of cell used
3. 'pressure/init. polar' (str): A string to introduce the $^\mathrm{3}$He gas pressure and the polarization measured at the creation lab.
4. 'PolariserPressure' (float): $^\mathrm{3}$He gas pressure in some units
5. 'InitialLabPolarization' (float): Polarization measured at the creation lab
6. 'initial date/time' (str): A string that introduces the day, month and year and the hour and minutes.
7. 'Date' (str): A string with the information DD/MM/YY
8. '@' (str): A string to separate date and time
9. 'time' (str): A string with the information HH:MM

And for the rest of the rows:
1. 'Measurement number' (int): The number index of the measurement.
2. 'First\_Miller\_Index' (float): The first Miller index of the crystal. Polarization is measured using a known Si Bragg crystal. For the source of the origin of the Si crystal see:
>Stunault, Anne & Vial, S & Pusztai, Laszlo & Cuello, Gabriel & Temleitner, László. (2016). Structure of hydrogenous liquids: separation of coherent and incoherent cross sections using polarised neutrons. Journal of Physics: Conference Series. 711. 012003. 10.1088/1742-6596/711/1/012003. 
3. 'Second\_Miller\_Index' (float)
4. 'Third\_Miller\_Index' (float):
5. 'Date' (str): A string with the information DD/MM/YY of that measurement
6. 'time' (str): A string with the information HH:MM:SS of that measurement
7. Unknown float, maybe Temperature
8. 'D3Polarization' (float): A float with the polarization measurement
9. 'ErrD3Polarization' (float): A float with the uncertainty of that polarization measurement
10. 'FlippingRatio' (float): The flipping ratio. Given either the flipping ratio or the polarization value, the other one is fully determined. Therefore, only one is needed and that is why we don´t work with the flipping ratio
11. 'ErrFlippingRatio' (float): The uncertainty of the flipping ratio
12. 'Elapsed time' (float): It is the time used to obtain the measurement (integration of the beam over that number of seconds)

Temperature did not seem to have an effect on the decay. Therefore, it has been eliminated in this code cell. Here is a summary of what the code does:

1. The code will go through the .fli files and find all combinations of consecutive rows with 'polariser cell info' and 'analyser cell info'. It doesn´t care about the order which makes the code more robust. We will consider that a polariser cell has been properly installed whrn both of these rows are present and that the cell has been changed once a new set of polariser and analyser rows are encountered. At the moment it ignores the experiments that use the 'magical box' as we are not sure if they are experiments compatible with the ones studied here
2. For evey cell change, a new .fli file is created storing all the information including both polariser and analyser rows and the measured data rows. Also, all cell IDs are recorded
3. For all .fli files the code now will:
    
- Remove unwanted rows
- Extract data form the header rows (polariser row and analyser row)
- Remove unwanted columns
- Set a time reference with the first measurement row. All other time values get referenced with respect to this moment in time and converted into seconds.
- Ignore all Miller index combinations that are not integers
- Run through all Miller index combinations until one passes all the filters defined in previous code cells
- Plot the succesful experiments
- Save two files for each experiment. One with the header rows and another one with just the numeric rows (with a new header that explains what each column has)

For every succesful experiment we will output:
1. Image:  "PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_ExtendedArea.png" in AmorphousPlotResults. Shows the plot with the extended area with the raw data
2. Txt:    "PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}.txt" in AmorphousMLDataBase. It contains the four data columns (DeltaTime, PolarizationD3, SoftPolarizationD3, ErrPolarizationD3)
3. Txt:    "PolarizationD3\_{folder\_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Parameters.txt" in AmorphousMLDataBase. It contains the parameters (CellID, Pressure, LabPolarization, LabTime)

These plots are not necessary but are saved for the user to know what all the files look like.
The files that are wrong or useless when all is done are the folowing:
1. Txt:    "{folder\_name}\_Arrays\_{i}.txt" in SeparatedFolder/{folder\_name}. It still has the header and useless columns. It is the fli file of evey chunk, of every recorded experiment (correct or incorrect)
2. Folder: "BadTest" contains all the graphs of the data sets that were considered not worthy but had more points that the ones saved. Check them if your experiment was not properly added

Finally, it erases all intermediate files and prepares the remaining ones for the ML pipeline

1. Removes all .fli files that have been created.
2. Removes empty folders
3. Collects all unique polariser–analyser ID pairs
 
As a result, the only useful files are _AmorphousPolariserAndAnalyser\_IDs.txt_ and the folder _AmorphousMLDataBase_

## 1. Libraries

In [None]:
%reset -f

import os
import shutil
import zipfile
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import pandas as pd
from datetime import datetime
from pathlib import Path
from scipy.signal import savgol_filter
from scipy.optimize import curve_fit
from collections import defaultdict
import hashlib

## 2. Auxiliary Functions and log file creation

1. _PrintDebug_ is a flag that allows the code to output on screen all the steps. If it is set to false, it won´t show anything. However, all information will be properly logged whether this flag is set to true or false. The name of the log is determined by the variable *log_file_path*. The code runs faster if it is set to False.

2. _ShowPlot_ is a similar flag that allows the code to show on screen all plots that are being produced. They are all stored independently of whether this flag is True or False. The code runs faster if it is set to False.

3. **log_message** is a function used for writting on the log file

4. **long_path** is a function that "fixes" directory paths

In [None]:
PrintDebug = False 
ShowPlot = False 
# Initialize log file at the start of the script
log_file_path = os.path.join(".", "AmorphousLog_Testing_Creation.txt")
with open(log_file_path, 'w', encoding='utf-8') as log_file:
    log_file.write("=== Log started ===\n")

def log_message(message):
    """
    Arguments: 
        message (string): The text that will be logged
    
    Returns:
        None
        
    Notes:
        It will write the string "message" in the log file.
        If PrintDebug==True then it will also print the string
    """
    message = str(message)
    if PrintDebug:
        print(message)
    with open(log_file_path, 'a', encoding='utf-8') as log_file:
        log_file.write(str(message) + "\n")

################################################################

def long_path(path):
    """
    Arguments:
        path (path): The path that needs to be converted

    Returns:
        The updated path string or path depending on the platform used

    Notes:
        To avoid Windows 260 character limit for Windows paths, a special "prefix" is added.
        It also unifies how directories are managed.
        Also works with Linux and Mac

    """
    # Convert to Path and resolve to absolute
    path = Path(path).resolve()

    #Windows only:
    if os.name == "nt":
        path_str = str(path)
        if not path_str.startswith("\\\\?\\"):
            if path_str.startswith("\\\\"):
                path_str = "\\\\?\\UNC\\" + path_str[2:]
            else:
                path_str = "\\\\?\\" + path_str
            return path_str

    return path

## 3. Functions

1. **Time** is a function that converts time to a universal format

2. **deltatime** is a funtions that computes the difference in time between two sets of time, in seconds 

3. **format_combination** is just an aesthetic change in the Miller index combination variable

4. **sanitize** is a funtion that fixes any directory path with "illegal" variables

5. **savgol_params_func** is a function that ensures that the window length is odd and large enough for the polynomial order.

6.  **Overall_Decrease** is a function that checks if the linear approximation of the data set has a negative slope

7. **filter_best_combination**  is a function that discards all problematic data sets (negative polarization, small sets of data and also uses **Overall_Decrease**  
            
8. **RemoveOutcast_FixUncertainty** is a funtion that removes points that are clear outliers, corrects the underestimation of experimental uncertainty and plots succesful experiments

9. **ensure path** is a function that makes sure a file path is Windows-safe, its folders exist, and the file exists—without crashing due to long paths or missing directories.

In [None]:
def Time(Day_Ref, Hour_Ref):
    """
    Arguments: 
        Day_Ref (str): 'DD/MM/YY' a.k.a Day/Month/Year
        Hour_Ref (str): = 'HH:MM' or 'HH:MM:SS' a.k.a Hour:Month:Second
        
    Returns:
        A 'datetime' object with format (year, month, day, hour, minute, second).
        
    Notes:
        If there is no information about the seconds, they will be considered 0
    """
    
    match = re.match(r"(\d+)/(\d+)/(\d+)", Day_Ref)
    if match:
        DD = int(match.group(1))
        MM = int(match.group(2))
        YY = int(match.group(3))
    else:
        raise ValueError(f"Invalid date format: {Day_Ref}")

    match = re.match(r"(\d+):(\d+):(\d+)", Hour_Ref)
    if match:
        Hour = int(match.group(1))
        Minute = int(match.group(2))
        Second = int(match.group(3))
    else:
        # If seconds are missing, try HH:MM
        match = re.match(r"(\d+):(\d+)", Hour_Ref)
        if match:
            Hour = int(match.group(1))
            Minute = int(match.group(2))
            Second = 0
        else:
            raise ValueError(f"Invalid time format: {Hour_Ref}")

    return datetime(YY + 2000 if YY < 100 else YY, MM, DD, Hour, Minute, Second)

################################################################


def deltatime(AIni,BIni, AFin,BFin):
    """
    Arguments: 
        AIni (str): 'DD/MM/YY' a.k.a Day/Month/Year for the initial time
        BIni (str): = 'HH:MM' or 'HH:MM:SS' a.k.a Hour:Month:Second for the initial time
        AFin (str): 'DD/MM/YY' a.k.a Day/Month/Year for the final time
        BFin (str): = 'HH:MM' or 'HH:MM:SS' a.k.a Hour:Month:Second for the final time
    Returns:
        The time variation in seconds
        
    Notes:
        Requires the function "Time"
    """

    time1 = Time(AIni, BIni)
    time2 = Time(AFin, BFin)
    return( int((time2 - time1).total_seconds()))

#################################################################

def format_combination(comb):
    """
    Arguments: 
        comb (float, float, float): A set of three floats characterizing the Miller indices.

    Returns:
        An object (int,int,int): With the floor integer of those float variables. (3.0 -> 3)
    """
    
    if comb is None:
        return "(None)"
    ints = tuple(int(float(x)) for x in comb)
    return f"({','.join(map(str, ints))})"

###############################################################

def sanitize(name):
    """
    Arguments: 
        name (str): Directory string 
    Returns:
        The same string but with symbols [,<,>,:,",/,\\,|,?,*,] converted to _
    """
    
    return re.sub(r'[<>:"/\\|?*]', '_', name)

###############################################################

def savgol_params_func(n_points):
    """
    Arguments: 
        n_points (int): Number of points where the filtered will be used
    Returns:
        A dictionary containing valid parameters for a Savitzky–Golay filter.
    Notes:    
        It ensures that the window length is odd and large enough for the polynomial order.
    """
    window_length = min(default_window_length, n_points)
    if window_length % 2 == 0:
        window_length -= 1
    if window_length < polyorder + 2:
        window_length = polyorder + 2
        if window_length % 2 == 0:
            window_length += 1
    return {'window_length': window_length, 'polyorder': polyorder}

###############################################################

def Overall_Decrease(df_filtered):
    """
    Arguments: 
        df_filtered (pandas object): It is something like this:
                1    2    3  PolarizationD3  ErrPolarizationD3  DeltaTime  
            0  3.0  3.0  3.0          0.5552             0.0022          0   
            1  3.0  3.0  3.0          0.5522             0.0021        279   
            2  3.0  3.0  3.0          0.5464             0.0022       4148
            ...
        
    Returns:
        True if df_filtered shows a decay and False otherwise
        
    Notes:    
        First, it extracts polarization and time arrays. 
        Second, it obtains the best linear fit.
        If the slope is positive then they get discarded to herwise they get accepted.
        Amorphous experiments are more stable than the crystaline ones. Therefore, less checks are needed.
    """
    def linear_func(x, m, n):
        return m * x + n


    x = df_filtered["DeltaTime"].values
    y = df_filtered["SoftPolarizationD3"].values
    
    try:
        popt, _ = curve_fit(linear_func, x, y)
        m, n = popt
        log_message(f"      Linear fit slope m={m:.4e}, intercept n={n:.4f}")
    except Exception as e:
        log_message(f"      Error fitting data: {e}")
        return False

    if m > 0:
        log_message(f"      Overall slope is positive. Can't be polarization information. Skipping Combination")
        return False
    else:
        return True  

###############################################################

def filter_best_combination(i, df):
    """
    Arguments: 
        1. i (int): The chunk number a.k.a the (ordinal) number of the Miller index combination
        2. df (pandas object): It is something like this: (NaNs are intended)
                  1    2    3  PolarizationD3 ErrPolarizationD3  12   13   14  DeltaTime
            0   3.0  3.0  3.0          0.5383            0.0021 NaN  NaN  NaN          0
            1   3.0  3.0  3.0          0.5379            0.0021 NaN  NaN  NaN        147
            2   3.0  3.0  3.0          0.5315            0.0022 NaN  NaN  NaN       3919
            ...

    Returns:
        1. A filtered df object like this one:
                  1    2    3  PolarizationD3 ErrPolarizationD3 DeltaTime
            0   3.0  3.0  3.0          0.5383            0.0021         0
            1   3.0  3.0  3.0          0.5379            0.0021       147
            2   3.0  3.0  3.0          0.5315            0.0022      3919
            ...
        2. An object (int,int,int) with the adequate Miller index combination.
        
    Notes:    
        First, it extracts the Miller index combination and converts it into a set of three integers (format_combination)
        Then it tries a couple of tests to see if the data associated to them is valid
            1. Check there is data
            2. Check if the time array is present and convert all values to either floats or integers
            3. Check if all polarization values are positive. If they are not, skip that Miller index combination
            4. Check if there are more than three rows of data. If there are not, skip that Miller index combination
            5. Check if the filtered df object passes the Overall_Decrease
            
    """
    
    filter_func=savgol_filter #Only tested for 'savgol_filter'
    filter_params_func=savgol_params_func #Only tested for the previously defined function 'savgol_params_func'
    min_points_required=3  #Minimum points needed for the filter to work (3 for Savitzky-Golay)
    tolerance=1e-8 #Tolerance to decide if the filtered value is worth keeping
    filter_column_idx=df.columns.get_loc('PolarizationD3')
    time_column_idx=df.columns.get_loc('DeltaTime')
    error_column_idx=df.columns.get_loc('ErrPolarizationD3')
    new_column_name='SoftPolarizationD3'
    folder_name = FileName.replace(".fli", "")

    # Group by first three columns (Miller indices)
    combination_counts = (
        df.groupby([df.columns[0], df.columns[1], df.columns[2]])
        .size()
        .sort_values(ascending=False)
    )
    log_message(f"Analyzing combinations in file: {folder_name}_Array_{i}.fli")
    
    #Read those three numbers from the .fli file
    for comb, count in combination_counts.items(): 
        log_message(f"Combination {comb} occurs {count} times in file {folder_name}.fli. Trying this combination")
        mask = (
            (df.iloc[:,0] == comb[0]) &
            (df.iloc[:,1] == comb[1]) &
            (df.iloc[:,2] == comb[2])
        )
        PrettyCombination = format_combination(comb)
        filtered_df = df.loc[mask].copy()
        
        
        # Requisites for the Combination to be valid:
        # Requisite 1: Have data in the data
        if filtered_df.empty:
            log_message(f"      {PrettyCombination} has no data")
            continue
        log_message(filtered_df)
        
        filtered_df = filtered_df.drop(filtered_df.columns[5], axis=1) #NaNs are eliminated
        filtered_df = filtered_df.drop(filtered_df.columns[5], axis=1)
        filtered_df = filtered_df.drop(filtered_df.columns[5], axis=1)
        
        
        # Requisite 2: Check if data column exists
        if filtered_df.shape[1] <= filter_column_idx:
            log_message(f"      Expected column index {filter_column_idx} not found. Skipping combination {PrettyCombination}")
            continue
        
        # Convert to numeric all columns (all columns are considered as object type)
        filtered_df = filtered_df.apply(pd.to_numeric, errors='coerce')
        filtered_df = filtered_df.dropna()  # drops any rows with NaNs introduced by coercion (last line)

        # Check dtypes
        all_numeric = all(dtype.kind in ('f', 'i') for dtype in filtered_df.dtypes)
        
        if all_numeric:
            log_message(f"      All columns have been successfully converted to numbers.")
        else:
            log_message(f"      Not all columns are numbers. Current dtypes:")
            log_message(f"      {filtered_df.dtypes}")
            log_message(f"      Expect Error Message from Python. Perhaps removing this file might be wise unless all files have the same issue")
        if filtered_df.empty:
            log_message(f"      All rows dropped after conversion to numeric. Skipping combination {PrettyCombination}")
            continue
        
        # Requisite 3: Polarization is ALWAYS positive. If any is negative, that is not a polarization. Immediately sent to the Bad Files Folder
        if (filtered_df.iloc[:, filter_column_idx] < 0).any():
            filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt"
            badfile_subfolder = BadFilesFolder / folder_name
            badfile_subfolder.mkdir(parents=True, exist_ok=True)
            badfiles_txt_path = badfile_subfolder / filename
            filtered_df.to_csv(badfiles_txt_path, index=False, sep='\t')
            log_message(f"      {PrettyCombination} has negative polarization values. Sent to BadFiles with name {filename}. Skipping to next Combination")
            continue

        
        # Requisite 4: Have at least three rows (otherwise we can't teach the ML algorithm nothing although three is almost useless too).
        if len(filtered_df) < min_points_required:
            filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt"
            badfile_subfolder = BadFilesFolder / folder_name
            badfile_subfolder.mkdir(parents=True, exist_ok=True)
            badfiles_txt_path = badfile_subfolder / filename

            filtered_df.to_csv(long_path(badfiles_txt_path), index=False, sep="\t")

            log_message(f"      {PrettyCombination} has only {len(filtered_df)} rows (< {min_points_required}). Sent to BadFiles with name {filename}. Skipping to next Combination")
            continue


        # "Requisite 5": Be worthy of having the filter used.
        # If the filter can´t be used, the "soft" correction will have the same data as the original
        y = filtered_df.iloc[:, filter_column_idx].values
        filter_params = filter_params_func(len(y))
        try:
            y_filtered = filter_func(y, **filter_params)
            diff = np.abs(y - y_filtered)
            changed_count = np.sum(diff > tolerance)
            filtered_df[new_column_name] = y_filtered
            if changed_count > 0:
                log_message(f"      Filter changed {changed_count}/{len(y)} points. Adding column '{new_column_name}'.")
            else:
                log_message(f"      Filter applied but data unchanged. Adding '{new_column_name}' as duplicated values.")

        except Exception as e:
            log_message(f"      Error applying filter to combination {comb}: {e}")
            log_message(f"      Adding '{new_column_name}' as duplicated values to proceed anyway.")
            # Just duplicate the original column
            y_filtered = y.copy()
            filtered_df[new_column_name] = y_filtered

            
            
        # Requisite 6: Use Overall_Decrease to test the data set
        if Overall_Decrease(filtered_df):
            log_message(f"      {PrettyCombination} has surpassed all tests. Proceding with it.")
            return filtered_df, PrettyCombination
        else:
            log_message(f"      {PrettyCombination} failed the Overall_Decrease test. Trying next combination.")
            continue
            
    return None, None


###############################################################

def RemoveOutcast_FixUncertainty(df_filtered, PrettyCombination, filename, AcceptableMultiplier=2.0):
    """
    Arguments: 
        1. df_filtered (pandas object): It is something like this:
                  1    2    3  PolarizationD3 ErrPolarizationD3 DeltaTime
            0   3.0  3.0  3.0          0.5383            0.0021         0
            1   3.0  3.0  3.0          0.5379            0.0021       147
            2   3.0  3.0  3.0          0.5315            0.0022      3919
            ...
        2. PrettyCombination (object): A set of three integers. For example (3,3,3)
        3. filename (str): The name used to save the data. For example: f"PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}"
        4. AcceptableMultiplier

    Returns:
        1. A filtered df object like this one:
                  1    2    3  PolarizationD3 ErrPolarizationD3 DeltaTime
            0   3.0  3.0  3.0          0.5383            0.0021         0
            1   3.0  3.0  3.0          0.5379            0.0021       147
            2   3.0  3.0  3.0          0.5315            0.0022      3919
            ...
        2. An object (int,int,int) with the adequate Miller index combination.
        
    Notes:    
        The function does three different jobs.
            1. It removes all points that are outliers.
                First, it computes the best linear fit to the data P=m·t+n
                Second, it finds the smallest value of N such that 75% of the points are in the region -N + n + m·t < P < m·t + n + N
                Third, it erases all points outside this range: -N·AcceptableMultiplier + n + m·t < P < m·t + n + N·AcceptableMultiplier
            2. It re-scales the uncertainties of the rest of the points
                The reasoning behind this rescaling is because if we assume that the decay is a smooth curve, then, clearly, the uncertainties are underestimated
            3. Plots the succesful polarization decay
                Two plots are saved, one with the original data and another wit the filtered data (both with the corrected uncertainty)    
    """
  
    output_folder = Path.cwd() / "AmorphousPlotResults"
    output_folder.mkdir(parents=True, exist_ok=True) 


    x = df_filtered['DeltaTime'].values
    y_hard = df_filtered['PolarizationD3'].values
    y_soft = df_filtered['SoftPolarizationD3'].values

    #Linear fit
    def linear_func(x, m, n):
        return m * x + n

    try:
        popt, _ = curve_fit(linear_func, x, y_hard)
        m, n = popt
    except Exception as e:
        log_message(f"Error fitting data: {e}")
        return df_filtered  # Return original if error

    #Find smallest N for 75% within band
    num_points = len(x)
    sorted_idx = np.argsort(x)
    x_sorted = x[sorted_idx]
    y_sorted = y_hard[sorted_idx]

    N_start = 0.0001
    N_step = 0.0001
    N_max = 0.4
    N = N_start
    needed_N = None

    while N <= N_max:
        y_fit = linear_func(x_sorted, m, n)
        upper = y_fit + N
        lower = y_fit - N
        inside = np.logical_and(y_sorted <= upper, y_sorted >= lower)
        percent_inside = np.sum(inside) / num_points * 100

        if percent_inside >= 75:
            needed_N = N
            break
        N += N_step

    if needed_N is None:
        log_message(f"    [{filename}] No N found to contain 75% within ±{N_max}")
        return df_filtered  # Return original if no good N is found (has never happened but just in case)

    # Compute extended band and filter out the points outside the extended band
    y_fit_full = linear_func(x, m, n)
    upper_band = y_fit_full + needed_N * AcceptableMultiplier
    lower_band = y_fit_full - needed_N * AcceptableMultiplier

    mask = np.logical_and(y_hard <= upper_band, y_hard >= lower_band)
    df_cleaned = df_filtered[mask].copy()
    log_message(f"    [{filename}] Filtering kept {np.sum(mask)} of {len(mask)} rows (±{needed_N * AcceptableMultiplier:.2e})")

    # Rescale uncertainties using reduced chi-squared 
    sigma = df_filtered['ErrPolarizationD3'].values
    popt, pcov = curve_fit(linear_func, x, y_hard, sigma=sigma, absolute_sigma=True)
    m_fit, n_fit = popt
    m_err, n_err = np.sqrt(np.diag(pcov)) #Covariance matrix has on its diagonal Cov(X_j,X_j) which is the variance so its square root is the uncertainty
    
    # Recalculate the reduced chi-squared
    residuals = (y_hard - linear_func(x, *popt)) / sigma
    dof = len(x) - len(popt)
    chi_squared_red = np.sum(residuals**2) / dof
    correction_factor = np.sqrt(chi_squared_red)
    
    # Automatically apply correction if needed
    if correction_factor > 1:
        df_cleaned['ErrPolarizationD3'] *= correction_factor
        log_message(f"    [{filename}] Applied uncertainty correction factor: √(χ²) = {correction_factor:.2f}")
    else:
        log_message(f"    [{filename}] No correction applied: √(χ²) = {correction_factor:.2f}")
    
    # Optional: log the fit results
    log_message(f"    [{filename}] Fit results: m = {m_fit:.3e} ± {m_err:.3e}, n = {n_fit:.4f} ± {n_err:.4f}")


    def make_clean_name(filename: str) -> str:
        """
        Turn e.g.
          PolarizationD3_CaFeAl_13_7_6_24_2_MillerIndex_(0,0,2)_Filtered.txt
        into:
          CaFeAl_13_7_6_24_2_(0,0,2)
        and handle cases where filename contains '/' or '\\' (dates like DD/MM/YY).
        """
        s = str(filename).replace("/", "_").replace("\\", "_")  # prevent path splitting
        base = Path(s).stem  
        if base.startswith("PolarizationD3_"):
            base = base[len("PolarizationD3_"):]
        if base.endswith("_Filtered"):
            base = base[:-len("_Filtered")]
        base = base.replace("MillerIndex_", "")
        return base
    
    def extended_area_plot_filename(filename: str) -> str:
        """EuAgAs_5_31_10_23_0_(3,0,0)_ExtendedArea.png"""
        return f"{make_clean_name(filename)}_ExtendedArea.png"
    # Extract values
    # Data
    T = df_filtered['DeltaTime'].values
    P_soft = df_filtered['SoftPolarizationD3'].values
    P_hard = df_filtered['PolarizationD3'].values
    Err = df_filtered['ErrPolarizationD3'].values if 'ErrPolarizationD3' in df_filtered.columns else np.zeros_like(P_soft)
    
    # Clean title + filename
    clean = make_clean_name(filename)
    save_name = extended_area_plot_filename(filename)  # ends with _ExtendedArea.png
    
    plt.figure(figsize=(10, 5))
    
    # Black points with error bars
    plt.scatter(T, P_hard, s=30, color="black", label="PolarizationD3", marker='o')
    if Err is not None:
        plt.errorbar(T, P_hard, yerr=Err, fmt='none', ecolor='black', alpha=0.6, capsize=2)
    
    # Blue linear fit
    fit = linear_func(T, m, n)
    plt.plot(T, fit, '-', color='blue', label="Linear Fit")
    
    # Bands: light blue (\pm N) and translucent green /\pm N*AcceptableMultiplier)
    if needed_N is not None:
        upper_narrow = fit + needed_N
        lower_narrow = fit - needed_N
        upper_wide   = fit + needed_N * AcceptableMultiplier
        lower_wide   = fit - needed_N * AcceptableMultiplier
    
        plt.fill_between(T, lower_narrow, upper_narrow, color='lightblue', alpha=0.35, label=f'Band ±{needed_N:.2e}')
        plt.fill_between(T, lower_wide,   upper_wide,   color='green',     alpha=0.18, label=f'Filter Band ±{(needed_N*AcceptableMultiplier):.2e}')
    
    # Labels
    plt.xlabel("DeltaTime")
    plt.ylabel("PolarizationD3")
    plt.title(f"{clean}_ExtendedArea")  # title matches saved name (without .png)
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.legend()
    plt.tight_layout()
    vals_min = [np.nanmin(P_hard - Err), np.nanmin(P_hard + Err)]
    vals_max = [np.nanmax(P_hard - Err), np.nanmax(P_hard + Err)]
    if needed_N is not None:
        vals_min += [np.nanmin(lower_narrow), np.nanmin(lower_wide)]
        vals_max += [np.nanmax(upper_narrow), np.nanmax(upper_wide)]
    ymin = np.nanmin(vals_min)
    ymax = np.nanmax(vals_max)
    pad = 0.02 * (ymax - ymin if np.isfinite(ymax - ymin) and (ymax - ymin) > 0 else 1.0)
    plt.ylim(ymin - pad, ymax + pad)
    plot_path_hard = output_folder / save_name
    plot_path_hard.parent.mkdir(parents=True, exist_ok=True)
    plt.savefig(long_path(plot_path_hard), dpi=300, bbox_inches='tight')
    plt.close()
    return df_cleaned


def ensure_file(path):
    """
    Arguments:
        path (str): A directory path
        
    Output:
        A string with a path that avoids Windows related issues
    Note:
        It makes sure a file path is Windows-safe, its folders exist
        and the file exists—without crashing due to long paths or missing directories.
        It works on any environment
    """
    path = Path(path)
    if path.parent != Path('.'):
        path.parent.mkdir(parents=True, exist_ok=True)
    if not path.exists():
        path.touch()

    return path


## 4. Clean old files and force certain experiments

Some parts of the code might use data from different sessions. It is safer to erase them and create all files from scratch everytime. This is not a big deal because this code file should only be run once unless the data base changes.


In [None]:
to_erase = [
    "AmorphousLog_Testing_Creation.txt",
    "AmorphousPolariserAndAnalyser_IDs.txt",
    "AmorphousSeparatedFolder",
    "AmorphousPlotResults",
    "AmorphousMLDataBase",
    "AmorphousFailuresTest",
    "AmorphousDataBase",
    "AmorphousBadFiles",
    "AmorphousFailuresFiles"
]
for item in to_erase:
    path = os.path.abspath(item) 
    if os.path.exists(path):
        try:
            if os.path.isfile(path):
                os.remove(path)
                log_message(f"Deleted file: {path}")
            elif os.path.isdir(path):
                shutil.rmtree(path)
                log_message(f"Deleted folder: {path}")
        except Exception as e:
            log_message(f" Could not delete {path}: {e}")
    else:
        log_message(f"Not found (skipped): {path}")
        

## 5. ZIP Folder Treatment and .fli data extraction

The code will take all zipped folders from the folder _D3Files_ and prepare them to get their .fli files extracted.

First, it will check if there are duplicate zip folders. To check it it will compare the folder name and the hash sha256. Duplicate folders will be erased. For more information about hash sha256 check for example:
>Wikipedia contributors. (2026, January 2). SHA-2. In Wikipedia, The Free Encyclopedia. Retrieved 10:49, January 17, 2026, from https://en.wikipedia.org/w/index.php?title=SHA-2&oldid=1330753870

Second, it will copy the contents of the zipped folders and create a new folder with the name of the experiment inside _D3Files_. No more zipped folders are erased and information gets duplicated.

Third, it will try to find all .fli files inside all the unzipped folders whether if they come from a zipped file or not. It will then send them to the newly created folder _AmorphousDataBase_. If, for each experiment proposal there are more than one .fli files, they get a numeric suffix ('_1', '_2',...) to distinguish them. Afterwards, all unzipped folders get erased leaving behind only the non-duplicated zipped folders.

Note: All unzipped folders in D3Files will be explored, however they will get erased at the end of the pipeline. If you want them to persist for future runs of the code, they should be zipped first. For ILL users, when navigating the ILL Cloud, the easiest way to prepare the zip files is to download the _processed_ folder for each experiment proposal. The code is "smart" enough to only process .fli files with polarization information. Therefore, there is no need to manually prepare anything.
To be precise, this cell of code will take all the zip files, extract them nd remove duplicates using the file name And the hash sha256. 

In [None]:
""" DUPLICATION TREATMENT """
def file_hash(filepath, algo="sha256", block_size=65536):
    """Compute hash of a file (default SHA256)."""
    h = hashlib.new(algo)
    with open(filepath, "rb") as f:
        for block in iter(lambda: f.read(block_size), b""):
            h.update(block)
    return h.hexdigest()

folder = Path("D3Files")  # Folder where all the raw data folders reside
zip_files = [f for f in folder.iterdir() if f.suffix.lower() == ".zip"]  # List all zip files

log_message(f"Reading ZIP files. Checking for true duplicates by content...")
base_names = set()
seen_hashes = {}

for zip_file in zip_files:
    filehash = file_hash(zip_file)
    name = zip_file.stem

    if filehash in seen_hashes:
        log_message(f"   Duplicate confirmed by hash! Removing: {zip_file.name} (same as {seen_hashes[filehash].name})")
        zip_file.unlink()   # delete the file
    else:
        seen_hashes[filehash] = zip_file
        base_names.add(name)

log_message(f"\nAll duplicates (by content) removed. Begin unzipping...\n")

######################################################################################

""" UNZIPPING """

log_message(f"Begin unzipping...\n")

# Refresh zip_files list after removals
zip_files = [f for f in folder.iterdir() if f.suffix.lower() == ".zip"]

# Unzip and remove original zip files
for zip_file in zip_files:
    if zipfile.is_zipfile(zip_file):
        folder_name = sanitize(zip_file.stem)
        extract_dir = folder / folder_name
        log_message(f"   Unzipping: {zip_file.name} -> {extract_dir}")
        try:
            with zipfile.ZipFile(zip_file, 'r') as zip_ref:
                zip_ref.extractall(extract_dir)
        except Exception as e:
            log_message(f"   WARNING: Error extracting {zip_file.name}: {e}")
    else:
        log_message(f"   WARNING: Skipping invalid zip file: {zip_file.name}")

log_message(f"\nFinished Unzipping. Experiments stored in individual folders substituting the zip files\n")


######################################################################################


""" .fli FILE EXTRACTION """

source_folder = folder  # D3Files
database_folder = Path("AmorphousDataBase")
database_folder.mkdir(exist_ok=True)

# Scan all items in source_folder to find .fli files
log_message(f"\nScanning all folders for .fli files...\n")
for item_path in source_folder.iterdir():
    if item_path.is_dir():
        log_message(f"   Processing folder: {item_path.name}")

        # Find all .fli files inside this folder (including subfolders and subsubfolders, etc)
        for root, dirs, files in os.walk(item_path):
            root_path = Path(root)
            for file in files:
                if file.lower().endswith(".fli"):
                    src_file = root_path / file
                    dest_file = database_folder / file

                    # Handle duplicate names
                    counter = 1
                    base_name, ext = file.rsplit('.', 1)
                    while dest_file.exists():
                        dest_file = database_folder / f"{base_name}_{counter}.{ext}"
                        counter += 1

                    log_message(f"   Copying: {src_file} -> {dest_file}")
                    shutil.copy2(src_file, dest_file)

        # After processing all .fli files, delete the original folder
        log_message(f"   Deleting folder: {item_path}")
        shutil.rmtree(item_path)
        log_message(f"\nAll .fli files collected, sent from folder {source_folder} to folder {database_folder} and unzipped folders removed.\n")


## 6. Separation of .fli files according to experiments

Some fli files have the wrong structure which means that they are not polarization measurements and if they are polarization files they may have used more than one polarization cell. Therefore, we need to remove all the non-polarization sets of data and separate the good fli files depending of the type of polarization cell they used.

For evey fli file we will read the contents and try to find the header (two strings in two consecutive lines). This symbolizes the installation of a new polarizer cell. If there are numerical values before the first header, that means that the process of saving the file occured before swapping the cell. Those data rows will be skipped. A correct fli file will have the following structure:

|  |  |  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|--|--|
| polariser cell info | ge18004 | pressure/init. polar | 2.30 | 0.79 | initial date/time | 21/11/23 | @ | 12:45 |
| analyser cell info  | sic1402 | pressure/init. polar | 2.00 | 0.79 | initial date/time | 21/11/23 | @ | 12:45 |
| 40661 | 3.000 | 3.000 | 3.000 | 21/11/23 | 12:50:35 | 0.00 | 0.7890 | 0.0020 | 8.4795 | 0.0897 | 120.00 |
| 40662 | 3.000 | 3.000 | 3.000 | 21/11/23 | 12:54:43 | 0.00 | 0.7851 | 0.0020 | 8.3048 | 0.0867 | 120.00 |
|  ...  |       |       |       |          |          |      |        |        |        |        |        |

Which corresponds to the following information for the first two rows:
1. 'polariser cell info'/'analyser cell info' (str): Log of the installation of the first and second polariser cells
2. 'PolariserID' (str): A string with the type of cell used
3. 'pressure/init. polar' (str): A string to introduce the $^\mathrm{3}$He gas pressure and the polarization measured at the creation lab.
4. 'PolariserPressure' (float): $^\mathrm{3}$He gas pressure in some units
5. 'InitialLabPolarization' (float): Polarization measured at the creation lab
6. 'initial date/time' (str): A string that introduces the day, month and year and the hour and minutes.
7. 'Date' (str): A string with the information DD/MM/YY
8. '@' (str): A string to separate date and time
9. 'time' (str): A string with the information HH:MM

And for the rest of the rows:
1. 'Measurement number' (int): The number index of the measurement.
2. 'First_Miller_Index' (float): The first Miller index of the crystal. Polarization is measured using a known Si Bragg crystal. For the source of the origin of the Si crystal see:
>Stunault, Anne & Vial, S & Pusztai, Laszlo & Cuello, Gabriel & Temleitner, László. (2016). Structure of hydrogenous liquids: separation of coherent and incoherent cross sections using polarised neutrons. Journal of Physics: Conference Series. 711. 012003. 10.1088/1742-6596/711/1/012003. 
3. 'Second_Miller_Index' (float)
4. 'Third_Miller_Index' (float):
5. 'Date' (str): A string with the information DD/MM/YY of that measurement
6. 'time' (str): A string with the information HH:MM:SS of that measurement
7. Unknown float, maybe Temperature
8. 'D3Polarization' (float): A float with the polarization measurement
9. 'ErrD3Polarization' (float): A float with the uncertainty of that polarization measurement
10. 'FlippingRatio' (float): The flipping ratio. Given either the flipping ratio or the polarization value, the other one is fully determined. Therefore, only one is needed and that is why we don´t work with the flipping ratio
11. 'ErrFlippingRatio' (float): The uncertainty of the flipping ratio
12. 'Elapsed time' (float): It is the time used to obtain the measurement (integration of the beam over that number of seconds)

Temperature did not seem to have an effect on the decay. Therefore, it has been eliminated in this code cell. Here is a summary of what the code does:

1. The code will go through the .fli files and find all combinations of consecutive rows with 'polariser cell info' and 'analyser cell info'. It doesn´t care about the order which makes the code more robust. We will consider that a polariser cell has been properly installed whrn both of these rows are present and that the cell has been changed once a new set of polariser and analyser rows are encountered. At the moment it ignores the experiments that use the 'magical box' as we are not sure if they are experiments compatible with the ones studied here
2. For evey cell change, a new .fli file is created storing all the information including both polariser and analyser rows and the measured data rows. Also, all cell IDs are recorded
3. For all .fli files the code now will:
    
- Remove unwanted rows
- Extract data form the header rows (polariser row and analyser row)
- Remove unwanted columns
- Set a time reference with the first measurement row. All other time values get referenced with respect to this moment in time and converted into seconds.
- Ignore all Miller index combinations that are not integers
- Run through all Miller index combinations until one passes all the filters defined in previous code cells
- Plot the succesful experiments
- Save two files for each experiment. One with the header rows and another one with just the numeric rows (with a new header that explains what each column has)

For every succesful experiment we will output:
1. Image:  **"PolarizationD3\_{folder_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Multiplier={Multiplier}.png"** in AmorphousPlotResults. Shows the plot with the extended area with the raw data
2. Image:  **"PolarizationD3\_{folder_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Multiplier={Multiplier}\_Soft.png"** in AmorphousPlotResults. Shows the plot with the extended area with the filtered data
3. Txt:    **"PolarizationD3\_{folder_name}\_{DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}.txt"** in AmorphousMLDataBase. It contains the four data columns (DeltaTime, PolarizationD3, SoftPolarizationD3, ErrPolarizationD3)
4. Txt:    **"PolarizationD3\_{folder_name}\_ {DD}/{MM}/{YY}\_{i}\_MillerIndex\_{PrettyCombination}\_Parameters.txt"** in AmorphousMLDataBase. It contains the parameters (CellID, Pressure, LabPolarization, LabTime)
   

These plots are not necessary but are saved for the user to know what all the files look like.
The files that are wrong or useless when all is done are the folowing:
1. Txt:    **"{folder_name}\_Arrays\_{i}.txt"** in SeparatedFolder/{folder\_name}. It still has the header and useless columns. It is the fli file of evey chunk, of every recorded experiment (correct or incorrect)
2. Folder: **"BadTest"** contains all the graphs of the data sets that were considered not worthy but had more points that the ones saved. Check them if your experiment was not properly added

In [None]:
# Path to the original folder and the final folder
DataBase = Path('AmorphousDataBase')
output_base = Path('AmorphousSeparatedFolder')

# List all .fli files in that folder, prepare folders
FileNameList = [f.name for f in DataBase.glob('*.fli')]
polyorder = 2
default_window_length = 5
SeparatedFolder = Path("AmorphousSeparatedFolder")
BadFilesFolder = Path("AmorphousBadFiles")
MLDataBaseFolder = Path("AmorphousMLDataBase")
BadFilesFolder.mkdir(exist_ok=True)
MLDataBaseFolder.mkdir(exist_ok=True)
log_message(f"\n\n Files in the data base that will be (tried) to be used\n {FileNameList}\n")

for FileName in FileNameList:
    """READ THE FILE AND SEPARATE IT INTO EACH EXPERIMENT USING THE POLARIZATION CELL"""
    # 1.1- Open file
    folder_name = FileName.replace(".fli", "")
    output_folder = output_base / folder_name
    file_path = DataBase / FileName
    output_folder.mkdir(parents=True, exist_ok=True)


    with open(long_path(file_path), "r") as f:
        lines = f.readlines()

    # 1.2- Locate the header with CellID, Pressure, etc. Chunks are the data rows between 'polariser cell info'
    chunks = []
    current_chunk = []
    started = False
    for line in lines:
        if line.strip().startswith("polariser cell info"):
            if started and current_chunk:
                chunks.append(current_chunk)
            current_chunk = [line]
            started = True
        else:
            if started:
                current_chunk.append(line)
    if not started:
        log_message(f" File '{FileName}' does NOT contain any 'polariser cell info' header. Skipping.\n")
        continue
    else:
        log_message(f" File '{FileName}' contains at least one 'polariser cell info' header.")

    if current_chunk:
        chunks.append(current_chunk)

    # 1.3- Save .fli files for each chunk
    base_name = FileName.replace(".fli", "")
    log_message(f"\n\nCreating all the Array files \n")
    for i, chunk in enumerate(chunks):
        fli_filename = f"{base_name}_Arrays_{i}.fli"
        fli_path = output_folder / fli_filename
        with open(long_path(fli_path), "w") as f_out:  
            f_out.writelines(chunk)

    chunks = []
    current_chunk = []
    started = False  # Thi is a flag to know when we found first header   
    with open(file_path, "r") as f:
        lines = f.readlines()
    i = 0
    while i < len(lines):
        line = lines[i].strip()  
        # Detect header section
        if line.startswith("polariser cell info") or line.startswith("analyser cell info"):
            header_block = []
            header_found = {"polariser": None, "analyser": None}   
            # Collect consecutive headers, but keep only the last polariser+analyser combination rows
            while i < len(lines) and (
                lines[i].strip().startswith("polariser cell info")
                or lines[i].strip().startswith("analyser cell info")
            ):
                current = lines[i].strip()
                if current.startswith("polariser"):
                    header_found["polariser"] = current
                elif current.startswith("analyser"):
                    header_found["analyser"] = current
                header_block.append(current)
                i += 1   
            # Build chunk if both headers are present
            if header_found["polariser"] and header_found["analyser"]:
                new_chunk = []           
                # Always keep order: polariser first, analyser second
                for header in (header_found["polariser"], header_found["analyser"]):
                    if not header.endswith("\n"):
                        header += "\n"
                    new_chunk.append(header)           
                # Collect data lines until next header
                while i < len(lines) and not (
                    lines[i].startswith("polariser cell info")
                    or lines[i].startswith("analyser cell info")
                ):
                    line = lines[i]
                    if not line.endswith("\n"):
                        line += "\n"  
                    new_chunk.append(line)
                    i += 1           
                chunks.append(new_chunk)
            else:
                log_message(f" File '{FileName}' skipped a block: missing polariser or analyser (headers={header_block}).")  
        else:
            i += 1
    if not chunks:
        log_message(f"   File '{FileName}' does NOT contain any valid polariser+analyser pair. Skipping.\n")
        continue
    else:
        log_message(f" File '{FileName}' contains {len(chunks)} valid experiment blocks.")


    # 2- Save .fli files for every correct chunk
    base_name = FileName.replace(".fli", "")  # remove .fli for clean filenames. Otherwise those .fli extensions appear on the final names
    log_message(f" Creating all the Array files \n")
    for i, chunk in enumerate(chunks):
        fli_filename = f"{base_name}_Arrays_{i}.fli"
        fli_path = output_folder / fli_filename
        with open(long_path(fli_path), "w") as f_out:
            f_out.writelines(chunk)  
    
    cellid_file = Path("AmorphousPolariserAndAnalyser_IDs.txt")
    try:
        with open(long_path(cellid_file), 'r') as file:
            seen_strings = set(line.strip() for line in file)
    except FileNotFoundError:
        seen_strings = set()
    import os


    # 3- Open each Array file and work with it (The Array file still has the header)
    with open(ensure_file(cellid_file), 'a') as file:
        for i in range(len(chunks)):
            FLI_filename = f"{base_name}_Arrays_{i}.fli"  # Name of the Array file
            FLI_path = output_folder / FLI_filename  
            df = pd.read_csv(long_path(FLI_path), sep=r'\s+', header=None, on_bad_lines='skip')
            log_message(f" Reading {FLI_path}, removing ***WARNING No centering scan found")
            warning_str = "***WARNING No centering scan found"



            #3.1 Combine first 4 columns as strings, join them with space, and filter rows containing this phrase (it is not important for us)
            df = df[~df.iloc[:, :5].astype(str).agg(' '.join, axis=1).str.contains('No centering scan found', regex=False)] 
            
            #3.2 Extract useful information from the header. Hopefully, CellID, Pressure, LabPolarization, Year, Month, Day, time of lab measurement before first experiment measurement (negative time) will be stored locally
            log_message(f" Header Information Extraction...")
            PolariserID =          df.iloc[0].tolist()[3]
            AnalyserID =           df.iloc[1].tolist()[3]
            PolariserPressure =    df.iloc[0].tolist()[6]
            AnalyserPressure =     df.iloc[1].tolist()[6]
            LabPolarization = df.iloc[0].tolist()[7]

            try:
                HM, DD, MM, YY = df.iloc[0].tolist()[14], int(df.iloc[0].tolist()[10]), int(df.iloc[0].tolist()[11]), int(df.iloc[0].tolist()[12])
                Day_Ref = f"{DD:02d}/{MM:02d}/{YY:02d}"
                dt = Time(Day_Ref, HM)
            except Exception as e:
                log_message(f"Skipping file {file_path} because of invalid header data: {e}")
                continue

            
            #3.3 All redundant/useless information is removed
            log_message(f" Removing Measurement Index, Unknown column, Flipping Ratio, Uncertainty of Flipping Ratio and Time between measurements,...")
            df = df.iloc[2:].reset_index(drop=True)
            df = df.drop(df.columns[0], axis=1)
            df = df.drop(df.columns[5], axis=1)
            df = df.drop(df.columns[7], axis=1)
            df = df.drop(df.columns[7], axis=1)
            df = df.drop(df.columns[7], axis=1)
            #log_message(f"Saving only polarization values for the Spin Directions wanted in both Polarizer Cells, i.e. (+z,+z)")


            #3.4 Convert Miller index columns into integers. From string or object to float and if the float is close to an integer (tolerance is 1e-8) then save as integer. Otherwise remove row
            cols_to_convert = [1, 2, 3]
            df[cols_to_convert] = df[cols_to_convert].apply(pd.to_numeric, errors='coerce').astype(float)            
            mask = np.isclose(df[cols_to_convert], np.round(df[cols_to_convert]), atol=1e-8)
            df = df[mask.all(axis=1)].copy()
            log_message(f" All irrational Miller Indices removed. Adding DeltaTime")
            
            #3.5 The time columns are converted into difference of time being the referenced time the first +z,+z measurement that has survived at this point
            if df.shape[0] < 2:
                log_message(f"   Not enough valid rows after filtering, skipping chunk")
                continue
            df['DeltaTime'] = df.apply(
                lambda row: deltatime(df[4].iloc[0], df[5].iloc[0], row[4], row[5]), axis=1 )
            ref_dt = Time(df[4].iloc[0], df[5].iloc[0])
            LabTime = int((dt - ref_dt).total_seconds())

            #3.6 Rename the columns PolarizationD3, ErrPolarizationD3 (the polarization column and its uncertainty). The other one with name is DeltaTime. The rest are numbers (will be erased).
            #Also we remove the time strings (with DeltaTime they have no new information)
            log_message(f" Renaming PolarizationD3 and ErrPolarizationD3")
            df.rename(columns={
                df.columns[5]: 'PolarizationD3',
                df.columns[6]: 'ErrPolarizationD3'
            }, inplace=True)
            df.drop(columns=[df.columns[3], df.columns[4]], inplace=True)
            log_message(f" Dropped Time Strings")

            
            #3.7 Begin filtering and softening with previous functions
            log_message(f" Begin removal of Bad files and softening with Savitzky-Golay filter")
            filtered_df, PrettyCombination = filter_best_combination(i,df)
            #If nothing survived the filters/purge then use'continue' and go for the next experiment
            if filtered_df is None and PrettyCombination is None:
                log_message(f" Chunk {i}: No suitable combination found. Skipping to next chunk or file.")
                log_message(f"_______________________________________________________________\n")
                continue  # skip to next chunk
            
            #3.8 Removal of Miller indices (we have all the information they could give us)
            log_message(f" Removing Miller Indices columns")
            #log_message(filtered_df)
            filtered_df = filtered_df.iloc[:, 3:]
            desired_order = ["DeltaTime", "PolarizationD3", "SoftPolarizationD3", "ErrPolarizationD3"]

            
            #3.9 Remove the points that won't be useful for the ML algorythm
            columns_to_save = [col for col in desired_order if col in filtered_df.columns] # Keep only the columns that actually exist (in case something is missing)
            df_SEMIFINAL = filtered_df[columns_to_save].copy()
            df_FINAL = filtered_df = RemoveOutcast_FixUncertainty(df_SEMIFINAL, PrettyCombination, filename=f"PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}", AcceptableMultiplier=1.3)

            #3.10 Plot the succesful experiments
            log_message(f" Plot of Data. PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}")
            plt.figure(figsize=(10, 5))
            T = pd.to_numeric(df_FINAL["DeltaTime"], errors='coerce')
            P = pd.to_numeric(df_FINAL["PolarizationD3"], errors='coerce')
            Err = pd.to_numeric(df_FINAL["ErrPolarizationD3"], errors='coerce')
            P_soft = pd.to_numeric(df_FINAL["SoftPolarizationD3"], errors='coerce')
            
            plt.scatter(T, P, linewidth=1, label='Original') 

            plt.plot(T, P, linestyle='--', color='blue', alpha=0.7)
            plt.scatter(T, P, linewidth=1, label='Original') 

            plt.plot(T, P, linestyle='--', color='blue', alpha=0.7)
            plt.errorbar(T, P, yerr=Err, fmt='none', ecolor='gray', alpha=0.5)

            plt.xlabel("DeltaTime")
            plt.ylabel("PolarizationD3")
            plot_filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}.png"
            plt.title(plot_filename)
            plt.ylim(np.min(P - Err), np.max(P + Err))
            plt.yticks(np.linspace(np.min(P - Err), np.max(P + Err), 10))
            plt.grid(True, linestyle='--', alpha=0.5)
            plt.tight_layout()

            plot_path = output_folder / plot_filename
            plt.savefig(long_path(plot_path), dpi=300, bbox_inches='tight')
            plt.close()




            log_message(f" Plot of Filtered Data. PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Softened")
            plt.figure(figsize=(10, 5))

            plt.scatter(T, P_soft, linewidth=1, label='Filtered')
            plt.plot(T, P_soft, linestyle='--', color='green', alpha=0.7)
            plt.errorbar(T, P_soft, yerr=Err, fmt='none', ecolor='gray', alpha=0.5)

            plt.xlabel("DeltaTime")
            plt.ylabel("SoftPolarizationD3")
            plot_filename_soft = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Softened.png"
            plt.title(plot_filename_soft)
            plt.ylim(np.min(P_soft - Err), np.max(P_soft + Err))
            plt.yticks(np.linspace(np.min(P_soft - Err), np.max(P_soft + Err), 10))
            plt.grid(True, linestyle='--', alpha=0.5)
            plt.tight_layout()

            plot_path_soft = output_folder / plot_filename_soft
            plt.savefig(long_path(plot_path_soft), dpi=300, bbox_inches='tight')
            plt.close()

            log_message(f" Comparison Plot. PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Comparison")
            plt.figure(figsize=(10, 5))

            plt.scatter(T, P, linewidth=1, color='blue', alpha=0.6, label='Original')
            plt.plot(T, P, linestyle='--', color='blue', alpha=0.5)

            plt.scatter(T, P_soft, linewidth=1, color='green', alpha=0.6, label='Filtered')
            plt.plot(T, P_soft, linestyle='--', color='green', alpha=0.5)

            plt.errorbar(T, P, yerr=Err, fmt='none', ecolor='gray', alpha=0.3)
            plt.errorbar(T, P_soft, yerr=Err, fmt='none', ecolor='gray', alpha=0.3)

            plt.xlabel("DeltaTime")
            plt.ylabel("Polarization")
            plot_filename_combined = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Combined.png"
            plt.title(plot_filename_combined)
            min_y = min(np.min(P - Err), np.min(P_soft - Err))
            max_y = max(np.max(P + Err), np.max(P_soft + Err))
            plt.legend(loc='best')
            plt.ylim(min_y, max_y)
            plt.yticks(np.linspace(min_y, max_y, 10))
            plt.grid(True, linestyle='--', alpha=0.5)
            plt.legend()
            plt.tight_layout()

            plot_path_combined = output_folder / plot_filename_combined
            plt.savefig(long_path(plot_path_combined), dpi=300, bbox_inches='tight')

            plt.close()

            # Save CSV and parameter files
            log_message(f" Saving chunk data and parameters")
            csv_filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}.txt"
            parameter_filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Parameters.txt"
            csv_path = output_folder / csv_filename
            ml_csv_path = MLDataBaseFolder / csv_filename
            ml_parameter_path = MLDataBaseFolder / parameter_filename
            
            df_FINAL['DeltaTime'] = df_FINAL['DeltaTime'] - df_FINAL['DeltaTime'].iloc[0]
            df_FINAL.to_csv(csv_path, index=False, sep=',')
            df_FINAL.to_csv(ml_csv_path, index=False, sep=',')
            log_message(f" Saved CSV: {csv_filename}")
            
            # Save parameter file
            with ml_parameter_path.open('w', encoding='utf-8', errors='replace') as f:
                f.write("CellID,AnalyserID,PolariserPressure,AnalyserPressure,LabPolarization,LabTime\n")
                f.write(f"{PolariserID},{AnalyserID},{PolariserPressure},{AnalyserPressure},{LabPolarization},{LabTime}")
            log_message(f" Saved Parameters: {parameter_filename}")
            log_message(f" Parameter and Array files saved to ML database: {MLDataBaseFolder}\n{'_'*65}\n")


## 7. Cleanup

It erases all intermediate files and prepares the remaining ones for the ML pipeline

1. Removes all .fli files that have been created.
2. Removes empty folders
3. Collects all unique polariser–analyser ID pairs
 
As a result, the only useful files are _AmorphousPolariserAndAnalyser_IDs.txt_ and the folder _AmorphousMLDataBase_


In [None]:
# Remove unwanted folders and files
for i in range(len(chunks)):
    temp_filename = f"{base_name}_Arrays_{i}.fli"
    temp_path = output_folder / temp_filename
    try:
        temp_path.unlink()  # delete the file
    except FileNotFoundError:
        pass       

log_message(f"Created and saved {len(chunks)} CSV files from file called {FileName}.")

# Remove folder if empty
if output_folder.exists() and not any(output_folder.iterdir()):
    output_folder.rmdir()
    log_message(f"Removed empty folder: {output_folder}")
log_message('\n\n')

# Collect unique polariser/analyser ID pairs because they are need for the ML .ipynb file
ml_database_folder = Path("AmorphousMLDataBase")
parameter_files = list(ml_database_folder.glob("*Parameters.txt"))
log_message(f"Found {len(parameter_files)} parameter files.")

unique_id_pairs = set()
for filepath in parameter_files:
    try:
        with filepath.open('r', encoding='utf-8') as f:
            lines = f.readlines()
            if len(lines) >= 2:
                second_row = lines[1].strip()
                parts = second_row.split(',')
                if len(parts) >= 2:
                    polariser_id, analyser_id = parts[0], parts[1]
                    unique_id_pairs.add((polariser_id, analyser_id))
    except Exception as e:
        log_message(f"Failed to read {filepath}: {e}")

unique_ids_path = Path("AmorphousPolariserAndAnalyser_IDs.txt")
with unique_ids_path.open('w', encoding='utf-8') as f:
    for polariser_id, analyser_id in sorted(unique_id_pairs):
        f.write(f"{polariser_id},{analyser_id}\n")
log_message(f"Saved {len(unique_id_pairs)} unique polariser/analyser ID pairs to {unique_ids_path}.")

# Delete intermediate folders 
for folder_name in ["AmorphousSeparatedFolder", "AmorphousDataBase"]:
    folder_path = Path(folder_name)
    if folder_path.exists():
        shutil.rmtree(folder_path)
        log_message(f"Folder '{folder_path}' has been deleted.")
    else:
        log_message(f"Folder '{folder_path}' does not exist.")

#Remove duplicate ML database files 
hash_map = defaultdict(list)

def file_sha256(filepath: Path, block_size=65536) -> str:
    """Compute SHA256 hash of a file (safe for large files)."""
    sha256 = hashlib.sha256()
    with filepath.open("rb") as f:
        while chunk := f.read(block_size):
            sha256.update(chunk)
    return sha256.hexdigest()

# Scan all base .txt files (skip '_Parameters' files)
for txt_file in ml_database_folder.glob("*.txt"):
    if "_parameters" not in txt_file.stem.lower():
        file_hash = file_sha256(txt_file)
        hash_map[file_hash].append(txt_file)

# Report & delete duplicates
duplicates_found = False
for file_hash, paths in hash_map.items():
    if len(paths) > 1:
        duplicates_found = True
        log_message(f"\nDuplicate group (hash={file_hash}):")
        log_message(f"   Keeping: {paths[0]}")

        # Delete all but the first file
        for p in paths[1:]:
            param_file = p.with_name(f"{p.stem}_Parameters{p.suffix}")
            try:
                p.unlink()
                log_message(f"   Deleted duplicate base file: {p}")
            except Exception as e:
                log_message(f"   Could not delete base file {p}: {e}")
            if param_file.exists():
                try:
                    param_file.unlink()
                    log_message(f"   Deleted parameter file: {param_file}")
                except Exception as e:
                    log_message(f"   Could not delete parameter file {param_file}: {e}")

if not duplicates_found:
    log_message("No duplicates found in MLDataBase!")
else:
    log_message("\nDuplicate cleanup complete!")


# Remove AmorphousCell_ID.txt if exists
file_path = Path("AmorphousCell_ID.txt")
if file_path.exists():
    file_path.unlink()
    log_message(f"{long_path(file_path)} has been deleted.")
else:
    log_message(f"{long_path(file_path)} does not exist.")