<h1>CrystalineFileLectureTests.ipynb</h1>

Reads from D3Files all the files it is going to process

Outputs the following folders and files:

1. CrystallineLog_Reading_Creation.txt
Logs all the prints and every step the code does. If you trust the code, it is irrelevant. If you don't trust it or want to change it then this txt file will tell you how each experiment file has been processed and where there might have been issues.


2. Crystalline_CellID
Contains the Cell IDs found on all the files. Needed for the ML code.


3. AmorphousPlotResults
This folder will store the graphs of all the experiments that were accepted. Not needed for anything but it is nice to see the files that will be fed to the model. For each experiment you can find the following files:

3.1 Crystalline_Reading_Summary.txt Shows the Score of each experiment. If you use the default criteria for deciding if a file is adequate (a.k.a method FilteringMethodInt = 12) then all files under 1.4 are rejected and the files that have been forcefully accepted or rejected appear as well indicated here. IF YOU RECKON AN EXPERIMENT SHOULD BE ACCEPTED DESPITE BEING SHOWN HERE AS REJECTED PLEASE FIND THE LIST force_reject_files AND ADD YOUR ECPERIMENT USING THE SAME STRUCTURE. IDEM FOR ACCEPTED FILES THAT SHOULD NOT BE ACCEPTED. As a side note, in order to know visullay if a file is good or not, open the associated graph and just check that it doesn't do any funky business (see the examples already present to get a feel on what the word "funky" means)

3.2 {base_name}_N_{N}.png 
It shows the values of the experiment in black, a linear fit in blue and the area needed for 75% of the points to be inside the rectangular region (the area inside y_min=mx+n-N<y<mx+n+N<y_max). Here you can find a good estimate on how good the overall decreasing tendencies are.

3.3 {base_name}_ExtendedArea.png 
It shows the same pot and also the range y_min=mx+n-1.3*N<y<mx+n+1.3*N<y_max. The points outside the green area will be discarded as they are considered to be too off to be considered correct.

3.4 {base_name}_Derivatives.png
It shows the number of negative slopes between points and how steep they are


4. CrystallineMLDataBase
Contains the .txt files NECESSARY for the ML algorithm. There are two per experiment

4.1 {base_name}.txt 
Contains DeltaTime (the time of the measurement measured from the first VALID polarization measurement), PolarizationD3, SoftPolarizationD3 (the polarization after using a Savitzky-Golay filter) and ErrPolarizationD3 (the uncertainty)

4.2 {base_name}_Parameters.txt
Contains the CellID, Pressure, LabPolarization (the polarization measured at the lab) and LabTimeCellID (the time when it was measured)


5. CrystallineFailuresTest
Contains the plots 3.2 and 3.4 but for the experiments that failed the overall decreasing test. Check them if you can to see if any of your experiments has been placed there by mistake



6. CrystallineDataBase
Contains all the .fli files that were attempted to be read


7. CrystallineBadFiles
Contains all the .fli separated in experiment sets folders that were rejected (not enough points, negative polarizations, etc.)


_________________________________________________________________________________________

Process it follows:

1. REMOVAL OF PREVIOUS ITERATIONS
To avoid leaks and duplications, all files are erased before running the code file

2. ZIP FOLDER TREATMENT
The code will take all the zip files, extract them, remove duplicates using the name AND hash sha256.

3. SEPARATION OF FLI FILES ACCORDING TO EXPERIMENTS

Some fli files have the wrong structure (they are not polarization measurementes) and if they are polarization files they can have more than one experiment per file.
For evey fli file we will read the contents and try to find the header (a string in an entire line). This symbolizes the beginning of an experiment
If there are numerical values before the first header, that means that the process of saving the file occured before changing something of the experiment. These data rows will be skipped
A correct fli file will have the following structure:
    polariser cell info ge18004 pressure/init. polar 2.29 0.79 initial date/time 17 09 23 @ 10:39
    37391   4.000   0.000   1.000 18/09/23 06:20:44     155.03  +z +z     0.8391    0.0156   11.4270    1.2031     120.00
    37392   4.000   0.000   1.000 18/09/23 06:26:49     155.05  +x +x     0.8255    0.0110   10.4610    0.7211     300.00
    ...

Which corresponds to the following information:
    String:'polariser cell info', CellID, String:'pressure/init. polar', Pressure(unknown units), InitialLabPolarization, String:'date/time', Day, Month, Year, String:'@', Hour:Minute
    Measurement Number, First Miller Index, Second Miller Index, Third Miller Index, Date Of Measurement, Time Of Measurement, Temperature [Kelvin],
                        Direction Of Polarization In The First Polarizer Cell (Direction of the quantum operator S_x,S_y,S_z), Direction Of Polarization In The Second Polarizer Cell,
                        Polarization, Polarization uncertainty, Flipping Ratio, FlippingRatio Uncertainty, Duration of the measurement

The direction +z is chosen to be pointing away from the ground.
The direction +x is the direction of the flow of neutrons, i.e, the direction of Scattering.
The direction +y is the orthogonal to both of them.
D3 uses two polariser cells, one between the reactor and the sample and a second between the sample and the sensor. The first one guarantees that only neutrons with the correct spin direction
interacts with the sample. The second one guarantees that only the neutrons that have unchanged spin direction after interacting with the sample are detected by the sensor. This is
the reason why the directions (+z,+y,+x,-z,-y,-x) appear twice.
We have considered that temperature is not a relevant factor and the flipping ratio has no new information that polarization alrady posseses.
First, the code will first locate the first header (ignoring eveything before) and save all the data afterwards (until the next header or end of the document) in a file with the suffix Array_{i} (i is the number of headers already processed in that fli file)
Second, it will save the header as a file with the suffix Parameters.
Third, the header row and the columns of Measurement Number, Temperature, Flipping Ratio, FlippingRatio Uncertainty and Time Between Measurements will be erased
Fourth, as all data measurement uses the +z,+z combination, all other combinations are erased
Fifth, not all data from all Miller Index combinations are polarization measurements. Even some of the ones that are polarization measurements are tampered (playing with magnetic fields for example).
This means that there needs to be a way to select the correct combination. For starters, irrational Miller indices are not used for measurements with the samples (they need to be discarded)
The integer Miller indices combination will be put to the test by all the functions defined before.
Sixth, It computes a score depending on how many derivatives are negative, (200 / (200 - percent_neg) - 1) to be precise. This is a normalized score (0-1) with a 1/x evolution. Also it computes a score depending of the size of N, 2 * (-0.5 + 1 / (1 + needed_N / 8.54e-2)) to be precise. 8.54e-2 is the maximum of the data set. If a new maximum is achieved, the score wont be normalized (0-1) but won't break. The final score combines both of these values (addition). Manually I have seen that 1.4 is a good threshold. If Score>1.4 the file will be accepted. If it is smaller it will be discarded by the main code (a False will be returned). It does the m<0 test, writes everything in Summary_txt (filename, Score associated with N, Score associated with Derivatives nad the total Score). If the file was chosen to be forcefully accepted or rejected, a string will appear in the .txt file. Finally it will save plots of both filters in PlotResults if it is correct and in FailureTest if it is considered a bad file. Again, if a new file is added it may be wise to check your experiment in these folders. For more info read the description of FilteringMethodInt = 0 inside the code
Seventh, it will try each Miller index combination for a set i value, apply a filter, and return filtered df + PrettyCombination. it will only add the filtered column if enough points & data changes significantly. If it doesn't change too much, the column PolarizationD3 will be duplicated with the new name
Eight, it repeats the process of obtaining the area in m*x+n-N < y < m*x+n+N where 75% of the points are inside the area. It also multiplies the value of N by a factor AcceptableMultiplier and erases all points outisde this bigger area. As uncertainties are clearly underestimated I tried to make them reasonable (looking at the dispersion of the points it is clear there is systematic uncertainties. Under the hypothesis that the polarization curve should be a soft curve (at least C^1) we will try to use χ^2 to add a provisional uncertainty margin fitting to a linear expression. This is a very inaccurate uncertainty increase but it is an improvement of the underestimated uncertainties (and the lack of ways to quantify the systematic uncertainty sources). The enlarged area will be plotted and saved in PlotResults for both the normal data and the softened data (Savitzky–Golay filter)    
    
4. CellID SAVING
It will safely store in a txt file all the cell ids so that the code in ML can use them

5. DUPLICATION REMOVAL
It will check if the files created for the ML algorithm are duplicates and erases them in that case

In [None]:
%reset -f

"""
1- LIBRARIES
""" 
import os
import shutil
import zipfile
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import pandas as pd
from datetime import datetime
from pathlib import Path
from scipy.signal import savgol_filter
from scipy.optimize import curve_fit
from collections import defaultdict
from typing import Optional
import hashlib
import zipfile


""" 
2- PRINTING AND LOG DETAILS. LONG PATH CORRECTIONS
"""
"""
Here we have a custom function for log_messageing and logging everything on a .txt file to be able to know what has happened on the code

"""
PrintDebug = True #This Bool will determine if all logs should be log_messageed on screen on the Python Notebook. The log writing is always on. If False the code will be faster.
ShowPlot = False #This Bool works the same but with showing on screen the plots (they are always saved even with this variable being False). Reduces program cost if False
log_file_path = os.path.join(".", "CrystallineLog_Reading_Creation.txt")
# Initialize log file at the start of the script
with open(log_file_path, 'w', encoding='utf-8') as log_file:
    log_file.write("=== Log started ===\n")

def log_message(message):
    if PrintDebug:
        print(message)
    with open(log_file_path, 'a', encoding='utf-8') as log_file:
        log_file.write(message + "\n")

def win_long_path(path):
    # Convert to Path and resolve to absolute
    path = Path(path).resolve()

    # Convert to string
    path_str = str(path)

    # Prepend \\?\ if not already present
    if not path_str.startswith("\\\\?\\"):
        path_str = "\\\\?\\" + path_str

    return path_str
to_erase = [
    "CrystallineBadFiles", "CrystallineDataBase","CrystallineFailuresTest",
    "CrystallineMLDataBase", "CrystallinePlotResults", "Crystalline_CellID.txt",
    "CrystallineLogTesting_Creation.txt",

]
for item in to_erase:
    path = os.path.abspath(item)  # full path
    if os.path.exists(path):
        try:
            if os.path.isfile(path):
                os.remove(path)
                log_message(f"Deleted file: {path}")
            elif os.path.isdir(path):
                shutil.rmtree(path)
                log_message(f"Deleted folder: {path}")
        except Exception as e:
            log_message(f" Could not delete {path}: {e}")
    else:
        log_message(f"Not found (skipped): {path}")


"""
3- FUNCTIONS
"""
#3.1-Time function for the conversion of time into seconds
def Time(Day_Ref, Hour_Ref):
    """
    Expected variables (strings) should be in the form
     Day_Ref = 'DD/MM/YY' a.k.a Day/Month/Year
     Hour_Ref = 'HH:MM' a.k.a Hour:Month
    Seconds will be ignored if fed to the function (there is a check before the call of the function in the main code that erases the seconds).
    The function takes this strings and converts them to seconds
    Could be an improvement to consider seconds but the headers don't use seconds, there is no information to prove that the time variables are precise to the second and ~30 seconds is negligible when working with time periods of up to 70000 seconds
    """
    match = re.match(r"(\d+)/(\d+)/(\d+)", Day_Ref)
    if match:
        DD = int(match.group(1))
        MM = int(match.group(2))
        YY = int(match.group(3))
    else:
        raise ValueError(f"Invalid date format: {Day_Ref}")

    # Parse time
    match = re.match(r"(\d+):(\d+):(\d+)", Hour_Ref)
    if match:
        Hour = int(match.group(1))
        Minute = int(match.group(2))
        Second = int(match.group(3))
    else:
        # Try HH:MM
        match = re.match(r"(\d+):(\d+)", Hour_Ref)
        if match:
            Hour = int(match.group(1))
            Minute = int(match.group(2))
            Second = 0
        else:
            raise ValueError(f"Invalid time format: {Hour_Ref}")

    return datetime(YY + 2000 if YY < 100 else YY, MM, DD, Hour, Minute, Second)

#3.2- Time function to compute the difference in time between two sets fo time
def deltatime(AIni,BIni, AFin,BFin):
    """
    There is nothing as "Absolute time". We want to compare durations or intervals of time.
    The A variables are of the type 'DD/MM/YY' and the B ones are 'HH:MM'
     A is for the time moment considered as reference
     B is for the time moment that has been used for measuring something
    """
    time1 = Time(AIni, BIni)
    time2 = Time(AFin, BFin)
    return( int((time2 - time1).total_seconds()))

#3.3 Conversion to integers
def format_combination(comb):
    """
    The Miller Indeces in the are writen as strings with float like numbers in them, for example (4.0,1.0,2.0)
    As only integer Miller Indices are required we can convert them to integers with this function.
    """
    if comb is None:
        return "(None)"
    ints = tuple(int(float(x)) for x in comb)
    return f"({','.join(map(str, ints))})"

#3.4 Folder name changes
def sanitize(name):
    """
    Remove invalid characters for folder names. [<>:"/\\|?*] all turn to _
    """
    return re.sub(r'[<>:"/\\|?*]', '_', name)

#3.5 Savitzky–Golay filter window parameters
def savgol_params_func(n_points):
    """
    Receives tha number of points of the file that the filter will be tried to be used on
    Window length can't be greater than the number of points. To avoid Python Errors this function gives the appropiate window length and order of the polygone
    """
    window_length = min(default_window_length, n_points)
    if window_length % 2 == 0:
        window_length -= 1
    if window_length < polyorder + 2:
        window_length = polyorder + 2
        if window_length % 2 == 0:
            window_length += 1
    return {'window_length': window_length, 'polyorder': polyorder}

#3.6 Filtering Methods
def Overall_Decrease (df_filtered, filename):
    """
    The function that accpets files if they are decreasing sets of data (polarization evolution is a decay ALWAYS)
    Will save plots of good files in PlotResults so that the user can see what has been accepted
    Will save plots of bad files in FailuresTest so that the user can see the polts being discarded.
    IMPORTANT: IF YOU ADD A NEW EXPERIMENT AND YOU ARE CONFIDENT IT IS A GOOD FILE BUT IS SENT TO FailuresTest PLEASE MANUALLY ADD THE TXT FILE TO THE force_accept_files. The program should rename eveything in SeparatedFolder so you can get the name from there  adding _Filtered if it is missing
    """
    FilteringMethodInt = 12 
    """
    This int chooses the method for filtering the wrong data. The options are:
        FilteringMethodInt = 0   
        FilteringMethodInt = 1
        FilteringMethodInt = 2
        FilteringMethodInt = 12
        FilteringMethodInt = 3
    """
    
    force_accept_files = [
        "PolarizationD3_EuAgAs_29_8_23_0_MillerIndex_(0,0,4)_Filtered.txt",
        "PolarizationD3_EuAgAs_1_30_8_23_5_MillerIndex_(3,0,0)_Filtered.txt",
        "PolarizationD3_SmI3_26_9_23_1_MillerIndex_(0,3,0)_Filtered.txt",
        "PolarizationD3_SmI3_1_28_9_23_3_MillerIndex_(0,3,0)_Filtered.txt",
        "PolarizationD3_MnSn-c_hor_1_22_3_24_1_MillerIndex_(3,0,0)_Filtered.txt",
        "PolarizationD3_MnSn-c_hor_15_3_24_0_MillerIndex_(0,0,2)_Filtered.txt",
        "PolarizationD3_MnSn-c_ver_14_3_24_1_(1,0,0)_Filtered.txt"]
    force_reject_files = [
        "PolarizationD3_EuAgAs_1_29_8_23_0_MillerIndex_(1,1,1)_Filtered.txt",
        "PolarizationD3_MnSn-c_ver_14_3_24_1_MillerIndex_(1,0,0)_Filtered.txt"] #The values were in the 0.3 to 0.2 range. Not a polarization
    
    """
    Some files failed the FilteringMethodInt = 12 method despite being considered good files. Others passed the test but were clearly tampered.
    These lists override the testing and accepts/rejets them immediately
    """
    output_folder = Path.cwd() / "CrystallinePlotResults"
    output_folder.mkdir(exist_ok=True)
    failures_folder = Path.cwd() / "CrystallineFailuresTest"
    failures_folder.mkdir(exist_ok=True)

    # Ensure paths are Windows long-path safe
    output_folder = Path(win_long_path(output_folder))
    failures_folder = Path(win_long_path(failures_folder))

    x = df_filtered['DeltaTime'].values
    y = df_filtered['SoftPolarizationD3'].values #The tests are done with the data set filtered by the Savitzky–Golay for better results
    if FilteringMethodInt == 0:
        """
        This method computes the area in the plot that satisfies mx+n-σ_n < y < mx+n+σ_n
        Then it finds the value of N so that mx+n-Nkσ_n < y < mx+n+k*σ_n contains 75% of the points
        This method seemed innefective as good files with lots of points needed aburd values of k despite being the only really good files
        The only good filtering capability is checking if the slope m is positive (if it is positive it is impossible to have a decreasing polarization evolution)
        The summary of all files and thier scores with the test are being saved in Sigmas.txt (if you can't find it is because this method has not been used (it is useless anyways)
        The method is saved only as a "show of concept"
        """
        def linear_func(x, m, n):
            return m*x + n
        try:
            popt, pcov = curve_fit(linear_func, x, y)
            m, n = popt
            sigma_n = np.sqrt(np.diag(pcov))[1]  # uncertainty in n
    
        except Exception as e:
            log_message(f"      Error fitting data: {e}")
            return False
    
        # Step 4: check slope
        if m > 0:
            log_message(f"      Overall slope is positive. Can't be polarization information. Skipping Combination")
            return False
    
        # Step 5–7: try sigma_n bands
        num_points = len(x)
        sorted_idx = np.argsort(x)
        x_sorted = x[sorted_idx]
        y_sorted = y[sorted_idx]
    
        needed_sigma = None
        max_sigma = 20
    
        for k in range(1, max_sigma+1):
            upper_band = linear_func(x_sorted, m, n) + k*sigma_n
            lower_band = linear_func(x_sorted, m, n) - k*sigma_n
    
            inside = np.logical_and(y_sorted <= upper_band, y_sorted >= lower_band)
            percent_inside = np.sum(inside) / num_points * 100
    
            if percent_inside >= 75:
                needed_sigma = k
                # Step 6: plot
                plt.figure(figsize=(8,5))
                plt.plot(x_sorted, y_sorted, 'o', label='Data')
                plt.plot(x_sorted, linear_func(x_sorted, m, n), '-', label='Fit')
                plt.fill_between(x_sorted, lower_band, upper_band, color='gray', alpha=0.3,
                                 label=f'Band ±{k}·sigma_n')
                plt.xlabel(r'$\Delta t$ (s)')   # Delta time in seconds
                plt.ylabel(r'$P_{\mathrm{soft}}$')  # SoftPolarizationD3

                plt.title(r'Fit: $y = %.2e \cdot x + %.2e$, $\sigma_n=%.2e$' % (m, n, sigma_n))

                plt.legend()
                plt.tight_layout()
                # Save plot
                plot_filename = output_folder / f"{filename}_AutomaticRange.png"
                plt.savefig(plot_filename)
                log_message(f"      Plot saved to {plot_filename}")
                plt.close()
                break
    
        # Step 8: write txt file with needed sigma
        # Common file in the same folder
        sigmas_txt_path = output_folder / "Sigmas.txt"
        
        with open(sigmas_txt_path, 'a') as f:
            if needed_sigma is not None:
                f.write(f"{filename}: needed sigma multiplier = {needed_sigma}\n")
            else:
                f.write(f"{filename}: needed sigma multiplier > {max_sigma}\n")
        log_message(f"      Sigma info saved to {sigmas_txt_path}")
    
    
        # Step 9: final decision
        if needed_sigma is None or needed_sigma > max_sigma:
            log_message(f"      Needed > {max_sigma}·sigma_n → file considered bad")
            return False
        else:
            log_message(f"      Approximately decreasing. Needed sigma multiplier: {needed_sigma}")
            return True


    
    elif FilteringMethodInt == 1:
        """
        This method computes the area in the plot that satisfies mx+n-N < y < mx+n+N
        Then it finds the value of N so that mx+n-N*σ_n < y < mx+n+N*σ_n contains 75% of the points
        This method seemed much more efficient as it didn't depend of the linearity of the fit but the dispersion of the points from a first order approxiamtion
        Still, it is not bullet-proof. Again, just a "show of concept"
        Saves everything in Bands.txt and the plots have in the title the value of N.
        Also does the m<0 test (really important)
        """
        # === New method: fixed N bands ===
        def linear_func(x, m, n):
            return m * x + n

        try:
            popt, _ = curve_fit(linear_func, x, y)
            m, n = popt
        except Exception as e:
            log_message(f"      Error fitting data: {e}")
            return False

        if m > 0:
            log_message(f"      Overall slope is positive. Can't be polarization information. Skipping Combination")
            return False

        num_points = len(x)
        sorted_idx = np.argsort(x)
        x_sorted = x[sorted_idx]
        y_sorted = y[sorted_idx]

        needed_N = None
        N_start = 0.0001
        N_step = 0.0001
        N_max = 0.4

        N = N_start
        while N <= N_max:
            upper_band = linear_func(x_sorted, m, n) + N
            lower_band = linear_func(x_sorted, m, n) - N

            inside = np.logical_and(y_sorted <= upper_band, y_sorted >= lower_band)
            percent_inside = np.sum(inside) / num_points * 100

            if percent_inside >= 75:
                needed_N = N
                plt.figure(figsize=(8,5))
                plt.plot(x_sorted, y_sorted, 'o', label='Data')
                plt.plot(x_sorted, linear_func(x_sorted, m, n), '-', label='Fit')
                plt.fill_between(x_sorted, lower_band, upper_band, color='gray', alpha=0.3,
                                 label=f'Band ±{N:.2e}')
                plt.xlabel(r'$\Delta t$ (s)')   # Delta time in seconds
                plt.ylabel(r'$P_{\mathrm{soft}}$')  # SoftPolarizationD3

                plt.title(r'Fit: $y = %.2e \cdot x + %.2e$, $N=%.2e$' % (m, n, N))
                plt.legend()
                plt.tight_layout()
                plot_filename = output_folder / f"{filename}_N_{N:.2f}_ManualInterval.png"
                plt.savefig(plot_filename)
                log_message(f"      Plot saved to {plot_filename}")
                plt.close()
                break

            N += N_step

        bands_txt_path = output_folder / "Bands.txt"
        with open(bands_txt_path, 'a') as f:
            if needed_N is not None:
                f.write(f"{filename}: needed band width = ±{needed_N:.2e}\n")
            else:
                f.write(f"{filename}: needed band width > ±{N_max:.2e}\n")
        log_message(f"      Band info saved to {bands_txt_path}")

        if needed_N is None:
            log_message(f"      Needed band width > ±{N_max:.2e} → file considered bad")
            return False
        else:
            log_message(f"      Approximately decreasing. Needed band width: ±{needed_N:.2e}")
            return True

    
    elif FilteringMethodInt == 2:
        """
        This method computes the "number of negative derivatives".
        The defintion of derivative of a point is $f'(x_0) = lim_x\righarrow {x_0}\frac{f(x+x_0)-f(x_0)}{x-x_0}$.
        It is the division of the substraction of two images from two points infinitey close over their distance (metric space needed).
        If the data set is not continious we can obtain the slope between two points using a similar idea $f'(x_j):=\frac{y_{j+1}-y_j}{x_{j+1}-x_j}$
        If 50% of slopes are negative we could say the overall curve is decreasing (which may not be true)
        A 100% can't be asked as noise in measurement can make some "derivatives" to be positive. 
        Again, another show of concept
        Does not do the m<0 test
        """
        # === New method: compute discrete derivatives ===
        num_points = len(x)
        derivatives = []
    
        # Compute derivative for sequential pairs
        for i in range(num_points - 1):
            x1, y1 = x[i], y[i]
            x2, y2 = x[i+1], y[i+1]
            if x2 != x1:
                derivative = (y2 - y1) / (x2 - x1)
                derivatives.append(derivative)
            else:
                log_message(f"      Skipping derivative at index {i} due to zero delta x")
    
        derivatives = np.array(derivatives)
        num_derivatives = len(derivatives)
    
        num_neg = np.sum(derivatives < 0)
        num_pos = np.sum(derivatives > 0)
    
        percent_neg = num_neg / num_derivatives * 100 if num_derivatives > 0 else 0
        percent_pos = num_pos / num_derivatives * 100 if num_derivatives > 0 else 0
    
        # Log results
        bands_txt_path = output_folder / "Bands.txt"
        with open(bands_txt_path, 'a') as f:
            f.write(f"{filename}: Total derivatives: {num_derivatives}\n")
            f.write(f"{filename}: Negative derivatives: {num_neg} ({percent_neg:.2f}%)\n")
            f.write(f"{filename}: Positive derivatives: {num_pos} ({percent_pos:.2f}%)\n")
    
        log_message(f"      Derivatives computed: {num_derivatives}")
        log_message(f"      Negative: {num_neg} ({percent_neg:.2e}%)")
        log_message(f"      Positive: {num_pos} ({percent_pos:.2e}%)")
        log_message(f"      Band info saved to {bands_txt_path}")
        mid_x = (x[:-1] + x[1:]) / 2
        plt.figure(figsize=(8,5))
        plt.plot(mid_x, derivatives, marker='o')
        plt.axhline(0, color='red', linestyle='--')
        plt.xlabel(r'$\Delta t$ (s)')
        plt.ylabel(r'Derivative')
        plt.title(f'Derivatives of {filename}')
        plot_filename = output_folder / f"{filename}_plot_Derivatives.png"
        plt.tight_layout()
        plt.savefig(plot_filename)
        plt.close()

        # Placeholder decision: if ≥ 50% negative, return True
        if percent_neg >= 50:
            log_message(f"      Majority negative derivatives → file considered decreasing")
            return True
        else:
            log_message(f"      Majority positive derivatives → file considered NOT decreasing")
            return False
            
    elif FilteringMethodInt == 12:
        """
        This method combine method 1 and 2 (that is the reason why it is called 12)
        It computes a score depending on how many derivatives are negative, (200 / (200 - percent_neg) - 1) to be precise. This is a normalized score (0-1) with a 1/x evolution
        It computes a score depending of the size of N, 2 * (-0.5 + 1 / (1 + needed_N / 8.54e-2)) to be precise. 8.54e-2 is the maximum of the data set. If a new maximum is achieved, the score wont be normalized (0-1) but won't break
        The final score combines both of these values (addition). Manually I have seen that 1.4 is a good threshold. If Score>1.4 the file will be accepted. If it is smaller it will be discarded by the main code (a False will be returned)
        Does the m<0 test
        Write eveythin in Summary_txt (filename, Score associated with N, Score associated with Derivatives nad the total Score). If the file was chosen to be forcefully accepted or rejected, a string will appear in the .txt file.
        Saves plots of both filters in PlotResults if it is correct and in FailureTest if it is considered a bad file. Again, if a new file is added it may be wise to check your experiment in these folders. For more info read the description of FilteringMethodInt = 0
        """
        output_folder = Path.cwd() / "CrystallinePlotResults"
        output_folder.mkdir(exist_ok=True)
        failures_folder = Path.cwd() / "CrystallineFailuresTest"
        failures_folder.mkdir(exist_ok=True)
    
        # Ensure paths are Windows long-path safe
        output_folder = Path(win_long_path(output_folder))
        failures_folder = Path(win_long_path(failures_folder))
    
        x = df_filtered['DeltaTime'].values
        y = df_filtered['SoftPolarizationD3'].values
    
        def linear_func(x, m, n):
            return m * x + n
    
        try:
            popt, _ = curve_fit(linear_func, x, y)
            m, n = popt
            log_message(f"      Linear fit slope m={m:.4e}, intercept n={n:.4f}")
        except Exception as e:
            log_message(f"      Error fitting data: {e}")
            return False
        if m > 0:
            log_message(f"      Overall slope is positive. Can't be polarization information. Skipping Combination")
            return False

        BandWidthScore = None
        DerivativeScore = None
        needed_N = None
    
        # N obtainment
        num_points = len(x)
        sorted_idx = np.argsort(x)
        x_sorted = x[sorted_idx]
        y_sorted = y[sorted_idx]
    
        N_start = 0.0001
        N_step = 0.0001
        N_max = 0.4
        N = N_start
        while N <= N_max:
            upper_band = linear_func(x_sorted, m, n) + N
            lower_band = linear_func(x_sorted, m, n) - N
            inside = np.logical_and(y_sorted <= upper_band, y_sorted >= lower_band)
            percent_inside = np.sum(inside) / num_points * 100
    
            if percent_inside >= 75:
                needed_N = N
                break
            N += N_step
    
        if needed_N is not None:
            # Band N test
            BandWidthScore = 2 * (-0.5 + 1 / (1 + needed_N / 8.54e-2))
            log_message(f"      Needed N=±{needed_N:.2e}")
            log_message(f"      BandWidthScore={BandWidthScore:.4f}")
        else:
            log_message(f"      No N found ≤ {N_max:.2e}")
    
        # Derivative test
        derivatives = np.diff(y_sorted) / np.diff(x_sorted)
        num_derivatives = len(derivatives)
        num_neg = np.sum(derivatives < 0)
        percent_neg = (num_neg / num_derivatives) * 100 if num_derivatives > 0 else None
    
        if percent_neg is not None:
            log_message(f"      Derivatives computed: {num_derivatives}")
            log_message(f"      Negative: {num_neg} ({percent_neg:.2f}%)")
            if percent_neg > 1:
                DerivativeScore = (200 / (200 - percent_neg) - 1)
            else:
                DerivativeScore = (2 / (2 - percent_neg) - 1)
            log_message(f"      DerivativeScore={DerivativeScore:.4f}")
        else:
            log_message(f"      No derivatives computed")
    
        # Final Score
        if BandWidthScore is not None and DerivativeScore is not None:
            score = BandWidthScore + DerivativeScore
        elif BandWidthScore is not None:
            log_message(f"      Missing DerivativeScore in experiment {filename}")
            score = 2 * BandWidthScore
        elif DerivativeScore is not None:
            log_message(f"      Missing BandWidthScore in experiment {filename}")
            score = 2 * DerivativeScore
        else:
            log_message(f"      Missing BandWidthScore and DerivativeScore in experiment {filename}")
            score = 0
    
        log_message(f"      Final score={score:.4f}")
    
        # Combined Test
        threshold = 1.4
        is_good = score > threshold
        was_force_accepted = False
        was_force_rejected = False
        if filename in force_accept_files:
            log_message(f"      File {filename} is in force_accept_files → passing filter even if score is low")
            is_good = True
            was_force_accepted = True
        elif filename in force_reject_files:
            log_message(f"      File {filename} is in force_reject_files → rejecting even if score is high")
            is_good = False
            was_force_rejected = True
        summary_path = output_folder / "CrystallineSummary_Testing.txt"
        with open(win_long_path(summary_path), 'a') as summary_file:
            summary_file.write(
                f"{filename}: N={'No N found' if needed_N is None else f'{needed_N:.2e}'}, "
                f"BandWidthScore={'None' if BandWidthScore is None else f'{BandWidthScore:.4f}'}, "
                f"Negative derivatives={'No Deriv found' if percent_neg is None else f'{percent_neg:.2f}%'}, "
                f"DerivativeScore={'None' if DerivativeScore is None else f'{DerivativeScore:.4f}'}, "
                f"TotalScore={score:.4f}"
            )
            if was_force_accepted:
                summary_file.write(" [FORCE ACCEPTED]")
            elif was_force_rejected:
                summary_file.write(" [FORCE Rejected]")
            elif not is_good:
                summary_file.write(" ***")
            summary_file.write("\n")
        
        # Make and save plots in the correct folder
        plot_folder = output_folder if is_good else failures_folder
    
        def make_clean_name(filename: str) -> str:
            """
            Strip unnecessary prefixes/suffixes from filenames.
            Example:
            PolarizationD3_155K_2_18_9_23_1_MillerIndex_(4,0,1)_Filtered.txt
              -> 155K_2_18_9_23_1_(4,0,1)
            """
            base = Path(filename).stem  # remove extension
            if base.startswith("PolarizationD3_"):
                base = base[len("PolarizationD3_"):]
            if base.endswith("_Filtered"):
                base = base[:-len("_Filtered")]
            base = base.replace("MillerIndex_", "")
            return base

        # Build a clean name for titles / saving
        clean_name = (
            filename.replace("PolarizationD3_", "")
                    .replace("MillerIndex_", "")
                    .replace("_Filtered.txt", "")
        )



        def clean_plot_filename(filename: str, needed_N: Optional[float], plot_folder: Path) -> Path:
            """
            Build a clean filename for linear fit plots.
            Example:
            155K_2_18_9_23_1_(4,0,1)_N_4.30e-03.png
            or
            155K_2_18_9_23_1_(4,0,1)_NoNFound.png
            """
            base = make_clean_name(filename)
            suffix = f"_N_{needed_N:.2e}" if needed_N else "_NoNFound"
            return plot_folder / f"{base}{suffix}.png"
        
        def derivative_plot_filename(filename: str, plot_folder: Path) -> Path:
            """
            Build a clean filename for derivative plots.
            Example:
            CaFeAl_13_7_6_24_2_(0,0,2)_Derivatives.png
            """
            base = make_clean_name(filename)
            return plot_folder / f"{base}_Derivatives.png"
        x = df_filtered['DeltaTime'].values
        y = df_filtered['SoftPolarizationD3'].values
        Err = df_filtered['ErrPolarizationD3'].values if 'ErrPolarizationD3' in df_filtered.columns else np.zeros_like(y)

        
        def make_clean_name(filename: str) -> str:
            """
            Turn e.g.
              PolarizationD3_CaFeAl_13_7_6_24_2_MillerIndex_(0,0,2)_Filtered.txt
            into:
              CaFeAl_13_7_6_24_2_(0,0,2)
            and handle cases where filename contains '/' or '\\' (dates like DD/MM/YY).
            """
            s = str(filename).replace("/", "_").replace("\\", "_")  # prevent path splitting
            base = Path(s).stem  # remove extension if present
            if base.startswith("PolarizationD3_"):
                base = base[len("PolarizationD3_"):]
            if base.endswith("_Filtered"):
                base = base[:-len("_Filtered")]
            base = base.replace("MillerIndex_", "")
            return base
        
        def extended_area_plot_filename(filename: str) -> str:
            """EuAgAs_5_31_10_23_0_(3,0,0)_ExtendedArea.png"""
            return f"{make_clean_name(filename)}_ExtendedArea.png"

        # === Linear Fit Plot ===
        plt.figure(figsize=(8,5))
        
        # Black points with error bars
        plt.scatter(x_sorted, y_sorted, color='black', s=30, label='Data', marker='o')
        if Err is not None:
            plt.errorbar(x_sorted, y_sorted, yerr=Err, fmt='none', ecolor='black', alpha=0.6, capsize=2)
        
        # Optional shaded uncertainty band for measured points

        
        # Blue linear fit
        plt.plot(x_sorted, linear_func(x_sorted, m, n), '-', color='blue', label='Fit')
        
        # Light blue band around the fit
        if needed_N is not None:
            upper_band = linear_func(x_sorted, m, n) + needed_N
            lower_band = linear_func(x_sorted, m, n) - needed_N
            plt.fill_between(
                x_sorted,
                lower_band,
                upper_band,
                color='lightblue',
                alpha=0.4,
                label=f'Band ±{needed_N:.2e}'
            )
        
        plt.xlabel(r'$\Delta t$ (s)')
        plt.ylabel(r'$P_{\mathrm{soft}}$')
        plt.title(f"Linear fit for {clean_name}")
        plt.legend()
        plt.tight_layout()
        
        plot_path = clean_plot_filename(filename, needed_N, plot_folder)
        plt.savefig(win_long_path(plot_path), dpi=300, bbox_inches='tight')
        plt.close()
        
        # === Derivative Plot ===
        mid_x = (x_sorted[:-1] + x_sorted[1:]) / 2
        plt.figure(figsize=(8,5))
        plt.plot(mid_x, derivatives, marker='o', color='black')
        plt.axhline(0, color='red', linestyle='--')
        plt.xlabel(r'$\Delta t$ (s)')
        plt.ylabel('Derivative')
        plt.title(f"Derivatives of {clean_name}")
        plt.tight_layout()
        
        plot_path = derivative_plot_filename(filename, plot_folder)
        plt.savefig(win_long_path(plot_path), dpi=300, bbox_inches='tight')
        plt.close()

    
        return is_good




    elif FilteringMethodInt == 3:
        """
        The N bands with a sinh, cosh fit. Was useless. Saved so that no one loses their time with it
        """
        def fit_func(x, a, b, c, d):
            return a * np.sinh(b * x) + c * np.cosh(d * x)
    
        try:
            popt, _ = curve_fit(fit_func, x, y, maxfev=1000000)
            a, b, c, d = popt
        except Exception as e:
            log_message(f"      Error fitting data: {e}")
            return False
            
        num_points = len(x)
        sorted_idx = np.argsort(x)
        x_sorted = x[sorted_idx]
        y_sorted = y[sorted_idx]
    
        needed_N = None
        N_start = 0.0001
        N_step = 0.0001
        N_max = 0.4
    
        N = N_start
        while N <= N_max:
            fitted_curve = fit_func(x_sorted, a, b, c, d)
            upper_band = fitted_curve + N
            lower_band = fitted_curve - N
    
            inside = np.logical_and(y_sorted <= upper_band, y_sorted >= lower_band)
            percent_inside = np.sum(inside) / num_points * 100
    
            if percent_inside >= 75:
                needed_N = N
                plt.figure(figsize=(8,5))
                plt.plot(x_sorted, y_sorted, 'o', label='Data')
                plt.plot(x_sorted, fitted_curve, '-', label='Fit')
                plt.fill_between(x_sorted, lower_band, upper_band, color='gray', alpha=0.3,
                                 label=f'Band ±{N:.2e}')
                plt.xlabel(r'$\Delta t$ (s)')
                plt.ylabel(r'$P_{\mathrm{soft}}$')
    
                plt.title(
                    r'Fit: $a \cdot \sinh(bx) + c \cdot \cosh(dx)$' + '\n' +
                    r'$a=%.2e$, $b=%.2e$, $c=%.2e$, $d=%.2e$, $N=%.2e$' % (a, b, c, d, N)
                )
                plt.legend()
                plt.tight_layout()
                plot_filename = output_folder / f"FIT_{filename}_N_{N:.2f}_plot.png"
                plt.plot()
                plt.savefig(plot_filename)
                log_message(f"      Plot saved to {plot_filename}")
                plt.close()
                break
    
            N += N_step
    
        bands_txt_path = output_folder / "Bands.txt"
        with open(bands_txt_path, 'a') as f:
            if needed_N is not None:
                f.write(f"{filename}: needed band width = ±{needed_N:.2e}\n")
            else:
                f.write(f"{filename}: needed band width > ±{N_max:.2e}\n")
        log_message(f"      Band info saved to {bands_txt_path}")
    
        if needed_N is None:
            log_message(f"      Needed band width > ±{N_max:.2e} → file considered bad")
            return False
        else:
            log_message(f"      Approximately fits inside band. Needed band width: ±{needed_N:.2e}")
            return True

    else:
        # === Add other methods here as needed ===
        log_message(f"      Unknown FilteringMethodInt = {FilteringMethodInt}")
        return False


        
    
#3.7 Function that decides the correct set of Miller Indices    
def filter_best_combination(i,
    df,
    filter_func,
    filter_column_idx,
    new_column_name,
    filter_params_func,
    min_points_required=3,
    tolerance=1e-8,
    time_column_idx=None,
    error_column_idx=None):
    """
    Inputs:
        i is the integer that separates the array files (the original .fli files have more than one experiment with different headers. Each experiment is a different i.
        df is the pandas dataframe (the data)
        filter_func is the type of filter to smooth the data. Only savgol_filter can be used
        filter_column_idx is the name of the column in df that will be used for filtering ad all the other tests. Always choose 'PolarizationD3'
        new_column name is the name that will be added to the new column. If you cange it from SoftPolarizationD3 you might need to change manually this name in the rest of the code
        filter_params_func asks for the parameters for the filter. The function savgol_params_func was made especifically for this 
        tolerance is a measurement to know it the filter was changed anything or not
        time_column_idx works like filter_column_idx. Please keep it as 'DeltaTime'
        error_column_idx works like filter_column_idx. Please keep it as 'ErrPolarizationD3'
    Outputs:
        df_filtered is the df dataframe with the new column for SoftPolarization and only the data points of the Miller Inidices Combination that has passed all the filters
        PrettyCombination is the Miller Indices combination that has passed the filters. Should be something like (4,1,0) 
    Try each Miller index combination for a set i value, apply a filter, and return filtered df + PrettyCombination.
    Only adds the filtered column if enough points & data changes significantly. If it doesn't change too much, the column PolarizationD3 will be duplicated with the new name
    
    """
    from collections import defaultdict
    folder_name = FileName.replace(".fli", "")
    
    


    # Group by first three columns (Miller indices)
    combination_counts = (
        df.groupby([df.columns[0], df.columns[1], df.columns[2]])
        .size()
        .sort_values(ascending=False)
    )
    log_message(f"Analyzing combinations in file: {folder_name}_Array_{i}.fli")
    #Read the three numbers from the .fli file
    for comb, count in combination_counts.items(): 
        log_message(f"Combination {comb} occurs {count} times in file {folder_name}.fli. Trying this combination")
        mask = (
            (df.iloc[:,0] == comb[0]) &
            (df.iloc[:,1] == comb[1]) &
            (df.iloc[:,2] == comb[2])
        )
        PrettyCombination = format_combination(comb)
        

        filtered_df = df.loc[mask].copy()
        # Requisites for the Combination to be valid:
        # Requisite 1: Have data in the data
        if filtered_df.empty:
            log_message(f"      {PrettyCombination} has no data")
            continue
        
        # Requisite 2: Check if data column exists
        if filtered_df.shape[1] <= filter_column_idx:
            log_message(f"      Expected column index {filter_column_idx} not found. Skipping combination {PrettyCombination}")
            continue
        
        # Convert to numeric all columns (all columns are considered as object type)
        filtered_df = filtered_df.apply(pd.to_numeric, errors='coerce')
        filtered_df = filtered_df.dropna()  # drops any rows with NaNs introduced by coercion
        
        # Check dtypes
        all_numeric = all(dtype.kind in ('f', 'i') for dtype in filtered_df.dtypes)
        
        if all_numeric:
            log_message(f"      All columns have been successfully converted to numbers.")
        else:
            log_message(f"      Not all columns are numbers. Current dtypes:")
            log_message(f"      {filtered_df.dtypes}")
            log_message(f"      Expect Error Message from Python. Perhaps removing this file might be wise unless all files have the same issue")
        if filtered_df.empty:
            log_message(f"      All rows dropped after conversion to numeric. Skipping combination {PrettyCombination}")
            continue
        
        # Requisite 3: Polarization is ALWAYS positive. If any is negative, that is not a polarization. Immediately sent to the Bad Files Folder
        if (filtered_df.iloc[:, filter_column_idx] < 0).any():
            filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt"
            badfile_subfolder = BadFilesFolder / folder_name
            badfile_subfolder.mkdir(parents=True, exist_ok=True)
            badfiles_txt_path = badfile_subfolder / filename
            filtered_df.to_csv(win_long_path(badfiles_txt_path), index=False, sep='\t')
            log_message(f"      {PrettyCombination} has negative polarization values. Sent to BadFiles with name {filename}. Skipping to next Combination")
            continue


        
        # Requisite 4: Have at least three rows (otherwise we can't teach the ML algorithm anything).
        if len(filtered_df) < min_points_required:
            filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt"
            badfile_subfolder = BadFilesFolder / folder_name
            badfile_subfolder.mkdir(parents=True, exist_ok=True)
            badfiles_txt_path = badfile_subfolder / filename
            filtered_df.to_csv(win_long_path(badfiles_txt_path), index=False, sep='\t')
            log_message(
                f"      {PrettyCombination} has only {len(filtered_df)} rows (< {min_points_required}). "
                f"Sent to BadFiles with name {filename}. Skipping to next Combination"
            )
            continue


        # Requisite 5: Be worthy of having the filter used.
        y = filtered_df.iloc[:, filter_column_idx].values
        filter_params = filter_params_func(len(y))
        try:
            y_filtered = filter_func(y, **filter_params)
            diff = np.abs(y - y_filtered)
            changed_count = np.sum(diff > tolerance)
            filtered_df[new_column_name] = y_filtered
            if changed_count > 0:
                log_message(f"      Filter changed {changed_count}/{len(y)} points. Adding column '{new_column_name}'.")
            else:
                log_message(f"      Filter applied but data unchanged. Adding '{new_column_name}' as duplicated values.")
            
            
        
        except Exception as e:
            log_message(f"      Error applying filter to combination {comb}: {e}")
            log_message(f"      Adding '{new_column_name}' as duplicated values to proceed anyway.")
            # Just duplicate the original column
            y_filtered = y.copy()
            filtered_df[new_column_name] = y_filtered
        #log_message(filtered_df)
        # Requisite 5: Add the Derivative and BandWidth filtering logic
        if Overall_Decrease(filtered_df, f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt"):
            log_message(f"      {PrettyCombination} has surpassed all tests. Proceding with it.")
            return filtered_df, PrettyCombination
        else:
            log_message(f"      {PrettyCombination} failed the Overall_Decrease test. Trying next combination.")
            continue
    return None, None

#3.8 Function that removes the points considered bad and fixes the underestimated Uncertainty  
def RemoveOutcast_FixUncertainty(df_filtered, PrettyCombination, filename, AcceptableMultiplier=2.0, ShowPlot=False):
    """
    Repeats the process of obtaining the area in m*x+n-N < y < m*x+n+N where 75% of the points are inside the area.
    Multiplies the value of N by a factor AcceptableMultiplier and erases all points outisde this bigger area.
    As uncertainties are clearly underestimated I tried to make them reasonable (looking at the dispersion of the points it is clear there is systematic uncertainties
    Under the hypothesis that the polarization curve should be a soft curve (at least C^1) we will try to use χ^2 to add a provisional uncertainty margin fitting to a linear expression.
    This is a very inaccurate uncertainty increase but it is an improvement of the underestimated uncertainties (and the lack of ways to quantify the systematic uncertainty sources)
    The enlarged area will be plotted and saved in PlotResults for both the normal data and the softened data (Savitzky–Golay filter)
    """
    output_folder = Path.cwd() / "CrystallinePlotResults"
    output_folder = Path(win_long_path(output_folder))


    x = df_filtered['DeltaTime'].values
    y_hard = df_filtered['PolarizationD3'].values
    y_soft = df_filtered['SoftPolarizationD3'].values

    #Linear fit
    def linear_func(x, m, n):
        return m * x + n

    try:
        popt, _ = curve_fit(linear_func, x, y_hard)
        m, n = popt
    except Exception as e:
        log_message(f"Error fitting data: {e}")
        return df_filtered  # Return original if error

    #Find smallest N for 75% within band
    num_points = len(x)
    sorted_idx = np.argsort(x)
    x_sorted = x[sorted_idx]
    y_sorted = y_hard[sorted_idx]

    N_start = 0.0001
    N_step = 0.0001
    N_max = 0.4
    N = N_start
    needed_N = None

    while N <= N_max:
        y_fit = linear_func(x_sorted, m, n)
        upper = y_fit + N
        lower = y_fit - N
        inside = np.logical_and(y_sorted <= upper, y_sorted >= lower)
        percent_inside = np.sum(inside) / num_points * 100

        if percent_inside >= 75:
            needed_N = N
            break
        N += N_step

    if needed_N is None:
        log_message(f"[{filename}] No N found to contain 75% within ±{N_max}")
        return df_filtered  # Return original if no good N

    # Compute extended band and filter out the points outside the extended band
    y_fit_full = linear_func(x, m, n)
    upper_band = y_fit_full + needed_N * AcceptableMultiplier
    lower_band = y_fit_full - needed_N * AcceptableMultiplier

    mask = np.logical_and(y_hard <= upper_band, y_hard >= lower_band)
    df_cleaned = df_filtered[mask].copy()

    log_message(f"[{filename}] Filtering kept {np.sum(mask)} of {len(mask)} rows (±{needed_N * AcceptableMultiplier:.2e})")

    
    # Rescale uncertainties using reduced chi-squared 

    sigma = df_filtered['ErrPolarizationD3'].values
    
    # Fit using curve_fit with uncertainties
    popt, pcov = curve_fit(linear_func, x, y_hard, sigma=sigma, absolute_sigma=True)
    
    # Extract best-fit parameters and their uncertainties
    m_fit, n_fit = popt
    m_err, n_err = np.sqrt(np.diag(pcov))
    
    # Recalculate the reduced chi-squared 
    residuals = (y_hard - linear_func(x, *popt)) / sigma
    dof = len(x) - len(popt)
    chi_squared_red = np.sum(residuals**2) / dof #Maybe the reduced chi-squaed is better
    correction_factor = np.sqrt(chi_squared_red)
    
    # Automatically apply correction if needed
    if correction_factor > 1:
        df_cleaned['ErrPolarizationD3'] *= correction_factor
        log_message(f"[{filename}] Applied uncertainty correction factor: √(χ²) = {correction_factor:.2f}")
    else:
        log_message(f"[{filename}] No correction applied: √(χ²) = {correction_factor:.2f}")
    
    # Optional: log the fit results
    log_message(f"[{filename}] Fit results: m = {m_fit:.4e} ± {m_err:.4e}, n = {n_fit:.4e} ± {n_err:.4e}")


    def make_clean_name(filename: str) -> str:
        """
        Turn e.g.
          PolarizationD3_CaFeAl_13_7_6_24_2_MillerIndex_(0,0,2)_Filtered.txt
        into:
          CaFeAl_13_7_6_24_2_(0,0,2)
        and handle cases where filename contains '/' or '\\' (dates like DD/MM/YY).
        """
        s = str(filename).replace("/", "_").replace("\\", "_")  # prevent path splitting
        base = Path(s).stem  # remove extension if present
        if base.startswith("PolarizationD3_"):
            base = base[len("PolarizationD3_"):]
        if base.endswith("_Filtered"):
            base = base[:-len("_Filtered")]
        base = base.replace("MillerIndex_", "")
        return base
    
    def extended_area_plot_filename(filename: str) -> str:
        """EuAgAs_5_31_10_23_0_(3,0,0)_ExtendedArea.png"""
        return f"{make_clean_name(filename)}_ExtendedArea.png"
    # Extract values
    # Data
    T = df_filtered['DeltaTime'].values
    P_soft = df_filtered['SoftPolarizationD3'].values
    P_hard = df_filtered['PolarizationD3'].values
    Err = df_filtered['ErrPolarizationD3'].values if 'ErrPolarizationD3' in df_filtered.columns else np.zeros_like(P_soft)
    
    # Clean title + filename
    clean = make_clean_name(filename)
    save_name = extended_area_plot_filename(filename)  # ends with _ExtendedArea.png
    
    plt.figure(figsize=(10, 5))
    
    # Black points with error bars
    plt.scatter(T, P_hard, s=30, color="black", label="PolarizationD3", marker='o')
    if Err is not None:
        plt.errorbar(T, P_hard, yerr=Err, fmt='none', ecolor='black', alpha=0.6, capsize=2)
    
    # Blue linear fit
    fit = linear_func(T, m, n)
    plt.plot(T, fit, '-', color='blue', label="Linear Fit")
    
    # Bands: light blue (±N) and translucent green (±N*AcceptableMultiplier)
    if needed_N is not None:
        upper_narrow = fit + needed_N
        lower_narrow = fit - needed_N
        upper_wide   = fit + needed_N * AcceptableMultiplier
        lower_wide   = fit - needed_N * AcceptableMultiplier
    
        plt.fill_between(T, lower_narrow, upper_narrow, color='lightblue', alpha=0.35, label=f'Band ±{needed_N:.2e}')
        plt.fill_between(T, lower_wide,   upper_wide,   color='green',     alpha=0.18, label=f'Filter Band ±{(needed_N*AcceptableMultiplier):.2e}')
    
    # Labels
    plt.xlabel("DeltaTime")
    plt.ylabel("PolarizationD3")
    plt.title(f"{clean}_ExtendedArea")  # title matches saved name (sans .png)
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.legend()
    plt.tight_layout()
    
    # --- Y limits that include error bars and bands ---
    vals_min = [np.nanmin(P_hard - Err), np.nanmin(P_hard + Err)]
    vals_max = [np.nanmax(P_hard - Err), np.nanmax(P_hard + Err)]
    if needed_N is not None:
        vals_min += [np.nanmin(lower_narrow), np.nanmin(lower_wide)]
        vals_max += [np.nanmax(upper_narrow), np.nanmax(upper_wide)]
    # Small margins
    ymin = np.nanmin(vals_min)
    ymax = np.nanmax(vals_max)
    pad = 0.02 * (ymax - ymin if np.isfinite(ymax - ymin) and (ymax - ymin) > 0 else 1.0)
    plt.ylim(ymin - pad, ymax + pad)
    
    # Save
    plot_path_hard = output_folder / save_name
    plot_path_hard.parent.mkdir(parents=True, exist_ok=True)
    plt.savefig(win_long_path(plot_path_hard), dpi=300, bbox_inches='tight')
    if ShowPlot:
        plt.show()
    plt.close()


    return df_cleaned



"""
4. ZIP FOLDER TREATMENT
"""

"""
To make it easier for the user to use, you can just download the "processed" files in the ILL data base using the zip file. To avoid duplications
and other issues, the files won't be deleted. This reduces speed as it needs to reprocess all the files but we make sure that it works
This code and the next will take all the zip files, extract them, remove duplicates using the name AND hash sha256.
"""
to_erase = [
    "CrystallineLog_Testing_Creation.txt",
    "Crystalline_CellID.txt",
    "CrystallineSeparatedFolder",
    "CrystallinePlotResults",
    "CrystallineMLDataBase",
    "CrystallineFailuresTest",
    "CrystallineDataBase",
    "CrystallineBadFiles"
]

for item in to_erase:
    path = os.path.abspath(item)  # full path
    if os.path.exists(path):
        try:
            if os.path.isfile(path):
                os.remove(path)
                log_message(f"Deleted file: {path}")
            elif os.path.isdir(path):
                shutil.rmtree(path)
                log_message(f"Deleted folder: {path}")
        except Exception as e:
            log_message(f" Could not delete {path}: {e}")
    else:
        log_message(f"Not found (skipped): {path}")
def file_hash(filepath, algo="sha256", block_size=65536):
    """Compute hash of a file (default SHA256)."""
    h = hashlib.new(algo)
    with open(win_long_path(filepath), "rb") as f:
        for block in iter(lambda: f.read(block_size), b""):
            h.update(block)
    return h.hexdigest()

folder = Path("D3Files")  # Folder where all the raw data folders reside
zip_files = [f.name for f in folder.glob("*.zip")]  # List all zip files

log_message(f"Reading ZIP files. Checking for true duplicates by content...")
base_names = set()
seen_hashes = {}

for zip_file in zip_files:
    zip_path = folder / zip_file
    name, ext = os.path.splitext(zip_file)

    # Compute hash of this file
    filehash = file_hash(zip_path)

    if filehash in seen_hashes:
        log_message(f"Duplicate confirmed by hash! Removing: {zip_file} (same as {seen_hashes[filehash]})")
        os.remove(win_long_path(zip_path))   # Safe long path
    else:
        seen_hashes[filehash] = zip_file
        base_names.add(name)

log_message(f"\n All duplicates (by content) removed. Begin unzipping...\n")

# Refresh zip_files list after removals
zip_files = [f.name for f in folder.glob("*.zip")]

# Unzip and remove original zip files
for zip_file in zip_files:
    zip_path = folder / zip_file
    if zipfile.is_zipfile(win_long_path(zip_path)):
        folder_name = sanitize(zip_file.stem if isinstance(zip_file, Path) else os.path.splitext(zip_file)[0])
        extract_dir = folder / folder_name
        log_message(f"Unzipping: {zip_file} -> {extract_dir}")
        try:
            with zipfile.ZipFile(win_long_path(zip_path), 'r') as zip_ref:
                zip_ref.extractall(win_long_path(extract_dir))
        except Exception as e:
            log_message(f"WARNING: Error extracting {zip_file}: {e}")
    else:
        log_message(f"WARNING: Skipping invalid zip file: {zip_file}")

log_message(f"\n Finished Unzipping. Experiments stored in individual folders substituting the zip files\n")

# --- Extraction of .fli files ---
source_folder = folder
database_folder = Path("CrystallineDataBase")
database_folder.mkdir(parents=True, exist_ok=True)

log_message(f"\n\n\n Scanning all folders for .fli files...\n ")
for item in source_folder.iterdir():
    if item.is_dir():
        log_message(f"Processing folder: {item.name}")
        for root, dirs, files in os.walk(win_long_path(item)):
            for file in files:
                if file.lower().endswith(".fli"):
                    src_file = Path(root) / file
                    dest_file = database_folder / file

                    # Handle duplicate names
                    counter = 1
                    base_name, ext = os.path.splitext(file)
                    while dest_file.exists():
                        dest_file = database_folder / f"{base_name}_{counter}{ext}"
                        counter += 1

                    log_message(f"Copying: {src_file} -> {dest_file}")
                    shutil.copy2(win_long_path(src_file), win_long_path(dest_file))
        
        # After processing all .fli files, delete the original folder
        log_message(f"Deleting folder: {item}")
        shutil.rmtree(win_long_path(item))

log_message(f"\n All .fli files collected, sent from folder {source_folder} to folder {database_folder}.\n")

"""
6 SEPARATION OF FLI FILES ACCORDING TO EXPERIMENTS
"""

"""
Some fli files have the wrong structure (they are not polarization measurementes) and if they are polarization files they can have more than one experiment per file.
For evey fli file we will read the contents and try to find the header (a string in an entire line). This symbolizes the beginning of an experiment
If there are numerical values before the first header, that means that the process of saving the file occured before changing something of the experiment. These data rows will be skipped
A correct fli file will have the following structure:
    polariser cell info ge18004 pressure/init. polar 2.29 0.79 initial date/time 17 09 23 @ 10:39
    37391   4.000   0.000   1.000 18/09/23 06:20:44     155.03  +z +z     0.8391    0.0156   11.4270    1.2031     120.00
    37392   4.000   0.000   1.000 18/09/23 06:26:49     155.05  +x +x     0.8255    0.0110   10.4610    0.7211     300.00
    ...

Which corresponds to the following information:
    String:'polariser cell info', CellID, String:'pressure/init. polar', Pressure(unknown units), InitialLabPolarization, String:'date/time', Day, Month, Year, String:'@', Hour:Minute
    Measurement Number, First Miller Index, Second Miller Index, Third Miller Index, Date Of Measurement, Time Of Measurement, Temperature [Kelvin],
                        Direction Of Polarization In The First Polarizer Cell (Direction of the quantum operator S_x,S_y,S_z), Direction Of Polarization In The Second Polarizer Cell,
                        Polarization, Polarization uncertainty, Flipping Ratio, FlippingRatio Uncertainty, Duration of the measurement

The direction +z is chosen to be pointing away from the ground.
The direction +x is the direction of the flow of neutrons, i.e, the direction of Scattering.
The direction +y is the orthogonal to both of them.
D3 uses two polariser cells, one between the reactor and the sample and a second between the sample and the sensor. The first one guarantees that only neutrons with the correct spin direction
interacts with the sample. The second one guarantees that only the neutrons that have unchanged spin direction after interacting with the sample are detected by the sensor. This is
the reason why the directions (+z,+y,+x,-z,-y,-x) appear twice.
We have considered that temperature is not a relevant factor and the flipping ratio has no new information that polarization alrady posseses.
First, the code will first locate the first header (ignoring eveything before) and save all the data afterwards (until the next header or end of the document) in a file with the suffix Array_{i} (i is the number of headers already processed in that fli file)
Second, it will save the header as a file with the suffix Parameters.
Third, the header row and the columns of Measurement Number, Temperature, Flipping Ratio, FlippingRatio Uncertainty and Time Between Measurements will be erased
Fourth, as all data measurement uses the +z,+z combination, all other combinations are erased
Fifth, not all data from all Miller Index combinations are polarization measurements. Even some of the ones that are polarization measurements are tampered (playing with magnetic fields for example).
This means that there needs to be a way to select the correct combination. For starters, irrational Miller indices are not used for measurements with the samples (they need to be discarded)
The integer Miller indices combination will be put to the test by all the functions defined before.
For evey succesful experiment we will output:
    Image:  "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Multiplier={Multiplier}.png" in PlotResults. Shows the plot with the extended area with the raw data
    Image:  "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Multiplier={Multiplier}_Soft.png" in PlotResults. Shows the plot with the extended area with the filtered data
    Image:  "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt_plot_Derivatives.png" in PlotResults. Shows the evolution of the "derivatives"
    Image:  "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Filtered.txt_N_{N}_ManualInterval.png" in PlotResults. Shows the plot with the non-exteded area
    Txt:    "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}.txt" in MLDataBase. It contains the four data columns (DeltaTime, PolarizationD3, SoftPolarizationD3, ErrPolarizationD3)
    Txt:    "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Parameters.txt" in MLDataBase. It contains the parameters (CellID, Pressure, LabPolarization, LabTime)
These plots are not necessary but are saved for the user to know what all the files look like.
The files that are wrong or useless when all is done are the folowing:
    Txt:    "{folder_name}_Arrays_{i}.txt" in SeparatedFolder/{folder_name}. It still has the header and useless columns. It is the fli file of evey chunk, of every recorded experiment (correct or incorrect)
    Txt:    "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}.txt" in SeparatedFolder/{folder_name}. It is the same as the one in MLDataBase (a duplicate)
    Image:  "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}.png" in SeparatedFolder/{folder_name}. It plots (with error bars) PolarizationD3
    Image:  "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Combined.png" in SeparatedFolder/{folder_name}. It plots (with error bars) both PolarizationD3 and SoftPolarizationD3
    Image:  "PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Softened.png" in SeparatedFolder/{folder_name}. It plots (with error bars) SoftPolarizationD3
    Folder: "FailuresTest" contains all the graphs of the data sets that were considered not worthy but had more points that the ones saved. Check them if your experiment was not properly added
    Folder: "DataBase" has the raw fli files. Once the code has been used they are no longer important (if you don't find the folder I may have added a line of code to erase it. Sorry in advance for any inconveniences)  
"""
# Path to the original folder and the final folder
DataBase = Path('CrystallineDataBase')
output_base = Path('CrystallineSeparatedFolder')

# List all .fli files in that folder, prepare folders
FileNameList = [f.name for f in DataBase.glob('*.fli')]
polyorder = 2
default_window_length = 5
SeparatedFolder = Path("CrystallineSeparatedFolder")
BadFilesFolder = Path("CrystallineBadFiles")
MLDataBaseFolder = Path("CrystallineMLDataBase")
BadFilesFolder.mkdir(exist_ok=True, parents=True)
MLDataBaseFolder.mkdir(exist_ok=True, parents=True)
log_message(f"\n\n Files in the data base that will be (tried) to be used\n {FileNameList}\n")

for FileName in FileNameList:
    """READ THE FILE AND SEPRATE IT INTO EACH EXPERIMENT USING THE POLARIZATION CELL"""
    # 1- Open file
    folder_name = FileName.replace(".fli", "")
    output_folder = output_base / folder_name
    file_path = DataBase / FileName
    output_folder.mkdir(parents=True, exist_ok=True)

    with open(win_long_path(file_path), "r", encoding='utf-8', errors='ignore') as f:
        lines = f.readlines()

    
    # 2- Locate the header with CellID, Pressure, etc. Chunks are the data rows sandwiched between two 'polariser cell info' strings
    chunks = []
    current_chunk = []
    started = False  # flag to know when we found first header

    for line in lines:
        if line.strip().startswith("polariser cell info"):  # All before polariser cell info will be forgotten
            if started and current_chunk:
                chunks.append(current_chunk)
            current_chunk = [line]
            started = True
        else:
            if started:
                current_chunk.append(line)
            # else: we are before the first header, so ignore these lines

    if not started:
        log_message(f" File '{FileName}' does NOT contain any 'polariser cell info' header. Skipping.\n")
        continue
    else:
        log_message(f" File '{FileName}' contains at least one 'polariser cell info' header.")

    if current_chunk:
        chunks.append(current_chunk)

    # 3- Save .fli files for every correct chunk
    base_name = FileName.replace(".fli", "")  # remove .fli for clean filenames
    log_message(f"Creating all the Array files \n")
    for i, chunk in enumerate(chunks):
        fli_filename = f"{base_name}_Arrays_{i}.fli"
        fli_path = output_folder / fli_filename  # full path
        with open(win_long_path(fli_path), "w", encoding='utf-8') as f_out:
            f_out.writelines(chunk)
    # 4- As CellID can be exchanged with real parameters, it is written in an independent file
    cell_id_file = Path.cwd() / "Crystalline_CellID.txt"
    try:
        with open(win_long_path(cell_id_file), 'r', encoding='utf-8') as file:
            seen_strings = set(line.strip() for line in file)
    except FileNotFoundError:
        seen_strings = set()
    
    # 5- Open each Array file and work with it (The Array file still has the header)
    with open(win_long_path(cell_id_file), 'a', encoding='utf-8') as file:
        for i in range(len(chunks)):
            FLI_filename = f"{base_name}_Arrays_{i}.fli"  # Name of the Array file
            FLI_path = output_folder / FLI_filename  # Full path
            if not FLI_path.exists():
                log_message(f"WARNING: Array file does not exist: {FLI_path}")
                continue
            df = pd.read_csv(win_long_path(FLI_path), sep=r'\s+', header=None, on_bad_lines='skip')  # Read file
            log_message(f"Reading {FLI_path}, removing ***WARNING No centering scan found ")
            warning_str = "***WARNING No centering scan found"

            #5.1 Combine first 4 columns as strings, join them with space, and filter rows containing this phrase (it is not important for us)
            df = df[~df.iloc[:, :5].astype(str).agg(' '.join, axis=1).str.contains('No centering scan found', regex=False)] 
            
            #5.2 Extract useful information from the header. Hopefully, CellID, Pressure, LabPolarization, Year, Month, Day, time of lab measurement before first experiment measurement (negative time) will be stored locally
            log_message(f"Header Information Extraction...")
            CellID =          df.iloc[0].tolist()[3]
            Pressure =        df.iloc[0].tolist()[6]
            LabPolarization = df.iloc[0].tolist()[7]

            try:
                HM, DD, MM, YY = df.iloc[0].tolist()[14], int(df.iloc[0].tolist()[10]), int(df.iloc[0].tolist()[11]), int(df.iloc[0].tolist()[12])
                Day_Ref = f"{DD:02d}/{MM:02d}/{YY:02d}"
                dt = Time(Day_Ref, HM)
            except Exception as e:
                log_message(f"Skipping file {file_path} because of invalid header data: {e}")
                continue

            
            #5.3 All redundant/useless information is removed
            log_message(f"Removing Measurement Index, Temperature, Flipping Ratio, Uncertainty of Flipping Ratio and Time between measurements,...")
            df = df.iloc[1:].reset_index(drop=True)
            df = df.drop(df.columns[0], axis=1)
            df = df.drop(df.columns[5], axis=1)
            df = df.drop(df.columns[9], axis=1)
            df = df.drop(df.columns[9], axis=1)
            df = df.drop(df.columns[9], axis=1)
            df = df.drop(df.columns[9], axis=1)
            log_message(f"Saving only polarization values for the Spin Directions wanted in both Polarizer Cells, i.e. (+z,+z)")

            #5.4 Keep only rows where both are +z
            df = df[(df[7] == '+z') & (df[8] == '+z')].copy()
            if df.empty:
                log_message(f"No valid '+z' rows in file {FileName}_Arrays_{i}.fli, skipping")
                continue  # skip to next file in your loop
            df = df.drop(df.columns[[5,6]], axis=1, errors='ignore')
            #5.5 Convert Miller index columns into integers. From string or object to float and if the float is close to an integer (tolerance is 1e-8) then save as integer. Otherwise remove row
            cols_to_convert = [1, 2, 3]
            df[cols_to_convert] = df[cols_to_convert].apply(pd.to_numeric, errors='coerce').astype(float)            
            mask = np.isclose(df[cols_to_convert], np.round(df[cols_to_convert]), atol=1e-8)
            df = df[mask.all(axis=1)].copy()
            log_message(f"All Spin directions removed. All irrational Miller Indices removed. Adding DeltaTime")
            
            #5.5 The time columns are converted into difference of time being the referenced time the first +z,+z measurement that has survived at this point
            if df.shape[0] < 2:
                log_message(f"Not enough valid rows after filtering, skipping chunk")
                continue
            df['DeltaTime'] = df.apply(
                lambda row: deltatime(df[4].iloc[0], df[5].iloc[0], row[4], row[5]), axis=1 )
            ref_dt = Time(df[4].iloc[0], df[5].iloc[0])
            LabTime = int((dt - ref_dt).total_seconds())

            #5.6 Rename the columns PolarizationD3, ErrPolarizationD3 (the polarization column and its uncertainty). The other one with name is DeltaTime. The rest are numbers (will be erased).
            #Also we remove the time strings (with DeltaTime they have no new information)
            log_message(f"Renaming PolarizationD3 and ErrPolarizationD3")
            df.rename(columns={
                df.columns[5]: 'PolarizationD3',
                df.columns[6]: 'ErrPolarizationD3'
            }, inplace=True)
            df.drop(columns=[df.columns[3], df.columns[4]], inplace=True)
            log_message(f"Dropped Time Strings")

            
            #5.7 Begin filtering and softening with previous functions
            log_message(f"Begin removal of Bad files and softening with Savitzky-Golay filter")
            filtered_df, PrettyCombination = filter_best_combination(i,
                df,
                filter_func=savgol_filter,
                filter_column_idx=df.columns.get_loc('PolarizationD3'),
                new_column_name='SoftPolarizationD3',
                filter_params_func=savgol_params_func,
                min_points_required=3,
                tolerance=1e-8,
                time_column_idx=df.columns.get_loc('DeltaTime'),
                error_column_idx=df.columns.get_loc('ErrPolarizationD3')
                )
            #If nothing survived the filters/purge then use'continue' and go for the next experiment
            if filtered_df is None and PrettyCombination is None:
                log_message(f"Chunk {i}: No suitable combination found. Skipping to next chunk or file.")
                log_message(f"_______________________________________________________________\n")
                continue  # skip to next chunk
            
            #5.8 Removal of Miller indices (we have all the information they could give us)
            log_message(f"Removing Miller Indices columns")
            #log_message(filtered_df)
            filtered_df = filtered_df.iloc[:, 3:]
            desired_order = ["DeltaTime", "PolarizationD3", "SoftPolarizationD3", "ErrPolarizationD3"]

            
            # 5.9 Remove the points that won't be useful for the ML algorithm
            columns_to_save = [col for col in desired_order if col in filtered_df.columns]  # Keep only the columns that exist
            df_SEMIFINAL = filtered_df[columns_to_save].copy()
            df_FINAL = filtered_df = RemoveOutcast_FixUncertainty(
                df_SEMIFINAL,
                PrettyCombination,
                filename=f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}",
                AcceptableMultiplier=1.3,
                ShowPlot=False
            )
            
            # 5.10 Plot the successful experiments
            log_message(f"Plot of Data. PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}")
            plt.figure(figsize=(10, 5))
            T = pd.to_numeric(df_FINAL["DeltaTime"], errors='coerce')
            P = pd.to_numeric(df_FINAL["PolarizationD3"], errors='coerce')
            Err = pd.to_numeric(df_FINAL["ErrPolarizationD3"], errors='coerce')
            P_soft = pd.to_numeric(df_FINAL["SoftPolarizationD3"], errors='coerce')
            
            # Scatter plot
            plt.scatter(T, P, linewidth=1, label='Original') 
            plt.plot(T, P, linestyle='--', color='blue', alpha=0.7)
            plt.errorbar(T, P, yerr=Err, fmt='none', ecolor='gray', alpha=0.5)
            
            plt.xlabel("DeltaTime")
            plt.ylabel("PolarizationD3")
            plot_filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}.png"
            plt.title(plot_filename)
            plt.ylim(np.min(P - Err), np.max(P + Err))
            plt.yticks(np.linspace(np.min(P - Err), np.max(P + Err), 10))
            plt.grid(True, linestyle='--', alpha=0.5)
            plt.tight_layout()
            
            plot_path = win_long_path(output_folder / plot_filename)
            plt.savefig(plot_path, dpi=300, bbox_inches='tight')
            
            if ShowPlot:
                plt.show()
            plt.close()
            
            # Softened data plot
            log_message(f"Plot of Filtered Data. PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Softened")
            plt.figure(figsize=(10, 5))
            plt.scatter(T, P_soft, linewidth=1, label='Filtered')
            plt.plot(T, P_soft, linestyle='--', color='green', alpha=0.7)
            plt.errorbar(T, P_soft, yerr=Err, fmt='none', ecolor='gray', alpha=0.5)
            
            plt.xlabel("DeltaTime")
            plt.ylabel("SoftPolarizationD3")
            plot_filename_soft = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Softened.png"
            plt.title(plot_filename_soft)
            plt.ylim(np.min(P_soft - Err), np.max(P_soft + Err))
            plt.yticks(np.linspace(np.min(P_soft - Err), np.max(P_soft + Err), 10))
            plt.grid(True, linestyle='--', alpha=0.5)
            plt.tight_layout()
            
            plot_path_soft = win_long_path(output_folder / plot_filename_soft)
            plt.savefig(plot_path_soft, dpi=300, bbox_inches='tight')
            
            if ShowPlot:
                plt.show()
            plt.close()
            
            # Comparison plot
            log_message(f"Comparison Plot. PolarizationD3_{folder_name}_{DD}/{MM}/{YY}_{i}_MillerIndex_{PrettyCombination}_Comparison")
            plt.figure(figsize=(10, 5))
            plt.scatter(T, P, linewidth=1, color='blue', alpha=0.6, label='Original')
            plt.plot(T, P, linestyle='--', color='blue', alpha=0.5)
            plt.scatter(T, P_soft, linewidth=1, color='green', alpha=0.6, label='Filtered')
            plt.plot(T, P_soft, linestyle='--', color='green', alpha=0.5)
            plt.errorbar(T, P, yerr=Err, fmt='none', ecolor='gray', alpha=0.3)
            plt.errorbar(T, P_soft, yerr=Err, fmt='none', ecolor='gray', alpha=0.3)
            
            plt.xlabel("DeltaTime")
            plt.ylabel("Polarization")
            plot_filename_combined = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Combined.png"
            plt.title(plot_filename_combined)
            min_y = min(np.min(P - Err), np.min(P_soft - Err))
            max_y = max(np.max(P + Err), np.max(P_soft + Err))
            plt.legend(loc='best')
            plt.ylim(min_y, max_y)
            plt.yticks(np.linspace(min_y, max_y, 10))
            plt.grid(True, linestyle='--', alpha=0.5)
            plt.tight_layout()
            
            plot_path_combined = win_long_path(output_folder / plot_filename_combined)
            plt.savefig(plot_path_combined, dpi=300, bbox_inches='tight')
            
            if ShowPlot:
                plt.show()
            plt.close()
            
            # 5.11 Save the files
            log_message(f"Finally we save the chunk")
            csv_filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}.txt"
            Parameter_filename = f"PolarizationD3_{folder_name}_{DD}_{MM}_{YY}_{i}_MillerIndex_{PrettyCombination}_Parameters.txt"
            csv_path = output_folder / csv_filename
            
            # Save CSV
            df_FINAL['DeltaTime'] = df_FINAL['DeltaTime'] - df_FINAL['DeltaTime'].iloc[0]

            df_FINAL.to_csv(win_long_path(csv_path), index=False, sep=',')
            
            # Save a copy to ML database
            ml_txt_path = MLDataBaseFolder / csv_filename
            df_FINAL.to_csv(win_long_path(ml_txt_path), index=False, sep=',')
            log_message(f"Saved: {csv_filename}")
            
            # Parameter file
            ml_param_path = MLDataBaseFolder / Parameter_filename
            with open(win_long_path(ml_param_path), 'w', encoding='utf-8') as f:
                f.write("CellID,Pressure,LabPolarization,LabTime\n")
                f.write(f"{CellID},{Pressure},{LabPolarization},{LabTime}")
            log_message(f"Saved: {Parameter_filename}")
            log_message(f"Parameter and Array files saved to ML database: {MLDataBaseFolder}\n_______________________________________________________________\n\n")
            
            # 5.12 Remove unwanted array files
            for i in range(len(chunks)):
                temp_filename = f"{base_name}_Arrays_{i}.fli"
                temp_path = output_folder / temp_filename
                try:
                    temp_path.unlink()  # delete the file
                except FileNotFoundError:
                    pass  # skip if missing
            
            log_message(f"Created and saved {len(chunks)} CSV files from file called {FileName}.")
            
            # Remove empty folder
            if output_folder.exists() and not any(output_folder.iterdir()):
                output_folder.rmdir()
                log_message(f"Removed empty folder: {output_folder}")
            
            log_message('\n\n')


"""
7. CELLID FILE PROCESSING
"""
ml_database_folder = Path("CrystallineMLDataBase")

# Find all txt files whose names end with Parameters.txt (case insensitive)
parameter_files = list(ml_database_folder.glob('*Parameters.txt'))

log_message(f"Found {len(parameter_files)} parameter files.")

unique_cell_ids = []
seen = set()

for filepath in parameter_files:
    try:
        with open(win_long_path(filepath), 'r', encoding='utf-8') as f:
            lines = f.readlines()
            if len(lines) >= 2:
                second_row = lines[1].strip()
                parts = second_row.split(',')
                if parts:
                    cell_id = parts[0]
                    if cell_id not in seen:
                        seen.add(cell_id)
                        unique_cell_ids.append(cell_id)
    except Exception as e:
        log_message(f"Failed to read {filepath}: {e}")

# Write to Crystalline_CellID.txt
cellid_file = Path.cwd() / "Crystalline_CellID.txt"
with open(win_long_path(cellid_file), "w", encoding='utf-8') as f:
    for cell_id in unique_cell_ids:
        f.write(f"{cell_id}\n")

log_message(f"Saved {len(unique_cell_ids)} unique cell IDs to {cellid_file.name}.")

# Remove the separated folder
folder_to_delete = Path.cwd() / "CrystallineSeparatedFolder"
if folder_to_delete.exists():
    shutil.rmtree(win_long_path(folder_to_delete))
    log_message(f"Folder '{folder_to_delete}' has been deleted.")
else:
    log_message(f"Folder '{folder_to_delete}' does not exist.")

"""
Removal of Duplicates
"""
hash_map = defaultdict(list)

def file_sha256(filepath, block_size=65536):
    """Compute SHA256 hash of a file (safe for large files)."""
    sha256 = hashlib.sha256()
    with open(win_long_path(filepath), "rb") as f:
        while chunk := f.read(block_size):
            sha256.update(chunk)
    return sha256.hexdigest()

# Scan all .txt files (only base files without '_Parameters')
for root, _, files in os.walk(win_long_path(ml_database_folder)):
    for file in files:
        if file.lower().endswith(".txt") and "_parameters" not in file.lower():
            path = Path(root) / file
            file_hash = file_sha256(path)
            hash_map[file_hash].append(path)

# Report & delete duplicates
duplicates_found = False
for file_hash, paths in hash_map.items():
    if len(paths) > 1:
        duplicates_found = True
        log_message(f"\nDuplicate group (hash={file_hash}):")
        log_message(f"   Keeping: {paths[0]}")

        # All but the first are duplicates
        for p in paths[1:]:
            base_name, ext = os.path.splitext(p)
            param_file = Path(f"{base_name}_Parameters{ext}")

            try:
                os.remove(win_long_path(p))
                log_message(f"   Deleted duplicate base file: {p}")
            except Exception as e:
                log_message(f"   Could not delete base file {p}: {e}")

            # Also try deleting the corresponding parameter file
            if param_file.exists():
                try:
                    os.remove(win_long_path(param_file))
                    log_message(f"   Deleted parameter file: {param_file}")
                except Exception as e:
                    log_message(f"   Could not delete parameter file {param_file}: {e}")

if not duplicates_found:
    log_message("No duplicates found in MLDataBase!")
else:
    log_message("\n Duplicate cleanup complete!")
