# FTIR Data Analysis Main Workflow
This notebook guides you through the main steps of the FTIR data analysis workflow, including file renaming, dataframe creation or modification, and baseline correction parameter management.

## Quick-Run
All settings on Default, run program with minimal input.

In [None]:
import os
import pandas as pd
from File_Info_Gathering import file_info_extractor
from Dataframe_Modification import baseline_selection_quick, prompt_parameters

#Extract File Information and build or append to the main DataFrame
file_info_extractor(file_types=None, separators=None, material_terms=None, conditions_terms=None, root_dir=None, append_missing=None, save_missing_txt=None, csv_path=None)

#Baseline Selection
baseline_selection_quick(dataframe_path, baseline_function=None, parameter_dictionary=None)

#Baseline Correction
baseline_correction(dataframe_path)

## Import Statements
Import necessary libraries and modules for data analysis and visualization.

In [None]:
import os
import pandas as pd
from Dataframe_Modification import baseline_selection, cast_param_types, get_default_params, parse_parameters, prompt_parameters
from Fixing_File_Names import batch_rename_files
from File_Info_Gathering import file_info_extractor

## File Renaming
You can optionally rename files in your dataset.

This script scans a specified root directory and its subdirectories to find and rename files. Folder names will not be changed,except in the case of date renaming to ISO format (e.g., 2025-09-18) (optional). It works by replacing spaces and/or specified words in the filenames. (e.g., replacing spaces with underscores). Suggested to use this tool if file names have inconsistent naming conventions that may cause issues in downstream processing.

In [None]:
# Set directory to rename folders and files within (e.g., r"C:\Users\user1\folder1")
directory = None
# If you want to replace spaces in filenames, set replace_spaces to True and set character_to_use to the desired separator (e.g., "_")
replace_spaces = None
character_to_use = None
# If you want to convert all dates in the directory names to ISO format (YYYY-MM-DD), set iso_date_rename to True
iso_date_rename = None
# If you want to replace other specified words in filenames, set file_rename to True and provide pairs_input (e.g., "old1:new1,old2:new2")
file_rename = None
pairs_input = None
# If any of these parameters are set to None, you will be prompted for input (may result in multiple prompts and/or minor formatting issues)
# Rename files in the specified directory
batch_rename_files(directory=directory, replace_spaces=replace_spaces, character_to_use=character_to_use, iso_date_rename=iso_date_rename, file_rename=file_rename, pairs_input=pairs_input)


## File Info Extraction
Gathers file information and builds the main data structure for analysis. Repeated uses can append new data into the DataFrame.

In [None]:
# Set directory containing files to analyze (e.g., r"C:\Users\user1\folder1")
directory = None
# Set file types to include (e.g., [".dpt", ".txt", ".csv"])
file_types = None
# Set separators to use when finding terms within filenames (e.g., ["_", " "])
separators = None
# Set material terms to search for in filenames (e.g., ["Si", "Perovskite", "Glass"]) (case-insensitive)
material_terms = None
# Set conditions terms to search for in filenames (e.g., ["A3", "A4", "B3", "B4"])
conditions_terms = None
# Set append_missing to False to add only files which have all required information, or True to add files even if some information is missing (may lead to issues downstream)
append_missing = None
# Set save_missing_txt to True to save a text file listing those files with missing information (will be saved in current working directory)
save_missing_txt = None
# Set dataframe_path to the path of the existing CSV file to append to or where the new CSV will be saved (e.g., r"C:\Users\user1\dataframe.csv")
# If set to just a filename (e.g., "dataframe.csv"), it will be saved in the current working directory
dataframe_path = None
# If any of these parameters are set to None, you will be prompted for input (may result in multiple prompts and/or minor formatting issues)

# Extract File Information and build or append to the main DataFrame
file_info_extractor(directory=directory, file_types=file_types, separators=separators, material_terms=material_terms, conditions_terms=conditions_terms, append_missing=append_missing, save_missing_txt=save_missing_txt, dataframe_path=dataframe_path)

## Baseline Correction Options
You can choose the baseline correction function and its parameters for your dataframe. This step allows you to fine-tune how baseline correction is applied to your FTIR data.

Baseline Options:

'GIFTS': straight line fit to the data and iteratively discards points that do not fit the line well. This is an asymmetric least squares (ALS) method. [Pros: fast] [Cons: unreliable accuracy]

'IRSQR': iterative reweighted spline quantile regression-- uses penalized splines and iterative reweighted least squares to perform quantile regression. [Pros: decent accuracy] [Cons: middling speed]

'FABC': fully automatic baseline correction-- uses first derivative approximation of data to identify and then ignore peak regions, then fits to baseline regions using Whittaker smoothing. [Pros: can handle noise well, decent accuracy] [Cons: middling speed]

'Manual': set "anchor points" for each of your materials using the built-in tool. This will create a list of wavenumber values that should be in baseline regions for every scan of that material. A cubic spline interpolation will be done between those points' values in each scan. [Pros: customizable, accurate] [Cons: requires manual entry for each material type]

In [None]:
### Try Baselines ###


In [None]:
### GIFTS Selection ###

## Option 1: Same parameters for all materials ##
# Set which materials you'd like to apply GIFTS to (e.g., ["Si", "Perovskite"])
# If set to None, you will be prompted for input
materials_to_use = None
# Set parameters for GIFTS baseline correction as a dictionary (e.g., {"lam": 1e6, "p": 0.01, "n_iter": 10})
# If set to None, default parameters will be used
parameter_dictionary = None

# Select Baseline Function and Parameters for specified materials
baseline_selection(dataframe_path=dataframe_path, materials_to_use=materials_to_use, baseline_function='GIFTS', parameter_dictionary=parameter_dictionary)

## Option 2: Different parameters for each material ##
# Uncomment and modify the following to use different parameters per material:
# materials_to_use = {
#     "Si": {"lam": 1e6, "p": 0.01, "n_iter": 10},
#     "Perovskite": {"lam": 1e5, "p": 0.05, "n_iter": 15}
# }
# baseline_selection(dataframe_path=dataframe_path, materials_to_use=materials_to_use, baseline_function='GIFTS')

### IRSQR Selection ###

## Option 1: Same parameters for all materials ##
# Set which materials you'd like to apply IRSQR to (e.g., ["Si", "Perovskite"])
# If set to None, you will be prompted for input
materials_to_use = None
# Set parameters for IRSQR baseline correction as a dictionary (e.g., {"lam": 1e6, "quantile": 0.05, "num_knots": 100, "spline_degree": 3, "diff_order": 3, "max_iter": 100, "tol": 1e-6, "weights": None, "eps": None})
# If set to None, default parameters will be used
parameter_dictionary = None

# Select Baseline Function and Parameters for specified materials
baseline_selection(dataframe_path=dataframe_path, materials_to_use=materials_to_use, baseline_function='IRSQR', parameter_dictionary=parameter_dictionary)

## Option 2: Different parameters for each material ##
# Uncomment and modify the following to use different parameters per material:
# materials_to_use = {
#     "Si": {"lam": 1e6, "quantile": 0.05, "num_knots": 100, "spline_degree": 3, "diff_order": 3, "max_iter": 100, "tol": 1e-6},
#     "Perovskite": {"lam": 1e5, "quantile": 0.02, "num_knots": 80, "spline_degree": 2, "diff_order": 2, "max_iter": 50, "tol": 1e-5}
# }
# baseline_selection(dataframe_path=dataframe_path, materials_to_use=materials_to_use, baseline_function='IRSQR')

### FABC Selection ###

## Option 1: Same parameters for all materials ##
# Set which materials you'd like to apply FABC to (e.g., ["Si", "Perovskite"])
materials_to_use = None
# Set parameters for FABC baseline correction as a dictionary (e.g., {"lam": 1e6, "scale": None, "num_std": 3.0, "diff_order": 2, "min_length": 2, "weights": None, "weights_as_mask": False, "pad_kwargs": None})
# If set to None, default parameters will be used
parameter_dictionary = None

# Select Baseline Function and Parameters for specified materials
baseline_selection(dataframe_path=dataframe_path, materials_to_use=materials_to_use, baseline_function='FABC', parameter_dictionary=parameter_dictionary)

## Option 2: Different parameters for each material ##
# Uncomment and modify the following to use different parameters per material:
# materials_to_use = {
#     "Si": {"lam": 1e6, "scale": None, "num_std": 3.0, "diff_order": 2, "min_length": 2},
#     "Perovskite": {"lam": 1e5, "scale": 50, "num_std": 2.5, "diff_order": 1, "min_length": 3}
# }
# baseline_selection(dataframe_path=dataframe_path, materials_to_use=materials_to_use, baseline_function='FABC')

### Manual Selection ###

# Set which materials you'd like to apply Manual Selection to (e.g., ["Si", "Perovskite"])
materials_to_use = None
# For manual parameter selection, you can set anchor points per material using the interactive tool or define them directly in the materials.json file
if materials_to_use:


# Select Baseline Function and Parameters for specified materials
baseline_selection(dataframe_path=dataframe_path, materials_to_use=materials_to_use, baseline_function='MANUAL')

### GIFTS Parameters
lam (float): Smoothness parameter (higher = smoother baseline).

p (float): Asymmetry parameter (0 < p < 1).

n_iter (integer): Number of iterations.

### IRSQR Parameters
lam (float): The smoothing parameter (higher = smoother baseline).

quantile (float): The quantile at which to fit the baseline (0 < quantile < 1).

num_knots (integer): The number of knots for the spline.

spline_degree (integer): The degree of the spline.

diff_order (integer): The order of the differential matrix. Must be greate matrix). Typical values are 3, 2, or 1.

max_iter (integer): The max number of fit iterations.

tol (float): The exit criteria.

weights (array-like): The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

eps (float): A small value added to the square of the residual to prevent dividing by 0. Default is None, which uses the square of the maximum-absolute-value of the fit each iteration multiplied by 1e-6.

### FABC Parameters
lam (float): The smoothing parameter (higher = smoother baseline).

scale (integer): The scale at which to calculate the continuous wavelet transform. Should be approximately equal to the index-based full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from :func:`.optimize_window`, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

num_std (float): The number of standard deviations to include when thresholding. Higher values
will assign more points as baseline.

diff_order (integer): The order of the differential matrix. Must be greater than 0. Typical values are 2 or 1.

min_length (integer): Any region of consecutive baseline points less than `min_length` is considered to be a false positive and all points in the region are converted to peak points. A higher `min_length` ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.

weights (array-like): The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None, then will be an array with size equal to N and all values set to 1.

weights_as_mask (bool): If True, signifies that the input `weights` is the mask to use for fitting, which skips the continuous wavelet calculation and just smooths the input data.

pad_kwargs (dict): A dictionary of keyword arguments to pass to :func:`.pad_edges` for padding the edges of the data to prevent edge effects from convolution for the continuous wavelet transform. Default is None.

### Manual Parameters