# Data Preprocessing Script

### Description of Folder and files:
- Two main folders: (1) 'Tool_makers_report' and (2) 'DAQ'
- Each of these have subfolders with names of the form '1. Sxxx', '2. Sxxx', '6. Sxxx' etc. 
- Within each sub-folder, there are report files in either Word (docx extension) or Excel format (xlsx extension)
- File names are in format 'a1', 'a2', 'a3' etc.

### Step-1: Consolidating tool wear values from the folder Tool_makers_report

Create a python code for extracting information from individual report files into a single consolidated csv file

Instructions:
Folder and files:
- Two main folders: (1) 'Tool_makers_report' and (2) 'DAQ'
- Each of these have subfolders with names of the form '1. Sxxx', '2. Sxxx', '6. Sxxx' etc.
- Within each sub-folder, there are report files in either Word (docx extension) or Excel format (xlsx extension)
- File names are in format 'a1', 'a2', 'a3' etc.

You are to extract information from each of these report files a1, a2 and extract info into a single output .csv file

Output format desired: | Folder name | File name | Tool-wear micrometer |

Process files in folders within the main 'Tool_makers_report' folder:
- Take the first folder - name is of form '1. Sxxx' the 'xxx' part changes
- Copy this folder name into the 'Folder name' column of ouput
- Within each folder, there are report files in either Word (docx extension) or Excel format (xlsx extension)
- File names are in format 'a1', 'a2', 'a3' etc.
- Copy the report file being processed in 'File name' column of ouput
- Read each file and locate a table
- Locate a column called 'Length'
- This may contain a single or 2-3 rows.
- Take the maximum value of Length and copy into 'Tool-wear micrometer' column of output

### Step-2: Downsample the large **time-series** sensor reading files. Remove unwanted header rows.

Instructions:

Folder and files:
- Main folder 'DAQ'
- Each has subfolders with names of the form '1. Sxxx', '2. Sxxx', '6. Sxxx' etc.
- Within each sub-folder, there are time-series sensor data files with names of the form 'Recording 1.csv', 'Recording 2.csv' etc.

You are to write code to downsample these large time-series sensor reading files. IMPORTANT - Use nth row sampling to preserve the time-series nature. Random sampling will destroy the time-series information.

Output format desired:
| Folder name | File name | Time | Vib_Spindle | Vib_Table | Sound_Spindle,Sound_table,X_Load_Cell | Y_Load_Cell | Z_Load_Cell | Current |

- Sample into ['Vib_Spindle', 'Vib_Table', 'Sound_Spindle,Sound_table,X_Load_Cell', 'Y_Load_Cell', 'Z_Load_Cell', 'Current'] 
- Delete all columns named 'Channel name'
- There are four header rows of which only the 3rd row is of interest and contains the column names
- Remove top 2 and the 4th row
- Create a constant called DOWNSAMPLE_TARGET and set to 200
- Prefix downsampled file with 'DS_' so we get 'DS_Recording_1' and store in SAME sub-folder


### Step-3: Add tool-wear values to the downsampled sensor reading files
Instructions:
Folder and files:
- Main folder 'DAQ'
- Each has subfolders with names of the form '1. Sxxx', '2. Sxxx', '6. Sxxx' etc.
- Within each sub-folder, there are time-series sensor data files with names of the form 'DS_Recording 1.csv', 'DS_Recording 2.csv' etc. 
- For this step, files without the 'DS_' prefix are to be ignored.

You are to write code to append values of tool-wear and action-code provided in another file for all the records of the time series data in the 'DS_Recording' files.

1. Iterate through the rows of 'Tool_Wear_Values.xlsx'
2. Read these three values: Path_Sensor_File, tool_wear and ACTION_CODE
3. Locate and read the .csv file mentioned in 'Path_Sensor_File'
4. This is the raw sensor readings files
5. Now add two columns to this with called 'tool_wear' and 'ACTION_CODE'
6. For **ALL** the rows of the sensor reading file, repeat the *single* values read in tool_wear and ACTION_CODE
7. Move to the next file in 'Tool_Wear_Values.xlsx' and repeat steps 2 to 6.

### Step-4: Concatanate data files into one per case

Folder and files:
- Main folder 'DAQ'
- Each has subfolders with names of the form '1. Sxxx', '2. Sxxx', '6. Sxxx' etc.
- Within each sub-folder, there are time-series sensor data files with names of the form 'DS_Recording 1.csv', 'DS_Recording 2.csv' etc. 

Instructions:
1. Within one sub-folder, you are to concatanate all sensor reading files into a single file
2. This single file will bear the name of the sub-folder e.g. '1. Sxxx' but with an added pre-fix 'PROC_'
3. In the sub folder are DS_ files, using the digit post-fixes i.e. 1, 2 etc from 'DS_Recording 1.csv', 'DS_Recording 2.csv' etc. - concatanate them in that order 

### Step-5: Final processing
1. Fill missing values for tool wear
2. Downsample further to 1000 records
3. Re-index time column


In [1]:
# %pip install python-docx openpyxl tqdm

In [2]:
# Data Preprocessing Script
import os
import re
import csv
from pathlib import Path
from typing import Optional, List
import pandas as pd
from docx import Document
from tqdm import tqdm

In [3]:
# === CONFIG ===
MAIN_DIR = r"Tool_makers_report"   # change to your folder path
OUTPUT_CSV = r"output_tool_wear.csv"  # change to desired output path
FILE_NAME_PATTERN = re.compile(r'^a\d+', re.IGNORECASE)  # matches a1, a2, a10...
EXCEL_EXTS = {'.xlsx', '.xls'}
DOCX_EXTS = {'.docx'}

In [4]:
# === UTILITIES ===
def extract_numbers_from_text(s: str) -> List[float]:
    """Extract numeric values (ints or floats) from a string and return as floats."""
    if s is None:
        return []
    # Replace commas in numbers like 1,234.56 or 1.234,56 (basic handling)
    s = str(s).strip()
    # normalize common comma thousand separators (simple heuristic)
    s = s.replace('\xa0', ' ')
    # find numbers with optional sign and decimal, also allow decimals with comma
    matches = re.findall(r'[-+]?\d*\.\d+|[-+]?\d+[,\.]?\d*', s)
    nums = []
    for m in matches:
        m_clean = m.replace(',', '.')  # convert comma decimal to dot
        try:
            val = float(m_clean)
            nums.append(val)
        except:
            continue
    return nums

def find_length_in_dataframe(df: pd.DataFrame) -> Optional[float]:
    """Search for a column named 'Length' (case-insensitive) and return max numeric value if found."""
    if df is None or df.empty:
        return None
    # Normalize columns to strings
    cols = {str(c).strip(): c for c in df.columns}
    # case-insensitive search
    target_col = None
    for col_label in cols:
        if col_label.strip().lower() == 'length':
            target_col = cols[col_label]
            break
    if target_col is None:
        # try header values inside first row if header is missing
        # but pandas usually creates headers; skip for simplicity
        return None
    # Extract numeric values from the column
    values = []
    for cell in df[target_col].astype(str).tolist():
        nums = extract_numbers_from_text(cell)
        if nums:
            values.extend(nums)
    if not values:
        return None
    return max(values)

def extract_length_from_docx(path: Path) -> Optional[float]:
    """Open a Word document and scan tables for a column named Length. Return max value found."""
    try:
        doc = Document(path)
    except Exception:
        return None
    for table in doc.tables:
        # try to get header row cells text
        headers = [cell.text.strip() for cell in table.rows[0].cells] if table.rows else []
        # find index of 'Length' in headers (case-insensitive)
        idx = None
        for i, h in enumerate(headers):
            if h.strip().lower() == 'length':
                idx = i
                break
        # If header not found, try to find a cell with text 'Length' in the table and take that column
        if idx is None:
            # search for any cell labelled 'Length' and record its column index
            for r in table.rows:
                for c_i, cell in enumerate(r.cells):
                    if cell.text.strip().lower() == 'length':
                        idx = c_i
                        break
                if idx is not None:
                    break
        if idx is None:
            continue
        # collect numbers from the column idx (skip header row if header exists)
        values = []
        for r_i, row in enumerate(table.rows):
            # if header row exists and contains 'Length' exactly, skip it
            if r_i == 0 and headers and headers[idx].strip().lower() == 'length':
                continue
            try:
                cell_text = row.cells[idx].text
            except IndexError:
                continue
            nums = extract_numbers_from_text(cell_text)
            if nums:
                values.extend(nums)
        if values:
            return max(values)
    return None

def extract_length_from_excel(path: Path) -> Optional[float]:
    """Open an Excel file and scan sheets for a column named Length. Return max value found."""
    try:
        xls = pd.ExcelFile(path)
    except Exception:
        return None
    # iterate sheets
    for sheet in xls.sheet_names:
        try:
            df = pd.read_excel(xls, sheet_name=sheet, dtype=str, header=0)
        except Exception:
            # try reading with header None to be robust
            try:
                df = pd.read_excel(xls, sheet_name=sheet, dtype=str, header=None)
            except Exception:
                continue
        length_val = find_length_in_dataframe(df)
        if length_val is not None:
            return length_val
    return None

In [5]:
# === MAIN PROCESS ===
def process_tool_makers_folder(main_dir: str, output_csv: str):
    main_path = Path(main_dir)
    if not main_path.exists():
        raise FileNotFoundError(f"Main folder not found: {main_dir}")
    rows = []
    # iterate immediate subfolders (like '1. Sxxx', '2. Sxxx' ...)
    for sub in sorted([p for p in main_path.iterdir() if p.is_dir()], key=lambda x: x.name):
        folder_name = sub.name
        # list files that match a* and are .docx or .xlsx
        for f in sorted(sub.iterdir(), key=lambda x: x.name):
            if not f.is_file():
                continue
            if not FILE_NAME_PATTERN.match(f.stem):
                continue
            ext = f.suffix.lower()
            tool_wear = None
            if ext in DOCX_EXTS:
                tool_wear = extract_length_from_docx(f)
            elif ext in EXCEL_EXTS:
                tool_wear = extract_length_from_excel(f)
            else:
                continue
            # store result; tool_wear may be None
            print(f'Processing {folder_name} >> {f.name}: Tool-wear: {tool_wear}')
            rows.append({
                "Folder name": folder_name,
                "File name": f.name,
                "Tool-wear micrometer": tool_wear if tool_wear is not None else ""
            })
    # write to CSV
    fieldnames = ["Folder name", "File name", "Tool-wear micrometer"]
    with open(output_csv, 'w', newline='', encoding='utf-8') as csvf:
        writer = csv.DictWriter(csvf, fieldnames=fieldnames)
        writer.writeheader()
        for r in rows:
            writer.writerow(r)
    print(f"Done. {len(rows)} rows written to {output_csv}")

In [6]:
process_tool_makers_folder(MAIN_DIR, OUTPUT_CSV)

Processing 10. S1200_F25_D0.10 >> a10.docx: Tool-wear: 140.15
Processing 10. S1200_F25_D0.10 >> a11.docx: Tool-wear: 154.74
Processing 10. S1200_F25_D0.10 >> a12.docx: Tool-wear: 162.04
Processing 10. S1200_F25_D0.10 >> a13.docx: Tool-wear: 166.42
Processing 10. S1200_F25_D0.10 >> a14.docx: Tool-wear: 181.02
Processing 10. S1200_F25_D0.10 >> a15.docx: Tool-wear: 182.48
Processing 10. S1200_F25_D0.10 >> a16.docx: Tool-wear: 191.24
Processing 10. S1200_F25_D0.10 >> a17.docx: Tool-wear: 194.16
Processing 10. S1200_F25_D0.10 >> a18.docx: Tool-wear: 202.92
Processing 10. S1200_F25_D0.10 >> a19.docx: Tool-wear: 218.98
Processing 10. S1200_F25_D0.10 >> a2.docx: Tool-wear: 75.91
Processing 10. S1200_F25_D0.10 >> a20.docx: Tool-wear: 227.74
Processing 10. S1200_F25_D0.10 >> a21.docx: Tool-wear: 232.12
Processing 10. S1200_F25_D0.10 >> a22.docx: Tool-wear: 240.88
Processing 10. S1200_F25_D0.10 >> a23.docx: Tool-wear: 246.72
Processing 10. S1200_F25_D0.10 >> a24.docx: Tool-wear: 252.55
Processing

In [1]:
import math
from pathlib import Path
import pandas as pd

# Configuration
MAIN_DIR = Path(r"DAQ")            # set this to your DAQ main folder
DOWNSAMPLE_TARGET = 200                      # target number of rows after downsampling
INPUT_GLOB = "Recording *.csv"                # pattern to find recordings
OUTPUT_PREFIX = "DS_"                         # prefix for downsampled files
COLUMNS_OF_INTEREST = [
    "Time",
    "Vib_Spindle",
    "Vib_Table",
    "Sound_Spindle",
    "Sound_table",
    "X_Load_Cell",
    "Y_Load_Cell",
    "Z_Load_Cell",
    "Current",
]

def process_recording_file(path: Path, target: int = DOWNSAMPLE_TARGET) -> None:
    try:
        # Read file, skip rows 0, 1, and 3; use row 2 (index=2) as header
        df = pd.read_csv(path, skiprows=[0, 1, 3], header=0, dtype=str, engine="python")
    except Exception as e:
        print(f"Skipping {path.name}: read error: {e}")
        return

    # Drop columns named exactly 'Channel name'
    cols_to_drop = [c for c in df.columns if str(c).strip() == "Channel name"]
    df.drop(columns=cols_to_drop, inplace=True, errors='ignore')

    # Normalize column names
    df.columns = [str(c).strip() for c in df.columns]

    # Build output DataFrame with selected columns
    out_df = pd.DataFrame()
    for col in COLUMNS_OF_INTEREST:
        out_df[col] = df[col] if col in df.columns else pd.NA

    # Downsample using nth-row sampling
    total_rows = len(out_df)
    if total_rows == 0:
        print(f"No data rows in {path.name} after header processing; skipping.")
        return
    step = max(1, math.ceil(total_rows / target))
    ds_df = out_df.iloc[::step].reset_index(drop=True)

    # Replace Time column with running index
    ds_df["Time"] = range(len(ds_df))
    ds_df = ds_df[["Time"] + [c for c in ds_df.columns if c != "Time"]]

    # Save output with DS_ prefix
    out_name = OUTPUT_PREFIX + path.stem.replace(" ", "_") + path.suffix
    out_path = path.with_name(out_name)
    try:
        ds_df.to_csv(out_path, index=False)
        print(f"Wrote {out_path.name} ({len(ds_df)} rows, step={step})")
    except Exception as e:
        print(f"Failed to write {out_path.name}: {e}")

def process_daq_main(main_dir: Path) -> None:
    if not main_dir.exists():
        raise FileNotFoundError(f"Main folder not found: {main_dir}")
    for sub in sorted([p for p in main_dir.iterdir() if p.is_dir()], key=lambda x: x.name):
        for csv_file in sorted(sub.glob(INPUT_GLOB), key=lambda x: x.name):
            process_recording_file(csv_file)

In [2]:
if __name__ == "__main__":
    process_daq_main(MAIN_DIR)

Wrote DS_Recording_1.csv (200 rows, step=2405)
Wrote DS_Recording_10.csv (200 rows, step=2405)
Wrote DS_Recording_11.csv (200 rows, step=2441)
Wrote DS_Recording_12.csv (200 rows, step=2405)
Wrote DS_Recording_13.csv (200 rows, step=2405)
Wrote DS_Recording_14.csv (200 rows, step=2417)
Wrote DS_Recording_15.csv (200 rows, step=2415)
Wrote DS_Recording_16.csv (200 rows, step=2404)
Wrote DS_Recording_17.csv (200 rows, step=2449)
Wrote DS_Recording_18.csv (200 rows, step=2406)
Wrote DS_Recording_19.csv (200 rows, step=2405)
Wrote DS_Recording_2.csv (200 rows, step=2406)
Wrote DS_Recording_20.csv (200 rows, step=2497)
Wrote DS_Recording_21.csv (200 rows, step=2407)
Wrote DS_Recording_22.csv (200 rows, step=2405)
Wrote DS_Recording_23.csv (200 rows, step=2406)
Wrote DS_Recording_24.csv (200 rows, step=2409)
Wrote DS_Recording_25.csv (200 rows, step=2404)
Wrote DS_Recording_26.csv (200 rows, step=2405)
Wrote DS_Recording_27.csv (200 rows, step=2539)
Wrote DS_Recording_28.csv (200 rows, step=

### Step 3: Appending the tool-wear values to the raw sensor readings

In [5]:
import pandas as pd
from pathlib import Path

# === CONFIGURATION ===
DAQ_DIR = Path(r"DAQ")  # Update this to your DAQ folder path
TOOL_WEAR_FILE = Path(r"Tool_Wear_Values.xlsx")  # Update this to your Excel file path

# === MAIN FUNCTION ===
def append_tool_wear_and_action_code(daq_dir: Path, excel_path: Path):
    if not daq_dir.exists():
        raise FileNotFoundError(f"DAQ folder not found: {daq_dir}")
    if not excel_path.exists():
        raise FileNotFoundError(f"Excel file not found: {excel_path}")

    # Read metadata Excel
    try:
        metadata_df = pd.read_excel(excel_path, dtype=str)
    except Exception as e:
        print(f"Error reading Excel file: {e}")
        return

    required_cols = {"Path_Sensor_File", "tool_wear", "ACTION_CODE"}
    if not required_cols.issubset(metadata_df.columns):
        print(f"Excel file missing required columns: {required_cols}")
        return

    # Process each row
    for i, row in metadata_df.iterrows():
        sensor_path_str = str(row["Path_Sensor_File"]).strip()
        tool_wear = str(row["tool_wear"]).strip()
        action_code = str(row["ACTION_CODE"]).strip()

        sensor_path = Path(sensor_path_str)
        if not sensor_path.name.startswith("DS_"):
            print(f"Skipping non-DS file: {sensor_path.name}")
            continue

        full_path = daq_dir / sensor_path.parent / sensor_path.name
        if not full_path.exists():
            print(f"Sensor file not found: {full_path}")
            continue

        try:
            df = pd.read_csv(full_path, dtype=str)
        except Exception as e:
            print(f"Error reading {full_path.name}: {e}")
            continue

        # Add tool_wear and ACTION_CODE columns with repeated values
        df["tool_wear"] = tool_wear
        df["ACTION_CODE"] = action_code

        # Overwrite the original file with updated content
        try:
            df.to_csv(full_path, index=False)
            print(f"Updated: {full_path.name} with tool_wear={tool_wear}, ACTION_CODE={action_code}")
        except Exception as e:
            print(f"Failed to write {full_path.name}: {e}")

if __name__ == "__main__":
    append_tool_wear_and_action_code(DAQ_DIR, TOOL_WEAR_FILE)

Updated: DS_Recording_1.csv with tool_wear=46.72, ACTION_CODE=0
Updated: DS_Recording_3.csv with tool_wear=68.61, ACTION_CODE=0
Updated: DS_Recording_4.csv with tool_wear=90.51, ACTION_CODE=0
Updated: DS_Recording_5.csv with tool_wear=99.27, ACTION_CODE=0
Updated: DS_Recording_6.csv with tool_wear=103.65, ACTION_CODE=0
Updated: DS_Recording_7.csv with tool_wear=121.17, ACTION_CODE=0
Updated: DS_Recording_8.csv with tool_wear=130.66, ACTION_CODE=0
Updated: DS_Recording_9.csv with tool_wear=139.42, ACTION_CODE=0
Updated: DS_Recording_11.csv with tool_wear=172.26, ACTION_CODE=0
Updated: DS_Recording_12.csv with tool_wear=183.21, ACTION_CODE=0
Updated: DS_Recording_13.csv with tool_wear=188.32, ACTION_CODE=0
Updated: DS_Recording_14.csv with tool_wear=199.27, ACTION_CODE=0
Updated: DS_Recording_15.csv with tool_wear=207.31, ACTION_CODE=0
Updated: DS_Recording_16.csv with tool_wear=216.79, ACTION_CODE=0
Updated: DS_Recording_17.csv with tool_wear=224.82, ACTION_CODE=0
Updated: DS_Recording_

### Step-4: Concatanate data files into one per case

In [1]:
import pandas as pd
from pathlib import Path
import re

# === CONFIGURATION ===
DAQ_DIR = Path(r"DAQ")  # Update this to your DAQ folder path
DS_PATTERN = re.compile(r"DS_Recording_(\d+)\.csv", re.IGNORECASE)

def concatenate_ds_files_in_subfolder(subfolder: Path):
    # Find all DS_Recording *.csv files
    ds_files = []
    for file in subfolder.glob("DS_Recording_*.csv"):
        match = DS_PATTERN.match(file.name)
        if match:
            index = int(match.group(1))
            ds_files.append((index, file))

    if not ds_files:
        print(f"No DS_Recording files found in {subfolder.name}")
        return

    # Sort files by numeric index
    ds_files.sort(key=lambda x: x[0])

    # Read and concatenate
    combined_df = pd.DataFrame()
    for idx, file in ds_files:
        try:
            df = pd.read_csv(file, dtype=str)
            combined_df = pd.concat([combined_df, df], ignore_index=True)
        except Exception as e:
            print(f"Error reading {file.name}: {e}")

    # Save to PROC_<subfolder>.csv
    output_name = f"PROC_{subfolder.name}.csv"
    output_path = subfolder / output_name
    try:
        combined_df.to_csv(output_path, index=False)
        print(f"Created {output_name} with {len(combined_df)} rows")
    except Exception as e:
        print(f"Failed to write {output_name}: {e}")

def process_all_subfolders(daq_dir: Path):
    if not daq_dir.exists():
        raise FileNotFoundError(f"DAQ folder not found: {daq_dir}")
    for subfolder in sorted(daq_dir.iterdir()):
        if subfolder.is_dir():
            concatenate_ds_files_in_subfolder(subfolder)

if __name__ == "__main__":
    process_all_subfolders(DAQ_DIR)

Created PROC_10. S1200_F25_D0.10.csv with 6400 rows
Created PROC_7. S1200_F20_D0.30.csv with 5600 rows
Created PROC_8. S1200_F40_D0.25.csv with 7200 rows
Created PROC_9. S1200_F30_D0.15.csv with 7000 rows


#### Step-5: Final processing
1. Fill missing values for tool wear
2. Downsample further to 1000 records
3. Re-index time column

In [1]:
import os
import glob
import pandas as pd
import numpy as np

# ====== SETTINGS ======
INPUT_DIR = r"PROCESSED_DATA"
RECURSIVE = False
OUTPUT_SUFFIX = "_filled_downsampled.csv"
TOOL_WEAR_COL = "tool_wear"
ACTION_COL = "ACTION_CODE"
TIME_COL = "Time"
DOWNSAMPLE_TARGET = 400
TOOL_WEAR_THRESHOLD = 285
# ======================

def find_csv_files(folder, recursive=False):
    pattern = "**/*.csv" if recursive else "*.csv"
    return glob.glob(os.path.join(folder, pattern), recursive=recursive)

def looks_step_like(series: pd.Series) -> bool:
    s = series.dropna()
    if s.empty:
        return False
    unique_ratio = s.nunique() / len(s)
    diffs = s.diff().dropna().abs()
    jump_ratio = 0.0 if diffs.empty else (diffs > 0).sum() / len(diffs)
    return (unique_ratio <= 0.25) or (jump_ratio <= 0.5)

def fill_tool_wear(series: pd.Series) -> tuple:
    s = series.copy()
    if s.dtype == object:
        s = pd.to_numeric(s, errors="coerce")
    nan_before = int(s.isna().sum())
    s = s.ffill()
    s = s.bfill()
    if s.isna().any():
        try:
            s = s.interpolate(method="nearest", limit_direction="both")
        except Exception:
            s = s.fillna(s.median(skipna=True))
    nan_after = int(s.isna().sum())
    return s.astype(float), nan_before, nan_after

def fill_action_code(df: pd.DataFrame, tool_wear_col: str, action_col: str, threshold: float) -> tuple:
    df = df.copy()
    if action_col not in df.columns:
        return df, 0, 0
    # Normalize ACTION_CODE to numeric, keep mask of originally missing
    action = pd.to_numeric(df[action_col], errors="coerce")
    missing_before = int(action.isna().sum())
    # Fill remaining NaNs based on tool_wear threshold rule
    tool_wear = pd.to_numeric(df[tool_wear_col], errors="coerce")
    # Create filled action: where action is NaN, set based on tool_wear <= threshold -> 0 else 1
    fill_vals = np.where(tool_wear <= threshold, 0, 1)
    # If tool_wear is NaN, default to 0
    fill_vals = np.where(np.isnan(tool_wear), 0, fill_vals)
    action_filled = action.copy()
    action_filled[action_filled.isna()] = fill_vals[action_filled.isna()]
    # Cast to integer (0/1)
    action_filled = action_filled.fillna(0).astype(int)
    missing_after = int(action_filled.isna().sum())
    df[action_col] = action_filled
    return df, missing_before, missing_after

def downsample_by_nth_record(df: pd.DataFrame, target: int) -> pd.DataFrame:
    n = len(df)
    if n <= target:
        return df.reset_index(drop=True)
    indices = np.linspace(0, n - 1, num=target, dtype=int)
    return df.iloc[indices].reset_index(drop=True)

def rewrite_time_column(df: pd.DataFrame, time_col: str) -> pd.DataFrame:
    df = df.copy()
    df[time_col] = np.arange(len(df))
    return df

def process_file(path):
    df = pd.read_csv(path)
    if TOOL_WEAR_COL not in df.columns:
        return None
    s = df[TOOL_WEAR_COL]
    step_like = looks_step_like(s)
    filled_series, nan_before, nan_after = fill_tool_wear(s)
    df_out = df.copy()
    df_out[TOOL_WEAR_COL] = filled_series
    df_out, action_nan_before, action_nan_after = fill_action_code(df_out, TOOL_WEAR_COL, ACTION_COL, TOOL_WEAR_THRESHOLD)
    df_down = downsample_by_nth_record(df_out, DOWNSAMPLE_TARGET)
    df_final = rewrite_time_column(df_down, TIME_COL)
    base, _ = os.path.splitext(path)
    out_path = base + OUTPUT_SUFFIX
    df_final.to_csv(out_path, index=False)
    return {
        "file": path,
        "out_file": out_path,
        "orig_rows": len(df),
        "out_rows": len(df_final),
        "tool_wear_nan_before": nan_before,
        "tool_wear_nan_after": nan_after,
        "action_nan_before": action_nan_before,
        "action_nan_after": action_nan_after,
        "step_like": bool(step_like)
    }

def main():
    files = find_csv_files(INPUT_DIR, recursive=RECURSIVE)
    if not files:
        print(f"No CSV files found in {INPUT_DIR}")
        return
    for f in files:
        try:
            summary = process_file(f)
            if summary:
                print(f"Processed {summary['file']}: orig_rows={summary['orig_rows']}, "
                      f"out_rows={summary['out_rows']}, "
                      f"tool_wear_nan_before={summary['tool_wear_nan_before']}, tool_wear_nan_after={summary['tool_wear_nan_after']}, "
                      f"action_nan_before={summary['action_nan_before']}, action_nan_after={summary['action_nan_after']}, "
                      f"step_like={summary['step_like']}")
            else:
                print(f"Skipped {f}: no column named '{TOOL_WEAR_COL}'")
        except Exception as e:
            print(f"Error processing {f}: {e}")
    print("Done. Cleaned, downsampled files written with suffix:", OUTPUT_SUFFIX)

if __name__ == "__main__":
    main()

Processed PROCESSED_DATA\PROC_10. S1200_F25_D0.10.csv: orig_rows=6400, out_rows=400, tool_wear_nan_before=400, tool_wear_nan_after=0, action_nan_before=400, action_nan_after=0, step_like=True
Processed PROCESSED_DATA\PROC_7. S1200_F20_D0.30.csv: orig_rows=5600, out_rows=400, tool_wear_nan_before=400, tool_wear_nan_after=0, action_nan_before=400, action_nan_after=0, step_like=True
Processed PROCESSED_DATA\PROC_8. S1200_F40_D0.25.csv: orig_rows=7200, out_rows=400, tool_wear_nan_before=0, tool_wear_nan_after=0, action_nan_before=0, action_nan_after=0, step_like=True
Processed PROCESSED_DATA\PROC_9. S1200_F30_D0.15.csv: orig_rows=7000, out_rows=400, tool_wear_nan_before=400, tool_wear_nan_after=0, action_nan_before=400, action_nan_after=0, step_like=True
Done. Cleaned, downsampled files written with suffix: _filled_downsampled.csv
