# FEATURE ENGINEERING

## FEW CONCEPTS

### what mean by timestamps and inertial signals ?


Inertial signals are measurements from sensors like accelerometers and gyroscopes that capture motion—specifically linear acceleration and angular velocity. In the UCI HAR dataset, these include signals such as body acceleration, gyroscope data, and total acceleration along the X, Y, and Z axes. That gives a total of 9 signals.

Time stamps refer to the sequence of measurements taken over time. The dataset is divided into windows of 2.56 seconds, sampled at 50 Hz, meaning each window contains 128 time steps. So, for each sample, you have 128 values for each of the 9 signals, forming a matrix of shape (128, 9) per sample.


### reason for shape ?


the shape `(10299, 128, 9)` is **correct** for the UCI HAR Dataset **Inertial Signals** when processed properly. Here's what each dimension means:

* `10299`: Total number of samples (train + test combined: 7352 train + 2947 test)
* `128`: Each sample is a time window of 128 readings (time steps)
* `9`: There are **9 inertial signal features**:
  * Total Acc: `body_acc_x/y/z`
  * Body Acc: `body_gyro_x/y/z`
  * Jerk: `total_acc_x/y/z`

### 📊 Why the **train_data has 561 features**:



- The 561 features come from **hand-crafted feature extraction** on the raw signal data.
⚙️ Step-by-step:

1. **Raw data shape** (after stacking `Inertial Signals`):

   * Shape = `(7352, 128, 9)` for training
   * Raw signal from accelerometer + gyroscope in time windows.

2. **Feature extraction process** (already done in the dataset's `X_train.txt` file):

   * From each 128-sample × 9-signal window → extract statistical features:

     * e.g., mean, std, energy, entropy, correlation, FFT coefficients, etc.
     * This is done per signal.
   * In total, **561 features per sample** are engineered this way.

3. **Final train dataset used in classification:**

   * Shape = `(7352, 561)`
   * Each row is a sample (1 time window)
   * Each column is a feature
 📁 Files Involved:

* `Inertial Signals/` → Raw 128×9 time series
* `X_train.txt` → Preprocessed data with 561 features
* `y_train.txt` → Corresponding activity labels

If we're working from raw Inertial Signals, you'll have to **re-implement feature extraction** to get the 561 features — or use the `X_train.txt` directly.



# concatatanating the signal files to create raw data

In [1]:
import pandas as pd
import numpy as np
import os

def load_signals(signal_dir):
    filenames = sorted(os.listdir(signal_dir))
    signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None) 
                   for f in filenames]
    return np.stack(signal_data, axis=-1)  # shape: (samples, time, features)

x_train_raw = load_signals("/Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/UCI HAR Dataset/train/Inertial Signals")

x_test_raw = load_signals("/Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/UCI HAR Dataset/test/Inertial Signals")
# Combine train and test data
x_raw = np.concatenate((x_train_raw, x_test_raw), axis=0)
y_train = pd.read_csv("/Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/UCI HAR Dataset/train/y_train.txt", header=None).values.flatten()
y_test = pd.read_csv("/Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/UCI HAR Dataset/test/y_test.txt", header=None).values.flatten()
# Combine train and test labels
y = np.concatenate((y_train, y_test), axis=0)
# convert x_raw to a DataFrame without flattening
# x = pd.DataFrame(x_raw)  # shape: (samples, time * features)


  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), delim_whitespace=True, header=None)
  signal_data = [pd.read_csv(os.path.join(signal_dir, f), de

# FEATURE EXTRACTION FROM RAW_FILE 

In [2]:
import numpy as np
import pandas as pd
from scipy.stats import skew, kurtosis, entropy
from scipy.fft import fft

def extract_features_from_window(window):
    features = []
    for i, signal in enumerate(window.T):  # window.T shape (9, 128) → iterate over 9 signals
        # Time-domain features
        features.append(np.mean(signal))
        features.append(np.std(signal))
        features.append(np.min(signal))
        features.append(np.max(signal))
        features.append(np.median(signal))
        features.append(skew(signal))
        features.append(kurtosis(signal))

        # Frequency-domain features
        fft_vals = np.abs(fft(signal))
        fft_norm = fft_vals / (np.sum(fft_vals) + 1e-12)  # avoid div zero
        features.append(np.sum(fft_vals**2))              # Energy
        features.append(entropy(fft_norm + 1e-12))        # Entropy
        features.append(np.mean(fft_vals))                 # Mean power
        features.append(np.argmax(fft_vals))               # Max freq index (dominant frequency)
    return features

def extract_features_from_all_windows(x_raw):
    feature_names = []
    signals = ['body_acc_x', 'body_acc_y', 'body_acc_z',
               'body_gyro_x', 'body_gyro_y', 'body_gyro_z',
               'total_acc_x', 'total_acc_y', 'total_acc_z']
    stats = ['mean', 'std', 'min', 'max', 'median', 'skew', 'kurtosis',
             'energy', 'entropy', 'mean_power', 'max_freq_idx']
    
    for sig in signals:
        for stat in stats:
            feature_names.append(f"{sig}_{stat}")
    
    all_features = []
    for window in x_raw:
        feats = extract_features_from_window(window)
        all_features.append(feats)
        
    df_features = pd.DataFrame(all_features, columns=feature_names)
    return df_features

# Usage:
# x_features_df = extract_features_from_all_windows(x_raw)
# print(x_features_df.shape)  # (samples, 9*11=99 features)


In [3]:
df_features = extract_features_from_all_windows(x_raw)

In [4]:
df_features.head(2)

Unnamed: 0,body_acc_x_mean,body_acc_x_std,body_acc_x_min,body_acc_x_max,body_acc_x_median,body_acc_x_skew,body_acc_x_kurtosis,body_acc_x_energy,body_acc_x_entropy,body_acc_x_mean_power,...,total_acc_z_std,total_acc_z_min,total_acc_z_max,total_acc_z_median,total_acc_z_skew,total_acc_z_kurtosis,total_acc_z_energy,total_acc_z_entropy,total_acc_z_mean_power,total_acc_z_max_freq_idx
0,0.002269,0.002941,-0.004294,0.01081,0.002025,0.481111,-0.395797,0.226044,4.338396,0.024127,...,0.00397,0.088742,0.109485,0.099841,0.071125,0.4938,163.220498,1.654016,0.132356,0
1,0.000174,0.001981,-0.006706,0.005251,0.00011,-0.480776,1.472747,0.064786,4.462213,0.016851,...,0.004918,0.0811,0.105788,0.097748,-1.084209,1.257869,154.361101,1.850264,0.135705,0


# PREPROCESSING

In [11]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler


In [12]:
# missing_values = df_features.isnull().sum()
print(df_features.isnull().sum().sum())  # Should be 0


0


if any found :
    df_features.fillna(df_features.mean(), inplace=True)


In [17]:
X = df_features
X.head(2)


Unnamed: 0,body_acc_x_mean,body_acc_x_std,body_acc_x_min,body_acc_x_max,body_acc_x_median,body_acc_x_skew,body_acc_x_kurtosis,body_acc_x_energy,body_acc_x_entropy,body_acc_x_mean_power,...,total_acc_z_std,total_acc_z_min,total_acc_z_max,total_acc_z_median,total_acc_z_skew,total_acc_z_kurtosis,total_acc_z_energy,total_acc_z_entropy,total_acc_z_mean_power,total_acc_z_max_freq_idx
0,0.002269,0.002941,-0.004294,0.01081,0.002025,0.481111,-0.395797,0.226044,4.338396,0.024127,...,0.00397,0.088742,0.109485,0.099841,0.071125,0.4938,163.220498,1.654016,0.132356,0
1,0.000174,0.001981,-0.006706,0.005251,0.00011,-0.480776,1.472747,0.064786,4.462213,0.016851,...,0.004918,0.0811,0.105788,0.097748,-1.084209,1.257869,154.361101,1.850264,0.135705,0


STANDARDIZE FEATURES 

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame with feature names
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

In [19]:
X_scaled_df.head(2)

Unnamed: 0,body_acc_x_mean,body_acc_x_std,body_acc_x_min,body_acc_x_max,body_acc_x_median,body_acc_x_skew,body_acc_x_kurtosis,body_acc_x_energy,body_acc_x_entropy,body_acc_x_mean_power,...,total_acc_z_std,total_acc_z_min,total_acc_z_max,total_acc_z_median,total_acc_z_skew,total_acc_z_kurtosis,total_acc_z_energy,total_acc_z_entropy,total_acc_z_mean_power,total_acc_z_max_freq_idx
0,0.210534,-0.883335,0.918871,-0.868773,0.611526,0.356362,-0.3394,-0.706303,0.782367,-0.884585,...,-0.913413,0.40467,-0.444325,0.039661,0.299175,0.052158,-0.563109,-0.730411,-1.203022,-0.287719
1,0.060208,-0.890098,0.908664,-0.884263,0.567732,-1.17913,0.580954,-0.706492,1.339885,-0.893425,...,-0.900186,0.388502,-0.456851,0.033242,-1.522574,0.556553,-0.56574,-0.593033,-1.19572,-0.287719


In [21]:
X_CLEAN = X_scaled_df
X_CLEAN.head(2)

Unnamed: 0,body_acc_x_mean,body_acc_x_std,body_acc_x_min,body_acc_x_max,body_acc_x_median,body_acc_x_skew,body_acc_x_kurtosis,body_acc_x_energy,body_acc_x_entropy,body_acc_x_mean_power,...,total_acc_z_std,total_acc_z_min,total_acc_z_max,total_acc_z_median,total_acc_z_skew,total_acc_z_kurtosis,total_acc_z_energy,total_acc_z_entropy,total_acc_z_mean_power,total_acc_z_max_freq_idx
0,0.210534,-0.883335,0.918871,-0.868773,0.611526,0.356362,-0.3394,-0.706303,0.782367,-0.884585,...,-0.913413,0.40467,-0.444325,0.039661,0.299175,0.052158,-0.563109,-0.730411,-1.203022,-0.287719
1,0.060208,-0.890098,0.908664,-0.884263,0.567732,-1.17913,0.580954,-0.706492,1.339885,-0.893425,...,-0.900186,0.388502,-0.456851,0.033242,-1.522574,0.556553,-0.56574,-0.593033,-1.19572,-0.287719


# SAVING THE FILES 

In [24]:
import os
import pandas as pd

# Define base path
base_path = "/Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/DATA_EXTRACTED_WITH_99_FEATURES"
os.makedirs(base_path, exist_ok=True)

# Save cleaned feature data
X_CLEAN.to_csv(os.path.join(base_path, "X_CLEAN.csv"), index=False)

# Save labels
y_df = pd.DataFrame(y, columns=['activity'])
y_df.to_csv(os.path.join(base_path, "Y_CLEAN.csv"), index=False)

# Save feature names
feature_names = df_features.columns.tolist()
with open(os.path.join(base_path, "feature_names.txt"), 'w') as f:
    for name in feature_names:
        f.write(f"{name}\n")

# Also save feature names in CSV
feature_names_df = pd.DataFrame(feature_names, columns=['feature_name'])
feature_names_df.to_csv(os.path.join(base_path, "feature_names.csv"), index=False)

# Save shape info
shape_info = {
    'features_shape': df_features.shape,
    'labels_shape': y_df.shape
}
shape_info_df = pd.DataFrame([shape_info])
shape_info_df.to_csv(os.path.join(base_path, "shape_info.csv"), index=False)

print(" All files saved under:", base_path)


 All files saved under: /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/DATA_EXTRACTED_WITH_99_FEATURES


In [38]:
readme_text = """
FOLDER STRUCTURE AND CONTENTS
============================================================

/stage1_BASELINE/ 
│__DATA_EXTRACTED_WITH_99_FEATURES/ # Contains cleaned and preprocessed data
│   ├── feature_names.csv
│   ├── feature_names.txt
│   ├── shape_info.csv
│   ├── X_CLEAN.csv
│   └── Y_CLEAN.csv
|__OUTPUT/
│   ├── <MODEL_NAME>_RESULTS/  # e.g., LinearSVC_RESULTS/
│   │   ├── classification_report.txt
│   │   └── confusion_matrix.png
│   └── model_comparison.txt
|__README.txt
------------------------------------------------------------

      
      

Folder: DATA_EXTRACTED_WITH_99_FEATURES
------------------------------------------------------------
This folder contains the cleaned and preprocessed data extracted from the **UCI HAR Dataset** using raw inertial signal files.

✔️ SOURCE:
-----------
The raw data was taken from the Inertial Signals directory of the original UCI HAR dataset. These signals include 9 sensor signals:
- body_acc_x
- body_acc_y
- body_acc_z
- body_gyro_x
- body_gyro_y
- body_gyro_z
- total_acc_x
- total_acc_y
- total_acc_z

Each signal was recorded over 128 time steps for every activity window/sample.

✔️ PIPELINE:
------------
1. **Raw Data Loaded**  
   Inertial signal files from `train` and `test` directories were loaded and concatenated to form the full raw dataset.

2. **Raw Files Saved**  
   Raw signal data was saved as `X_RAW.csv` and labels as `Y_RAW.csv` for reference.

3. **Feature Extraction**  
   From each window (i.e., one sample of 128 time steps), the following statistical and frequency-domain features were extracted:
   
   For each signal (total 9), the following 11 features were computed:
   - Mean
   - Standard Deviation
   - Minimum
   - Maximum
   - Median
   - Skewness
   - Kurtosis
   - Energy (Sum of squares of FFT)
   - Entropy (of normalized FFT)
   - Mean Power (mean of FFT magnitudes)
   - Max Frequency Index (argmax of FFT)

   👉 This results in **99 features total** (9 signals × 11 features each).

4. **Cleaned Data Saved**  
   The extracted features were saved as:
   - `X_CLEAN.csv` — Cleaned feature matrix (shape: [samples, 99])
   - `Y_RAW.csv` — Corresponding labels for each sample
   - `feature_names.csv` — Names of the 99 features
   - `feature_names.txt` — Plain text list of all features
   - `shape_info.csv` — Shape of the final feature and label datasets

✔️ OUTPUT FILES:
----------------
- `X_CLEAN.csv` — Extracted feature dataset
- `Y_RAW.csv` — Activity labels
- `feature_names.csv` — Feature names in CSV
- `feature_names.txt` — Feature names in plain text
- `shape_info.csv` — Dataset shapes

📌 This dataset is now ready for use in machine learning pipelines for Human Activity Recognition (HAR).

Author: Priyam Pandey  
Date: [24 TH MAY 2025]

---------------------
MODEL TRAINING AND EVALUATION
------------------------------------------------------------
* Loaded `X_CLEAN.csv` and `Y_CLEAN.csv` from the given base path

* Split data into training and testing sets

* Defined 12 classification models:
  `[LinearSVC, GradientBoosting, ExtraTrees, Bagging, ANN, RandomForest, CART, GaussianNB, DecisionTree, AdaBoost, KNN, LogisticRegression]`

* Trained each model using a loop

* Calculated metrics: Accuracy, F1 Score, Recall, Precision
* Saved classification report (`.txt`) and confusion matrix (`.png`) for each model in a separate folder named `<MODEL_NAME>_RESULTS`
* Compiled all model scores into a comparison table, saved as `model_comparison.txt` in the base path
------------------------------------------------------------
Folder Structure:


/stage1_BASELINE/
│
├── X_CLEAN.csv
├── Y_CLEAN.csv











"""
base_path = "/Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE"
with open(os.path.join(base_path, "README.txt"), "w") as f:
    f.write(readme_text)
print("README file created at:", os.path.join(base_path, "README.txt"))
# Save the README file in the base path

README file created at: /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/README.txt


# MODEL_TRAINING

IMPORT LIBRAIES 


In [28]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, f1_score, recall_score, precision_score,
    classification_report, confusion_matrix
)

# ML models
from sklearn.svm import LinearSVC
from sklearn.ensemble import (
    GradientBoostingClassifier, ExtraTreesClassifier, BaggingClassifier,
    RandomForestClassifier, AdaBoostClassifier
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier


### LOADING DATA AND SPLITTING IT 

In [33]:
# Define base path
base_path1 = "/Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/DATA_EXTRACTED_WITH_99_FEATURES"

# Load clean data
X = pd.read_csv(os.path.join(base_path1, "X_CLEAN.csv"))
y = pd.read_csv(os.path.join(base_path1, "Y_CLEAN.csv")).values.ravel()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### DEFINING MODEL DIRECTORY


In [34]:
base_path= "/Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1"

In [35]:
# Dictionary of models
models = {
    "Linear_SVC": LinearSVC(max_iter=10000),
    "Gradient_Boosting": GradientBoostingClassifier(),
    "Extra_Trees": ExtraTreesClassifier(),
    "Bagged_Decision_Trees": BaggingClassifier(),
    "ANN": MLPClassifier(max_iter=1000),
    "Random_Forest": RandomForestClassifier(),
    "CART": DecisionTreeClassifier(),  # Same as Decision Tree
    "Gaussian_Naive_Bayes": GaussianNB(),
    "Decision_Tree": DecisionTreeClassifier(),
    "AdaBoost": AdaBoostClassifier(),
    "KNN": KNeighborsClassifier(),
    "Logistic_Regression": LogisticRegression(max_iter=10000)
}


### 🔁 Train, Evaluate, Save Report & Confusion Matrix

In [36]:
# Store results
results = []

# Loop through each model
for model_name, model in models.items():
    print(f"Training {model_name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Metrics
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    precision = precision_score(y_test, y_pred, average='weighted')
    
    results.append([model_name, acc, f1, recall, precision])

    # Classification report & confusion matrix
    report = classification_report(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    # Save in model-specific folder
    model_folder = os.path.join(base_path, f"{model_name}_RESULTS")
    os.makedirs(model_folder, exist_ok=True)

    with open(os.path.join(model_folder, f"{model_name}_classification_report.txt"), "w") as f:
        f.write(report)

    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f"Confusion Matrix - {model_name}")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.savefig(os.path.join(model_folder, f"{model_name}_confusion_matrix.png"))
    plt.close()
    print(f"{model_name} completed. Results saved in {model_folder}")

Training Linear_SVC...
Linear_SVC completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/Linear_SVC_RESULTS
Training Gradient_Boosting...
Gradient_Boosting completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/Gradient_Boosting_RESULTS
Training Extra_Trees...
Extra_Trees completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/Extra_Trees_RESULTS
Training Bagged_Decision_Trees...
Bagged_Decision_Trees completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/Bagged_Decision_Trees_RESULTS
Training ANN...
ANN completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/ANN_RESULTS
Training Random_Forest...
Random_Forest completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/Random_Forest_RESULTS
T

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


AdaBoost completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/AdaBoost_RESULTS
Training KNN...
KNN completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/KNN_RESULTS
Training Logistic_Regression...
Logistic_Regression completed. Results saved in /Users/priyam/paper_recreation/HAR MODEL_OPTIMIZATION/stage1_BASELINE/OUTPUT1/Logistic_Regression_RESULTS


### SAVE COMPARISON TABLE 

In [37]:
# Create and sort comparison DataFrame
comparison_df = pd.DataFrame(results, columns=["Model", "Accuracy", "F1 Score", "Recall", "Precision"])
comparison_df.sort_values(by="F1 Score", ascending=False, inplace=True)

# Save to TXT
comparison_txt_path = os.path.join(base_path, "model_comparison.txt")
with open(comparison_txt_path, "w") as f:
    f.write(comparison_df.to_string(index=False))

print("✅ All models trained and results saved successfully.")


✅ All models trained and results saved successfully.
