# Exercise 1: Sensor Data Exploration

## Objective
Understand the structure of the Human Activity Recognition dataset, explore features, visualize distributions, detect inconsistencies, and normalize data for machine learning.

## Step 1: Import Libraries
We start by importing the Python libraries needed for data handling and visualization.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Step 2: Load Dataset
Load the features, activity labels, training, and test data from the dataset folder.

In [None]:
# Features and activity labels
features = pd.read_csv('dataset/features.txt', delim_whitespace=True, header=None)
activity_labels = pd.read_csv('dataset/activity_labels.txt', delim_whitespace=True, header=None, index_col=0)

# Training and test data
X_train = pd.read_csv('dataset/X_train.txt', delim_whitespace=True, header=None)
X_test = pd.read_csv('dataset/X_test.txt', delim_whitespace=True, header=None)
y_train = pd.read_csv('dataset/y_train.txt', header=None)
y_test = pd.read_csv('dataset/y_test.txt', header=None)

# Assign feature names with DEDUPLICATION (Crucial Fix)
# The HAR dataset has duplicate column names. We must rename them to avoid errors later.
feature_names = features[1].values
seen = {}
unique_names = []
for name in feature_names:
    if name in seen:
        seen[name] += 1
        unique_names.append(f"{name}_{seen[name]}")
    else:
        seen[name] = 0
        unique_names.append(name)

X_train.columns = unique_names
X_test.columns = unique_names

# Map activity labels for readability
y_train_mapped = y_train[0].map(activity_labels[1])
y_test_mapped = y_test[0].map(activity_labels[1])

## Step 3: Inspect Dataset
Check dataset dimensions and activity distribution. Understanding your data is the first step before training any model.

In [None]:
print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)

print('\nActivity distribution:')
print(y_train_mapped.value_counts())

## Step 4: Visualize Feature Distributions
Plotting histograms helps identify outliers, skewness, or unusual patterns in sensor features.

In [None]:
plt.hist(X_train['tBodyAccMag-mean()'], bins=50)
plt.title('Distribution of tBodyAccMag-mean()')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

## Step 5: Normalize Features
Scaling features is important because many machine learning algorithms are sensitive to the scale of input data. StandardScaler transforms features to have mean 0 and standard deviation 1.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 6: Reflection Questions
1. Why remove the target variable before scaling?
2. Why must the test set use the same scaler as the training set?
3. How does class imbalance affect model training and evaluation?