<h1 align=center style="line-height:200%;color:#0099cc">
    Movement Status
</h1>


<h2 style="line-height:200%;color:#0099cc">
    Dataset Introduction
</h2>

<p style="text-align: justify; line-height:200%; font-size:medium">
    The training dataset includes 270,688 rows, with the description of each column provided in the table below.
</p>

<center>
<div style="line-height:200%;font-size:medium">
|Column|Description|
|:------:|:---:|
|timestamp|Timestamp|
|back_x|X-axis acceleration data from the lower back sensor|
|back_y|Y-axis acceleration data from the lower back sensor|
|back_z|Z-axis acceleration data from the lower back sensor|
|thigh_x|X-axis acceleration data from the thigh sensor|
|thigh_y|Y-axis acceleration data from the thigh sensor|
|thigh_z|Z-axis acceleration data from the thigh sensor|
|label|An integer representing the movement activity|
</div>
</center>

<p style="text-align: justify; line-height:200%; font-size:medium">
    The test dataset is similar to the training set, except it does not contain the <code>label</code> column, which is the target variable of the problem. The test dataset has 90,229 rows and 7 columns.
</p>


<h2 style="line-height:200%;color:#0099cc">
    Reading the Dataset
</h2>

<p style="text-align: justify; line-height:200%; font-size:medium">
    First, you need to import the necessary libraries. Then, based on the descriptions above, read the training and test datasets appropriately and perform the necessary preprocessing steps on them.
</p>


In [1]:
import os
import numpy as np
import pandas as pd
from typing import Tuple, List

# Modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

# Utils
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Reading/Loading the dataset files
# Locate data directory 
possible_data_dirs = [
    os.path.join(os.getcwd(), 'data'),
    os.path.join(os.getcwd(), '..', 'data'),
    os.path.join(os.path.dirname(os.getcwd()), 'data')
]

data_dir = None
for p in possible_data_dirs:
    if os.path.exists(os.path.join(p, 'train.csv')) and os.path.exists(os.path.join(p, 'test.csv')):
        data_dir = p
        print(f"Data directory found: {data_dir}")
        break

if data_dir is None:
    raise FileNotFoundError("Could not find data directory containing train.csv and test.csv")

train_path = os.path.join(data_dir, 'train.csv')
test_path = os.path.join(data_dir, 'test.csv')

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
# print the first 5 rows of the training data
print(train_df.head())


Data directory found: /Users/taher/Projects/data-processing–preliminary  /Q4/notebook/../data
                 timestamp    back_x    back_y    back_z   thigh_x   thigh_y  \
0  2000-01-01 00:00:00.000 -0.956569 -0.098830  0.080135 -0.960096 -0.162203   
1  2000-01-01 00:00:00.020 -0.998093 -0.111525  0.078602 -1.020245 -0.092208   
2  2000-01-01 00:00:00.040 -0.948218 -0.091371  0.083376 -0.970062 -0.115447   
3  2000-01-01 00:00:00.060 -0.995374 -0.078641  0.099155 -1.011861 -0.133171   
4  2000-01-01 00:00:00.080 -0.956965 -0.078572  0.088926 -0.999605 -0.024058   

    thigh_z  label  
0 -0.019368      6  
1 -0.045086      6  
2 -0.058316      6  
3 -0.024171      6  
4 -0.077175      6  


In [3]:
# Preprocessing step
# 1) Parse timestamps and sort by time to preserve sequence
train_df['timestamp'] = pd.to_datetime(train_df['timestamp'])
test_df['timestamp'] = pd.to_datetime(test_df['timestamp'])
train_df = train_df.sort_values('timestamp').reset_index(drop=True)
test_df = test_df.sort_values('timestamp').reset_index(drop=True)

SENSOR_COLS = ['back_x', 'back_y', 'back_z', 'thigh_x', 'thigh_y', 'thigh_z']
TARGET_COL = 'label'

# 2) Choose a window size that produces exactly 249 test windows
NUM_TEST_WINDOWS = 249
WINDOW_SIZE = len(test_df) // NUM_TEST_WINDOWS  # integer division
print(f"Test rows: {len(test_df)}, window_size: {WINDOW_SIZE}, windows: {NUM_TEST_WINDOWS}, remainder dropped: {len(test_df) - WINDOW_SIZE*NUM_TEST_WINDOWS}")

# 3) Feature engineering helpers
import numpy as np

def median_abs_deviation(values: np.ndarray) -> float:
    med = np.median(values)
    return float(np.median(np.abs(values - med)))

def window_features(df_window: pd.DataFrame) -> dict:
    feats = {}
    n = len(df_window)
    t = np.arange(n, dtype=np.float32)
    for col in SENSOR_COLS:
        x = df_window[col].to_numpy(dtype=np.float32, copy=False)
        feats[f'{col}_mean'] = float(np.mean(x))
        feats[f'{col}_std'] = float(np.std(x, ddof=1)) if n > 1 else 0.0
        feats[f'{col}_min'] = float(np.min(x))
        feats[f'{col}_max'] = float(np.max(x))
        feats[f'{col}_median'] = float(np.median(x))
        q25 = float(np.quantile(x, 0.25))
        q75 = float(np.quantile(x, 0.75))
        feats[f'{col}_q25'] = q25
        feats[f'{col}_q75'] = q75
        feats[f'{col}_iqr'] = q75 - q25
        feats[f'{col}_mad'] = median_abs_deviation(x)
        feats[f'{col}_abs_mean'] = float(np.mean(np.abs(x)))
        feats[f'{col}_energy'] = float(np.mean(x*x))
        # Linear trend (slope)
        if n > 1:
            slope = np.polyfit(t, x, 1)[0]
        else:
            slope = 0.0
        feats[f'{col}_slope'] = float(slope)
    # Simple cross-sensor correlations for aligned axes
    for a_col, b_col, name in [
        ('back_x','thigh_x','x'),
        ('back_y','thigh_y','y'),
        ('back_z','thigh_z','z'),
    ]:
        a = df_window[a_col].to_numpy(dtype=np.float32, copy=False)
        b = df_window[b_col].to_numpy(dtype=np.float32, copy=False)
        if n > 1 and np.std(a) > 0 and np.std(b) > 0:
            corr = float(np.corrcoef(a, b)[0,1])
        else:
            corr = 0.0
        feats[f'corr_back_thigh_{name}'] = corr
    return feats


def build_windows_features(df: pd.DataFrame, window_size: int, num_windows: int, with_labels: bool) -> tuple:
    feature_rows = []
    labels = [] if with_labels else None
    for i in range(num_windows):
        start = i * window_size
        end = start + window_size
        if end > len(df):
            break
        w = df.iloc[start:end]
        feature_rows.append(window_features(w))
        if with_labels:
            # Use majority label within window
            lbl = int(w[TARGET_COL].mode().iloc[0])
            labels.append(lbl)
    X = pd.DataFrame(feature_rows)
    y = np.array(labels, dtype=np.int32) if with_labels else None
    return X, y

# 4) Build train windows (use same window size; drop trailing remainder)
num_train_windows = len(train_df) // WINDOW_SIZE
X_all, y_all = build_windows_features(train_df, WINDOW_SIZE, num_train_windows, with_labels=True)
print('Train windows:', X_all.shape, 'Labels:', y_all.shape)

# 5) Time-based split: last 20% windows as validation
split_idx = int(0.8 * len(X_all))
X_train, y_train = X_all.iloc[:split_idx].reset_index(drop=True), y_all[:split_idx]
X_val, y_val = X_all.iloc[split_idx:].reset_index(drop=True), y_all[split_idx:]
print('X_train:', X_train.shape, 'X_val:', X_val.shape)

Test rows: 90229, window_size: 362, windows: 249, remainder dropped: 91
Train windows: (747, 75) Labels: (747,)
X_train: (597, 75) X_val: (150, 75)


<h2 style="line-height:200%;color:#0099cc">
    Model Training
</h2>

<p style="text-align: justify; line-height:200%; font-size:medium">
    Now that you have cleaned the data, it's time to train a model that can predict the target variable for this problem.
</p>


In [4]:
# Model design
# RandomForest baseline on window features
rf = RandomForestClassifier(
    n_estimators=400,
    max_depth=None,
    min_samples_split=4,
    min_samples_leaf=2,
    n_jobs=-1,
    random_state=42,
    class_weight='balanced_subsample'
)
rf.fit(X_train, y_train)
print('Model trained. Train windows:', len(X_train), 'Val windows:', len(X_val))

Model trained. Train windows: 597 Val windows: 150


<h3 style="line-height:200%;color:#0099cc">
    Evaluation Metric
</h3>

<p style="text-align: justify; line-height:200%; font-size:medium">
    The <code>F1 Score</code> is used to evaluate your model, with <code>macro</code> averaging. 
    To receive a score for this question, your model must have a minimum <code>F1 Score</code> of 0.40. In this case, the final score will be calculated based on the following formula:

$$\text{round}(f1score, 3) \times 100$$

</p>

<p style="text-align: justify; line-height:200%; font-size:medium">
    If your model does not reach the minimum threshold, the received score will be zero.
</p>


In [5]:
# evaluate your model
from sklearn.metrics import f1_score

val_pred = rf.predict(X_val)
val_f1_macro = f1_score(y_val, val_pred, average='macro')
print({'val_f1_macro': round(val_f1_macro, 4)})

{'val_f1_macro': 0.8135}


<h2 style="line-height:200%;color:#0099cc">
    Prediction for Test Data and Output
</h2>

<p style="text-align: left; line-height:200%; font-size:medium">
    Use your model to predict the samples in the test data and prepare the results in a table (<code>dataframe</code>) format as follows.
</p>

<div style="text-align: center;line-height:200%;font-size:medium">
|Column|Description|
|------|---|
|label|Predicted movement type|
</div>


<p style="text-align: left; line-height:200%; font-size:medium">
    The dataframe name must be <i>submission</i>; otherwise, the judging system will not be able to evaluate your attempt.
    <br>
    This dataframe contains only 1 column named <i>label</i> and has 249 rows.
    <br>
    For each row in the <i>test</i> dataframe, you must have one predicted value.
    <br>
    The table below shows the first 5 rows of the <code>submission</code> dataframe. However, in your answer, the values in the <i>label</i> column may differ.
</p>

<div style="text-align: center;line-height:200%;font-size:medium">
|label|
|-----|
|1|
|2|
|1|
|8|
|8|
</div>


In [6]:
# predict test samples
# Build fixed-count windows for test to match NUM_TEST_WINDOWS
X_test, _ = build_windows_features(test_df, WINDOW_SIZE, NUM_TEST_WINDOWS, with_labels=False)
print('Test windows:', X_test.shape)

# Predict per-window labels
y_test_pred = rf.predict(X_test).astype(int)

# Expand window predictions to per-row labels
num_rows = len(test_df)
row_labels = np.empty(num_rows, dtype=int)
for i in range(NUM_TEST_WINDOWS):
    start = i * WINDOW_SIZE
    end = start + WINDOW_SIZE
    row_labels[start:end] = int(y_test_pred[i])

# Handle remainder rows by predicting on the final partial window
remainder = num_rows - WINDOW_SIZE * NUM_TEST_WINDOWS
if remainder > 0:
    leftover_df = test_df.iloc[WINDOW_SIZE * NUM_TEST_WINDOWS:]
    leftover_feats = window_features(leftover_df)
    leftover_X = pd.DataFrame([leftover_feats]).reindex(columns=X_test.columns, fill_value=0.0)
    leftover_pred = int(rf.predict(leftover_X)[0])
    row_labels[-remainder:] = leftover_pred

submission = pd.DataFrame({'label': row_labels.astype(int)})
print('Submission rows:', len(submission))
print(submission.head())

Test windows: (249, 75)
Submission rows: 90229
   label
0      7
1      7
2      7
3      7
4      7


<h2 style="line-height:200%;color:#0099cc">
    <b>Output Generation Cell</b>
</h2>

<p style="text-align: justify; line-height:200%; font-size:medium">
    Run the cell below to create the <code>result.zip</code> file. Please note that you must save the changes made in the notebook (<code>ctrl+s</code>) before running the cell below, so that your code can be reviewed if support is needed.
</p>


In [7]:
import zipfile
import joblib

if not os.path.exists(os.path.join(os.getcwd(), 'movement_status.ipynb')):
    %notebook -e movement_status.ipynb


def compress(file_names):
    print("File Paths:")
    print(file_names)
    compression = zipfile.ZIP_DEFLATED
    with zipfile.ZipFile("result.zip", mode="w") as zf:
        for file_name in file_names:
            zf.write('./' + file_name, file_name, compress_type=compression)

submission.to_csv('submission.csv', index=False)
file_names = ['movement_status.ipynb', 'submission.csv']
compress(file_names)

File Paths:
['movement_status.ipynb', 'submission.csv']
