# WISDM: priprava podatkov za »normalno« vs »hitenje«

Ta notebook zrihta **čist in uporaben dataset** iz WISDM strukture map:
- prebere `activity_key.txt` (mapiranje kod → aktivnost)
- preveri strukturo map (glede na `listing.txt`)
- naloži **RAW** (časovna vrsta) *ali* **ARFF** (že agregirani primeri)
- izbere samo aktivnosti **Walking (A)** in **Jogging (B)**
- ustvari binarno oznako: `label = 0 (normalno/walking)` in `label = 1 (hiti/jogging)`
- shrani rezultat v `parquet/csv` za nadaljnje korake (segmentiranje, feature extraction, učenje)



## 0) Nastavitve

Po dokumentaciji: raw datoteke imajo vrstice v obliki:
`Subject-id, Activity Code, Timestamp, x, y, z;` (20 Hz) in aktivnosti so kodirane z A–S.


In [None]:
from pathlib import Path
import fastparquet
import pyarrow

DATA_DIR = Path(r"/Users/pikakriznar/Documents/1_letnik_MAG/UPK/Projekti/Razpoznava_hitenja_projekt/data/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset/wisdm-dataset")

assert DATA_DIR.exists(), f"DATA_DIR ne obstaja: {DATA_DIR}"
print("DATA_DIR =", DATA_DIR.resolve())


DATA_DIR = /Users/pikakriznar/Documents/1_letnik_MAG/UPK/Projekti/Razpoznava_hitenja_projekt/data/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset/wisdm-dataset


## 1) Preberi `activity_key.txt` (kode → aktivnost)
Uporabili bomo mapiranje, da lahko hitro filtriramo samo `A` (walking) in `B` (jogging).


In [2]:
def load_activity_key(path: Path):
    mapping = {}
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line or "=" not in line:
                continue
            name, code = [x.strip() for x in line.split("=", 1)]
            mapping[code] = name  # npr. "A" -> "walking"
    return mapping

activity_key_path = DATA_DIR / "activity_key.txt"
assert activity_key_path.exists(), f"Manjka: {activity_key_path}"

code_to_name = load_activity_key(activity_key_path)
code_to_name


{'A': 'walking',
 'B': 'jogging',
 'C': 'stairs',
 'D': 'sitting',
 'E': 'standing',
 'F': 'typing',
 'G': 'teeth',
 'H': 'soup',
 'I': 'chips',
 'J': 'pasta',
 'K': 'drinking',
 'L': 'sandwich',
 'M': 'kicking',
 'O': 'catch',
 'P': 'dribbling',
 'Q': 'writing',
 'R': 'clapping',
 'S': 'folding'}

Določimo binarna razreda:
- **normalno** = Walking (`A`)
- **hitenje** = Jogging (`B`)


In [3]:
NORMAL_CODE = "A"   # walking
RUSH_CODE   = "B"   # jogging

assert NORMAL_CODE in code_to_name and RUSH_CODE in code_to_name, "Ne najdem kod A/B v activity_key.txt"

label_map = {NORMAL_CODE: 0, RUSH_CODE: 1}
name_map  = {NORMAL_CODE: "walking", RUSH_CODE: "jogging"}  # za lepši output

print("Normalno:", NORMAL_CODE, "->", code_to_name[NORMAL_CODE], "label=0")
print("Hitenje:",  RUSH_CODE,   "->", code_to_name[RUSH_CODE],   "label=1")


Normalno: A -> walking label=0
Hitenje: B -> jogging label=1


## 2) Preveri strukturo map


In [4]:
RAW_PHONE_ACCEL_DIR = DATA_DIR / "raw" / "phone" / "accel"
ARFF_PHONE_ACCEL_DIR = DATA_DIR / "arff_files" / "phone" / "accel"

print("RAW_PHONE_ACCEL_DIR:", RAW_PHONE_ACCEL_DIR)
print("ARFF_PHONE_ACCEL_DIR:", ARFF_PHONE_ACCEL_DIR)

assert RAW_PHONE_ACCEL_DIR.exists(), f"Manjka: {RAW_PHONE_ACCEL_DIR}"
print("Št. RAW phone/accel datotek:", len(list(RAW_PHONE_ACCEL_DIR.glob("data_*_accel_phone.txt"))))

# ARFF ni nujen, ampak pogosto je prisoten:
print("ARFF phone/accel obstaja?", ARFF_PHONE_ACCEL_DIR.exists())
if ARFF_PHONE_ACCEL_DIR.exists():
    print("Št. ARFF phone/accel datotek:", len(list(ARFF_PHONE_ACCEL_DIR.glob("data_*_accel_phone.arff"))))


RAW_PHONE_ACCEL_DIR: /Users/pikakriznar/Documents/1_letnik_MAG/UPK/Projekti/Razpoznava_hitenja_projekt/data/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset/wisdm-dataset/raw/phone/accel
ARFF_PHONE_ACCEL_DIR: /Users/pikakriznar/Documents/1_letnik_MAG/UPK/Projekti/Razpoznava_hitenja_projekt/data/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset/wisdm-dataset/arff_files/phone/accel
Št. RAW phone/accel datotek: 51
ARFF phone/accel obstaja? True
Št. ARFF phone/accel datotek: 50


## 3) Branje RAW datotek (časovna vrsta)
RAW format (vsaka vrstica):
`subject_id, activity_code, timestamp, x, y, z;`

⚠️ Opomba: `z` ima na koncu `;`, zato ga moramo očistiti.

Ker je dataset velik, beremo **po delih** (chunking) in filtriramo samo A/B.


In [5]:
import pandas as pd

RAW_COLS = ["subject_id", "activity_code", "timestamp", "x", "y", "z"]

def read_raw_file_in_chunks(path: Path, chunksize: int = 200_000, usecols=None):
    # pandas read_csv zna brati hitro, če mu povemo imena stolpcev.
    # Z stolpec ima na koncu ';' -> converter
    conv = {"z": lambda s: float(str(s).rstrip(";"))}
    for chunk in pd.read_csv(
        path,
        header=None,
        names=RAW_COLS,
        sep=",",
        engine="python",
        chunksize=chunksize,
        converters=conv,
    ):
        yield chunk

# Hiter test na eni datoteki:
sample_file = sorted(RAW_PHONE_ACCEL_DIR.glob("data_*_accel_phone.txt"))[0]
print("Sample file:", sample_file.name)

chunk0 = next(read_raw_file_in_chunks(sample_file, chunksize=50_000))
chunk0.head(), chunk0.dtypes


Sample file: data_1600_accel_phone.txt


(   subject_id activity_code        timestamp         x          y         z
 0        1600             A  252207666810782 -0.364761   8.793503  1.055084
 1        1600             A  252207717164786 -0.879730   9.768784  1.016998
 2        1600             A  252207767518790  2.001495  11.109070  2.619156
 3        1600             A  252207817872794  0.450623  12.651642  0.184555
 4        1600             A  252207868226798 -2.164352  13.928436 -4.422485,
 subject_id         int64
 activity_code     object
 timestamp          int64
 x                float64
 y                float64
 z                float64
 dtype: object)

### 3.1) Filtriraj samo walking/jogging in dodaj binarno oznako
Tu naredimo prvi 'pravi' dataset za tvoj projekt: časovna vrsta izbranega senzorja.


In [6]:
def filter_and_label_raw(df: pd.DataFrame) -> pd.DataFrame:
    df = df[df["activity_code"].isin([NORMAL_CODE, RUSH_CODE])].copy()
    df["activity_name"] = df["activity_code"].map(name_map)
    df["label"] = df["activity_code"].map(label_map).astype("int8")
    return df

filtered = filter_and_label_raw(chunk0)
filtered["activity_code"].value_counts(), filtered.head()


(activity_code
 A    3574
 B    3572
 Name: count, dtype: int64,
    subject_id activity_code        timestamp         x          y         z  \
 0        1600             A  252207666810782 -0.364761   8.793503  1.055084   
 1        1600             A  252207717164786 -0.879730   9.768784  1.016998   
 2        1600             A  252207767518790  2.001495  11.109070  2.619156   
 3        1600             A  252207817872794  0.450623  12.651642  0.184555   
 4        1600             A  252207868226798 -2.164352  13.928436 -4.422485   
 
   activity_name  label  
 0       walking      0  
 1       walking      0  
 2       walking      0  
 3       walking      0  
 4       walking      0  )

## 4) Zgradi celoten RAW dataset (phone accel) za A/B
To bo rezultat, ki ga boš uporabila v naslednjem notebooku za **segmentiranje v okna**.

Možnosti:
- `MAX_FILES = None` → prebere vse subjekte (najbolj pravilno, a počasnejše)
- `MAX_FILES = 5` ali `10` → hitro testiranje pipeline-a

Shranjevanje:
- Parquet (priporočeno) je hitrejši in manjši
- CSV je bolj prenosljiv


In [7]:
from tqdm.auto import tqdm

OUT_DIR = DATA_DIR / "prepared"
OUT_DIR.mkdir(exist_ok=True)

MAX_FILES = 5  # TODO: nastavi None za vse datoteke, ko dela pipeline deluje
CHUNKSIZE = 250_000

raw_files = sorted(RAW_PHONE_ACCEL_DIR.glob("data_*_accel_phone.txt"))
if MAX_FILES is not None:
    raw_files = raw_files[:MAX_FILES]

print("Datotek za branje:", len(raw_files))

dfs = []
for fp in tqdm(raw_files, desc="Reading RAW files"):
    for ch in read_raw_file_in_chunks(fp, chunksize=CHUNKSIZE):
        ch = filter_and_label_raw(ch)
        if len(ch):
            # tipično želimo urejeno po času
            ch = ch.sort_values(["subject_id", "timestamp"])
            dfs.append(ch)

raw_ab = pd.concat(dfs, ignore_index=True) if dfs else pd.DataFrame(columns=RAW_COLS + ["activity_name","label"])
raw_ab.info()
raw_ab.head()


  from .autonotebook import tqdm as notebook_tqdm


Datotek za branje: 5


Reading RAW files: 100%|██████████| 5/5 [00:38<00:00,  7.78s/it]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39482 entries, 0 to 39481
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   subject_id     39482 non-null  int64  
 1   activity_code  39482 non-null  object 
 2   timestamp      39482 non-null  int64  
 3   x              39482 non-null  float64
 4   y              39482 non-null  float64
 5   z              39482 non-null  float64
 6   activity_name  39482 non-null  object 
 7   label          39482 non-null  int8   
dtypes: float64(3), int64(2), int8(1), object(2)
memory usage: 2.1+ MB





Unnamed: 0,subject_id,activity_code,timestamp,x,y,z,activity_name,label
0,1600,B,251987619821922,1.375549,15.375046,2.971619,jogging,1
1,1600,B,251987670175926,-3.934433,17.538788,2.110016,jogging,1
2,1600,B,251987720529930,-0.087738,12.791565,-1.454102,jogging,1
3,1600,B,251987770883934,2.038742,3.077148,-1.053726,jogging,1
4,1600,B,251987821237937,-2.558472,-2.738678,-2.098511,jogging,1


### 4.1) Osnovne sanity checks (ali podatki izgledajo OK?)
- ali imamo oba razreda?
- koliko vzorcev po subjektu?
- osnovni opis pospeškov


In [8]:
print("Razredi (label):")
print(raw_ab["label"].value_counts(dropna=False))

print("\nAktivnosti (koda):")
print(raw_ab["activity_code"].value_counts(dropna=False))

print("\nŠt. meritev po subjektu (top 10):")
print(raw_ab["subject_id"].value_counts().head(10))

raw_ab[["x","y","z"]].describe()


Razredi (label):
label
1    19741
0    19741
Name: count, dtype: int64

Aktivnosti (koda):
activity_code
B    19741
A    19741
Name: count, dtype: int64

Št. meritev po subjektu (top 10):
subject_id
1601    9024
1603    9022
1600    7146
1604    7146
1602    7144
Name: count, dtype: int64


Unnamed: 0,x,y,z
count,39482.0,39482.0,39482.0
mean,0.746838,5.574242,0.187358
std,4.644276,9.863546,5.118136
min,-19.724915,-19.386353,-19.753006
25%,-1.76573,-0.06878,-2.485531
50%,0.596755,7.726868,-0.195984
75%,3.048417,12.646173,2.314316
max,19.612701,19.613052,19.612701


### 4.2) Shrani pripravljeni RAW dataset
Ta datoteka je vhod za naslednje korake (segmentiranje na 5 s okna, značilnice, model).


In [9]:
out_parquet = OUT_DIR / "raw_phone_accel_walk_jog.parquet"
out_csv     = OUT_DIR / "raw_phone_accel_walk_jog.csv"

# Parquet (hitro/branje/pisanje)
# raw_ab.to_parquet(out_parquet, index=False)
# print("Saved:", out_parquet)

# CSV (opcijsko, lahko traja dlje)
raw_ab.to_csv(out_csv, index=False)
print("Saved:", out_csv)


Saved: /Users/pikakriznar/Documents/1_letnik_MAG/UPK/Projekti/Razpoznava_hitenja_projekt/data/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset/wisdm-dataset/prepared/raw_phone_accel_walk_jog.csv


## 5) Branje ARFF (že agregirana okna 10 s)
ARFF vsebuje že izračunane značilnice za 10-sekundna okna (200 vzorcev pri 20 Hz).


In [10]:
def parse_arff_file(path: Path) -> pd.DataFrame:
    """Minimal ARFF reader: preskoči header do '@data' in prebere CSV-like vrstice."""
    data_lines = []
    in_data = False
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("%"):
                continue
            if not in_data:
                if line.lower() == "@data":
                    in_data = True
                continue
            data_lines.append(line)

    # ARFF data del je CSV vrstica brez ';'
    # Prvi stolpec je ACTIVITY (A-S), zadnji je class (subject id)
    # Število featurejev je veliko; zato bomo prebrali fleksibilno.
    df = pd.DataFrame([row.split(",") for row in data_lines])
    return df

# Primer preberi 1 arff
if ARFF_PHONE_ACCEL_DIR.exists():
    arff_file = sorted(ARFF_PHONE_ACCEL_DIR.glob("data_*_accel_phone.arff"))[0]
    df_arff = parse_arff_file(arff_file)
    print("ARFF shape:", df_arff.shape)
    df_arff.head()
else:
    print("ARFF map ne obstaja - preskočim.")


ARFF shape: (321, 93)


### 5.1) Pretvori ARFF v 'A/B' dataset
ARFF ima v prvem stolpcu `ACTIVITY`, v zadnjem `subject_id`.
Vmes so značilnice (številčne).


In [11]:
if ARFF_PHONE_ACCEL_DIR.exists():
    # Preberemo nekaj ARFF datotek za demo (enako kot pri RAW)
    MAX_ARFF_FILES = 3  # TODO: None za vse, ko dela
    arff_files = sorted(ARFF_PHONE_ACCEL_DIR.glob("data_*_accel_phone.arff"))
    if MAX_ARFF_FILES is not None:
        arff_files = arff_files[:MAX_ARFF_FILES]

    arff_frames = []
    for fp in tqdm(arff_files, desc="Reading ARFF files"):
        df = parse_arff_file(fp)
        # 0: ACTIVITY, -1: subject_id
        df = df.rename(columns={0:"activity_code", df.columns[-1]:"subject_id"})
        df = df[df["activity_code"].isin([NORMAL_CODE, RUSH_CODE])].copy()
        if len(df) == 0:
            continue
        df["label"] = df["activity_code"].map(label_map).astype("int8")
        # preostale kolone (razen activity_code, subject_id, label) pretvori v float
        feature_cols = [c for c in df.columns if c not in ["activity_code","subject_id","label"]]
        df[feature_cols] = df[feature_cols].apply(pd.to_numeric, errors="coerce")
        df["subject_id"] = pd.to_numeric(df["subject_id"], errors="coerce").astype("Int64")
        arff_frames.append(df)

    arff_ab = pd.concat(arff_frames, ignore_index=True) if arff_frames else pd.DataFrame()
    print("ARFF A/B shape:", arff_ab.shape)
    display(arff_ab.head())

    out_arff_parquet = OUT_DIR / "arff_phone_accel_walk_jog.parquet"
    arff_ab.to_parquet(out_arff_parquet, index=False)
    print("Saved:", out_arff_parquet)
else:
    print("ARFF map ne obstaja - preskočim.")


Reading ARFF files: 100%|██████████| 3/3 [00:27<00:00,  9.06s/it]

ARFF A/B shape: (115, 94)





Unnamed: 0,activity_code,1,2,3,4,5,6,7,8,9,...,84,85,86,87,88,89,90,91,subject_id,label
0,A,0.235,0.47,0.275,0.02,0.0,0.0,0.0,0.0,0.0,...,0.479859,-0.550668,0.049864,0.121354,-0.251024,0.164468,-0.110722,10.0518,1600,0
1,A,0.275,0.44,0.27,0.015,0.0,0.0,0.0,0.0,0.0,...,0.473409,-0.633171,0.072129,0.161492,-0.386416,0.21568,-0.034375,10.1171,1600,0
2,A,0.32,0.43,0.245,0.0,0.005,0.0,0.0,0.0,0.0,...,0.476798,-0.659493,0.087043,0.162157,-0.325151,0.27238,-0.077274,9.98384,1600,0
3,A,0.315,0.495,0.185,0.005,0.0,0.0,0.0,0.0,0.0,...,0.474534,-0.712081,0.00381,0.210015,-0.364285,0.203131,0.015328,10.106,1600,0
4,A,0.215,0.455,0.325,0.005,0.0,0.0,0.0,0.0,0.0,...,0.462811,-0.534933,0.047553,0.275833,-0.216423,0.2385,-0.00987,10.0521,1600,0


Saved: /Users/pikakriznar/Documents/1_letnik_MAG/UPK/Projekti/Razpoznava_hitenja_projekt/data/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset/wisdm-dataset/prepared/arff_phone_accel_walk_jog.parquet
