# Data Cleansing

In this stage, I implemented an automated data cleansing pipeline to preprocess raw 5G CSV data files. The script scans all raw .csv files using glob, `removes invalid GPS`, `coordinates`, `negative speeds`, `duplicates`, and ***missing critical network metrics*** (e.g., `Bitrate`, `Transfer size`, `etc`.). It **also ensures correct numeric formatting across columns. Cleaned versions of each file are saved individually**, and a combined master file (`cleansed_data.csv`) is exported to my Google Drive for further analysis like clustering or forecasting.

In [None]:
import pandas as pd
import numpy as np
import os
from glob import glob #file indexing and sorting (neat :D)

# Config: Google Drive paths
RAW_DATA_PATH = "/content/drive/MyDrive/COS40007 Design Project/data/"
CLEANED_OUTPUT_PATH = "/content/drive/MyDrive/COS40007 Design Project/Data Processing/Alvin Version/Processed Data/"
os.makedirs(CLEANED_OUTPUT_PATH, exist_ok=True)

# Load a single CSV
def load_csv(filepath):
    try:
        return pd.read_csv(filepath)
    except Exception as e:
        print(f"[!] Failed to load {filepath}: {e}")
        return None

# Clean a DataFrame
def clean_dataframe(df):
    required_cols = ['latitude', 'longitude', 'speed', 'Bitrate', 'Bitrate-RX',
                     'Transfer size', 'Transfer size-RX', 'CWnd', 'Retransmissions', 'send_data']

    # Skip if any required columns are missing
    missing = [col for col in required_cols if col not in df.columns]
    if missing:
        print(f"[!] Skipped file — missing columns: {missing}")
        return None

    # Drop duplicates
    df = df.drop_duplicates()

    # Filter invalid GPS and speed
    df = df[
        df['latitude'].between(-38, -10) &
        df['longitude'].between(110, 155) &
        (df['speed'] >= 0)
    ]

    # Drop rows with missing critical fields
    df = df.dropna(subset=required_cols)

    # Convert to numeric to fix type issues
    for col in required_cols:
        df[col] = pd.to_numeric(df[col], errors='coerce')

    # Drop rows with failed conversion
    df = df.dropna(subset=required_cols)

    return df

# Process and merge all CSVs
def process_all_files():
    files = glob(os.path.join(RAW_DATA_PATH, "*.csv"))
    print(f"📦 Found {len(files)} raw CSV files")

    combined_df = []

    for file in files:
        filename = os.path.basename(file)
        df = load_csv(file)
        if df is None:
            continue
        df_clean = clean_dataframe(df)
        if df_clean is not None and not df_clean.empty:
            combined_df.append(df_clean)
            print(f"[✓] Cleaned: {filename} → {df_clean.shape[0]} rows")

    if combined_df:
        final_df = pd.concat(combined_df, ignore_index=True)
        final_out = os.path.join(CLEANED_OUTPUT_PATH, "cleansed_data.csv")
        final_df.to_csv(final_out, index=False)
        print(f"\n✅ All files cleaned and merged into: cleansed_data.csv")
    else:
        print("\n⚠️ No valid data found to merge.")

# Run the script
if __name__ == "__main__":
    process_all_files()




---


# Data Engineering

In this step, I developed a complete data engineering pipeline to transform our cleansed 5G network logs into a model-ready dataset named cluster_ready.csv. This engineered dataset is optimized for both unsupervised clustering and supervised forecasting tasks, such as training RandomForest, LightGBM, XGBoost, and LSTM models.

The code begins by loading the previously cleaned dataset (cleansed_data.csv) and reconstructs a full timestamp column from the available time parts (Year, Month, Date, hour, min, sec). This enables us to extract time-based patterns essential for modeling.

Next, I applied feature engineering in several categories:

* Time-based features: Extracted `hour_of_day`, `day_of_week`, and `is_weekend` to capture cyclic and behavioral patterns over time.

* Location binning: Rounded `latitude` and `longitude` into `lat_bin` and `lon_bin` to support spatial grouping while avoiding over-granular GPS noise.

* Signal strength aggregation: Computed `svr_avg` by averaging `svr1` to `svr4` if available, representing overall network signal quality.

* Rolling statistics: Created 5-point moving averages (`rolling_avg`) for key metrics like `Bitrate`, `CWnd`, and `send_data` to smooth out short-term spikes and capture trends.

* Delta features: Calculated first-order differences for `Bitrate` and `speed` to highlight sharp changes and transitions in usage or movement.

The dataset is then purged of any remaining `NaN` values to ensure model robustness and saved as `cluster_ready.csv`.

In [7]:
import pandas as pd
import numpy as np
import os

# Paths (Google Drive)
INPUT_PATH = "/content/drive/MyDrive/COS40007 Design Project/Data Processing/Alvin Version/Processed Data/cleansed_data.csv"
OUTPUT_PATH = "/content/drive/MyDrive/COS40007 Design Project/Data Processing/Alvin Version/Processed Data/cluster_ready.csv"

# Load cleaned data
df = pd.read_csv(INPUT_PATH)

# Parse timestamp
df['timestamp'] = pd.to_datetime(dict(
    year=df['Year'],
    month=df['Month'],
    day=df['Date'],
    hour=df['hour'],
    minute=df['min'],
    second=df['sec']
), errors='coerce')
df = df.dropna(subset=['timestamp'])

# Time features
df['hour_of_day'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Location binning
df['lat_bin'] = df['latitude'].round(3)
df['lon_bin'] = df['longitude'].round(3)

# Average signal strength
if all(col in df.columns for col in ['svr1', 'svr2', 'svr3', 'svr4']):
    df['svr_avg'] = df[['svr1', 'svr2', 'svr3', 'svr4']].mean(axis=1)

# Rolling average (5-sample smoothing)
rolling_cols = ['Bitrate', 'Bitrate-RX', 'Transfer size', 'Transfer size-RX', 'CWnd', 'send_data']
df = df.sort_values('timestamp')
for col in rolling_cols:
    df[f'{col}_rolling_avg'] = df[col].rolling(window=5, min_periods=1).mean()

# Change over time (first-order difference)
df['delta_bitrate'] = df['Bitrate'].diff().fillna(0)
df['delta_speed'] = df['speed'].diff().fillna(0)

# Final cleanup: drop rows with any remaining NaNs
df = df.dropna()

# Export final engineered dataset
df.to_csv(OUTPUT_PATH, index=False)
print("✅ cluster_ready.csv saved successfully.")


✅ cluster_ready.csv saved successfully.


### 💡🤡**More to do? -> OPTIMISATION** 👁️👄👁️

While current setup is already powerful, further optimizations can be explored:

* **Normalization & Scaling:**
    Especially important for distance-based clustering (KMeans, DBSCAN) or LSTM input prep.

* **Correlation Reduction:**
    Dropping redundant features using a correlation matrix can improve model interpretability and efficiency.

* **Dimensionality Reduction:**
    Applying PCA or t-SNE before clustering can boost performance and visualization.

* **Categorical Encoding:**
    Convert non-numeric identifiers (e.g., square_id) to one-hot or label-encoded formats if needed.

* **Lag Features for LSTM:**
    Creating lag-based features (e.g., Bitrate_t-1, Bitrate_t-2) would greatly help sequence-based models.