# üìä Part 1: Data Preprocessing & Splitting

## M·ª•c ti√™u
1. Load v√† l√†m s·∫°ch d·ªØ li·ªáu
2. Feature engineering
3. Chia d·ªØ li·ªáu 80% train, 20% test
4. Export d·ªØ li·ªáu ƒë√£ x·ª≠ l√Ω ƒë·ªÉ train

In [12]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.model_selection import train_test_split
import pickle
import json

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


## 1. Load Raw Data

In [13]:
# Load mock data
DATA_PATH = '../data/mock_bookings_long_distance_1000.csv'
df_raw = pd.read_csv(DATA_PATH)

print(f"üìä Raw dataset shape: {df_raw.shape}")
print(f"üìã Columns: {df_raw.columns.tolist()}")
print(f"\nüîç First few rows:")
df_raw.head()

üìä Raw dataset shape: (1000, 14)
üìã Columns: ['client_id', 'created_at', 'departureDate', 'departureTime', 'destination', 'destination_coordinates_lat', 'destination_coordinates_lng', 'pickup', 'pickup_coordinates_lat', 'pickup_coordinates_lng', 'returnDate', 'state', 'timestamp', 'trip_id']

üîç First few rows:


Unnamed: 0,client_id,created_at,departureDate,departureTime,destination,destination_coordinates_lat,destination_coordinates_lng,pickup,pickup_coordinates_lat,pickup_coordinates_lng,returnDate,state,timestamp,trip_id
0,jXNpXoiqNwMJ7aqzPGDqXGtEeoAw,2025-11-04T18:38:03.717831+07:00,04/11/2025,14:45,Van Don Ferry Terminal,20.653118,107.018416,Hanoi - My Dinh Bus Station,21.039535,106.036022,,Completed,1762256283717,
1,VH6pRA9NppXvWx9rHtXbsdqfpauG,2025-12-27T14:26:19.200261+07:00,27/12/2025,22:45,Hanoi - Long Bien Station,20.931912,105.791412,Cam Pha Bus Station,21.159575,107.474859,,Completed,1766820379200,
2,bHw9s6YHSHRdQsPTPfuAHCyjgYMb,2025-12-13T00:45:35.978092+07:00,13/12/2025,07:30,Hanoi - Luong Yen Station,21.227358,105.715977,Mong Cai Bus Station,21.062437,106.519003,,Completed,1765561535978,
3,ppG9UhBtbg3XWSk8WDdbQ3XGbhc1,2025-10-16T16:10:17.126534+07:00,16/10/2025,13:00,Ha Long - Tuan Chau Port,20.97706,106.493432,Hanoi - My Dinh Bus Station,20.90423,105.679686,,Completed,1760605817126,
4,fyfwdwDN9U7JNaWwncvaCMdumReX,2025-10-11T10:22:54.072155+07:00,11/10/2025,17:45,Hanoi - Long Bien Station,20.704991,105.599908,Cam Pha Bus Station,20.918885,106.570949,,Completed,1760152974072,


In [14]:
# Check data info
print("üìã DATA INFO")
print("=" * 60)
df_raw.info()

print("\n‚ùì MISSING VALUES")
print(df_raw.isnull().sum())

üìã DATA INFO
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   client_id                    1000 non-null   object 
 1   created_at                   1000 non-null   object 
 2   departureDate                1000 non-null   object 
 3   departureTime                1000 non-null   object 
 4   destination                  1000 non-null   object 
 5   destination_coordinates_lat  1000 non-null   float64
 6   destination_coordinates_lng  1000 non-null   float64
 7   pickup                       1000 non-null   object 
 8   pickup_coordinates_lat       1000 non-null   float64
 9   pickup_coordinates_lng       1000 non-null   float64
 10  returnDate                   0 non-null      float64
 11  state                        1000 non-null   object 
 12  timestamp                    1000 non-null   int64  
 13  trip

## 2. Data Cleaning Function

In [15]:
def clean_and_prepare_data(df):
    """
    Clean and prepare data for clustering
    
    Returns:
    --------
    df_clean : DataFrame
        Cleaned dataframe with all features
    """
    print("üîß STARTING DATA CLEANING")
    print("=" * 60)
    
    df_clean = df.copy()
    initial_count = len(df_clean)
    
    # 1. Check and remove missing coordinates
    print("\n1Ô∏è‚É£ Checking for missing coordinates...")
    missing_coords = df_clean[['pickup_coordinates_lat', 'pickup_coordinates_lng']].isnull().any(axis=1).sum()
    print(f"   Missing coordinates: {missing_coords}")
    
    df_clean = df_clean.dropna(subset=['pickup_coordinates_lat', 'pickup_coordinates_lng'])
    print(f"   Rows after removal: {len(df_clean)}")
    
    # 2. Parse datetime
    print("\n2Ô∏è‚É£ Parsing datetime...")
    df_clean['departure_datetime'] = pd.to_datetime(
        df_clean['departureDate'] + ' ' + df_clean['departureTime'],
        format='%d/%m/%Y %H:%M'
    )
    
    # 3. Extract time features
    print("\n3Ô∏è‚É£ Extracting time features...")
    df_clean['departure_hour'] = df_clean['departure_datetime'].dt.hour
    df_clean['departure_minute'] = df_clean['departure_datetime'].dt.minute
    df_clean['departure_time_minutes'] = df_clean['departure_hour'] * 60 + df_clean['departure_minute']
    df_clean['departure_day'] = df_clean['departure_datetime'].dt.day
    df_clean['departure_month'] = df_clean['departure_datetime'].dt.month
    df_clean['departure_dayofweek'] = df_clean['departure_datetime'].dt.dayofweek
    
    # 4. Validate coordinate ranges
    print("\n4Ô∏è‚É£ Validating coordinates...")
    lat_range = (20.5, 21.5)
    lng_range = (105.4, 107.6)
    
    valid_coords = (
        (df_clean['pickup_coordinates_lat'] >= lat_range[0]) & 
        (df_clean['pickup_coordinates_lat'] <= lat_range[1]) &
        (df_clean['pickup_coordinates_lng'] >= lng_range[0]) & 
        (df_clean['pickup_coordinates_lng'] <= lng_range[1])
    )
    
    invalid_count = (~valid_coords).sum()
    print(f"   Invalid coordinates: {invalid_count}")
    
    if invalid_count > 0:
        df_clean = df_clean[valid_coords]
        print(f"   Rows after removal: {len(df_clean)}")
    
    # 5. Identify direction
    print("\n5Ô∏è‚É£ Identifying travel direction...")
    df_clean['direction'] = df_clean['pickup'].apply(
        lambda x: 'Hanoi_to_QuangNinh' if 'Hanoi' in x else 'QuangNinh_to_Hanoi'
    )
    
    direction_counts = df_clean['direction'].value_counts()
    print(f"   {direction_counts.to_dict()}")
    
    # 6. Summary
    print("\n‚úÖ CLEANING SUMMARY")
    print("=" * 60)
    print(f"   Initial rows: {initial_count}")
    print(f"   Final rows: {len(df_clean)}")
    print(f"   Removed: {initial_count - len(df_clean)} ({(initial_count - len(df_clean))/initial_count*100:.2f}%)")
    
    return df_clean

# Apply cleaning
df_cleaned = clean_and_prepare_data(df_raw)

üîß STARTING DATA CLEANING



1Ô∏è‚É£ Checking for missing coordinates...
   Missing coordinates: 0
   Rows after removal: 1000

2Ô∏è‚É£ Parsing datetime...

3Ô∏è‚É£ Extracting time features...

4Ô∏è‚É£ Validating coordinates...
   Invalid coordinates: 0

5Ô∏è‚É£ Identifying travel direction...
   {'QuangNinh_to_Hanoi': 529, 'Hanoi_to_QuangNinh': 471}

‚úÖ CLEANING SUMMARY
   Initial rows: 1000
   Final rows: 1000
   Removed: 0 (0.00%)

4Ô∏è‚É£ Validating coordinates...
   Invalid coordinates: 0

5Ô∏è‚É£ Identifying travel direction...
   {'QuangNinh_to_Hanoi': 529, 'Hanoi_to_QuangNinh': 471}

‚úÖ CLEANING SUMMARY
   Initial rows: 1000
   Final rows: 1000
   Removed: 0 (0.00%)


In [16]:
# View cleaned data
print("üìä Cleaned data sample:")
df_cleaned[['client_id', 'pickup', 'destination', 'departureDate', 'departureTime', 
            'pickup_coordinates_lat', 'pickup_coordinates_lng', 
            'departure_time_minutes', 'direction']].head(10)

üìä Cleaned data sample:


Unnamed: 0,client_id,pickup,destination,departureDate,departureTime,pickup_coordinates_lat,pickup_coordinates_lng,departure_time_minutes,direction
0,jXNpXoiqNwMJ7aqzPGDqXGtEeoAw,Hanoi - My Dinh Bus Station,Van Don Ferry Terminal,04/11/2025,14:45,21.039535,106.036022,885,Hanoi_to_QuangNinh
1,VH6pRA9NppXvWx9rHtXbsdqfpauG,Cam Pha Bus Station,Hanoi - Long Bien Station,27/12/2025,22:45,21.159575,107.474859,1365,QuangNinh_to_Hanoi
2,bHw9s6YHSHRdQsPTPfuAHCyjgYMb,Mong Cai Bus Station,Hanoi - Luong Yen Station,13/12/2025,07:30,21.062437,106.519003,450,QuangNinh_to_Hanoi
3,ppG9UhBtbg3XWSk8WDdbQ3XGbhc1,Hanoi - My Dinh Bus Station,Ha Long - Tuan Chau Port,16/10/2025,13:00,20.90423,105.679686,780,Hanoi_to_QuangNinh
4,fyfwdwDN9U7JNaWwncvaCMdumReX,Cam Pha Bus Station,Hanoi - Long Bien Station,11/10/2025,17:45,20.918885,106.570949,1065,QuangNinh_to_Hanoi
5,VXySnu4Jmtyr2897SvWJJLsXBrSQ,Hanoi - My Dinh Bus Station,Van Don Ferry Terminal,03/01/2026,21:45,20.773014,105.688985,1305,Hanoi_to_QuangNinh
6,Ky86bwZ1Qm4TX6NZYfrfm2g81uoV,Uong Bi Bus Station,Hanoi - Giap Bat Station,26/11/2025,17:30,21.05104,106.455187,1050,QuangNinh_to_Hanoi
7,TyfTGXUhgtNPrNfUztzQdtau3NA3,Uong Bi Bus Station,Hanoi - Long Bien Station,01/01/2026,12:00,20.963712,107.04262,720,QuangNinh_to_Hanoi
8,MQx0QVqcf4fCT5x73UvFrCT470fX,Uong Bi Bus Station,Hanoi - Long Bien Station,01/12/2025,21:45,20.655262,106.423693,1305,QuangNinh_to_Hanoi
9,EUZqzUcC38ndZ4rkUXyz9CBfrTwj,Van Don Ferry Terminal,Hanoi - Yen Nghia Station,23/11/2025,10:15,20.834541,107.020613,615,QuangNinh_to_Hanoi


## 3. Filter Data by Direction

Ch√∫ng ta s·∫Ω ch·ªâ l√†m vi·ªác v·ªõi h∆∞·ªõng **Hanoi ‚Üí Quang Ninh** ƒë·ªÉ ƒë∆°n gi·∫£n h√≥a

In [17]:
# Filter for Hanoi to Quang Ninh only
df_filtered = df_cleaned[df_cleaned['direction'] == 'Hanoi_to_QuangNinh'].copy()

print(f"üìä Filtered dataset (Hanoi ‚Üí Quang Ninh):")
print(f"   Total samples: {len(df_filtered)}")
print(f"   Unique clients: {df_filtered['client_id'].nunique()}")
print(f"   Date range: {df_filtered['departureDate'].min()} to {df_filtered['departureDate'].max()}")

üìä Filtered dataset (Hanoi ‚Üí Quang Ninh):
   Total samples: 471
   Unique clients: 184
   Date range: 01/01/2026 to 31/12/2025


## 4. Prepare Features for Clustering

In [18]:
# Select features for clustering
feature_columns = [
    'pickup_coordinates_lat',
    'pickup_coordinates_lng',
    'departure_time_minutes'
]

# Extract features
X = df_filtered[feature_columns].values

print(f"üìä Feature matrix shape: {X.shape}")
print(f"\nüìã Feature statistics:")
print(pd.DataFrame(X, columns=feature_columns).describe())

üìä Feature matrix shape: (471, 3)

üìã Feature statistics:
       pickup_coordinates_lat  pickup_coordinates_lng  departure_time_minutes
count              471.000000              471.000000              471.000000
mean                20.995621              105.798206              784.777070
std                  0.166741                0.167631              339.841332
min                 20.700235              105.500165              240.000000
25%                 20.855175              105.656898              510.000000
50%                 20.984548              105.795151              735.000000
75%                 21.134418              105.945549             1050.000000
max                 21.296355              106.099600             1425.000000


## 5. Split Data: 80% Train, 20% Test

V·ªõi clustering, ch√∫ng ta s·∫Ω:
- **Train set**: D√πng ƒë·ªÉ t√¨m s·ªë clusters t·ªëi ∆∞u v√† train model
- **Test set**: D√πng ƒë·ªÉ ƒë√°nh gi√° xem model c√≥ generalize t·ªët kh√¥ng

In [19]:
# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Split indices to keep track of which samples are in train/test
indices = np.arange(len(df_filtered))

# Stratified split by departure date to ensure temporal distribution
# We'll use shuffle=True to randomize, but set random_state for reproducibility
train_idx, test_idx = train_test_split(
    indices,
    test_size=0.2,
    random_state=RANDOM_SEED,
    shuffle=True
)

# Split features
X_train = X[train_idx]
X_test = X[test_idx]

# Split dataframes
df_train = df_filtered.iloc[train_idx].copy()
df_test = df_filtered.iloc[test_idx].copy()

print("üìä DATA SPLIT SUMMARY")
print("=" * 60)
print(f"Total samples: {len(df_filtered)}")
print(f"\n‚úÖ Train set:")
print(f"   Samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"   Shape: {X_train.shape}")
print(f"\n‚úÖ Test set:")
print(f"   Samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print(f"   Shape: {X_test.shape}")

üìä DATA SPLIT SUMMARY
Total samples: 471

‚úÖ Train set:
   Samples: 376 (79.8%)
   Shape: (376, 3)

‚úÖ Test set:
   Samples: 95 (20.2%)
   Shape: (95, 3)


In [20]:
# Verify distribution similarity between train and test
print("\nüìä FEATURE DISTRIBUTION COMPARISON")
print("=" * 60)

for i, col in enumerate(feature_columns):
    train_mean = X_train[:, i].mean()
    test_mean = X_test[:, i].mean()
    train_std = X_train[:, i].std()
    test_std = X_test[:, i].std()
    
    print(f"\n{col}:")
    print(f"   Train: mean={train_mean:.4f}, std={train_std:.4f}")
    print(f"   Test:  mean={test_mean:.4f}, std={test_std:.4f}")
    print(f"   Difference: {abs(train_mean - test_mean):.4f}")


üìä FEATURE DISTRIBUTION COMPARISON

pickup_coordinates_lat:
   Train: mean=20.9930, std=0.1661
   Test:  mean=21.0060, std=0.1681
   Difference: 0.0130

pickup_coordinates_lng:
   Train: mean=105.7992, std=0.1691
   Test:  mean=105.7943, std=0.1606
   Difference: 0.0049

departure_time_minutes:
   Train: mean=783.1117, std=334.9476
   Test:  mean=791.3684, std=356.7798
   Difference: 8.2567


## 6. Export Processed Data

In [21]:
# Create processed data directory
import os
PROCESSED_DIR = '../data/processed'
os.makedirs(PROCESSED_DIR, exist_ok=True)

# Export train data
df_train.to_csv(f'{PROCESSED_DIR}/train_data.csv', index=False)
np.save(f'{PROCESSED_DIR}/X_train.npy', X_train)

# Export test data
df_test.to_csv(f'{PROCESSED_DIR}/test_data.csv', index=False)
np.save(f'{PROCESSED_DIR}/X_test.npy', X_test)

# Export full cleaned data for reference
df_filtered.to_csv(f'{PROCESSED_DIR}/full_cleaned_data.csv', index=False)
np.save(f'{PROCESSED_DIR}/X_full.npy', X)

# Export metadata
metadata = {
    'total_samples': len(df_filtered),
    'train_samples': len(X_train),
    'test_samples': len(X_test),
    'train_ratio': 0.8,
    'test_ratio': 0.2,
    'random_seed': RANDOM_SEED,
    'feature_columns': feature_columns,
    'direction': 'Hanoi_to_QuangNinh',
    'processed_date': datetime.now().isoformat(),
    'date_range': {
        'min': df_filtered['departureDate'].min(),
        'max': df_filtered['departureDate'].max()
    }
}

with open(f'{PROCESSED_DIR}/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print("‚úÖ DATA EXPORTED SUCCESSFULLY")
print("=" * 60)
print(f"Directory: {PROCESSED_DIR}/")
print(f"\nFiles created:")
print(f"   - train_data.csv        ({len(df_train)} rows)")
print(f"   - X_train.npy           ({X_train.shape})")
print(f"   - test_data.csv         ({len(df_test)} rows)")
print(f"   - X_test.npy            ({X_test.shape})")
print(f"   - full_cleaned_data.csv ({len(df_filtered)} rows)")
print(f"   - X_full.npy            ({X.shape})")
print(f"   - metadata.json")

‚úÖ DATA EXPORTED SUCCESSFULLY
Directory: ../data/processed/

Files created:
   - train_data.csv        (376 rows)
   - X_train.npy           ((376, 3))
   - test_data.csv         (95 rows)
   - X_test.npy            ((95, 3))
   - full_cleaned_data.csv (471 rows)
   - X_full.npy            ((471, 3))
   - metadata.json


In [22]:
# Display metadata
print("\nüìã METADATA:")
print(json.dumps(metadata, indent=2))


üìã METADATA:
{
  "total_samples": 471,
  "train_samples": 376,
  "test_samples": 95,
  "train_ratio": 0.8,
  "test_ratio": 0.2,
  "random_seed": 42,
  "feature_columns": [
    "pickup_coordinates_lat",
    "pickup_coordinates_lng",
    "departure_time_minutes"
  ],
  "direction": "Hanoi_to_QuangNinh",
  "processed_date": "2026-01-08T21:25:34.415187",
  "date_range": {
    "min": "01/01/2026",
    "max": "31/12/2025"
  }
}


## Summary

‚úÖ **Completed preprocessing steps:**
1. Loaded raw data (1000 samples)
2. Cleaned data (removed missing/invalid coordinates)
3. Extracted features (lat, lng, time)
4. Filtered by direction (Hanoi ‚Üí Quang Ninh)
5. Split data 80/20 train/test
6. Exported processed data

**Next step:** Open `02_model_training.ipynb` to train clustering model