# Uncertain temporal database : preprocess

In here, we calculate the time column, delete some unused columns and change it into a prefered format, so we can use it for many scripts, especially for measuring time via 'main_pfpmeasure.ipynb'.

In [None]:
from pytorch_tabnet.tab_model import TabNetClassifier, TabNetRegressor
import numpy as np
from tqdm import tqdm
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt

## Read the data

In [None]:
database_path = r"../data/UTDATABASE/utd_20221222_0226/"
database_df = pd.read_csv(database_path + "label.csv")
database_df.fillna(0, inplace=True)
# database_df.describe()

# Handle time

I assume hour can contribute to the traffic jam?

In [None]:
database_df['Datetime'] = pd.to_datetime(database_df['Datetime'], errors='coerce')
first_datetime = database_df['Datetime'].min()
database_df['Time'] = np.round((database_df['Datetime'] - first_datetime).dt.total_seconds() / 3600)
database_df.sort_values(by=['Time'], ignore_index=True, inplace=True)

Note: there is a time gap that is 120 hours in between, which I suspect is because of data loss. It make all the features have at least their period >= 120. Which I don't think is a good 'bug'. So I planned to delete that time gap. Even though it could lead to data loss, I hope the remaining data can still achieve good patterns. (There's still plenty of days to calculate.)

In [None]:
#debug
time_val_counts = database_df['Time'].value_counts()
debug_time_df = database_df[ ['Time'] ] 
debug_time_df['Diff'] = debug_time_df.diff()
delete_time_start, max_time_gap = debug_time_df['Time'][ debug_time_df['Diff'].idxmax() ], debug_time_df['Diff'][ debug_time_df['Diff'].idxmax() ]
print(delete_time_start, max_time_gap)

In [None]:
# danger! delete time gap
database_df = database_df[ database_df['Time'] >= delete_time_start ]
database_df.drop(columns=['Datetime'], axis=1, inplace=True)
database_df.reset_index(drop=True, inplace=True)

## Delete unused columns

We delete some columns to achieve faster running time, and hopefully, better results.

In [None]:
deleted_columns = ['SensorCode', 'Datetime' ]
database_df.drop([ col for col in database_df if col in deleted_columns ], axis=1, inplace=True)

### Delete time related labels

Because we think they won't really contribute much, but actually make the database sparser.

In [None]:
database_df.drop( [ col for col in database_df if ('HourTriple' in col) or ('WeekDay' in col) ], axis=1, inplace=True )

# Write the preprocessed database

The current format is that the 'Time' column is the first column.

In [None]:
# save into the same folder that contain input file
database_df.to_csv(database_path + "database.csv", index=False)