# Person - data preprocessing & cleaning

This Notebook generates the data at the daily/hourly level, removes erroronous data points and deals with the outliers in the dataset.

### TODO: apply null and outlier analysis for each feature instead of Labels !!!

In [1]:
# import ConfigImports Notebook to import and configure libs
%run ../ConfigImports.ipynb

TF -> Using GPU ->  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


In [2]:
df = pd.read_csv('../Datasets/UniqueObjectDetections__person__2019-09-09_2020-03-02.csv')
print(df.shape)
df.head(2)

(4790, 26)


Unnamed: 0,img_idx,label,confidence,x1,y1,x2,y2,date,time,filename,img_n_boxes,time_ms,date_time,week_day,is_weekend,month,hour,min,dummy_var,time_diff,sec_diff,x_center,y_center,prev_x_center,prev_y_center,euc_distance
0,73740,person,0.450496,459,24,478,38,2019-09-09,07.03.03,07.03.03.965_4d9909b4_person-car-car-car.jpg,2,965,2019-09-09 07:03:03.965,Monday,False,9,7,3,1,0 days 00:42:40.471000000,2560.471,468.5,31.0,490.0,310.0,279.827179
1,73135,person,0.658724,286,238,381,340,2019-09-09,07.29.50,07.29.50.440_4e0ee29d_person-car-car.jpg,1,440,2019-09-09 07:29:50.440,Monday,False,9,7,29,1,0 days 00:26:46.475000000,1606.475,333.5,289.0,468.5,31.0,291.185508


### Pre-process data

This analysis will be performed at the daily / hourly level.

I have tried several other approaches (from 15-minute to 3-hour time intervals), but 15 minute intervals are definitely too random and 3 hours reduces the dataset size dramatically. Based on that it seems like hourly analysis (and later forecasting) is a good trade off.

To roll up the data to daily / hourly level we can use Pandas. This is straight forward.

The only issue I have identified with this approach is that it will only include date / hour combinations with observations. But in order to analyse data, gaps without any observations need to be filled with 0's. This is done in the code snippet below.

In [3]:
# make sure Pandas understands date time fields
df['date_time'] = pd.to_datetime(df['date_time'])
df['date'] = pd.to_datetime(df['date'])

In [4]:
# use Pandas handy resample feature to fill in gaps with 0's
resampled = df.set_index('date_time').resample('H')['dummy_var'].sum().reset_index()
resampled.columns = ['date_time', 'obs_count']
resampled['date'] = resampled['date_time'].dt.date.astype(str)
resampled['hour'] = resampled['date_time'].dt.hour
resampled = resampled[['date', 'hour', 'obs_count']]
resampled.head(2)

Unnamed: 0,date,hour,obs_count
0,2019-09-09,7,2
1,2019-09-09,8,3


In [5]:
# remove any entries where we know that there was an error in measurements
orig_size = resampled.shape[0]
idx = resampled['date'].isin(['2020-01-13', '2020-01-14', '2020-02-28'])
resampled = resampled.loc[~idx]
print(f'Removed {orig_size - resampled.shape[0]} records')

Removed 72 records


In [33]:
# save data to csv, so we can conduct feature engineering in another Notebook
use_cols = ['date', 'hour', 'obs_count', 'obs_count_corr']
resampled[use_cols].to_csv('../Datasets/Person_no_outliers__2019-09-09_2020-03-02.csv', index=False)