# 03 - Preliminary data cleaning
____

Step 3! We have our pristine dataset split nicely into our training (and validation) and testing data. 

It is important to remember that any steps / transformations we apply to our training data must also be applied to the test data to avoid any errors. If we're going to drop columns, lets build up a list of cols_to_drop so we can apply this nice and easily.


In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

pd.options.mode.chained_assignment = None 

In [70]:
train = pd.read_csv('training_data.csv', low_memory = False, na_values=[-1, '-1', 'Data missing or out of range', 'Unknown'])
test = pd.read_csv('testing_data.csv', low_memory = False, na_values=[-1, '-1', 'Data missing or out of range', 'Unknown'])


## Let's have a look at our data types.

In [71]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 532806 entries, 0 to 532805
Data columns (total 57 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   accident_index                               532806 non-null  object 
 1   accident_year_x                              532806 non-null  int64  
 2   accident_reference_x                         532806 non-null  object 
 3   location_easting_osgr                        532712 non-null  float64
 4   location_northing_osgr                       532712 non-null  float64
 5   longitude                                    532712 non-null  float64
 6   latitude                                     532712 non-null  float64
 7   police_force                                 532806 non-null  object 
 8   accident_severity                            532806 non-null  object 
 9   number_of_vehicles                           532806 non-nul

Firstly, we have a lot of features. Feature selection will be very important to make sure that our model doesn't overfit. There are also a lot of superfluous features that can easily be dropped at our feature selection stage:
  * The geographical location columns 
  * The duplicate accident_year and accident_reference columns
  * Whether or not a police officer attended *after* the collision occured won't have affected the seriousness
  * vehicle and casualty reference isn't useful
  * accident_severity should be dropped as it is highly correlated to the casualty_severity, and will just be a proxy for the model to overfit to
  * same with the enhanced severity

It also looks like we have a tonne of missing data, so first let's make sure that we drop features that have so much missing they won't really assist in our predictions.


### Let's drop columns where we're missing a lot of data.

We have a lot of NaN values, but also some more disguised missing data. Some entries are labelled with -1 (both text and numeric present), 'Data missing or out of range', and 'Unknown'. Missing greater than a third of data is going to limit our predictive power so let's remove these columns from the off. We've specified the alternative values for the NaN values in our read_csv() function above, so these will be counted with typical NaN based methods.

We will evaluate the missing data in the training data set, and apply any column dropping to both datasets. No peeking at the testing data!


In [72]:
columns_to_drop = []

columns_to_drop.extend(train.columns[(train.isna().sum() / train.shape[0]) * 100 > 5].to_list())


In [73]:
print(columns_to_drop)

['local_authority_district', 'first_road_number', 'junction_control', 'second_road_number', 'special_conditions_at_site', 'carriageway_hazards', 'trunk_road_flag', 'enhanced_severity_collision', 'casualty_home_area_type', 'casualty_imd_decile', 'lsoa_of_casualty', 'enhanced_casualty_severity', 'casualty_distance_banding']


37% of local authority district data missing, but we have the full dataset for ONS district. Let's use this to infill the local authority district and then drop the ONS district column.

#### Now we're ready to drop the columns with >5% of the data missing. We'll apply this to both the test and train data, and output the saved data.

In [74]:
[data.drop(columns=columns_to_drop, inplace=True) for data in [train, test]]

[None, None]

#### Which columns still have missing data?

In [75]:
cols_missing_data = train.loc[:, train.isnull().any()].isna().sum().sort_values(ascending=False)
print(cols_missing_data)
missing_numeric = train[cols_missing_data.index].describe().columns
missing_categorical = train.describe(exclude=[np.number]).columns

lsoa_of_accident_location                      23310
weather_conditions                             13473
road_type                                      11270
age_band_of_casualty                            8125
age_of_casualty                                 8125
pedestrian_crossing_human_control               4133
pedestrian_crossing_physical_facilities         4078
road_surface_conditions                         2600
sex_of_casualty                                 2223
car_passenger                                   1068
pedestrian_road_maintenance_worker               738
second_road_class                                510
bus_or_coach_passenger                           179
location_easting_osgr                             94
location_northing_osgr                            94
latitude                                          94
longitude                                         94
speed_limit                                       72
casualty_type                                 

The missing data for these features represents quite a small proportion of the dataset, so let's just infill with the most frequent value for categories, or the mean for numerics. We'll apply this to all columns in the dataframe incase our testing data has missing data in other features.

In [76]:
[[data[col].fillna(data[col].mean(), inplace=True) for col in (data.describe().columns)] for data in [train, test]]

for data in [train, test]:
  categorical_cols = data.describe(exclude=[np.number]).columns
  data[categorical_cols] = data[categorical_cols].apply(lambda x: x.fillna(x.value_counts().index[0]))

In [77]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 532806 entries, 0 to 532805
Data columns (total 44 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   accident_index                               532806 non-null  object 
 1   accident_year_x                              532806 non-null  int64  
 2   accident_reference_x                         532806 non-null  object 
 3   location_easting_osgr                        532806 non-null  float64
 4   location_northing_osgr                       532806 non-null  float64
 5   longitude                                    532806 non-null  float64
 6   latitude                                     532806 non-null  float64
 7   police_force                                 532806 non-null  object 
 8   accident_severity                            532806 non-null  object 
 9   number_of_vehicles                           532806 non-nul

In [78]:
contains_self_reported = []
for col in train.select_dtypes(include='object'):
  if train[col].str.contains('unknown.*self rep.*', case=False, regex=True).any(): 
    contains_self_reported.append(col)

Impute self-reported unknown values with column mode

In [79]:
for dataset in [train, test]:
  for col in contains_self_reported:  
    dataset[col] = np.where(dataset[col].str.contains('unknown.*self rep.*', case=False, regex=True), dataset[col].value_counts().index[0], dataset[col])

In [80]:
for dataset in [train, test]:
  dataset['urban_or_rural_area'] = np.where(dataset['urban_or_rural_area'] == 'Unallocated', dataset['urban_or_rural_area'].value_counts().index[0], dataset['urban_or_rural_area'])

train.urban_or_rural_area.value_counts()

urban_or_rural_area
Urban    341438
Rural    191368
Name: count, dtype: int64

Let's get rid of the duplicated columns from merging the data.

In [81]:
for data in [train, test]:
  data.rename(columns={'accident_year_x': 'accident_year', 'accident_reference_x': 'accident_reference'}, inplace=True)
  data.drop(labels=['accident_year_y', 'accident_reference_y', 'police_force'], axis=1, inplace=True)

In [82]:
train.to_csv('cleaned_training_data.csv', index=False)
test.to_csv('cleaned_testing_data.csv', index=False)