# 03 - Preliminary data cleaning
____

Step 3! We have our pristine dataset split nicely into our training (and validation) and testing data. 

It is important to remember that any steps / transformations we apply to our training data must also be applied to the test data to avoid any errors. If we're going to drop columns, lets build up a list of cols_to_drop so we can apply this nice and easily.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

pd.options.mode.chained_assignment = None 

In [2]:
train = pd.read_csv('training_data.csv', low_memory = False, na_values=[-1, '-1', 'Data missing or out of range', 'Unknown'])
test = pd.read_csv('testing_data.csv', low_memory = False, na_values=[-1, '-1', 'Data missing or out of range', 'Unknown'])


## Let's have a look at our data types.

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11359 entries, 0 to 11358
Data columns (total 57 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   accident_index                               11359 non-null  int64  
 1   accident_year_x                              11359 non-null  int64  
 2   accident_reference_x                         11359 non-null  int64  
 3   location_easting_osgr                        11359 non-null  float64
 4   location_northing_osgr                       11359 non-null  float64
 5   longitude                                    11359 non-null  float64
 6   latitude                                     11359 non-null  float64
 7   police_force                                 11359 non-null  object 
 8   accident_severity                            11359 non-null  object 
 9   number_of_vehicles                           11359 non-null  int64  
 10

Firstly, we have a lot of features. Feature selection will be very important to make sure that our model doesn't overfit. There are also a lot of superfluous features that can easily be dropped at our feature selection stage:
  * The geographical location columns 
  * The duplicate accident_year and accident_reference columns
  * Whether or not a police officer attended *after* the collision occured won't have affected the seriousness
  * vehicle and casualty reference isn't useful
  * accident_severity should be dropped as it is highly correlated to the casualty_severity, and will just be a proxy for the model to overfit to
  * same with the enhanced severity

It also looks like we have a tonne of missing data, so first let's make sure that we drop features that have so much missing they won't really assist in our predictions.


### Let's drop columns where we're missing a lot of data.

We have a lot of NaN values, but also some more disguised missing data. Some entries are labelled with -1 (both text and numeric present), 'Data missing or out of range', and 'Unknown'. Missing greater than a third of data is going to limit our predictive power so let's remove these columns from the off. We've specified the alternative values for the NaN values in our read_csv() function above, so these will be counted with typical NaN based methods.

We will evaluate the missing data in the training data set, and apply any column dropping to both datasets. No peeking at the testing data!


In [4]:
columns_to_drop = []

columns_to_drop.extend(train.columns[(train.isna().sum() / train.shape[0]) * 100 > 5].to_list())


In [5]:
print(columns_to_drop)

['local_authority_district', 'first_road_number', 'junction_control', 'second_road_number', 'special_conditions_at_site', 'carriageway_hazards', 'casualty_home_area_type', 'casualty_imd_decile', 'lsoa_of_casualty', 'casualty_distance_banding']


37% of local authority district data missing, but we have the full dataset for ONS district. Let's use this to infill the local authority district and then drop the ONS district column.

In [6]:
for data in [train, test]:
  districts = {'E08000019': 'Sheffield', 'E08000017': 'Doncaster', 'E08000018': 'Rotherham', 'E08000016': 'Barnsley'}

  for ind,row in enumerate(data.local_authority_district):
    if row is np.nan:
      district = districts[data.local_authority_ons_district.iloc[ind]]
      data.local_authority_district.iloc[ind] = district

columns_to_drop.append('local_authority_ons_district')
columns_to_drop.remove('local_authority_district')

#### Now we're ready to drop the columns with >5% of the data missing. We'll apply this to both the test and train data, and output the saved data.

In [7]:
[data.drop(columns=columns_to_drop, inplace=True) for data in [train, test]]

[None, None]

#### Which columns still have missing data?

In [8]:
cols_missing_data = train.loc[:, train.isnull().any()].isna().sum().sort_values(ascending=False)
print(cols_missing_data)
missing_numeric = train[cols_missing_data.index].describe().columns
missing_categorical = train.describe(exclude=[np.number]).columns

age_of_casualty                            271
age_band_of_casualty                       271
weather_conditions                         198
road_type                                   73
road_surface_conditions                     29
pedestrian_crossing_human_control           23
pedestrian_crossing_physical_facilities     22
car_passenger                               14
bus_or_coach_passenger                      13
pedestrian_road_maintenance_worker           9
sex_of_casualty                              5
enhanced_severity_collision                  2
enhanced_casualty_severity                   2
dtype: int64


The missing data for these features represents quite a small proportion of the dataset, so let's just infill with the most frequent value for categories, or the mean for numerics. We'll apply this to all columns in the dataframe incase our testing data has missing data in other features.

In [9]:
[[data[col].fillna(data[col].mean(), inplace=True) for col in (data.describe().columns)] for data in [train, test]]

for data in [train, test]:
  categorical_cols = data.describe(exclude=[np.number]).columns
  data[categorical_cols] = data[categorical_cols].apply(lambda x: x.fillna(x.value_counts().index[0]))

In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11359 entries, 0 to 11358
Data columns (total 47 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   accident_index                               11359 non-null  int64  
 1   accident_year_x                              11359 non-null  int64  
 2   accident_reference_x                         11359 non-null  int64  
 3   location_easting_osgr                        11359 non-null  float64
 4   location_northing_osgr                       11359 non-null  float64
 5   longitude                                    11359 non-null  float64
 6   latitude                                     11359 non-null  float64
 7   police_force                                 11359 non-null  object 
 8   accident_severity                            11359 non-null  object 
 9   number_of_vehicles                           11359 non-null  int64  
 10

Let's get rid of the duplicated columns from merging the data.

In [11]:
for data in [train, test]:
  data.rename(columns={'accident_year_x': 'accident_year', 'accident_reference_x': 'accident_reference'}, inplace=True)
  data.drop(labels=['accident_year_y', 'accident_reference_y', 'police_force'], axis=1, inplace=True)

In [12]:
train.to_csv('cleaned_training_data.csv', index=False)
test.to_csv('cleaned_testing_data.csv', index=False)