missing values are in the columns:
- REPORT_TYPE
- STREET_DIRECTION
- STREET_NAME
- BEAT_OF_OCCURRENCE
- MOST_SEVERE_INJURY
- LATITUDE
- LONGITUDE
- LOCATION

In [11]:
import pandas as pd

In [12]:
crashes_df = pd.read_csv('../data/Crashes.csv')

### check with other datasets

there are missing values, let's check if any of this columns are in the other 2 datasets

In [13]:
people_df = pd.read_csv('../data/People.csv')
vehicles_df = pd.read_csv('../data/Vehicles.csv')

In [14]:
missing_values_dict = crashes_df.isna().sum()[crashes_df.isna().sum() > 0].to_dict()

In [15]:
for col in set(list(people_df.columns)+list(vehicles_df.columns)):
    if col in missing_values_dict.keys():
        print(f'Column {col} is missing in both People and Vehicles datasets.')

none are (or at least directly)

In [16]:
people_df.columns, vehicles_df.columns

(Index(['PERSON_ID', 'PERSON_TYPE', 'RD_NO', 'VEHICLE_ID', 'CRASH_DATE', 'CITY',
        'STATE', 'SEX', 'AGE', 'SAFETY_EQUIPMENT', 'AIRBAG_DEPLOYED',
        'EJECTION', 'INJURY_CLASSIFICATION', 'DRIVER_ACTION', 'DRIVER_VISION',
        'PHYSICAL_CONDITION', 'BAC_RESULT', 'DAMAGE_CATEGORY', 'DAMAGE'],
       dtype='object'),
 Index(['CRASH_UNIT_ID', 'RD_NO', 'CRASH_DATE', 'UNIT_NO', 'UNIT_TYPE',
        'VEHICLE_ID', 'MAKE', 'MODEL', 'LIC_PLATE_STATE', 'VEHICLE_YEAR',
        'VEHICLE_DEFECT', 'VEHICLE_TYPE', 'VEHICLE_USE', 'TRAVEL_DIRECTION',
        'MANEUVER', 'OCCUPANT_CNT', 'FIRST_CONTACT_POINT'],
       dtype='object'))

in vehicles we have a feature called TRAVEL_DIRECTION that might contain the same information as STREET_DIRECTION

In [17]:
#let's join the 2 datasets on rd_no and keeping only street direction and travel direction from the 2
joined_df = crashes_df[['RD_NO', 'STREET_DIRECTION']].merge(
    vehicles_df[['RD_NO', 'TRAVEL_DIRECTION']],
    on='RD_NO',
    how='inner'
)
missing_values = joined_df.isna().sum()

In [18]:
joined_df

Unnamed: 0,RD_NO,STREET_DIRECTION,TRAVEL_DIRECTION
0,JC113649,N,S
1,JC113627,N,S
2,JC113627,N,E
3,JC113637,S,N
4,JC113637,S,S
...,...,...,...
460432,HZ164689,N,S
460433,HZ122950,S,S
460434,HZ122950,S,W
460435,JB442550,W,E


In [19]:
missing_values

RD_NO                   0
STREET_DIRECTION        4
TRAVEL_DIRECTION    10373
dtype: int64

In [20]:
#mismatch, shape, #shape-#mismatch
(joined_df['STREET_DIRECTION'] != joined_df['TRAVEL_DIRECTION']).sum(), joined_df.shape, joined_df.shape[0]-(joined_df['STREET_DIRECTION'] != joined_df['TRAVEL_DIRECTION']).sum()

(291438, (460437, 3), 168999)

In [21]:
vehicles_df["TRAVEL_DIRECTION"].isna().sum()

10373

the 2 columns seem to be unrealted to each other:
- other than the fact that 10373 rows values are missing from the travel_direction column,
- the number of mismatch between the total number of rows in the dataset and the number of mismatch (therfore the number of values that match is just ~1/3 of the number of rows, we will have to find this info in another way)

### internal integration

since we have CRASH_DATE and DATE_POLICE_NOTIFIED we can find the delta of days that passed between the crash date and the report to the police

In [22]:
create_delta_time_column_df = crashes_df[['CRASH_DATE', 'DATE_POLICE_NOTIFIED']]
create_delta_time_column_df

Unnamed: 0,CRASH_DATE,DATE_POLICE_NOTIFIED
0,01/12/2019 12:01:00 AM,01/12/2019 12:01:00 AM
1,01/11/2019 11:36:00 PM,01/11/2019 11:42:00 PM
2,01/11/2019 11:31:00 PM,01/12/2019 12:08:00 AM
3,01/11/2019 11:22:00 PM,01/11/2019 11:48:00 PM
4,01/11/2019 11:08:00 PM,01/11/2019 11:38:00 PM
...,...,...
257920,11/11/2014 08:00:00 PM,11/12/2015 02:40:00 PM
257921,08/20/2014 04:50:00 PM,08/20/2016 08:32:00 PM
257922,02/24/2014 07:45:00 PM,02/25/2016 02:30:00 PM
257923,01/21/2014 07:40:00 AM,01/21/2016 07:50:00 AM


In [23]:
# Convert the date columns to datetime format
create_delta_time_column_df['CRASH_DATE'] = pd.to_datetime(create_delta_time_column_df['CRASH_DATE'], errors='coerce')
create_delta_time_column_df['DATE_POLICE_NOTIFIED'] = pd.to_datetime(create_delta_time_column_df['DATE_POLICE_NOTIFIED'], errors='coerce')

delta_time = create_delta_time_column_df['DATE_POLICE_NOTIFIED'] - create_delta_time_column_df['CRASH_DATE']

# Create a new column with the format "days minutes seconds"
create_delta_time_column_df['DELTA_TIME_CRASH_DATE_POLICE_REPORT_DATE'] = delta_time.apply(lambda x: f"{x.days} days {x.seconds // 60} minutes {x.seconds % 60} seconds")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  create_delta_time_column_df['CRASH_DATE'] = pd.to_datetime(create_delta_time_column_df['CRASH_DATE'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  create_delta_time_column_df['DATE_POLICE_NOTIFIED'] = pd.to_datetime(create_delta_time_column_df['DATE_POLICE_NOTIFIED'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_

In [24]:
create_delta_time_column_df['DELTA_TIME_CRASH_DATE_POLICE_REPORT_DATE']

0              0 days 0 minutes 0 seconds
1              0 days 6 minutes 0 seconds
2             0 days 37 minutes 0 seconds
3             0 days 26 minutes 0 seconds
4             0 days 30 minutes 0 seconds
                       ...               
257920    365 days 1120 minutes 0 seconds
257921     731 days 222 minutes 0 seconds
257922    730 days 1125 minutes 0 seconds
257923      730 days 10 minutes 0 seconds
257924     1705 days 46 minutes 0 seconds
Name: DELTA_TIME_CRASH_DATE_POLICE_REPORT_DATE, Length: 257925, dtype: object

BEAT_OF_OCCURRENCE can be found by the latitude and longitude and location data