In [102]:
import pandas as pd
from Modules import preprocessing as pre

In [103]:
dir = './Data'
temp_pre = './Data/tmp_pre'
colission_exists = False

## Weather dataset preprocessing

In the process of downloading the data, we selected all the attributes available. Now, we will explore the data and select the ones that are useful for our purpose.

In [104]:
weather = pd.read_csv(f'{dir}/weather.csv')

  weather = pd.read_csv(f'{dir}/weather.csv')


### Data exploration

In [105]:
weather.shape

(90591, 66)

Since we are only interested in the rows where the date is inside the timeranges of 01/06/2018 - 31/09/2018 and 01/06/2020 - 31/09/2020, we will filter the data to only include those rows.

In [106]:
weather = pre.time_filter(weather, 'DATE')
weather.to_csv(f'{dir}/weather_2018-2020.csv', index=False)
weather.shape

(20536, 66)

In [107]:
weather.columns

Index(['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE', 'AWND',
       'AWND_ATTRIBUTES', 'DAPR', 'DAPR_ATTRIBUTES', 'DASF', 'DASF_ATTRIBUTES',
       'MDPR', 'MDPR_ATTRIBUTES', 'MDSF', 'MDSF_ATTRIBUTES', 'PGTM',
       'PGTM_ATTRIBUTES', 'PRCP', 'PRCP_ATTRIBUTES', 'PSUN', 'PSUN_ATTRIBUTES',
       'SNOW', 'SNOW_ATTRIBUTES', 'SNWD', 'SNWD_ATTRIBUTES', 'TAVG',
       'TAVG_ATTRIBUTES', 'TMAX', 'TMAX_ATTRIBUTES', 'TMIN', 'TMIN_ATTRIBUTES',
       'TOBS', 'TOBS_ATTRIBUTES', 'TSUN', 'TSUN_ATTRIBUTES', 'WDF2',
       'WDF2_ATTRIBUTES', 'WDF5', 'WDF5_ATTRIBUTES', 'WESD', 'WESD_ATTRIBUTES',
       'WESF', 'WESF_ATTRIBUTES', 'WSF2', 'WSF2_ATTRIBUTES', 'WSF5',
       'WSF5_ATTRIBUTES', 'WT01', 'WT01_ATTRIBUTES', 'WT02', 'WT02_ATTRIBUTES',
       'WT03', 'WT03_ATTRIBUTES', 'WT04', 'WT04_ATTRIBUTES', 'WT05',
       'WT05_ATTRIBUTES', 'WT06', 'WT06_ATTRIBUTES', 'WT08', 'WT08_ATTRIBUTES',
       'WT09', 'WT09_ATTRIBUTES', 'WT11', 'WT11_ATTRIBUTES'],
      dtype='object')

By looking at the documentation of the weather datatset and the attributes present, each row represents some selected observations (values) available for a given **STATION** and **DATE**. Neither the **STATION** nor the **DATE** are unique.

In [108]:
len(weather['STATION'].unique()), len(weather['DATE'].unique())

(129, 244)

Apart from the first 6 columns (**SATTION**, **NAME**, **LATITUDE**, **LONGITUDE**, **ELEVATION** and **DATE**), the rest of the attributes correspond to optional flags and their respective attributes (definded by the weather documentation. *Note: The 4 flags listed [...] are optional on the Custom GHCN-Daily ASCII Form*) and therefore can contain several null values. That is the reason for the large quantity of sparse attributes detected. We will explore the data to see which attributes are useful for our purpose.

All these atributes correspond to the Table 4. A brief description is collected in the following table:

| Attribute | Description |
| --- | --- |
| *PRCP* | Precipitation (mm) |
| *SNOW* | Snowfall (mm) |
| *SNWD* | Snow depth (mm) |
| *TSUN* | Daily total sunshine (minutes) |
| *TMAX* | Maximum temperature (Celsius) |
| *TMIN* | Minimum temperature (Celsius) |
| *TAVG* | Average temperature (Celsius) |
| *TOBS* | Temperature at the time of observation |
| *AWND* | Average daily wind speed (meters per second) |
| *WT** | Weather type * |


The next table contains the attributes that we considered that are not useful for the visualization purpose:

| Attribute | Description |
| --- | --- |
| *PGTM* | Peak gust time (hours and minutes, i.e., HHMM) |
| *DARP* | Number of days included in the multiday precipitation total (MDPR) |
| *DASF* | Number of days included in the multiday snowfall total (MDSF) |
| *MDPR* | Multiday precipitation total (mm; use with DAPR and DWPR, if available) |
| *MDSF* | Multiday snowfall total (mm; use with DASF and DWSF, if available) |
| *PSUN* | Daily percent of possible sunshine (percent; use with TSUN, if available) |
| *WSF** | Fastest *-minute wind speed (meters per second) |
| *WDF** | Direction of fastest *-second wind (degrees) |
| *WESD* | Water equivalent of snow on the ground (decimal mm) |
| *WESF* | Water equivalent of snowfall (decimal mm) |


Both **DARP** and **DASF** doesn't directly measure specific weather conditions. They are more of an aggregate measure of precipitation/snowfall over multiple days, which is not useful for our purpose. The same applies to **MDPR**, **MDSF** and **PSUN** that are also not useful for our purpose since they are not specific weather conditions.

While weather conditions can impact accidents, these specific attributes about wind direction and speed (**WDF** and **WSF**) might not directly correlate with accident frequency or timing. They are more focused on detailed weather patterns, not necessarily directly tied to accident data. Similarly, these attributes detail aspects of snow, like its water equivalent or depth (**WESD** and **WESF**) are not relevant since we assume that sufficient information is gathered by the **SNOW** and **SBWD** attributes. 

The ***_ATTRIBUTES** columns refer to the rest of the tables in the documentation (Table 1, 2, 3). These tables serve as flags for the measurement, quality and source of the data respectively.

- Table 1 details measurement attributes and flags for specific weather data. It provides information on how certain measurements were taken or derived, such as multi-day totals, conversions from different units, missing or trace values, and other specific characteristics regarding the nature of the recorded data.

- Table 2 presents quality flags indicating various quality assurance checks for the data. These flags denote whether the data passed or failed specific quality checks such as duplicate checks, consistency checks, outlier checks, and other verification processes. It's designed to inform users about the reliability and potential issues associated with the data.

- Table 3 specifies the source of the weather data. It categorizes the origin or the institution from which the data was collected. This is crucial for understanding the diversity of sources, from different countries, organizations, or systems, providing a comprehensive picture of the data's origin and the network through which it was obtained.

The values available for these attributes cover a wide range. In the data selection process we will analyze the subset of those values that are present in the data and discuss if they are useful for our purpose.

### Data selection

#### Geographical attributes

First of all, we will start by joining **LATITUDE** and **LONGITUDE** in a **LOCATION** attribute following the format of the *collision* dataset. Since both attributes have no missing values, the resulting column will not have any missing values.

In [109]:
weather[(weather['LATITUDE'].notnull()) & (weather['LONGITUDE'].notnull())].shape

(20536, 66)

In [110]:
weather['LOCATION'] = '(' + weather['LATITUDE'].astype(str) + ', ' + weather['LONGITUDE'].astype(str) + ')'

We will delete the **ELEVATION** column since it can't be related in any way with the previous dataset and the information it gives is not useful for our visualization purpose.

In [111]:
weather.drop(columns=['LATITUDE', 'LONGITUDE', 'ELEVATION'], inplace=True)

cols = list(weather.columns)
cols = [cols[-1]] + cols[:-1]
weather = weather[cols]

Since we have *LOCATION*, we don't need *STATION* code or *NAME* given that we are not interested in knowing where the measurement was taken. We will delete both columns.

In [112]:
weather.drop(columns=['STATION', 'NAME'], inplace=True)

#### Observational attributes

Since we are analyzing the summer periods of 2018 and 2020, it is unlikely to have observations of snow in the records. Under this assumption we can drop the **SNOW** and **SNWD** attributes.

In [113]:
weather['SNOW'].unique(), weather['SNWD'].unique()

(array([ 0., nan]), array([ 0., nan]))

As we supposed, the only observations registred of **SNOW** and **SNWD** are null values or 0. Therefore, we can drop these columns and their respective flags.

In [114]:
weather.drop(columns=['SNOW', 'SNWD', 'SNOW_ATTRIBUTES', 'SNWD_ATTRIBUTES'], inplace=True)

From the data exploration done in the previous section, we can remove the following attributes: **DARP**, **DASF**, **MDPR**, **MDSF**, **PSUN**, **WDF**, **WESD**, **WESF** and their respective flags.

In [115]:
weather.drop(columns=['DAPR',
                        'DASF',
                        'PGTM',
                        'MDPR',
                        'MDSF',
                        'PSUN',
                        'TAVG',
                        'WDF2',
                        'WDF5',
                        'WESD',
                        'WESF',
                        'WSF2',
                        'WSF5',
                        'DAPR_ATTRIBUTES',
                        'DASF_ATTRIBUTES',
                        'MDPR_ATTRIBUTES',
                        'MDSF_ATTRIBUTES',
                        'PSUN_ATTRIBUTES',
                        'WDF2_ATTRIBUTES',
                        'WDF5_ATTRIBUTES',
                        'WESD_ATTRIBUTES',
                        'WESF_ATTRIBUTES',
                        'WSF2_ATTRIBUTES',
                        'WSF5_ATTRIBUTES'
                        ], inplace=True)

weather.columns

Index(['LOCATION', 'DATE', 'AWND', 'AWND_ATTRIBUTES', 'PGTM_ATTRIBUTES',
       'PRCP', 'PRCP_ATTRIBUTES', 'TAVG', 'TAVG_ATTRIBUTES', 'TMAX',
       'TMAX_ATTRIBUTES', 'TMIN', 'TMIN_ATTRIBUTES', 'TOBS', 'TOBS_ATTRIBUTES',
       'TSUN', 'TSUN_ATTRIBUTES', 'WT01', 'WT01_ATTRIBUTES', 'WT02',
       'WT02_ATTRIBUTES', 'WT03', 'WT03_ATTRIBUTES', 'WT04', 'WT04_ATTRIBUTES',
       'WT05', 'WT05_ATTRIBUTES', 'WT06', 'WT06_ATTRIBUTES', 'WT08',
       'WT08_ATTRIBUTES', 'WT09', 'WT09_ATTRIBUTES', 'WT11',
       'WT11_ATTRIBUTES'],
      dtype='object')

Regarding the **WT*** columns, they all describe a specific weather condition that could be referred as **ADVERSE CONDITION** given the descriptions from the documentation.


| Attribute | Description |
| --- | --- |
| *WT02* | Heavy fog or heaving freezing fog (not always distinguished from fog) |
| *WT03* | Thunder |
| *WT04* | Ice pellets, sleet, snow pellets, or small hail" |
| *WT05* | Hail (may include small hail) |
| *WT06* | Glaze or rime |
| *WT08* | Smoke or haze |
| *WT09* | Blowing or drifting snow |
| *WT11* | High or damaging winds |

In [116]:
weather[weather['WT01'].isnull() & weather['WT02'].isnull() & weather['WT03'].isnull() & weather['WT04'].isnull() & weather['WT05'].isnull() & weather['WT06'].isnull() & weather['WT08'].isnull() & weather['WT09'].isnull() & weather['WT11'].isnull()].shape

(19894, 35)

As checked, they all have the null values in the same rows, which could indicate that the ``nan`` value could significate a ``False`` value for all the adverse type of weather conditions and meaning that the corresponding record refers to a *normal* weather day. Therefore, we will replace the ``nan`` values with ``0``.

In [117]:
weather['WT01'].unique(), weather['WT02'].unique(), weather['WT03'].unique(), weather['WT04'].unique(), weather['WT05'].unique(), weather['WT06'].unique(), weather['WT08'].unique(), weather['WT09'].unique(), weather['WT11'].unique()

(array([nan,  1.]),
 array([nan,  1.]),
 array([nan,  1.]),
 array([nan]),
 array([nan,  1.]),
 array([nan]),
 array([nan,  1.]),
 array([nan]),
 array([nan,  1.]))

By knowing this, the final decision was to merge all the **WT*** columns into a single one called **ADVERSE CONDITION**, which would have boolean values (0 or 1) indicating if the day was normal or not in terms of weather conditions. This would loose some information regaring a more detailed description of the weather conditions, but it would be more useful for our purpose.

In [118]:
wt_columns = ['WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09', 'WT11']

weather['ADVERSE_CONDITIONS'] = weather[wt_columns].isnull().all(axis=1).astype(int)
weather['ADVERSE_CONDITIONS'] = 1 - weather['ADVERSE_CONDITIONS']

weather['ADVERSE_CONDITIONS'].value_counts()

ADVERSE_CONDITIONS
0    19894
1      642
Name: count, dtype: int64

The flag attributes of the **WT*** columns will be removed given that the weather conditions columns were aggregated into a single column and thus the flags are not meaningful anymore.

In [119]:
wt_columns_att = [col + '_ATTRIBUTES' for col in wt_columns]

weather.drop(columns=wt_columns, inplace=True)
weather.drop(columns=wt_columns_att, inplace=True)

weather.columns

Index(['LOCATION', 'DATE', 'AWND', 'AWND_ATTRIBUTES', 'PGTM_ATTRIBUTES',
       'PRCP', 'PRCP_ATTRIBUTES', 'TAVG', 'TAVG_ATTRIBUTES', 'TMAX',
       'TMAX_ATTRIBUTES', 'TMIN', 'TMIN_ATTRIBUTES', 'TOBS', 'TOBS_ATTRIBUTES',
       'TSUN', 'TSUN_ATTRIBUTES', 'ADVERSE_CONDITIONS'],
      dtype='object')

#### Flags attributes

The ***_ATTRIBUTES** have not been commented so far. These values collectively offer a detailed understanding of the characteristics, quality, and origins of the weather data, empowering users to interpret and analyze the data within the context of its measurement methods, quality assurance, and diverse sources.

For the visualization objective, these attributes do not provide any useful information and could be considered as metadata. Therefore, we will remove all the ***_ATTRIBUTES** columns.

In [120]:
cols = weather.columns.tolist()

for col in cols:
    if '_ATTRIBUTES' in col:
        print(col, weather[col].unique(), sep=':')

AWND_ATTRIBUTES:[nan ',,W']
PGTM_ATTRIBUTES:[nan ',,W']
PRCP_ATTRIBUTES:[',,7,0700' 'T,,7,0700' ',,N' 'T,,N' nan ',,W,2400' 'T,,W,2400' ',L,N'
 ',,7,0800' ',,W' 'T,,W' ',,N,' 'T,,N,' ',,7,' ',,7,0630' 'T,,7,0630'
 ',,Z,0700']
TAVG_ATTRIBUTES:[nan 'H,,S']
TMAX_ATTRIBUTES:[',,7' nan ',,W' ',,Z']
TMIN_ATTRIBUTES:[',,7' nan ',,W']
TOBS_ATTRIBUTES:[',,7,0700' nan ',,7,0800' ',,7,' ',,7,0630']
TSUN_ATTRIBUTES:[nan]


In [121]:
wt_columns_att = [col for col in weather.columns if '_ATTRIBUTES' in col]
weather.drop(columns=wt_columns_att, inplace=True)

### Missing values

Based on the sparseness described in the data exploration, it is logical to check for columns with only missing values in order to remove them.

In [122]:
weather.shape

(20536, 10)

Coincidentally, the **TSUN** column has only missing values. Therefore, we will remove them.

In [123]:
weather.isnull().sum()

LOCATION                  0
DATE                      0
AWND                  18608
PRCP                    297
TAVG                  19804
TMAX                  17221
TMIN                  17233
TOBS                  19244
TSUN                  20536
ADVERSE_CONDITIONS        0
dtype: int64

In [124]:
weather.drop(columns=['TSUN'], inplace=True)

 For *PRCP*, we will assume that the null values are 0, that is if there's no record of precipitation (the value is null), it's because there were no precipitations. We make this assumption because the number of null values represents around 10% of the dataset so it's not such a risk to make this assumption (and even less risk taking into account that the data is from Summer, when it's less likely to rain).

In [125]:
weather['PRCP'].fillna(0.0, inplace=True)

Regarding temperature, the great number of missing values is a problem. We will impute the missing values with the average temperature per day of each attribute (*TMAX*, *TMIN*, *TOBS*).

In [126]:
weather['TMAX'] = weather.groupby('DATE')['TMAX'].transform(lambda x: x.fillna(x.mean()))
weather['TMIN'] = weather.groupby('DATE')['TMIN'].transform(lambda x: x.fillna(x.mean()))
weather['TOBS'] = weather.groupby('DATE')['TOBS'].transform(lambda x: x.fillna(x.mean()))

In [128]:
weather.isnull().sum()

LOCATION                  0
DATE                      0
AWND                  18608
PRCP                      0
TMAX                      0
TMIN                      0
TOBS                      0
ADVERSE_CONDITIONS        0
dtype: int64

We will apply the same strategy on AWND to get rid of the missing values.

In [129]:
weather['AWND'] = weather.groupby('DATE')['AWND'].transform(lambda x: x.fillna(x.mean()))

In [130]:
weather.isnull().sum()

LOCATION              0
DATE                  0
AWND                  0
PRCP                  0
TMAX                  0
TMIN                  0
TOBS                  0
ADVERSE_CONDITIONS    0
dtype: int64

### Data transformation

We are interested in daily measures, so we will group the data by **DATE** and aggregate the values of the rest of the columns by the mean.

In [132]:
weather_aggregated = weather.copy()
weather_aggregated.drop(columns=['LOCATION', 'ADVERSE_CONDITIONS'], inplace=True)

In [133]:
weather_aggregated = weather_aggregated.groupby('DATE').mean().reset_index()
weather_aggregated.head()

Unnamed: 0,DATE,AWND,PRCP,TMAX,TMIN,TOBS
0,2018-06-01,1.625,4.354762,24.921429,17.371429,18.683333
1,2018-06-02,3.4875,5.296296,29.330769,18.8,21.825
2,2018-06-03,5.0,6.153165,23.8,12.353846,15.125
3,2018-06-04,2.8875,13.223256,20.7,10.392308,10.75
4,2018-06-05,3.4875,0.801266,23.092857,11.5,13.233333


Categorization of **AWND** can simplify its complex numerical data, making it easier to understand and interpret for further analysis. This categorization will follow the Beaufort scale. The [Beaufort Scale](https://en.wikipedia.org/wiki/Beaufort_scale) is a measure used to estimate wind speed based on observed conditions at sea or on land. 

In [134]:
def beaufort_scale(wind_speed_mps):
    if wind_speed_mps < 0.3:
        return "Calm"
    elif wind_speed_mps < 1.5:
        return "Light Air"
    elif wind_speed_mps < 3.4:
        return "Light Breeze"
    elif wind_speed_mps < 5.5:
        return "Gentle Breeze"
    elif wind_speed_mps < 8.0:
        return "Moderate Breeze"
    elif wind_speed_mps < 10.8:
        return "Fresh Breeze"
    elif wind_speed_mps < 13.9:
        return "Strong Breeze"
    elif wind_speed_mps < 17.2:
        return "Near Gale"
    elif wind_speed_mps < 20.8:
        return "Gale"
    elif wind_speed_mps < 24.5:
        return "Strong Gale"
    elif wind_speed_mps < 28.5:
        return "Storm"
    else:
        return "Hurricane"
    

weather_aggregated['BEAUFORT_SCALE'] = weather_aggregated['AWND'].apply(beaufort_scale)

weather_aggregated['BEAUFORT_SCALE'].unique()

The same was done for **PRCP**. The categorization will follow the [World Meteorological Organization, 2018](https://www.researchgate.net/figure/Rain-intensity-classifications-according-to-the-World-Meteorological-Organization-2018_tbl1_353769617) classification.

In [36]:
def rain_intensity_scale(prcp_mm):
    if prcp_mm < 2.5*24:
        return "Slight"
    elif prcp_mm < 10*24:
        return "Moderate"
    elif prcp_mm < 50*24:
        return "Heavy"
    else:
        return "Violent"
    
weather_aggregated['PRCP_SCALE'] = weather_aggregated['PRCP'].apply(rain_intensity_scale)

weather_aggregated['PRCP_SCALE'].unique()

array(['Slight'], dtype=object)

In [37]:
weather_aggregated['MEAN_TEMP'] = weather_aggregated[['TMAX', 'TMIN', 'TOBS']].mean(axis=1)
weather_aggregated.drop(columns=['TMAX', 'TMIN', 'TOBS'], inplace=True)

### Preprocessed data saving

In [38]:
weather.to_csv(f'{dir}/weather_clean.csv', index=False)
weather_aggregated.to_csv(f'{dir}/weather_aggregated.csv', index=False)