### This notebook is used to merge the processed data together and create training, validation, and testing sets

The goal is to create a big DataFrame containing data from 2016, 2017, and 2018 used for training.

We will also process further the data for 2019 that will be used for training and validation.

In [1]:
import numpy as np
import pandas as pd
from zipfile import ZipFile
from tqdm import tqdm
import time
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import _pickle as cPickle
import bz2
# Removing scientific notation
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [2]:
def compressed_pickle(title, data):
    with bz2.BZ2File('./data/output/' + title + '.pbz2', 'w') as f:
        cPickle.dump(data, f)

In [3]:
def decompress_pickle(file):
    data = bz2.BZ2File(file, 'rb')
    data = cPickle.load(data)
    return data

In [4]:
_2016_data = decompress_pickle('./data/output/bixi_processed_2016.pbz2')
_2017_data = decompress_pickle('./data/output/bixi_processed_2017.pbz2')
_2018_data = decompress_pickle('./data/output/bixi_processed_2018.pbz2')
_2019_data = decompress_pickle('./data/output/bixi_processed_2019.pbz2')

Checking number of stations. We notice they increase over the years

In [5]:
len(_2016_data)

465

In [6]:
len(_2017_data)

546

In [7]:
len(_2018_data)

552

In [8]:
len(_2019_data)

619

Intersect all the station codes together and list out the common ones. Those are the stations that existed between 2016 and 2019

In [9]:
_2016_stations = np.array(list(_2016_data.keys()))
_2017_stations = np.array(list(_2017_data.keys()))
_2018_stations = np.array(list(_2018_data.keys()))
_2019_stations = np.array(list(_2019_data.keys()))

common_stations = np.intersect1d(np.intersect1d(np.intersect1d(_2017_stations, _2018_stations), _2019_stations), _2016_stations)

In [10]:
len(common_stations)

463

In [11]:
common_stations[0]

5002

Checking if columns are the same. They differ because there are extra hot-encoded weather data on some years

In [43]:
_2016_data[5002].shape

(4776, 32)

In [44]:
_2017_data[5002].shape

(4722, 33)

In [45]:
_2018_data[5002].shape

(4872, 37)

In [46]:
_2019_data[5002].shape

(4773, 34)

Merging the 2016, 2017, 2018, and 2019 data based on the common station codes

In [47]:
bixi_all_data = {}

for station_code in _2016_data:
    if station_code in common_stations:
        bixi_all_data[station_code] = _2016_data[station_code]

for station_code in _2017_data:
    if station_code in common_stations:
        df = bixi_all_data[station_code]
        new_concat = pd.concat([df, _2017_data[station_code]])
        bixi_all_data[station_code] = new_concat

for station_code in _2018_data:
    if station_code in common_stations:
        df = bixi_all_data[station_code]
        new_concat = pd.concat([df, _2018_data[station_code]])
        bixi_all_data[station_code] = new_concat

for station_code in _2019_data:
    if station_code in common_stations:
        df = bixi_all_data[station_code]
        new_concat = pd.concat([df, _2019_data[station_code]])
        bixi_all_data[station_code] = new_concat     

Doing the merge will create *nan* fields because some years are missing some weather information. We will replace all the *nan* values with 0

In [49]:
for station_code in bixi_all_data:
    bixi_all_data[station_code] = bixi_all_data[station_code].fillna(0)

In [50]:
bixi_all_data[5002].isnull().sum()

hour_trip_count          0
Date/Time                0
Year                     0
Month                    0
Day                      0
Temp (°C)                0
Dew Point Temp (°C)      0
Rel Hum (%)              0
Wind Dir (10s deg)       0
Wind Spd (km/h)          0
Visibility (km)          0
Stn Press (kPa)          0
Hour                     0
Minute                   0
Second                   0
Day_of_year              0
Blowing Snow             0
Clear                    0
Cloudy                   0
Drizzle                  0
Fog                      0
Heavy Rain               0
Ice Pellets              0
Mainly Clear             0
Moderate Rain            0
Moderate Rain Showers    0
Mostly Cloudy            0
Rain                     0
Rain Showers             0
Snow                     0
Snow Showers             0
Thunderstorms            0
Haze                     0
Heavy Rain Showers       0
Moderate Drizzle         0
Blowing Dust             0
Freezing Drizzle         0
F

Split the data for training and trsting

In [62]:
bixi_data_2019 = {}
bixi_data_2016_2017_2018 = {}

for station_code in bixi_all_data:
    _2019 = bixi_all_data[station_code][(bixi_all_data[station_code]['Year'] == 2019)]
    rest = bixi_all_data[station_code][(bixi_all_data[station_code]['Year'] == 2018) | (bixi_all_data[station_code]['Year'] == 2017) | (bixi_all_data[station_code]['Year'] == 2016)]
    
    bixi_data_2019[station_code] = _2019
    bixi_data_2016_2017_2018[station_code] = rest

Making sure the years are correct

In [63]:
bixi_data_2019[5002]['Year'].value_counts()

2019    4773
Name: Year, dtype: int64

In [64]:
bixi_data_2016_2017_2018[5002]['Year'].value_counts()

2018    4872
2016    4776
2017    4722
Name: Year, dtype: int64

# Export

In [65]:
compressed_pickle('bixi_data_2019', bixi_data_2019)

In [66]:
compressed_pickle('bixi_data_2016_2017_2018', bixi_data_2016_2017_2018)