**Goals**

Have a model that allows us to make system load balancing decisions based on demand
forecasting based on historical data. Load balancing means the planning of transporting
bicycles from one station to another.

**Outcome**

A model that indicates the optimal movement of bicycles to be made between stations,
and at what times or frequencies, to ensure that you perform the best system planning.

# Importing Modules & Data

In [1]:
import pandas as pd 
import numpy as np
import geopandas as gpd
import fiona
import os
import zipfile
from folium import Map, CircleMarker, Vega, Popup, Marker, PolyLine, Icon, Choropleth, LayerControl
from folium.plugins import MarkerCluster, HeatMap, BeautifyIcon
import shapely
import matplotlib
from ipywidgets import interact
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.tsa 
from datetime import datetime
import missingno as msno
import re

In [2]:
pd.set_option('display.max_rows', 100)

## Class Databases 

In [3]:
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
class Databases:
    def __init__(self, csv, anva, smm, metro):
        self.csv = csv
        self.anva = anva 
        self.smm = smm
        self.metro = metro

    def extract_csv(self):
        file = pd.read_csv(f'data/bike_loans/{self.csv}', sep=';')
        file['Fecha_Prestamo'] = pd.to_datetime(file['Fecha_Prestamo'])
        file['Fecha_Devolucion'] = pd.to_datetime(file['Fecha_Devolucion'])
        return file
        
    
    def extract_anva(self):
        bus_routes_AMVA = gpd.GeoDataFrame()
        for root, dirs, files in os.walk(f'data/bus_routes/medellin_bus_routes/{self.anva}'):
            for file in files:
                path = os.path.join(root, file)
                df = gpd.read_file(path, driver='KML')
                bus_routes_AMVA = gpd.GeoDataFrame(pd.concat([bus_routes_AMVA, df], ignore_index=True))
        return bus_routes_AMVA
    
    def extract_smm(self):
        smm_bus_routes = gpd.GeoDataFrame()
        for root, dirs, files in os.walk(f'data/bus_routes/medellin_bus_routes/{self.smm}'):
            for file in files:
                path = os.path.join(root, file)
                df = gpd.read_file(path, driver='KML')
                smm_bus_routes = gpd.GeoDataFrame(pd.concat([smm_bus_routes, df], ignore_index=True))
        return smm_bus_routes
    
    def extract_metro(self):
        metro_routes = gpd.GeoDataFrame()
        for root, dirs, files in os.walk(f'data/medellin_metro/{self.metro}'):
            for file in files:
                path = os.path.join(root, file)
                print(path)
                if path.endswith('.kmz'):
                    kmz = zipfile.ZipFile(path, 'r')
                    kml = kmz.open('doc.kml', 'r')
                    df = gpd.read_file(kml, driver='KML')
                    metro_routes = gpd.GeoDataFrame(pd.concat([metro_routes, df], ignore_index=True))
        return metro_routes

In [4]:
databases = Databases('bike_loans.csv','Kml Ruta AMVA','kml Ruta SMM','medellin_metro_stations')

In [5]:
# extracting csv bikes with datetime data
bikes = databases.extract_csv()

In [6]:
bikes.dtypes #checking feature types

Id_Historico_Prestamo               int64
Id_Usuario                          int64
Id_Tag_Bicicleta                   object
Fecha_Prestamo             datetime64[ns]
Fecha_Devolucion           datetime64[ns]
Id_Aparcamiento_Origen              int64
Posicion_Origen                     int64
Id_Aparcamiento_Destino           float64
Posicion_Destino                  float64
Importe                           float64
DescripcionImporte                float64
operario                           object
Num_Bici_Hist                     float64
dtype: object

# Data Cleaning

In order to capture **trends and seasonality** properly, we decided to focus on **pre-pandemic** data. The first case diagnosed of Covid-19 in Colombia was on March 6th, 2020. Therefore, in order to avoid any bias in the data, we decided to focus on the data prior to 2020.

Following some review, we decided to focus entirely on the data from **2019**. There are several reasons for this. First, we believe that using data at the beginning of the project might be unrepresentative of how it is currently operated (just before COVID). Secondly, for dimensionality purposes, the more rows and dates we turn into time series, the great amount of data to process. 

**Homemade Dict**

In [7]:
# Id_Historico_Prestamo - represents ID of the load
# Id_Usuario - id of the users
# Id_Tag_Bicicleta - Id of the bycicle
# Fecha_Prestamo - loan start date/time
# Fecha_Devolucion - load end date/time
# Id_Aparcamiento_Origen - Id of Parking station
# Posicion_Origen - position of the bike in the bike station 
# Importe - not sure 
# DescripcionImporte - same here not sure
# operario - whether the station was automatic or ran manually by someone
# Num_Bici_Hist - not entirely sure but could be number of times the bike was used 

In [8]:
bikes_2019_loans_only = bikes[(bikes['Fecha_Prestamo'].dt.year == 2019)]

We are focusing solely on 2019, therefore we are removing the bikes that were loaned AND deposited back in 2019. Therefore, if someone gets wasted on New Years Eve and drops the bike back in 2020, this wouldn't be counted. 

In [170]:
bikes_2019 = bikes[(bikes['Fecha_Prestamo'].dt.year == 2019) & (bikes['Fecha_Devolucion'].dt.year == 2019)]
bikes_2019

Unnamed: 0,Id_Historico_Prestamo,Id_Usuario,Id_Tag_Bicicleta,Fecha_Prestamo,Fecha_Devolucion,Id_Aparcamiento_Origen,Posicion_Origen,Id_Aparcamiento_Destino,Posicion_Destino,Importe,DescripcionImporte,operario,Num_Bici_Hist
440864,10948569,20272,00FB73222B5E,2019-02-20 14:08:04,2019-02-20 14:26:08,9,1,43.0,1.0,0.0,,anderson.rojas,1366.0
586995,12896369,135209,002772222B5E,2019-08-22 09:19:51,2019-08-22 09:20:31,44,15,44.0,15.0,0.0,,,976.0
726893,12896264,26536,00094A72BF5B,2019-08-22 09:11:22,2019-08-22 09:20:53,9,1,70.0,1.0,0.0,,seguro.sorelly,2699.0
1028721,10948678,124512,00233C222B5E,2019-02-20 14:15:54,2019-02-20 14:26:28,89,2,56.0,1.0,0.0,,,638.0
2111125,10516625,40806,009F53222B5E,2019-01-02 12:57:01,2019-01-02 13:51:38,12,1,22.0,,0.0,,monitor.jannetv,558.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11767100,14454800,27944,0000F69A5A88,2019-12-31 13:58:55,2019-12-31 14:25:38,40,1,50.0,1.0,0.0,,claudia.marin,541.0
11767101,14454778,108800,00029A222B5E,2019-12-31 13:52:48,2019-12-31 14:40:52,28,1,50.0,1.0,0.0,,claudia.marin,715.0
11767102,14454793,98449,00001619CE0A,2019-12-31 13:55:47,2019-12-31 14:41:39,113,1,28.0,1.0,0.0,,edison.graciano,3497.0
11767103,14454774,129665,000036885A88,2019-12-31 13:51:28,2019-12-31 16:27:12,39,15,41.0,15.0,0.0,,,534.0


## Checking for NaN Values

We are seeing below the amount of NaN values for the columns - station of origin and station of destination. What is happening is that there are no NaN values station of origins and 1009 for stations of destination. What we assume could have happened, is once again some drunk people took their fucking bikes and threw them into the river or something along those lines, hence never reached their destination. 

In [171]:
bikes_2019['Id_Aparcamiento_Origen'].isna().sum(), bikes_2019['Id_Aparcamiento_Destino'].isna().sum()

(0, 1009)

Filling those NaN values with -1 for conversion purposes, so that we can more easily use the .apply function when creating new features

In [172]:
bikes_2019['Id_Aparcamiento_Destino'] = bikes_2019['Id_Aparcamiento_Destino'].fillna(np.NaN)
bikes_2019['Posicion_Destino'] = bikes_2019['Posicion_Destino'].fillna(np.NaN)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikes_2019['Id_Aparcamiento_Destino'] = bikes_2019['Id_Aparcamiento_Destino'].fillna(np.NaN)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikes_2019['Posicion_Destino'] = bikes_2019['Posicion_Destino'].fillna(np.NaN)


In [173]:
bikes_2019.dtypes

Id_Historico_Prestamo               int64
Id_Usuario                          int64
Id_Tag_Bicicleta                   object
Fecha_Prestamo             datetime64[ns]
Fecha_Devolucion           datetime64[ns]
Id_Aparcamiento_Origen              int64
Posicion_Origen                     int64
Id_Aparcamiento_Destino           float64
Posicion_Destino                  float64
Importe                           float64
DescripcionImporte                float64
operario                           object
Num_Bici_Hist                     float64
dtype: object

We can see from above that some similar columns are of different types such as Id_Aparcamiento_Destino = float64 AND Id_Aparcamiento_Origen = int64

In [174]:
bikes_2019['Id_Aparcamiento_Origen'] = bikes_2019['Id_Aparcamiento_Origen'].astype(float)                                                                                

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikes_2019['Id_Aparcamiento_Origen'] = bikes_2019['Id_Aparcamiento_Origen'].astype(float)


In [175]:
bikes_2019['Posicion_Origen'] = bikes_2019['Posicion_Origen'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikes_2019['Posicion_Origen'] = bikes_2019['Posicion_Origen'].astype(float)


## Feature Creation: Supply & Demand

In [176]:
def create_supply_demand(x):
        if x >= 0:
            return 1
        elif x != x:
            return 0

What the function above does is to return 0 or 1 depending on whether each row has a station of origin or station of destination. Each time someone loans a new bike this is/can be represented as demand = 1. When they drop their bikes this can be represented as supply = 1. However, if the row does not have a station of destination i.e. the bike got lost/thrown in the river, then supply should be = 0. For later use, this will be to find out if there is an imbalance of demand and supply for some stations.

In [177]:
bikes_2019['demand'] = bikes_2019['Id_Aparcamiento_Origen'].apply(create_supply_demand)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikes_2019['demand'] = bikes_2019['Id_Aparcamiento_Origen'].apply(create_supply_demand)


In [178]:
bikes_2019['supply'] = bikes_2019['Id_Aparcamiento_Destino'].apply(create_supply_demand)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikes_2019['supply'] = bikes_2019['Id_Aparcamiento_Destino'].apply(create_supply_demand)


In [179]:
# this should equal 1009 because there are 1009 NaN values
bikes_2019['supply'].isin([0]).sum()

1009

### Creating new features - features to groupby 

In [180]:
bikes_2019['loan_station_date_time'] = bikes_2019['Id_Aparcamiento_Origen'].astype('string') + " " +\
        bikes_2019['Fecha_Prestamo'].dt.date.astype('string') + " " +\
        bikes_2019['Fecha_Prestamo'].dt.hour.astype('string')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikes_2019['loan_station_date_time'] = bikes_2019['Id_Aparcamiento_Origen'].astype('string') + " " +\


In [181]:
bikes_2019['deposit_station_date_time'] = bikes_2019['Id_Aparcamiento_Destino'].astype('string') + " " +\
        bikes_2019['Fecha_Devolucion'].dt.date.astype('string') + " " + \
        bikes_2019['Fecha_Devolucion'].dt.hour.astype('string')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bikes_2019['deposit_station_date_time'] = bikes_2019['Id_Aparcamiento_Destino'].astype('string') + " " +\


In [182]:
bikes_2019

Unnamed: 0,Id_Historico_Prestamo,Id_Usuario,Id_Tag_Bicicleta,Fecha_Prestamo,Fecha_Devolucion,Id_Aparcamiento_Origen,Posicion_Origen,Id_Aparcamiento_Destino,Posicion_Destino,Importe,DescripcionImporte,operario,Num_Bici_Hist,demand,supply,loan_station_date_time,deposit_station_date_time
440864,10948569,20272,00FB73222B5E,2019-02-20 14:08:04,2019-02-20 14:26:08,9.0,1.0,43.0,1.0,0.0,,anderson.rojas,1366.0,1,1,9.0 2019-02-20 14,43.0 2019-02-20 14
586995,12896369,135209,002772222B5E,2019-08-22 09:19:51,2019-08-22 09:20:31,44.0,15.0,44.0,15.0,0.0,,,976.0,1,1,44.0 2019-08-22 9,44.0 2019-08-22 9
726893,12896264,26536,00094A72BF5B,2019-08-22 09:11:22,2019-08-22 09:20:53,9.0,1.0,70.0,1.0,0.0,,seguro.sorelly,2699.0,1,1,9.0 2019-08-22 9,70.0 2019-08-22 9
1028721,10948678,124512,00233C222B5E,2019-02-20 14:15:54,2019-02-20 14:26:28,89.0,2.0,56.0,1.0,0.0,,,638.0,1,1,89.0 2019-02-20 14,56.0 2019-02-20 14
2111125,10516625,40806,009F53222B5E,2019-01-02 12:57:01,2019-01-02 13:51:38,12.0,1.0,22.0,,0.0,,monitor.jannetv,558.0,1,1,12.0 2019-01-02 12,22.0 2019-01-02 13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11767100,14454800,27944,0000F69A5A88,2019-12-31 13:58:55,2019-12-31 14:25:38,40.0,1.0,50.0,1.0,0.0,,claudia.marin,541.0,1,1,40.0 2019-12-31 13,50.0 2019-12-31 14
11767101,14454778,108800,00029A222B5E,2019-12-31 13:52:48,2019-12-31 14:40:52,28.0,1.0,50.0,1.0,0.0,,claudia.marin,715.0,1,1,28.0 2019-12-31 13,50.0 2019-12-31 14
11767102,14454793,98449,00001619CE0A,2019-12-31 13:55:47,2019-12-31 14:41:39,113.0,1.0,28.0,1.0,0.0,,edison.graciano,3497.0,1,1,113.0 2019-12-31 13,28.0 2019-12-31 14
11767103,14454774,129665,000036885A88,2019-12-31 13:51:28,2019-12-31 16:27:12,39.0,15.0,41.0,15.0,0.0,,,534.0,1,1,39.0 2019-12-31 13,41.0 2019-12-31 16


In [183]:
# we decided to create new features based on several columns that we wanted to merge together.
# basically, our goal would be to group by station, date and time
bikes_2019.iloc[:,-2:-1]

Unnamed: 0,loan_station_date_time
440864,9.0 2019-02-20 14
586995,44.0 2019-08-22 9
726893,9.0 2019-08-22 9
1028721,89.0 2019-02-20 14
2111125,12.0 2019-01-02 12
...,...
11767100,40.0 2019-12-31 13
11767101,28.0 2019-12-31 13
11767102,113.0 2019-12-31 13
11767103,39.0 2019-12-31 13


## Our New DFs

### Demand DF

In [184]:
demand_per_station_per_hour = bikes_2019[['demand']].groupby(bikes_2019['loan_station_date_time']).sum()

In [185]:
demand_per_station_per_hour.reset_index(inplace=True)

In [186]:
demand_per_station_per_hour

Unnamed: 0,loan_station_date_time,demand
0,1.0 2019-01-02 10,5
1,1.0 2019-01-02 11,8
2,1.0 2019-01-02 12,17
3,1.0 2019-01-02 13,7
4,1.0 2019-01-02 14,12
...,...,...
280484,99.0 2019-12-31 11,8
280485,99.0 2019-12-31 12,13
280486,99.0 2019-12-31 13,6
280487,99.0 2019-12-31 8,1


In [187]:
demand_per_station_per_hour[['station','date','time']] = demand_per_station_per_hour.loan_station_date_time.str.split(" ",expand=True)

In [188]:
demand_per_station_per_hour

Unnamed: 0,loan_station_date_time,demand,station,date,time
0,1.0 2019-01-02 10,5,1.0,2019-01-02,10
1,1.0 2019-01-02 11,8,1.0,2019-01-02,11
2,1.0 2019-01-02 12,17,1.0,2019-01-02,12
3,1.0 2019-01-02 13,7,1.0,2019-01-02,13
4,1.0 2019-01-02 14,12,1.0,2019-01-02,14
...,...,...,...,...,...
280484,99.0 2019-12-31 11,8,99.0,2019-12-31,11
280485,99.0 2019-12-31 12,13,99.0,2019-12-31,12
280486,99.0 2019-12-31 13,6,99.0,2019-12-31,13
280487,99.0 2019-12-31 8,1,99.0,2019-12-31,8


### Supply DF

In [189]:
supply_per_station_per_hour = bikes_2019[['supply']].groupby(bikes_2019['deposit_station_date_time']).sum()

In [190]:
supply_per_station_per_hour.reset_index(inplace=True)

In [191]:
supply_per_station_per_hour

Unnamed: 0,deposit_station_date_time,supply
0,1.0 2019-01-02 10,4
1,1.0 2019-01-02 11,8
2,1.0 2019-01-02 12,12
3,1.0 2019-01-02 13,7
4,1.0 2019-01-02 14,8
...,...,...
298700,99.0 2019-12-31 5,1
298701,99.0 2019-12-31 6,1
298702,99.0 2019-12-31 7,1
298703,99.0 2019-12-31 8,5


In [192]:
supply_per_station_per_hour[['station','date','time']] = supply_per_station_per_hour.deposit_station_date_time.str.split(" ",expand=True)

In [193]:
supply_per_station_per_hour

Unnamed: 0,deposit_station_date_time,supply,station,date,time
0,1.0 2019-01-02 10,4,1.0,2019-01-02,10
1,1.0 2019-01-02 11,8,1.0,2019-01-02,11
2,1.0 2019-01-02 12,12,1.0,2019-01-02,12
3,1.0 2019-01-02 13,7,1.0,2019-01-02,13
4,1.0 2019-01-02 14,8,1.0,2019-01-02,14
...,...,...,...,...,...
298700,99.0 2019-12-31 5,1,99.0,2019-12-31,5
298701,99.0 2019-12-31 6,1,99.0,2019-12-31,6
298702,99.0 2019-12-31 7,1,99.0,2019-12-31,7
298703,99.0 2019-12-31 8,5,99.0,2019-12-31,8


### Cleaning DFs

In [194]:
df1 = demand_per_station_per_hour.copy()
df1

Unnamed: 0,loan_station_date_time,demand,station,date,time
0,1.0 2019-01-02 10,5,1.0,2019-01-02,10
1,1.0 2019-01-02 11,8,1.0,2019-01-02,11
2,1.0 2019-01-02 12,17,1.0,2019-01-02,12
3,1.0 2019-01-02 13,7,1.0,2019-01-02,13
4,1.0 2019-01-02 14,12,1.0,2019-01-02,14
...,...,...,...,...,...
280484,99.0 2019-12-31 11,8,99.0,2019-12-31,11
280485,99.0 2019-12-31 12,13,99.0,2019-12-31,12
280486,99.0 2019-12-31 13,6,99.0,2019-12-31,13
280487,99.0 2019-12-31 8,1,99.0,2019-12-31,8


In [195]:
df2 = supply_per_station_per_hour.copy()
df2

Unnamed: 0,deposit_station_date_time,supply,station,date,time
0,1.0 2019-01-02 10,4,1.0,2019-01-02,10
1,1.0 2019-01-02 11,8,1.0,2019-01-02,11
2,1.0 2019-01-02 12,12,1.0,2019-01-02,12
3,1.0 2019-01-02 13,7,1.0,2019-01-02,13
4,1.0 2019-01-02 14,8,1.0,2019-01-02,14
...,...,...,...,...,...
298700,99.0 2019-12-31 5,1,99.0,2019-12-31,5
298701,99.0 2019-12-31 6,1,99.0,2019-12-31,6
298702,99.0 2019-12-31 7,1,99.0,2019-12-31,7
298703,99.0 2019-12-31 8,5,99.0,2019-12-31,8


In [196]:
df1.dtypes , df2.dtypes

(loan_station_date_time    object
 demand                     int64
 station                   object
 date                      object
 time                      object
 dtype: object,
 deposit_station_date_time    object
 supply                        int64
 station                      object
 date                         object
 time                         object
 dtype: object)

In [199]:
def clean_df(df, cols, index):
    df[cols] =  df[cols].astype(float) # string to float
    df.iloc[:,index]=  df.iloc[:,index].str.strip() #left strip
    return df

In [201]:
df1_clean = clean_df(df1,['station','time'], 0)
df2_clean = clean_df(df2,['station','time'], 0)

In [202]:
df1_clean

Unnamed: 0,loan_station_date_time,demand,station,date,time
0,1.0 2019-01-02 10,5,1.0,2019-01-02,10.0
1,1.0 2019-01-02 11,8,1.0,2019-01-02,11.0
2,1.0 2019-01-02 12,17,1.0,2019-01-02,12.0
3,1.0 2019-01-02 13,7,1.0,2019-01-02,13.0
4,1.0 2019-01-02 14,12,1.0,2019-01-02,14.0
...,...,...,...,...,...
280484,99.0 2019-12-31 11,8,99.0,2019-12-31,11.0
280485,99.0 2019-12-31 12,13,99.0,2019-12-31,12.0
280486,99.0 2019-12-31 13,6,99.0,2019-12-31,13.0
280487,99.0 2019-12-31 8,1,99.0,2019-12-31,8.0


In [203]:
df2_clean

Unnamed: 0,deposit_station_date_time,supply,station,date,time
0,1.0 2019-01-02 10,4,1.0,2019-01-02,10.0
1,1.0 2019-01-02 11,8,1.0,2019-01-02,11.0
2,1.0 2019-01-02 12,12,1.0,2019-01-02,12.0
3,1.0 2019-01-02 13,7,1.0,2019-01-02,13.0
4,1.0 2019-01-02 14,8,1.0,2019-01-02,14.0
...,...,...,...,...,...
298700,99.0 2019-12-31 5,1,99.0,2019-12-31,5.0
298701,99.0 2019-12-31 6,1,99.0,2019-12-31,6.0
298702,99.0 2019-12-31 7,1,99.0,2019-12-31,7.0
298703,99.0 2019-12-31 8,5,99.0,2019-12-31,8.0


In [204]:
pattern = re.compile(r'2019.*') 

In [205]:
def year_month_day_hour(x):
    return pattern.search(x).group()

In [206]:
df1_clean['year_month_day_hour'] = df1_clean['loan_station_date_time'].apply(year_month_day_hour)

In [207]:
df2_clean['year_month_day_hour'] = df2_clean['deposit_station_date_time'].apply(year_month_day_hour)

In [208]:
def drop_col(df):
    return df.iloc[:,1:]

In [209]:
df1_clean_drop = drop_col(df1_clean)
df2_clean_drop = drop_col(df2_clean)

In [210]:
def convert_to_date(x):
    return datetime.strptime(x, '%Y-%m-%d %H')

In [211]:
df1_clean_drop['year_month_day_hour'] = df1_clean_drop['year_month_day_hour'].apply(convert_to_date)
df2_clean_drop['year_month_day_hour'] = df2_clean_drop['year_month_day_hour'].apply(convert_to_date)

In [212]:
df1_final = df1_clean_drop.copy()
df2_final = df2_clean_drop.copy()

In [213]:
df1_final

Unnamed: 0,demand,station,date,time,year_month_day_hour
0,5,1.0,2019-01-02,10.0,2019-01-02 10:00:00
1,8,1.0,2019-01-02,11.0,2019-01-02 11:00:00
2,17,1.0,2019-01-02,12.0,2019-01-02 12:00:00
3,7,1.0,2019-01-02,13.0,2019-01-02 13:00:00
4,12,1.0,2019-01-02,14.0,2019-01-02 14:00:00
...,...,...,...,...,...
280484,8,99.0,2019-12-31,11.0,2019-12-31 11:00:00
280485,13,99.0,2019-12-31,12.0,2019-12-31 12:00:00
280486,6,99.0,2019-12-31,13.0,2019-12-31 13:00:00
280487,1,99.0,2019-12-31,8.0,2019-12-31 08:00:00


In [214]:
df2_final

Unnamed: 0,supply,station,date,time,year_month_day_hour
0,4,1.0,2019-01-02,10.0,2019-01-02 10:00:00
1,8,1.0,2019-01-02,11.0,2019-01-02 11:00:00
2,12,1.0,2019-01-02,12.0,2019-01-02 12:00:00
3,7,1.0,2019-01-02,13.0,2019-01-02 13:00:00
4,8,1.0,2019-01-02,14.0,2019-01-02 14:00:00
...,...,...,...,...,...
298700,1,99.0,2019-12-31,5.0,2019-12-31 05:00:00
298701,1,99.0,2019-12-31,6.0,2019-12-31 06:00:00
298702,1,99.0,2019-12-31,7.0,2019-12-31 07:00:00
298703,5,99.0,2019-12-31,8.0,2019-12-31 08:00:00


## Merging DFs