# JohnMcL's Machine Learning

# DublinBikes ADT
- Included the process for producing the ADT
- It's saved as a csv called __DbDynamicInfo.csv__
- So you can just import it as a dataframe if needed
- Only issue going forward is the amount of time it takes to add weather info to the dynamic info
- While not unsolvable, it may present an issue in the future as our data set grows
- If the ADT csv is updated, be prepared to wait 5-10 mins, maybe more for this process to complete
- If we can get merge_asof working in pandas instead, this would probably be faster
- However, there are issues with this approach where after the length of the weatherinfo table is passed, only the last row of the weather info table is used to update the dynamic info table
- Also, at the moment, only the station number is present in the ADT but this can be solved by pulling in info from static info and performing a merge on number if needs be

## All code before linear regression to do with cleaning and prep of data
## For machine learning just need to read in NewDbADT.csv
[Linear Regression Start](#linreg)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from patsy import dmatrices
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score 
from sklearn.utils import shuffle

- Get info from sql database
- Save info to csv to work with
- Uncomment these lines to update csv
- Note: DynamicInfo takes a while to get

In [3]:
"""
database = "mysql+pymysql://jmcl:Jm30079!@dbbikes.ca8jj5ksuurt.eu-west-1.rds.amazonaws.com:3306/dbbikes"
dfStaticInfo = pd.read_sql_table("DbStaticInfo", database)
dfDynamicInfo = pd.read_sql_table("DbDynamicInfo", database)
dfWeatherInfo = pd.read_sql_table("weather", database)
dfStaticInfo.to_csv("StaticInfoWorkData.csv")
dfDynamicInfo.to_csv("DynamicInfoWorkData.csv")
dfWeatherInfo.to_csv("WeatherWorkData.csv")
"""

'\ndatabase = "mysql+pymysql://jmcl:Jm30079!@dbbikes.ca8jj5ksuurt.eu-west-1.rds.amazonaws.com:3306/dbbikes"\ndfStaticInfo = pd.read_sql_table("DbStaticInfo", database)\ndfDynamicInfo = pd.read_sql_table("DbDynamicInfo", database)\ndfWeatherInfo = pd.read_sql_table("weather", database)\ndfStaticInfo.to_csv("StaticInfoWorkData.csv")\ndfDynamicInfo.to_csv("DynamicInfoWorkData.csv")\ndfWeatherInfo.to_csv("WeatherWorkData.csv")\n'

- Read DynamicInfo in from csv file

In [8]:
dfDynamicInfo = pd.read_csv("DynamicInfoWorkData.csv")

## DbDynamicInfo Work

- Check starting number and end number per 113 rows
- Starts with 42, ends with 88

In [9]:
dfDynamicInfo.shape

(1094292, 7)

In [10]:
dfDynamicInfo.head(113)

Unnamed: 0.1,Unnamed: 0,id,number,status,available_bike_stands,available_bikes,last_update
0,0,1,42,OPEN,16,14,2019-02-19 19:04:44
1,1,2,30,OPEN,16,4,2019-02-19 19:00:54
2,2,3,54,OPEN,31,2,2019-02-19 19:03:18
3,3,4,108,OPEN,24,15,2019-02-19 18:57:17
4,4,5,56,OPEN,40,0,2019-02-19 19:00:15
5,5,6,6,OPEN,11,9,2019-02-19 19:01:05
6,6,7,18,OPEN,16,14,2019-02-19 19:03:05
7,7,8,32,OPEN,9,21,2019-02-19 18:58:33
8,8,9,52,OPEN,32,0,2019-02-19 18:57:15
9,9,10,48,OPEN,35,5,2019-02-19 19:02:47


- Drop unnecessary columns
- Also drop available bike stands since we're going to represent available bikes as a percentage of total bikes

In [11]:
dfDynamicInfo.drop(["Unnamed: 0", "id", "status"], axis=1, inplace=True)

- Due to prior issues with recording timestamp, need to find where usefule time stamps begin
- Iterate through rows in steps of 113
- At each step, iterate over rows 1 by 1
- Compare last_updates, when timestamp is correct, the group of 113 rows should all have same timestamp
- Use boolean to check if target row reached
- breaks used because otherwise this takes forever

In [12]:
for i in range(0, 150000, 113):
    lastUpdateEqual = True
    for j in range(i, i+113-1):
        if dfDynamicInfo.iloc[j]["last_update"] != dfDynamicInfo.iloc[j+1]["last_update"]:
            lastUpdateEqual = False
            break
            
    if lastUpdateEqual:
        target = i
        break
    
target

103282

- Check that from before target, the timestamps are incorrect

In [13]:
dfDynamicInfo.iloc[target-113:target]

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
103169,42,15,15,2019-02-22 23:30:19
103170,30,18,2,2019-02-22 23:21:49
103171,54,31,2,2019-02-22 23:25:06
103172,108,27,12,2019-02-22 23:23:36
103173,56,38,2,2019-02-22 23:25:53
103174,6,10,9,2019-02-22 23:24:26
103175,18,23,7,2019-02-22 23:30:53
103176,32,5,25,2019-02-22 23:26:08
103177,52,31,1,2019-02-22 23:25:43
103178,48,39,1,2019-02-22 23:27:19


- Note that timestamp is different for each row
- Due to taking timestamp from api
- Started taking timestamp from machine instead
- Check that the 113 rows past target have same timestamp
- Also check that the number for the first row is 42 and the number for the last row is 88

In [14]:
dfDynamicInfo.iloc[target:].head(113)

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
103282,42,19,11,2019-02-23 23:05:30
103283,30,16,4,2019-02-23 23:05:30
103284,54,25,8,2019-02-23 23:05:30
103285,108,20,19,2019-02-23 23:05:30
103286,56,31,9,2019-02-23 23:05:30
103287,6,1,19,2019-02-23 23:05:30
103288,18,22,7,2019-02-23 23:05:30
103289,32,26,4,2019-02-23 23:05:30
103290,52,4,28,2019-02-23 23:05:30
103291,48,35,5,2019-02-23 23:05:30


- These rows all have the same timestamp
- This marks the point where timestamp issues were resolved
- Also, the first number is 42 and the last number is 88, this means that the target starts at a group of 113
- Next, check that the number of rows from the target to the end is divisible by 113
- This should be the case because the table is added to in groups of 113

In [15]:
dfDynamicInfo.iloc[target:].shape[0] % 113

0

- Number of remaining rows divisible by 113
- Can now remove all rows below target

In [16]:
dfDynamicInfo.drop(range(0, target), inplace=True)

In [17]:
dfDynamicInfo.head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
103282,42,19,11,2019-02-23 23:05:30
103283,30,16,4,2019-02-23 23:05:30
103284,54,25,8,2019-02-23 23:05:30
103285,108,20,19,2019-02-23 23:05:30
103286,56,31,9,2019-02-23 23:05:30


In [18]:
dfDynamicInfo[dfDynamicInfo["number"]==42].head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
103282,42,19,11,2019-02-23 23:05:30
103395,42,18,12,2019-02-23 23:16:49
103508,42,19,11,2019-02-23 23:22:59
103621,42,18,12,2019-02-23 23:23:07
103734,42,17,13,2019-02-23 23:29:41


- Still have issues with timing, gonna work on weather below and come back to it

In [19]:
dfDynamicInfo.tail()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
1094287,39,15,5,2019-03-27 17:01:23
1094288,83,34,6,2019-03-27 17:01:23
1094289,92,20,20,2019-03-27 17:01:23
1094290,21,5,25,2019-03-27 17:01:23
1094291,88,26,4,2019-03-27 17:01:23


## WeatherInfo Work 
- Read in weather info from csv

In [22]:
dfWeatherInfo = pd.read_csv("WeatherWorkData.csv")

- Checked head and saw that there was a jump in dates
- So expanded check to find where we started regularly getting weather info

In [23]:
dfWeatherInfo.shape

(1449, 9)

- Checked head and saw that there was a jump in dates
- So expanded check to find where we started regularly getting weather info

In [24]:
dfWeatherInfo.head(10)

Unnamed: 0.1,Unnamed: 0,number,datetime,overview,description,temperature,humidity,wind_speed,clouds
0,0,1,2019-02-22 11:00:00,Clouds,broken clouds,286.0,76.0,3.0,75.0
1,1,2,2019-02-22 11:00:00,Clouds,broken clouds,286.0,76.0,3.0,75.0
2,2,3,2019-02-22 11:00:00,Clouds,broken clouds,286.0,76.0,3.0,75.0
3,3,4,2019-02-22 12:00:00,Clouds,broken clouds,287.0,66.0,4.0,75.0
4,4,5,2019-02-25 14:30:00,Clouds,broken clouds,286.0,62.0,4.0,75.0
5,5,6,2019-02-25 14:30:00,Clouds,broken clouds,286.0,62.0,4.0,75.0
6,6,7,2019-02-25 15:00:00,Clouds,broken clouds,286.0,71.0,4.0,75.0
7,7,8,2019-02-25 15:30:00,Clouds,broken clouds,286.0,71.0,3.0,75.0
8,8,9,2019-02-25 16:00:00,Clouds,broken clouds,286.0,76.0,4.0,75.0
9,9,10,2019-02-25 16:30:00,Clouds,broken clouds,285.0,76.0,2.0,75.0


- Drop unecessary columns and rows 0 to 6

In [25]:
dfWeatherInfo.drop(["Unnamed: 0", "number"],axis=1, inplace=True)
dfWeatherInfo.drop(range(0,6), inplace=True)

- Rename datetime to last_update to make merging easier later on

In [26]:
dfWeatherInfo.rename(columns={"datetime":"last_update"}, inplace=True)

- dfWeatherInfo in kelvin, convert to degrees celsius

In [27]:
dfWeatherInfo["temperature"] = dfWeatherInfo["temperature"] - 273.15

In [28]:
dfWeatherInfo.head()

Unnamed: 0,last_update,overview,description,temperature,humidity,wind_speed,clouds
6,2019-02-25 15:00:00,Clouds,broken clouds,12.85,71.0,4.0,75.0
7,2019-02-25 15:30:00,Clouds,broken clouds,12.85,71.0,3.0,75.0
8,2019-02-25 16:00:00,Clouds,broken clouds,12.85,76.0,4.0,75.0
9,2019-02-25 16:30:00,Clouds,broken clouds,11.85,76.0,2.0,75.0
10,2019-02-25 17:48:51,Clouds,broken clouds,10.85,81.0,3.0,75.0


- Note that we didn't start collecting weather info till the 25-Feb
- This means we have no weather data for DynamicInfo before that time
- For the time being, I'll store the DynamicInfo without weather data in a separate df
- I'll then merge the weather df with the DynamicInfo that we have weather for

In [29]:
dfDynamicInfo[dfDynamicInfo["last_update"] > "2019-02-25 15:30:00"].head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
136278,42,30,0,2019-02-25 15:34:42
136279,30,16,4,2019-02-25 15:34:42
136280,54,14,19,2019-02-25 15:34:42
136281,108,40,0,2019-02-25 15:34:42
136282,56,16,24,2019-02-25 15:34:42


In [30]:
dfDynamicInfoNoWeatherAvailable = dfDynamicInfo[dfDynamicInfo["last_update"] < "2019-02-25 15:00:00"]

In [31]:
dfDynamicInfoNoWeatherAvailable.head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
103282,42,19,11,2019-02-23 23:05:30
103283,30,16,4,2019-02-23 23:05:30
103284,54,25,8,2019-02-23 23:05:30
103285,108,20,19,2019-02-23 23:05:30
103286,56,31,9,2019-02-23 23:05:30


In [32]:
dfDynamicInfo = dfDynamicInfo[dfDynamicInfo["last_update"] > "2019-02-25 15:00:00"]

In [33]:
dfDynamicInfo.head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
136165,42,30,0,2019-02-25 15:20:29
136166,30,17,3,2019-02-25 15:20:29
136167,54,13,20,2019-02-25 15:20:29
136168,108,40,0,2019-02-25 15:20:29
136169,56,16,24,2019-02-25 15:20:29


In [34]:
dfDynamicInfo[dfDynamicInfo["number"]==42].head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
136165,42,30,0,2019-02-25 15:20:29
136278,42,30,0,2019-02-25 15:34:42
136391,42,28,2,2019-02-25 15:39:43
136504,42,28,2,2019-02-25 15:44:44
136617,42,30,0,2019-02-25 15:49:45


- Note, this also sorts out our uneven timestamp issue in DynamicInfo

In [35]:
dfWeatherInfo.reset_index(drop=True, inplace=True)
dfWeatherInfo.head()

Unnamed: 0,last_update,overview,description,temperature,humidity,wind_speed,clouds
0,2019-02-25 15:00:00,Clouds,broken clouds,12.85,71.0,4.0,75.0
1,2019-02-25 15:30:00,Clouds,broken clouds,12.85,71.0,3.0,75.0
2,2019-02-25 16:00:00,Clouds,broken clouds,12.85,76.0,4.0,75.0
3,2019-02-25 16:30:00,Clouds,broken clouds,11.85,76.0,2.0,75.0
4,2019-02-25 17:48:51,Clouds,broken clouds,10.85,81.0,3.0,75.0


In [36]:
dfDynamicInfo.reset_index(drop=True, inplace=True)
dfDynamicInfo.head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
0,42,30,0,2019-02-25 15:20:29
1,30,17,3,2019-02-25 15:20:29
2,54,13,20,2019-02-25 15:20:29
3,108,40,0,2019-02-25 15:20:29
4,56,16,24,2019-02-25 15:20:29


## Representing available_bikes as a percentage of totalbikes
- First read in StaticInfo

In [40]:
dfStaticInfo = pd.read_csv("StaticInfoWorkData.csv")

In [41]:
dfStaticInfo.shape

(113, 7)

In [42]:
dfStaticInfo.head()

Unnamed: 0.1,Unnamed: 0,number,name,address,lat,lng,bikestands
0,0,2,BLESSINGTON STREET,Blessington Street,53.35677,-6.26814,20
1,1,3,BOLTON STREET,Bolton Street,53.351181,-6.269859,20
2,2,4,GREEK STREET,Greek Street,53.346874,-6.272976,20
3,3,5,CHARLEMONT PLACE,Charlemont Street,53.330662,-6.260177,40
4,4,6,CHRISTCHURCH PLACE,Christchurch Place,53.343369,-6.27012,20


In [43]:
dropper = ["Unnamed: 0", "name", "lat", "lng"]
for col in dropper:
    dfStaticInfo.drop(col, axis=1 ,inplace=True)
dfStaticInfo.head()

Unnamed: 0,number,address,bikestands
0,2,Blessington Street,20
1,3,Bolton Street,20
2,4,Greek Street,20
3,5,Charlemont Street,40
4,6,Christchurch Place,20


In [44]:
dfDynamicInfo.head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update
0,42,30,0,2019-02-25 15:20:29
1,30,17,3,2019-02-25 15:20:29
2,54,13,20,2019-02-25 15:20:29
3,108,40,0,2019-02-25 15:20:29
4,56,16,24,2019-02-25 15:20:29


In [45]:
dfDynamicInfo = dfDynamicInfo.merge(dfStaticInfo, how="left", on="number", sort=False)
dfDynamicInfo.head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update,address,bikestands
0,42,30,0,2019-02-25 15:20:29,Smithfield North,30
1,30,17,3,2019-02-25 15:20:29,Parnell Square North,20
2,54,13,20,2019-02-25 15:20:29,Clonmel Street,33
3,108,40,0,2019-02-25 15:20:29,Avondale Road,40
4,56,16,24,2019-02-25 15:20:29,Mount Street Lower,40


In [46]:
dfDynamicInfo["available_bikes_percentage"] = dfDynamicInfo["available_bikes"] / dfDynamicInfo["bikestands"] * 100

In [47]:
dfDynamicInfo.head()

Unnamed: 0,number,available_bike_stands,available_bikes,last_update,address,bikestands,available_bikes_percentage
0,42,30,0,2019-02-25 15:20:29,Smithfield North,30,0.0
1,30,17,3,2019-02-25 15:20:29,Parnell Square North,20,15.0
2,54,13,20,2019-02-25 15:20:29,Clonmel Street,33,60.606061
3,108,40,0,2019-02-25 15:20:29,Avondale Road,40,0.0
4,56,16,24,2019-02-25 15:20:29,Mount Street Lower,40,60.0


## Merge DynamicInfo with WeatherInfo

In [50]:
pd.concat([dfDynamicInfo, dfWeatherInfo], )

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,address,available_bike_stands,available_bikes,available_bikes_percentage,bikestands,clouds,description,humidity,last_update,number,overview,temperature,wind_speed
0,Smithfield North,30.0,0.0,0.000000,30.0,,,,2019-02-25 15:20:29,42.0,,,
1,Parnell Square North,17.0,3.0,15.000000,20.0,,,,2019-02-25 15:20:29,30.0,,,
2,Clonmel Street,13.0,20.0,60.606061,33.0,,,,2019-02-25 15:20:29,54.0,,,
3,Avondale Road,40.0,0.0,0.000000,40.0,,,,2019-02-25 15:20:29,108.0,,,
4,Mount Street Lower,16.0,24.0,60.000000,40.0,,,,2019-02-25 15:20:29,56.0,,,
5,Christchurch Place,20.0,0.0,0.000000,20.0,,,,2019-02-25 15:20:29,6.0,,,
6,Grantham Street,27.0,3.0,10.000000,30.0,,,,2019-02-25 15:20:29,18.0,,,
7,Pearse Street,11.0,19.0,63.333333,30.0,,,,2019-02-25 15:20:29,32.0,,,
8,York Street East,12.0,20.0,62.500000,32.0,,,,2019-02-25 15:20:29,52.0,,,
9,Excise Walk,9.0,31.0,77.500000,40.0,,,,2019-02-25 15:20:29,48.0,,,


In [51]:
dfDynamicInfo.shape

(958127, 7)

In [52]:
dfWeatherInfo.head()

Unnamed: 0,last_update,overview,description,temperature,humidity,wind_speed,clouds
0,2019-02-25 15:00:00,Clouds,broken clouds,12.85,71.0,4.0,75.0
1,2019-02-25 15:30:00,Clouds,broken clouds,12.85,71.0,3.0,75.0
2,2019-02-25 16:00:00,Clouds,broken clouds,12.85,76.0,4.0,75.0
3,2019-02-25 16:30:00,Clouds,broken clouds,11.85,76.0,2.0,75.0
4,2019-02-25 17:48:51,Clouds,broken clouds,10.85,81.0,3.0,75.0


In [53]:
pd.merge_asof(dfDynamicInfo.sort_values(["number","last_update"]).reset_index(drop=True), dfWeatherInfo, left_index=True, right_index=True, on="last_update", direction="nearest").sort_values(["number","last_update"])

Unnamed: 0,number,available_bike_stands,available_bikes,last_update,address,bikestands,available_bikes_percentage,overview,description,temperature,humidity,wind_speed,clouds
0,2,20,0,2019-02-25 15:20:29,Blessington Street,20,0.000000,Clouds,broken clouds,12.85,71.00,4.0,75.0
1,2,20,0,2019-02-25 15:34:42,Blessington Street,20,0.000000,Clouds,broken clouds,12.85,71.00,3.0,75.0
2,2,20,0,2019-02-25 15:39:43,Blessington Street,20,0.000000,Clouds,broken clouds,12.85,76.00,4.0,75.0
3,2,20,0,2019-02-25 15:44:44,Blessington Street,20,0.000000,Clouds,broken clouds,11.85,76.00,2.0,75.0
4,2,20,0,2019-02-25 15:49:45,Blessington Street,20,0.000000,Clouds,broken clouds,10.85,81.00,3.0,75.0
5,2,17,3,2019-02-25 15:54:46,Blessington Street,20,15.000000,Clouds,broken clouds,9.85,81.00,2.0,75.0
6,2,17,3,2019-02-25 15:59:47,Blessington Street,20,15.000000,Clouds,broken clouds,7.85,93.00,2.0,75.0
7,2,14,6,2019-02-25 16:04:48,Blessington Street,20,30.000000,Clouds,broken clouds,7.85,93.00,2.0,75.0
8,2,15,5,2019-02-25 16:09:48,Blessington Street,20,25.000000,Clouds,scattered clouds,7.85,87.00,2.0,40.0
9,2,15,5,2019-02-25 16:14:49,Blessington Street,20,25.000000,Clouds,scattered clouds,6.85,81.00,4.0,40.0


- Merging in pandas not working out so did it in sql instead
- Used local database and saved result to csv
- Takes a while so wouldn't recommend it but it's fast enough
- Tries using sqlite instead but it hit 8gb ram straight away so it needs to be done in a normal database on disk

In [54]:
"""
from sqlalchemy import create_engine
# localDb format: "mysql+pymysql://<username>:<password>@localhost/<database>"
localDb = "mysql+pymysql://root:Jm30079!@localhost/test"
engine = create_engine(localDb)
dfDynamicInfo.to_sql("DbDynamicInfoADT", con=engine)
dfWeatherInfo.to_sql("WeatherInfoADT", con=engine)
"""

'\nfrom sqlalchemy import create_engine\n# localDb format: "mysql+pymysql://<username>:<password>@localhost/<database>"\nlocalDb = "mysql+pymysql://root:Jm30079!@localhost/test"\nengine = create_engine(localDb)\ndfDynamicInfo.to_sql("DbDynamicInfoADT", con=engine)\ndfWeatherInfo.to_sql("WeatherInfoADT", con=engine)\n'

In [26]:
"""
query = "select di.number, di.available_bikes_percentage, di.available_bikes, di.available_bike_stands, di.bikestands,di.last_update, di.weekday, wi.overview, wi.description, wi.temperature, wi.wind_speed, wi.clouds from DbDynamicInfoADT di, WeatherInfoADT wi where abs(timestampdiff(minute, di.last_update, wi.last_update)) < 20" # limit " + str(113*5000)
dfNewDynamicInfo = pd.read_sql(query, localDb)
dfNewDynamicInfo.head()
"""

'\nquery = "select di.number, di.available_bikes_percentage, di.available_bikes, di.available_bike_stands, di.bikestands,di.last_update, di.weekday, wi.overview, wi.description, wi.temperature, wi.wind_speed, wi.clouds from DbDynamicInfoADT di, WeatherInfoADT wi where abs(timestampdiff(minute, di.last_update, wi.last_update)) < 20" # limit " + str(113*5000)\ndfNewDynamicInfo = pd.read_sql(query, localDb)\ndfNewDynamicInfo.head()\n'

- Ended up with duplicate rows
- Dropped them but accidentally deleted the cells for this
- When recreating, look for duplicates in number and last_update: df[["number", "last_update"]].duplicated()
- These need to be dropped
- End up with about 1130 rows less than starting dynamic info but weather seems to line up with proper dates and times and there are no more duplicates, this still leaves us with enough data to work with even at this point

In [27]:
#dfNewDynamicInfo.to_csv("DbDynamicInfoADT.csv")
dfNewDynamicInfo = pd.read_csv("DbDynamicInfoADT.csv")

In [28]:
dfNewDynamicInfo.shape

(1220287, 16)

In [29]:
dfNewDynamicInfo.drop_duplicates(["number", "last_update"], inplace=True)

In [30]:
dfNewDynamicInfo[dfNewDynamicInfo[["number", "last_update", "weekday"]].duplicated()].shape

(0, 16)

In [31]:
dfNewDynamicInfo.shape

(1220287, 16)

In [None]:
#dfNewDynamicInfo.to_csv("DbDynamicInfoADT.csv")

In [33]:
dfNewDynamicInfo.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,number,available_bikes_percentage,available_bikes,available_bike_stands,bikestands,last_update,weekday,overview,description,temperature,wind_speed,clouds
0,0,0,0,0,42,0.0,0,30,30,2019-02-25 15:20:29,2,Clouds,broken clouds,12.85,3.0,75.0
1,1,1,1,1,30,15.0,3,17,20,2019-02-25 15:20:29,2,Clouds,broken clouds,12.85,3.0,75.0
2,2,2,2,2,54,60.606061,20,13,33,2019-02-25 15:20:29,2,Clouds,broken clouds,12.85,3.0,75.0
3,3,3,3,3,108,0.0,0,40,40,2019-02-25 15:20:29,2,Clouds,broken clouds,12.85,3.0,75.0
4,4,4,4,4,56,60.0,24,16,40,2019-02-25 15:20:29,2,Clouds,broken clouds,12.85,3.0,75.0


In [55]:
dfWeatherInfo.head()

Unnamed: 0,last_update,overview,description,temperature,humidity,wind_speed,clouds
0,2019-02-25 15:00:00,Clouds,broken clouds,12.85,71.0,4.0,75.0
1,2019-02-25 15:30:00,Clouds,broken clouds,12.85,71.0,3.0,75.0
2,2019-02-25 16:00:00,Clouds,broken clouds,12.85,76.0,4.0,75.0
3,2019-02-25 16:30:00,Clouds,broken clouds,11.85,76.0,2.0,75.0
4,2019-02-25 17:48:51,Clouds,broken clouds,10.85,81.0,3.0,75.0


In [None]:
dfNewDynamicInfo.head()

In [None]:
dfNewDynamicInfo.tail()

In [94]:
dfWeatherInfo.tail()

Unnamed: 0,last_update,overview,description,temperature,humidity,wind_speed,clouds
1438,2019-03-27 14:58:41,Clouds,scattered clouds,12.85,58.0,4.0,40.0
1439,2019-03-27 15:23:55,Clouds,scattered clouds,12.85,66.0,5.0,40.0
1440,2019-03-27 15:53:48,Clouds,broken clouds,12.85,66.0,4.0,75.0
1441,2019-03-27 16:23:49,Clouds,scattered clouds,12.85,62.0,4.0,40.0
1442,2019-03-27 16:53:34,Clouds,scattered clouds,11.85,62.0,2.0,40.0


In [95]:
#df = pd.read_csv("DbDynamicInfoADT.csv")

In [96]:
df.tail()

Unnamed: 0.1,Unnamed: 0,number,available_bikes_percentage,available_bikes,available_bike_stands,bikestands,last_update,weekday,overview,description,temperature,wind_speed,clouds
933940,933940,39,0.0,0,20,20,1553644682000000000,3,Clouds,scattered clouds,3.85,2.0,40.0
933941,933941,83,97.5,39,1,40,1553644682000000000,3,Clouds,scattered clouds,3.85,2.0,40.0
933942,933942,92,90.0,36,4,40,1553644682000000000,3,Clouds,scattered clouds,3.85,2.0,40.0
933943,933943,21,0.0,0,30,30,1553644682000000000,3,Clouds,scattered clouds,3.85,2.0,40.0
933944,933944,88,96.666667,29,1,30,1553644682000000000,3,Clouds,scattered clouds,3.85,2.0,40.0


In [99]:
df2 = pd.read_csv("WeatherWorkData.csv")

In [100]:
df2.tail()

Unnamed: 0.1,Unnamed: 0,number,datetime,overview,description,temperature,humidity,wind_speed,clouds
1444,1444,1445,2019-03-27 14:58:41,Clouds,scattered clouds,286.0,58.0,4.0,40.0
1445,1445,1446,2019-03-27 15:23:55,Clouds,scattered clouds,286.0,66.0,5.0,40.0
1446,1446,1447,2019-03-27 15:53:48,Clouds,broken clouds,286.0,66.0,4.0,75.0
1447,1447,1448,2019-03-27 16:23:49,Clouds,scattered clouds,286.0,62.0,4.0,40.0
1448,1448,1449,2019-03-27 16:53:34,Clouds,scattered clouds,285.0,62.0,2.0,40.0


## NOTE: Weekday Numbers
- Sunday = 1
- Monday = 2
- Tuesday = 3
- Wednesday = 4
- Thursday = 5
- Friday = 6
- Saturday = 7

In [None]:
df = df[df["last_update"] < "2019-03-27"]

In [None]:
df.drop(["Unnamed: 0", "Unnamed: 0.1", "Unnamed: 0.1.1"], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df2[df2["datetime"] > "2019-03-26 23:00:00"].head()

In [None]:
277-273.15

In [None]:
#df.to_csv("DublinBikesADT.csv")

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df[df[["number", "last_update"]].duplicated()]

# Linear Regression Attempt
<a id="linreg"></a>

In [58]:
df = pd.read_csv("NewDbADT.csv")
print(df.shape)
df.head()

(933945, 13)


Unnamed: 0.1,Unnamed: 0,number,available_bikes_percentage,available_bikes,available_bike_stands,bikestands,last_update,weekday,overview,description,temperature,wind_speed,clouds
0,0,42,0.0,0,30,30,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0
1,1,30,15.0,3,17,20,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0
2,2,54,60.606061,20,13,33,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0
3,3,108,0.0,0,40,40,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0
4,4,56,60.0,24,16,40,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0


In [59]:
df.drop("Unnamed: 0", axis=1, inplace=True)

In [60]:
overview = {}
num = 0
for i in df["overview"].unique():
    overview[i] = num
    num += 1

In [61]:
description = {}
num = 0
for i in df["description"].unique():
    description[i] = num
    num += 1

In [62]:
description

{'broken clouds': 0,
 'scattered clouds': 1,
 'few clouds': 2,
 'clear sky': 3,
 'mist': 4,
 'fog': 5,
 'light rain': 6,
 'moderate rain': 7,
 'heavy intensity rain': 8,
 'very heavy rain': 9,
 'snow': 10,
 'light intensity shower rain': 11,
 'shower rain': 12,
 'light intensity drizzle': 13,
 'shower sleet': 14,
 'light intensity drizzle rain': 15,
 'drizzle': 16,
 'overcast clouds': 17}

In [63]:
overview

{'Clouds': 0,
 'Clear': 1,
 'Mist': 2,
 'Fog': 3,
 'Rain': 4,
 'Snow': 5,
 'Drizzle': 6}

In [64]:
df.replace(description, inplace=True)
df.replace(overview, inplace=True)

In [65]:
df.head()

Unnamed: 0,number,available_bikes_percentage,available_bikes,available_bike_stands,bikestands,last_update,weekday,overview,description,temperature,wind_speed,clouds
0,42,0.0,0,30,30,1551108029000000000,2,0,0,12.85,3.0,75.0
1,30,15.0,3,17,20,1551108029000000000,2,0,0,12.85,3.0,75.0
2,54,60.606061,20,13,33,1551108029000000000,2,0,0,12.85,3.0,75.0
3,108,0.0,0,40,40,1551108029000000000,2,0,0,12.85,3.0,75.0
4,56,60.0,24,16,40,1551108029000000000,2,0,0,12.85,3.0,75.0


In [66]:
df = shuffle(df)
firstModelData = df[df["number"] == 42]
firstModelData["last_update"] = pd.to_numeric(pd.to_datetime(df["last_update"]))
dfTrain, dfTest = train_test_split(firstModelData, test_size=0.3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [67]:
dfTrain.head()

Unnamed: 0,number,available_bikes_percentage,available_bikes,available_bike_stands,bikestands,last_update,weekday,overview,description,temperature,wind_speed,clouds
231650,42,0.0,0,30,30,1551773632000000000,3,4,7,4.85,7.0,40.0
623195,42,46.666667,14,16,30,1552816956000000000,1,0,1,3.85,6.0,40.0
133792,42,63.333333,19,11,30,1551512996000000000,7,0,2,4.85,2.0,20.0
831228,42,80.0,24,6,30,1553371384000000000,7,1,3,5.85,2.0,0.0
148369,42,40.0,12,18,30,1551551815000000000,7,4,6,7.85,7.0,40.0


In [68]:
continuous = ["weekday", "temperature", "clouds", "overview", "description"]

In [69]:
model = LinearRegression()
xTrain = dfTrain[continuous]
yTrain = dfTrain["available_bikes"]
model.fit(xTrain, yTrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [70]:
test = model.predict(dfTest[continuous])

In [71]:
actualVsPredicted = pd.DataFrame()

In [72]:
for i in range(0, len(test)):
    test[i] = test[i].round()

In [73]:
actualVsPredicted["Actual"] = dfTest["available_bikes"]

In [74]:
actualVsPredicted["Prediction"] = test

In [75]:
dfTest.head()

Unnamed: 0,number,available_bikes_percentage,available_bikes,available_bike_stands,bikestands,last_update,weekday,overview,description,temperature,wind_speed,clouds
559463,42,0.0,0,30,30,1552647164000000000,6,4,11,8.85,11.0,75.0
680825,42,60.0,18,12,30,1552970404000000000,3,0,0,5.85,4.0,75.0
434824,42,30.0,9,21,30,1552315146000000000,2,0,0,7.85,5.0,75.0
408043,42,40.0,12,18,30,1552243748000000000,1,4,6,1.85,7.0,75.0
57065,42,66.666667,20,10,30,1551248362000000000,4,2,4,0.85,1.0,20.0


In [76]:
actualVsPredicted.head()

Unnamed: 0,Actual,Prediction
559463,0,14.0
680825,18,15.0
434824,9,12.0
408043,12,15.0
57065,20,17.0


In [77]:
print("Accuracy Score:", metrics.accuracy_score(actualVsPredicted["Actual"], actualVsPredicted["Prediction"])*100, "\n")
print("Root Mean Squared:", metrics.r2_score(actualVsPredicted["Actual"], actualVsPredicted["Prediction"]), "\n")
print("Mean Squared Error:", metrics.mean_squared_error(actualVsPredicted["Actual"], actualVsPredicted["Prediction"]), "\n")
print("Mean Absolute Error:", metrics.mean_absolute_error(actualVsPredicted["Actual"], actualVsPredicted["Prediction"]), "\n")

Accuracy Score: 1.370967741935484 

Root Mean Squared: 0.11333536955113177 

Mean Squared Error: 96.80120967741935 

Mean Absolute Error: 8.6875 



Given that the low accuracy score and r squared value, coupled with a high mean sqaured error and absolute error, it might be a good idea to evaluate other models.
Since we are predicting numeric values, random forests or decision trees might be the way forward.

# TomM's Machine Learning

### Import software tools for Random Forest and Label encoder (to make categorical features numerical).

In [78]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

  from numpy.core.umath_tests import inner1d


### A dictionary will be created to hold the prediction model for each station. The Random Forest model will be tested against the Decision Tree model, and the most accurate will be model overall will be used. 

In [79]:
# Create a dictionary to hold the prediction model for each station
modelDict = {}
# Used for loop number in print statement
j=1

# Run the code for each unique bikestand number
for i in df.number.unique():
    df = df.loc[df['number'] == i]
    
    # Get the columns which will be used in the data prediction model. 
    df = df[["available_bikes","last_update","weekday", "overview", "description", "temperature", "wind_speed", "clouds" ]]
    continuous_columns = df[["available_bikes", "last_update","temperature", "wind_speed", "clouds"]].columns
    category_columns = ["weekday", "overview", "description"]
    
    # Use Label Encoder to change categorical features to numerical.
    df[category_columns]= df[category_columns].apply(le.fit_transform)
    
    # Separate descriptive features from the target feature.
    describe = [x for x in df.columns.values if x not in ['available_bikes']]
    
    # X is the descriptive features, y is the target feature.
    X = pd.concat([df[describe]], axis = 1)
    y = df.available_bikes
    
    # Create a 70/30 split between training and test data.
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=7)
    
    # Create a decision tree and random forest data prediction model
    decision_tree = DecisionTreeClassifier()
    random_forest = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)
    
    # Train the models
    decision_tree.fit(X_train, y_train)
    random_forest.fit(X_train, y_train)

    # Predict on the test data
    dec_predictions = decision_tree.predict(X_test)
    for_predictions = random_forest.predict(X_test)

    # Print accuracy scores for model prediction, along with loop number and bikestand number
    print("\nStations checked out of 113: ",j, "\nBikestand Number: ", i,
        "\nDecision Tree\nAccuracy: ", metrics.accuracy_score(y_test, dec_predictions), "\nRoot Mean Squared: ",
          metrics.r2_score(y_test, dec_predictions), "\nMean Squared Error: ", 
          metrics.mean_squared_error(y_test, dec_predictions), "\nMean Absolute Error: ", metrics.mean_absolute_error(y_test, dec_predictions), "\n")
    print()     
    print("Stations checked out of 113: ",j, "\nBikestand Number: ", i,
        "\nRandom Forest\nAccuracy: ", metrics.accuracy_score(y_test, for_predictions), "\nRoot Mean Squared: ",
          metrics.r2_score(y_test, for_predictions), "\nMean Squared Error: ", 
          metrics.mean_squared_error(y_test, for_predictions), "\nMean Absolute Error: ", metrics.mean_absolute_error(y_test, for_predictions), "\n")
         
    print("________________________________________________________________") 
    
    # Increment j for denoting the loop number.
    j= j+1

    # Reread df to reset all columns for next for loop.
    df = pd.read_csv("NewDbADT.csv")
    
    # Decision trees were very slightly more accurate, so that model will be used to populate the dictionary.
    dt_pred = decision_tree
    modelDict.update({ i : dt_pred} )


Stations checked out of 113:  1 
Bikestand Number:  17 
Decision Tree
Accuracy:  0.8012096774193549 
Root Mean Squared:  0.9796850163259772 
Mean Squared Error:  0.3842741935483871 
Mean Absolute Error:  0.25201612903225806 


Stations checked out of 113:  1 
Bikestand Number:  17 
Random Forest
Accuracy:  0.8016129032258065 
Root Mean Squared:  0.9794078969264364 
Mean Squared Error:  0.3895161290322581 
Mean Absolute Error:  0.2532258064516129 

________________________________________________________________

Stations checked out of 113:  2 
Bikestand Number:  12 
Decision Tree
Accuracy:  0.8149193548387097 
Root Mean Squared:  0.9848796032109086 
Mean Squared Error:  0.48024193548387095 
Mean Absolute Error:  0.24637096774193548 


Stations checked out of 113:  2 
Bikestand Number:  12 
Random Forest
Accuracy:  0.8137096774193548 
Root Mean Squared:  0.9857048137829414 
Mean Squared Error:  0.45403225806451614 
Mean Absolute Error:  0.2435483870967742 

___________________________


Stations checked out of 113:  17 
Bikestand Number:  93 
Decision Tree
Accuracy:  0.9294354838709677 
Root Mean Squared:  0.9927958204069961 
Mean Squared Error:  1.8399193548387096 
Mean Absolute Error:  0.2754032258064516 


Stations checked out of 113:  17 
Bikestand Number:  93 
Random Forest
Accuracy:  0.927016129032258 
Root Mean Squared:  0.9919290453474828 
Mean Squared Error:  2.0612903225806454 
Mean Absolute Error:  0.2903225806451613 

________________________________________________________________

Stations checked out of 113:  18 
Bikestand Number:  88 
Decision Tree
Accuracy:  0.7560483870967742 
Root Mean Squared:  0.98901716963071 
Mean Squared Error:  1.1653225806451613 
Mean Absolute Error:  0.3556451612903226 


Stations checked out of 113:  18 
Bikestand Number:  88 
Random Forest
Accuracy:  0.7584677419354838 
Root Mean Squared:  0.9944629813674548 
Mean Squared Error:  0.5875 
Mean Absolute Error:  0.3294354838709677 

__________________________________________


Stations checked out of 113:  33 
Bikestand Number:  25 
Decision Tree
Accuracy:  0.7798387096774193 
Root Mean Squared:  0.9901797610466184 
Mean Squared Error:  0.9580645161290322 
Mean Absolute Error:  0.34919354838709676 


Stations checked out of 113:  33 
Bikestand Number:  25 
Random Forest
Accuracy:  0.7798387096774193 
Root Mean Squared:  0.9895763288550387 
Mean Squared Error:  1.0169354838709677 
Mean Absolute Error:  0.357258064516129 

________________________________________________________________

Stations checked out of 113:  34 
Bikestand Number:  31 
Decision Tree
Accuracy:  0.6157258064516129 
Root Mean Squared:  0.9527776446230043 
Mean Squared Error:  1.3737903225806452 
Mean Absolute Error:  0.5907258064516129 


Stations checked out of 113:  34 
Bikestand Number:  31 
Random Forest
Accuracy:  0.6137096774193549 
Root Mean Squared:  0.9529578297183671 
Mean Squared Error:  1.3685483870967743 
Mean Absolute Error:  0.592741935483871 

____________________________


Stations checked out of 113:  49 
Bikestand Number:  21 
Decision Tree
Accuracy:  0.6754032258064516 
Root Mean Squared:  0.9871798060654776 
Mean Squared Error:  1.414516129032258 
Mean Absolute Error:  0.5290322580645161 


Stations checked out of 113:  49 
Bikestand Number:  21 
Random Forest
Accuracy:  0.6705645161290322 
Root Mean Squared:  0.9873442612328247 
Mean Squared Error:  1.3963709677419356 
Mean Absolute Error:  0.5342741935483871 

________________________________________________________________

Stations checked out of 113:  50 
Bikestand Number:  95 
Decision Tree
Accuracy:  0.7653225806451613 
Root Mean Squared:  0.9945529950689556 
Mean Squared Error:  1.0314516129032258 
Mean Absolute Error:  0.3580645161290323 


Stations checked out of 113:  50 
Bikestand Number:  95 
Random Forest
Accuracy:  0.7600806451612904 
Root Mean Squared:  0.9957837882081829 
Mean Squared Error:  0.7983870967741935 
Mean Absolute Error:  0.3564516129032258 

____________________________


Stations checked out of 113:  65 
Bikestand Number:  32 
Decision Tree
Accuracy:  0.6181451612903226 
Root Mean Squared:  0.9633429131303949 
Mean Squared Error:  2.8233870967741934 
Mean Absolute Error:  0.7540322580645161 


Stations checked out of 113:  65 
Bikestand Number:  32 
Random Forest
Accuracy:  0.6141129032258065 
Root Mean Squared:  0.9613011730733902 
Mean Squared Error:  2.9806451612903224 
Mean Absolute Error:  0.7733870967741936 

________________________________________________________________

Stations checked out of 113:  66 
Bikestand Number:  112 
Decision Tree
Accuracy:  0.660483870967742 
Root Mean Squared:  0.9896483944083251 
Mean Squared Error:  1.1778225806451612 
Mean Absolute Error:  0.5028225806451613 


Stations checked out of 113:  66 
Bikestand Number:  112 
Random Forest
Accuracy:  0.6592741935483871 
Root Mean Squared:  0.9892373070243352 
Mean Squared Error:  1.2245967741935484 
Mean Absolute Error:  0.5141129032258065 

__________________________


Stations checked out of 113:  81 
Bikestand Number:  2 
Decision Tree
Accuracy:  0.7729838709677419 
Root Mean Squared:  0.9785387719121507 
Mean Squared Error:  0.7060483870967742 
Mean Absolute Error:  0.30766129032258066 


Stations checked out of 113:  81 
Bikestand Number:  2 
Random Forest
Accuracy:  0.7733870967741936 
Root Mean Squared:  0.9782201014779508 
Mean Squared Error:  0.7165322580645161 
Mean Absolute Error:  0.3165322580645161 

________________________________________________________________

Stations checked out of 113:  82 
Bikestand Number:  60 
Decision Tree
Accuracy:  0.6407258064516129 
Root Mean Squared:  0.9810110344373973 
Mean Squared Error:  2.397983870967742 
Mean Absolute Error:  0.6407258064516129 


Stations checked out of 113:  82 
Bikestand Number:  60 
Random Forest
Accuracy:  0.6415322580645161 
Root Mean Squared:  0.9797721377435534 
Mean Squared Error:  2.5544354838709675 
Mean Absolute Error:  0.6479838709677419 

_____________________________


Stations checked out of 113:  97 
Bikestand Number:  81 
Decision Tree
Accuracy:  0.9217741935483871 
Root Mean Squared:  0.9939272349032054 
Mean Squared Error:  0.5604838709677419 
Mean Absolute Error:  0.1314516129032258 


Stations checked out of 113:  97 
Bikestand Number:  81 
Random Forest
Accuracy:  0.9205645161290322 
Root Mean Squared:  0.9945257736213787 
Mean Squared Error:  0.505241935483871 
Mean Absolute Error:  0.13024193548387097 

________________________________________________________________

Stations checked out of 113:  98 
Bikestand Number:  68 
Decision Tree
Accuracy:  0.5911290322580646 
Root Mean Squared:  0.9797280450745993 
Mean Squared Error:  3.571774193548387 
Mean Absolute Error:  0.8596774193548387 


Stations checked out of 113:  98 
Bikestand Number:  68 
Random Forest
Accuracy:  0.5887096774193549 
Root Mean Squared:  0.9792749126434377 
Mean Squared Error:  3.6516129032258067 
Mean Absolute Error:  0.8758064516129033 

____________________________


Stations checked out of 113:  113 
Bikestand Number:  79 
Decision Tree
Accuracy:  0.8806451612903226 
Root Mean Squared:  0.9715537498987274 
Mean Squared Error:  1.2100806451612902 
Mean Absolute Error:  0.2318548387096774 


Stations checked out of 113:  113 
Bikestand Number:  79 
Random Forest
Accuracy:  0.8842741935483871 
Root Mean Squared:  0.9716864548308893 
Mean Squared Error:  1.2044354838709677 
Mean Absolute Error:  0.22620967741935483 

________________________________________________________________


### Upon analysing the accuracy measurements of the two data prediction models, it can be seen that they are extremely similar. However, the Decision Tree model is very slightly ahead in more tests than the Random Forest Model, and thus the Decision Tree model will be used.

In [80]:
df.head()

Unnamed: 0.1,Unnamed: 0,number,available_bikes_percentage,available_bikes,available_bike_stands,bikestands,last_update,weekday,overview,description,temperature,wind_speed,clouds
0,0,42,0.0,0,30,30,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0
1,1,30,15.0,3,17,20,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0
2,2,54,60.606061,20,13,33,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0
3,3,108,0.0,0,40,40,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0
4,4,56,60.0,24,16,40,1551108029000000000,2,Clouds,broken clouds,12.85,3.0,75.0


In [81]:
prediction = [1551108029000000000, 1, 1, 0, 12.85, 3, 75]

In [82]:
modelDict[42].predict([prediction])[0]

0

In [83]:
# Actual = 0, predicted = 0

In [84]:
modelDict[30].predict([prediction])[0]

4

In [85]:
# Actual = 3, predicted = 4

In [86]:
modelDict[54].predict([prediction])[0]

19

In [87]:
# Actual = 20, predicted = 19

In [88]:
modelDict[108].predict([prediction])[0]

0

In [89]:
# Actual = 0, predicted = 0

### Overall the model is quite accurate, far superior to using linear regression, and better than chance.

In [90]:
### Store the predictions in a pickle file

In [91]:
import pickle
ser = pickle.dumps(modelDict)

In [92]:
with open("data_model.pkl", "wb") as handle:
    pickle.dump(modelDict, handle, pickle.HIGHEST_PROTOCOL)