# Prophet

First, necessary libraries for this notebook are imported.

In [1]:
%load_ext rpy2.ipython
%matplotlib inline
import logging
logging.getLogger('fbprophet').setLevel(logging.ERROR)
import warnings
warnings.filterwarnings("ignore")

from timeit import default_timer as timer

  from pandas.core.index import Index as PandasIndex


In [2]:
import pandas as pd
from fbprophet import Prophet
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# from fastprogress.fastprogress import progress_bar

## Metro Interstate Traffic Volume Data Set

**Source:** https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

**Data Set Information:** Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.

**Attribute Information:**
- holiday: Categorical US National holidays plus regional holiday, Minnesota State Fair.
- temp: Numeric Average temp in kelvin.
- rain_1h: Numeric Amount in mm of rain that occurred in the hour.
- snow_1h: Numeric Amount in mm of snow that occurred in the hour.
- clouds_all: Numeric Percentage of cloud cover.
- weather_main: Categorical Short textual description of the current weather.
- weather_description: Categorical Longer textual description of the current weather.
- date_time: DateTime Hour of the data collected in local CST time.
- traffic_volume: Numeric Hourly I-94 ATR 301 reported westbound traffic volume.

The data set is loaded, the timestamp and output variables are renamed to "ds" and "y" respectively for prophet identification, and an example is shown.

In [3]:
df = pd.read_csv('../examples/Metro_Interstate_Traffic_Volume.csv')
df = df.rename(columns={'date_time': 'ds', 'traffic_volume': 'y'})
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,ds,y
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918


As "holiday", "weather_main" and "holiday" are categorical, they are encoded by LabelEncoder().

In [4]:
le = LabelEncoder()
df['weather_main'] = le.fit_transform(df['weather_main'])
df['weather_description'] = le.fit_transform(df['weather_description'])
df['holiday'] = le.fit_transform(df['holiday'])
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,ds,y
0,7,288.28,0.0,0.0,40,1,24,2012-10-02 09:00:00,5545
1,7,289.36,0.0,0.0,75,1,2,2012-10-02 10:00:00,4516
2,7,289.58,0.0,0.0,90,1,19,2012-10-02 11:00:00,4767
3,7,290.13,0.0,0.0,90,1,19,2012-10-02 12:00:00,5026
4,7,291.14,0.0,0.0,75,1,2,2012-10-02 13:00:00,4918


The Prophet model is defined with the extra variables that will be used for training in addition to "ds" and "y".

In [5]:
def train(df):
    m = Prophet()
    m.add_regressor('temp')
    m.add_regressor('rain_1h')
    m.add_regressor('snow_1h')
    m.add_regressor('clouds_all')
    m.add_regressor('weather_main')
    m.add_regressor('weather_description')
    m.add_regressor('holiday')
    m.fit(df)
    return m

A cross validation with a KFold of 10 is performed and the R^2, RMSE scores and average time are obtained.

In [6]:
kf = KFold(n_splits=10, random_state=42, shuffle=False)
r2score_values = []
rmse_values = []
time_values = []

for train_index, test_index in kf.split(df):
    df_train, df_test = df.iloc[train_index, :], df.iloc[test_index, :]
    start = timer()
    m = train(df_train)
    end = timer()
    time_values.append(end-start)
    y_pred = m.predict(df_test)['yhat'].values
    y_true = df_test['y'].values
    r2score = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2score_values.append(r2score)
    rmse_values.append(rmse)
    print(r2score, rmse)

print(f'Mean R^2 score => {np.mean(r2score_values)}\nMean RMSE => {np.mean(rmse_values)}\nMean time => {np.mean(time_values)}')

0.8032421153376518 883.3732697316294
0.8197692352793818 877.1200454344335
0.7963847711416194 886.7152173799411
0.8139079775433222 861.9863869972448
0.7916443224579326 930.8271031496416
-0.24941526916698176 2103.3958742024024
0.8091944017986281 863.6914993886502
0.8254946286367031 829.54475144421
0.7739781965246284 929.8390104347976
0.8233546794724935 829.8235964147831
Mean R^2 score => 0.7007555059025379
Mean RMSE => 999.6316754577734
Mean time => 41.183372796800086


## Air Quality Data Set

**Source:** https://archive.ics.uci.edu/ml/datasets/Air+quality

**Data Set Information:** The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.

**Attribute Information:**
- Date: Date (DD/MM/YYYY).
- Time: Time (HH.MM.SS).
- CO(GT): True hourly averaged concentration CO in mg/m^3 (reference analyzer).
- PT08.S1(CO): PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted).
- NMHC(GT): True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer).
- C6H6(GT): True hourly averaged Benzene concentration in microg/m^3 (reference analyzer).
- PT08.S2(NMHC): PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted).
- NOx(GT): True hourly averaged NOx concentration in ppb (reference analyzer).
- PT08.S3(NOx): PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted).
- NO2(GT): True hourly averaged NO2 concentration in microg/m^3 (reference analyzer).
- PT08.S4(NO2): PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted).
- PT08.S5(O3): PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted).
- T: Temperature in °C.
- RH: Relative Humidity (%).
- AH: Absolute Humidity.

The data set is loaded, the timestamp is created using the variables "Date" and "Time", and output variable "CO(GT)" is renamed to "y" for prophet identification, and an example is shown.

In [75]:
df = pd.read_csv('../examples/AirQualityUCI.csv', sep=';')
df['ds'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%d/%m/%Y %H.%M.%S')
df = df.drop(['Date', 'Time'], axis=1)
df = df.rename(columns={'CO(GT)': 'y'})
df.head()

Unnamed: 0,y,PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16,ds
0,26,1360.0,150.0,119,1046.0,166.0,1056.0,113.0,1692.0,1268.0,136,489,7578,,,2004-03-10 18:00:00
1,2,1292.0,112.0,94,955.0,103.0,1174.0,92.0,1559.0,972.0,133,477,7255,,,2004-03-10 19:00:00
2,22,1402.0,88.0,90,939.0,131.0,1140.0,114.0,1555.0,1074.0,119,540,7502,,,2004-03-10 20:00:00
3,22,1376.0,80.0,92,948.0,172.0,1092.0,122.0,1584.0,1203.0,110,600,7867,,,2004-03-10 21:00:00
4,16,1272.0,51.0,65,836.0,131.0,1205.0,116.0,1490.0,1110.0,112,596,7888,,,2004-03-10 22:00:00


More information from the dataset is displayed to detect possible missing values.

In [76]:
df.describe()

Unnamed: 0,PT08.S1(CO),NMHC(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),Unnamed: 15,Unnamed: 16
count,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,0.0,0.0
mean,1048.990061,-159.090093,894.595276,168.616971,794.990168,58.148873,1391.479641,975.072032,,
std,329.83271,139.789093,342.333252,257.433866,321.993552,126.940455,467.210125,456.938184,,
min,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,,
25%,921.0,-200.0,711.0,50.0,637.0,53.0,1185.0,700.0,,
50%,1053.0,-200.0,895.0,141.0,794.0,96.0,1446.0,942.0,,
75%,1221.0,-200.0,1105.0,284.0,960.0,133.0,1662.0,1255.0,,
max,2040.0,1189.0,2214.0,1479.0,2683.0,340.0,2775.0,2523.0,,


Firstly, it can be seen that there are two attributes without values, which also do not appear in the description of the dataset, so they are removed.

In [77]:
df = df.drop(['Unnamed: 15', 'Unnamed: 16'], axis=1)
df.head()

Unnamed: 0,y,PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,ds
0,26,1360.0,150.0,119,1046.0,166.0,1056.0,113.0,1692.0,1268.0,136,489,7578,2004-03-10 18:00:00
1,2,1292.0,112.0,94,955.0,103.0,1174.0,92.0,1559.0,972.0,133,477,7255,2004-03-10 19:00:00
2,22,1402.0,88.0,90,939.0,131.0,1140.0,114.0,1555.0,1074.0,119,540,7502,2004-03-10 20:00:00
3,22,1376.0,80.0,92,948.0,172.0,1092.0,122.0,1584.0,1203.0,110,600,7867,2004-03-10 21:00:00
4,16,1272.0,51.0,65,836.0,131.0,1205.0,116.0,1490.0,1110.0,112,596,7888,2004-03-10 22:00:00


Null values are also checked and removed.

In [78]:
df.isnull().sum(axis = 0)

y                114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
ds               114
dtype: int64

In [79]:
df = df.dropna(how='all')
df.isnull().sum(axis = 0)

y                0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
ds               0
dtype: int64

Float values are checked and fixed, as they contain "," instead of ".

In [80]:
df.dtypes

y                        object
PT08.S1(CO)             float64
NMHC(GT)                float64
C6H6(GT)                 object
PT08.S2(NMHC)           float64
NOx(GT)                 float64
PT08.S3(NOx)            float64
NO2(GT)                 float64
PT08.S4(NO2)            float64
PT08.S5(O3)             float64
T                        object
RH                       object
AH                       object
ds               datetime64[ns]
dtype: object

In [81]:
df['C6H6(GT)'] = df['C6H6(GT)'].apply(lambda x: x.replace(',','.'))
df['T'] = df['T'].apply(lambda x: x.replace(',','.'))
df['RH'] = df['RH'].apply(lambda x: x.replace(',','.'))
df['AH'] = df['AH'].apply(lambda x: x.replace(',','.'))
df['y'] = df['y'].apply(lambda x: x.replace(',','.'))

df['PT08.S1(CO)'] = df['PT08.S1(CO)'].astype(float)
df['C6H6(GT)'] = df['C6H6(GT)'].astype(float)
df['T'] = df['T'].astype(float)
df['RH'] = df['RH'].astype(float)
df['AH'] = df['AH'].astype(float)
df['y'] = df['y'].astype(float)

As we have seen in the "describe", there are many values of "-200" in all columns, this may be due to missing values, so we will clean these values.

In [82]:
df = df[df["y"] != -200]
df = df.replace(-200, 0)
df.describe()

Unnamed: 0,y,PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
count,7674.0,7674.0,7674.0,7674.0,7674.0,7674.0,7674.0,7674.0,7674.0,7674.0,7674.0,7674.0,7674.0
mean,2.15275,1062.823169,25.93706,9.833855,906.46638,241.573365,791.360568,108.58809,1382.624967,998.639432,17.006255,46.950378,0.946904
std,1.453252,310.691637,99.99094,7.57193,323.064928,217.251773,301.959611,53.639799,450.968711,449.700751,9.389699,19.761648,0.439692
min,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.9,0.0,0.0
25%,1.1,927.0,0.0,4.1,716.0,90.0,628.0,73.0,1157.0,704.0,10.3,33.5,0.65
50%,1.8,1062.0,0.0,8.1,903.0,177.0,782.0,107.0,1425.0,968.0,16.3,48.1,0.9406
75%,2.9,1235.0,0.0,14.0,1116.75,326.0,949.0,141.0,1659.0,1287.0,23.5,61.8,1.2352
max,11.9,2040.0,1189.0,63.7,2214.0,1479.0,2683.0,340.0,2775.0,2523.0,44.6,88.7,2.1806


The Prophet model is defined with the extra variables that will be used for training in addition to "ds" and "y".

In [83]:
def train(df):
    m = Prophet()
    m.add_regressor('PT08.S1(CO)')
    m.add_regressor('NMHC(GT)')
    m.add_regressor('C6H6(GT)')
    m.add_regressor('PT08.S2(NMHC)')
    m.add_regressor('NO2(GT)')
    m.add_regressor('PT08.S3(NOx)')
    m.add_regressor('PT08.S4(NO2)')
    m.add_regressor('PT08.S5(O3)')
    m.add_regressor('PT08.S5(O3)')
    m.add_regressor('T')
    m.add_regressor('RH')
    m.add_regressor('AH')
    m.fit(df)
    return m

A cross validation with a KFold of 10 is performed and the R^2, RMSE scores and average time are obtained.

In [84]:
kf = KFold(n_splits=10, random_state=42, shuffle=False)
r2score_values = []
rmse_values = []
time_values = []

for train_index, test_index in kf.split(df):
    df_train, df_test = df.iloc[train_index, :], df.iloc[test_index, :]
    start = timer()
    m = train(df_train)
    end = timer()
    time_values.append(end-start)
    y_pred = m.predict(df_test)['yhat'].values
    y_true = df_test['y'].values
    r2score = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2score_values.append(r2score)
    rmse_values.append(rmse)
    print(r2score, rmse)

print(f'Mean R^2 score => {np.mean(r2score_values)}\nMean RMSE => {np.mean(rmse_values)}\nMean time => {np.mean(time_values)}')

0.8927630725698871 0.43317263230939074
0.9466426446137584 0.3073501833589736
0.91716571015208 0.3232902877080434
0.9068623315075556 0.3040014845265486
0.8876934241174479 0.4509411376979662
0.9608528304082758 0.33271616404197957
0.7536538389832645 0.9262946524400024
0.7296913884907124 0.8175209583978258
0.7806747108957219 0.6171753850007097
0.8689442922190114 0.49611229415926844
Mean R^2 score => 0.8644944243957715
Mean RMSE => 0.5008575179640709
Mean time => 8.073555734799992


## example_wp_log_peyton_manning

**Source:** https://github.com/facebook/prophet/blob/master/examples/example_wp_log_peyton_manning.csv

**Data Set Information:** The dataset contains 2905 instances of the log daily page views for the Wikipedia page for Peyton Manning.

**Attribute Information:**
- ds: Date (YYYY-MM-DD).
- y: Number of views.

The data set is loaded and an example is shown.

In [25]:
df = pd.read_csv('../examples/example_wp_log_peyton_manning.csv')
df.head()

Unnamed: 0,ds,y
0,2007-12-10,9.590761
1,2007-12-11,8.51959
2,2007-12-12,8.183677
3,2007-12-13,8.072467
4,2007-12-14,7.893572


The Prophet model is defined using only "ds" and "y" variables.

In [26]:
def train(df):
    m = Prophet()
    m.fit(df)
    return m

A cross validation with a KFold of 10 is performed and the R^2, RMSE scores and average time are obtained.

In [27]:
kf = KFold(n_splits=10, random_state=42, shuffle=False)
r2score_values = []
rmse_values = []
time_values = []

for train_index, test_index in kf.split(df):
    df_train, df_test = df.iloc[train_index, :], df.iloc[test_index, :]
    start = timer()
    m = train(df_train)
    end = timer()
    time_values.append(end-start)
    y_pred = m.predict(df_test)['yhat'].values
    y_true = df_test['y'].values
    r2score = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2score_values.append(r2score)
    rmse_values.append(rmse)
    print(r2score, rmse)

print(f'Mean R^2 score => {np.mean(r2score_values)}\nMean RMSE => {np.mean(rmse_values)}\nMean time => {np.mean(time_values)}')

-1.2956353413247914 1.064864712083494
-0.09232581304821852 0.49141540619394875
0.5217030611388819 0.5483008438936049
0.5713890931451139 0.5405045950359735
0.43835193729599775 0.5942893499174834
0.030750346783670723 0.8729621395573418
0.49777469603625835 0.5250181048908731
0.5434010849987404 0.6833648614731199
0.623729232212226 0.4494222678004932
0.5799077498242129 0.4790649639733912
Mean R^2 score => 0.24190460470620923
Mean RMSE => 0.6249207244819723
Mean time => 1.2993268623999028
