In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# Modeling

**Notes**
- Using Facebook Prophet to help capture trends, seasons and any other events in our traffic data time series prediction

In [9]:
modeling_data = pd.read_csv("../data/monthly_data.csv")

In [10]:
modeling_data = modeling_data.reset_index() 
modeling_data.head()

Unnamed: 0,index,all_motor_vehicles
0,0,1603322.0
1,1,1475538.0
2,2,2848495.0
3,3,2814927.0
4,4,1646437.0


In [11]:
modeling_data.columns

Index(['index', 'all_motor_vehicles'], dtype='object')

In [12]:
modeling_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 285 entries, 0 to 284
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   index               285 non-null    int64  
 1   all_motor_vehicles  285 non-null    float64
dtypes: float64(1), int64(1)
memory usage: 4.6 KB


In [13]:
# Determine target variable and feature

prophet_data = modeling_data[['timestamp', 'all_motor_vehicles']].rename(columns={'timestamp': 'ds', 'all_motor_vehicles': 'y'})


KeyError: "['timestamp'] not in index"

In [7]:
prophet_data.head()

NameError: name 'prophet_data' is not defined

In [None]:
prophet_data['y'].head(10)

In [None]:
# Visualize the time series
plt.figure(figsize=(10, 6))
plt.plot(prophet_data['ds'], prophet_data['y'])
plt.xlabel('Date')
plt.ylabel('Target Variable')
plt.title('Time Series Plot')
plt.show()

**Observation**

**Observations:**
- The data shows significant fluctuations in the "Target Variable" over time, with numerous peaks and troughs indicating dynamic and frequent changes in the target variable within short time intervals.
- Moreover, there are noticeable seasonal or cyclical patterns, suggesting that vehicle counts follow a regular seasonal pattern.

**Note:**
- We will utilize Facebook Prophet to capture the underlying seasonality and trend in order to make reliable predictions.

### Train/Test Split

In [None]:
prophet_data.info()

In [None]:
ptrain = prophet_data.loc[prophet_data.ds <= "2020-12-01", :] # training dataset to include all data points less than or equal to December 1, 2020.
ptest = prophet_data.loc[prophet_data.ds > "2020-12-01", :] # test dataset to include all data points greater than December 1, 2020.

In [None]:
ptrain.head()

In [None]:
ptrain.shape

In [None]:
ptest.head()

In [None]:
# Build and fit the model on the training data
from prophet import Prophet

model = Prophet() # Initialize the Prophet model
model.fit(ptrain) # Fit the model 

### Create Future Dates for Predictions

In [None]:
# Forecasting into the future
future = model.make_future_dataframe(periods=24, freq='MS')  

In [None]:
future.tail()

### Make Predictions

In [None]:
# Generate forecasts on the test set
forecast = model.predict(future)

In [None]:
forecast.head()

In [None]:
forecast.tail()

In [None]:
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper']].tail()

In [None]:
plot_plotly(model, forecast)

**Key Insights**:

**Seasonal Patterns**:

The time series shows consistent seasonal peaks and troughs. The model accurately captures these recurring patterns, reflecting the yearly or monthly cycles observed in all vehicles data. The seasonal pattern remains relatively stable, indicating that the model has effectively captured the underlying seasonality.

**Uncertainty Intervals**:

The shaded blue areas represent the uncertainty intervals (confidence intervals) around the forecast. These intervals are narrow for most of the forecast period, indicating the model's confidence in its predictions based on the observed data. However, as the forecast extends further into the future (closer to 2020 and beyond), the intervals widen slightly, reflecting the normal increase in uncertainty in time series forecasting.

**Fit with Historical Data:**

The black dots represent the actual historical data points, and the blue line shows the model’s fitted values. The model closely aligns with the actual data, capturing the peaks and troughs of the seasonal cycles. However, there are some black dots that fall outside the predicted range, which could signify outliers that the model did not capture 
**Trend:**
Although seasonal fluctuations are prominent, the overall trend appears stable, as the model indicates no significant upward or downward movement. This observation is consistent with the earlier decomposition, which revealed no distinct trend after detrending the data.


In [None]:
plot_components_plotly(model, forecast)

**More Insights Interpretation**

The component plot function dissects the forecast into its trend and yearly seasonality components, providing a deeper understanding of how the model interprets the data.

- **Trend Component (Top Plot)**:
The trend line depicts a downward trend in the target variable over time. Starting from around 1.2 million vehicle counts, the trend steadily declines, reaching around 0.8 million in 2020. This indicates that, according to the Prophet model, the overall usage or counts of vehicles have been decreasing over the years.

- **Yearly Seasonality Component (Bottom Plot):**
The yearly seasonality plot illustrates how the target variable changes throughout the year, capturing regular fluctuations that repeat each year. It reveals higher vehicle counts around August and September, followed by a dip toward the end of the year, likely corresponding to decreased travel or vehicle usage during the colder months.

There's another peak around May, followed by a slight drop during the summer months before the large spike in August. The lowest points appear to be around January and November/December, suggesting lower vehicle activity during these months, possibly related to seasonal holidays or weather conditions.

**Notes:**
- Overall, the model identifies a significant long-term decline in vehicle counts over time
- There are predictable, recurring seasonal peaks and troughs, with higher activity in spring and late summer (especially around August) and lower activity in winter months (January, November, and December).

# Model Evaluation

In [None]:
# Determine target variable and feature

prophet_data = modeling_data[['timestamp', 'all_motor_vehicles']].rename(columns={'timestamp': 'ds', 'all_motor_vehicles': 'y'})

In [None]:
# Extract columns from the forecast and test data
forecasted_values = forecast[['ds', 'yhat']].set_index('ds')
actual_values = ptest.set_index('ds')  

comparison = actual_values.join(forecasted_values, how='left')

In [None]:
comparison.isnull().sum()

In [None]:
comparison['yhat'].dropna()
comparison_cleaned = comparison.dropna(subset=['yhat'])

In [None]:
mae = mean_absolute_error(comparison_cleaned['y'], comparison_cleaned['yhat'])
rmse = np.sqrt(mean_squared_error(comparison_cleaned['y'], comparison_cleaned['yhat']))

print(f'MAE: {mae}')
print(f'RMSE: {rmse}')

**Notes**

The average forecasted values differ from the actual values by approximately 452,833 units (likely vehicle counts), and the typical error between the predicted and actual values is around 586,966 units. The RMSE is higher than the MAE, indicating that there are some large prediction errors in the data.

- Accuracy: Both the MAE (452,833) and RMSE (586,966) are relatively high, suggesting that the model’s predictions are significantly off on average.
- The original plot shows large fluctuations in the time series, which can make accurate prediction more challenging.
- The presence of outliers could also increase the RMSE.

# Conclusion

Although the Prophet model offers a sound baseline for forecasting and captures the overall trend and seasonality, its accuracy is compromised by outliers or specific events. The model effectively predicts seasonal high and low points, but larger discrepancies from the actual values (as evident in the RMSE) indicate that additional variables may be necessary to enhance precision.