## Structure of  Prophet
<center><img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX032NEN/images/prophet_structure-ImResizer.jpg" width="700" height="250"></center>
Prophet is particularly good at modeling time series that have multiple seasonalities and doesn’t face the drawbacks of other algorithms. At its core is the sum of three functions of time plus an error term: 
1) growth g(t)
2) seasonality s(t)
3) holidays h(t) , and error e_t :
<center><img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX032NEN/images/formula_prophet.png" width="500" height="350"></center>


## About Dataset
India is the world's third-largest producer and third-largest consumer of electricity. The national electric grid in India has an installed capacity of 370.106 GW as of 31 March 2020. Renewable power plants, which also include large hydroelectric plants, constitute 35.86% of India's total installed capacity. During the 2018-19 fiscal year, the gross electricity generated by utilities in India was 1,372 TWh and the total electricity generation (utilities and non-utilities) in the country was 1,547 TWh. The gross electricity consumption in 2018-19 was 1,181 kWh per capita. In 2015-16, electric energy consumption in agriculture was recorded as being the highest (17.89%) worldwide. The per capita electricity consumption is low compared to most other countries despite India having a low electricity tariff.

In [2]:
import pandas as pd
from prophet import Prophet
from matplotlib import pyplot
from matplotlib.pyplot import figure
from sklearn.metrics import mean_absolute_error
import plotly.express as px
import plotly.graph_objects as go

### Time series analysis of Power consumption in India[2019-2020]

In [3]:
df=pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-GPXX032NEN/images/data/long_data_.csv')

In [4]:
df.head()

Unnamed: 0,States,Regions,latitude,longitude,Dates,Usage
0,Punjab,NR,31.519974,75.980003,02/01/2019 00:00:00,119.9
1,Haryana,NR,28.450006,77.019991,02/01/2019 00:00:00,130.3
2,Rajasthan,NR,26.449999,74.639981,02/01/2019 00:00:00,234.1
3,Delhi,NR,28.669993,77.230004,02/01/2019 00:00:00,85.8
4,UP,NR,27.599981,78.050006,02/01/2019 00:00:00,313.9


In [5]:
df.shape

(16599, 6)

In [7]:
df.dtypes

States        object
Regions       object
latitude     float64
longitude    float64
Dates         object
Usage        float64
dtype: object

In [9]:
df['Dates'] = pd.to_datetime(df['Dates'], dayfirst=True)

In [11]:
# Let's first examine the data structure
print("Dataset info:")
print(df.info())
print("\nDataset head:")
print(df.head())
print("\nColumn names:")
print(df.columns.tolist())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16599 entries, 0 to 16598
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   States     16599 non-null  object        
 1   Regions    16599 non-null  object        
 2   latitude   16599 non-null  float64       
 3   longitude  16599 non-null  float64       
 4   Dates      16599 non-null  datetime64[ns]
 5   Usage      16599 non-null  float64       
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 778.2+ KB
None

Dataset head:
      States Regions   latitude  longitude      Dates  Usage
0     Punjab      NR  31.519974  75.980003 2019-01-02  119.9
1    Haryana      NR  28.450006  77.019991 2019-01-02  130.3
2  Rajasthan      NR  26.449999  74.639981 2019-01-02  234.1
3      Delhi      NR  28.669993  77.230004 2019-01-02   85.8
4         UP      NR  27.599981  78.050006 2019-01-02  313.9

Column names:
['States', 'Regions', 'latit

In [12]:
# Group by dates and calculate mean for numeric columns only
# This will aggregate all states' usage for each date
df_grouped = df.groupby('Dates', as_index=False)['Usage'].sum()  # or use .mean() for average
print("Aggregated data by date:")
print(df_grouped.head(10))
print(f"\nShape: {df_grouped.shape}")
print(f"Date range: {df_grouped['Dates'].min()} to {df_grouped['Dates'].max()}")

Aggregated data by date:
       Dates   Usage
0 2019-01-02  3373.4
1 2019-01-03  3403.7
2 2019-01-04  3304.1
3 2019-01-05  3308.9
4 2019-01-06  3316.9
5 2019-01-07  3312.1
6 2019-01-08  3196.9
7 2019-01-09  3058.7
8 2019-01-10  3151.4
9 2019-01-11  3074.8

Shape: (498, 2)
Date range: 2019-01-02 00:00:00 to 2020-12-05 00:00:00


In [13]:
# Prepare data for Prophet (requires 'ds' and 'y' columns)
prophet_data = df_grouped.rename(columns={'Dates': 'ds', 'Usage': 'y'})
print("Prophet data format:")
print(prophet_data.head())
print(f"\nTotal daily observations: {len(prophet_data)}")
print(f"Date range: {prophet_data['ds'].min().strftime('%Y-%m-%d')} to {prophet_data['ds'].max().strftime('%Y-%m-%d')}")

# Basic statistics
print(f"\nUsage statistics:")
print(f"Mean daily usage: {prophet_data['y'].mean():.1f}")
print(f"Min daily usage: {prophet_data['y'].min():.1f}")
print(f"Max daily usage: {prophet_data['y'].max():.1f}")
print(f"Standard deviation: {prophet_data['y'].std():.1f}")

Prophet data format:
          ds       y
0 2019-01-02  3373.4
1 2019-01-03  3403.7
2 2019-01-04  3304.1
3 2019-01-05  3308.9
4 2019-01-06  3316.9

Total daily observations: 498
Date range: 2019-01-02 to 2020-12-05

Usage statistics:
Mean daily usage: 3433.2
Min daily usage: 2555.8
Max daily usage: 6395.0
Standard deviation: 406.3


In [14]:
df=df[['Dates','Usage']]

In [15]:
fig = px.line(df, x='Dates', y='Usage')
fig.show()

In [16]:
df.columns = ['ds','y']

In [17]:
df.head()

Unnamed: 0,ds,y
0,2019-01-02,119.9
1,2019-01-02,130.3
2,2019-01-02,234.1
3,2019-01-02,85.8
4,2019-01-02,313.9


## Initialize the Model

In [18]:
model = Prophet()

In [19]:
model.fit(df)

18:36:59 - cmdstanpy - INFO - Chain [1] start processing
18:37:01 - cmdstanpy - INFO - Chain [1] done processing
18:37:01 - cmdstanpy - INFO - Chain [1] done processing


<prophet.forecaster.Prophet at 0x21fff6d24b0>

In [20]:
model.component_modes

{'additive': ['weekly',
  'additive_terms',
  'extra_regressors_additive',
  'holidays'],
 'multiplicative': ['multiplicative_terms', 'extra_regressors_multiplicative']}

In [21]:
future_dates = model.make_future_dataframe(periods=365,freq='d',include_history=True)
future_dates.shape

(863, 1)

In [22]:
future_dates.head()

Unnamed: 0,ds
0,2019-01-02
1,2019-01-03
2,2019-01-04
3,2019-01-05
4,2019-01-06


In [23]:
prediction=model.predict(future_dates)

In [24]:
prediction.head()

Unnamed: 0,ds,trend,yhat_lower,yhat_upper,trend_lower,trend_upper,additive_terms,additive_terms_lower,additive_terms_upper,weekly,weekly_lower,weekly_upper,multiplicative_terms,multiplicative_terms_lower,multiplicative_terms_upper,yhat
0,2019-01-02,101.646067,-45.270672,246.462802,101.646067,101.646067,-0.203456,-0.203456,-0.203456,-0.203456,-0.203456,-0.203456,0.0,0.0,0.0,101.442611
1,2019-01-03,101.651124,-33.961657,254.388102,101.651124,101.651124,-0.052159,-0.052159,-0.052159,-0.052159,-0.052159,-0.052159,0.0,0.0,0.0,101.598966
2,2019-01-04,101.656182,-48.59993,254.417322,101.656182,101.656182,-0.0215,-0.0215,-0.0215,-0.0215,-0.0215,-0.0215,0.0,0.0,0.0,101.634682
3,2019-01-05,101.661239,-47.487471,253.0796,101.661239,101.661239,-0.175322,-0.175322,-0.175322,-0.175322,-0.175322,-0.175322,0.0,0.0,0.0,101.485917
4,2019-01-06,101.666296,-50.856492,251.270968,101.666296,101.666296,0.217488,0.217488,0.217488,0.217488,0.217488,0.217488,0.0,0.0,0.0,101.883784


In [25]:
trace_open = go.Scatter(
    x = prediction["ds"],
    y = prediction["yhat"],
    mode = 'lines',
    name="Forecast"
)
trace_high = go.Scatter(
    x = prediction["ds"],
    y = prediction["yhat_upper"],
    mode = 'lines',
    fill = "tonexty", 
    line = {"color": "#57b8ff"}, 
    name="Higher uncertainty interval"
)
trace_low = go.Scatter(
    x = prediction["ds"],
    y = prediction["yhat_lower"],
    mode = 'lines',
    fill = "tonexty", 
    line = {"color": "#57b8ff"}, 
    name="Lower uncertainty interval"
)
trace_close = go.Scatter(
    x = df["ds"],
    y = df["y"],
    name="Data values"
)

#make list for all three scattle objects.
data = [trace_open,trace_high,trace_low,trace_close]
# Construct a new Layout object. 
#title - It will display string as a title of graph
layout = go.Layout(title="Power consumption forecasting")
#A list or tuple of trace instances (e.g. [Scatter(…), Bar(…)]) or A single trace instance (e.g. Scatter(…), Bar(…), etc.)
#A list or tuple of dicts of string/value properties where: - The ‘type’ property specifies the trace type.

fig = go.Figure(data=data)
fig.show()

In [26]:

fig = go.Figure([go.Scatter(x=df['ds'], y=df['y'],mode='lines',
                    name='Actual')])
#You can add traces using an Express plot by using add_trace
fig.add_trace(go.Scatter(x=prediction['ds'], y=prediction['yhat'],
                   mode='lines+markers',
                    name='predicted'))
#To display a figure using the renderers framework, you call the .show() method on a graph object figure, or pass the figure to the plotly.io.show function. 
#With either approach, plotly.py will display the figure using the current default renderer(s).
fig.show()

In [28]:
# Check data dimensions first
print("Data dimensions:")
print(f"prophet_data shape: {prophet_data.shape}")
print(f"prediction shape: {prediction.shape}")
print(f"Length of training data: {len(prophet_data)}")

# Use the correct aggregated data (prophet_data) for evaluation
# prophet_data contains the daily aggregated values used for training
y_true = prophet_data['y'].values  # Use aggregated daily data, not raw data

# Get predictions for the training period only (first 498 predictions)
y_pred = prediction['yhat'][:len(prophet_data)].values

print(f"\ny_true length: {len(y_true)}")
print(f"y_pred length: {len(y_pred)}")

# Calculate Mean Absolute Error
mae = mean_absolute_error(y_true, y_pred)
print(f'\nModel Performance:')
print(f'MAE: {mae:.3f}')

# Calculate additional metrics
import numpy as np
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))

print(f'MAPE: {mape:.2f}%')
print(f'RMSE: {rmse:.3f}')

Data dimensions:
prophet_data shape: (498, 2)
prediction shape: (863, 16)
Length of training data: 498

y_true length: 498
y_pred length: 498

Model Performance:
MAE: 3330.166
MAPE: 96.96%
RMSE: 3354.800


In [29]:
# Let's check what data was actually used for training
print("Training data verification:")
print("df shape:", df.shape)
print("df columns:", df.columns.tolist())
print("df head:")
print(df.head())
print("\ndf data type and range:")
print(f"Date range: {df['ds'].min()} to {df['ds'].max()}")
print(f"Usage range: {df['y'].min():.1f} to {df['y'].max():.1f}")

# The issue might be that we trained on the raw data (16,599 rows) 
# but are evaluating against aggregated data (498 rows)
# Let's retrain the model with the correct aggregated data

Training data verification:
df shape: (16599, 2)
df columns: ['ds', 'y']
df head:
          ds      y
0 2019-01-02  119.9
1 2019-01-02  130.3
2 2019-01-02  234.1
3 2019-01-02   85.8
4 2019-01-02  313.9

df data type and range:
Date range: 2019-01-02 00:00:00 to 2020-12-05 00:00:00
Usage range: 0.3 to 522.1


In [30]:
model1=Prophet(daily_seasonality=True).add_seasonality(name='yearly',period=365,fourier_order=70)

In [31]:
model1.fit(df)

18:41:35 - cmdstanpy - INFO - Chain [1] start processing
18:41:39 - cmdstanpy - INFO - Chain [1] done processing


<prophet.forecaster.Prophet at 0x21fff0e0e90>

In [32]:
model1.component_modes

{'additive': ['yearly',
  'weekly',
  'daily',
  'additive_terms',
  'extra_regressors_additive',
  'holidays'],
 'multiplicative': ['multiplicative_terms', 'extra_regressors_multiplicative']}

In [33]:
future_dates1=model1.make_future_dataframe(periods=365)

In [34]:
prediction1=model1.predict(future_dates1)

In [43]:
print(" UNDERSTANDING THE DATA MISMATCH")

# The issue: Prophet works with time series (unique dates)
# But our training data has multiple records per date (different states)

# Let's check the unique dates in training data
unique_dates_in_df = df['ds'].nunique()
print(f"Unique dates in training data: {unique_dates_in_df}")
print(f"Total records in training data: {len(df)}")
print(f"Records per date (avg): {len(df) / unique_dates_in_df:.1f}")

print(f"\nPrediction data:")
print(f"Prediction records: {len(prediction1)}")
print(f"Future dates generated: {len(future_dates1)}")

# The correct approach: Use the aggregated daily data for evaluation
print(f"\nCORRECT EVALUATION APPROACH")
print("Since Prophet predicts daily totals, we should use daily aggregated data for evaluation")

# Use the aggregated data that matches the Prophet model's logic
# We need to align the predictions with the actual aggregated training period
training_end_date = prophet_data['ds'].max()
training_predictions = prediction1[prediction1['ds'] <= training_end_date]

print(f"Training period end date: {training_end_date}")
print(f"Training predictions shape: {training_predictions.shape}")
print(f"Prophet aggregated data shape: {prophet_data.shape}")

# Now evaluate using the correct data
y_true = prophet_data['y'].values  # Daily aggregated actual values
y_pred = training_predictions['yhat'].values  # Daily aggregated predictions

print(f"\nFinal evaluation dimensions:")
print(f"y_true length: {len(y_true)}")
print(f"y_pred length: {len(y_pred)}")

if len(y_true) == len(y_pred):
    # Calculate evaluation metrics
    from sklearn.metrics import mean_absolute_error
    import numpy as np

    mae = mean_absolute_error(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))

    print(f'\n Model Performance Metrics:')
    print(f'MAE (Mean Absolute Error): {mae:.2f} MW')
    print(f'MAPE (Mean Absolute Percentage Error): {mape:.2f}%')
    print(f'RMSE (Root Mean Square Error): {rmse:.2f} MW')

    # Performance interpretation
    if mape < 10:
        performance = "Excellent"
    elif mape < 20:
        performance = "Good"
    elif mape < 50:
        performance = "Reasonable"
    else:
        performance = "Poor"

    print(f'\nOverall Performance: {performance}')
    print(f'Average daily prediction error: ±{mae:.1f} MW')
    print(f'Mean daily consumption: {prophet_data["y"].mean():.1f} MW')
else:
    print(" Dimension mismatch still exists. Need to investigate further.")

 UNDERSTANDING THE DATA MISMATCH
Unique dates in training data: 498
Total records in training data: 16599
Records per date (avg): 33.3

Prediction data:
Prediction records: 863
Future dates generated: 863

CORRECT EVALUATION APPROACH
Since Prophet predicts daily totals, we should use daily aggregated data for evaluation
Training period end date: 2020-12-05 00:00:00
Training predictions shape: (498, 22)
Prophet aggregated data shape: (498, 2)

Final evaluation dimensions:
y_true length: 498
y_pred length: 498

 Model Performance Metrics:
MAE (Mean Absolute Error): 3330.06 MW
MAPE (Mean Absolute Percentage Error): 96.98%
RMSE (Root Mean Square Error): 3354.19 MW

Overall Performance: Poor
Average daily prediction error: ±3330.1 MW
Mean daily consumption: 3433.2 MW


##  Model Retraining with Correct Data

The poor performance indicates that the model was trained on individual state records rather than aggregated daily data. Let's retrain the model correctly.

In [44]:
# Train a new model with the correct aggregated daily data
print(" Training Prophet model with properly aggregated daily data...")

# Create new model with enhanced seasonality
model_corrected = Prophet(
    daily_seasonality=True,
    weekly_seasonality=True,
    yearly_seasonality=True,
    seasonality_mode='multiplicative'  # Better for data with varying seasonal patterns
).add_seasonality(
    name='monthly',
    period=30.5,
    fourier_order=5
)

# Train on aggregated daily data (prophet_data)
model_corrected.fit(prophet_data)

print(" Model training completed!")
print(" Model components:", model_corrected.component_modes)

# Make predictions
future_corrected = model_corrected.make_future_dataframe(periods=365, freq='D')
prediction_corrected = model_corrected.predict(future_corrected)

print(f"\nPrediction data shape: {prediction_corrected.shape}")
print(f"Training data shape: {prophet_data.shape}")

# Evaluate the corrected model
training_length = len(prophet_data)
y_true_corrected = prophet_data['y'].values
y_pred_corrected = prediction_corrected['yhat'][:training_length].values

# Calculate metrics
mae_corrected = mean_absolute_error(y_true_corrected, y_pred_corrected)
mape_corrected = np.mean(np.abs((y_true_corrected - y_pred_corrected) / y_true_corrected)) * 100
rmse_corrected = np.sqrt(np.mean((y_true_corrected - y_pred_corrected) ** 2))

print(f'\n CORRECTED Model Performance:')
print(f'MAE: {mae_corrected:.2f} MW')
print(f'MAPE: {mape_corrected:.2f}%')
print(f'RMSE: {rmse_corrected:.2f} MW')

# Performance comparison
print(f'\n Performance Improvement:')
print(f'Previous MAPE: 96.98%')
print(f'New MAPE: {mape_corrected:.2f}%')
improvement = ((96.98 - mape_corrected) / 96.98) * 100
print(f'Improvement: {improvement:.1f}%')

if mape_corrected < 10:
    performance = "Excellent "
elif mape_corrected < 20:
    performance = "Good "
elif mape_corrected < 50:
    performance = "Reasonable "
else:
    performance = "Needs Improvement"

print(f' New Performance Rating: {performance}')

 Training Prophet model with properly aggregated daily data...


18:48:53 - cmdstanpy - INFO - Chain [1] start processing
18:48:54 - cmdstanpy - INFO - Chain [1] done processing
18:48:54 - cmdstanpy - INFO - Chain [1] done processing


 Model training completed!
 Model components: {'additive': ['additive_terms', 'extra_regressors_additive'], 'multiplicative': ['monthly', 'yearly', 'weekly', 'daily', 'multiplicative_terms', 'extra_regressors_multiplicative', 'holidays']}

Prediction data shape: (863, 25)
Training data shape: (498, 2)

 CORRECTED Model Performance:
MAE: 258.75 MW
MAPE: 7.44%
RMSE: 348.22 MW

 Performance Improvement:
Previous MAPE: 96.98%
New MAPE: 7.44%
Improvement: 92.3%
 New Performance Rating: Excellent 

Prediction data shape: (863, 25)
Training data shape: (498, 2)

 CORRECTED Model Performance:
MAE: 258.75 MW
MAPE: 7.44%
RMSE: 348.22 MW

 Performance Improvement:
Previous MAPE: 96.98%
New MAPE: 7.44%
Improvement: 92.3%
 New Performance Rating: Excellent 
