# Dealing with Missing Values in TimeGPT

`TimeGPT` requires time series data to not have missing values. It is possible to have multiple series that begin and end on different dates, but it is essential that each series contains uninterrupted data for its time frame.

In this tutorial, we will show you how to deal with missing values. This is a crucial step for forecasting with `TimeGPT`.

**Outline** 
1. [Load Data](#load-data) 
2. [Get Started with TimeGPT](#get-started-with-timegpt)
3. [Check for Missing Values](#check-for-missing-values)
4. [Fill Missing Values](#fill-missing-values)
5. [Forecast with `TimeGPT`](#forecast-with-timegpt)
6. [References](#references)

## Load data 

We first need to load the data using `pandas`. 

In [1]:
import pandas as pd

Y_df = pd.read_csv('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/master/data/usuarios_diarios_bicimad.csv')
Y_df = Y_df[['fecha', 'Usos bicis total día']]
Y_df.rename(columns={'fecha': 'ds', 'Usos bicis total día': 'y'}, inplace=True)
Y_df.head()

Unnamed: 0,ds,y
0,2014-06-23,99
1,2014-06-24,72
2,2014-06-25,119
3,2014-06-26,135
4,2014-06-27,149


In [2]:
Y_df['ds'] = pd.to_datetime(Y_df['ds'])
Y_df['unique_id'] = 'id1'
Y_df = Y_df[['unique_id', 'ds', 'y']]

Now we will separate the data into training and test set. We will use the last 93 days as the test set, so the forecast horizon `h=93`. 

In [3]:
df = Y_df[:-93]
test = Y_df[-93:]
h = len(test)

In [4]:
df.tail()

Unnamed: 0,unique_id,ds,y
2924,id1,2022-06-25,9758
2925,id1,2022-06-26,9081
2926,id1,2022-06-27,12628
2927,id1,2022-06-28,13371
2928,id1,2022-06-29,13491


In [5]:
test.head()

Unnamed: 0,unique_id,ds,y
2929,id1,2022-06-30,13468
2930,id1,2022-07-01,12932
2931,id1,2022-07-02,9918
2932,id1,2022-07-03,8967
2933,id1,2022-07-04,12869


To simulate the missing values, we will randomly set 10% of the training data to `NaN`.

In [7]:
rows_to_drop = df.sample(n=round(len(df)*0.1), random_state=123).index
train = df.drop(rows_to_drop)

mask = ~((train['ds'] >= '2018-03-01') & (train['ds'] <= '2018-05-31'))
train = train[mask]


## Get Started with TimeGPT

In [8]:
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)
    

In [9]:
#| hide 

nixtla_client = NixtlaClient()

## Check for Missing Values 

We need to ensure that

1. All timestamps from the start date to the end date are present in the data

2. The target column contains no missing values


In [10]:
nixtla_client.plot(train, engine='plotly')

In [11]:
from utilsforecast.preprocessing import fill_gaps

print('Number of rows before filling gaps:', len(train))
train_no_gaps = fill_gaps(train, freq='D')
print('Number of rows after filling gaps:', len(train_no_gaps))

Number of rows before filling gaps: 2552
Number of rows after filling gaps: 2929


In [12]:
train_no_gaps.tail()

Unnamed: 0,unique_id,ds,y
2924,id1,2022-06-25,9758.0
2925,id1,2022-06-26,9081.0
2926,id1,2022-06-27,12628.0
2927,id1,2022-06-28,13371.0
2928,id1,2022-06-29,13491.0


In [13]:
train_no_gaps[train_no_gaps['y'].isna()].tail()

Unnamed: 0,unique_id,ds,y
2887,id1,2022-05-19,
2889,id1,2022-05-21,
2898,id1,2022-05-30,
2916,id1,2022-06-17,
2921,id1,2022-06-22,


## Fill Missing Values

In [14]:
train_no_gaps['y'] = train_no_gaps['y'].interpolate(method='linear', limit_direction='both')

In [15]:
train_no_gaps.isna().sum()

unique_id    0
ds           0
y            0
dtype: int64

In [17]:
train_no_gaps.tail()

Unnamed: 0,unique_id,ds,y
2924,id1,2022-06-25,9758.0
2925,id1,2022-06-26,9081.0
2926,id1,2022-06-27,12628.0
2927,id1,2022-06-28,13371.0
2928,id1,2022-06-29,13491.0


## Forecast with TimeGPT

In [21]:
fcst = nixtla_client.forecast(train_no_gaps, h=h, freq='D', model='timegpt-1-long-horizon')

INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...


In [22]:
fcst['ds'] = pd.to_datetime(fcst['ds'])

In [23]:
nixtla_client.plot(test, fcst, engine='plotly')

In [24]:
from utilsforecast.evaluation import evaluate 
from utilsforecast.losses import mae 

In [25]:
result = test.merge(fcst, on=['ds', 'unique_id'], how='left')
result.head()

Unnamed: 0,unique_id,ds,y,TimeGPT
0,id1,2022-06-30,13468,13233.229492
1,id1,2022-07-01,12932,12264.268555
2,id1,2022-07-02,9918,9867.337891
3,id1,2022-07-03,8967,8994.099609
4,id1,2022-07-04,12869,11470.827148


In [27]:
evaluate(result, metrics=[mae])

Unnamed: 0,unique_id,metric,TimeGPT
0,id1,mae,1811.08935


In [28]:
import numpy as np 

train_ex_vars = train_no_gaps.copy()
train_ex_vars['flag'] = np.where(train_ex_vars.index.isin(rows_to_drop), 1, 0)
train_ex_vars.head()

Unnamed: 0,unique_id,ds,y,flag
0,id1,2014-06-23,99.0,0
1,id1,2014-06-24,72.0,0
2,id1,2014-06-25,119.0,0
3,id1,2014-06-26,135.0,0
4,id1,2014-06-27,149.0,0


In [29]:
future_ex_vars = test[['unique_id', 'ds']]
future_ex_vars['flag'] = 0
future_ex_vars.head()

Unnamed: 0,unique_id,ds,flag
2929,id1,2022-06-30,0
2930,id1,2022-07-01,0
2931,id1,2022-07-02,0
2932,id1,2022-07-03,0
2933,id1,2022-07-04,0


In [30]:
fcst_ex_vars = nixtla_client.forecast(train_ex_vars, h=h, X_df=future_ex_vars, freq='D', model='timegpt-1-long-horizon')

INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: flag
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...


In [31]:
fcst_ex_vars['ds'] = pd.to_datetime(fcst_ex_vars['ds'])

nixtla_client.plot(test, fcst_ex_vars, engine='plotly')

In [32]:
result_ex_vars = test.merge(fcst_ex_vars, on=['ds', 'unique_id'], how='left')
evaluate(result_ex_vars, metrics=[mae])

Unnamed: 0,unique_id,metric,TimeGPT
0,id1,mae,1811.047242


## References

Exclude covid impact in time series forecasting by Joaquín Amat Rodrigo and Javier Escobar Ortiz, available under a CC BY-NC-SA 4.0 at https://www.cienciadedatos.net/documentos/py45-weighted-time-series-forecasting.html