In [None]:
#| hide
!pip install -Uqq nixtla 

In [None]:
#| hide 
from nixtla.utils import in_colab

In [None]:
#| hide 
IN_COLAB = in_colab()

In [None]:
#| hide
if not IN_COLAB:
    from nixtla.utils import colab_badge
    from dotenv import load_dotenv

# Audit and Clean Data with TimeGPT 

In [None]:
#| echo: false
if not IN_COLAB:
    load_dotenv()
    colab_badge('docs/tutorials/24_audit_clean')

[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/tutorials/24_audit_clean.ipynb)

The `audit_data` and `clean_data` methods from TimeGPT can help you identify and fix potential issues in your data.

The `audit_data` method performs a series of checks to detect problems that will cause errors when you run TimeGPT. Specifically, it checks for:

- **Duplicate rows**
- **Missing dates**
- **Categorical feature columns**

Additionally, `audit_data` checks for:

- **Negative values**
- **Leading zeros**

Although these issues won't directly cause errors in TimeGPT, you might still choose to address them, depending on your specific use case.

Once you've identified any issues, the `clean_data` method helps you automatically fix them. In this tutorial, you'll learn how to use these methods to audit and clean your data before generating forecasts with TimeGPT. 

## 1. Get Started 

To use the `audit_data` and `clean_data` methods, you first need to import and instantiate the `NixtlaClient` class.

In [None]:
import pandas as pd 
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    # api_key = 'my_api_key_provided_by_nixtla'
)

If you don't have an API key, or if you want to learn more secure ways of setting it up, please refer to the [Setting up your API key](https://docs.nixtla.io/docs/getting-started-setting_up_your_api_key) tutorial.

## 2. Audit Data

The `audit_data` method performs a series of checks to identify issues in your data. These checks fall into two categories:

| **Check Type**    | **Description**                                                                 | **Checks Performed**                                      |
|-------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------|
| **Fail**          | Issues that will cause errors when you run TimeGPT    | Duplicate rows (D001)<br>Missing dates (D002)<br>Categorical feature columns (F001) |
| **Case-specific** | Issues that may not cause errors but could negatively affect your results       | Negative values (V001)<br>Leading zeros (V002)            |

To show how the `audit_data` method works, we will create a sample dataset with missing dates, negative values and leading zeros. 

In [None]:
df = pd.DataFrame({
    'unique_id': ['id1', 'id1', 'id1', 'id2', 'id2', 'id2', 'id2', 'id3', 'id3', 'id3', 'id3'],
    'ds': ['2023-01-01', '2023-01-03', '2023-01-04', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'y': [1, 1, 1, 0, 0, 1, 2, -1, 0, 1, -2]
})

df

Unnamed: 0,unique_id,ds,y
0,id1,2023-01-01,1
1,id1,2023-01-03,1
2,id1,2023-01-04,1
3,id2,2023-01-01,0
4,id2,2023-01-02,0
5,id2,2023-01-03,1
6,id2,2023-01-04,2
7,id3,2023-01-01,-1
8,id3,2023-01-02,0
9,id3,2023-01-03,1


The `audit_data` method requires the following parameters:

- `df` *(required)*: A pandas DataFrame with your input data.

- `freq` *(required)*: The frequency of your time series data (e.g., `D` for daily, `M` for monthly).

- `id_col`: Column name identifying each unique series. Default is `unique_id`.

- `time_col`: Column name containing timestamps. Default is `ds`.

- `target_col`: Column name containing the target variable. Default is `y`.

Additionally, you can use the following optional parameters to specify how missing dates are identified:

- `start`: The initial timestamp for the series.

- `end`: The final timestamp for the series.

Both `start` and `end` can take the following options:

- `per_serie`: Uses the first or last timestamp of each individual series.

- `global`: Uses the earliest or latest timestamp from the entire dataset.

- A specific timestamp or integer (e.g., `2025-01-01`, `2025`, or `datetime(2025, 1, 1)`).

In [None]:
all_pass, fail_dfs, case_specific_dfs = nixtla_client.audit_data(
    df = df,
    freq = 'D', 
    start = 'per_serie', 
    end = 'per_serie'
)

INFO:nixtla.nixtla_client:Running data quality tests...


The audit_data method returns three values:

- **all_pass** (bool): True if every check passed, otherwise False.

- **fail_dfs** (dict): Any failed tests (D001, D002 or F001), each paired with the rows that failed.

- **case_specific_dfs** (dict): Any case-specific tests (V001 or V002), each paired with the rows flagged.

In [None]:
print(all_pass)
print(fail_dfs)
print(case_specific_dfs)

False
{'D002':   unique_id         ds
1       id1 2023-01-02}
{'V001':    unique_id         ds  y
7        id3 2023-01-01 -1
10       id3 2023-01-04 -2, 'V002':   unique_id  first_index  first_nonzero_index
0       id2            3                    5}


In the above example, the `audit_data` method found missing dates (D002), negative values (V001), and leading zeros (V002).

## 3. Clean Data

The `clean_data` method fixes the issues identified by the `audit_data` method. It requires the output of `audit_data`, so it must always be run after it. The `clean_data` method takes the following parameters:

- `df` *(required)*: A pandas DataFrame with your input data.

- `fail_dict` *(required)*: A dictionary with failed checks, as returned by the `audit_data` method.

- `case_specific_dict` *(required)*: A dictionary with case-specific checks, also returned by the `audit_data` method.

- `freq` *(required)*: The frequency of your time series data (e.g., `D` for daily, `M` for monthly). Can be a string, integer, or pandas offset.

- `clean_case_specific`: Whether to clean case-specific issues (e.g., negative values, leading zeros). Default is `False`.

- `id_col`: Column name identifying each unique series. Default is `unique_id`.

- `time_col`: Column name containing timestamps or integer steps. Default is `ds`.

- `target_col`: Column name containing the target variable. Default is `y`.

In [None]:
clean_df, all_pass, fail_dfs, case_specific_dfs = nixtla_client.clean_data(
    df = df, 
    fail_dict = fail_dfs,
    case_specific_dict = case_specific_dfs,
    clean_case_specific = True, 
    freq = 'D'
)

clean_df

INFO:nixtla.nixtla_client:Running data cleansing...
INFO:nixtla.nixtla_client:Fixing D002: Filling missing dates...
INFO:nixtla.nixtla_client:Fixing V001: Removing negative values...
INFO:nixtla.nixtla_client:Fixing V002: Removing leading zeros...
INFO:nixtla.nixtla_client:Running data quality tests...


Unnamed: 0,unique_id,ds,y
0,id1,2023-01-01,1.0
1,id1,2023-01-03,1.0
2,id1,2023-01-04,1.0
1,id1,2023-01-02,
5,id2,2023-01-03,1.0
6,id2,2023-01-04,2.0
7,id3,2023-01-01,0.0
8,id3,2023-01-02,0.0
9,id3,2023-01-03,1.0
10,id3,2023-01-04,0.0


In this example, `clean_data` added the missing date in `id1`, removed the leading zeros in `id2`, and replaced the negative values in `id3`. However, replacing negative values with zeros introduced new leading zeros in `id3`, so a second run of `clean_data` is required.

In [None]:
clean_df2, all_pass, fail_dfs, case_specific_dfs = nixtla_client.clean_data(
    df = clean_df, 
    fail_dict = fail_dfs,
    case_specific_dict = case_specific_dfs,
    clean_case_specific = True, # if False, the case-specific tests will be ignored
    freq = 'D'
)

clean_df2

INFO:nixtla.nixtla_client:Running data cleansing...
INFO:nixtla.nixtla_client:Fixing V002: Removing leading zeros...
INFO:nixtla.nixtla_client:Running data quality tests...
INFO:nixtla.nixtla_client:All checks passed...


Unnamed: 0,unique_id,ds,y
0,id1,2023-01-01,1.0
1,id1,2023-01-03,1.0
2,id1,2023-01-04,1.0
1,id1,2023-01-02,
5,id2,2023-01-03,1.0
6,id2,2023-01-04,2.0
9,id3,2023-01-03,1.0
10,id3,2023-01-04,0.0


After the second run of `clean_data`, the leading zeros in `id3` have been removed. The only remaining step is to fill the missing value created when the missing date was added in `id1`, and to sort the DataFrame by `unique_id` and `ds`.

In [None]:
clean_df2 = clean_df2.sort_values(by=['unique_id', 'ds'])
clean_df2['y'] = clean_df2['y'].fillna(0)
clean_df2

Unnamed: 0,unique_id,ds,y
0,id1,2023-01-01,1.0
1,id1,2023-01-02,0.0
1,id1,2023-01-03,1.0
2,id1,2023-01-04,1.0
5,id2,2023-01-03,1.0
6,id2,2023-01-04,2.0
9,id3,2023-01-03,1.0
10,id3,2023-01-04,0.0


## 4. Conclusion

The `audit_data` method helps you identify issues that may prevent TimeGPT from running properly. These include fail tests (duplicate rows, missing dates, and categorical feature columns), which will always result in errors if not addressed. It also flags case-specific issues (negative values and leading zeros), which may not cause errors but can affect the quality of your forecasts depending on your use case.

The `clean_data` method can automatically fix the issues identified by `audit_data`. Be cautious when removing negative values or leading zeros, as they may contain important information about your data. Above all, when auditing and cleaning your data, make decisions based on the needs and context of your specific use case.