# Getting Started with Temporian

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/temporian/blob/last-release/docs/src/tutorials/getting_started.ipynb)

This guide will introduce you to the basics of Temporian, including:
- What is an **EventSet** and how to create one from scratch.
- Visualizing input/output data using **EventSet.plot()** and interactive plots.
- Converting back and forth between EventSets and pandas **DataFrames**.
- Transforming the EventSets by using **operators**.
- How operators work when using **indexes**.
- Commonly used operations like **glue**, **resample**, **lag**, moving windows and arithmetics.

If you're interested in a topic that is not included here, we provide links to other parts of the documentation on the final section, to continue learning.

## Setup

In [None]:
# Skip this cell if you are running the notebook locally and have already installed temporian.
%pip install temporian -q

In [2]:
import temporian as tp

import pandas as pd
import numpy as np

## Part 1: Events and EventSets

The most basic unit of data in Temporian is an **event**. An event consists of a timestamp and a set of feature values.

Events are not handled individually. Instead, events are grouped together into an **[`EventSet`][temporian.EventSet]**.

[`EventSets`][temporian.EventSet] are the main data structure in Temporian, and represent **[multivariate and multi-index time sequences](../user_guide/#what-is-temporal-data)**. Let's break that down:

- "multivariate" indicates that each event in the time sequence holds several feature values.
- "multi-index" indicates that the events can represent hierarchical data, and be therefore grouped by one or more of their features' values.
- "sequence" indicates that the events are not necessarily sampled at a uniform rate (in which case we would call it a time "series").

You can create an [`EventSet`][temporian.EventSet] from a pandas DataFrame, NumPy arrays, CSV files, and more. Here is an example of an [`EventSet`][temporian.EventSet] containing four events and three features (one of which is used as an `index`):

In [3]:
evset = tp.event_set(
    timestamps=["2023-02-04", "2023-02-06", "2023-02-07", "2023-02-07"],
    features={
        "feature_1": [0.5, 0.6, np.nan, 0.9],
        "feature_2": ["red", "blue", "red", "blue"],
        "feature_3":  [10.0, -1.0, 5.0, 5.0],
    },
    indexes=["feature_2"],
)
evset

timestamp,feature_1,feature_3
2023-02-06 00:00:00+00:00,0.6,-1
2023-02-07 00:00:00+00:00,0.9,5

timestamp,feature_1,feature_3
2023-02-04 00:00:00+00:00,0.5,10
2023-02-07 00:00:00+00:00,,5


An [`EventSet`][temporian.EventSet] can hold one or several time sequences, depending on what its **[index](../user_guide/#index-horizontal-and-vertical-operators)** is.

If the [`EventSet`][temporian.EventSet] has no index, it will hold a single multivariate time sequence, which means that all events will be considered part of the same group and will interact with each other when operators are applied to the [`EventSet`][temporian.EventSet].

If the [`EventSet`][temporian.EventSet] has one (or many) indexes, its events will be grouped by their indexes' values, so it will hold one multivariate time sequence for each unique value (or unique combination of values) of its indexes, and most operators applied to the [`EventSet`][temporian.EventSet] will be applied to each time sequence independently.

See the last part of this tutorial to see some examples using `indexes` and operators.

### Example Data

This minimal data consists of just one `signal` with a `timestamp` for each sample.

The signal is a periodic sinusoidal `season` with a slight positive slope in the long run, which we call `trend`. Plus the ubiquitous `noise`.

In [None]:
# Generate a synthetic dataset
timestamps = np.arange(0, 100, 0.1)
n = len(timestamps)
noise = 0.1 * np.random.randn(n)
trend = 0.01 * timestamps
season = 0.4 * np.sin(timestamps)

# Convention: 'df_' for DataFrame
df_signals = pd.DataFrame(
    {
        "timestamp": timestamps,
        "noise": noise,
        "trend": trend,
        "season": season,
        "signal": noise + trend + season,
    }
)

df_signals

### Creating an EventSet from a DataFrame

As mentioned in the previous section, any kind of signal is represented in Temporian as a **collection of events**, using the `EventSet` object.

In this case there's no `indexes` because we only have one sequence. In the third part we'll learn how to use them and why they can be useful.

In [None]:
# Convert the DataFrame into a Temporian EventSet
evset_signals = tp.from_pandas(df_signals)

evset_signals

In [None]:
# Plot the dataset
_ = evset_signals.plot()

**Note:** If you're wondering why the plot has an empty `()` in the title, it's because we have no `indexes`, as mentioned above.

## Part 2: Using Operators

Now, let's actually transform our data with a couple operations.

To extract only the long-term trend, the sine and noise signals are first removed using a moving average over a large moving window (they have zero mean).

In [None]:
# Pick only one feature
signal = evset_signals["signal"]

# Moving avg
trend = signal.simple_moving_average(tp.duration.seconds(30))
trend.plot()

Notice that the feature is still named `signal`?

Let's give it a new name to avoid confusions.

In [None]:
# Let's rename the feature by adding a prefix
trend = trend.prefix("trend_")
trend.plot()

Now we've the long-term trend, we can subtract it from the original signal to get only the `season` component.

In [None]:
# Remove the slow 'trend' to get 'season'
detrend = signal - trend

# Rename resulting feature
detrend = detrend.rename("detrend")

detrend.plot()

Using a shorter moving average, we can filter out the noise.

In [None]:
denoise = detrend.simple_moving_average(tp.duration.seconds(1.5)).rename("denoise")
denoise.plot()

### Selecting and combining features

Features can be selected and combined to create new `EventSets` using two operations:
1. **Select:** using `evset["feature_1"]` or `evset[["feature_1", "feature_2"]]` will return a new `EventSet` object with only one or two features respectively.
1. **Glue:** using `tp.glue(evset_1, evset_2)` will return a new `EventSet` combining all features from both inputs. But the feature names cannot be repeated, so you may need to use `prefix()` or `rename()` before combining.

Let's add some operations and then plot together everything:
- The `slope` of one of the signals is calculated, by subtracting a delayed version of itself. Note that the time axis for this plot is shifted.

In [None]:
# Pack results to show all plots together
evset_result = tp.glue(
    signal,
    trend,
    detrend,
    denoise
)

evset_result.plot()

### Lag and resample

Just as another example, let's also calculate the derivative of the denoised signal, numerically.

In [None]:
# Estimate numeric derivative

# Time step
delta_t = 1

# Increment in y axis
y = denoise
y_lag = y.lag(delta_t)
delta_y = y - y_lag.resample(y)

# Remember the formula? :)
derivative = delta_y / delta_t

# Also, let's use an interactive plot just for fun.
derivative.plot(interactive=True, width_px=600)

Pretty accurate! We had a `0.4` amplitude sine wave with unit frequency, so the derivative should be a `0.4` amplitude cosine.


Now, taking a look at the operators, the `lag()` call is pretty self-descriptive. But you might be wondering, why is the `resample()` operator needed?

That's because the `y.lag(delta_t)` just shifts the timestamps by `delta_t`, and as a result, `y` and `y_lag` are signals with **different samplings**.

But, how would you subtract two signals that are defined at different timestamps? In Temporian, we don't like error-prone implicit _magic_ behavior, so you have to do it explicitly. **You can only do arithmetics between signals with the same samplings.**

To create matching samplings, we explicitly use `y_lag.resample(y)`, creating a signal using the timestamps from `y`, but taking the values from `y_lag`. It's essentialy the same signal as `y_lag`, but sampled at the same timestamps as `y`.

### Exporting outputs from Temporian
You may need to use this data in different ways for downstream tasks, like training a model using whatever library you need. 

If you can't use the data directly from Temporian, you can always go back to a pandas DataFrame:

In [None]:
tp.to_pandas(evset_result)

## Part 3: Using indexes
This is the final important concept to get from this introduction.

Indexes are useful to handle multiple signals in parallel (as mentioned at the top of this notebook).
For example, working with signals from multiple sensor devices or representing sales from many stores or products. The feature names may be exactly the same for all the data, but we need to separate them by setting the correct `index` for each one.

### New example data: multiple devices
Let's create two signals with overlapping timestamps, with a different `device_id`:

In [None]:

# Two devices with overlapping timestamps
df_device_1 = df_signals[:900].copy()
df_device_2 = df_signals[300:].copy()

# Add a column with device_id and concat
df_device_1["device_id"] = "Device 1"
df_device_2["device_id"] = "Device 2"
df_both_devices = pd.concat([df_device_1, df_device_2])

# Create evset using 'device_id' as index
evset_devices = tp.from_pandas(df_both_devices, indexes=["device_id"])
evset_devices

As you can see above, each index has it's own timestamps and feature values. They will always have the same features though, because they're on the same `EventSet`.

The plots also accomodate to show each index separately. In particular, see below how the timestamps are different and partly overlapping, and that's completely fine for separate indices. This wouldn't be possible by using different feature names for each sensor, for example.

In [None]:
evset_devices["signal"].plot()

### Operations with index

Any operator that we apply now, is aware of the `index` and will be performed over each one separately.

In [None]:
# Apply some operations
trend_i = evset_devices["signal"].simple_moving_average(tp.duration.seconds(30))
detrend_i = evset_devices["signal"] - trend_i
denoise_i = detrend_i.simple_moving_average(tp.duration.seconds(1.5))

# Plot for each index
tp.glue(evset_devices["signal"],
        detrend_i.rename("detrend"),
        denoise_i.rename("denoise")
       ).plot()

### Multi-indexes

Finally, let's point out that multiple columns of the input data may be set as indexes.

For example, in the case of sales in a store, we could use both the store and product columns to group the sequences. In this case, each group would contain the sales for a single product in a single store.

This is easy to do since the `indexes` argument is actually a list of columns, and each group represented in Temporian by using a tuple `(store, product)` as the index key.

## Summary

Congratulations! You now have the basic concepts needed to create a data preprocessing pipeline with Temporian:
- Defining an **EventSet** and using **operators** on it.
- Combine **features** using **select** and **glue**.
- Coverting data back and forth between Temporian's **EventSet** and pandas **DataFrames**.
- Visualizing input/output data using **EventSet.plot()**.
- Operating and plotting with an **index**.

### Other important details

To keep it short and concise, there are interesting concepts that were not mentioned above:

- Check the [**Time Units** section of the User Guide](https://temporian.readthedocs.io/en/latest/user_guide/#time-units). There are many [**calendar operators**](https://temporian.readthedocs.io/en/stable/reference/temporian/operators/calendar/calendar_day_of_month/) available when working with datetimes.
- To combine or operate with events from different sampling sources (potentially non-uniform samplings) check the [**sampling** section of the User Guide](https://temporian.readthedocs.io/en/stable/user_guide/#sampling).
- Temporian is **strict on the feature data types** when applying operations, to avoid potentially silent errors or memory issues. Check the [User Guide's **casting** section](https://temporian.readthedocs.io/en/latest/user_guide/#casting) section to learn how to tackle those cases.

### Next Steps
- The [**Recipes**](https://temporian.readthedocs.io/en/stable/recipes/) are short and self-contained examples showing how to use Temporian in typical use cases.
- Try the more advanced [**tutorials**](https://temporian.readthedocs.io/en/stable/tutorials/) to continue learning by example about all these topics and more!
- Learn how Temporian is **ready for production**, using [**graph mode**](https://temporian.readthedocs.io/en/stable/user_guide/#eager-mode-vs-graph-mode) or [Apache Beam](https://temporian.readthedocs.io/en/stable/tutorials/temporian_with_beam/).

- We could only cover a small fraction of **[all available operators](https://temporian.readthedocs.io/en/stable/reference/temporian/operators/add_index/)**.
- We put a lot of ❤️ in the **[User Guide](https://temporian.readthedocs.io/en/stable/user_guide/)**, so make sure to check it out 🙂.