# Adding text features

So far, the tutorials have dealt with _tabular_ data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within `timeseriesflattener`.

Specifically, this tutorial will cover _how to generate flattened predictors from already embedded text._


## The dataset

To start out, let's load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value.


In [10]:
from __future__ import annotations

from timeseriesflattener.testing.load_synth_data import load_synth_text

In [11]:
synth_text = load_synth_text()
synth_text.head()

entity_id,timestamp,value
i64,datetime[μs],str
4647,1967-07-19 00:22:00,"""The patient went into a medica…"
2007,1966-11-25 02:02:00,"""The patient is taken to the em…"
5799,1967-09-19 12:31:00,"""The patient, described as a 7-…"
1319,1969-07-21 23:16:00,"""The patient had been left on a…"
4234,1966-04-14 22:04:00,"""The patient had had some sever…"


## Generating predictors from embedded text

As generating text embeddings can often take a while, it can be an advantageous to embed the text before using `timeseriesflattener` to speed up the computation if you're generating multiple datasets. This first block will show you how to format a dataframe with embedded text for `timeseriesflattener`.

To start, let's embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an `entity_id_col`, `timestamp_col` and any number of columns containing the embeddings, with a single value in each column.

For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.


In [12]:
%%capture
import polars as pl
from sklearn.feature_extraction.text import TfidfVectorizer


# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pl.DataFrame:
    tfidf_model = TfidfVectorizer(max_features=10)
    embeddings = tfidf_model.fit_transform(text)
    return pl.DataFrame(embeddings.toarray(), schema=tfidf_model.get_feature_names_out().tolist())


# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].to_list())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pl.concat([metadata_only, embedded_text], how="horizontal")

In [13]:
embedded_text_with_metadata.head()

entity_id,timestamp,and,for,in,of,or,patient,that,the,to,was
i64,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
4647,1967-07-19 00:22:00,0.175872,0.182066,0.249848,0.15843,0.0,0.023042,0.311389,0.529966,0.490203,0.479312
2007,1966-11-25 02:02:00,0.24487,0.0,0.135282,0.064337,0.465084,0.336859,0.151743,0.729861,0.179161,0.0
5799,1967-09-19 12:31:00,0.192367,0.232332,0.283402,0.336952,0.0,0.176422,0.238416,0.646879,0.250217,0.382277
1319,1969-07-21 23:16:00,0.165635,0.200046,0.183015,0.261115,0.125837,0.151906,0.205285,0.759528,0.403961,0.098747
4234,1966-04-14 22:04:00,0.493461,0.119196,0.272619,0.207444,0.0,0.045256,0.183475,0.588324,0.433253,0.235349


Now that we have our embeddings in a dataframe including the `entity_id` and `timestamp`, we can simply pass it to `PredictorSpec`!


In [14]:
import datetime as dt

import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import MeanAggregator

text_spec = PredictorSpec.from_primitives(
    df=embedded_text_with_metadata,
    entity_id_col_name="entity_id",
    value_timestamp_col_name="timestamp",
    lookbehind_days=[365, 730],
    aggregators=["mean"],
    column_prefix="pred_tfidf",
    fallback=np.nan,
)

# Alternatively, if you prefer types
text_spec = PredictorSpec(
    ValueFrame(
        init_df=embedded_text_with_metadata,
        entity_id_col_name="entity_id",
        value_timestamp_col_name="timestamp",
    ),
    lookbehind_distances=[dt.timedelta(days=365), dt.timedelta(days=730)],
    aggregators=[MeanAggregator()],
    fallback=np.nan,
    column_prefix="pred_tfidf",
)

Let's make some features!


We are creating 10\*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days), using the mean aggregation function.


In [15]:
# make features how you would normally
from timeseriesflattener import Flattener, PredictionTimeFrame
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

flattener = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=load_synth_prediction_times(),
        entity_id_col_name="entity_id",
        timestamp_col_name="timestamp",
    )
)

df = flattener.aggregate_timeseries(specs=[text_spec]).df

Output()

Let's check the output.


In [16]:
import polars as pl
import polars.selectors as cs

# dropping na values in float columns (no embeddings within the lookbehind periods) for the sake of this example
df.filter(pl.all_horizontal(cs.float().is_not_nan())).head()

entity_id,timestamp,prediction_time_uuid,pred_tfidf_and_within_0_to_365_days_mean_fallback_nan,pred_tfidf_for_within_0_to_365_days_mean_fallback_nan,pred_tfidf_in_within_0_to_365_days_mean_fallback_nan,pred_tfidf_of_within_0_to_365_days_mean_fallback_nan,pred_tfidf_or_within_0_to_365_days_mean_fallback_nan,pred_tfidf_patient_within_0_to_365_days_mean_fallback_nan,pred_tfidf_that_within_0_to_365_days_mean_fallback_nan,pred_tfidf_the_within_0_to_365_days_mean_fallback_nan,pred_tfidf_to_within_0_to_365_days_mean_fallback_nan,pred_tfidf_was_within_0_to_365_days_mean_fallback_nan,pred_tfidf_and_within_0_to_730_days_mean_fallback_nan,pred_tfidf_for_within_0_to_730_days_mean_fallback_nan,pred_tfidf_in_within_0_to_730_days_mean_fallback_nan,pred_tfidf_of_within_0_to_730_days_mean_fallback_nan,pred_tfidf_or_within_0_to_730_days_mean_fallback_nan,pred_tfidf_patient_within_0_to_730_days_mean_fallback_nan,pred_tfidf_that_within_0_to_730_days_mean_fallback_nan,pred_tfidf_the_within_0_to_730_days_mean_fallback_nan,pred_tfidf_to_within_0_to_730_days_mean_fallback_nan,pred_tfidf_was_within_0_to_730_days_mean_fallback_nan
i64,datetime[μs],str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
6840,1965-11-02 07:17:00,"""6840-1965-11-02 07:17:00.00000…",0.155821,0.376386,0.258256,0.573168,0.355142,0.071452,0.096561,0.28581,0.45603,0.092896,0.155821,0.376386,0.258256,0.573168,0.355142,0.071452,0.096561,0.28581,0.45603,0.092896
2039,1966-04-20 05:06:00,"""2039-1966-04-20 05:06:00.00000…",0.108015,0.0,0.596744,0.11352,0.0,0.099062,0.133872,0.693431,0.210747,0.257581,0.108015,0.0,0.596744,0.11352,0.0,0.099062,0.133872,0.693431,0.210747,0.257581
9496,1966-12-06 06:44:00,"""9496-1966-12-06 06:44:00.00000…",0.279955,0.0,0.30933,0.294222,0.0,0.256749,0.0,0.513498,0.546216,0.3338,0.279955,0.0,0.30933,0.294222,0.0,0.256749,0.0,0.513498,0.546216,0.3338
7281,1967-06-05 00:41:00,"""7281-1967-06-05 00:41:00.00000…",0.289663,0.04373,0.280049,0.304425,0.385111,0.332065,0.269251,0.464891,0.211934,0.388547,0.289663,0.04373,0.280049,0.304425,0.385111,0.332065,0.269251,0.464891,0.211934,0.388547
7424,1967-07-13 15:01:00,"""7424-1967-07-13 15:01:00.00000…",0.153907,0.092941,0.170056,0.107834,0.389756,0.282299,0.063583,0.682222,0.475452,0.0,0.153907,0.092941,0.170056,0.107834,0.389756,0.282299,0.063583,0.682222,0.475452,0.0


And just like that, you're ready to make a prediction model!
