# Adding text features

So far, the tutorials have dealt with _tabular_ data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within `timeseriesflattener`.

Specifically, this tutorial will cover *how to generate flattened predictors from already embedded text.*

## The dataset

To start out, let's load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value. 

In [1]:
from __future__ import annotations

from timeseriesflattener.testing.load_synth_data import load_synth_text

In [2]:
synth_text = load_synth_text()
synth_text.head()

entity_id,timestamp,value
i64,datetime[μs],str
4647,1967-07-19 00:22:00,"""The patient we…"
2007,1966-11-25 02:02:00,"""The patient is…"
5799,1967-09-19 12:31:00,"""The patient, d…"
1319,1969-07-21 23:16:00,"""The patient ha…"
4234,1966-04-14 22:04:00,"""The patient ha…"


## Generating predictors from embedded text

As generating text embeddings can often take a while, it can be an advantageous to embed the text before using `timeseriesflattener` to speed up the computation if you're generating multiple datasets. This first block will show you how to format a dataframe with embedded text for `timeseriesflattener`.

To start, let's embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an `entity_id_col`, `timestamp_col` and any number of columns containing the embeddings, with a single value in each column. 

For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.

In [7]:
%%capture
import polars as pl
from sklearn.feature_extraction.text import TfidfVectorizer


# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pl.DataFrame:
    tfidf_model = TfidfVectorizer(max_features=10)
    embeddings = tfidf_model.fit_transform(text)
    return pl.DataFrame(embeddings.toarray(), schema=tfidf_model.get_feature_names_out().tolist())


# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].to_list())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(columns=["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pl.concat([metadata_only, embedded_text], how="horizontal")

In [9]:
embedded_text_with_metadata.head()

entity_id,timestamp,and,for,in,of,or,patient,that,the,to,was
i64,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
4647,1967-07-19 00:22:00,0.175872,0.182066,0.249848,0.15843,0.0,0.023042,0.311389,0.529966,0.490203,0.479312
2007,1966-11-25 02:02:00,0.24487,0.0,0.135282,0.064337,0.465084,0.336859,0.151743,0.729861,0.179161,0.0
5799,1967-09-19 12:31:00,0.192367,0.232332,0.283402,0.336952,0.0,0.176422,0.238416,0.646879,0.250217,0.382277
1319,1969-07-21 23:16:00,0.165635,0.200046,0.183015,0.261115,0.125837,0.151906,0.205285,0.759528,0.403961,0.098747
4234,1966-04-14 22:04:00,0.493461,0.119196,0.272619,0.207444,0.0,0.045256,0.183475,0.588324,0.433253,0.235349


Now that we have our embeddings in a dataframe including the `entity_id` and `timestamp`, we can simply pass it to `PredictorSpec`!

In [15]:
import datetime as dt

import numpy as np
from timeseriesflattener import PredictorSpec, ValueFrame
from timeseriesflattener.aggregators import MeanAggregator

text_spec = PredictorSpec(
    ValueFrame(
        init_df=embedded_text_with_metadata,
        entity_id_col_name="entity_id",
        value_timestamp_col_name="timestamp",
    ),
    lookbehind_distances=[dt.timedelta(days=365), dt.timedelta(days=730)],
    aggregators=[MeanAggregator()],
    fallback=np.nan,
    column_prefix="pred_tfidf",
)

Let's make some features! 

We are creating 10*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days), using the mean aggregation function.

In [16]:
# make features how you would normally
from timeseriesflattener import Flattener, PredictionTimeFrame
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

flattener = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=load_synth_prediction_times(),
        entity_id_col_name="entity_id",
        timestamp_col_name="timestamp",
    )
)

df = flattener.aggregate_timeseries(specs=[text_spec]).df.collect()

Let's check the output.

In [29]:
import polars as pl
import polars.selectors as cs

# dropping na values in float columns (no embeddings within the lookbehind periods) for the sake of this example
df.filter(pl.all_horizontal(cs.float().is_not_nan())).head()

entity_id,timestamp,prediction_time_uuid,pred_tfidf_and_within_0_to_365_days_mean_fallback_nan,pred_tfidf_for_within_0_to_365_days_mean_fallback_nan,pred_tfidf_in_within_0_to_365_days_mean_fallback_nan,pred_tfidf_of_within_0_to_365_days_mean_fallback_nan,pred_tfidf_or_within_0_to_365_days_mean_fallback_nan,pred_tfidf_patient_within_0_to_365_days_mean_fallback_nan,pred_tfidf_that_within_0_to_365_days_mean_fallback_nan,pred_tfidf_the_within_0_to_365_days_mean_fallback_nan,pred_tfidf_to_within_0_to_365_days_mean_fallback_nan,pred_tfidf_was_within_0_to_365_days_mean_fallback_nan,pred_tfidf_and_within_0_to_730_days_mean_fallback_nan,pred_tfidf_for_within_0_to_730_days_mean_fallback_nan,pred_tfidf_in_within_0_to_730_days_mean_fallback_nan,pred_tfidf_of_within_0_to_730_days_mean_fallback_nan,pred_tfidf_or_within_0_to_730_days_mean_fallback_nan,pred_tfidf_patient_within_0_to_730_days_mean_fallback_nan,pred_tfidf_that_within_0_to_730_days_mean_fallback_nan,pred_tfidf_the_within_0_to_730_days_mean_fallback_nan,pred_tfidf_to_within_0_to_730_days_mean_fallback_nan,pred_tfidf_was_within_0_to_730_days_mean_fallback_nan
i64,datetime[μs],str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
1660,1968-08-04 02:05:00,"""1660-1968-08-0…",0.093117,0.028116,0.102888,0.146794,0.67206,0.298894,0.173111,0.55509,0.227099,0.16654,0.093117,0.028116,0.102888,0.146794,0.67206,0.298894,0.173111,0.55509,0.227099,0.16654
5570,1966-07-12 10:54:00,"""5570-1966-07-1…",0.611529,0.0,0.0,0.321347,0.0,0.280419,0.378958,0.280419,0.298286,0.364574,0.611529,0.0,0.0,0.321347,0.0,0.280419,0.378958,0.280419,0.298286,0.364574
7337,1966-06-28 10:34:00,"""7337-1966-06-2…",0.231068,0.139536,0.127656,0.303555,0.0,0.158935,0.071595,0.847656,0.225416,0.137755,0.231068,0.139536,0.127656,0.303555,0.0,0.158935,0.071595,0.847656,0.225416,0.137755
4234,1969-08-09 12:46:00,"""4234-1969-08-0…",0.493461,0.119196,0.272619,0.207444,0.0,0.045256,0.183475,0.588324,0.433253,0.235349,0.493461,0.119196,0.272619,0.207444,0.0,0.045256,0.183475,0.588324,0.433253,0.235349
5799,1966-12-15 08:15:00,"""5799-1966-12-1…",0.192367,0.232332,0.283402,0.336952,0.0,0.176422,0.238416,0.646879,0.250217,0.382277,0.192367,0.232332,0.283402,0.336952,0.0,0.176422,0.238416,0.646879,0.250217,0.382277


And just like that, you're ready to make a prediction model!