# Adding text features

So far, the tutorials have dealt with _tabular_ data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within `timeseriesflattener`.

Specifically, this tutorial will cover *how to generate flattened predictors from already embedded text.*

## The dataset

To start out, let's load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value. 

In [1]:
from timeseriesflattener.testing.load_synth_data import load_synth_text

* 'allow_mutation' has been removed


In [2]:
synth_text = load_synth_text()
synth_text.head()

Unnamed: 0,entity_id,timestamp,value
0,4647,1967-07-19 00:22:00,The patient went into a medically induced coma...
1,2007,1966-11-25 02:02:00,The patient is taken to the emergency departme...
2,5799,1967-09-19 12:31:00,"The patient, described as a 7-month old son wh..."
3,1319,1969-07-21 23:16:00,The patient had been left on a bed for 20 minu...
4,4234,1966-04-14 22:04:00,The patient had had some severe allergies but ...


## Generating predictors from embedded text

As generating text embeddings can often take a while, it can be an advantageous to embed the text before using `timeseriesflattener` to speed up the computation if you're generating multiple datasets. This first block will show how to convert a dataframe with embeddings into a format that can be passed to `timeseriesflattener`. 

To start, let's embed the synthetic text data using TF-IDF. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an `entitiy_id_col`, `timestamp_col` and any number of `value_cols`. 

For purposes of demonstration, we will fit a small TF-IDF model to the data and use it to embed the text.

In [21]:
%%capture
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd


# define function to embed text and return a dataframe
def embed_text_to_df(text: list[str]) -> pd.DataFrame:
    tfidf_model = TfidfVectorizer(max_features=10)
    embeddings = tfidf_model.fit_transform(text)
    return pd.DataFrame(embeddings.toarray(), columns=tfidf_model.get_feature_names_out())

# embed text
embedded_text = embed_text_to_df(text=synth_text["value"].tolist())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(columns=["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pd.concat([metadata_only, embedded_text], axis=1)


In [22]:
embedded_text_with_metadata.head()


Unnamed: 0,entity_id,timestamp,and,for,in,of,or,patient,that,the,to,was
0,4647,1967-07-19 00:22:00,0.175872,0.182066,0.249848,0.15843,0.0,0.023042,0.311389,0.529966,0.490203,0.479312
1,2007,1966-11-25 02:02:00,0.24487,0.0,0.135282,0.064337,0.465084,0.336859,0.151743,0.729861,0.179161,0.0
2,5799,1967-09-19 12:31:00,0.192367,0.232332,0.283402,0.336952,0.0,0.176422,0.238416,0.646879,0.250217,0.382277
3,1319,1969-07-21 23:16:00,0.165635,0.200046,0.183015,0.261115,0.125837,0.151906,0.205285,0.759528,0.403961,0.098747
4,4234,1966-04-14 22:04:00,0.493461,0.119196,0.272619,0.207444,0.0,0.045256,0.183475,0.588324,0.433253,0.235349


Now that we have our embeddings, we can use the `df_with_multiple_values_to_named_dataframes` function to turn the embeddings into a format that can be readily supplied to `PredictorGroupSpec`.

In [23]:
from timeseriesflattener.df_transforms import (
    df_with_multiple_values_to_named_dataframes,
)

# split the dataframe into a list of named dataframes with one value each
embedded_dfs = df_with_multiple_values_to_named_dataframes(
    df=embedded_text_with_metadata,
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    name_prefix="tfidf_",
)

# check the first dataframe
embedded_dfs[0].df.head()


Unnamed: 0,entity_id,timestamp,value
0,4647,1967-07-19 00:22:00,0.175872
1,2007,1966-11-25 02:02:00,0.24487
2,5799,1967-09-19 12:31:00,0.192367
3,1319,1969-07-21 23:16:00,0.165635
4,4234,1966-04-14 22:04:00,0.493461


In [24]:
# check the number of embeddings/dataframes
len(embedded_dfs)


10

Each dataframe has been named according to `name_prefix` and the column name. This means, that if your column names are informative (e.g. if they correspond to specific words in a BOW model) they will be kept. 

In [25]:
embedded_dfs[0].name


'tfidf_and'

Let's make some features! 

In [26]:
from timeseriesflattener.aggregation_fns import mean
from timeseriesflattener.feature_specs.group_specs import PredictorGroupSpec
import numpy as np

# create a group spec for the embedded text that will take the mean of each embedding on the column axis
# for the last 365 and 730 days
emb_spec_batch = PredictorGroupSpec(
    named_dataframes=embedded_dfs,
    lookbehind_days=[365, 730],
    fallback=[np.nan],
    aggregation_fns=[mean],
).create_combinations()

# print the number of features we will create
print(len(emb_spec_batch))


20


We are creating 10*2=20 features: 1 for each embedding for each lookbehind (365 and 730 days).

In [27]:
# make features how you would normally
from timeseriesflattener import TimeseriesFlattener
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

ts_flattener = TimeseriesFlattener(
    prediction_times_df=load_synth_prediction_times(),
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    n_workers=1,
    drop_pred_times_with_insufficient_look_distance=False,
)
ts_flattener.add_spec(emb_spec_batch)
df = ts_flattener.get_df()


2023-08-09 15:21:03 [INFO] There were unprocessed specs, computing...
2023-08-09 15:21:03 [INFO] Processing 20 temporal features in parallel with 1 workers. Chunksize is 20. If this is above 1, it may take some time for the progress bar to move, as processing is batched. However, this makes for much faster total performance.
* 'allow_mutation' has been removed
100%|██████████| 20/20 [00:02<00:00,  9.79it/s]
2023-08-09 15:21:05 [INFO] Checking alignment of dataframes - this might take a little while (~2 minutes for 1.000 dataframes with 2.000.000 rows).
2023-08-09 15:21:05 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features and 2_000_000 prediction times. This is normal.
2023-08-09 15:21:05 [INFO] Concatenation took 0.039 seconds
2023-08-09 15:21:05 [INFO] Merging with original df


Let's check the output.

In [28]:
# dropping na values (no embeddings within the lookbehind period) for the sake of this example
df.dropna().head()


Unnamed: 0,entity_id,timestamp,prediction_time_uuid,pred_tfidf_the_within_365_days_mean_fallback_nan,pred_tfidf_and_within_365_days_mean_fallback_nan,pred_tfidf_for_within_365_days_mean_fallback_nan,pred_tfidf_for_within_730_days_mean_fallback_nan,pred_tfidf_that_within_730_days_mean_fallback_nan,pred_tfidf_in_within_365_days_mean_fallback_nan,pred_tfidf_or_within_730_days_mean_fallback_nan,...,pred_tfidf_was_within_730_days_mean_fallback_nan,pred_tfidf_was_within_365_days_mean_fallback_nan,pred_tfidf_of_within_365_days_mean_fallback_nan,pred_tfidf_patient_within_365_days_mean_fallback_nan,pred_tfidf_to_within_730_days_mean_fallback_nan,pred_tfidf_of_within_730_days_mean_fallback_nan,pred_tfidf_patient_within_730_days_mean_fallback_nan,pred_tfidf_the_within_730_days_mean_fallback_nan,pred_tfidf_that_within_365_days_mean_fallback_nan,pred_tfidf_in_within_730_days_mean_fallback_nan
1917,4977,1968-11-28 16:05:00,4977-1968-11-28-16-05-00,0.53489,0.145809,0.08805,0.08805,0.090356,0.483324,0.221549,...,0.086927,0.086927,0.536339,0.133722,0.284485,0.536339,0.133722,0.53489,0.090356,0.483324
2463,6840,1965-11-02 07:17:00,6840-1965-11-02-07-17-00,0.28581,0.155821,0.376386,0.376386,0.096561,0.258256,0.355142,...,0.092896,0.092896,0.573168,0.071452,0.45603,0.573168,0.071452,0.28581,0.096561,0.258256
2580,18,1968-08-26 15:19:00,18-1968-08-26-15-19-00,0.601521,0.0,0.0,0.0,0.0,0.0,0.0,...,0.26068,0.26068,0.0,0.401014,0.639848,0.0,0.401014,0.601521,0.0,0.0
2741,9832,1969-06-03 04:36:00,9832-1969-06-03-04-36-00,0.825558,0.225044,0.101924,0.101924,0.0,0.186493,0.128228,...,0.33541,0.33541,0.236513,0.103195,0.164655,0.236513,0.103195,0.825558,0.0,0.186493
2931,7281,1967-06-05 00:41:00,7281-1967-06-05-00-41-00,0.464891,0.289663,0.04373,0.04373,0.269251,0.280049,0.385111,...,0.388547,0.388547,0.304425,0.332065,0.211934,0.304425,0.332065,0.464891,0.269251,0.280049
