# Adding text features

So far, the tutorials have dealt with _tabular_ data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within `timeseriesflattener`.

Specifically, this tutorial will cover how to generate flattened predictors from already embedded text.

To use the features in this tutorial you'll need to install some extra dependencies. These can be installed by running:
```
pip install pytorch transformers sentence-transformer
```
or by installing `timeseriesflattener` with the text dependencies.
```
pip install timeseriesflattener"[text]"
```

## The dataset

To start out, let's load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value. 

In [1]:
from timeseriesflattener.testing.load_synth_data import load_synth_text

In [2]:
synth_text = load_synth_text()
synth_text.head()

Unnamed: 0,entity_id,timestamp,value
0,4647,1967-07-19 00:22:00,The patient went into a medically induced coma...
1,2007,1966-11-25 02:02:00,The patient is taken to the emergency departme...
2,5799,1967-09-19 12:31:00,"The patient, described as a 7-month old son wh..."
3,1319,1969-07-21 23:16:00,The patient had been left on a bed for 20 minu...
4,4234,1966-04-14 22:04:00,The patient had had some severe allergies but ...


## Generating predictors from embedded text

As generating text embeddings can often take a while, it can be an advantageous to embed the text before using `timeseriesflattener` to speed up the computation if you're generating multiple datasets. This first block will show how to convert a dataframe with embeddings into a format that can be passed to `timeseriesflattener`. Skip to [TextPredictorSpec](#textpredictorspec) if you want to perform the embedding step directly in `timeseriesflattener`.

To start, let's embed the synthetic text data using a sentence-transformer. You can use any form of text-embedding you want - the only constraint is that the result of the embedding function should be a dataframe with an `entitiy_id_col`, `timestamp_col` and any number of `value_cols`. 

In [3]:
%%capture
from sentence_transformers import SentenceTransformer
import pandas as pd

# load fast model
model = SentenceTransformer("all-MiniLM-L6-v2")

# define function to embed text and return a dataframe
def embed_text_to_df(model: SentenceTransformer, text: list[str]) -> pd.DataFrame:
    embeddings = model.encode(text, batch_size=256)
    return pd.DataFrame(embeddings)

# embed text
embedded_text = embed_text_to_df(model=model, text=synth_text["value"].tolist())
# drop the text column from the original dataframe
metadata_only = synth_text.drop(columns=["value"])
# concatenate the metadata and the embedded text
embedded_text_with_metadata = pd.concat([metadata_only, embedded_text], axis=1)


In [4]:
embedded_text_with_metadata.head()


Unnamed: 0,entity_id,timestamp,0,1,2,3,4,5,6,7,...,374,375,376,377,378,379,380,381,382,383
0,4647,1967-07-19 00:22:00,-0.020159,0.006134,-0.006455,0.005938,0.038562,0.005949,-0.056681,0.029464,...,0.021432,0.061494,0.011665,0.018157,-0.035946,0.101041,-0.002912,0.014489,-0.033684,-0.085988
1,2007,1966-11-25 02:02:00,-0.065502,0.026975,-0.042235,-0.012499,-0.01282,-0.003107,0.025823,0.115787,...,-0.013681,-0.008509,-0.005801,-0.019228,-0.029137,0.107618,0.027575,0.061189,-0.036197,-0.023715
2,5799,1967-09-19 12:31:00,-0.015965,0.030239,-0.025726,0.011575,-0.056353,0.02495,0.005075,0.158615,...,0.021345,0.019185,0.046376,0.008546,-0.017712,0.014252,-0.090198,0.036281,0.119648,-0.031743
3,1319,1969-07-21 23:16:00,0.049595,0.124481,-0.050134,0.036343,0.040793,0.067932,0.108808,0.068143,...,0.041999,-0.011297,0.013209,0.002157,-0.032716,-0.001036,-0.013383,-0.025948,-0.033742,-0.01356
4,4234,1966-04-14 22:04:00,-0.062923,0.062385,-0.048646,0.081368,0.115612,-0.036585,0.105179,0.034068,...,0.015677,-0.009112,-0.032549,0.021608,-0.043334,0.057872,-0.044645,0.024808,0.002562,0.030407


Now that we have our embeddings, we can use the `df_with_multiple_values_to_named_dataframes` function to turn the embeddings into a format that can be readily supplied to `PredictorGroupSpec`.

In [5]:
from timeseriesflattener.df_transforms import (
    df_with_multiple_values_to_named_dataframes,
)

# split the dataframe into a list of named dataframes with one value each
embedded_dfs = df_with_multiple_values_to_named_dataframes(
    df=embedded_text_with_metadata,
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    name_prefix="sent_emb_",
)

# check the first dataframe
embedded_dfs[0].df.head()


Unnamed: 0,entity_id,timestamp,value
0,4647,1967-07-19 00:22:00,-0.020159
1,2007,1966-11-25 02:02:00,-0.065502
2,5799,1967-09-19 12:31:00,-0.015965
3,1319,1969-07-21 23:16:00,0.049595
4,4234,1966-04-14 22:04:00,-0.062923


In [6]:
# check the number of embeddings/dataframes
len(embedded_dfs)


384

Each dataframe has been named according to `name_prefix` and the column name. This means, that if your column names are informative (e.g. if they correspond to specific words in a BOW model) they will be kept. 

In [7]:
embedded_dfs[0].name


'sent_emb_0'

Let's make some features! 

In [8]:
from timeseriesflattener.aggregation_fns import mean
from timeseriesflattener.feature_specs.group_specs import PredictorGroupSpec
import numpy as np

# create a group spec for the embedded text that will take the mean of each embedding on the column axis
# for the last 365 and 730 days
emb_spec_batch = PredictorGroupSpec(
    named_dataframes=embedded_dfs,
    lookbehind_days=[365, 730],
    fallback=[np.nan],
    aggregation_fns=[mean],
).create_combinations()

# print the number of features we will create
print(len(emb_spec_batch))


768


We are creating 384*2=768 features: 1 for each embedding for each lookbehind (365 and 730 days).

In [9]:
# make features how you would normally
from timeseriesflattener import TimeseriesFlattener
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

ts_flattener = TimeseriesFlattener(
    prediction_times_df=load_synth_prediction_times(),
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    n_workers=1,
    drop_pred_times_with_insufficient_look_distance=False,
)
ts_flattener.add_spec(emb_spec_batch)
df = ts_flattener.get_df()


2023-07-21 14:58:09 [INFO] There were unprocessed specs, computing...
2023-07-21 14:58:09 [INFO] Processing 768 temporal features in parallel with 1 workers. Chunksize is 768. If this is above 1, it may take some time for the progress bar to move, as processing is batched. However, this makes for much faster total performance.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 768/768 [00:23<00:00, 33.28it/s]
2023-07-21 14:58:32 [INFO] Checking alignment of dataframes - this might take a little while (~2 minutes for 1.000 dataframes with 2.000.000 rows).
2023-07-21 14:58:33 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features and 2_000_000 prediction times. This is normal.
2023-07-21 14:58:34 [INFO] Concatenation took 1.269 seconds
2023-07-21 14:58:34 [INFO] Merging with original df


Let's check the output.

In [10]:
# dropping na values (no embeddings within the lookbehind period) for the sake of this example
df.dropna().head()


Unnamed: 0,entity_id,timestamp,prediction_time_uuid,pred_sent_emb_4_within_365_days_mean_fallback_nan,pred_sent_emb_91_within_365_days_mean_fallback_nan,pred_sent_emb_106_within_730_days_mean_fallback_nan,pred_sent_emb_376_within_730_days_mean_fallback_nan,pred_sent_emb_333_within_730_days_mean_fallback_nan,pred_sent_emb_51_within_365_days_mean_fallback_nan,pred_sent_emb_316_within_730_days_mean_fallback_nan,...,pred_sent_emb_217_within_365_days_mean_fallback_nan,pred_sent_emb_288_within_365_days_mean_fallback_nan,pred_sent_emb_67_within_730_days_mean_fallback_nan,pred_sent_emb_120_within_730_days_mean_fallback_nan,pred_sent_emb_221_within_365_days_mean_fallback_nan,pred_sent_emb_26_within_365_days_mean_fallback_nan,pred_sent_emb_137_within_365_days_mean_fallback_nan,pred_sent_emb_11_within_365_days_mean_fallback_nan,pred_sent_emb_205_within_730_days_mean_fallback_nan,pred_sent_emb_357_within_730_days_mean_fallback_nan
1917,4977,1968-11-28 16:05:00,4977-1968-11-28-16-05-00,0.001232,-0.000302,0.040634,-0.002166,0.039541,0.107663,0.012334,...,-0.036499,0.073914,-0.078592,0.052193,0.013373,-0.069261,-0.088767,0.00415,-0.081158,-0.007002
2463,6840,1965-11-02 07:17:00,6840-1965-11-02-07-17-00,0.015495,0.010209,-0.006142,0.047095,0.062537,-0.047844,-0.117715,...,0.031768,-0.075292,-0.061927,-0.028022,0.046316,-0.026953,-0.095338,0.002313,0.056995,0.015441
2580,18,1968-08-26 15:19:00,18-1968-08-26-15-19-00,-0.025853,0.049886,-0.060344,0.054533,0.013257,-0.022677,-0.021626,...,-0.013668,-0.024832,-0.086064,0.004718,0.020128,0.00432,-0.078148,0.016352,-0.011881,-0.061026
2741,9832,1969-06-03 04:36:00,9832-1969-06-03-04-36-00,-0.047658,0.103156,0.049586,0.012266,-0.05127,-0.056747,-0.047248,...,-0.015452,0.011051,-0.108515,-0.033745,-0.037524,0.009744,-0.045512,0.091745,0.012856,-0.013721
2931,7281,1967-06-05 00:41:00,7281-1967-06-05-00-41-00,-0.027302,-0.031607,0.009704,0.059126,-0.036862,-0.098369,-0.026895,...,-0.038409,0.015281,-0.071394,0.010426,-0.055521,0.071217,-0.029075,-0.037698,-0.063065,0.068621


Notice that the text column names are informative wrt. the word they count (e.g. and, for, in, etc.). This is because `sklearn_embedding` uses the `.get_feature_names` method of the sklearn model to set the column names.