# Adding text features

So far, the tutorials have dealt with _tabular_ data only. This tutorial will show you to make predictors out of text features, such as clinical notes, within `timeseriesflattener`.

Specifically, this tutorial will cover:

1. How to featurize text using Huggingface or sci-kit learn models.
2. How to use write your own text embedding function in `timeseriesflattener`.

To use the features in this tutorial you'll need to install some extra dependencies. These can be installed by running:
```
pip install pytorch transformers sentence-transformer
```
or by installing `timeseriesflattener` with the text dependencies.
```
pip install timeseriesflattener"[text]"
```

## The dataset

To start out, let's load a synthetic dataset containing text. As with all other features, each row in the dataset needs an ID, a timestamp, and the feature value. 

In [1]:
from timeseriesflattener.testing.load_synth_data import load_synth_text

In [2]:
load_synth_text().head()

Unnamed: 0,entity_id,timestamp,text
0,4647,1967-07-19 00:22:00,The patient went into a medically induced coma...
1,2007,1966-11-25 02:02:00,The patient is taken to the emergency departme...
2,5799,1967-09-19 12:31:00,"The patient, described as a 7-month old son wh..."
3,1319,1969-07-21 23:16:00,The patient had been left on a bed for 20 minu...
4,4234,1966-04-14 22:04:00,The patient had had some severe allergies but ...


## TextPredictorSpec

The main difference when specifiying text predictors compared to tabular predictors is the `Spec` you define. For text, we need to specify a `TextPredictorSpec` which is entirely similar to the `PredictorSpec` except for two additional attributes: `embedding_fn` and `embedding_fn_kwargs`.

`embedding_fn` should be a callable that takes a pandas Series containing text and converts it to a pandas DataFrame with a column for each feature. `embedding_fn_kwargs` are simply optional keyword arguments that will be passed to the embedding function, such as a Huggingface model name.

Not all `resolve_multiple_fn` are meaningful for text, as we can't do numerical operations on text. Instead, `TextPredictorSpec` defaults to the "concatenate" option, which simply concatenates all texts within the lookbehind within. Other options that work for text are "latest" and "earliest". 


### Featurization using sentence-transformers
Let's specify a `TextPredictorSpec` using a [sentence-transformers](https://www.sbert.net/) model. `timeseriesflattener` includes functions that make it easy to featurize text using either sentence-transformers or any text model from the [Huggingface Hub](https://huggingface.co/). 

The `sentence_transformers_embedding` function is recommended for sentence-transformers. If you want to another type of model from the Huggingface Hub we recommend using the `huggingface_embedding` function which has the same interface as `sentence_transformers_embedding`.

Notice, both `huggingface_embedding` and `sentence_transformers_embedding` will truncate the input to the maximum sequence length allowed by the model. If you want to use Huggingface embeddings for larger blocks of text, either use the `sklearn_embedding` function or write your own embedding function (see below).

In [3]:
from timeseriesflattener.text_embedding_functions import (
        sentence_transformers_embedding, huggingface_embedding
        )
from timeseriesflattener import TextPredictorSpec
import numpy as np

In [4]:
text_spec = TextPredictorSpec(
    values_loader=load_synth_text,
    lookbehind_days=730,
    fallback=np.nan,
    resolve_multiple_fn="concatenate",
    feature_name="text-st",
    input_col_name_override="text",
    embedding_fn=sentence_transformers_embedding,
    embedding_fn_kwargs={"model_name": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"},
)

That's it. Let's make our features in the usual way.

In [5]:
from timeseriesflattener import TimeseriesFlattener
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

In [6]:
ts_flattener = TimeseriesFlattener(
    prediction_times_df=load_synth_prediction_times(),
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    n_workers=1,
    drop_pred_times_with_insufficient_look_distance=False,
)
ts_flattener.add_spec(text_spec)

In [7]:
df = ts_flattener.get_df()

2023-02-17 06:36:53 [INFO] There were unprocessed specs, computing...


2023-02-17 06:36:53 [INFO] Processing 1 temporal features in parallel with 1 workers. Chunksize is 1. If this is above 1, it may take some time for the progress bar to move, as processing is batched. However, this makes for much faster total performance.


  0%|          | 0/1 [00:00<?, ?it/s]

2023-02-17 06:36:53 [INFO] Load pretrained SentenceTransformer: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2


Downloading (…)0fe39/.gitattributes:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)83e900fe39/README.md:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading (…)e900fe39/config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)ncepiece.bpe.model";:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)"tokenizer.json";:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading (…)"unigram.json";:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading (…)900fe39/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

2023-02-17 06:37:05 [INFO] Use pytorch device: cpu


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:13<00:00, 13.78s/it]

100%|██████████| 1/1 [00:13<00:00, 13.78s/it]


2023-02-17 06:37:07 [INFO] Checking alignment of dataframes - this might take a little while (~2 minutes for 1.000 dataframes with 2.000.000 rows).


2023-02-17 06:37:07 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features and 2_000_000 prediction times. This is normal.


2023-02-17 06:37:07 [INFO] Concatenation took 0.023 seconds


2023-02-17 06:37:07 [INFO] Merging with original df


Let's check the features.

In [8]:
df.head()

Unnamed: 0,entity_id,timestamp,prediction_time_uuid,pred_text-st_within_730_days_concatenate_fallback_nan,pred_text-st-1_within_730_days_concatenate_fallback_nan,pred_text-st-2_within_730_days_concatenate_fallback_nan,pred_text-st-3_within_730_days_concatenate_fallback_nan,pred_text-st-4_within_730_days_concatenate_fallback_nan,pred_text-st-5_within_730_days_concatenate_fallback_nan,pred_text-st-6_within_730_days_concatenate_fallback_nan,...,pred_text-st-374_within_730_days_concatenate_fallback_nan,pred_text-st-375_within_730_days_concatenate_fallback_nan,pred_text-st-376_within_730_days_concatenate_fallback_nan,pred_text-st-377_within_730_days_concatenate_fallback_nan,pred_text-st-378_within_730_days_concatenate_fallback_nan,pred_text-st-379_within_730_days_concatenate_fallback_nan,pred_text-st-380_within_730_days_concatenate_fallback_nan,pred_text-st-381_within_730_days_concatenate_fallback_nan,pred_text-st-382_within_730_days_concatenate_fallback_nan,pred_text-st-383_within_730_days_concatenate_fallback_nan
0,9903,1968-05-09 21:24:00,9903-1968-05-09-21-24-00,,,,,,,,...,,,,,,,,,,
1,7465,1966-05-24 01:23:00,7465-1966-05-24-01-23-00,,,,,,,,...,,,,,,,,,,
2,6447,1967-09-25 18:08:00,6447-1967-09-25-18-08-00,,,,,,,,...,,,,,,,,,,
3,2121,1966-05-05 20:52:00,2121-1966-05-05-20-52-00,,,,,,,,...,,,,,,,,,,
4,4927,1968-06-30 12:13:00,4927-1968-06-30-12-13-00,,,,,,,,...,,,,,,,,,,


Because the synthetic text data is much smaller than the prediction times data, there are a lot of NaNs. Let's subset to only see the prediction times that actually include text.

In [9]:
df_pred_times_with_text = df[~df["pred_text-st-1_within_730_days_concatenate_fallback_nan"].isna()]
df_pred_times_with_text.head()

Unnamed: 0,entity_id,timestamp,prediction_time_uuid,pred_text-st_within_730_days_concatenate_fallback_nan,pred_text-st-1_within_730_days_concatenate_fallback_nan,pred_text-st-2_within_730_days_concatenate_fallback_nan,pred_text-st-3_within_730_days_concatenate_fallback_nan,pred_text-st-4_within_730_days_concatenate_fallback_nan,pred_text-st-5_within_730_days_concatenate_fallback_nan,pred_text-st-6_within_730_days_concatenate_fallback_nan,...,pred_text-st-374_within_730_days_concatenate_fallback_nan,pred_text-st-375_within_730_days_concatenate_fallback_nan,pred_text-st-376_within_730_days_concatenate_fallback_nan,pred_text-st-377_within_730_days_concatenate_fallback_nan,pred_text-st-378_within_730_days_concatenate_fallback_nan,pred_text-st-379_within_730_days_concatenate_fallback_nan,pred_text-st-380_within_730_days_concatenate_fallback_nan,pred_text-st-381_within_730_days_concatenate_fallback_nan,pred_text-st-382_within_730_days_concatenate_fallback_nan,pred_text-st-383_within_730_days_concatenate_fallback_nan
244,7337,1966-06-28 10:34:00,7337-1966-06-28-10-34-00,-0.020497,0.201255,-0.187649,-0.240372,0.105265,0.004314,-0.151738,...,0.12119,0.381909,0.144212,-0.038955,-0.000123,0.15652,-0.196082,0.080771,0.025773,0.193602
755,8951,1969-12-22 16:32:00,8951-1969-12-22-16-32-00,0.06995,0.099192,-0.007804,0.033173,-0.044742,0.193883,-0.165403,...,-0.236144,0.051994,-0.029249,-0.08007,-0.105516,0.141162,0.008052,-0.315303,0.049577,-0.070045
896,2007,1968-10-15 14:12:00,2007-1968-10-15-14-12-00,0.048036,-0.050683,0.039954,0.03827,-0.208976,-0.26262,0.084824,...,-0.064675,0.065383,-0.03043,-0.047385,-0.037471,0.03116,-0.191298,0.291992,-0.154983,-0.043705
1517,1728,1968-05-29 12:27:00,1728-1968-05-29-12-27-00,0.042505,-0.009908,-0.091326,-0.005621,-0.085136,0.168317,0.037416,...,0.027059,0.319632,-0.044289,-0.042798,-0.136277,-0.118022,0.169292,-0.029509,0.26205,0.103499
1917,4977,1968-11-28 16:05:00,4977-1968-11-28-16-05-00,0.135539,0.035848,0.089571,0.098358,-0.066512,-0.051096,0.027141,...,-0.039122,0.246103,0.102282,0.031486,-0.336561,-0.110646,0.089151,0.233424,-0.021751,-0.0239


### Featurization using sklearn models

If you want to embed your model using an sklearn model using e.g. TF-IDF, this can also be easily accomplished. First, you should train the sklearn model (e.g. `TfidfVectorizer`) on your dataset (using the `.fit` method). 

Now, to use your trained model in `timeseriesflattener`, simply use the `sklearn_embedding` function and supply the model as an embedding function keyword argument. 

In the following example we will use a simple CountVectorizer model, which has been pretrained on the synthetic data, to create the predictors.

In [10]:
from timeseriesflattener.text_embedding_functions import sklearn_embedding
from timeseriesflattener.testing.text_embedding_functions import _load_bow_model

tfidf_model = _load_bow_model()
tfidf_model

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [11]:
count_vectorizer_text_spec = TextPredictorSpec(
    values_loader=load_synth_text,
    lookbehind_days=730,
    fallback=np.nan,
    resolve_multiple_fn="concatenate",
    feature_name="text-cv",
    input_col_name_override="text",
    embedding_fn=sklearn_embedding,
    embedding_fn_kwargs={"model": tfidf_model},
)

2023-02-17 06:37:07 [INFO] text-cv: Loading values


Let's add the feature to the dataset:

In [12]:
ts_flattener.add_spec(count_vectorizer_text_spec)
df = ts_flattener.get_df()

2023-02-17 06:37:07 [INFO] There were unprocessed specs, computing...


2023-02-17 06:37:07 [INFO] Processing 1 temporal features in parallel with 1 workers. Chunksize is 1. If this is above 1, it may take some time for the progress bar to move, as processing is batched. However, this makes for much faster total performance.


  0%|          | 0/1 [00:00<?, ?it/s]



100%|██████████| 1/1 [00:00<00:00,  6.70it/s]

100%|██████████| 1/1 [00:00<00:00,  6.64it/s]


2023-02-17 06:37:07 [INFO] Checking alignment of dataframes - this might take a little while (~2 minutes for 1.000 dataframes with 2.000.000 rows).


2023-02-17 06:37:07 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features and 2_000_000 prediction times. This is normal.


2023-02-17 06:37:07 [INFO] Concatenation took 0.005 seconds


2023-02-17 06:37:07 [INFO] Merging with original df


Let's subset to only see the prediction times that include text again.

In [13]:
df_pred_times_with_text = df[~df["pred_text-st-1_within_730_days_concatenate_fallback_nan"].isna()]
df_pred_times_with_text.head()

Unnamed: 0,entity_id,timestamp,prediction_time_uuid,pred_text-st_within_730_days_concatenate_fallback_nan,pred_text-st-1_within_730_days_concatenate_fallback_nan,pred_text-st-2_within_730_days_concatenate_fallback_nan,pred_text-st-3_within_730_days_concatenate_fallback_nan,pred_text-st-4_within_730_days_concatenate_fallback_nan,pred_text-st-5_within_730_days_concatenate_fallback_nan,pred_text-st-6_within_730_days_concatenate_fallback_nan,...,pred_text-cv-and_within_730_days_concatenate_fallback_nan,pred_text-cv-for_within_730_days_concatenate_fallback_nan,pred_text-cv-in_within_730_days_concatenate_fallback_nan,pred_text-cv-of_within_730_days_concatenate_fallback_nan,pred_text-cv-or_within_730_days_concatenate_fallback_nan,pred_text-cv-patient_within_730_days_concatenate_fallback_nan,pred_text-cv-that_within_730_days_concatenate_fallback_nan,pred_text-cv-the_within_730_days_concatenate_fallback_nan,pred_text-cv-to_within_730_days_concatenate_fallback_nan,pred_text-cv-was_within_730_days_concatenate_fallback_nan
244,7337,1966-06-28 10:34:00,7337-1966-06-28-10-34-00,-0.020497,0.201255,-0.187649,-0.240372,0.105265,0.004314,-0.151738,...,4.0,2.0,2.0,5.0,0.0,3.0,1.0,16.0,4.0,2.0
755,8951,1969-12-22 16:32:00,8951-1969-12-22-16-32-00,0.06995,0.099192,-0.007804,0.033173,-0.044742,0.193883,-0.165403,...,1.0,5.0,1.0,1.0,1.0,2.0,2.0,8.0,2.0,0.0
896,2007,1968-10-15 14:12:00,2007-1968-10-15-14-12-00,0.048036,-0.050683,0.039954,0.03827,-0.208976,-0.26262,0.084824,...,4.0,0.0,2.0,1.0,5.0,6.0,2.0,13.0,3.0,0.0
1517,1728,1968-05-29 12:27:00,1728-1968-05-29-12-27-00,0.042505,-0.009908,-0.091326,-0.005621,-0.085136,0.168317,0.037416,...,1.0,1.0,8.0,11.0,0.0,2.0,1.0,11.0,5.0,2.0
1917,4977,1968-11-28 16:05:00,4977-1968-11-28-16-05-00,0.135539,0.035848,0.089571,0.098358,-0.066512,-0.051096,0.027141,...,2.0,1.0,6.0,7.0,2.0,2.0,1.0,8.0,4.0,1.0


We can subset further to only include the features we created with the count vectorizer by subsetting to only include columns starting with the feature name ("text-cv").

In [14]:
df_cv_pred_times_with_text = df_pred_times_with_text.loc[:,df_pred_times_with_text.columns.str.startswith("pred_text-cv")]
df_cv_pred_times_with_text.head()

Unnamed: 0,pred_text-cv-and_within_730_days_concatenate_fallback_nan,pred_text-cv-for_within_730_days_concatenate_fallback_nan,pred_text-cv-in_within_730_days_concatenate_fallback_nan,pred_text-cv-of_within_730_days_concatenate_fallback_nan,pred_text-cv-or_within_730_days_concatenate_fallback_nan,pred_text-cv-patient_within_730_days_concatenate_fallback_nan,pred_text-cv-that_within_730_days_concatenate_fallback_nan,pred_text-cv-the_within_730_days_concatenate_fallback_nan,pred_text-cv-to_within_730_days_concatenate_fallback_nan,pred_text-cv-was_within_730_days_concatenate_fallback_nan
244,4.0,2.0,2.0,5.0,0.0,3.0,1.0,16.0,4.0,2.0
755,1.0,5.0,1.0,1.0,1.0,2.0,2.0,8.0,2.0,0.0
896,4.0,0.0,2.0,1.0,5.0,6.0,2.0,13.0,3.0,0.0
1517,1.0,1.0,8.0,11.0,0.0,2.0,1.0,11.0,5.0,2.0
1917,2.0,1.0,6.0,7.0,2.0,2.0,1.0,8.0,4.0,1.0


Notice that the text column names are informative wrt. the word they count (e.g. and, for, in, etc.). This is because `sklearn_embedding` uses the `.get_feature_names` method of the sklearn model to set the column names.

## Writing your own text embedding function

If you want to write your own embedding function, you simply need to write a function that takes a pd.Series of text as the first input and any number of optional keyword arguments. Let's write a small function to embed long texts using a Huggingface model. Note that this implementation will likely be quite slow.

In [15]:
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import torch

In [16]:
def huggingface_long_text_embedding(
    text_series: pd.Series, model_name: str, chunk_length: int
) -> pd.DataFrame:
    """
    Embeds text using a HuggingFace model, splitting the text into chunks of a
    specified number of characters.

    Args:
        text_series: A pandas Series containing the text to be embedded.
        model_name: The name of the HuggingFace model to use.
        chunk_length: The number of characters to use in each chunk.

    Returns:
        A pandas DataFrame containing the embeddings.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()
    embeddings = []
    for text in text_series:
        text_chunks = [
            text[i : i + chunk_length] for i in range(0, len(text), chunk_length)
        ]
        tokenized = tokenizer(
            text_chunks, padding=True, truncation=True, return_tensors="pt"
        )
        with torch.no_grad():
            output = model(**tokenized)
        # take mean of all tokens in each chunk, then mean of all chunks
        embeddings.append(output[0].mean(axis=1).mean(axis=0).cpu().numpy())
    return pd.DataFrame(embeddings)


The function can now be used as an embedding function in a `TextPredictorSpec` and used in the same manner as usual.

In [17]:
huggingface_long_text_spec = TextPredictorSpec(
    values_loader=load_synth_text,
    lookbehind_days=730,
    fallback=np.nan,
    resolve_multiple_fn="concatenate",
    feature_name="text-hf-long",
    input_col_name_override="text",
    embedding_fn=huggingface_long_text_embedding,
    embedding_fn_kwargs={
        "model_name": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", 
        "chunk_length" : 256},
)

2023-02-17 06:37:07 [INFO] text-hf-long: Loading values
