# Lexical tokenization - TF\*IDF

Let's walk through a basic introduction to lexical search.

### Who you are:

An ML engineer with enough comfort with Python data stack (pandas, numpy, etc) that wants to understand traditional search engines (ie Elasticsearch, etc)

### What this is

A run through of the core concepts behind lexical search.


## This notebook: BM25F

We [previously examined the basics of BM25](https://colab.research.google.com/drive/1RGNkq4SOZMvlFvpHq3IKgNJdCTlHqiek). But we'll see a major problem with BM25 / TF\*IDF has document frequencies in their own universe.

In [None]:
!pip install searcharray

from searcharray import SearchArray
import pandas as pd
import numpy as np

Collecting searcharray
  Downloading searcharray-0.0.72-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading searcharray-0.0.72-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: searcharray
Successfully installed searcharray-0.0.72


## Tokenize and index

This time we add back multiple fields and we'll see a problem with simple, naive summing / dismax of field scores

In [None]:
from string import punctuation


def better_tokenize(text):
    lowercased = text.lower()
    without_punctuation = lowercased.translate(str.maketrans('', '', punctuation))
    split = without_punctuation.split()
    return split


chat_transcript = [
  "Hi this is Doug, I have a complaint about the weather",
  "Doug, this is Tom, support for Earth's Climate, sorry to hear about your complaint, how can we help you doug?",
  "Tom, can I speak to your manager?",
  "Hi, this is Sue, Tom's boss. What can I do for you?",
  "I'd like to complain about the ski conditions in West Virginia",
  "Oh doug thats terrible, lets see what we can do.",
  "Thanks you guys are great.",
  "That's very sweet of you"

]

topics = [
    "bad weather complaint climate",
    "earth climate",
    "escalation support",
    "boss asks",
    "West Virginia ski",
    "doug",
    "grattitude",
    "sweet"

]

msgs = pd.DataFrame({"name": ["Doug", "Doug", "Tom", "Sue", "Doug", "Sue", "Doug", "Sue"],
                     "msg": chat_transcript,
                     "topics": topics})
msgs['msg_tokenized'] = SearchArray.index(msgs['msg'],
                                          tokenizer=better_tokenize)

msgs['topics_tokenized'] = SearchArray.index(msgs['topics'],
                                              tokenizer=better_tokenize)
msgs

2025-09-17 13:48:00,549 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-09-17 13:48:00,559 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-09-17 13:48:00,566 - searcharray.indexing - INFO - Tokenizing 8 documents


INFO:searcharray.indexing:Tokenizing 8 documents


2025-09-17 13:48:00,574 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-09-17 13:48:00,583 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-09-17 13:48:00,589 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-09-17 13:48:00,601 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-09-17 13:48:00,603 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-09-17 13:48:00,608 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-09-17 13:48:00,610 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2025-09-17 13:48:00,615 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-09-17 13:48:00,621 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-09-17 13:48:00,623 - searcharray.indexing - INFO - Tokenizing 8 documents


INFO:searcharray.indexing:Tokenizing 8 documents


2025-09-17 13:48:00,628 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-09-17 13:48:00,630 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-09-17 13:48:00,634 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-09-17 13:48:00,636 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-09-17 13:48:00,637 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-09-17 13:48:00,639 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-09-17 13:48:00,644 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


Unnamed: 0,name,msg,topics,msg_tokenized,topics_tokenized
0,Doug,"Hi this is Doug, I have a complaint about the ...",bad weather complaint climate,"Terms({'doug', 'weather', 'a', 'about', 'i', '...","Terms({'climate', 'complaint', 'bad', 'weather'})"
1,Doug,"Doug, this is Tom, support for Earth's Climate...",earth climate,"Terms({'doug', 'help', 'tom', 'your', 'how', '...","Terms({'earth', 'climate'})"
2,Tom,"Tom, can I speak to your manager?",escalation support,"Terms({'to', 'can', 'speak', 'tom', 'your', 'i...","Terms({'escalation', 'support'})"
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...",boss asks,"Terms({'boss', 'for', 'can', 'toms', 'you', 'i...","Terms({'boss', 'asks'})"
4,Doug,I'd like to complain about the ski conditions ...,West Virginia ski,"Terms({'virginia', 'conditions', 'to', 'like',...","Terms({'ski', 'west', 'virginia'})"
5,Sue,"Oh doug thats terrible, lets see what we can do.",doug,"Terms({'doug', 'thats', 'oh', 'can', 'terrible...",Terms({'doug'})
6,Doug,Thanks you guys are great.,grattitude,"Terms({'thanks', 'are', 'great', 'guys', 'you'})",Terms({'grattitude'})
7,Sue,That's very sweet of you,sweet,"Terms({'thats', 'you', 'sweet', 'very', 'of'})",Terms({'sweet'})


## Search each field

We take BM25 for each field / term, then we take the best score for each.

In [None]:
QUERY = "doug complaint"
query_tokenized = better_tokenize(QUERY)
from searcharray.similarity import compute_idf

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    # Score of each term
    impactA = msgs['msg_tokenized'].array.score(query_token)
    impactB = msgs['topics_tokenized'].array.score(query_token)

    scores += np.maximum(impactA, impactB)

msgs['score'] = scores

msgs.sort_values('score', ascending=False)

Unnamed: 0,name,msg,topics,msg_tokenized,topics_tokenized,score
5,Sue,"Oh doug thats terrible, lets see what we can do.",doug,"Terms({'doug', 'thats', 'oh', 'can', 'terrible...",Terms({'doug'}),1.023863
0,Doug,"Hi this is Doug, I have a complaint about the ...",bad weather complaint climate,"Terms({'doug', 'weather', 'a', 'about', 'i', '...","Terms({'climate', 'complaint', 'bad', 'weather'})",0.992629
1,Doug,"Doug, this is Tom, support for Earth's Climate...",earth climate,"Terms({'doug', 'help', 'tom', 'your', 'how', '...","Terms({'earth', 'climate'})",0.879412
2,Tom,"Tom, can I speak to your manager?",escalation support,"Terms({'to', 'can', 'speak', 'tom', 'your', 'i...","Terms({'escalation', 'support'})",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...",boss asks,"Terms({'boss', 'for', 'can', 'toms', 'you', 'i...","Terms({'boss', 'asks'})",0.0
4,Doug,I'd like to complain about the ski conditions ...,West Virginia ski,"Terms({'virginia', 'conditions', 'to', 'like',...","Terms({'ski', 'west', 'virginia'})",0.0
6,Doug,Thanks you guys are great.,grattitude,"Terms({'thanks', 'are', 'great', 'guys', 'you'})",Terms({'grattitude'}),0.0
7,Sue,That's very sweet of you,sweet,"Terms({'thats', 'you', 'sweet', 'very', 'of'})",Terms({'sweet'}),0.0


## Problem: single field match too strong

The match `doug` in `topics` is too strong, as you can see here, its a very high score.

But this doesn't match the "natural" specificity of the term `doug`

Remember document frequency reflects how rare/special this term is. But `doug` being rare in one field is not a reflection of its true "specialness" just that this one field is rare.

In [None]:
print(f"`doug` matches in `topics` field (very high!): {msgs['topics_tokenized'].array.score('doug')}")
print(f"`doug` matches in `msg` field (much lower): {msgs['msg_tokenized'].array.score('doug')}")

`doug` matches in `topics` field (very high!): [0.        0.        0.        0.        0.        1.0238626 0.
 0.       ]
`doug` matches in `msg` field (much lower): [0.4146417  0.46322364 0.         0.         0.         0.43147987
 0.         0.        ]


## Field blending

Idea: we need to measure term specificity independent of fields. So document frequency isn't about one field. Instead of:

```
score = msgs.TF*IDF + topics.TF*IDF
```

Could we instead do:

```
score = (msgs.TF + topics.TF) * combinedIDF
```

This is the core concept behind BM25F.

### Just blending doc frequencies

First we will just combine doc frequencies by taking the max document frequency before we compute a raw idf.

In our case this largely solves the problem.

In [None]:
from searcharray.similarity import Similarity

def just_tfs(term_freqs: np.ndarray,        # TF array of every doc in the index
             doc_freqs: np.ndarray,         # Doc freq array of every term (> 1 if a phrase)
             doc_lens: np.ndarray,          # Every documents length (same shape as TF)
             avg_doc_lens: int,             # avg doc length of corpus
             num_docs: int) -> np.ndarray:     # total number of docs in corpus

    return term_freqs


In [None]:
QUERY = "doug complaint"
query_tokenized = better_tokenize(QUERY)

scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    # Score of each term
    impactA = msgs['msg_tokenized'].array.score(query_token,
                                                similarity=just_tfs)
    impactB = msgs['topics_tokenized'].array.score(query_token,
                                                   similarity=just_tfs)

    # Take doc freq as max of this terms doc freq across terms
    docFreq = max(msgs['msg_tokenized'].array.docfreq(query_token), msgs['topics_tokenized'].array.docfreq(query_token))

    blended = (impactA + impactB) / docFreq
    print(f"Term '{query_token}' impactA: {impactA}")
    print(f"Term '{query_token}' impactB: {impactB}")
    print(f"Term '{query_token}' docFreq: {docFreq}")

    scores += blended

msgs['score'] = scores
msgs.sort_values('score', ascending=False)

Term 'doug' impactA: [1. 2. 0. 0. 0. 1. 0. 0.]
Term 'doug' impactB: [0. 0. 0. 0. 0. 1. 0. 0.]
Term 'doug' docFreq: 3
Term 'complaint' impactA: [1. 1. 0. 0. 0. 0. 0. 0.]
Term 'complaint' impactB: [1. 0. 0. 0. 0. 0. 0. 0.]
Term 'complaint' docFreq: 2


Unnamed: 0,name,msg,topics,msg_tokenized,topics_tokenized,score
0,Doug,"Hi this is Doug, I have a complaint about the ...",bad weather complaint climate,"Terms({'doug', 'weather', 'a', 'about', 'i', '...","Terms({'climate', 'complaint', 'bad', 'weather'})",1.333333
1,Doug,"Doug, this is Tom, support for Earth's Climate...",earth climate,"Terms({'doug', 'help', 'tom', 'your', 'how', '...","Terms({'earth', 'climate'})",1.166667
5,Sue,"Oh doug thats terrible, lets see what we can do.",doug,"Terms({'doug', 'thats', 'oh', 'can', 'terrible...",Terms({'doug'}),0.666667
2,Tom,"Tom, can I speak to your manager?",escalation support,"Terms({'to', 'can', 'speak', 'tom', 'your', 'i...","Terms({'escalation', 'support'})",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...",boss asks,"Terms({'boss', 'for', 'can', 'toms', 'you', 'i...","Terms({'boss', 'asks'})",0.0
4,Doug,I'd like to complain about the ski conditions ...,West Virginia ski,"Terms({'virginia', 'conditions', 'to', 'like',...","Terms({'ski', 'west', 'virginia'})",0.0
6,Doug,Thanks you guys are great.,grattitude,"Terms({'thanks', 'are', 'great', 'guys', 'you'})",Terms({'grattitude'}),0.0
7,Sue,That's very sweet of you,sweet,"Terms({'thats', 'you', 'sweet', 'very', 'of'})",Terms({'sweet'}),0.0


## Term freq attenuated by length

In BM25F, we don't take raw term frequency. We take term frequency relative to document length. As document length increases, a single term occurence matters less.

Below, `bm25_impact` captures this. With the `b` parameter controlling how much document length will lessen the term frequency.

In [None]:
from searcharray.similarity import Similarity

# BM25 params
b = 0.8
k1 = 1.1

def bm25_impact(term_freqs: np.ndarray,        # TF array of every doc in the index
                doc_freqs: np.ndarray,         # Doc freq array of every term (> 1 if a phrase)
                doc_lens: np.ndarray,          # Every documents length (same shape as TF)
                avg_doc_lens: int,             # avg doc length of corpus
                num_docs: int) -> np.ndarray:     # total number of docs in corpus

    return (term_freqs) / (1 - b + b * doc_lens / avg_doc_lens)



In [None]:
QUERY = "doug complaint"
query_tokenized = better_tokenize(QUERY)
from searcharray.similarity import compute_idf

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    # Score of each term
    impactA = msgs['msg_tokenized'].array.score(query_token,
                                                similarity=bm25_impact)
    impactB = msgs['topics_tokenized'].array.score(query_token,
                                                   similarity=bm25_impact)
    docFreq = max(msgs['msg_tokenized'].array.docfreq(query_token), msgs['topics_tokenized'].array.docfreq(query_token))
    print(f"Term '{query_token}' impactA: {impactA}")
    print(f"Term '{query_token}' impactB: {impactB}")
    print(f"Term '{query_token}' docFreq: {docFreq}")

    impact = (impactA + impactB) / docFreq

    print(f"Term '{query_token}' score: {impact}")

    scores += impact

msgs.sort_values('score', ascending=False)

Term 'doug' impactA: [0.93533486 1.1234397  0.         0.         0.         1.0099751
 0.         0.        ]
Term 'doug' impactB: [0.        0.        0.        0.        0.        1.6666666 0.
 0.       ]
Term 'doug' docFreq: 3
Term 'doug' score: [0.31177829 0.37447989 0.         0.         0.         0.8922139
 0.         0.        ]
Term 'complaint' impactA: [0.93533486 0.56171983 0.         0.         0.         0.
 0.         0.        ]
Term 'complaint' impactB: [0.5555555 0.        0.        0.        0.        0.        0.
 0.       ]
Term 'complaint' docFreq: 2
Term 'complaint' score: [0.74544519 0.28085992 0.         0.         0.         0.
 0.         0.        ]


Unnamed: 0,name,msg,topics,msg_tokenized,topics_tokenized,score
0,Doug,"Hi this is Doug, I have a complaint about the ...",bad weather complaint climate,"Terms({'doug', 'weather', 'a', 'about', 'i', '...","Terms({'climate', 'complaint', 'bad', 'weather'})",1.333333
1,Doug,"Doug, this is Tom, support for Earth's Climate...",earth climate,"Terms({'doug', 'help', 'tom', 'your', 'how', '...","Terms({'earth', 'climate'})",1.166667
5,Sue,"Oh doug thats terrible, lets see what we can do.",doug,"Terms({'doug', 'thats', 'oh', 'can', 'terrible...",Terms({'doug'}),0.666667
2,Tom,"Tom, can I speak to your manager?",escalation support,"Terms({'to', 'can', 'speak', 'tom', 'your', 'i...","Terms({'escalation', 'support'})",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...",boss asks,"Terms({'boss', 'for', 'can', 'toms', 'you', 'i...","Terms({'boss', 'asks'})",0.0
4,Doug,I'd like to complain about the ski conditions ...,West Virginia ski,"Terms({'virginia', 'conditions', 'to', 'like',...","Terms({'ski', 'west', 'virginia'})",0.0
6,Doug,Thanks you guys are great.,grattitude,"Terms({'thanks', 'are', 'great', 'guys', 'you'})",Terms({'grattitude'}),0.0
7,Sue,That's very sweet of you,sweet,"Terms({'thats', 'you', 'sweet', 'very', 'of'})",Terms({'sweet'}),0.0


## BM25 term freq saturation

We take the above, and we recall that BM25 saturates term frequency to an asymptote. We don't want raw term frequency. This form `tf / (tf + k1)` will approach an asymptote.

So below we take our combined impacts as if it was "one" term frequency.

In [None]:
QUERY = "doug complaint"
query_tokenized = better_tokenize(QUERY)
from searcharray.similarity import compute_idf

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    # Score of each term
    impactA = msgs['msg_tokenized'].array.score(query_token,
                                                similarity=bm25_impact)
    impactB = msgs['topics_tokenized'].array.score(query_token,
                                                   similarity=bm25_impact)
    docFreq = max(msgs['msg_tokenized'].array.docfreq(query_token), msgs['topics_tokenized'].array.docfreq(query_token))
    print(f"Term '{query_token}' impactA: {impactA}")
    print(f"Term '{query_token}' impactB: {impactB}")
    print(f"Term '{query_token}' docFreq: {docFreq}")

    impact = (impactA + impactB)

    # ADDED, saturation
    impact = impact / (impact + k1)
    impact = impact / docFreq

    print(f"Term '{query_token}' score: {impact}")

    scores += impact

msgs.sort_values('score', ascending=False)

Term 'doug' impactA: [0.93533486 1.1234397  0.         0.         0.         1.0099751
 0.         0.        ]
Term 'doug' impactB: [0.        0.        0.        0.        0.        1.6666666 0.
 0.       ]
Term 'doug' docFreq: 3
Term 'doug' score: [0.1531828  0.16842367 0.         0.         0.         0.23624529
 0.         0.        ]
Term 'complaint' impactA: [0.93533486 0.56171983 0.         0.         0.         0.
 0.         0.        ]
Term 'complaint' impactB: [0.5555555 0.        0.        0.        0.        0.        0.
 0.       ]
Term 'complaint' docFreq: 2
Term 'complaint' score: [0.28771776 0.16901761 0.         0.         0.         0.
 0.         0.        ]


Unnamed: 0,name,msg,topics,msg_tokenized,topics_tokenized,score
0,Doug,"Hi this is Doug, I have a complaint about the ...",bad weather complaint climate,"Terms({'doug', 'weather', 'a', 'about', 'i', '...","Terms({'climate', 'complaint', 'bad', 'weather'})",1.333333
1,Doug,"Doug, this is Tom, support for Earth's Climate...",earth climate,"Terms({'doug', 'help', 'tom', 'your', 'how', '...","Terms({'earth', 'climate'})",1.166667
5,Sue,"Oh doug thats terrible, lets see what we can do.",doug,"Terms({'doug', 'thats', 'oh', 'can', 'terrible...",Terms({'doug'}),0.666667
2,Tom,"Tom, can I speak to your manager?",escalation support,"Terms({'to', 'can', 'speak', 'tom', 'your', 'i...","Terms({'escalation', 'support'})",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...",boss asks,"Terms({'boss', 'for', 'can', 'toms', 'you', 'i...","Terms({'boss', 'asks'})",0.0
4,Doug,I'd like to complain about the ski conditions ...,West Virginia ski,"Terms({'virginia', 'conditions', 'to', 'like',...","Terms({'ski', 'west', 'virginia'})",0.0
6,Doug,Thanks you guys are great.,grattitude,"Terms({'thanks', 'are', 'great', 'guys', 'you'})",Terms({'grattitude'}),0.0
7,Sue,That's very sweet of you,sweet,"Terms({'thats', 'you', 'sweet', 'very', 'of'})",Terms({'sweet'}),0.0


## Full BM25F

We're almost to full BM25F, but now we now change to the BM25 inverse document frequency, which is more logarithmic ~(1 / log(df))

In [None]:
QUERY = "doug complaint"
query_tokenized = better_tokenize(QUERY)
from searcharray.similarity import compute_idf

# ACCUMULATE SCORES
scores = np.zeros(len(msgs))
for query_token in query_tokenized:
    # Score of each term
    impactA = msgs['msg_tokenized'].array.score(query_token,
                                                similarity=bm25_impact)
    impactB = msgs['topics_tokenized'].array.score(query_token,
                                                   similarity=bm25_impact)
    docFreq = max(msgs['msg_tokenized'].array.docfreq(query_token), msgs['topics_tokenized'].array.docfreq(query_token))
    print(f"Term '{query_token}' impactA: {impactA}")
    print(f"Term '{query_token}' impactB: {impactB}")
    print(f"Term '{query_token}' docFreq: {docFreq}")

    impact = (impactA + impactB)
    impact = impact / (impact + k1)
    # ADDED IDF
    idf = compute_idf(len(msgs), docFreq)
    impact = impact * idf

    print(f"Term '{query_token}' score: {impact}")

    scores += impact

msgs.sort_values('score', ascending=False)

Term 'doug' impactA: [0.93533486 1.1234397  0.         0.         0.         1.0099751
 0.         0.        ]
Term 'doug' impactB: [0.        0.        0.        0.        0.        1.6666666 0.
 0.       ]
Term 'doug' docFreq: 3
Term 'doug' score: [0.43402583 0.47720908 0.         0.         0.         0.66937383
 0.         0.        ]
Term 'complaint' impactA: [0.93533486 0.56171983 0.         0.         0.         0.
 0.         0.        ]
Term 'complaint' impactB: [0.5555555 0.        0.        0.        0.        0.        0.
 0.       ]
Term 'complaint' docFreq: 2
Term 'complaint' score: [0.73709483 0.43300076 0.         0.         0.         0.
 0.         0.        ]


Unnamed: 0,name,msg,topics,msg_tokenized,topics_tokenized,score
0,Doug,"Hi this is Doug, I have a complaint about the ...",bad weather complaint climate,"Terms({'doug', 'weather', 'a', 'about', 'i', '...","Terms({'climate', 'complaint', 'bad', 'weather'})",1.333333
1,Doug,"Doug, this is Tom, support for Earth's Climate...",earth climate,"Terms({'doug', 'help', 'tom', 'your', 'how', '...","Terms({'earth', 'climate'})",1.166667
5,Sue,"Oh doug thats terrible, lets see what we can do.",doug,"Terms({'doug', 'thats', 'oh', 'can', 'terrible...",Terms({'doug'}),0.666667
2,Tom,"Tom, can I speak to your manager?",escalation support,"Terms({'to', 'can', 'speak', 'tom', 'your', 'i...","Terms({'escalation', 'support'})",0.0
3,Sue,"Hi, this is Sue, Tom's boss. What can I do for...",boss asks,"Terms({'boss', 'for', 'can', 'toms', 'you', 'i...","Terms({'boss', 'asks'})",0.0
4,Doug,I'd like to complain about the ski conditions ...,West Virginia ski,"Terms({'virginia', 'conditions', 'to', 'like',...","Terms({'ski', 'west', 'virginia'})",0.0
6,Doug,Thanks you guys are great.,grattitude,"Terms({'thanks', 'are', 'great', 'guys', 'you'})",Terms({'grattitude'}),0.0
7,Sue,That's very sweet of you,sweet,"Terms({'thats', 'you', 'sweet', 'very', 'of'})",Terms({'sweet'}),0.0
