## Read File

First, we need to read the raw data from `sample_stream.jsonl`.<br/>
Due to each line holds a single JSON object, so we read one line by line util the end of this file.<br/>
For each line, we use `json.loads` to load this object as **Python** `dict` object, and append this object to a list.

In [85]:
import json


# This is a list that hold each comment.
# Each element is a dict that parsed from a single JSON line.
streams: list[dict] = []

# Open the file with read mode and utf-8 encoding.
# Then, read each line by each line, and append the the list.
with open("../data/sample_stream.jsonl", "r+", encoding="utf-8") as file:
    for line in file:
        streams.append(json.loads(line))

# Print each comment.
for i in streams:
    print(i)

{'timestamp': '2025-05-01T10:00:00', 'text': "I love this product! It's absolutely fantastic."}
{'timestamp': '2025-05-01T10:05:00', 'text': 'Not what I expected. Pretty disappointed.'}
{'timestamp': '2025-05-01T10:10:00', 'text': "It's okay, does the job. Nothing special though."}
{'timestamp': '2025-05-01T10:15:00', 'text': 'Terrible experience. Would not recommend!'}
{'timestamp': '2025-05-01T10:20:00', 'text': 'Absolutely brilliant. Exceeded my expectations.'}


## Initialize NLTK module

First we need to download the things that **NLTK** module necessary needed.<br/>
We running the following code to download them and check everything is OK.

Note that before import nltk, you need to make sure you already use **PIP** to install this module, because this is a third party module.<br>
You can use the following statements to install `nltk`. Please make sure you running this in the shell.

```shell
pip install nltk --upgrade
```

The `upgrade` argument is to make sure you will upgrade to the latest release.<br>
Just for those who already installed.

In [77]:
import nltk

# Download all the book we need.
nltk.download("gutenberg")
nltk.download("genesis")
nltk.download("inaugural")
nltk.download("nps_chat")
nltk.download("webtext")
nltk.download("treebank")
nltk.download('punkt_tab')
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("wordnet")
nltk.download("stopwords")


# If all books are downloaded,
# then this line will have no error.
from nltk.book import *

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Happy2018new\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to
[nltk_data]     C:\Users\Happy2018new\AppData\Roaming\nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\Happy2018new\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to
[nltk_data]     C:\Users\Happy2018new\AppData\Roaming\nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\Happy2018new\AppData\Roaming\nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\Happy2018new\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data

If no error was found, then we finish the preparation part.

## Tokenize

Now we can start the first step of pre-processing the data.<br/>
In the very beginning we should to do is to **tokenize** the users' comments.

To make the system to be stronger, or in the other words, to make it highly robust and suitable for most situations, we first need to do is sentence tokenize.

We use the following codes to this task.

In [93]:
# The result of "nltk.sent_tokenize" is a list of strings.
# And because we have multiple comments, so here we use
# `list[list[str]]` as the type hint.
sent_tokenize: list[list[str]] = []

# For each comment in the list,
# we do sentence tokenization and append the result to the list.
for i in streams:
    text: str = i["text"]
    sent_tokenize.append(nltk.sent_tokenize(text))

# To ensure the result match our expected,
# We print the tokenized sentences.
for index, sentences in enumerate(sent_tokenize):
    print(f"Comment {index+1}:")
    for sentence in sentences:
        print("\t" + sentence)
    print()

Comment 1:
	I love this product!
	It's absolutely fantastic.

Comment 2:
	Not what I expected.
	Pretty disappointed.

Comment 3:
	It's okay, does the job.
	Nothing special though.

Comment 4:
	Terrible experience.
	Would not recommend!

Comment 5:
	Absolutely brilliant.
	Exceeded my expectations.



Then the next step is to do word tokenize for each sentence.<br/>
To make everything in organized, I create a simple class named `Sentences` to store the data.

A single `Sentences` instance holds the following information.
- The origin comment (that this comment can have 1 or many sentences)
- Each sentence of this comment
- Each token for each sentence.

Use `__repr__` can show all these things more clearly.

In [94]:
class Sentences:
    origin_text: str
    sent_tokens: list[str]
    word_tokens: list[list[str]]

    def __init__(self, text: str) -> None:
        self.origin_text = text
        self.sent_tokens = nltk.sent_tokenize(self.origin_text)
        self.word_tokens = [nltk.word_tokenize(i) for i in self.sent_tokens]

    def __repr__(self) -> str:
        result = "Sentences("
        result += f"origin_text={json.dumps(self.origin_text,ensure_ascii=False)}, "
        result += f"sent_tokens={self.sent_tokens}, "
        result += f"word_tokens={self.word_tokens}"
        return result + ")"


comments = [Sentences(i["text"]) for i in streams]
for index, comment in enumerate(comments):
    print(f"Comment {index+1}:")
    print(f"\t{comment}")
    print()

Comment 1:
	Sentences(origin_text="I love this product! It's absolutely fantastic.", sent_tokens=['I love this product!', "It's absolutely fantastic."], word_tokens=[['I', 'love', 'this', 'product', '!'], ['It', "'s", 'absolutely', 'fantastic', '.']])

Comment 2:
	Sentences(origin_text="Not what I expected. Pretty disappointed.", sent_tokens=['Not what I expected.', 'Pretty disappointed.'], word_tokens=[['Not', 'what', 'I', 'expected', '.'], ['Pretty', 'disappointed', '.']])

Comment 3:
	Sentences(origin_text="It's okay, does the job. Nothing special though.", sent_tokens=["It's okay, does the job.", 'Nothing special though.'], word_tokens=[['It', "'s", 'okay', ',', 'does', 'the', 'job', '.'], ['Nothing', 'special', 'though', '.']])

Comment 4:
	Sentences(origin_text="Terrible experience. Would not recommend!", sent_tokens=['Terrible experience.', 'Would not recommend!'], word_tokens=[['Terrible', 'experience', '.'], ['Would', 'not', 'recommend', '!']])

Comment 5:
	Sentences(origin_te

## Stemming

Now we finish the tokenize processing step.

However, we want to state the appear count for each word, but you can simply think about this...
- play
- plays
- playing
- played
- player
- playful

Well, these words look the same, but different in **Python** because they have different characters so that they are different strings.<br/>
But we wish when we count these words, we can make they are the same, so the result count is 6 but not each word get count 1.

So the next step we need to do is to **Stemming** the tokens, to stem each word tokens so that it can be more convenient for me to state the words with different frequencies of occurrence and the same root will be classified into the same group.

Then, I can compute the distance between different groups.

In [95]:
from nltk.stem import PorterStemmer

DEFAULT_PORTER_STEMMER = PorterStemmer()


class StemSentences(Sentences):
    stem_tokens: list[list[str]]

    def __init__(self, text: str) -> None:
        super().__init__(text)
        self.stem_tokens = [
            [DEFAULT_PORTER_STEMMER.stem(j) for j in i] for i in self.word_tokens
        ]

    def __repr__(self) -> str:
        result = "StemSentences("
        result += f"origin_text={json.dumps(self.origin_text,ensure_ascii=False)}, "
        result += f"sent_tokens={self.sent_tokens}, "
        result += f"word_tokens={self.word_tokens}, "
        result += f"stem_tokens={self.stem_tokens}"
        return result + ")"


stem_comments = [StemSentences(i["text"]) for i in streams]
for index, comment in enumerate(stem_comments):
    print(f"Comment {index+1}:")
    print(f"\t{comment}")
    print()

Comment 1:
	StemSentences(origin_text="I love this product! It's absolutely fantastic.", sent_tokens=['I love this product!', "It's absolutely fantastic."], word_tokens=[['I', 'love', 'this', 'product', '!'], ['It', "'s", 'absolutely', 'fantastic', '.']], stem_tokens=[['i', 'love', 'thi', 'product', '!'], ['it', "'s", 'absolut', 'fantast', '.']])

Comment 2:
	StemSentences(origin_text="Not what I expected. Pretty disappointed.", sent_tokens=['Not what I expected.', 'Pretty disappointed.'], word_tokens=[['Not', 'what', 'I', 'expected', '.'], ['Pretty', 'disappointed', '.']], stem_tokens=[['not', 'what', 'i', 'expect', '.'], ['pretti', 'disappoint', '.']])

Comment 3:
	StemSentences(origin_text="It's okay, does the job. Nothing special though.", sent_tokens=["It's okay, does the job.", 'Nothing special though.'], word_tokens=[['It', "'s", 'okay', ',', 'does', 'the', 'job', '.'], ['Nothing', 'special', 'though', '.']], stem_tokens=[['it', "'s", 'okay', ',', 'doe', 'the', 'job', '.'], ['no

I create a new class named `StemSentences` that based on `Sentences`.<br/>
This new class has a new field named `stem_tokens` to store the stemming results for word tokens of each sentence.

`PorterStemmer` is the stemmer I used because this is widely used in English.<br/>
In addition, due to I'll not to change the stemmer, so use a constant `DEFAULT_PORTER_STEMMER` for it, so that we don't need to initialize stemmer by `PorterStemmer()` for each call of `stem`.

## Lemmatization

Stemming can make us simple to count the words of each comments.

**Stemming** part can help to enhance the speed of compute the distance between different groups.<br/>
But the results of **Stemming** is for the machine and hard for human to read...

So, here we need **Lemmatization** to get the real English word for each token.<br/>
That means use the results of **Stemming** to do internal compute, and use **Lemmatization** for visualization.

In [96]:
from nltk.stem import WordNetLemmatizer


DEFAULT_LEMMATIZER = WordNetLemmatizer()


class LemmerWrapper:
    @staticmethod
    def get_file_key(pos_tag):
        return {
            "NN": "n",
            "VB": "v",
            "RB": "r",
            "JJ": "a",
        }.get(pos_tag[:2], "n")

    @staticmethod
    def lemmatize(word: str) -> str:
        tag = nltk.pos_tag([word])[0][1]
        return DEFAULT_LEMMATIZER.lemmatize(word, LemmerWrapper.get_file_key(tag))


class FilterSentences(Sentences):
    stem_tokens: list[list[str]]
    lem_tokens: list[list[str]]

    def __init__(self, text: str) -> None:
        super().__init__(text)
        self.stem_tokens = [
            [DEFAULT_PORTER_STEMMER.stem(j) for j in i] for i in self.word_tokens
        ]
        self.lem_tokens = [
            [LemmerWrapper.lemmatize(j).lower() for j in i] for i in self.word_tokens
        ]

    def __repr__(self) -> str:
        result = "StemSentences("
        result += f"origin_text={json.dumps(self.origin_text,ensure_ascii=False)}, "
        result += f"sent_tokens={self.sent_tokens}, "
        result += f"word_tokens={self.word_tokens}, "
        result += f"stem_tokens={self.stem_tokens}, "
        result += f"lem_tokens={self.lem_tokens}"
        return result + ")"


filter_comments = [FilterSentences(i["text"]) for i in streams]
for index, comment in enumerate(filter_comments):
    print(f"Comment {index+1}:")
    print(f"\t{comment}")
    print()

Comment 1:
	StemSentences(origin_text="I love this product! It's absolutely fantastic.", sent_tokens=['I love this product!', "It's absolutely fantastic."], word_tokens=[['I', 'love', 'this', 'product', '!'], ['It', "'s", 'absolutely', 'fantastic', '.']], stem_tokens=[['i', 'love', 'thi', 'product', '!'], ['it', "'s", 'absolut', 'fantast', '.']], lem_tokens=[['i', 'love', 'this', 'product', '!'], ['it', "'s", 'absolutely', 'fantastic', '.']])

Comment 2:
	StemSentences(origin_text="Not what I expected. Pretty disappointed.", sent_tokens=['Not what I expected.', 'Pretty disappointed.'], word_tokens=[['Not', 'what', 'I', 'expected', '.'], ['Pretty', 'disappointed', '.']], stem_tokens=[['not', 'what', 'i', 'expect', '.'], ['pretti', 'disappoint', '.']], lem_tokens=[['not', 'what', 'i', 'expect', '.'], ['pretty', 'disappointed', '.']])

Comment 3:
	StemSentences(origin_text="It's okay, does the job. Nothing special though.", sent_tokens=["It's okay, does the job.", 'Nothing special though.

We select `WordNetLemmatizer` to do the lemmatize.<br/>
Note that this time we'll still not change the instance of `WordNetLemmatizer` so we just use a constant to store a `WordNetLemmatizer()` so I can use it in anywhere.


`LemmerWrapper` wrapped a lemmer. This is util for me so I can just call `LemmerWrapper.lemmatize`.<br/>
`get_file_key` is to get the part of speech of the word based on the tag attached to nltk.

I place `get_file_key` and `lemmatize` into `LemmerWrapper` to make these 2 functions organized.<br/>
Use `staticmethod` for those two functions so that no need to create a new `LemmerWrapper` instance when calling the functions in this class.

Same as `StemSentences`, this time we still based on `Sentences`, but contains 2 extra fields named:
- `stem_tokens` (Store the stemming results)
- `lem_tokens` (Store the lemming results)

## Clean Stop Words

These words are some example of stop words in English.
- a
- an
- the
- and
- or

And these words actually just have a little contribution to the sentence.<br/>
However, we store these words in the tokens and count them, and it is not good for us.

Therefore, we should clean these stop words from the tokens.<br/>
Here let we do that, by create a new class named `StopWordCleaner`.

In [97]:
from nltk.corpus import stopwords

CONST_STOP_WORDS = set(stopwords.words("english"))
CONST_STOP_WORDS |= {",", ".", "!", "?", ";", ":", "'", '"', "`", "``"}
CONST_STOP_WORDS -= {
    "shan't",
    "wouldn't",
    "shouldn't",
    "wasn't",
    "aren't",
    "not",
    "mightn't",
    "no",
    "doesn't",
    "hasn't",
    "won't",
    "isn't",
    "out",
    "don't",
    "didn't",
    "needn't",
    "mustn't",
    "hadn't",
    "couldn't",
    "off",
    "nor",
}
CONST_STOP_WORDS -= {"very", "to"}
CONST_STOP_WORDS -= {"should", "can", "will"}
CONST_STOP_WORDS -= {"if"}
CONST_STOP_WORDS -= {"which", "who", "this", "those", "that", "these", "whom"}
CONST_STOP_WORDS -= {"each", "most", "few", "all", "some", "more", "any"}
CONST_STOP_WORDS -= {"before", "between", "during", "against", "after"}


class StopWordCleanner:
    @staticmethod
    def clean(sent: FilterSentences) -> FilterSentences:
        sent.stem_tokens = [
            [j if j not in CONST_STOP_WORDS else "" for j in i]
            for i in sent.stem_tokens
        ]
        sent.lem_tokens = [
            [j if j not in CONST_STOP_WORDS else "" for j in i] for i in sent.lem_tokens
        ]
        return sent


filter_comments = [StopWordCleanner.clean(i) for i in filter_comments]
for index, comment in enumerate(filter_comments):
    print(f"Comment {index+1}:")
    print(f"\t{comment}")
    print()

Comment 1:
	StemSentences(origin_text="I love this product! It's absolutely fantastic.", sent_tokens=['I love this product!', "It's absolutely fantastic."], word_tokens=[['I', 'love', 'this', 'product', '!'], ['It', "'s", 'absolutely', 'fantastic', '.']], stem_tokens=[['', 'love', 'thi', 'product', ''], ['', "'s", 'absolut', 'fantast', '']], lem_tokens=[['', 'love', 'this', 'product', ''], ['', "'s", 'absolutely', 'fantastic', '']])

Comment 2:
	StemSentences(origin_text="Not what I expected. Pretty disappointed.", sent_tokens=['Not what I expected.', 'Pretty disappointed.'], word_tokens=[['Not', 'what', 'I', 'expected', '.'], ['Pretty', 'disappointed', '.']], stem_tokens=[['not', '', '', 'expect', ''], ['pretti', 'disappoint', '']], lem_tokens=[['not', '', '', 'expect', ''], ['pretty', 'disappointed', '']])

Comment 3:
	StemSentences(origin_text="It's okay, does the job. Nothing special though.", sent_tokens=["It's okay, does the job.", 'Nothing special though.'], word_tokens=[['It', 

You can noticed that I removed some words from the original set which is `set(stopwords.words("english"))`, because we don't want the meaning of the sentences changed. For example, the reversal of meaning in the original sentence.

Still use `staticmethod` for `StopWordCleanner.clean` so that I can use it directly without create a `StopWordCleanner` instance.

In addition, with using of the following expressions, I also removed the punctuation marks from the token at the same time.
```python
CONST_STOP_WORDS |= {",", ".", "!", "?", ";", ":", "'", '"', "`", "``"}
```

## Compact Inverse Words

We wish to compute the distance between each tokens, but you can find a token list like this:
```python
["not", "good"]
```

And that means the program will think `good` and `not` have no relation, <br/>
and just compute the distance without any thinking.

But our human know this means `not good`, <br/>
so we need to combine these two tokens to something like that:
```python
["NEG_good"]
```

So let's do that.

In [98]:
class InverseCompacter:
    WINDOW_SIZE = 3
    NEG_TOKENS = {
        "won't",
        "n't",
        "out",
        "without",
        "no",
        "don't",
        "mightn't",
        "isn't",
        "doesn't",
        "shouldn't",
        "can't",
        "wouldn't",
        "hadn't",
        "nor",
        "off",
        "cannot",
        "needn't",
        "never",
        "shan't",
        "didn't",
        "couldn't",
        "mustn't",
        "not",
        "aren't",
        "hasn't",
        "wasn't",
    }

    @staticmethod
    def compact_tokens(tokens: list[str]) -> list[str]:
        result = []
        count = 0

        for token in tokens:
            token = token.lower()
            if token in InverseCompacter.NEG_TOKENS:
                count = InverseCompacter.WINDOW_SIZE
                result.append(token)
                continue
            if count > 0:
                result.append(f"NEG_{token}")
                count -= 1
            else:
                result.append(token)

        return result

    @staticmethod
    def compact_sentences(sent: FilterSentences) -> FilterSentences:
        sent.stem_tokens = [
            InverseCompacter.compact_tokens(i) for i in sent.stem_tokens
        ]
        return sent


filter_comments = [InverseCompacter.compact_sentences(i) for i in filter_comments]
for index, comment in enumerate(filter_comments):
    print(f"Comment {index+1}:")
    print(f"\t{comment}")
    print()


Comment 1:
	StemSentences(origin_text="I love this product! It's absolutely fantastic.", sent_tokens=['I love this product!', "It's absolutely fantastic."], word_tokens=[['I', 'love', 'this', 'product', '!'], ['It', "'s", 'absolutely', 'fantastic', '.']], stem_tokens=[['', 'love', 'thi', 'product', ''], ['', "'s", 'absolut', 'fantast', '']], lem_tokens=[['', 'love', 'this', 'product', ''], ['', "'s", 'absolutely', 'fantastic', '']])

Comment 2:
	StemSentences(origin_text="Not what I expected. Pretty disappointed.", sent_tokens=['Not what I expected.', 'Pretty disappointed.'], word_tokens=[['Not', 'what', 'I', 'expected', '.'], ['Pretty', 'disappointed', '.']], stem_tokens=[['not', 'NEG_', 'NEG_', 'NEG_expect', ''], ['pretti', 'disappoint', '']], lem_tokens=[['not', '', '', 'expect', ''], ['pretty', 'disappointed', '']])

Comment 3:
	StemSentences(origin_text="It's okay, does the job. Nothing special though.", sent_tokens=["It's okay, does the job.", 'Nothing special though.'], word_tok

Note that this is only need to the `stem_tokens` field.<br/>
`lem_tokens` no need that because it is for human to see.

## Machine Token to Human Readable Token

Due to at last we need to convert our analysis results to the word that human can read, so before we start to compute the distance, we first is to create a map that can find the human readable words by given machine token.

Note that for those who starts with `NEG_` like `NEG_good`, we should the corresponding human readable word to `(negative) ...` like `(negative) good`.

Now let's do it!

In [103]:
from dataclasses import dataclass, field


@dataclass
class StemToLemMapping:
    mapping: dict[str, str] = field(default_factory=lambda: {})

    def build_mapping(self, sentences: list[FilterSentences]) -> StemToLemMapping:
        for sentence in sentences:
            for i, tokens in enumerate(sentence.stem_tokens):
                for j, token in enumerate(tokens):
                    key, value = token, sentence.lem_tokens[i][j]
                    if len(token) == 0 or len(value) == 0:
                        continue
                    if key in self.mapping:
                        continue
                    self.mapping[key] = value
        return self

    def get_lem_token(self, stem_token: str) -> str:
        result = self.mapping.get(stem_token, stem_token)
        if stem_token.startswith("NEG_"):
            return f"(negative) {result}"
        return result


mapping = StemToLemMapping().build_mapping(filter_comments)
for key, value in mapping.mapping.items():
    print(f"{key} => {mapping.get_lem_token(key)}")

love => love
thi => this
product => product
's => 's
absolut => absolutely
fantast => fantastic
not => not
NEG_expect => (negative) expect
pretti => pretty
disappoint => disappointed
okay => okay
job => job
noth => nothing
special => special
though => though
terribl => terrible
experi => experience
would => would
NEG_recommend => (negative) recommend
brilliant => brilliant
exceed => exceeded
expect => expectation


Module `dataclass` can help us easily to manage the class field.<br/>
`field(default_factory=lambda: {})` is to ensure if the user not given a dict, then create a new empty dict for the `mapping` field.