# Text preprocessing experiments

My goal is to find a solution for text preprocessing that allows to remove unnecessary, not important words from text before sending it to LLM.

## Setup libs

In [120]:
%pip install trafilatura tiktoken

Note: you may need to restart the kernel to use updated packages.


## Sample texts

In this experiment, I'm going to use a few recent newsletters from JS Weekly. The last one is a number 743. Let's fetch recent 10 issues

### Download texts

In [121]:
from trafilatura import fetch_url, extract

latest_js_weekly_issue = 743
js_weekly_base_url = "https://javascriptweekly.com/issues/"
downloaded_issues = []

for issue_number in range(latest_js_weekly_issue, latest_js_weekly_issue - 11, -1 ):
    issue = fetch_url(js_weekly_base_url + str(issue_number))
    issue = extract(issue)
    downloaded_issues.append({
        'issue_number': issue_number,
        'content': issue,
    })


### Tokenize downloaded texts

Let's see how many tokens have each text.

In [122]:
import tiktoken

tokenizer = tiktoken.encoding_for_model("gpt-4")
all_issues_tokens = 0

for issue in downloaded_issues:
    tokens = len(tokenizer.encode(issue['content']))
    issue['tokens_numbers'] = tokens
    all_issues_tokens += tokens
    print(f"issue_number: {issue['issue_number']}, tokens: {tokens}")

print(f"all_issues_tokens: {all_issues_tokens}")

issue_number: 743, tokens: 1007
issue_number: 742, tokens: 1099
issue_number: 741, tokens: 801
issue_number: 740, tokens: 914
issue_number: 739, tokens: 863
issue_number: 738, tokens: 1075
issue_number: 737, tokens: 951
issue_number: 736, tokens: 944
issue_number: 735, tokens: 1066
issue_number: 734, tokens: 994
issue_number: 733, tokens: 1090
all_issues_tokens: 10804


### Preprocessing function

Let's implement a function that will be used for experimenting


In [123]:
def process_issues_content(processing_function, function_name="processing function"):
    print(f"all_issues_tokens before this step: {all_issues_tokens}")
    new_all_issues_tokens = 0
    for issue in downloaded_issues:
        print(f"tokens before {function_name}: {issue['tokens_numbers']}")
        issue['content'] = processing_function(issue['content'])
        issue['tokens_numbers'] = len(tokenizer.encode(issue['content']))
        print(f"tokens after {function_name}: {issue['tokens_numbers']}")
        new_all_issues_tokens += issue['tokens_numbers']

    print(f"all_issues_tokens after this step: {new_all_issues_tokens}")
    return new_all_issues_tokens

### Remove emojis


In [124]:
%pip install emoji demoji

Note: you may need to restart the kernel to use updated packages.


In [125]:
# function for removing emojis

import demoji
import emoji

def remove_emojis(text: str) -> str:
    text = emoji.replace_emoji(text, replace='')
    text = demoji.replace(text, '')

    return text

In [126]:
all_issues_tokens = process_issues_content(remove_emojis, "removing emojis")

all_issues_tokens before this step: 10804
tokens before removing emojis: 1007
tokens after removing emojis: 967
tokens before removing emojis: 1099
tokens after removing emojis: 1074
tokens before removing emojis: 801
tokens after removing emojis: 778
tokens before removing emojis: 914
tokens after removing emojis: 886
tokens before removing emojis: 863
tokens after removing emojis: 829
tokens before removing emojis: 1075
tokens after removing emojis: 1050
tokens before removing emojis: 951
tokens after removing emojis: 928
tokens before removing emojis: 944
tokens after removing emojis: 924
tokens before removing emojis: 1066
tokens after removing emojis: 1043
tokens before removing emojis: 994
tokens after removing emojis: 973
tokens before removing emojis: 1090
tokens after removing emojis: 1063
all_issues_tokens after this step: 10515


### Remove email addresses

In [127]:
import re

def remove_email_addresses(text: str) -> str:
    email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
    return email_pattern.sub('', text)

In [128]:
all_issues_tokens = process_issues_content(remove_email_addresses, "removing email addresses")

all_issues_tokens before this step: 10515
tokens before removing email addresses: 967
tokens after removing email addresses: 967
tokens before removing email addresses: 1074
tokens after removing email addresses: 1074
tokens before removing email addresses: 778
tokens after removing email addresses: 778
tokens before removing email addresses: 886
tokens after removing email addresses: 886
tokens before removing email addresses: 829
tokens after removing email addresses: 829
tokens before removing email addresses: 1050
tokens after removing email addresses: 1050
tokens before removing email addresses: 928
tokens after removing email addresses: 928
tokens before removing email addresses: 924
tokens after removing email addresses: 924
tokens before removing email addresses: 1043
tokens after removing email addresses: 1043
tokens before removing email addresses: 973
tokens after removing email addresses: 973
tokens before removing email addresses: 1063
tokens after removing email addresses

### Remove phone numbers


In [129]:
def remove_phone_numbers(text: str) -> str:
    phone_pattern = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b')
    return phone_pattern.sub('', text)

In [130]:
all_issues_tokens = process_issues_content(remove_phone_numbers, "removing phone numbers")

all_issues_tokens before this step: 10515
tokens before removing phone numbers: 967
tokens after removing phone numbers: 967
tokens before removing phone numbers: 1074
tokens after removing phone numbers: 1074
tokens before removing phone numbers: 778
tokens after removing phone numbers: 778
tokens before removing phone numbers: 886
tokens after removing phone numbers: 886
tokens before removing phone numbers: 829
tokens after removing phone numbers: 829
tokens before removing phone numbers: 1050
tokens after removing phone numbers: 1050
tokens before removing phone numbers: 928
tokens after removing phone numbers: 928
tokens before removing phone numbers: 924
tokens after removing phone numbers: 924
tokens before removing phone numbers: 1043
tokens after removing phone numbers: 1043
tokens before removing phone numbers: 973
tokens after removing phone numbers: 973
tokens before removing phone numbers: 1063
tokens after removing phone numbers: 1063
all_issues_tokens after this step: 10

## Remove urls

In [131]:
def remove_urls(text: str) -> str:
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub('', text)

In [132]:
all_issues_tokens = process_issues_content(remove_urls, "removing urls")

all_issues_tokens before this step: 10515
tokens before removing urls: 967
tokens after removing urls: 967
tokens before removing urls: 1074
tokens after removing urls: 1074
tokens before removing urls: 778
tokens after removing urls: 778
tokens before removing urls: 886
tokens after removing urls: 886
tokens before removing urls: 829
tokens after removing urls: 829
tokens before removing urls: 1050
tokens after removing urls: 1050
tokens before removing urls: 928
tokens after removing urls: 928
tokens before removing urls: 924
tokens after removing urls: 924
tokens before removing urls: 1043
tokens after removing urls: 1043
tokens before removing urls: 973
tokens after removing urls: 973
tokens before removing urls: 1063
tokens after removing urls: 1063
all_issues_tokens after this step: 10515


#### Remove special chars

In [133]:
def remove_special_chars(text: str) -> str:
    return re.sub(r'[^a-zA-Z0-9\s.,!?;:()\'\-]', '', text)

In [134]:
all_issues_tokens = process_issues_content(remove_special_chars, "removing special chars")

all_issues_tokens before this step: 10515
tokens before removing special chars: 967
tokens after removing special chars: 925
tokens before removing special chars: 1074
tokens after removing special chars: 1038
tokens before removing special chars: 778
tokens after removing special chars: 744
tokens before removing special chars: 886
tokens after removing special chars: 847
tokens before removing special chars: 829
tokens after removing special chars: 798
tokens before removing special chars: 1050
tokens after removing special chars: 1004
tokens before removing special chars: 928
tokens after removing special chars: 894
tokens before removing special chars: 924
tokens after removing special chars: 893
tokens before removing special chars: 1043
tokens after removing special chars: 998
tokens before removing special chars: 973
tokens after removing special chars: 934
tokens before removing special chars: 1063
tokens after removing special chars: 1031
all_issues_tokens after this step: 101

### Remove extra whitespace


In [135]:
def remove_extra_whitespace(text: str) -> str:
    return re.sub(r'\s+', ' ', text).strip()

In [136]:
all_issues_tokens = process_issues_content(remove_extra_whitespace, "removing extra whitespace")

all_issues_tokens before this step: 10106
tokens before removing extra whitespace: 925
tokens after removing extra whitespace: 874
tokens before removing extra whitespace: 1038
tokens after removing extra whitespace: 993
tokens before removing extra whitespace: 744
tokens after removing extra whitespace: 702
tokens before removing extra whitespace: 847
tokens after removing extra whitespace: 803
tokens before removing extra whitespace: 798
tokens after removing extra whitespace: 758
tokens before removing extra whitespace: 1004
tokens after removing extra whitespace: 956
tokens before removing extra whitespace: 894
tokens after removing extra whitespace: 855
tokens before removing extra whitespace: 893
tokens after removing extra whitespace: 848
tokens before removing extra whitespace: 998
tokens after removing extra whitespace: 951
tokens before removing extra whitespace: 934
tokens after removing extra whitespace: 892
tokens before removing extra whitespace: 1031
tokens after removin

## Summary

Thanks to all of these techniques, amount of tokens is decreased from 10840 to 9616.