# Classifying language

We want to classify the language of our posts. Since this is one of our filtering steps and we want it to act quickly, we'll benchmark a few options and check for (1) runtime and (2) memory usage. We want to measure runtime and memory usage after the initial cold start, since downloading, loading, and preparing the model can take some time.

I assume that in terms of accuracy, the more up-to-date the model is, the better it will perform, but that all industrial-grade solutions are probably equally accurate. They might have differences in terms of compute and runtime, but besides light QA I won't check the accuracy that much, unless we see a drastic difference in classifications between models. Language classification should be pretty well-solved and I'm satisfied with using off-the-shelf solutions.

In [1]:
from ml_tooling.inference_helpers import classify_posts, generate_batches_of_posts

## Load data

For this, let's benchmark by loading 50,000 posts. Let's see how well our models do on that load.

In [2]:
from services.sync.stream.helper import get_posts_as_list_dicts

In [3]:
num_posts = 50000

In [4]:
posts: list[dict] = get_posts_as_list_dicts(k=num_posts)

In [5]:
BATCH_SIZE = 1000

Let's preprocess the posts first. Let's manage any spacing or newline characters, for example.

In [6]:
def preprocess_post(post: dict) -> dict:
    """Do any preprocessing needed for the posts.

    We only need the following fields:
    - id
    - uri
    - text

    We also want to remove any weird spacing or any newline characters.
    """
    return {
        "id": post["id"],
        "uri": post["uri"],
        "text": post["text"].replace("\n", " ").strip(),
    }

In [7]:
preprocessed_posts: list[dict] = [preprocess_post(post) for post in posts]

Now let's get our batches

In [9]:
batches: list[list[dict]] = generate_batches_of_posts(
    posts=preprocessed_posts, batch_size=BATCH_SIZE
)

## Language detectors
In Python, there are plenty of tools to use for language detection. We'll try several of these options.

### 1. Langdetect

Can we use `langdetect`? [langdetect](https://github.com/Mimino666/langdetect) is a Python package (ported from [Java](https://www.slideshare.net/shuyo/language-detection-library-for-java)). It powers `spacy-langdetect` and is also commonly used in language detection tasks.

In [51]:
from langdetect import detect
from langdetect.detector import LangDetectException

In [45]:
detect("This is an example post")

'en'

In [46]:
def text_is_english_langdetect(text):
    return detect(text) == "en"

In [52]:
def clf_post_langdetect(post: dict) -> dict:
    """Classify if a post is in English using the langdetect library."""
    try:
        label = text_is_english_langdetect(post["text"])
    except LangDetectException as e:
        # if unable to detect language, classify as False by default.
        label = False
    return {
        "id": post["id"],
        "uri": post["uri"],
        "text": post["text"],
        "is_english_label": label,
    }

In [53]:
langdetect_labels: list[dict] = classify_posts(
    posts=preprocessed_posts, clf_func=clf_post_langdetect,
    batch_size=BATCH_SIZE, rate_limit_per_minute=None
)

Execution time for classify_posts: 0 minutes, 58 seconds
Memory usage for classify_posts: 79.4375 MB


`langdetect` was really inefficient - it took 58 seconds and used ~80MB of memory.

### 2. Langid

`langid` is a Python package designed specifically for language detection. According to the [docs](https://github.com/saffsd/langid.py), it's supposed to be fast, minimalistic, pre-trained, and not sensitive to domain-specific features (like markup text).

In [55]:
import langid

In [56]:
def text_is_english_langid(text):
    return langid.classify(text)[0] == "en"

In [59]:
text_is_english_langid("This is a post")

True

In [60]:
def clf_post_langid(post: dict) -> dict:
    """Classify if a post is in English using the langid library."""
    return {
        "id": post["id"],
        "uri": post["uri"],
        "text": post["text"],
        "is_english_label": text_is_english_langid(post["text"]),
    }

In [61]:
langid_labels: list[dict] = classify_posts(
    posts=preprocessed_posts, clf_func=clf_post_langid,
    batch_size=BATCH_SIZE, rate_limit_per_minute=None
)

Execution time for classify_posts: 2 minutes, 36 seconds
Memory usage for classify_posts: 53.046875 MB


2 minutes and 36 seconds (156 seconds) to classify 50,000 posts. Used >50MB of memory.

### 3. Fasttext

`fasttext` is a [package](https://github.com/facebookresearch/fastText) developed at Facebook for fast, scalable word representation and language learning. They have a specific fine-tuned version, [fasttext-language-identification](https://huggingface.co/facebook/fasttext-language-identification) used for language detection.

There are two ways to use `fasttext`:

#### 3.1. Hugging Face
We can download the model from the Hugging Face Hub.


In [66]:
import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
model: fasttext.FastText._FastText = fasttext.load_model(model_path)




Let's try out this model

In [67]:
model.predict("This is a text")

(('__label__eng_Latn',), array([1.00001001]))

In [76]:
def text_is_english_hf_fasttext(text):
    return model.predict(text)[0][0] == "__label__eng_Latn"

In [77]:
def clf_post_hf_fasttext(post: dict) -> dict:
    """Classify if a post is in English using the fasttext model from
    Hugging Face Hub."""
    return {
        "id": post["id"],
        "uri": post["uri"],
        "text": post["text"],
        "is_english_label": text_is_english_hf_fasttext(post["text"]),
    }

In [78]:
hf_fasttext_labels: list[dict] = classify_posts(
    posts=preprocessed_posts, clf_func=clf_post_hf_fasttext,
    batch_size=BATCH_SIZE, rate_limit_per_minute=None
)

Execution time for classify_posts: 0 minutes, 4 seconds
Memory usage for classify_posts: 8.4375 MB


4 seconds to classify 50,000 posts. Used ~5 MB. Let's take a look at some of the results.

In [79]:
is_english_list_hf_fasttext = [
    post["is_english_label"] for post in hf_fasttext_labels
]

In [80]:
print(f"Total number of English posts: {sum(is_english_list_hf_fasttext)}")

Total number of English posts: 11583


#### 3.2. Local binary
We can download the binary classifier model [here](https://fasttext.cc/docs/en/language-identification.html) and load it for inference.

In [81]:
# need to download the model; this is >100MB which is OK for local storage
# but too large for Github (unless we use Github LFS).
fasttext_model_bin = fasttext.load_model('lid.176.bin')



In [82]:
def text_is_english_local_fasttext(text):
    return fasttext_model_bin.predict(text)[0][0] == "__label__eng_Latn"

In [83]:
def clf_post_local_fasttext(post: dict) -> dict:
    """Classify if a post is in English using a local binary of the fasttext
    model."""
    return {
        "id": post["id"],
        "uri": post["uri"],
        "text": post["text"],
        "is_english_label": text_is_english_local_fasttext(post["text"]),
    }

In [84]:
local_fasttext_labels: list[dict] = classify_posts(
    posts=preprocessed_posts, clf_func=clf_post_local_fasttext,
    batch_size=BATCH_SIZE, rate_limit_per_minute=None
)

Execution time for classify_posts: 0 minutes, 1 seconds
Memory usage for classify_posts: 4.015625 MB


4 seconds to classify 50,000 posts. Used ~7 MB. Let's take a look at some of the results.

In [25]:
is_english_list_local_fasttext = [
    post["is_english_label"] for post in local_fasttext_labels
]

In [26]:
print(f"Total number of English posts: {sum(is_english_list_local_fasttext)}")

Total number of English posts: 11583


##### Comparing the two `fasttext` models

It looks like Hugging Face just stores the same model binary that we can store ourselves. We can store the model binary ourselves or use Hugging Face. I'd rather store it, especially since it's a pretty small binary (~120 MB), all things considered.

We might want to run this multiple times but all else equal, having the binary version-controlled ourselves is better, so we don't have to rely on the network connection to Hugging Face.

One developer's benchmark [tests](https://github.com/zafercavdar/fasttext-langdetect) seem to suggest that `fasttext` will work the fastest, and this does reinforce what I've found so far about `fasttext`. From the [Github repo](https://github.com/facebookresearch/fastText), we see that `fasttext` is quite fast and also widely uses. Conveniently, it's also optimized to work on CPU, as per the [FAQs](https://fasttext.cc/docs/en/faqs.html).

I expect that the `fasttext` models will be the most performant, and out of those I would rather store the model binary than use the Hugging Face stored version of the model binary.

### 4. spaCy

Can we use `spaCy` for this task? Not really, as it turns out. [Here](https://github.com/explosion/spaCy/issues/11038) is a discussion about it in the Github repo. There's a third-party package, `spacy-langdetect`, but that uses `langdetect` under the hood. Examples such as [this](https://towardsdatascience.com/4-python-libraries-to-detect-english-and-non-english-language-c82ad3efd430) are outdated and don't work anymore.

## Conclusion

From these experiments, it seems clear that the fastest way to do language detectiopn is via `fasttext`. Without any parallelization or other speedups, we can classify ~50,000 posts in ~4 seconds, which is over an order of magnitude faster and more memory efficient than either `langdetect` or `langid`.