# Language 
We are using `langdetect` for the task of language detection. Langdetect is not the most accurate but it is lightweight and simple.

In [1]:
from collections import Counter
from datetime import date
import polars as pl
import sys
import os

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
import config
from paths import Paths
from src.lang_detect import detect_parallel

# change this for a dataset that you have
date_obj = date(2025, 9, 10)
channel_paths = Paths(channel_handle=config.channel_handle, date_obj=date_obj)

## Adding the lang column to the clean dataset
We will be working with the clean dataset, since we need the original comment. Additionally, we are performing this process per dataset, that is, if the clean file is from the first of the month, only that file will be updated, so you have to run this notebook for every dataset of each day that you have. It is done this way since it is a relatively slow progress, and managing all datasets in a single go becomes inefficient.

In [2]:
comments = pl.read_parquet(channel_paths.clean_comments_file_path, columns=["comment"])["comment"].to_list()
print(f"{len(comments)} comments loaded.")

300508 comments loaded.


This is a task that runs in parallel, making use of the multiple processors in your cpu.

Take a look at `03_language_detection_notes.ipynb` for some testing, and the evaluation on how many processors is the sweet spot for your system.

This is quite a lengthy process, so feel free to go for a coffee ☕ and grab some cookies 🍪.

In [36]:
langs = detect_parallel(comments, max_workers=7)

  0%|          | 0/300508 [00:00<?, ?it/s]

Finished translation in 314.50s, 955.51 c/s


Somme quick inspection

In [37]:
Counter(langs).most_common(10)

[('en', 242130),
 ('und', 35711),
 ('de', 2577),
 ('so', 1759),
 ('af', 1282),
 ('fr', 1193),
 ('tl', 1193),
 ('es', 1157),
 ('it', 1138),
 ('id', 1105)]

We append the results to our clean comments.

In [None]:
# Read the parquet file
df = pl.read_parquet(channel_paths.clean_comments_file_path)

# Add the new 'lang' column (make sure 'lang' is a list or iterable)
df = df.with_columns(pl.Series("lang", langs))

In [None]:
# Write back to the parquet file (overwrite)
df.write_parquet(channel_paths.clean_comments_file_path)

# Q & A

- **The process was very slow, are there any other language detectors available?**

A: Multiple language detectors were tried.
1) Langid has not been update in quite some time, and in linux would behave unexpectedly, it would span multiple threads just by importing the library.
2) Fasttext has a complex installation on windows.
3) Lingua was extremelly fast, but the way that it handles confidence values across multiple languages was found unsatisfactory.
4) pycld3 like fastttext has a complex installation on windows.

langdetect had some issues with accuracy and speed, but was finally chosen by its simplicity and lightweightness.