# Language detection

We separate language filtering into a dedicated notebook because the dataset is large and multilingual. A fast language identification model is used to detect the language of each tweet efficiently. We keep only English tweets to ensure consistent topic modelling and sentiment analysis.


Install dependencies with `uv` (see README), then restart the kernel.


In [21]:
import pandas as pd
import re
import urllib.request
import fasttext
import os

In [22]:
input_file  = "data/hashtag_donaldtrump.csv"
output_file = "data/hashtag_donaldtrump_en.csv"

model_file = "lid.176.ftz"

In [23]:
if not os.path.exists(model_file):
    url = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz"
    print("Downloading model...")
    urllib.request.urlretrieve(url, model_file)
    print("Done:", model_file)
else:
    print("Model already exists:", model_file)

Model already exists: lid.176.ftz


In [24]:
model = fasttext.load_model(model_file)

def clean_for_lang(text):
    text = str(text)
    text = re.sub(r"http\S+|www\.\S+", " ", text)  # remove URLs
    text = re.sub(r"@\w+", " ", text)             # remove @mentions
    text = re.sub(r"\s+", " ", text).strip()      # normalize spaces
    return text

In [25]:
for chunk in pd.read_csv(
    input_file,
    chunksize=chunksize,
    engine="python",
    on_bad_lines="skip"
):
    total += len(chunk)

    # make sure tweet column exists and is not empty
    if "tweet" not in chunk.columns:
        raise ValueError("No 'tweet' column found in this file.")

    chunk = chunk.dropna(subset=["tweet"])

    texts = chunk["tweet"].apply(clean_for_lang).tolist()
    labels, probs = model.predict(texts, k=1)

    chunk["lang_pred"] = [x[0].replace("__label__", "") for x in labels]
    chunk["lang_prob"] = [x[0] for x in probs]

    chunk_en = chunk[(chunk["lang_pred"] == "en") & (chunk["lang_prob"] >= confidence)]
    kept += len(chunk_en)

    chunk_en.to_csv(output_file, mode="w" if first else "a", index=False, header=first)
    first = False

    print("Processed:", total, "| Kept English:", kept)

print("\nDONE. Saved English tweets to:", output_file)

Processed: 49994 | Kept English: 34670
Processed: 99987 | Kept English: 69279
Processed: 149980 | Kept English: 103154
Processed: 199979 | Kept English: 137062
Processed: 249973 | Kept English: 170684
Processed: 299963 | Kept English: 203599
Processed: 349957 | Kept English: 237743
Processed: 399952 | Kept English: 270128
Processed: 449949 | Kept English: 300111
Processed: 499945 | Kept English: 326793
Processed: 549945 | Kept English: 346952
Processed: 599944 | Kept English: 364312
Processed: 649943 | Kept English: 385126
Processed: 699940 | Kept English: 408850
Processed: 749940 | Kept English: 436230
Processed: 799937 | Kept English: 459331
Processed: 849935 | Kept English: 485285
Processed: 899935 | Kept English: 506889
Processed: 949930 | Kept English: 531346
Processed: 971087 | Kept English: 542322

DONE. Saved English tweets to: data/hashtag_donaldtrump_en.csv


In [26]:
df_en = pd.read_csv(output_file, nrows=5, low_memory=False)
df_en[["created_at", "tweet_id", "tweet", "lang_pred", "lang_prob"]]

Unnamed: 0,created_at,tweet_id,tweet,lang_pred,lang_prob
0,2020-10-15 00:00:02,1.316529e+18,"#Trump: As a student I used to hear for years,...",en,0.993831
1,2020-10-15 00:00:02,1.316529e+18,2 hours since last tweet from #Trump! Maybe he...,en,0.975893
2,2020-10-15 00:00:17,1.316529e+18,@CLady62 Her 15 minutes were over long time ag...,en,0.981575
3,2020-10-15 00:00:17,1.316529e+18,@richardmarx Glad u got out of the house! DICK...,en,0.924819
4,2020-10-15 00:00:18,1.316529e+18,@DeeviousDenise @realDonaldTrump @nypost There...,en,0.980735
