# Preprocessing for word binding

This notebook:

1. **Loaded** the Billboard Hot 100 dataset (`dataset.xlsx`).
2. **Deduplicated** to unique songs.
3. **Detected languages** using a unique-word fingerprint trick.
4. **Translated** non-English lyrics to English, with caching & error-handling.
5. **Saved** the fully translated corpus to `data/hot100_translated.xlsx`.

In [1]:
# ──────────────────────────────────────────────
# 1  |  Imports & global config
# ──────────────────────────────────────────────
import re
from pathlib import Path
from typing import List, Dict

import pandas as pd
from tqdm.auto import tqdm
from langdetect import detect, DetectorFactory
from deep_translator import GoogleTranslator     # or use DeepL

import time
import logging

import spacy
nlp_en = spacy.load("en_core_web_sm", disable=["parser", "ner"])

tqdm.pandas()
DetectorFactory.seed = 42                        # reproducible language picks

In [2]:
# ──────────────────────────────────────────────
# 2  |  Load dataset
# ──────────────────────────────────────────────
DATA_PATH = Path("../data/processed/dataset.xlsx")

df_raw = pd.read_excel(DATA_PATH)
df_raw

Unnamed: 0,date,rank,title,artist,image,peakPos,lastpos,weeks,isNew,song_id,lyrics
0,2024-12-28,1,All I Want For Christmas Is You,Mariah Carey,https://charts-static.billboard.com/img/1994/1...,1,1,70,False,all_i_want_for_christmas_is_you__mariah_carey,i don't want a lot for christmas there is just...
1,2024-12-28,2,Rockin' Around The Christmas Tree,Brenda Lee,https://charts-static.billboard.com/img/1960/1...,1,2,63,False,rockin_around_the_christmas_tree__brenda_lee,rockin' around the christmas tree at the chris...
2,2024-12-28,3,Last Christmas,Wham!,https://charts-static.billboard.com/img/1998/0...,3,4,44,False,last_christmas__wham,"ah, ah-ah ooh-woah oh-oh last christmas, i gav..."
3,2024-12-28,4,Jingle Bell Rock,Bobby Helms,https://charts-static.billboard.com/img/1958/1...,3,3,60,False,jingle_bell_rock__bobby_helms,"jingle bell, jingle bell, jingle bell rock jin..."
4,2024-12-28,5,A Holly Jolly Christmas,Burl Ives,https://charts-static.billboard.com/img/1998/0...,4,5,44,False,a_holly_jolly_christmas__burl_ives,ding-dong-ding ding-dong-ding have a holly jol...
...,...,...,...,...,...,...,...,...,...,...,...
5195,2024-01-06,96,El Amor de Su Vida,Grupo Frontera & Grupo Firme,https://charts-static.billboard.com/img/2023/0...,68,0,16,False,el_amor_de_su_vida__grupo_frontera__grupo_firme,si estoy tomando es porque estoy echando alcoh...
5196,2024-01-06,97,Standing Next To You,Jung Kook,https://charts-static.billboard.com/img/2023/1...,5,79,8,False,standing_next_to_you__jung_kook,standing next to you play me slow push up on t...
5197,2024-01-06,98,Man Made A Bar,Morgan Wallen Featuring Eric Church,https://charts-static.billboard.com/img/2023/0...,15,0,14,False,man_made_a_bar__morgan_wallen_featuring_eric_c...,"i sat down on a barstool, like a dern fool 'ca..."
5198,2024-01-06,99,Que Onda,Calle 24 x Chino Pacas x Fuerza Regida,https://charts-static.billboard.com/img/2023/0...,61,98,13,False,que_onda__calle_24_x_chino_pacas_x_fuerza_regida,"baby, me vuelves loco no se esperó al hotel y ..."


In [3]:
# ──────────────────────────────────────────────
# 3  |  Drop duplicate songs
# ──────────────────────────────────────────────

df = df_raw.copy()

df = (
    df.drop_duplicates(subset="song_id", keep="first")
      .reset_index(drop=True)
)
df = df[df["lyrics"].notna()]

df

Unnamed: 0,date,rank,title,artist,image,peakPos,lastpos,weeks,isNew,song_id,lyrics
0,2024-12-28,1,All I Want For Christmas Is You,Mariah Carey,https://charts-static.billboard.com/img/1994/1...,1,1,70,False,all_i_want_for_christmas_is_you__mariah_carey,i don't want a lot for christmas there is just...
1,2024-12-28,2,Rockin' Around The Christmas Tree,Brenda Lee,https://charts-static.billboard.com/img/1960/1...,1,2,63,False,rockin_around_the_christmas_tree__brenda_lee,rockin' around the christmas tree at the chris...
2,2024-12-28,3,Last Christmas,Wham!,https://charts-static.billboard.com/img/1998/0...,3,4,44,False,last_christmas__wham,"ah, ah-ah ooh-woah oh-oh last christmas, i gav..."
3,2024-12-28,4,Jingle Bell Rock,Bobby Helms,https://charts-static.billboard.com/img/1958/1...,3,3,60,False,jingle_bell_rock__bobby_helms,"jingle bell, jingle bell, jingle bell rock jin..."
4,2024-12-28,5,A Holly Jolly Christmas,Burl Ives,https://charts-static.billboard.com/img/1998/0...,4,5,44,False,a_holly_jolly_christmas__burl_ives,ding-dong-ding ding-dong-ding have a holly jol...
...,...,...,...,...,...,...,...,...,...,...,...
756,2024-01-06,43,I Saw Mommy Kissing Santa Claus,Jackson 5,https://charts-static.billboard.com/img/1969/1...,43,0,1,True,i_saw_mommy_kissing_santa_claus__jackson_5,wow! mommy's kissing santa claus! i saw mommy ...
757,2024-01-06,46,Merry Christmas,Ed Sheeran & Elton John,https://charts-static.billboard.com/img/2021/1...,42,0,7,False,merry_christmas__ed_sheeran__elton_john,build the fire and gather 'round the tree fill...
758,2024-01-06,50,(There's No Place Like) Home For The Holidays ...,Perry Como With Mitchell Ayers And His Orchestra,https://charts-static.billboard.com/img/2005/1...,50,0,1,True,theres_no_place_like_home_for_the_holidays_195...,"oh, there's no place like home for the holiday..."
759,2024-01-06,87,Winter Wonderland,Chloe,https://charts-static.billboard.com/img/2023/1...,87,96,2,False,winter_wonderland__chloe,"walk walk, walk, walking walk, walk, walking w..."


In [4]:
# ──────────────────────────────────────────────
# 4  |  Language detection
# ──────────────────────────────────────────────
def guess_lang(text: str) -> str:
    try:
        return detect(text)
    except:
        return "un"

df["orig_lang"] = df["lyrics"].progress_apply(guess_lang)
df["orig_lang"].value_counts()

  0%|          | 0/758 [00:00<?, ?it/s]

orig_lang
en    644
es     56
ru     19
pt     15
tr      5
de      5
pl      3
th      2
nl      2
uk      1
ja      1
ar      1
fr      1
sv      1
hu      1
un      1
Name: count, dtype: int64

As we can see - there are a lot of spanish songs that will need translation, but there are other songs in a variety of languages that seem suspicious - let's check it out

In [5]:
weird_lang = df[~df["orig_lang"].isin(["en", "es"])]
weird_lang[["title", "artist", "lyrics", "orig_lang"]]

Unnamed: 0,title,artist,lyrics,orig_lang
10,Luther,Kendrick Lamar & SZA,"se esse mundo fosse meu hey, número romano set...",pt
11,TV Off,Kendrick Lamar Featuring Lefty Gunplay,i̇stediğim tek şeydi bir siyah grand national ...,tr
31,Timeless,The Weeknd & Playboi Carti,"güneş parlıyor, sona erdi yarın oldu kendini ı...",tr
58,Defying Gravity,Cynthia Erivo Featuring Ariana Grande,เอลฟาบา ช่วยใจเย็นมีสติสักครั้งได้ไหม? อย่าเพิ...,th
110,AGATS2 (Insecure),Juice WRLD & Nicki Minaj,"aw merda, oh oh aw merda, oh eu admito, outra ...",pt
113,No One Mourns The Wicked,"Ariana Grande Featuring Andy Nyman, Courtney-M...",ข่าวดี! นางแม่มดตายแล้ว! ออกมา ออกมา นางไม่อยู...,th
201,Like That,"Future, Metro Boomin & Kendrick Lamar","зажгу косяк в этой суке метро, метро, метро, т...",ru
226,Lost In Love,Rod Wave & Be Charlotte,"largado pra morrer, largado pra morte, continu...",pt
240,We Pray,"Coldplay Featuring Little Simz, Burna Boy, Ely...","вау ох тож ми молимося ух я молюся, аби не зда...",uk
276,BAND4BAND,Central Cee & Lil Baby,"я не в настроении, потому что мой рейс задержа...",ru


All of these songs are english songs. meaning - when we received these lyrics from genius, we got translations - we will need to translate them back or try another way to retrieve the lyrics

In [6]:
song = df[df["orig_lang"] == "es"].iloc[0]

song["lyrics"]

"deck the halls with boughs of holly fa-la-la-la-la, la-la-la-la 'tis the season to be jolly fa-la-la-la-la, la-la-la-la don we now our gay apparel fa-la-la, la-la-la, la-la-la troll the ancient yuletide carol fa-la-la-la-la, la-la-la-la fa-la-la, la-la-la, la-la-la fa-la-la, la-la-la, la-la-la fa-la-la, la-la-la, la-la-la fa-la-la, la-la-la, la-la-la see the blazing yule before us fa-la-la-la-la, la-la-la-la strike the harp and join the chorus fa-la-la-la-la, la-la-la-la follow me in merry measure fa-la-la-la-la, la-la-la-la while i tell of yuletide treasure fa-la-la-la-la, la-la-la-la fa-la-la-la-la, la-la-la-la"

We can also see here an example of wrong language detection. the above song is a christmas song sang in english, but was detected as spanish. this is due to a lot of filler words.

To counter that - we will try to detect language based on unique words, rather than the text as a whole

In [16]:
en_stop = spacy.lang.en.stop_words.STOP_WORDS
PUNCT_RE = re.compile(r"[^a-z\s]")

def unique_word_snippet(text: str, max_tokens: int = 250) -> str:
    """
    Return a whitespace-separated string containing **unique** words
    in first-appearance order, up to `max_tokens` tokens.
    """
    # 1. basic clean
    # txt = PUNCT_RE.sub(" ", text.lower())
    words = text.split()
    
    # 2. deduplicate but keep order
    seen, uniq = set(), []
    for w in words:
        if w not in seen:
            uniq.append(w)
            seen.add(w)
        if len(uniq) >= max_tokens:
            break
    return " ".join(uniq)

In [10]:
def detect_lang_unique(text: str) -> str:
    snippet = unique_word_snippet(text)
    if not snippet.strip():          # empty after cleaning
        return "un"
    
    try:
        lang = detect(snippet)
    except Exception:
        return "un"
    
    return lang

In [17]:
df["orig_lang"] = df["lyrics"].progress_apply(detect_lang_unique)

df["orig_lang"].value_counts()

  0%|          | 0/758 [00:00<?, ?it/s]

orig_lang
en    644
es     55
ru     19
pt     15
tr      5
de      5
pl      3
th      2
nl      2
ar      1
hu      1
sv      1
id      1
ko      1
ja      1
uk      1
un      1
Name: count, dtype: int64

In [42]:
# ──────────────────────────────────────────────
# 4  |  Translate non-English to English
# ──────────────────────────────────────────────
translator = GoogleTranslator(source="auto", target="en")

# Google unofficial endpoint: max 5 req / s & 5 000 chars / call.
# We'll translate in ≤4 000-char chunks to be safe.
MAX_CHARS = 4000
SLEEP_BETWEEN_REQ = 0.25        # 4 requests/sec  ➜ stay under 5 r/s

def safe_translate(text: str, src_lang: str) -> str:
    """
    Translate `text` to English unless it's already English.
    Handles 429 & 5 000-char limits gracefully.
    """
    if src_lang == "en" or not text.strip():
        return text

    if src_lang == "un":
        return ""

    # Split long lyrics into <= MAX_CHARS pieces at line breaks
    parts, buf = [], ""
    for line in text.splitlines():
        if len(buf) + len(line) + 1 > MAX_CHARS:
            parts.append(buf)
            buf = line + "\n"
        else:
            buf += line + "\n"
    parts.append(buf)

    translated_parts = []
    for p in parts:
        retry, ok = 0, False
        while not ok and retry < 5:
            try:
                translated_parts.append(translator.translate(p))
                ok = True
            except Exception as e:
                # Handle Google 429 or random failures
                wait = 2 ** retry        # exponential back-off: 1,2,4,8,16 s
                logging.warning(f"MT error ({e}); retrying in {wait}s")
                time.sleep(wait)
                retry += 1
        if not ok:                      # give up – keep original chunk
            translated_parts.append(p)
        time.sleep(SLEEP_BETWEEN_REQ)   # rate-limit

    return "\n".join(translated_parts)

Let's test translating with a single song first

In [23]:
song = df[df["orig_lang"] == "es"].iloc[0]

song["lyrics"]

"siento un vacío muy frío por dentro, mi amor cuando te fuiste, te robaste mi corazón me quedé loco de tanto pensar y pensar me iré al infierno, pero me tengo que vengar quiero manchar el vestido blanco de rojo sé que el altar dе dios es santo, pero, mi amor cien invitados y todos tеndrán que mirar que nuestro amor va al más allá quiero que bailemos juntos en el cielo, el infierno, pero sin ese puto el que te ha apartado todo este tiempo de mí al inconsciente que hizo que no fueras pa' mí danza, una danza eterna pero me voy contigo, pero nos vamos juntos llevo tres noches malditas sin poder dormir solo pensando y pensando que eres para mí ¡uh! me despido de ti para siempre, chiquitita ahí le va para todos los que traen el corazón roto uh y su compa óscar maydon fuerza regida camino lento, me duele la respiración a mil por hora latidos de mi corazón en el pasillo me vienen recuerdos, flashbacks prometimos que los dos iríamos al altar te quitó el velo, aceptaste y luego te besó ya has co

In [24]:
trans = safe_translate(song["lyrics"], song["orig_lang"])

trans

'I feel a very cold emptiness inside, my love when you left, you stole my heart I was crazy about thinking and thinking I will go to hell, but I have to avenge I want to stain the white red dress I know that the altar of God is holy, but, my love a hundred guests and all will have to look that our love goes to the beyond I want us to dance together in the sky, hell That you were not for my dance, an eternal dance but I leave with you, but we go together I have three cursed nights without being able to sleep just thinking and thinking that you are for me Uh! I say goodbye to you forever, little girl there goes for everyone who brings the broken heart and his company Óscar Maydon Fortid Intense, my being took all my being, there is no reason to be standing in the sky, hell, but without that fucking to which all this time from me has separated to the unconscious that did not force me dancing, an eternal dance but I go with you, but we go together I have three damn nights without being abl

**Now let's translate all the songs**

In [43]:
for idx, row in df.iterrows():
    if pd.isna(row["lyrics_en"]):
        df.at[idx, "lyrics_en"] = safe_translate(row["lyrics"], row["orig_lang"])
        if row["orig_lang"] != "en":
            print(f"translated song {row['title']} from {row['orig_lang']}")

translated song King from un
translated song Que Onda from es
translated song Y Lloro from es
translated song First Love from es
translated song El Amor de Su Vida from es
translated song Lace It from ru
translated song Mi Ex Tenia Razon from es
translated song Rompe La Dompe from es
translated song Sabor Fresa from es
translated song Qlona from es
translated song Needle from pt
translated song Amargura from es


In [44]:
translated_songs = df[df["orig_lang"] != "en"]

translated_songs[["artist", "title", "orig_lang", "lyrics", "lyrics_en"]]

Unnamed: 0,artist,title,orig_lang,lyrics,lyrics_en
10,Kendrick Lamar & SZA,Luther,pt,"se esse mundo fosse meu hey, número romano set...","If this world were my hey, Roman number seven,..."
11,Kendrick Lamar Featuring Lefty Gunplay,TV Off,tr,i̇stediğim tek şeydi bir siyah grand national ...,The only thing I said was a black Grand Nation...
31,The Weeknd & Playboi Carti,Timeless,tr,"güneş parlıyor, sona erdi yarın oldu kendini ı...","The sun shines, ended tomorrow in the light, i..."
58,Cynthia Erivo Featuring Ariana Grande,Defying Gravity,th,เอลฟาบา ช่วยใจเย็นมีสติสักครั้งได้ไหม? อย่าเพิ...,เอลฟาบา ช่วยใจเย็นมีสติสักครั้งได้ไหม? อย่าเพิ...
60,Oscar Maydon & Fuerza Regida,Tu Boda,es,"siento un vacío muy frío por dentro, mi amor c...","I feel a very cold emptiness inside, my love w..."
...,...,...,...,...,...
740,"Peso Pluma, Junior H & Oscar Maydon",Rompe La Dompe,es,"rompe la dom pe la paca nunca se acaba, farand...","Break the dom pe the paca never ends, pharandu..."
743,Fuerza Regida,Sabor Fresa,es,"mesero, traiga champagne que quiero complacer ...","Waiter, brought champagne that I want to pleas..."
745,Karol G & Peso Pluma,Qlona,es,"ayer te vi solita, esa carita bonita diablo, q...","Yesterday I saw you alone, that little little ..."
746,Nicki Minaj Featuring Drake,Needle,pt,"quero dizer, não sei se gosto de garotas, vi s...","I mean, I don't know if I like girls, I saw Sh..."


## Exporting the df for further use

In [48]:
OUT_DIR = Path("../data/processed")
OUT_DIR.mkdir(exist_ok=True)

OUT_FILE  = OUT_DIR / "hot100_translated.xlsx"
df.to_excel(OUT_FILE, index=False, engine="openpyxl")

print(f"✓ Saved translated corpus → {OUT_FILE.resolve()}")

✓ Saved translated corpus → C:\Users\royic\OneDrive\Desktop\לימודים\שנה ג\סמסטר ב\טקסט כנתונים\billboard-100-lyrics-analysis\data\processed\hot100_translated.xlsx
