# **PREPROCESAMIENTO DE TEXTOS**

## **PARA EMBEDDINGS NO CONTEXTUALES**

In [30]:
import re
import pandas as pd
from tqdm import tqdm
import spacy

df = pd.read_csv("datas/datasetClean.csv")

Lo primero es limpiar símbolos y distintas cosas que el modelo pueda malinterpretar. Observando los textos de los articulos, hemos detectado que se cuelan algunos simbolos como ">" "<" o comillas,...Hemos tomado la decisión de eliminarlas para un mejor funcionamiento

In [31]:
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser", "textcat"])
tqdm.pandas()

STOPWORDS = set(nlp.Defaults.stop_words)
PLACEHOLDER_RE = re.compile(r"__\w+__")

URL_RE = re.compile(r"https?://\S+|www\.\S+")
HTML_TAG_RE = re.compile(r"<[^>]+>")
BOILERPLATE_RE = re.compile(
    r"(^read more:.*$|^story continues.*$|copyright\s*©.*$)",
    flags=re.IGNORECASE | re.MULTILINE,
)
CASHTAG_RE = re.compile(r"\$([A-Za-z]{1,10})\b")

In [32]:
def pre_rules(text: str) -> str:
    if not isinstance(text, str) or not text.strip():
        return ""
    text = (text.replace("’", "'").replace("“", '"').replace("”", '"')
                 .replace("–", "-").replace("—", "-")).lower()
    t = BOILERPLATE_RE.sub("", text)
    t = HTML_TAG_RE.sub(" ", t)
    t = URL_RE.sub(" ", t)
    t = re.sub(CASHTAG_RE, "__TICKER__", t)

    # placeholders financieros
    t = re.sub(r'\bQ([1-4])\b', r'__QTR\1__', t, flags=re.IGNORECASE)
    t = re.sub(r'\b[+-]?\d[\d,\.]*\s*%(?=\W|$)', '__PERCENT__', t)
    t = re.sub(r'\b[+-]?\d[\d,\.]*\s*percent(?=\W|$)', '__PERCENT__', t, flags=re.IGNORECASE)
    t = re.sub(r'\b[+-]?\d[\d,\.]*\s*per\s+cent(?=\W|$)', '__PERCENT__', t, flags=re.IGNORECASE)
    t = re.sub(r'(\$|€|£)\s*\d[\d,\.]*\s*(?:bn|b|m|k)?\b', '__MONEY__', t, flags=re.IGNORECASE)
    t = re.sub(r'\b\d[\d,\.]*\s*(million|billion|trillion|bn|m)\b', '__AMOUNT__', t, flags=re.IGNORECASE)
    t = re.sub(r'\b(19|20)\d{2}\b', '__YEAR__', t)
    t = re.sub(r'\b((jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)[a-z]*\s+\d{1,2})\b',
               '__DATE__', t, flags=re.IGNORECASE)
    t = re.sub(r'\b(\d{1,2}\s+(jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)[a-z]*)\b',
               '__DATE__', t, flags=re.IGNORECASE)
    # números genéricos al final
    t = re.sub(r'(?<![A-Za-z_])\b\d[\d,\.]*\b(?![A-Za-z_])', '__NUM__', t)
    t = re.sub(r'\s+', ' ', t).strip()
    return t


### Problema que nos ha surgido: 

Acronimos, bolsas como el S&P,...Se eliminaban de los textos porque teniamos mal hecho un filtro al lemmatizar. Para esto hemos tratado de resolver introducioendo una variable KEEP_TERMS con terminos relevanates que hemos visto por las noticias, así como ACRONYM_RE para casos como COS o los que no hayamos metido en la lista

In [39]:
import re

STOPWORDS.discard("us")  # mantener "US" como país

KEEP_TERMS = {
    "uk", "us", "eu", "ai", "ceo", "cfo", "ipo", "esg", "gdp", "cpi", "pmi", "ppi",
    "eps", "roi", "ebitda", "fx", "irr", "yoy", "qoq", "bps", "ev", "iot", "ml",
    "nyse", "nasdaq", "dow", "sp500", "ftse", "dax", "cac", "nikkei", "tsx",
    "hk", "jp", "cn", "in", "br", "mx", "de", "fr", "it", "sg", "za", "kr",
    "gm", "ge", "bp", "ibm", "aapl", "tsla", "msft", "meta", "googl",
    "ons", "imf", "oecd", "ecb", "boe", "fed", "sec", "bis", "opec", "wto", "un"
}

ACRONYM_RE = re.compile(r"^[A-Z]{2,5}$")       # UK, ONS, GDP
MIXED_CASE_RE = re.compile(r"^[A-Z0-9&]{2,6}$")  # 5G, S&P, AT&T

def spacy_clean_strong(doc) -> str:
    out = []
    for tok in doc:
        text_tok = tok.text
        lemma = tok.lemma_.lower()

        # --- Placeholders (ej: __MONEY__) ---
        if text_tok.startswith("__") and text_tok.endswith("__"):
            out.append(text_tok)
            continue

        # --- Espacios o puntuación ---
        if tok.is_punct or tok.is_space:
            continue

        # --- Stopwords primero ---
        if lemma in STOPWORDS:
            continue

        # --- Siglas importantes ---
        if lemma in KEEP_TERMS:
            out.append(lemma)
            continue

        # --- Detectar acrónimos por patrón ---
        if ACRONYM_RE.match(text_tok) or MIXED_CASE_RE.match(text_tok):
            out.append(text_tok.lower())
            continue

        # --- Palabras normales ---
        if lemma.isalpha() and len(lemma) >= 3:
            out.append(lemma)

    return " ".join(out)



In [40]:
base_col = "article_text"
assert base_col in df.columns, f"No existe la columna {base_col}"

df["text_nc_step1"] = df[base_col].apply(pre_rules)
df["text_nc"] = df["text_nc_step1"].progress_apply(lambda x: spacy_clean_strong(nlp(x)))

columnasNecesarias = ["article_text","text_nc_step1", "text_nc", ]
out_path = "datas/processData.csv"
df.to_csv(out_path, index=False)
print("Guardado:", out_path)

100%|██████████| 5160/5160 [02:02<00:00, 42.02it/s]


Guardado: datas/processData.csv


In [41]:
OUT_COL = "text_nc"


# Creamos una máscara para verificar que hay textos no vacíos
mask_valid = df[OUT_COL].notna() & (df[OUT_COL].str.len() > 0)


first_original_text = df.loc[mask_valid, "article_text"].iloc[0]
first_preprocessed_text = df.loc[mask_valid, OUT_COL].iloc[0]
first_half_proc_text = df.loc[mask_valid,"text_nc_step1"].iloc[0]

print("\n============================") 
print("Texto original") 
print("============================\n")
print(first_original_text[:1000], "...\n")



Texto original

The UK jobs market continues to show signs of weakness, with pay growth slowing and unemployment edging higher ahead of the autumn budget next month.
The latest data from the Office for National Statistics (ONS), released on Tuesday, showed that annual wage growth excluding bonuses in the three months to August was 4.7%, down slightly from 4.8% between May and July.
The unemployment rate came in at 4.8% for the period, slightly higher than the 4.7% recorded for the previous three months.
The number of employees on the payroll in the year to August was estimated to have fallen by 93,000, though it increased by 10,000 between July and August.
Early estimates for the number of payrolled employees in September suggested a fall of 100,000 on the year and 10,000 on a monthly basis, though the ONS said this was likely to be revised when more data is received next month.
The estimated number of job vacancies fell by 9,000 from the previous three months to 717,000 in July to Se

In [42]:
print("\n============================") 
print("Texto preprocesado con etiquetas") 
print("============================\n")
print(first_half_proc_text, "...\n")


Texto preprocesado con etiquetas

the uk jobs market continues to show signs of weakness, with pay growth slowing and unemployment edging higher ahead of the autumn budget next month. the latest data from the office for national statistics (ons), released on tuesday, showed that annual wage growth excluding bonuses in the three months to august was __PERCENT__, down slightly from __PERCENT__ between may and july. the unemployment rate came in at __PERCENT__ for the period, slightly higher than the __PERCENT__ recorded for the previous three months. the number of employees on the payroll in the year to august was estimated to have fallen by __NUM__, though it increased by __NUM__ between july and august. early estimates for the number of payrolled employees in september suggested a fall of __NUM__ on the year and __NUM__ on a monthly basis, though the ons said this was likely to be revised when more data is received next month. the estimated number of job vacancies fell by __NUM__ from

In [43]:
print("\n============================") 
print("Texto preprocesado") 
print("============================\n")
print(first_preprocessed_text, "...\n")

    # Tokenización de verificación con spaCy
doc_test = nlp(first_original_text)
verification_data = [(w.text, w.pos_, w.lemma_) for w in doc_test]

print(f"Número de tokens analizados: {len(verification_data)}\n")
print("Primeros 30 tokens:\n")
print(verification_data[:30])


Texto preprocesado

uk job market continue sign weakness pay growth slowing unemployment edge higher ahead autumn budget month late datum office national statistic ons release tuesday annual wage growth exclude bonus month august percent slightly percent july unemployment rate come percent period slightly high percent record previous month number employee payroll year august estimate fall num increase num july august early estimate number payrolle employee september suggest fall num year num monthly basis likely revise datum receive month estimate number job vacancy fall num previous month num july september liz mckeown director economic statistic ons long period weak hire activity sign fall payroll number vacancy level different pattern age range record number work increase unemployment drive young people figure come ahead government autumn budget chancellor rachel reeves deliver date speculation ramp policy change announce raise fund support public finance professor joe nellis econo

## **PARA EMBEDDINGS CONTEXTUALES**