<header style="width:100%;position:relative">
  <div style="width:80%;float:right;">
    <h1>False Political Claim Detection</h1>
    <h3>Carga y limpieza de los datos - LLM</h3>
    <h5>Grupo 2</h5>
  </div>
        <img style="width:15%;" src="./images/logo.jpg" alt="UPM" />
</header>

# Índice

1. [Importar librerias](#1.-Importar-librerias)  
2. [Funciones auxiliares](#2.-Funciones-auxiliares)  
3. [Configuracion de los diccionarios](#3.-configuracion-de-los-diccionarios)  
4. [Referencias](#4-referencias)

# 1. Importar Librerias

In [41]:
import pandas as pd
import numpy as np

from datetime import datetime
import json


import matplotlib.pyplot as plt
import seaborn as sns


import spacy

import re
from multiprocessing import process
from transformers import pipeline


from collections import Counter


# pip install gensim
import gensim
from gensim.corpora import Dictionary

# pip install pyLDAvis==3.4.1
# import pyLDAvis

In [42]:
def create_collection(df, label, column):
    collection = []
    
    for x in df[df['label'] == label][column].str.split():
        for i in x:
            collection.append(i)
    return collection


In [43]:
def create_json(outputfile, dict):
    with open(outputfile, 'w', encoding='utf-8') as f:
        json.dump(dict, f, ensure_ascii=False, indent=2)

In [44]:
# variables generales
nlp = spacy.load('en_core_web_lg')
en_stopwords = nlp.Defaults.stop_words

# 2. Carga del Dataset

In [45]:
# Train
url = "formated/train_exportado.csv" 
df_train_exportado = pd.read_csv(url)


# Test
url2 = "formated/test_exportado.csv"
df_test_exportado = pd.read_csv(url2)

# 3. Configuracion de los diccionarios

## 3.1 Quitando los comunes

In [46]:
statement_train = df_train_exportado['statement-lemmatize']
statement_test = df_test_exportado['statement-lemmatize']

In [47]:
# Lista de valores mínimos de frecuencia. Una palabra debe aparecer al menos este número de veces en los documentos para ser incluida en el diccionario.
min_freq_values = [5, 10, 20, 50]

#  Lista de proporciones máximas. Una palabra no debe aparecer en más del porcentaje especificado de documentos para ser incluida en el diccionario.
max_prop_values = [0.5, 0.65, 0.8]


In [48]:
# Tokeniza
tokenized_train = [str(text).split() for text in statement_train]
tokenized_test = [str(text).split() for text in statement_test]

# Cuenta palabras
counter_train = Counter(word for doc in tokenized_train for word in doc)
counter_test = Counter(word for doc in tokenized_test for word in doc)

n_docs_train = len(tokenized_train)
n_docs_test = len(tokenized_test)

# Cuenta en cuántos documentos aparece cada palabra (frecuencia de documento, no total)
def document_frequency(tokenized_docs):
    df_counter = Counter()
    for doc in tokenized_docs:
        unique_words = set(doc)
        df_counter.update(unique_words)
    return df_counter

doc_freq_train = document_frequency(tokenized_train)
doc_freq_test = document_frequency(tokenized_test)

# Función para filtrar un diccionario de palabras
def filtra(dic_frecuencia, doc_frecuencia, total_docs, min_freq, max_prop):
    return {
        word: dic_frecuencia[word]
        for word in dic_frecuencia
        if doc_frecuencia[word] >= min_freq and (doc_frecuencia[word] / total_docs) <= max_prop
    }

# Procesa para cada combinación
for min_freq in min_freq_values:
    for max_prop in max_prop_values:
        # Filtrado
        filtered_train = filtra(counter_train, doc_freq_train, n_docs_train, min_freq, max_prop)
        filtered_test = filtra(counter_test, doc_freq_test, n_docs_test, min_freq, max_prop)

        # Palabras en común
        common_words = set(filtered_train.keys()) & set(filtered_test.keys())

        # Exclusivos
        exclusive_train = {word: freq for word, freq in filtered_train.items() if word not in common_words}
        exclusive_test = {word: freq for word, freq in filtered_test.items() if word not in common_words}

        # Guarda todo como JSON
        base = f"nb{min_freq}_na{int(max_prop * 100)}"

        with open(f"dictionaries/train_{base}.json", "w") as f:
            json.dump(filtered_train, f, indent=2)
        with open(f"dictionaries/test_{base}.json", "w") as f:
            json.dump(filtered_test, f, indent=2)
        with open(f"dictionaries/train_exclusive_{base}.json", "w") as f:
            json.dump(exclusive_train, f, indent=2)
        with open(f"dictionaries/test_exclusive_{base}.json", "w") as f:
            json.dump(exclusive_test, f, indent=2)

        print(f"✔ Diccionarios guardados para min_freq={min_freq}, max_prop={max_prop}")


✔ Diccionarios guardados para min_freq=5, max_prop=0.5
✔ Diccionarios guardados para min_freq=5, max_prop=0.65
✔ Diccionarios guardados para min_freq=5, max_prop=0.8
✔ Diccionarios guardados para min_freq=10, max_prop=0.5
✔ Diccionarios guardados para min_freq=10, max_prop=0.65
✔ Diccionarios guardados para min_freq=10, max_prop=0.8
✔ Diccionarios guardados para min_freq=20, max_prop=0.5
✔ Diccionarios guardados para min_freq=20, max_prop=0.65
✔ Diccionarios guardados para min_freq=20, max_prop=0.8
✔ Diccionarios guardados para min_freq=50, max_prop=0.5
✔ Diccionarios guardados para min_freq=50, max_prop=0.65
✔ Diccionarios guardados para min_freq=50, max_prop=0.8


## 3.2 Otra manera

# 4. Referencias

* [pandas documentation — pandas 2.2.3 documentation. (s. f.).](https://pandas.pydata.org/docs/)  