Due to issues in data processing, if it is necessary to sample text data for further classifier training based on the existing named entity recognition results, the dataframe should be read in its default order. If the task is text classification, the dataframe should be read in ascending order according to the Date column.

## OCR test part (ignore it)

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained('HenriPorteur/bart-large-ocr-fr')
tokenizer = AutoTokenizer.from_pretrained('HenriPorteur/bart-large-ocr-fr')
generator = pipeline('text2text-generation', model=model.to('cuda'), tokenizer=tokenizer, device='cuda', max_length=1024)

ocr = "C3nUm3r~o compr3nd3g@lement l compte-rendu deIa séance du mème jour de l@ CHAMBRE des dépuTés¡."
pred = generator(ocr)[0]['generated_text']
print(pred)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.71k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/292 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

Device set to use cuda


Cénuméro comprend dégà l’écompte-rendu de la séance du même jour de là CHAMBRE des députés.


In [None]:
ocr = "il Des i débats d'ordre purement politique, .dans lesquels se heurtent, comme depuis et ans, les idées de droite plus de cin ? ^Ilai? te ans» idées de droite et les idées de gauche, ou prétendues telles 1 nous semblent épuisées. Ces diseus."
pred = generator(ocr)[0]['generated_text']
print(pred)

Il des indébats d’ordre purement politique, dans lesquels se heurtent, comme depuis et ans, les idées de droite plus de cinq « il aîte ans », identées, de droit et les idéees de gauche, ou prétendues telles - nous semblent épuisées - ces diseus.


## Load package

In [None]:
from google.colab import drive
import os
import shutil
import pandas as pd
import numpy as np
import re
import unicodedata
from tqdm import tqdm
import spacy
import matplotlib.pyplot as plt
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from sentence_transformers import SentenceTransformer, util
import nltk
from nltk.tokenize import sent_tokenize
import json

In [None]:
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
! pip install gliner
! pip install rapidfuzz
from gliner import GLiNER
from rapidfuzz import process, fuzz

Collecting gliner
  Downloading gliner-0.2.16-py3-none-any.whl.metadata (8.8 kB)
Collecting onnxruntime (from gliner)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->gliner)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->gliner)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->gliner)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->gliner)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->gliner)
  Downloading nvidia_cublas_c

## Load data

In [None]:
df_chambre = pd.read_csv('/content/drive/MyDrive/Memoire_ENC/df_chambre_cleaned.csv')
df_senat = pd.read_csv('/content/drive/MyDrive/Memoire_ENC/df_senat_cleaned.csv')

In [None]:
country_list = [
    "Afrique du Sud", "Sud-Africain", "Sud-Africaine", "Sud-Africains", "Sud-Africaines",
    "Albanie", "Albanais", "Albanaise", "Albanais", "Albanaises",
    "Allemagne", "Allemand", "Allemande", "Allemands", "Allemandes",
    "Argentine", "Argentin", "Argentine", "Argentins", "Argentines",
    "Australie", "Australien", "Australienne", "Australiens", "Australiennes",
    "Autriche", "Autrichien", "Autrichienne", "Autrichiens", "Autrichiennes",
    "Belgique", "Belge", "Belge", "Belges", "Belges",
    "Bolivie", "Bolivien", "Bolivienne", "Boliviens", "Boliviennes",
    "Brésil", "Brésilien", "Brésilienne", "Brésiliens", "Brésiliennes",
    "Bulgarie", "Bulgare", "Bulgare", "Bulgares", "Bulgares",
    "Canada", "Canadien", "Canadienne", "Canadiens", "Canadiennes",
    "Chili", "Chilien", "Chilienne", "Chiliens", "Chiliennes",
    "Chine", "Chinois", "Chinoise", "Chinois", "Chinoises",
    "Colombie", "Colombien", "Colombienne", "Colombiens", "Colombiennes",
    "Costa Rica", "Costaricien", "Costaricienne", "Costariciens", "Costariciennes",
    "Cuba", "Cubain", "Cubaine", "Cubains", "Cubaines",
    "Danemark", "Danois", "Danoise", "Danois", "Danoises",
    "Égypte", "Égyptien", "Égyptienne", "Égyptiens", "Égyptiennes",
    "Équateur", "Équatorien", "Équatorienne", "Équatoriens", "Équatoriennes",
    "Espagne", "Espagnol", "Espagnole", "Espagnols", "Espagnoles",
    "Estonie", "Estonien", "Estonienne", "Estoniens", "Estoniennes",
    "Éthiopie", "Éthiopien", "Éthiopienne", "Éthiopiens", "Éthiopiennes",
    "Finlande", "Finlandais", "Finlandaise", "Finlandais", "Finlandaises",
    "Grèce", "Grec", "Grecque", "Grecs", "Grecques",
    "Guatemala", "Guatémaltèque", "Guatémaltèque", "Guatémaltèques", "Guatémaltèques",
    "Haïti", "Haïtien", "Haïtienne", "Haïtiens", "Haïtiennes",
    "Honduras", "Hondurien", "Hondurienne", "Honduriens", "Honduriennes",
    "Hongrie", "Hongrois", "Hongroise", "Hongrois", "Hongroises",
    "Inde", "Indien", "Indienne", "Indiens", "Indiennes",
    "Irak", "Irakien", "Irakienne", "Irakiens", "Irakiennes",
    "Irlande", "Irlandais", "Irlandaise", "Irlandais", "Irlandaises",
    "Italie", "Italien", "Italienne", "Italiens", "Italiennes",
    "Japon", "Japonais", "Japonaise", "Japonais", "Japonaise",
    "Lettonie", "Letton", "Lettone", "Lettons", "Lettones",
    "Liberia", "Libérien", "Libérienne", "Libériens", "Libériennes",
    "Lituanie", "Lituanien", "Lituanienne", "Lituaniens", "Lituaniennes",
    "Luxembourg", "Luxembourgeois", "Luxembourgeoise", "Luxembourgeois", "Luxembourgeoises",
    "Mexique", "Mexicain", "Mexicaine", "Mexicains", "Mexicaines",
    "Nicaragua", "Nicaraguayen", "Nicaraguayenne", "Nicaraguayens", "Nicaraguayennes",
    "Norvège", "Norvégien", "Norvégienne", "Norvégiens", "Norvégiennes",
    "Nouvelle-Zélande", "Néo-Zélandais", "Néo-Zélandaise", "Néo-Zélandais", "Néo-Zélandaises",
    "Panama", "Panaméen", "Panaméenne", "Panaméens", "Panaméennes",
    "Paraguay", "Paraguayen", "Paraguayenne", "Paraguayens", "Paraguayennes",
    "Pays-Bas", "Néerlandais", "Néerlandaise", "Néerlandais", "Néerlandaises",
    "Pérou", "Péruvien", "Péruvienne", "Péruviens", "Péruviennes",
    "Pologne", "Polonais", "Polonaise", "Polonais", "Polonaises",
    "Portugal", "Portugais", "Portugaise", "Portugais", "Portugaises",
    "République dominicaine", "Dominicain", "Dominicaine", "Dominicains", "Dominicaines",
    "Roumanie", "Roumain", "Roumaine", "Roumains", "Roumaines",
    "Royaume-Uni", "Britannique", "Britannique", "Britanniques", "Britanniques",
    "Salvador", "Salvadorien", "Salvadorienne", "Salvadoriens", "Salvadoriennes",
    "Siam", "Siamois", "Siamoise", "Siamois", "Siamoises",
    "Suède", "Suédois", "Suédoise", "Suédois", "Suédoises",
    "Suisse", "Suisse", "Suisse", "Suisses", "Suisses",
    "Tchécoslovaquie", "Tchécoslovaque", "Tchécoslovaque", "Tchécoslovaques", "Tchécoslovaques",
    "Turquie", "Turc", "Turque", "Turcs", "Turques",
    "Union soviétique", "Soviétique", "Soviétique", "Soviétiques", "Soviétiques",
    "Uruguay", "Uruguayen", "Uruguayenne", "Uruguayens", "Uruguayennes",
    "Venezuela", "Vénézuélien", "Vénézuélienne", "Vénézuéliens", "Vénézuéliennes",
    "Yougoslavie", "Yougoslave", "Yougoslave", "Yougoslaves", "Yougoslaves"
]
minister_list = ['Aristide Briand','Alexandre Ribot','Louis Barthou','Stephen Pichon','Alexandre Millerand','Georges Leygues','Raymond Poincaré','Edmond Lefebvre du Prey',
                 'Édouard Herriot','Pierre Laval','André Tardieu', 'Joseph Paul-Boncour','Édouard Daladier','Pierre-Étienne Flandin','Yvon Delbos','Joseph Paul-Boncour',
                 'Georges Bonnet','Paul Reynaud','Paul Baudouin']

In [None]:
nltk.download("punkt")  
nltk.download('punkt_tab')

columns_to_keep = ["Text", "Date", "Year", "errors", "total_words", "error_positions", "error_ratio"]
df_chambre_filtered = df_chambre[columns_to_keep].copy()
df_chambre_filtered = df_chambre[columns_to_keep].copy()

def chunk_sentences(text, sentences_per_chunk=30, overlap=5):
    sentences = sent_tokenize(text)
    chunks = []

    for i in range(0, len(sentences), sentences_per_chunk - overlap):
        chunk = " ".join(sentences[i : i + sentences_per_chunk])
        chunks.append(chunk)

    return chunks


df_chambre_filtered["Chunks"] = df_chambre_filtered["Text"].apply(chunk_sentences)

df_chambre_chunks = df_chambre_filtered.explode("Chunks").reset_index(drop=True)

print(df_chambre_chunks.head())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


                                                Text        Date  Year  \
0  .\n\nSOMMAIRE 1. - Ouverture de la session ext...  1929-07-31  1929   
1  .\n\nSOMMAIRE 1. - Ouverture de la session ext...  1929-07-31  1929   
2  .\n\nSOMMAIRE 1. - Ouverture de la session ext...  1929-07-31  1929   
3  .\n\nSOMMAIRE 1. - Ouverture de la session ext...  1929-07-31  1929   
4  .\n\nSOMMAIRE 1. - Ouverture de la session ext...  1929-07-31  1929   

   errors  total_words                                    error_positions  \
0    7153        63465  [4.7270148900969035e-05, 0.0002048373119041991...   
1    7153        63465  [4.7270148900969035e-05, 0.0002048373119041991...   
2    7153        63465  [4.7270148900969035e-05, 0.0002048373119041991...   
3    7153        63465  [4.7270148900969035e-05, 0.0002048373119041991...   
4    7153        63465  [4.7270148900969035e-05, 0.0002048373119041991...   

   error_ratio                                             Chunks  
0     0.112708  . SOMMAI

In [None]:
len(df_chambre_chunks['Chunks'])

166442

In [None]:
df_chambre_chunks['Chunks'][2]

"Adoption de l'article 27 (devenu 15). Adoption d'un article 27 bis (nouveau) (devenu 16). Modification du titre. Adoption, au scrutin, de l'ensemble du projet de loi. 9. — Décrets désignant des commissaires du Gouvernement. 10. — Demande, par M. André Tardieu, minjstre de l'intérieur, de]adiscussion immédiate d'une proposition de loi, adoptée par le Sénat, relative aux élections sénatoriales. 11. — Demande, par M. Pierre Forgeot, ministre des travaux publies, de la discussion immédiate du projet de loi modifiant la loi du 1er août 1928 sur le crédit maritime et portant allégement des charges fiscales supportées par les navires de mer. Discussion immédiate. Adoption des articles 1er à 8. Modification du titre. Adoption de l'ensemble du projet de loi. 12. — Demandes d'interpellation: 1° De M. Odin, sur les négociations doua. nières concernant les vins de France; 2° De M. Gaston-Gérard, sur la crise de chômage qui sévit actuellement dans les professions théâtrales ; 3° De M. Ramadier, su

## NER part

### Finished extraction

In [None]:
#gliner-multitask-v1.0
device=0
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-v1.0",device=device)
model.to('cuda')
labels = ["person", "country", "organization", "region"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

In [None]:
chunk_entities = {}
for idx, chunks in enumerate(df_chambre_chunks["Chunks"]):
    for i, chunk in enumerate(chunks):
        entities = model.predict_entities(chunk, labels)
        chunk_entities[(idx, i)] = [ent['text'] for ent in entities]
    print(f"Progress: {idx+1}/{len(df_chambre)}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Progress: 1/1003
Progress: 2/1003
Progress: 3/1003
Progress: 4/1003
Progress: 5/1003
Progress: 6/1003
Progress: 7/1003


### Load entities directly from JSON

In [None]:
def load_data():
   """Load CSV data"""
    df_senat = pd.read_csv('/content/drive/MyDrive/Memoire_ENC/df_senat_cleaned.csv')
    df_chambre = pd.read_csv('/content/drive/MyDrive/Memoire_ENC/df_chambre_cleaned.csv')
    return df_senat, df_chambre

def chunk_sentences(text, sentences_per_chunk=30, overlap=5):
    """Split text by sentences and merge into chunks, each chunk containing 30 sentences with an overlap of 5 sentences"""
    sentences = sent_tokenize(text)
    chunks = [" ".join(sentences[i : i + sentences_per_chunk]) for i in range(0, len(sentences), sentences_per_chunk - overlap)]
    return chunks

def preprocess_data(df):
    """Preprocess the dataframe and split into chunks"""
    columns_to_keep = ["Text", "Date", "Year", "errors", "total_words", "error_positions", "error_ratio"]
    df_filtered = df[columns_to_keep].copy()
    df_filtered["Chunks"] = df_filtered["Text"].apply(chunk_sentences)
    df_chunks = df_filtered.explode("Chunks").reset_index(drop=True)
    return df_chunks



nltk.download('punkt')
nltk.download('punkt_tab')
df_senat_origin, df_chambre_origin = load_data()
df_senat = preprocess_data(df_senat_origin)
df_chambre = preprocess_data(df_chambre_origin)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

In [None]:
def load_data():
    """Load CSV data and sort by the Date column from earliest to latest"""
    df_senat = pd.read_csv('/content/drive/MyDrive/Memoire_ENC/df_senat_cleaned.csv')
    df_chambre = pd.read_csv('/content/drive/MyDrive/Memoire_ENC/df_chambre_cleaned.csv')

    # Remove the last two characters in the Date column (if present), then convert to datetime format
    df_senat['Date'] = pd.to_datetime(df_senat['Date'].str.split('_').str[0])
    df_chambre['Date'] = pd.to_datetime(df_chambre['Date'].str.split('_').str[0])

    # Sort by the Date column
    df_senat = df_senat.sort_values(by='Date')
    df_chambre = df_chambre.sort_values(by='Date')

    return df_senat, df_chambre




def chunk_sentences(text, sentences_per_chunk=30, overlap=5):
    """Split text by sentences and merge into chunks, each chunk containing 30 sentences with an overlap of 5 sentences"""
    sentences = sent_tokenize(text)
    chunks = [" ".join(sentences[i : i + sentences_per_chunk]) for i in range(0, len(sentences), sentences_per_chunk - overlap)]
    return chunks

def preprocess_data(df):
    """Preprocess the dataframe and split into chunks"""
    columns_to_keep = ["Text", "Date", "Year", "errors", "total_words", "error_positions", "error_ratio"]
    df_filtered = df[columns_to_keep].copy()
    df_filtered["Chunks"] = df_filtered["Text"].apply(chunk_sentences)
    df_chunks = df_filtered.explode("Chunks").reset_index(drop=True)
    return df_chunks



nltk.download('punkt')
nltk.download('punkt_tab')
df_senat_origin, df_chambre_origin = load_data()
df_senat = preprocess_data(df_senat_origin)
df_chambre = preprocess_data(df_chambre_origin)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
df_chambre

In [None]:
with open("/content/drive/MyDrive/Memoire_ENC/data_chambre.json", "r", encoding="utf-8") as f:
    entities_chambre = json.load(f)
with open("/content/drive/MyDrive/Memoire_ENC/data_senat.json", "r", encoding="utf-8") as f:
    entities_senat = json.load(f)

In [None]:
entities_senat

{'0': ['Chambre des dépuJés',
  'Reich',
  'Ilenry de Jouvenel',
  'Chambre des députés',
  'Chambre des députés',
  'Chambre des députés',
  'Chambre des députés'],
 '1': ['conseil municipal de Paris',
  'Gornudet',
  'Marcel Régnier',
  'André Morizet',
  'Henri Laudier',
  'Henri Merlin',
  'Canada',
  'Alsace'],
 '2': ['M. Alfred Brard',
  'M. Robert Thoumyre',
  'Chambre des députés',
  'M. Hachette',
  'Chambre des députés',
  'M. JULES JEANNENEY',
  'M. Loubat',
  'M. Georges Se Grandmaison',
  'Chambre des députés',
  'Chambre des députés',
  'M. Amiard'],
 '3': ['Sénat',
  'M. Amiard',
  'M. Amiard',
  'Sénat',
  'M. Caillier',
  'M. Paul Bénazet',
  'Reich',
  'M. Lémery',
  'Sénat',
  'M. Alexandre Israël',
  'M. Henry de Jouvenel',
  'Sénat'],
 '4': ['M. Justin Godart'],
 '5': ['Justin Godart',
  'Sénat',
  'Chambre des députés',
  'Justin Godart',
  'Fèvre',
  'Roussel',
  'Bender',
  'Lugol',
  'Gautier',
  'Japy',
  'Mando',
  'Thournyre',
  'Lancien',
  'Bersez',
  'Ham

In [None]:
def fuzzy_match(entity, reference_list, threshold=60):
    """Levenshtein + Fuzzy Matching"""
    match, score, _ = process.extractOne(entity, reference_list, scorer=fuzz.ratio)
    #return match if score > threshold else None
    if score > threshold:
        return entity  
    return None

# Use country and minister list
matched_chunks_chambre = {}
for key, entities in entities_chambre.items():
    matches = [fuzzy_match(ent, country_list + minister_list) for ent in entities]
    matched_chunks_chambre[key] = [m for m in matches if m]
    print(f"Progress: {key}/{len(entities_chambre)}")

matched_chunks_senat = {}
for key, entities in entities_senat.items():
    matches = [fuzzy_match(ent, country_list + minister_list) for ent in entities]
    matched_chunks_senat[key] = [m for m in matches if m]
    print(f"Progress: {key}/{len(entities_senat)}")


[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
Progress: 53138/58138
Progress: 53139/58138
Progress: 53140/58138
Progress: 53141/58138
Progress: 53142/58138
Progress: 53143/58138
Progress: 53144/58138
Progress: 53145/58138
Progress: 53146/58138
Progress: 53147/58138
Progress: 53148/58138
Progress: 53149/58138
Progress: 53150/58138
Progress: 53151/58138
Progress: 53152/58138
Progress: 53153/58138
Progress: 53154/58138
Progress: 53155/58138
Progress: 53156/58138
Progress: 53157/58138
Progress: 53158/58138
Progress: 53159/58138
Progress: 53160/58138
Progress: 53161/58138
Progress: 53162/58138
Progress: 53163/58138
Progress: 53164/58138
Progress: 53165/58138
Progress: 53166/58138
Progress: 53167/58138
Progress: 53168/58138
Progress: 53169/58138
Progress: 53170/58138
Progress: 53171/58138
Progress: 53172/58138
Progress: 53173/58138
Progress: 53174/58138
Progress: 53175/58138
Progress: 53176/58138
Progress: 53177/58138
Progress: 53178/58138
Progress: 53179/58138
Progress: 53180/58138
Progress: 531

In [None]:
len(matched_chunks_chambre)

166442

In [None]:
len(filtered_senat_entities)

29923

In [None]:
filtered_chambre_entities = {k: v for k, v in matched_chunks_chambre.items() if v}

In [None]:
filtered_senat_entities = {k: v for k, v in matched_chunks_senat.items() if v}

In [None]:
filtered_senat_entities

{'1': ['André Morizet', 'Canada', 'Alsace'],
 '3': ['M. Alexandre Israël'],
 '5': ['Roussel', 'Japy', 'Lancien', 'France'],
 '9': ['Paris', 'Paris', 'Paris'],
 '10': ['Paris', 'Paris'],
 '11': ['Paris', 'Paris'],
 '12': ['Paris', 'Paris', 'Paris'],
 '13': ['Paris', 'HenriHaye'],
 '14': ['Savoie'],
 '15': ['Savoie'],
 '18': ['Paris', 'Paris', 'Paris'],
 '19': ['Paris', 'Paris', 'Paris'],
 '21': ['Canada'],
 '22': ['Alsace'],
 '23': ['M. Georges Ulmo'],
 '24': ['Algérie', 'Algérie'],
 '25': ['Algérie', 'Algérie', 'Algérie', 'Algérie'],
 '26': ['Algérie', 'Algérie'],
 '27': ['Indochine',
  'Indochine',
  'France',
  'France',
  'Léon Bourgeois',
  'France'],
 '28': ['Léon Bourgeois', 'France', 'Louis Martin'],
 '29': ['Canada', 'Jean Valadier'],
 '30': ['Indochine', 'Indochine', 'Indochine'],
 '31': ['Paris', 'Paris', 'Paul Laflont', 'France'],
 '32': ['Paul Laflont', 'France', 'Léon Bourgeois', 'France'],
 '34': ['Bruxelles', 'Bruxelles'],
 '35': ['la République française', 'République f

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")  
embeddings = model.encode(country_list + minister_list, convert_to_tensor=True)

# Calculate semantic similarity between entities within chunk
verified_senat_chunks = {}
verified_chambre_chunks = {}
for key, entities in filtered_senat_entities.items():
    chunk_vecs = model.encode(entities, convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(chunk_vecs, embeddings)

    # find highest match
    best_scores, best_matches = cosine_scores.max(dim=1)
    verified_senat_chunks[key] = [(entities[i], best_matches[i].item(), best_scores[i].item()) for i in range(len(entities)) if best_scores[i] > 0.8]
    print(f"Progress: {key}/{len(filtered_senat_entities)}")

for key, entities in filtered_chambre_entities.items():
    chunk_vecs = model.encode(entities, convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(chunk_vecs, embeddings)

    # find highest match
    best_scores, best_matches = cosine_scores.max(dim=1)
    verified_chambre_chunks[key] = [(entities[i], best_matches[i].item(), best_scores[i].item()) for i in range(len(entities)) if best_scores[i] > 0.8]
    print(f"Progress: {key}/{len(filtered_chambre_entities)}")

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
Progress: 157988/105881
Progress: 157989/105881
Progress: 157990/105881
Progress: 157991/105881
Progress: 157992/105881
Progress: 157993/105881
Progress: 157994/105881
Progress: 157995/105881
Progress: 157996/105881
Progress: 157997/105881
Progress: 157998/105881
Progress: 157999/105881
Progress: 158000/105881
Progress: 158001/105881
Progress: 158002/105881
Progress: 158003/105881
Progress: 158004/105881
Progress: 158005/105881
Progress: 158007/105881
Progress: 158008/105881
Progress: 158009/105881
Progress: 158013/105881
Progress: 158015/105881
Progress: 158023/105881
Progress: 158025/105881
Progress: 158026/105881
Progress: 158030/105881
Progress: 158043/105881
Progress: 158044/105881
Progress: 158045/105881
Progress: 158046/105881
Progress: 158047/105881
Progress: 158048/105881
Progress: 158049/105881
Progress: 158051/105881
Progress: 158052/105881
Progress: 158053/105881
Progress: 158054/105881
Progress: 158058/105881
Progress: 158059/105881

In [None]:
final_selected_senat_chunks = {key: val for key, val in verified_senat_chunks.items() if val}
final_selected_chambre_chunks = {key: val for key, val in verified_chambre_chunks.items() if val}

In [None]:
final_selected_chambre_chunks

{'0': [('Aristide Briand', 300, 1.000000238418579),
  ('Aristide Briand', 300, 1.000000238418579)],
 '3': [('Aristide Briand', 300, 1.0)],
 '4': [('M. Aristide Briand', 300, 0.9387767910957336),
  ('M. Aristide Briand', 300, 0.9387767910957336)],
 '5': [('M. Aristide Briand', 300, 0.9387767910957336)],
 '10': [('Aristide Briand', 300, 1.000000238418579)],
 '34': [('M. Aristide Briand', 300, 0.9387767910957336)],
 '62': [('Aristide Briand', 300, 1.000000238418579),
  ('Aristide Briand', 300, 1.000000238418579),
  ('M.\n\nLouis Barthou', 302, 0.9521903991699219)],
 '98': [('Indre', 140, 0.8098695278167725)],
 '104': [('Reynaud', 317, 0.82662433385849)],
 '109': [('Reynaud', 317, 0.82662433385849)],
 '110': [('Reynaud', 317, 0.82662433385849)],
 '111': [('Colomb', 65, 0.8489503860473633)],
 '112': [('Colomb', 65, 0.8489503860473633)],
 '114': [('Roumagoux', 242, 0.802230179309845)],
 '121': [('Colomb', 65, 0.8489503860473633)],
 '134': [('Reynaud', 317, 0.82662433385849)],
 '135': [('Roum

In [None]:
final_selected_senat_chunks

{'1': [('Canada', 50, 1.0000001192092896)],
 '21': [('Canada', 50, 1.0)],
 '29': [('Canada', 50, 1.0000001192092896)],
 '41': [('M. André Tardieu', 310, 0.9422976970672607)],
 '65': [('Canada', 50, 1.0)],
 '70': [('Luxembourg', 180, 1.0)],
 '71': [('Luxembourg', 180, 1.000000238418579),
  ('Luxembourg', 180, 1.000000238418579),
  ('Luxembourg', 180, 1.000000238418579),
  ('Luxembourg', 180, 1.000000238418579)],
 '72': [('Luxembourg', 180, 1.0000003576278687),
  ('Luxembourg', 180, 1.0000003576278687)],
 '73': [('Danemark', 80, 1.0)],
 '74': [('Luxembourg', 180, 1.0000001192092896)],
 '98': [('M. Pierre Laval', 309, 0.9496603012084961)],
 '99': [('Pierre Laval', 309, 0.9999998807907104),
  ('Pierre Laval', 309, 0.9999998807907104)],
 '100': [('M. Pierre Laval', 309, 0.9496603012084961)],
 '107': [('M. Pierre Laval', 309, 0.9496602416038513),
  ('M. Pierre Laval', 309, 0.9496602416038513)],
 '108': [('M. Pierre Laval', 309, 0.9496602416038513)],
 '120': [('M. Reynaud', 317, 0.84253740310

In [None]:
df_senat['Chunks'][1]

'conditions d\'élection des membres du conseil municipal de Paris. Désignation de commissaires du Couver" nement. Urgence précédemment déclarée. Discussion générale:MM.Babaud-Lacroze, rapporteur; Gornudet, le général Hir.schauer, Art.1r11i. — Adoption. Art. 5. — Adoption. Observations de MM. Marcel Régnier, mirustre de l\'intérieur; André Morizet, Babaud" Lacroie, rapporteur. Art. G. — Adoption.. Disposition additionnelle: M. Eugène Milliès-Lacroix. — Adoption. Adoption de l\'ensemble de l\'article G.\nArt. 7 et dernier. — Adoption. Sur l\'ensemble : MM. Henri Laudier, Henri Merlin, président de la commission de l\'administration. Adoption de l\'ensemble du projet de loi. ; Modification du libellé de l\'intitulé du projet de loi. 10. — Ajournement dePaira délibération sur le projet de loi, adopté par la Chambre desdéputés,ayant peur objet d\'autoriser lerni-\n\nntstretleafinances à pourvoir aux iasulflilanoes des annuités remises en gage à ses prêteurs par la ville de Soissons pour le 

In [None]:
def chunks_by_index(df, index_dict):
    """
    Retrieve the previous, current, and next values from the 'Chunks' column of the DataFrame 
    based on the indices provided in the dictionary. 
    Merge them only if their 'Date' values are the same, and return a DataFrame containing ['Date', 'Texte'].

    :param df: pandas DataFrame containing 'Chunks' and 'Date' columns
    :param index_dict: dictionary with DataFrame indices as keys
    :return: new DataFrame containing ['Date', 'Texte']
    """

    data = []

    for idx in index_dict.keys():
        idx = int(idx)
        
        current_date = df.at[idx, "Date"]
        current_chunk = df.at[idx, "Chunks"]

    
        data.append({"Date": current_date, "Texte": current_chunk})

        print(f"Progress: {idx}/{len(index_dict)}")


    return pd.DataFrame(data)


In [None]:
df_chambre_int = chunks_by_index(df_chambre, final_selected_chambre_chunks)
df_senat_int = chunks_by_index(df_senat, final_selected_senat_chunks)

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
Progress: 13742/6608
Progress: 13744/6608
Progress: 13755/6608
Progress: 13756/6608
Progress: 13757/6608
Progress: 13785/6608
Progress: 13786/6608
Progress: 13787/6608
Progress: 13794/6608
Progress: 13805/6608
Progress: 13807/6608
Progress: 13818/6608
Progress: 13819/6608
Progress: 13831/6608
Progress: 13832/6608
Progress: 13837/6608
Progress: 13846/6608
Progress: 13847/6608
Progress: 13853/6608
Progress: 13871/6608
Progress: 13885/6608
Progress: 13886/6608
Progress: 13891/6608
Progress: 13897/6608
Progress: 13903/6608
Progress: 13913/6608
Progress: 13920/6608
Progress: 13926/6608
Progress: 13933/6608
Progress: 13938/6608
Progress: 13964/6608
Progress: 13965/6608
Progress: 13970/6608
Progress: 13971/6608
Progress: 13985/6608
Progress: 13986/6608
Progress: 13995/6608
Progress: 14003/6608
Progress: 14004/6608
Progress: 14068/6608
Progress: 14079/6608
Progress: 14083/6608
Progress: 14138/6608
Progress: 14158/6608
Progress: 14166/6608
Progress: 1417

In [None]:
df_chambre_sample = df_chambre_int.sample(n=8000, random_state=42)
df_chambre_sample.to_csv('/content/df_chambre_sample.csv', index=False)

In [None]:
df_chambre_sample

Unnamed: 0,Date,Texte
17946,1938-05-06,Réponse. — Dans le but de réunir les éléments ...
18914,1936-02-25,Castel. .Castellane (Stanislas de). Catalan (G...
17177,1929-06-04,Qu'ils ne trouvaientl'asgénéralement. M. le mi...
3965,1929-03-27,Gustave Doussain (Seine). Drouot. Dubois (Loui...
9731,1929-03-14,Roquette. Rothschild (Maurice de). Rotours (de...
...,...,...
5012,1937-02-12,Poitou-Duplessy. Polignac (de). Polimann. Pons...
12965,1930-04-23,Nous posons la: question aussi nettement que l...
19274,1933-12-10,Petsche (Maurice). Pezet. Pic. Piétri. Pinauit...
15535,1933-02-12,Périn (Emile) (fièvre). Pernot (Georges). Perr...


In [None]:
df_chambre_sample['Texte'][101]

"La parole est à M. Jénouvrier. M. Jénouvrier. Messieurs, l'Assemblée nationale vient, en lait, de suspendre sa séance et nous sommes tous désireux de clore au (plus tôt, dans l'intérêt du pays, le nouveau scrutin qui est rendu nécessaire. Ce scrutin demande un certain temps. Aucun des membres de l'Assemblée nationale ne réside à Versailles. (Interruptions et exclamations à l'extrême gauche et sur divers bancs à gauche.) Dans ces conditions, je demande à lAs-, semblée nationale de décider qu'elle va procéder immédiatement au second tour de scrutin. (Marques nombreuses d'appro- bation à droite et au centre. — Vives protestations à l'extrême gauche et sur divers bancs à gauche.) M. Camille Chautemps. Je demande la parole. M. le président. La parole est à M. Camille Chautemps. (Aux voix! — Bruit prolongé.) M. Camille Chautemps. Je renonce à la parole. M. Edsuard Herriot. Je demande la pa- role. M. le président. La parole est à M. Ilerriot. M. Edouard Herriot. Messieurs, nous venons d'ente

##TF-IDF+logistic regression classification

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report


In [None]:

df = pd.read_csv('/content/drive/MyDrive/Memoire_ENC/chambre_int_processed.csv') 

In [None]:
! pip install \
    --extra-index-url=https://pypi.nvidia.com \
    "cudf-cu12==25.2.*" "dask-cudf-cu12==25.2.*" "cuml-cu12==25.2.*" \
    "cugraph-cu12==25.2.*" "nx-cugraph-cu12==25.2.*" "cuspatial-cu12==25.2.*" \
    "cuproj-cu12==25.2.*" "cuxfilter-cu12==25.2.*" "cucim-cu12==25.2.*" \
    "pylibraft-cu12==25.2.*" "raft-dask-cu12==25.2.*" "cuvs-cu12==25.2.*" \
    "nx-cugraph-cu12==25.2.*"

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cugraph-cu12==25.2.*
  Downloading https://pypi.nvidia.com/cugraph-cu12/cugraph_cu12-25.2.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
Collecting cuspatial-cu12==25.2.*
  Downloading https://pypi.nvidia.com/cuspatial-cu12/cuspatial_cu12-25.2.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m136.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cuproj-cu12==25.2.*
  Downloading https://pypi.nvidia.com/cuproj-cu12/cuproj_cu12-25.2.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m235.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cuxfilter-cu12==25.2.*
  Downl

In [None]:
import cudf
import cuml
from cuml.model_selection import train_test_split
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.linear_model import LogisticRegression
from cuml.dask.common.utils import persist_across_workers
from cuml.preprocessing import normalize


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score
import joblib
import time
import cupy as cp
from cuml.linear_model import LogisticRegression as cuLogisticRegression

# Start timing
start_time = time.time()

# 1. Data splitting on CPU
print("Starting data preparation...")
X_train, X_test, y_train, y_test = train_test_split(
    df['Texte'], df['is_international_politics'],
    test_size=0.2, random_state=42, stratify=df['is_international_politics']
)

# 2. TF-IDF transformation on CPU
print("Performing TF-IDF feature extraction...")
tfidf = TfidfVectorizer(max_features=10000, min_df=2, ngram_range=(1, 2), use_idf=True)

# Transform training and test sets
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# 3. Convert TF-IDF matrices to a format usable by cuML
print("Transferring data to GPU...")
X_train_gpu = cp.sparse.csr_matrix(X_train_tfidf)
X_test_gpu = cp.sparse.csr_matrix(X_test_tfidf)

# Ensure labels are in GPU-compatible format
y_train_gpu = cp.array(y_train)
y_test_gpu = cp.array(y_test)

# 4. Manual grid search parameters
print("Configuring grid search parameters...")
param_grid = {
    'C': [0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'l1_ratio': [0.0, 0.2, 0.5, 0.8, 1.0],  # Only valid for elasticnet
    'tol': [1e-4, 1e-3],
    'fit_intercept': [True, False],
    'class_weight': [None, 'balanced']
}

# 5. Manual grid search with cross-validation
print("Starting manual GPU grid search...")

# Create cross-validation folds
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Prepare training data on CPU for splitting
X_train_np = X_train_tfidf.toarray() if hasattr(X_train_tfidf, 'toarray') else X_train_tfidf
y_train_np = y_train.values if hasattr(y_train, 'values') else y_train

# Record results
results = []
best_score = -1
best_params = None
best_model = None

# Generate all parameter combinations
import itertools
keys = param_grid.keys()
param_combinations = [dict(zip(keys, values)) for values in itertools.product(*param_grid.values())]
total_combinations = len(param_combinations)

print(f"Evaluating {total_combinations} parameter combinations...")

for i, params in enumerate(param_combinations):
    print(f"Evaluating parameter set {i+1}/{total_combinations}: {params}")

    # Skip invalid parameter combinations (l1_ratio when penalty is not elasticnet)
    if params['penalty'] != 'elasticnet' and 'l1_ratio' in params:
        if params['l1_ratio'] != 0.0:  # l1_ratio=0.0 corresponds to l2 penalty
            continue

    cv_scores = []

    # Perform cross-validation
    for train_idx, val_idx in kf.split(X_train_np):
        # Get training and validation sets for current fold
        X_train_fold = X_train_gpu[train_idx]
        y_train_fold = y_train_gpu[train_idx]
        X_val_fold = X_train_gpu[val_idx]
        y_val_fold = y_train_gpu[val_idx]

        # Try current parameter combination
        try:
            # Create and train model
            model = cuLogisticRegression(
                C=params['C'],
                penalty=params['penalty'],
                l1_ratio=params['l1_ratio'] if params['penalty'] == 'elasticnet' else None,
                tol=params['tol'],
                fit_intercept=params['fit_intercept'],
                class_weight=params['class_weight'],
                solver='qn',  # cuML only supports the qn solver
                max_iter=1000,
                verbose=0
            )

            model.fit(X_train_fold, y_train_fold)

            # Evaluate on validation set
            val_score = model.score(X_val_fold, y_val_fold)
            cv_scores.append(float(val_score))
        except Exception as e:
            print(f"  Error with parameter set: {str(e)}")
            cv_scores = [-1]  # Mark as invalid combination
            break

    # Compute mean cross-validation score
    mean_cv_score = np.mean(cv_scores) if cv_scores[0] != -1 else -1

    if mean_cv_score > best_score:
        best_score = mean_cv_score
        best_params = params.copy()

        # Retrain best model on full training set
        best_model = cuLogisticRegression(
            C=params['C'],
            penalty=params['penalty'],
            l1_ratio=params['l1_ratio'] if params['penalty'] == 'elasticnet' else None,
            tol=params['tol'],
            fit_intercept=params['fit_intercept'],
            class_weight=params['class_weight'],
            solver='qn',
            max_iter=1000,
            verbose=0
        )
        best_model.fit(X_train_gpu, y_train_gpu)

    # Record results
    results.append({
        'params': params,
        'mean_cv_score': mean_cv_score
    })

    print(f"  Mean CV score: {mean_cv_score:.4f}" +
          f" {'(current best)' if mean_cv_score == best_score else ''}")

# 6. Output best parameters
print("\nGrid search completed!")
print(f"Best parameters: {best_params}")
print(f"Best CV score: {best_score:.4f}")

# 7. Predict on test set with best model
y_pred_gpu = best_model.predict(X_test_gpu)

# Transfer GPU predictions back to CPU for evaluation
y_pred = cp.asnumpy(y_pred_gpu)
y_test_cpu = np.array(y_test)

# 8. Evaluate model performance
print("\nFinal test set evaluation results:")
print(classification_report(y_test_cpu, y_pred))
print(f"Test set accuracy: {accuracy_score(y_test_cpu, y_pred):.4f}")

# 9. Save model components
print("\nSaving model and feature extractor...")
try:
    # Save TF-IDF vectorizer
    joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
    print("TF-IDF vectorizer saved as 'tfidf_vectorizer.pkl'")

    # Save best parameters
    joblib.dump(best_params, 'best_params.pkl')
    print("Best parameters saved as 'best_params.pkl'")

    # Save grid search results
    results_df = pd.DataFrame(results)
    results_df.to_csv('grid_search_results.csv', index=False)
    print("Grid search results saved as 'grid_search_results.csv'")

    # Save evaluation results
    evaluation_results = {
        'best_params': best_params,
        'best_cv_score': best_score,
        'test_accuracy': accuracy_score(y_test_cpu, y_pred),
        'classification_report': classification_report(y_test_cpu, y_pred, output_dict=True)
    }
    joblib.dump(evaluation_results, 'evaluation_results.pkl')
    print("Evaluation results saved as 'evaluation_results.pkl'")

    # Try saving the model (note: cuML models may not be directly serializable)
    try:
        joblib.dump(best_model, 'best_model.pkl')
        print("Best model saved as 'best_model.pkl'")
    except Exception as e:
        print(f"Unable to save cuML model: {str(e)}")
        print("Note: cuML models may not be directly serializable, please record parameters for retraining")
except Exception as e:
    print(f"Error during saving: {str(e)}")

# 10. Analyze feature importance
print("\nAnalyzing feature importance...")
try:
    feature_names = tfidf.get_feature_names_out()
    coefs = cp.asnumpy(best_model.coef_)[0]  # Convert back to CPU

    # Create feature importance DataFrame
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': coefs
    })

    # Sort by importance and save
    feature_importance = feature_importance.sort_values('importance', ascending=False)
    feature_importance.to_csv('feature_importance.csv', index=False)
    print("Feature importance saved as 'feature_importance.csv'")

    # Output top positive and negative features
    print("\nTop positive features (international politics related):")
    for idx, row in feature_importance.head(20).iterrows():
        print(f"{row['feature']}: {row['importance']:.4f}")

    print("\nTop negative features (non-international politics related):")
    for idx, row in feature_importance.tail(20).iloc[::-1].iterrows():
        print(f"{row['feature']}: {row['importance']:.4f}")
except Exception as e:
    print(f"Unable to extract feature importance: {str(e)}")

# Compute and display total runtime
end_time = time.time()
total_time = end_time - start_time
print(f"\nTotal runtime: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

print("\nModel training and evaluation pipeline completed!")


开始数据准备...
执行TF-IDF特征提取...
将数据转移到GPU...
配置网格搜索参数...
开始手动GPU网格搜索...
开始评估 1080 个参数组合...
评估参数组合 1/1080: {'C': 0.001, 'penalty': 'l1', 'l1_ratio': 0.0, 'tol': 0.0001, 'fit_intercept': True, 'class_weight': None}
  平均交叉验证得分: 0.8341 (当前最佳)
评估参数组合 2/1080: {'C': 0.001, 'penalty': 'l1', 'l1_ratio': 0.0, 'tol': 0.0001, 'fit_intercept': True, 'class_weight': 'balanced'}
  平均交叉验证得分: 0.2994 
评估参数组合 3/1080: {'C': 0.001, 'penalty': 'l1', 'l1_ratio': 0.0, 'tol': 0.0001, 'fit_intercept': False, 'class_weight': None}
  平均交叉验证得分: 0.8341 (当前最佳)
评估参数组合 4/1080: {'C': 0.001, 'penalty': 'l1', 'l1_ratio': 0.0, 'tol': 0.0001, 'fit_intercept': False, 'class_weight': 'balanced'}
  平均交叉验证得分: 0.8341 (当前最佳)
评估参数组合 5/1080: {'C': 0.001, 'penalty': 'l1', 'l1_ratio': 0.0, 'tol': 0.001, 'fit_intercept': True, 'class_weight': None}
  平均交叉验证得分: 0.8341 (当前最佳)
评估参数组合 6/1080: {'C': 0.001, 'penalty': 'l1', 'l1_ratio': 0.0, 'tol': 0.001, 'fit_intercept': True, 'class_weight': 'balanced'}
  平均交叉验证得分: 0.2994 
评估参数组合 7/1080: {'C': 

In [None]:
import joblib
import pandas as pd
import cupy as cp
from scipy.sparse import csr_matrix

tfidf = joblib.load('/content/drive/MyDrive/Memoire_ENC/classify_model/tfidf_vectorizer.pkl')

best_model = joblib.load('/content/drive/MyDrive/Memoire_ENC/classify_model/best_model.pkl')

best_params = joblib.load('/content/drive/MyDrive/Memoire_ENC/classify_model/best_params.pkl')

In [None]:
df_chambre

Unnamed: 0,Text,Date,Year,errors,total_words,error_positions,error_ratio,Chunks
0,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,. GHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION OR...
1,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,"Ainsi a été consacrée, selon la volonté du pay..."
2,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,Presque tous les délégués d'Ex* trême-Orient c...
3,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,"Et nous qui sommes au soir de la vie, nous aur..."
4,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,"Auguste Durand, Herriot, Walter. 5e table : MM..."
...,...,...,...,...,...,...,...,...
166437,".\n\nSOMMAIRE 1. - Procès-vtrbait,-Motiond'ord...",1939-12-30,1939,8600,75591,"[3.968726435686788e-05, 0.00010583270495164769...",0.113770,Vantielcke. Vardelle. Vassal. Vaur. Vidal (Ray...
166438,".\n\nSOMMAIRE 1. - Procès-vtrbait,-Motiond'ord...",1939-12-30,1939,8600,75591,"[3.968726435686788e-05, 0.00010583270495164769...",0.113770,Gitton. G rcnier. Guyot. Ilonel. Langumicr. La...
166439,".\n\nSOMMAIRE 1. - Procès-vtrbait,-Motiond'ord...",1939-12-30,1939,8600,75591,"[3.968726435686788e-05, 0.00010583270495164769...",0.113770,Bonté. Brun. Catelas (Somme). Cornavin. Cosson...
166440,".\n\nSOMMAIRE 1. - Procès-vtrbait,-Motiond'ord...",1939-12-30,1939,8600,75591,"[3.968726435686788e-05, 0.00010583270495164769...",0.113770,Pillot. Pourtalet. Prachay. Prot (Louis) (Somm...


In [None]:
texts_to_classify_chambre = df_chambre['Chunks'].astype(str) 

In [None]:
texts_to_classify_senat = df_senat['Chunks'].astype(str)

In [None]:
X_new_tfidf_chambre = tfidf.transform(texts_to_classify_chambre)
X_new_gpu_chambre = cp.sparse.csr_matrix(X_new_tfidf_chambre.astype(cp.float32))

In [None]:
X_new_tfidf_senat = tfidf.transform(texts_to_classify_senat)
X_new_gpu_senat = cp.sparse.csr_matrix(X_new_tfidf_senat.astype(cp.float32))

In [None]:
y_pred_gpu_chambre = best_model.predict(X_new_gpu_chambre)
y_proba_gpu_chambre = best_model.predict_proba(X_new_gpu_chambre)[:, 1] 
y_pred_chambre = cp.asnumpy(y_pred_gpu_chambre)
y_proba_chambre = cp.asnumpy(y_proba_gpu_chambre)

In [None]:
y_pred_gpu_senat = best_model.predict(X_new_gpu_senat)
y_proba_gpu_senat = best_model.predict_proba(X_new_gpu_senat)[:, 1]
y_pred_senat = cp.asnumpy(y_pred_gpu_senat)
y_proba_senat = cp.asnumpy(y_proba_gpu_senat)

In [None]:
df_chambre['predicted_class'] = y_pred_chambre
df_chambre['probability'] = y_proba_chambre

In [None]:
df_senat['predicted_class'] = y_pred_senat
df_senat['probability'] = y_proba_senat

In [None]:
df_chambre

Unnamed: 0,Text,Date,Year,errors,total_words,error_positions,error_ratio,Chunks,predicted_class,probability
0,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,. GHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION OR...,0,0.184607
1,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,"Ainsi a été consacrée, selon la volonté du pay...",1,0.527271
2,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,Presque tous les délégués d'Ex* trême-Orient c...,0,0.414410
3,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,"Et nous qui sommes au soir de la vie, nous aur...",0,0.030369
4,.\n\nGHAMBREIDES DÉPUTÉS14LÉGISLlURE — SESSION...,1929-01-08,1929,4087,31746,"[9.45000945000945e-05, 0.000252000252000252, 0...",0.128741,"Auguste Durand, Herriot, Walter. 5e table : MM...",0,0.010958
...,...,...,...,...,...,...,...,...,...,...
166437,".\n\nSOMMAIRE 1. - Procès-vtrbait,-Motiond'ord...",1939-12-30,1939,8600,75591,"[3.968726435686788e-05, 0.00010583270495164769...",0.113770,Vantielcke. Vardelle. Vassal. Vaur. Vidal (Ray...,0,0.002777
166438,".\n\nSOMMAIRE 1. - Procès-vtrbait,-Motiond'ord...",1939-12-30,1939,8600,75591,"[3.968726435686788e-05, 0.00010583270495164769...",0.113770,Gitton. G rcnier. Guyot. Ilonel. Langumicr. La...,0,0.006207
166439,".\n\nSOMMAIRE 1. - Procès-vtrbait,-Motiond'ord...",1939-12-30,1939,8600,75591,"[3.968726435686788e-05, 0.00010583270495164769...",0.113770,Bonté. Brun. Catelas (Somme). Cornavin. Cosson...,0,0.006476
166440,".\n\nSOMMAIRE 1. - Procès-vtrbait,-Motiond'ord...",1939-12-30,1939,8600,75591,"[3.968726435686788e-05, 0.00010583270495164769...",0.113770,Pillot. Pourtalet. Prachay. Prot (Louis) (Somm...,0,0.001753


In [None]:
# Step 1: Filter records where predicted_class=1
#mask = df_chambre["predicted_class"] == 1
#filtered_df = df_chambre[mask].copy()

# Step 2: Remove duplicates based on the Text column (keep the first occurrence)
#dedup_df = filtered_df.drop_duplicates(subset="Text", keep="first")

# Step 3: Extract target columns
#final_cols = ["Text", "Date", "Year", "errors", "total_words", "error_positions", "error_ratio"]
#final_df_chambre = dedup_df[final_cols]

# Step 4: Verify the results
#print(f"Number of records after deduplication: {len(final_df_chambre)}")
#print(final_df_chambre.head())

去重后记录数: 799
                                                   Text        Date  Year  \
4     .\n\nSOMMAIRE 1. - Ouverture de la session ext...  1929-07-31  1929   
791   .\n\nCHAMBREDESDÉPUTÉSttss16° LÉGISLATURESlOîs...  1937-11-25  1937   
964   .\n\nCHAMBREDESDÉPUTÉS14e LÉGISLATURE - SESSIO...  1931-05-05  1931   
1200  .\n\ntCHAMBREDESDÉPUTÉS-T148 LÉGISLATURE - SES...  1930-07-01  1930   
1383  .\n\nIlleséance du Jeudi 29 Novembre 1034.\n\n...  1934-11-29  1934   

      errors  total_words                                    error_positions  \
4       7153        63465  [4.7270148900969035e-05, 0.0002048373119041991...   
791    11454        78946  [2.5333772452055837e-05, 3.800065867808375e-05...   
964     8339        57476  [3.479713271626418e-05, 0.00013918853086505672...   
1200    5893        65993  [3.030624460170018e-05, 6.061248920340036e-05,...   
1383   11624        86646  [5.7706068370149805e-05, 8.078849571820973e-05...   

      error_ratio  
4        0.112708  
791 

Strategy failed, switching to expanding context instead of taking the entire document

In [None]:
import pandas as pd


def get_surrounding_chunks(row, df, window_size=5):
    current_index = row.name
    target_date = row["Date"]
    chunks_to_merge = []


    upper_bound = max(0, current_index - window_size)
    for i in range(current_index - 1, upper_bound - 1, -1):
        if i < 0:
            break
        if (df.at[i, "Date"] != target_date) or (df.at[i, "predicted_class"] == 1):
            break
        chunks_to_merge.append(df.at[i, "Chunks"])


    lower_bound = min(len(df) - 1, current_index + window_size)
    for i in range(current_index + 1, lower_bound + 1):
        if i >= len(df):
            break
        if (df.at[i, "Date"] != target_date) or (df.at[i, "predicted_class"] == 1):
            break
        chunks_to_merge.append(df.at[i, "Chunks"])


    merged = chunks_to_merge[::-1] + [row["Chunks"]] + chunks_to_merge
    return "\n".join(merged)


target_rows = df_chambre[df_chambre["predicted_class"] == 1].copy()
target_rows["merged_chunks"] = target_rows.apply(
    lambda row: get_surrounding_chunks(row, df_chambre, window_size=0),
    axis=1
)

target_rows_senat = df_senat[df_senat["predicted_class"] == 1].copy()
target_rows_senat["merged_chunks"] = target_rows_senat.apply(
    lambda row: get_surrounding_chunks(row, df_senat, window_size=0),
    axis=1
)


final_df_chambre = target_rows[["Date", "Year", "merged_chunks"]]
final_df_chambre = final_df_chambre.rename(columns={"merged_chunks": "Chunks"})

final_df_senat = target_rows_senat[["Date", "Year", "merged_chunks"]]
final_df_senat = final_df_senat.rename(columns={"merged_chunks": "Chunks"})



In [None]:
#final_df_chambre.to_csv('/content/drive/MyDrive/Memoire_ENC/final_df_chambre.csv', index=False)
final_df_chambre.to_csv('/content/final_df_chambre_smaller.csv', index=False)
final_df_senat.to_csv('/content/final_df_senat_smaller.csv', index=False)

In [None]:
final_df_chambre['Chunks'].iloc[2]

'-M. Maurice Petsche. Ce sont les communistes qui les fomentent ! M. Marcel Cachin. Actuellement battent leur plein trois grèves très importantes qui totalisent plus de 50.000 ouvriers: la grève-Sestisseurs d\'Halluin, celle des mineursclela Loire, celle des mineurs du bassin du Gard, en attendant d\'autres qui peuvent fcurgir demain. Dans ces conflits, le capitalisme est d\'ailleurs assure d\'avoir pour lui toute la force, toute la puissance gouvernementale.ril\'a présentement. Plusieurs milliers de gardes mobiles sillonnent le bassin de la Loire, le bassin du ÎGard, les rues d\'Halluin la rouge. 1Les autorités-locales déclarent que le Gouvernement est décidé à briser ces grèves, et je n\'ai pas appris sans une certaine Surprise un fait avoué par M. Soulié. Ce si\'est pas le fait qu\'il a signalé qui m etonne, c\'est son aveu. Des milliers de mineurs en grève protesJent contre l\'insuffisance de leurs salaires. 4-tuandils sont allés à la mairie de SaintEtienne réclamer les diverses\' 