---------------------------------------------------------------
## III. PRÉTRAITEMENT DES DONNÉES
--------------------------------------------------------------

1. Nettoyer caractères spéciaux, url, hashtags, ...
2. Dédoublonner avec Sequence Matcher


Principe : (1) Au-delà des doublons parfaits que l'on peut supprimer avec deduplicate(), on a des preque-doublons, qui se distinguent par qqs caractères spéciaux, un commentaire ou @username après retweet. (2) Deux textes ne peuvent pas être des doublons si leurs longueurs sont trop différentes (seuil empirique) (3) au-delà du traitement de 45000 tweets, le processing est soit très long (plusieurs heures), sinon on est limité par la capacité RAM.

* (a) Éliminer doublons parfaits.
* (b) Nettoyer l'ensemble
* (c) Couper en deux parties selon longueur.
* (d) Dédoublonner avec SequenceMatcher
* (e) Concatener
* (f) Extraire une partie au milieu.
* (g) Dédoublonner cette partie.
* (h) Reconcatener
----------------------------------------------------------------

In [242]:
import pandas as pd
import numpy as np
import time

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import SequenceMatcher

[nltk_data] Downloading package punkt to /Users/thevault/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/thevault/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [243]:
# Import du jeu de données
df_all = pd.read_csv('../data/raw/df_full_source.csv')

In [244]:
# Affichage des informations du jeu de données
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102661 entries, 0 to 102660
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   Unnamed: 0.1  102661 non-null  int64 
 1   Unnamed: 0    102661 non-null  int64 
 2   text          102661 non-null  object
 3   label         102661 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 3.1+ MB


In [245]:
# On supprime les colonnes inutiles et renomme les autres
df_all.drop('Unnamed: 0.1',axis = 1, inplace = True)
df_all = df_all.reset_index()
df_all.columns = df_all.columns.str.replace("Unnamed: 0", "id_origin")
df_all.head(3)

Unnamed: 0,index,id_origin,text,label
0,0,34366,"Local Charlotte, NC news station WSOCTV is rep...",1
1,1,41656,The tsunami has started President Obama s Keny...,1
2,2,26726,The only reality show Donald Trump should have...,1


---
### 1. Premier traitement des données en deux parties
---

#### 1.1. Sélectionner les extractions à traiter (longueur max = env. 45 000, limite RAM)

In [246]:
# Éliminer les doublons parfaits
# -------------------------------------
df_all = df_all.drop_duplicates(subset=['text'], keep='first')
# 102661 - 76319 = 26342 doublons bruts
print('df dédoublonné :', len(df_all))
df_all.loc[:,'sign_count'] = df_all['text'].apply(lambda x : len(x))
df_all.head(3)

df dédoublonné : 85924


Unnamed: 0,index,id_origin,text,label,sign_count
0,0,34366,"Local Charlotte, NC news station WSOCTV is rep...",1,2302
1,1,41656,The tsunami has started President Obama s Keny...,1,1598
2,2,26726,The only reality show Donald Trump should have...,1,2001


#### 1.2. Nettoyage du texte d'origine et création d'une colonne de text_prepr(ocessing)

In [247]:
df = df_all.copy()

In [248]:
# Nettoyage : on crée une colonne 'text_prepr' et on laisse le 'text' original tranquille
# ----------------------------------------------------------------------------------------
df['text_prepr'] = df.loc[:,'text'].str.replace('WASHINGTON', ' ')
df.loc[:,'text_prepr'] = df['text_prepr'].str.replace(r'Reuters|reuters|REUTERS', ' ', regex=True)
df.loc[:,'text_prepr'] = df['text_prepr'].str.replace(r'Ä¶|Äô|Äù|Äú|Å©|äî|Äî', ' ', regex=True) #lié à la traduction depuis html. il y a probablement une meilleure méthode.
df.loc[:,'text_prepr'] = df['text_prepr'].str.replace(r'\bu\b', ' ', regex=True) # supprimer ou remplacer les 'u' qui représentent 'you'
df.loc[:,'text_prepr'] = df['text_prepr'].str.replace(r'\bs\b', ' ', regex=True) # supprimer les 's' qui représenent 'is' ou 'has'
df.loc[:,'text_prepr'] = df['text_prepr'].str.replace(r'\br\b', ' ', regex=True) # supprimer ou remplacer les 'r' qui représentent 'are'

# supprimer url, e-mail, special characters
import re
def remove_urls(text):
    if isinstance(text, str):
        url_pattern = re.compile(r'http[s]?://\S+|www\.\S+')
        return url_pattern.sub('', text)

def remove_Emails(text):
    if isinstance(text, str):
        Email_pattern = re.compile(r'([a-zA-Z0-9_\.-]+)@([a-zA-Z0-9_\.-]+)\.([a-zA-Z]{2,5})')
        return Email_pattern.sub('', text)

def remove_gobbledegook(text):
    if isinstance(text, str):
    # Supprimer les strings long composés de chiffres et de lettres (sans espaces, des codes d'identification probablement)
        GobblGook = re.compile(r'\b(?=[a-zA-Z0-9]*[A-Z])(?=[a-zA-Z0-9]*[a-z])(?=[a-zA-Z0-9]*\d)[a-zA-Z0-9]{8,12}\b')
        text = GobblGook.sub(' ', text)
        return text

def remove_ATusername(text):
    if isinstance(text, str):
    # Supprimer les @username Twitter
        ATusername = re.compile(r'@([a-zA-Z0-9_\.-]+)')
        text = ATusername.sub(' ', text)
        return text

def remove_hashtag(text) :
    if isinstance(text, str):
    # Supprimer hastags par ex. #blizzard2016
        ATusername = re.compile(r'#([a-zA-Z0-9_\.-]+)')
        text = ATusername.sub(' ', text)
        return text

def remove_speCar_exclu_comma_dot(text):
    if isinstance(text, str):
    # Supprimer tous les caractères spéciaux et ponctuation sauf points et virgules (en cas de sentence tokenization)
        remove_punctuation = re.compile(r'[^\w\s]')
        text = remove_punctuation.sub(' ', text)
        return text

df.loc[:,'text_prepr'] = df['text_prepr'].apply(remove_urls)
df.loc[:,'text_prepr'] = df['text_prepr'].apply(remove_Emails)
df.loc[:,'text_prepr'] = df['text_prepr'].apply(remove_gobbledegook)
df.loc[:,'text_prepr'] = df['text_prepr'].apply(remove_ATusername)
df.loc[:,'text_prepr'] = df['text_prepr'].apply(remove_hashtag)
df.loc[:,'text_prepr'] = df['text_prepr'].apply(remove_speCar_exclu_comma_dot)
df.loc[:,'text_prepr'] = df['text_prepr'].str.replace('.', '. ') # parfois certaines phrases successives collent l'une à l'autre.
df.loc[:,'text_prepr'] = df['text_prepr'].str.replace('  ', ' ')
df.loc[:,'text_prepr'] = df['text_prepr'].str.replace('   ', ' ')

#### 1.3. Fonction SequenceMatcher pour calculer similarités fortes entre textes

In [249]:
df = df.sort_values(by='sign_count')
df.head(3)
df_1_2=df[:43000].reset_index(drop=True)
df_2_2=df[43000:].reset_index(drop=True)
print(len(df_1_2))
print(len(df_2_2))
df_1_2.head(3)


43000
42924


Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr
0,27,37083,,1,1,
1,5381,34071,,1,2,
2,48983,40989,Ouch!,1,5,Ouch


In [250]:
def compute_similarity(text1, text2):
    """Calcule la similarité de deux textes en utilisant la distance de Levenshtein."""
    return SequenceMatcher(None, text1, text2).ratio()

def flag_similar_posts(df, threshold=0.9,max_features=15000):
    """Flague les lignes avec une similarité lexicale supérieure au seuil donné."""

    # Initialisation des colonnes nécessaires
    df.loc[:,'has_similarity'] = 0
    df.loc[:,'similar_with'] = None
    df.loc[:,'similarity_group'] = None
    df.loc[:,'similarity_score'] = 0.0

    # Calcul du TF-IDF
    vectorizer = TfidfVectorizer(max_features=max_features).fit_transform(df['text_prepr'])
    vectors = vectorizer.toarray()

    # Calcul de la similarité cosinus
    cosine_sim = cosine_similarity(vectors)

    group_counter = 0
    visited = np.zeros(len(df), dtype=bool)

    for i in range(len(df)):
        if visited[i]:
            continue

        # Identifie les textes similaires
        similar_indices = np.where(cosine_sim[i] >= threshold)[0]
        if len(similar_indices) > 1:
            group_counter += 1

            for j in similar_indices:
                if i != j:
                    df.at[j, 'has_similarity'] = 1
                    df.at[j, 'similar_with'] = df.at[i, 'id_origin']
                    df.at[j, 'similarity_score'] = cosine_sim[i][j]
                    df.at[j, 'similarity_group'] = group_counter
                    visited[j] = True

            # Pour le premier texte lui-même
            df.at[i, 'has_similarity'] = 1
            df.at[i, 'similarity_score'] = 1.0  # Car il est identique à lui-même
            df.at[i, 'similarity_group'] = group_counter

    return df

#### 1.4. Exécuter sequence matcher et identifier les doublons pour les DEUX moitiés

**1 sur 2** : # PREMIERE MOITIÉ


In [251]:
# Seq Matcher 1 sur 2 : threshold = 0.9
# ---------------------------------------
start = time.time()
flag_similar_posts(df_1_2)
df.head(3)
stop = time.time()
print(f"Temps d'exécution : {stop-start} secondes")


Temps d'exécution : 105.74881100654602 secondes


In [252]:
print('Total doublons :', len(df_1_2[(df_1_2['has_similarity'] == 1)]))
print('À éliminer :', len(df_1_2[(df_1_2['has_similarity'] == 1) & ~(df_1_2['similar_with'].isna())  ]))
print('À garder :', len(df_1_2[(df_1_2['has_similarity'] == 1) & (df_1_2['similar_with'].isna())  ]))
print('Reste : ', len(df_1_2)-len(df_1_2[(df_1_2['has_similarity'] == 1) & ~(df_1_2['similar_with'].isna())  ]))

Total doublons : 10755
À éliminer : 8117
À garder : 2638
Reste :  34883


In [253]:
# df 1 sur 2 avec uniquement les doublons
# --------------------------------------------------------
total_doublons_1_2 = df_1_2[(df_1_2['has_similarity'] == 1)].sort_values(by='similarity_group')
print(len(total_doublons_1_2))
total_doublons_1_2[100:150]

10755


Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
8212,20287,86828,Vehicle blamed for deadly California wildfire ...,0,95,Vehicle blamed for deadly California wildfire,1,86820.0,31,1.0
36205,9262,86826,Vehicle blamed for deadly California wildfire ...,0,186,Vehicle blamed for deadly California wildfire ...,1,86820.0,31,1.0
8183,95990,86833,Vehicle blamed for deadly California wildfire ...,0,95,Vehicle blamed for deadly California wildfire,1,86820.0,31,1.0
8090,22284,86832,Vehicle blamed for deadly California wildfire ...,0,95,Vehicle blamed for deadly California wildfire,1,86820.0,31,1.0
886,5758,86820,Vehicle blamed for deadly California wildfire,0,45,Vehicle blamed for deadly California wildfire,1,,31,1.0
5013,19258,86845,Vehicle blamed for deadly California wildfire ...,0,82,Vehicle blamed for deadly California wildfire ...,1,86820.0,31,0.95467
6384,92362,86843,Vehicle blamed for deadly California wildfire ...,0,88,Vehicle blamed for deadly California wildfire,1,86820.0,31,1.0
894,44016,46771,sevenfigz has a crush: http://t.co/20B3PnQxMD,0,45,sevenfigz has a crush,1,,32,1.0
1233,59855,46773,tiffanyfrizzell has a crush: http://t.co/RaF73...,1,51,tiffanyfrizzell has a crush,1,46771.0,32,1.0
984,90601,46752,samel_samel has a crush: http://t.co/tBsTk5VqU0,1,47,samel_samel has a crush,1,46771.0,32,1.0


In [254]:
# df 1 sur 2 sans doublons
# ------------------------------------
df_1_dd = df_1_2[df_1_2['similar_with'].isna()]
print(len(df_1_dd))
df_1_dd.sort_values(by='id_origin').head(3)

34883


Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
42486,72814,403,WASHINGTON (Reuters) - The No. 2 U.S. Senate R...,0,259,The No 2 U S Senate Republican John Cornyn s...,0,,,0.0
42351,569,428,WASHINGTON (Reuters) - The Republican tax bill...,0,257,The Republican tax bill would generate a net...,0,,,0.0
42222,10109,476,WASHINGTON (Reuters) - U.S. President Donald T...,0,256,U S President Donald Trump will host Libyan ...,0,,,0.0


**2 sur 2** : # DEUXIEME MOITIÉ

In [255]:
# seq Matcher 2 sur 2 : threshold = 0.85
start = time.time()
flag_similar_posts(df_2_2)
df.head(3)
stop = time.time()
print(f"Temps d'exécution : {stop-start} secondes")

Temps d'exécution : 119.02753520011902 secondes


In [256]:
print('Total doublons :', len(df_2_2[(df_2_2['has_similarity'] == 1)]))
print('À éliminer :', len(df_2_2[(df_2_2['has_similarity'] == 1) & ~(df_2_2['similar_with'].isna())  ]))
print('À garder :', len(df_2_2[(df_2_2['has_similarity'] == 1) & (df_2_2['similar_with'].isna())  ]))
print('Reste : ', len(df_2_2)-len(df_2_2[(df_2_2['has_similarity'] == 1) & ~(df_2_2['similar_with'].isna())  ]))

Total doublons : 2220
À éliminer : 1361
À garder : 859
Reste :  41563


In [257]:
# df 2 sur 2 avec uniquement les doublons
# --------------------------------------------------------
total_doublons_2_2 = df_2_2[(df_1_2['has_similarity'] == 1)].sort_values(by='similarity_group')
print(len(total_doublons_2_2))
total_doublons_2_2.head(3)

10739


  total_doublons_2_2 = df_2_2[(df_1_2['has_similarity'] == 1)].sort_values(by='similarity_group')


Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
37,10595,87731,Food For The Poor Partners with The Sandals Fo...,0,266,Food For The Poor Partners with The Sandals Fo...,1,,3,1.0
5582,21410,86439,the planet will reach the crucial threshold of...,0,374,the planet will reach the crucial threshold of...,1,86437.0,4,1.0
93,5936,82699,Would-be looter in Hurricane Michael-ravaged F...,0,266,Would be looter in Hurricane Michael ravaged F...,1,82591.0,6,0.975797


In [258]:
# df 2 sur 2 sans doublons
# ------------------------------------
df_2_dd = df_2_2[df_2_2['similar_with'].isna()]
print(len(df_2_dd))
df_2_dd.sort_values(by='id_origin').head(3)

41563


Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
39494,7064,0,WASHINGTON (Reuters) - The head of a conservat...,0,4659,The head of a conservative Republican factio...,0,,,0.0
37872,76498,1,WASHINGTON (Reuters) - Transgender people will...,0,4077,Transgender people will be allowed for the f...,0,,,0.0
30824,35743,2,WASHINGTON (Reuters) - The special counsel inv...,0,2789,The special counsel investigation of links b...,0,,,0.0


In [259]:
# Quelques vérifications (dans cette partie, il y a deux moitiés du df)
print(len(df_1_dd))
print(len(df_2_dd))
print(len(df_2_dd)+len(df_1_dd))
#df_1.head(3)
df_2_dd.isna().sum()

34883
41563
76446


index                   0
id_origin               0
text                    0
label                   0
sign_count              0
text_prepr              0
has_similarity          0
similar_with        41563
similarity_group    40704
similarity_score        0
dtype: int64

---
### 2. Traitement spécifique de la partie centrale et reconcaténation de l'ensemble
---

Afin de s'assurer d'un traitement optimal des données, on procère à la même opération mais cette fois ci sur la partie centrale du jeu de données complet.

In [260]:
# D'abord réunir les deux moitiés
# ---------------------------------------
dfWhole=pd.concat([df_1_dd, df_2_dd])
dfWhole=dfWhole.reset_index(drop=True)
print(len(dfWhole))

display(dfWhole['text'].duplicated().sum())
print(76178/3)

76446


0

25392.666666666668


In [261]:
# Créer 3 tiers (attn, tout doit rester trié selon sign_count)
#-----------------------------------------------
df_1tiers = dfWhole[0:25392].reset_index(drop=True)
df_CENTER = dfWhole[25392:50784].reset_index(drop=True)
df_3tiers = dfWhole[50784:].reset_index(drop=True)
print(len(df_1tiers))
print(len(df_CENTER))
print(len(df_3tiers))

25392
25392
25662


In [262]:
# Avant SequenceMatcher, remettre les colonnes à zéro
# ------------------------------------------------
df_CENTER.loc[:,'has_similarity'] = 0
df_CENTER.loc[:,'similar_with'] = None
df_CENTER.loc[:,'similarity_group'] = None
df_CENTER.loc[:,'similarity_score'] = 0.0

In [263]:
df_CENTER.head(3)

Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
0,32149,72058,RT @MiamiDadePD RT @MayorGimenez: Residents sh...,0,140,RT RT Residents should have 3 days of water ...,0,,,0.0
1,25496,61847,Heroic efforts of Fort Carson MedEvac company ...,0,140,Heroic efforts of Fort Carson MedEvac company ...,0,,,0.0
2,49592,48792,Fylde Building set to be flattened: One of Pre...,1,140,Fylde Building set to be flattened One of Pres...,0,,,0.0


In [264]:
# SequenceMatcher uniquement pour CENTER (threshold = 0.9)
# --------------------------------------------------------
start = time.time()
flag_similar_posts(df_CENTER)
#df_exp3.head(3)
stop = time.time()
print(f"Temps d'exécution : {stop-start} secondes")


Temps d'exécution : 35.49111223220825 secondes


In [265]:
print('Total doublons :', len(df_CENTER[(df_CENTER['has_similarity'] == 1)]))
print('À éliminer :', len(df_CENTER[(df_CENTER['has_similarity'] == 1) & ~(df_CENTER['similar_with'].isna())  ]))
print('À garder :', len(df_CENTER[(df_CENTER['has_similarity'] == 1) & (df_CENTER['similar_with'].isna())  ]))
print('Reste : ', len(df_CENTER)-len(df_CENTER[(df_CENTER['has_similarity'] == 1) & ~(df_CENTER['similar_with'].isna())  ]))

Total doublons : 484
À éliminer : 251
À garder : 233
Reste :  25141


In [266]:
# df center avec uniquement les doublons
# --------------------------------------------------------
total_doublons_CENTER = df_CENTER[(df_CENTER['has_similarity'] == 1)].sort_values(by='similarity_group')
print(len(total_doublons_CENTER))
total_doublons_CENTER.head(3)

484


Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
334,101226,76988,Retweeted WIRED (@WIRED):\n\nIt's not easy liv...,0,140,Retweeted WIRED \n\nIt not easy living in Hou...,1,,1,1.0
13107,41703,78004,WIRED: It‚Äôs not easy living in Houston witho...,0,309,WIRED It not easy living in Houston without a...,1,76988.0,1,0.912651
705,15634,71577,Health checklist: What to buy before a hurrica...,0,141,Health checklist What to buy before a hurrican...,1,,2,1.0


In [267]:
# df center sans doublons
# ------------------------------------
df_CENTER_dd = df_CENTER[df_CENTER['similar_with'].isna()]
print(len(df_CENTER_dd))
df_CENTER_dd.sort_values(by='id_origin').head(3)

25141


Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
19848,88123,7,The following statements were posted to the ve...,0,856,The following statements were posted to the ve...,0,,,0.0
18029,83485,8,The following statements were posted to the ve...,0,632,The following statements were posted to the ve...,0,,,0.0
14926,28869,9,WASHINGTON (Reuters) - Alabama Secretary of St...,0,408,Alabama Secretary of State John Merrill said...,0,,,0.0


In [268]:
# Concaténation finale
# ----------------------------------------------

df_Final_dd = pd.concat([df_1tiers, df_CENTER, df_3tiers], axis=0).reset_index(drop=True)
#dFFinal.to_excel('dFFinal.xlsx', index=False)
print(len(df_Final_dd))
df_Final_dd.head(3)

76446


Unnamed: 0,index,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
0,27,37083,,1,1,,0,,,0.0
1,5381,34071,,1,2,,0,,,0.0
2,48983,40989,Ouch!,1,5,Ouch,0,,,0.0


In [269]:
# Vérification et enregistrement
" ----------------------------------------------------------------------------------"
df_Final_dd.duplicated().sum()

0

In [270]:
# Synthèse
# -----------------------------------------------
print("Longueur df origine total : 102661")

print("Longueur df dédoublonné :", len(df_Final_dd))

print("Réduction totale de :", np.round((102661-len(df_Final_dd))/102661*100, 2), " %")

Longueur df origine total : 102661
Longueur df dédoublonné : 76446
Réduction totale de : 25.54  %


In [286]:
# A présent, on ne conserve que les posts dont la taille est située entre 50 et 3000 caractères.
filtered_df = df_Final_dd[(df_Final_dd['text'].str.len() >= 50) & (df_Final_dd['text'].str.len() <= 3000)]

In [287]:
print("Longueur df filtré :", len(filtered_df))

Longueur df filtré : 65202


In [288]:
# Réinitialisation de l'index du DataFrame
filtered_df.reset_index(drop=True, inplace=True)

# Utilisation de .loc pour créer une nouvelle colonne 'id_50_3000' avec les valeurs de l'index
filtered_df.loc[:, 'id_50_3000'] = filtered_df.index

# Réorganisation des colonnes pour mettre 'id_50_3000' en premier si nécessaire
filtered_df = filtered_df[['id_50_3000', 'id_origin', 'text', 'label', 'sign_count', 'text_prepr', 'has_similarity', 'similar_with', 'similarity_group', 'similarity_score']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.loc[:, 'id_50_3000'] = filtered_df.index


In [289]:
filtered_df.head()

Unnamed: 0,id_50_3000,id_origin,text,label,sign_count,text_prepr,has_similarity,similar_with,similarity_group,similarity_score
0,0,55929,Dank is it a tornado n Raleigh car blowincg n ...,0,50,Dank is it a tornado n Raleigh car blowincg n ...,0,,,0.0
1,1,51817,@smoak_queen 'I'm going to be in so much troub...,1,50,I m going to be in so much trouble,0,,,0.0
2,2,51709,@CSAresu American Tragedy http://t.co/SDmrzG...,0,50,American Tragedy,0,,,0.0
3,3,47897,How to Survive a Dust Storm http://t.co/0yL3yT...,0,50,How to Survive a Dust Storm,0,,,0.0
4,4,50836,I SCREAMED 'WHATS A CHONCe' http://t.co/GXYivs...,1,50,I SCREAMED WHATS A CHONCe,0,,,0.0


In [290]:
# Enregistrement en CSV
filtered_df.to_csv("../data/processed/processed_data.csv", index=False)

# Enregistrement en Excel
filtered_df.to_excel("../data/processed/processed_data.xlsx", index=False)