Objectifs du notebook :

- Nettoyer les données pour obtenir un seul fichier contenant l'échantillon qui sera utilisée.
- Vérifier la présence de valeurs manquantes ou erronées.
- S'assurer de réunir uniquement les données nécessaires pour la base de données.
- Vérifier le typage des données correspondant à la base de données prévue.

[Lien DrawDB](https://drawdb.vercel.app/editor?shareId=e6c18b8ae53063fa1dfa9cc8a849605f)

In [2]:
import matplotlib.pyplot as plt
from rich import print_json, print
from tqdm import tqdm
import pandas as pd
import glob
import os

Récupération de la liste des fichiers du dataset disponible

In [None]:
def get_csv_list(part_number):
    if type(part_number) == list:
        for part in part_number:
            csv_files = glob.glob(f"./data/part_{part}/*.csv")
    else:
        csv_files = glob.glob(f"./data/part_{part_number}/*.csv")
    return csv_files

Création du DataFrame final

In [6]:
x_post_df = pd.DataFrame({
    "user_id": pd.Series(dtype="int"),
    "lang": pd.Series(dtype="str"),
    "text": pd.Series(dtype="str"),
    "date": pd.Series(dtype="datetime64[ns]"),
    "like_count": pd.Series(dtype="int"),
    "reply_count": pd.Series(dtype="int"),
    "retweet_count": pd.Series(dtype="int"),
    "quote_count": pd.Series(dtype="int"),
})

print(x_post_df.dtypes)
x_post_df.to_parquet("x_post.parquet", index=False)

Clean function
- input : chemin du csv a ajouter, le dataframe original
- output : le dataframe avec les nouvelles lignes

In [None]:
def clean_file(path, x_post_df):
    current_df = pd.read_csv(path)
    for i, row in tqdm(current_df.iterrows(), total=len(current_df), desc=f"Processing {os.path.basename(path)}"):
        try:
            # Vérification des données
            # Nous voulons :
            # - Les lignes sans valeurs nulles
            # - Les tweets en français
            # - Sans retweets ni citation de tweets.
            required_columns = ["id", "lang", "text", "date", "likeCount", "replyCount", "retweetCount", "quoteCount"]
            if any(col not in row or pd.isna(row[col]) for col in required_columns):
                continue
            if row["lang"] != "fr": # Choisir language voulu
                continue
            if row["quotedTweet"] == True:
                continue
            if row["retweetedTweet"] == True:
                continue
        
            user_id = int(row["id"])
            lang = row["lang"]
            text = row["text"]
            date = row["date"]
            like_count = row["likeCount"]
            reply_count = row["replyCount"]
            retweet_count = row["retweetCount"]
            quote_count = row["quoteCount"]
            
            # Create a new row for the final dataframe
            new_row = pd.DataFrame({
                "user_id": [user_id],
                "lang": [lang],
                "text": [text],
                "date": [pd.to_datetime(date)],
                "like_count": [like_count],
                "reply_count": [reply_count],
                "retweet_count": [retweet_count],
                "quote_count": [quote_count]
            })
            
            # Append to the main dataframe
            x_post_df = pd.concat([x_post_df, new_row], ignore_index=True)
            
        except Exception as e:
            print(f"Error processing row {i} from file {os.path.basename(path)}: {str(e)}")
            continue
    
    return x_post_df
        

Cellule a utiliser pour actualiser x_post.parquet

In [None]:
x_post_df = pd.read_parquet("x_post.parquet")

for path in get_csv_list(1): # Change the argument to choose the part number folder to treat
    print(f"Processing file: {path}")
    x_post_df = clean_file(path, x_post_df)

print("[bold yellow]x_post_df describe:[/bold yellow]")
print(x_post_df.describe())
print("[bold yellow]x_post_df head(5):[/bold yellow]")
print(x_post_df.head())
x_post_df.to_parquet("x_post.parquet", index=False)
print("[bold yellow]Sucessfuly saved to x_post.parquet[/bold yellow]")

Processing may_july_chunk_16.csv: 100%|██████████| 49984/49984 [00:02<00:00, 18378.70it/s]


Processing may_july_chunk_13.csv: 100%|██████████| 50000/50000 [00:02<00:00, 19438.63it/s]


Processing may_july_chunk_1.csv: 100%|██████████| 50000/50000 [00:02<00:00, 16766.05it/s]


Processing may_july_chunk_17.csv: 100%|██████████| 50000/50000 [00:02<00:00, 17539.91it/s]


Processing may_july_chunk_3.csv: 100%|██████████| 50000/50000 [00:02<00:00, 17440.86it/s]


Processing may_july_chunk_20.csv: 100%|██████████| 50000/50000 [00:03<00:00, 15906.19it/s]


Processing may_july_chunk_15.csv: 100%|██████████| 50000/50000 [00:02<00:00, 18742.78it/s]


Processing may_july_chunk_7.csv: 100%|██████████| 49998/49998 [00:02<00:00, 19703.24it/s]


Processing may_july_chunk_5.csv: 100%|██████████| 50000/50000 [00:03<00:00, 16613.28it/s]


Processing may_july_chunk_6.csv: 100%|██████████| 49998/49998 [00:02<00:00, 16914.00it/s]


Processing may_july_chunk_4.csv: 100%|██████████| 50000/50000 [00:02<00:00, 17657.95it/s]


Processing may_july_chunk_14.csv: 100%|██████████| 50000/50000 [00:03<00:00, 16549.13it/s]


Processing may_july_chunk_10.csv: 100%|██████████| 50000/50000 [00:02<00:00, 17604.25it/s]


Processing may_july_chunk_8.csv: 100%|██████████| 49998/49998 [00:02<00:00, 18039.15it/s]


Processing may_july_chunk_2.csv: 100%|██████████| 50000/50000 [00:02<00:00, 17124.20it/s]


Processing may_july_chunk_9.csv: 100%|██████████| 50000/50000 [00:03<00:00, 16618.09it/s]


Processing may_july_chunk_19.csv: 100%|██████████| 50000/50000 [00:02<00:00, 17210.98it/s]


Processing may_july_chunk_18.csv: 100%|██████████| 50000/50000 [00:02<00:00, 17779.63it/s]


Processing may_july_chunk_12.csv: 100%|██████████| 49998/49998 [00:02<00:00, 18677.95it/s]


Processing may_july_chunk_11.csv: 100%|██████████| 50000/50000 [00:02<00:00, 17670.72it/s]
