##  Prétraitement des commentaires textuels

Ce notebook a pour but de nettoyer des **commentaires textuels** issus de la base de données des dommentaires en rapport avec la technologie. Ces données traitées seront ensuites utilisées pour entraîner un chatbot.

#### Etapes du prétraitement:

- Suppression des balises HTML, caractères spéciaux, emojis, etc ;
- Mise en minuscule et suppression des espaces inutiles ;
- Remplacement des abréviations courantes (ex. `mdr` → `mort de rire`, `u` → `you`) ;
- Nettoyage des ponctuations mal placées (ex. remplacement de `. ` par `.`) ...

#### Dépendances

- `re`
- `pandas`

### Importation of libraries

In [None]:
import re
import warnings
from collections import Counter
import os
import pandas as pd

# filter warnings
warnings.filterwarnings('ignore')


### Importation et visualisation de la base

In [41]:
df = pd.read_csv("./data/df_comments_posts.csv",delimiter=",")
df.head()

Unnamed: 0,post_id,subreddit,created_utc,post_title,link_flair_text,comment
0,1d6iggo,artificial,1717349000.0,What are your thoughts on the following statem...,Discussion,"""https://x.com/AuthorJMac/status/1773871445669..."
1,1d6iggo,artificial,1717349000.0,What are your thoughts on the following statem...,Discussion,"> So, just to clarify. This post isn't about w..."
2,1d6iggo,artificial,1717349000.0,What are your thoughts on the following statem...,Discussion,If AI could do/explain my taxes this would be ...
3,1d6iggo,artificial,1717349000.0,What are your thoughts on the following statem...,Discussion,"""I agree, but AI won't prevent you from doing ..."
4,1d6iggo,artificial,1717349000.0,What are your thoughts on the following statem...,Discussion,"""People are taking this quite literally, but I..."


In [42]:
# Combine all comments
all_comments = '. '.join(df["comment"].astype(str))
print(all_comments[:200])

"https://x.com/AuthorJMac/status/1773871445669474662. > So, just to clarify. This post isn't about wanting an actual laundry robots.. If AI could do/explain my taxes this would be great.". "I agree, b


In [43]:
words_frequency = Counter(all_comments.split())
words_frequency.most_common(n=30)

[('the', 234026),
 ('to', 199492),
 ('a', 166811),
 ('and', 141512),
 ('of', 126755),
 ('is', 114679),
 ('I', 100664),
 ('in', 87058),
 ('that', 82668),
 ('you', 78040),
 ('for', 72282),
 ('it', 71444),
 ('be', 52715),
 ('are', 50245),
 ('with', 48749),
 ('on', 46059),
 ('have', 43373),
 ('this', 43145),
 ('not', 38407),
 ('but', 36908),
 ('as', 32675),
 ('can', 31388),
 ('they', 30204),
 ('or', 29429),
 ('like', 29363),
 ('will', 28803),
 ('just', 28291),
 ('AI', 28115),
 ('your', 26235),
 ('an', 26001)]

#### Possibles abbreviations que les internautes auraient utilisés

In [44]:
abbreviations = {
    "u": "you",
    "ur": "your",
    "r": "are",
    "ya": "you",
    "cuz": "because",
    "bc": "because",
    "bcz": "because",
    "idk": "i do not know",
    "iirc": "if i recall correctly",
    "tbh": "to be honest",
    "imo": "in my opinion",
    "imho": "in my humble opinion",
    "fyi": "for your information",
    "asap": "as soon as possible",
    "btw": "by the way",
    "brb": "be right back",
    "lol": "laughing out loud",
    "lmao": "laughing my ass off",
    "rofl": "rolling on the floor laughing",
    "omg": "oh my god",
    "omfg": "oh my freaking god",
    "wtf": "what the fuck",
    "wth": "what the hell",
    "smh": "shaking my head",
    "afaik": "as far as i know",
    "b4": "before",
    "gr8": "great",
    "k": "okay",
    "ok": "okay",
    "thx": "thanks",
    "ty": "thank you",
    "np": "no problem",
    "pls": "please",
    "plz": "please",
    "lmk": "let me know",
    "nvm": "never mind",
    "dm": "direct message",
    "msg": "message",
    "jk": "just kidding",
    "ttyl": "talk to you later",
    "bcuz": "because",
    "tho": "though",
    "yolo": "you only live once",
    "bday": "birthday",
    "gg": "good game",
    "irl": "in real life",
    "g2g": "got to go",
    "atm": "at the moment",
    "hmu": "hit me up",
    "np": "no problem",
    "wanna": "want to",
    "gonna": "going to",
    "gotta": "got to",
    "ain't": "is not",
    "lemme": "let me",
    "gimme": "give me",
    "ya'll": "you all",
    "y'all": "you all",
    "cya": "see you",
    "bbl": "be back later",

    #Quelques contractions de mots
    "i'm": "i am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    
    "i've": "i have",
    "you've": "you have",
    "we've": "we have",
    "they've": "they have",
    "he'd": "he had",
    "she'd": "she had",
    "it'd": "it had",
    
    "i'll": "i will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "it'll": "it will",
    "we'll": "we will",
    "they'll": "they will",
    
    "i'd": "i would",
    "you'd": "you would",
    "we'd": "we would",
    "they'd": "they would",

    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "won't": "will not",
    "wouldn't": "would not",
    "can't": "cannot",
    "couldn't": "could not",
    "shouldn't": "should not",
    "mustn't": "must not",
    "hasn't": "has not",
    "haven't": "have not",
    "hadn't": "had not",
}


### Fonction de pretraitement du texte sans le tokenizer (on va utiliser celui de gpt2)

In [45]:
def preprocess_comment(
    text,
    expand_abbreviations=False,
    remove_emojis=True,
    remove_deleted=True,
):
    """
    Prétraitement complet de texte destiné à un chatbot ou analyse NLP.

    Paramètres :
    - text : str, texte d'entrée
    - expand_abbreviations : bool, pour convertir les abréviations communes
    - remove_emojis : bool, supprimer les emojis
    - remove_deleted : bool, supprimer les balises '[deleted]'

    Output :
    - Texte nettoyé 
    """

    # 1. Mise en minuscules
    text = text.lower()

    # 2. Suppression HTML, liens, mails, hashtags, mentions
    text = re.sub(r"<.*?>", "", text)                  # HTML
    text = re.sub(r"http\S+|www\S+", "", text)         # liens
    text = re.sub(r"\S*@\S*\s*", "", text)             # emails
    text = re.sub(r"@\w+", "", text)                   # mentions
    text = re.sub(r"#(\w+)", r"\1", text)              # hashtags

    # 3. Suppression des balises
    if remove_deleted:
        text = re.sub(r"\[deleted\]", "", text)

    # 4. Supprimer les emojis
    if remove_emojis:
        emoji_pattern = re.compile("["
            u"\U0001F600-\U0001F64F"  # emoticônes
            u"\U0001F300-\U0001F5FF"  # symboles
            u"\U0001F680-\U0001F6FF"  # transport
            u"\U0001F1E0-\U0001F1FF"  # drapeaux
            "]+", flags=re.UNICODE)
        text = emoji_pattern.sub(r'', text)

    # 5. Expansion des abréviations
    if expand_abbreviations:
        for abbr, full in abbreviations.items():
            text = re.sub(rf"\b{abbr}\b", full, text)
 
    # 6. Suppression des espaces multiples
    text = re.sub(r"\s+", " ", text).strip()

    # 7. suspression du caractère: ". "
    text = text.replace('"', "").replace('..', ".").replace("<", "").replace(">", "")

    return text


In [46]:
all_comments_cleaned=preprocess_comment(all_comments,True,True,True)

In [47]:
all_comments_cleaned[:3000]

'  so, just to clarify. this post is not about wanting an actual laundry robots. if ai could do/explain my taxes this would be great. i agree, but ai will not prevent you from doing art and writing, it will just make you earn less for doing these things, unfortunately. people are taking this quite literally, but i think she is more likely making a general point about ai taking away from the human experience, rather than adding to it. i do not think she is actually imagining a jetsons-style future. . in my opinion almost all the takes in this thread are missing the point completely. i do not think it will happen like that. i think everything will get automated. however yeah in that kind of world you could end up with an underclass of manual labourers and an overclass of owners who pay them with the profits from machines. and is that a good thing? we are all happy to see elevator operators and lecters go however do we want all artists and musicians to go too? or at least the bottom 99% s

### Sauvegarde du fichier

In [None]:
#Save the file

os.makedirs("./data", exist_ok=True)
with open("./data/data_cleaned.txt", "w", encoding="utf-8") as f:
    f.write(all_comments_cleaned)