<h1 style="text-align: center; color: #E30613;"><b><i>Annotation Automatique pour les Commentaires</i></b></h1>

<p style="font-size: 18px;">
Ce notebook vise à automatiser le processus de classification des commentaires des clients dans le domaine des télécommunications.
Grâce à l'utilisation de modèles avancés d'apprentissage automatique et de traitement du langage naturel,
nous analysons et catégorisons les retours des utilisateurs pour améliorer la qualité des services.
</p>

<h2 style="color: #28A745;">Objectifs Principaux :</h2>
<ul style="font-size: 16px; color: #333;">
    <li>Analyser les commentaires des clients pour identifier les problèmes récurrents.</li>
    <li>Classer les retours en différentes sous-catégories pour une meilleure compréhension.</li>
    <li>Fournir des insights exploitables pour améliorer les services et la satisfaction client.</li>
</ul>

<h2 style="color: #28A745;">Technologies Utilisées :</h2>
<ul style="font-size: 16px; color: #333;">
    <li><b>Pandas :</b> Pour la manipulation et le prétraitement des données.</li>
    <li><b>Regex :</b> Pour le nettoyage et la normalisation des textes.</li>
    <li><b>emoji :</b> Pour gérer et supprimer les emojis dans les commentaires.</li>
    <li><b>Llama :</b> Modèle avancé pour la classification des commentaires.</li>
    <li><b>JSON :</b> Pour structurer et sauvegarder les résultats de classification.</li>
</ul>

<h2 style="color: #28A745;">Flux de Travail :</h2>
<ol style="font-size: 16px; color: #333;">
    <li><b>Chargement des Données :</b> Importation des commentaires clients depuis un fichier Excel.</li>
    <li><b>Pré-traitement :</b> Nettoyage, normalisation et suppression des doublons.</li>
    <li><b>Classification :</b> Utilisation d'un modèle Llama pour catégoriser les commentaires.</li>
    <li><b>Sauvegarde :</b> Exportation des résultats dans des fichiers JSON et CSV pour une analyse ultérieure.</li>
</ol>

<h2 style="color: #28A745;">Résultats Attendus :</h2>
<ul style="font-size: 16px; color: #333;">
    <li>Une classification précise des commentaires en sous-catégories pertinentes.</li>
    <li>Une meilleure compréhension des problèmes rencontrés par les clients.</li>
    <li>Des recommandations exploitables pour améliorer les services.</li>
</ul>

## <span style="color: #28A745;">**Bibiliothèques nécessaires**</span>

In [60]:
%pip install emoji pandas llama-cpp-python tqdm -q --quiet

import pandas as pd
import re
import emoji
from collections import defaultdict
import os
from llama_cpp import Llama
import json
from tqdm import tqdm
from time import sleep

## <span style="color: #28A745;">**Chargement des Données**</span>

In [61]:
# Charger les données
comments_df = pd.read_excel('/content/Comments.xlsx')

# Add ID column to comments_df
comments_df.insert(0, 'ID Comment', range(1, 1 + len(comments_df)))

# Vérification du format attendu
assert 'ID Comment' in comments_df.columns and 'Comments' in comments_df.columns, "Les colonnes 'ID Comment' et 'Comments' doivent exister."

# Créer le dossier de résultats
os.makedirs('/content/Results', exist_ok=True)

# Créer le dossier de modèle
os.makedirs('/content/models', exist_ok=True)

comments_df

Unnamed: 0,ID Comment,ID Post,User Name,Comments,Sentiments
0,1,1,Samir Bekhouche,,Neutre
1,2,1,Yanise Yanise,سلام عليكم ورحمة لديا مشكلة ! فليكسيت 100 دج و...,Negatif
2,3,1,Jj Kie,كل عام و انتم بخير,Positif
3,4,1,Sakou Younes,كل عام و أنتم بخير,Positif
4,5,1,راني نعاني,كل عام و حنا بخير,Positif
...,...,...,...,...,...
4097,4098,183,Ĺã Rõsë Ýb,❤️❤️,Positif
4098,4099,183,نسمات هادئة,💕💕💕💕,Positif
4099,4100,183,ملك ملهاش غيرك,❤❤❤❤❤❤🌹,Positif
4100,4101,183,سعيدي رضا,,Neutre


## <span style="color: #28A745;">**Pré-traitement**</span>

In [62]:
# Suppression des lignes où "User Name" est "Djezzy", "Mobilis" ou "Ooredoo"
comments_df = comments_df[~comments_df["User Name"].isin(["Djezzy", "Mobilis", "Ooredoo Algérie"])]

# Supprimer les lignes où "Comments" est vide ou contient uniquement des espaces
comments_df = comments_df.dropna(subset=["Comments"])
comments_df = comments_df[comments_df["Comments"].str.strip() != ""]

# Supprimer les doublons consécutifs
comments_df = comments_df.loc[comments_df["Comments"].shift() != comments_df["Comments"]]

In [63]:
# Normalisation des commentaires
def normalize_arabic(text):
    text = text.lower()
    text = re.sub("گ", "ك", text)
    text = re.sub("ڭ", "ك", text)
    text = re.sub("ڤ", "ق", text)
    text = re.sub("ڨ", "ق", text)
    text = re.sub("پ", "ب", text)
    text = re.sub("é", "e", text)
    text = re.sub("ê", "e", text)
    text = re.sub("ë", "e", text)
    text = re.sub("ç", "c", text)
    text = re.sub("à", "a", text)
    text = re.sub("â", "a", text)
    text = re.sub("ä", "a", text)
    text = re.sub("î", "i", text)
    text = re.sub("ï", "a", text)
    text = re.sub("æ", "ae", text)
    text = re.sub("œ", "oe", text)
    return text

# Appliquer la normalisation
comments_df["Comments"] = comments_df["Comments"].apply(normalize_arabic)

In [64]:
def clean_text(text):
    text = emoji.replace_emoji(text, replace=" ")  # Supprimer les emojis
    text = re.sub(r'http\S+ | htps\S+', " ", str(text))  # Supprimer les hyperliens
    text = re.sub(r'((www\.[^\s]+)|(https?://[^\s]+))', " ", str(text))  # Supprimer les URL
    text = re.sub(r'@\S+', '', str(text))  # Supprimer les mots commençant par @
    text = text.replace("_", " ").replace("#", "")  # Supprimer # et _
    text = text.replace("'", " ")  # Supprimer '
    text = re.sub(r'\. | , | ، | ؛', " ", text)  # Supprimer les ponctuations

    # Supprimer les mots réservés
    text = re.sub(r'\bRT\b | \bRetweeted\b', " ", text)

    # Supprimer tout caractère spécial sauf l'alphabet arabe et latin
    text = re.sub(r'[^a-zA-Z0-9\u0600-\u06FF\s]', " ", text)

    # Supprimer les voyelles courtes arabes (حركات)
    harakat = "[\u064B-\u0652]"  # Comprend
    text = re.sub(harakat, '', text)

    text = re.sub(r"(.)\1{2,}", r"\1", text)  # Supprimer les caractères consécutifs en double
    text = text.replace('\n', " ").replace('/', " ")  # Supprimer sauts de ligne et /
    text = re.sub(r'[^\w\s]', " ", text)  # Supprimer les caractères spéciaux

    return text.strip()

comments_df["Comments"] = comments_df["Comments"].apply(clean_text)

# Remplacer les valeurs nulles par une chaîne vide
comments_df["Comments"] = comments_df["Comments"].fillna('')

comments_df = comments_df.dropna(subset=["Comments"])
comments_df = comments_df[comments_df["Comments"].str.strip() != ""]

In [65]:
abbreviations = {
    "mrc": "merci",
    "num": "numéro",
    "numro": "numéro",
    "nn": "non",
    "bn": "bonne",
    "topp": "top",
    "شوي": "قليل",
    "شويا": "قليل",
    "لزونيتي": "الوحدات",
    "حنا": "نحن",
    "أسإلتي": "اسئلة",
    "حاص": "خاص",
    "مدام": "بينما",
    "رانا": "نحن",
    "بون": "حسن",
    "توس": "جميع",
    "سبيسيال": "منفرد",
    "سبيال": "منفرد",
    "باه": "لكي",
    "زاف": "كثيرا",
    "بزاف": "كثيرا",
    "نتمني": "نتمنى",
    "فظلكم": "فضلكم",
    "ضلم": "ظلم",
    "ردى": "رد",
    "علاش": "لماذا",
    "علاه": "لماذا",
    "علا": "لماذا",
    "هذ": "هذا",
    "تبرعو": "تبرع",
    "ابليكاسيو": "تطبيق",
    "أبليكاسيون": "تطبيق",
    "أبليكاسيو": "تطبيق",
    "الخاص بكم": "نتاعكم",
    "ديرولنا": "افعلو لنا",
    "ديرو": "افعلو",
    "نحا": "نزع",
    "دير": "افعل",
    "حاجة": "شيء",
    "نتع": "خاص ب",
    "تاع": "خاص ب",
    "تع": "خاص ب",
    "جاوبوني": "رد علي",
    "مشتاركه": "مشاركة",
    "رحو": "اذهبو",
    "هدي": "هدية",
    "هذي": "هذه",
    "يروح": "يذهب",
    "يجي": "يأتي",
    "وقتاه": "متى",
    "وقتاش": "متى",
    "وقتش": "متى",
    "تردولنا": "تردون",
    "نتاع": "خاص ب",
    "شك": "من",
    "شكون": "من",
    "نحيتوها": "نزع",
    "نحيتو": "نزع",
    "نحيت": "نزع",
    "سلفلي": "قرض",
    "سلفولي": "قرض",
    "غدوة": "غدا",
    "تلقى": "تجد",
    "نلقى": "اجد",
    "نأكتيفيها": "تفعيل",
    "نأكتيفي": "تفعيل",
    "ناكتيفي": "تفعيل",
    "نعانيو": "نعانون",
    "نعانو": "نعانون",
    "ميت حال": "رديء",
    "مفهمتش": "لم افهم",
    "انو": "انه",
    "علابالي": "اعلم",
    "ليتو": "اصبحتم",
    "وليتو": "اصبحتم",
    "تدو": "تاخذون",
    "الى": "إلى",
    "خاستني": "احتاج",
    "خستني": "احتاج",
    "مشى": "تعمل",
    "مشا": "تعمل",
    "شكا": "شكوى",
    "شكيت": "شكوى",
    "زرو": "سيء",
    "زيرو": "سيء",
    "سرفيك": "خدمة",
    "سرفيس": "خدمة",
    "جيزي اب": "djezzy app",
    "nchlh": "incha allah",
    "riglou": "regler",
    "bah": "pour",
    "nwaliw": "devenir",
    "k": "comme",
    "مراحش": "لن",
    "مهمش": "ليسو",
    "تعليع": "تعليق",
    "زااف": "كثيرا",
    "khayan": "سرق",
    "djezzyy": "djezzy",
    "ومرديتش": "لم ترد",
    "واش": "ماذا",
    "منبعد": "بعد ذلك",
    "تفتحش": "لا تفتح",
    "ندير": "افعل",
    "راه": "اصبح",
    "لازم": "يجب",
    "pix": "pixx",
    "Twanty": "Twenty",
    "orod": "prod",
    "imtiyaaz": "imtiyaz",
    "tm": "ok",
    "نم": "تم",
    "جداو": "جدا",
    "زلب": "زبل",
    "كونطرا": "عقد",
    "شري": "شراء",
    "توب": "رائع",
    "جزل": "جزيلا",
    "ياسر": "كثيرا",
    "golde": "gold",
    "mknch": "introuvable",
    "rani": "je suis",
    "شكرالكم": "شكرا لكم",
    "viv": "vive",
    "عتل": "ارسل",
    "بعت": "ارسل",
    "لاجونس": "مقر",
    "rpnd": "repond",
    "prv": "prive",
    "svp": "s il vous plais",
    "yennayer": "سنة",
    "amervuh": "سعيدة",
    "assgas": "assegas",
    "asugas": "assegas",
    "asegas": "assegas",
    "amegaz": "amegas",
    "amgaz": "amegas",
    "amegaz": "amegas",
    "amgaz": "amegas",
    "amegaz": "amegas",
    "خخ": "ضحك",
    "هه": "ضحك",
    "سبي": "سبيسيال",
    "شبكهمشكورين": "شبكه مشكورين",
    "يعطيكمصحه": "يعطيكم صحه",
    "نشله": "ان شاء الله",
    "عندوش": "لا يوجد",
    "خفظو": "تخفيض",
    "سء": "سؤال",
    "مليحة": "حسن",
    "مليحه": "حسن",
    "وله": "و الله",
    "مكمات": "مكالمات",
    "وينتا": "متى",
    "تدوها": "تاخذون",
    "felawen": "tous",
    "ya": "il y a",
    "en": "في",
    "panne": "عطل",
    "happy": "سعيدة",
    "koum": "votre",
    "ayi": "faible",
    "new": "جديدة",
    "year": "سنة",
    "years": "سنة",
    "شنو": "ما هو",
    "هدا": "هذا",
    "شالنج": "تحدي",
    "li": "qui",
    "bghi": "aime",
    "ndirlo": "faire",
    "yji": "viens",
    "lah": "pourquoi",
    "raho": "que il",
    "hbs": "arret",
    "بر": "فقط",
    "برك": "فقط",
    "غي": "الا",
    "غير": "الا",
    "الي": "الى",
    "حسنو": "اصلاح",
    "سقمو": "اصلاح",
    "جوند": "legend",
    "يجاند": "legend",
    "ليجند": "legend",
    "ارطيا": "جزء",
    "علجال": "من أجل",
    "تثقال": "بطء",
    "كون": "ليت",
    "بغى": "اراد",
    "يبغي": "يريد",
    "نبغي": "نريد",
    "تفرج": "مشاهدة",
    "ماتش": "مباراة",
    "رجا": "رجاء",
    "متمشلكش": "لا تعمل",
    "متمسيلكش": "لا تعمل",
    "لايص": "اماكن",
    "بلايص": "اماكن",
    "بلاصة": "اماكن",
    "نسقسي": "اسأل",
    "اذ": "اذا",
    "يمتي": "بلا حدود",
    "اليميتي": "بلا حدود",
    "خسني": "اريد",
    "باطل": "مجانا",
    "قولد": "gold",
    "تعيف": "بطء",
    "نسييو": "محاولة",
    "نلعبو": "لعب",
    "نكونو": "أكون",
    "عب": "لعب",
    "كف": "كيف",
    "اللهيوفقناجميعاقولويارب": "الله يوفقنا جميع اقولو يارب",
    "congratulations": "مبروك",
    "berkaw": "arret",
    "ser": "vole",
    "تردوش": "لا تردون",
    "هضرت": "تكلمت",
    "باش": "لكي",
    "تحلى": "حل",
    "تحلي": "حل",
    "لينا": "لنا",
    "حض": "حظ",
    "wech": "Quoi",
    "ndirou": "faire",
    "bach": "pour",
    "nrebhou": "gagner",
    "elfe": "mille",
    "mabrok": "felicitations",
    "koules": "tous",
    "moucharikones": "participants",
    "el": "les",
    "mabrouk": "مبروك",
    "اطوههالي": "اعطوها لي",
    "شاب": "جميل",
    "يعطيكمالصحة": "يعطيكم الصحة",
    "illa": "lent",
    "woww": "wow",
    "يارب": "يا رب",
    "كلشي": "كل شيء",
    "كنكتي": "تواصل",
    "مانكونيكتيش": "لا أتواصل",
    "يووز": "yooz",
    "يوز": "yooz",
    "وش": "ما هو",
    "وشمن": "اي",
    "ديما": "dima",
    "راه": "انه",
    "معجبتنيش": "سيء",
    "تحبسلي": "توقف",
    "انتاع": "ل",
    "لوس": "plus",
    "شكراوريدو": "شكرا أوريدو",
    "در": "فعل",
    "رهي": "انه",
    "كاين": "يوجد",
    "مباغش": "لا يريد",
    "يمدلي": "يعطيني",
    "يخرجو": "خروج",
    "اكتر": "أكثر",
    "مايمشيش": "لا يعمل",
    "سوايع": "ساعة",
    "محبتش": "لا",
    "جام": "مستحيل",
    "جامي": "مستحيل",
    "تمشلي": "تعمل",
    "ريفي": "خاص",
    "قاع": "كل",
    "منربحش": "لا أربح",
    "منربحوش": "لا أربح",
    "نقارع": "صبر",
    "مفتحتوهاليش": "لا تفتح",
    "ابونمون": "اشتراك",
    "مساج": "رسالة",
    "رجعو": "رد",
    "صحيتو": "شكرا",
    "لاتوجد": "لا توجد",
    "نلقاش": "لا أجد",
    "ما نلقاش": "لا أجد",
    "كريدي": "رصيد",
    "ريبونديولنا": "رد",
    "جد": "جدا",
    "ينحيولي": "نزع",
    "يحذفولي": "نزع",
    "ماسلفت": "لم اقترض",
    "مدايرا": "لم أفعل",
    "راهي": "إنها",
    "فور": "ممتاز",
    "هايل": "ممتاز",
    "مليح": "جيد",
    "شابة": "جميل",
    "ماصلحتليش": "لا تعمل",
    "إستلاف": "قرض",
    "خص": "اريد",
    "هاذ": "هذا",
    "شحال": "كم",
    "تاكتيفيه": "تفعيل",
    "حتان": "كي",
    "كيفاش": "كيف",
    "غلطة": "خطا",
    "ختاريت": "خيار",
    "ديالي": "خاص بي",
    "حاب": "اريد",
    "ويل": "أو",
    "ابعث": "ارسال",
    "goold": "gold",
    "ماتجاوبوش": "عدم رد",
    "مهدى": "هدية",
    "واشهرالجاي": "شهر موالي",
    "نخلصش": "لا أدفع",
    "ابليس": "plus",
    "تأكتيفها": "تفعيل",
    "هيل": "hayla",
    "هاذي": "هذه",
    "وشمن": "ما هي",
    "جزاير": "جزائر",
    "معن": "معنى",
    "فاه": "فيها",
    "فاش": "اي",
    "مكوبي": "مقطوع",
    "مي": "لكن",
    "مايمشيلك": "لا يعمل",
    "نديرلهم": "اعمل لهم",
    "تتمسخرو": "استهزاء",
    "ريبونديو": "رد",
    "plais" : "من فظلكم",
    "كفاه": "كيف",
    "ندموندي": "طلب",
    "دخلتو": "ادخال",
    "العالميه": "عالمية",
    "قات": "بقي",
    "بقات": "بقي",
    "nechlh": "incha allah",
    "jdida": "جديد",
    "كيفما": "كيف",
    "اندير": "افعل",
    "ذيم": "دائما",
    "وين": "أين",
    "راهي": "هي",
    "ماندمت": "ندم",
    "حشو": "خدعة",
    "حشوة": "خدعة",
    "انشاءاللهتكونمننصيبي": "ان شاء الله تكون من نصيبي",
    "نتمنالهم": "اتمنى",
    "راحوش": "لم يذهب",
    "بادن": "باذن",
    "الي": "الذي",
    "تغلقلو": "غلق",
    "نيمروه": "رقم",
    "يسترجعو": "استرجاع",
    "رانيني": "ranini",
    "عقوبة": "عاقبة",
    "قوب": "عاقبة",
    "حطو": "وضع",
    "لينا": "لنا",
    "ستمرار": "مستمر",
    "لابيلكاسيو": "تطبيق",
    "لبليكاسيو": "تطبيق",
    "متمشيش": "لا تعمل",
    "حبستوها": "توقف",
    "حمال": "تحميل",
    "مزان": "ميزان",
    "جيه": "جهة",
    "فلاترددو": "فلا تترددو",
    "خدمهوعروضاطاقم": "خدمة و عروض طاقم",
    "فلعاصمة": "في عاصمة",
    "nztjwice": " ",
    "refvf": " ",
    "b": " ",
    "br": " ",
    "ابار": "يبارك",
    "يحفضكم": "يحفظكم",
    "يلوس": "plus",
    "تاه": "متى",
    "بف": " ",
    "sagmou": "regler",
    "شويش": "switch",
    "بغية": "اريد",
    "dialkom": "vous",
    "sahel": "facile",
    "toop": "top",
    "هنيء": "مبروك",
    "puse": "sim",
    "pui": "puis",
    "nerbah": "gagne",
    "ghir": "sauf",
    "بيان": "جيدا",
    "ماتمشيش": "لا تعمل",
    "نحافضو": "حفاظ",
    "تخلاص": "انتهاء",
    "متعرفش": "لا تعلم",
    "خاصتا": "خاصة",
    "وفى": "و في",
    "ظل": "دائما",
    "مدرتو": "لم تفعلو",
    "لموبليس": "موبيليس",
    "را": "انه",
    "صر": "حدث",
    "rah": "il est",
    "gae": "tous",
    "dok": "maintenant",
    "لابليكاسيون": "تطبيق",
    "كلش": "كل",
    "مشاءالله": "ما شاء الله",
    "حبس": "توقف",
    "زيد": "ايضا",
    "تسرقو": "سرق",
    "تاكتيفي": "تفعيل",
    "زوعاماء": "زعماء",
    "ul": " ",
    "كدب": "كذب",
    "ميقراسيون": "تبديل",
    "شح": "كم",
    "فيهاش": "لا يوجد",
    "نجاوب": "اجيب",
    "راحت": "ذهب",
    "متسواش": "سيء",
    "تعكم": "خاص بكم",
    "حقاهايلة": "ممتاز",
    "صرقولي": "سرق",
    "صرق": "سرق",
    "ويني": "اين",
    "حمدلله": "حمد لله",
    "رحمن": "رحمان",
    "نشال": "شاء الله",
    "ماشاء": "ما شاء",
    "تنحولي": "نزع",
    "خمسلاف": "خمسة ألف",
    "خمسلاف": "خمسة ألف",
    "مفعلتهاش": "لا تفعيل",
    "سرقتولي": "سرق",
    "لكريدي": "رصيد",
    "انترنتوراها": "اين انترنت",
    "تعييف": "سيء",
    "تعيف": "سيء",
    "ضك": "الان",
    "جزى": "جزاك",
    "نشاء": "ان شاء",
    "إنشاءالله": "ان شاء الله",
    "نشالله": "ان شاء الله",
    "شاءالله": "شاء الله"
}

def replace_abbreviations(text):
    words = text.split()
    return ' '.join([abbreviations[word] if word in abbreviations else word for word in words])

comments_df['Comments'] = comments_df['Comments'].apply(replace_abbreviations)

In [66]:
# Dictionnaire de regroupement phonétique
phonetic_groups = {
    "تعبئة": ["كنفليكسيو", "نفليكسي", "نفليكسيو", "فليكس", "تفليكسي", "فليكسولنا", "نفلبكسى", "فليكسيت", "كسي", "فليكسي"],
    "connexion": ["conx", "cnx", "ncx", "conexion", "connection"],
    "reseau": ["wrizo", "rizo", "riso", "rysou", "resou", "risou"],
    "اصلحوا": ["ريقلوه", "ريقلو", "صلحو", "ريقولو", "صلحونا", "عدلو", "ريغلونا", "رقليو", "تريغليونا", "لوتصلحولينا", "تريقلونا", "وريقليو"],
    "شبكة": ["ريزو", "اريزو", "الريزو", "ااريزو", "خط", "خطي", "رزو"],
    "ما به": ["واشي", "وشبيه", "شبيه"],
    "لا يوجد": ["ماكاش", "مكانش", "مكاش", "مكااش", "والو", "معنديش", "الو"],
    "أي": ["كش", "كاش"],
    "انترنت": ["كونيكسيون", "كونكسيو", "كونكزيون", "كونيكسيو", "ليدوني", "انترنات", "انترنيت","نترنت", "أنترنات", "الكنكسيو", "الانثرنات", "الانترنت", "أنترنت",  "اترنات", "الكونيكسيوو", "نت", "لكونيكسيو", "كنكسيو", "كونيكسو", "انترن"],
    "جازي": ["جايز", "جاز", "دجيزي", "دجزي", "جيزي", "جايزي"],
    "اسقاس": ["اسوكاس", "اصكاس", "اسيقاس", "اسكاس"],
    "امقاز": ["أموقاز", "امكاز"],
    "رمز": ["كود", "رزم"],
    "بارك": ["بارى", "بيار", "باراك", "يبارك"],
    "شريحة": ["لابيس", "لبيس", "بيس", "لابوس", "لبييس", "ليبيس", "بوس", "سبيسي", "لبووس", "pis"],
    "اوريدو": ["اريدوو", "لأوريد", "ياؤريدوا", "اوريدوا", "واوريدو", "ااوريدو", "اريدو"],
    "ان شاء": ["إنشاء", "نشالله", "انشاء"],
    "ilimite": ["ilm", "ilmt", "ilimiti", "ilim"],
}

# Création d’un mapping inverse (pour accélérer la recherche)
phonetic_mapping = {}
for standard, variations in phonetic_groups.items():
    for variant in variations:
        phonetic_mapping[variant] = standard

# Fonction de remplacement des variantes phonétiques
def replace_phonetic_variants(text):
    words = text.split()
    return ' '.join([phonetic_mapping[word] if word in phonetic_mapping else word for word in words])

comments_df['Comments'] = comments_df['Comments'].apply(replace_phonetic_variants)

## <span style="color: #28A745;">**Chargement de modèle**</span>

In [None]:
!wget -O /content/models/DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf \
"https://huggingface.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf"

--2025-05-10 02:25:35--  https://huggingface.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf
Resolving huggingface.co (huggingface.co)... 3.168.73.38, 3.168.73.129, 3.168.73.106, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.38|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/05/84/058408ebba13ec9fd4556e4187361bb25387663d5ec87e73d85e1abca50bb887/318b1edf03c35eb962aa79c1c59d8e03a7fe902f793b68ab3dbe6ae850622515?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf%3B+filename%3D%22DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf%22%3B&Expires=1746847535&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0Njg0NzUzNX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzA1Lzg0LzA1ODQwOGViYmExM2VjOWZkNDU1NmU0MTg3MzYxYmIyNTM4NzY2M2Q1ZWM4N2U3M2Q4NWUxYWJjYTUwYmI4ODcvMzE4YjFlZGYwM2

## <span style="color: #28A745;">**Creation de l'instance Llama**</span>

In [None]:
llm = Llama(
    model_path="/content/models/DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf",
    n_gpu_layers=-1,
    use_mlock=False,
    n_ctx=2048,
    verbose=False
)

## <span style="color: #28A745;">**Système prompt de classification des posts**</span>

In [None]:
system_prompt = (
    "You are a highly intelligent multilingual assistant specialized in **telecom customer feedback classification**. "
    "Your job is to analyze each customer comment, identified by its **unique ID**, and classify it into one or more **appropriate subcategories** based on the content.\n\n"

    "### 🎯 Classification Subcategories:\n"
    "- **Call Quality Issues** (e.g., Dropped calls, Voice distortion, Call connection failures, Can't hear the other person)\n"
    "- **Data Quality Issues (3G, 4G)** (e.g., Slow internet or something is slow, No connection, Intermittent connectivity, Mobile data not working properly)\n"
    "- **Coverage Issues** (e.g., Weak signal, Dead zones, No service in certain areas, Network outages in cities or villages)\n"
    "- **Response Time** (e.g., Slow customer support response, Long wait times on hotline or chat, Delayed service resolution, or reply to me)\n"
    "- **Agent Behavior** (e.g., Rude or unhelpful agents, Lack of knowledge, Unprofessional attitude, Didn’t solve my problem)\n"
    "- **Overcharging and Stolen Credit Issues** (e.g., Unexpected charges, Credit deducted without usage, Sudden balance loss after recharge)\n"
    "- **Billing Errors** (e.g., Wrong invoice details, Duplicate charges, Inaccurate amounts shown on bill, Postpaid billing problems)\n"
    "- **Subscription & Plan Issues** (e.g., Forced subscriptions, Plan switched without consent, Unwanted service activation, Difficult to unsubscribe)\n"
    "- **Hidden Charges** (e.g., Unclear pricing, Unexpected fees, Not informed about deductions, Surprise costs after service use)\n"
    "- **Suggestions** (e.g., Add 5G, Improve offers, Lower prices, Requesting better bundles or roaming options)\n"
    "- **Promotions & Discounts** (e.g., Misleading promotions, No actual benefit, Ads say something else, Discounts not applied)\n"
    "- **Competitor Pricing & Value (Mobilis and Djezzy)** (e.g., Better pricing or deals from competitors, More data or minutes for the same price)\n"
    "- **Competitor Plan Flexibility (Mobilis and Djezzy)** (e.g., More customizable plans by competitors, Ability to build own bundles)\n"
    "- **Competitor Data Consumption (Mobilis and Djezzy)** (e.g., Faster data exhaustion compared to competitors, My data lasts longer on Djezzy)\n"
    "- **Loyalty Expression** (e.g., Well done, Good service, Satisfied customer, Happy New Year, Saha Ramdankom, Thank you messages, Good and positif Expressions and not negatif)\n"
    "- **Service Information Request** (e.g., What is the code for loan? How to activate roaming? How to check balance? How much does it cost? Questions about using a service or feature, and About Details, Details about the offer)\n"
    "- **Other** (e.g., Irrelevant, incomprehensible, sarcasm, unrelated comment, joke or spam)\n\n"


    "### 🧠 Classification Instructions:\n"
    "1. Carefully read the content of each comment.\n"
    "2. Determine which subcategories best represent the issue(s).\n"
    "3. Multiple subcategories are allowed when relevant and even encouraged when multiple issues are present.\n"
    "4. DO NOT limit to one category if the comment contains multiple concerns.\n"

    "### 📤 Output Format:\n"
    "Respond **strictly** using the following JSON structure (no extra explanations):\n\n"
      "{\n"
      "  \"45\": {\n"
      "    \"categories\": [\"Promotions & Discounts\", \"Suggestions\"]\n"
      "  },\n"
      "  \"69\": {\n"
      "    \"categories\": [\"Call Quality Issues\"]\n"
      "  }\n"
      "}\n\n"

    "⚠️ **Important Notes**:\n"
    "- Do NOT include any commentary or explanation in the output.\n"
    "- Always use the exact category labels listed above.\n"
    "- Do NOT invent new categories.\n"
    "- Ensure output is in **valid JSON format**.\n"
    "- Each comment ID must map to at least one subcategory.\n\n"

    "### 🌐 Supported Languages:\n"
    "- Arabic (العربية)\n"
    "- Algerian Darija (الدارجة الجزائرية)\n"
    "- French (Français)\n"
    "- English\n\n"

    "Proceed with the classification."
)

## <span style="color: #28A745;">**Echantillon de test**</span>

In [None]:
# Extraire uniquement les commentaires dont l'ID_Comment est entre 2110 et 2400 inclus
test_df = comments_df[(comments_df["ID Comment"] >= 2110) & (comments_df["ID Comment"] <= 2400)].copy()
test_df

## <span style="color: #28A745;">**Classification des comments**</span>

In [None]:
# Boucle d’inférence
results = {}

for idx, row in test_df.iterrows():
    comment_id = str(row['ID Comment'])
    comment_text = str(row['Comments'])

    prompt = (
        system_prompt
        + f"\n\nClassify the following customer comment. Respond using ONLY the JSON format.\n"
        + f"Comment ID: \"{comment_id}\"\n"
        + f"Comment: \"{comment_text}\"\n"
        + "\nDO NOT limit to one category if the content contains multiple concerns.\n"
        + "\nReturn the result in this exact structure:\n"
        + "{\n"
        + f"  \"{comment_id}\": {{\n"
        + "    \"categories\": [\"Category A\", \"Category B\"]\n"
        + "  }\n"
        + "}\n"
        + "\n⚠️ DO NOT COMMENT. DO NOT THINK. ONLY RETURN STRICT JSON."
    )

    try:
        output = llm(prompt, max_tokens=1000, temperature=0.0)
        raw_response = output["choices"][0]["text"].strip()

        # Extraction du JSON
        json_match = re.search(r"\{.*\}", raw_response, re.DOTALL)
        if json_match:
            json_str = json_match.group()
            parsed = json.loads(json_str)
            results.update(parsed)
            print(json.dumps(parsed, indent=2, ensure_ascii=False))  # 👈 Affichage formaté JSON uniquement
        else:
            raise ValueError("No JSON found")

    except Exception:
        fallback = {
            comment_id: {
                "categories": ["Other"]
            }
        }
        results.update(fallback)
        print(json.dumps(fallback, indent=2, ensure_ascii=False))

    sleep(0.5)

## <span style="color: #28A745;">**Sauvegarder les résultats**</span>

In [None]:
# Sauvegarder les résultats dans un fichier JSON
json_path = "/content/Results/classified_comments.json"
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

In [None]:
# Lire le fichier JSON depuis le disque
with open("/content/Results/classified_comments.json", "r", encoding="utf-8") as f:
    classification_dict = json.load(f)

# Liste des catégories
category_list = [
    "Call Quality Issues",
    "Data Quality Issues (3G, 4G)",
    "Coverage Issues",
    "Response Time",
    "Agent Behavior",
    "Overcharging and Stolen Credit Issues",
    "Billing Errors",
    "Subscription & Plan Issues",
    "Hidden Charges",
    "Suggestions",
    "Promotions & Discounts",
    "Competitor Pricing & Value (Mobilis and Djezzy)",
    "Competitor Plan Flexibility (Mobilis and Djezzy)",
    "Competitor Data Consumption (Mobilis and Djezzy)",
    "Loyalty Expression",
    "Service Information Request",
    "Sarcasme",
    "Irrelevant",
    "Other"
]

# Initialiser toutes les colonnes de catégories à 0
for cat in category_list:
    if cat not in comments_df.columns:
        comments_df[cat] = 0

# Remplir en fonction des catégories détectées
for comment_id_str, data in classification_dict.items():
    comment_id = int(comment_id_str)
    categories = data.get("categories", [])
    for category in categories:
        if category in category_list:
            comments_df.loc[comments_df["ID Comment"] == comment_id, category] = 1
            if category in ("Sarcasme", "Irrelevant"):  # Correct 'or' condition
                comments_df.loc[comments_df["ID Comment"] == comment_id, "Other"] = 1
# Drop "Sarcasme" and "Irrelevant" columns after the loop
comments_df = comments_df.drop(columns=["Sarcasme", "Irrelevant"])

# Sauvegarder le nouveau DataFrame
comments_df.to_csv("/content/Results/comments_df_classified.csv", index=False)

# Afficher un aperçu
comments_df.head()

<h3 style="text-align: center; color: #E30613;"><b><i>Développé par: OUARAS Khelil Rafik</i></b></h3>