<h1 style="text-align: center; color: #E30613;"><b><i>Annotation Automatique pour les Commentaires</i></b></h1>

<p style="font-size: 18px;">
Ce notebook vise à automatiser le processus de classification des commentaires des clients dans le domaine des télécommunications.
Grâce à l'utilisation de modèles avancés d'apprentissage automatique et de traitement du langage naturel,
nous analysons et catégorisons les retours des utilisateurs pour améliorer la qualité des services.
</p>

<h2 style="color: #28A745;">Objectifs Principaux :</h2>
<ul style="font-size: 16px; color: #333;">
    <li>Analyser les commentaires des clients pour identifier les problèmes récurrents.</li>
    <li>Classer les retours en différentes sous-catégories pour une meilleure compréhension.</li>
    <li>Fournir des insights exploitables pour améliorer les services et la satisfaction client.</li>
</ul>

<h2 style="color: #28A745;">Technologies Utilisées :</h2>
<ul style="font-size: 16px; color: #333;">
    <li><b>Pandas :</b> Pour la manipulation et le prétraitement des données.</li>
    <li><b>Regex :</b> Pour le nettoyage et la normalisation des textes.</li>
    <li><b>emoji :</b> Pour gérer et supprimer les emojis dans les commentaires.</li>
    <li><b>Llama :</b> Modèle avancé pour la classification des commentaires.</li>
    <li><b>JSON :</b> Pour structurer et sauvegarder les résultats de classification.</li>
</ul>

<h2 style="color: #28A745;">Flux de Travail :</h2>
<ol style="font-size: 16px; color: #333;">
    <li><b>Chargement des Données :</b> Importation des commentaires clients depuis un fichier Excel.</li>
    <li><b>Pré-traitement :</b> Nettoyage, normalisation et suppression des doublons.</li>
    <li><b>Classification :</b> Utilisation d'un modèle Llama pour catégoriser les commentaires.</li>
    <li><b>Sauvegarde :</b> Exportation des résultats dans des fichiers JSON et CSV pour une analyse ultérieure.</li>
</ol>

<h2 style="color: #28A745;">Résultats Attendus :</h2>
<ul style="font-size: 16px; color: #333;">
    <li>Une classification précise des commentaires en sous-catégories pertinentes.</li>
    <li>Une meilleure compréhension des problèmes rencontrés par les clients.</li>
    <li>Des recommandations exploitables pour améliorer les services.</li>
</ul>

## <span style="color: #28A745;">**Bibiliothèques nécessaires**</span>

In [1]:
%pip install emoji pandas llama-cpp-python tqdm -q --quiet

import pandas as pd
import re
import emoji
from collections import defaultdict
import os
from llama_cpp import Llama
import json
from tqdm import tqdm
from time import sleep

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone


## <span style="color: #28A745;">**Chargement des Données**</span>

In [2]:
# Charger les données
comments_df = pd.read_excel('/content/Comments.xlsx')
comments_df

Unnamed: 0,ID Post,User Name,Comments,Sentiments
0,1,Samir Bekhouche,,Neutre
1,1,Yanise Yanise,سلام عليكم ورحمة لديا مشكلة ! فليكسيت 100 دج و...,Negatif
2,1,Jj Kie,كل عام و انتم بخير,Positif
3,1,Sakou Younes,كل عام و أنتم بخير,Positif
4,1,راني نعاني,كل عام و حنا بخير,Positif
...,...,...,...,...
4097,183,Ĺã Rõsë Ýb,❤️❤️,Positif
4098,183,نسمات هادئة,💕💕💕💕,Positif
4099,183,ملك ملهاش غيرك,❤❤❤❤❤❤🌹,Positif
4100,183,سعيدي رضا,,Neutre


## <span style="color: #28A745;">**Pré-traitement**</span>

In [3]:
# Suppression des lignes où "User Name" est "Djezzy", "Mobilis" ou "Ooredoo"
comments_df = comments_df[~comments_df["User Name"].isin(["Djezzy", "Mobilis", "Ooredoo Algérie"])]

# Supprimer les lignes où "Comments" est vide ou contient uniquement des espaces
comments_df = comments_df.dropna(subset=["Comments"])
comments_df = comments_df[comments_df["Comments"].str.strip() != ""]

# Supprimer les doublons consécutifs
comments_df = comments_df.loc[comments_df["Comments"].shift() != comments_df["Comments"]]

In [4]:
# Normalisation des commentaires
def normalize_arabic(text):
    text = text.lower()
    text = re.sub("گ", "ك", text)
    text = re.sub("ڭ", "ك", text)
    text = re.sub("ڤ", "ق", text)
    text = re.sub("ڨ", "ق", text)
    text = re.sub("پ", "ب", text)
    text = re.sub("é", "e", text)
    text = re.sub("ê", "e", text)
    text = re.sub("ë", "e", text)
    text = re.sub("ç", "c", text)
    text = re.sub("à", "a", text)
    text = re.sub("â", "a", text)
    text = re.sub("ä", "a", text)
    text = re.sub("î", "i", text)
    text = re.sub("ï", "a", text)
    text = re.sub("æ", "ae", text)
    text = re.sub("œ", "oe", text)
    return text

# Appliquer la normalisation
comments_df["Comments"] = comments_df["Comments"].apply(normalize_arabic)

In [5]:
def clean_text(text):
    text = emoji.replace_emoji(text, replace=" ")  # Supprimer les emojis
    text = re.sub(r'http\S+ | htps\S+', " ", str(text))  # Supprimer les hyperliens
    text = re.sub(r'((www\.[^\s]+)|(https?://[^\s]+))', " ", str(text))  # Supprimer les URL
    text = re.sub(r'@\S+', '', str(text))  # Supprimer les mots commençant par @
    text = text.replace("_", " ").replace("#", "")  # Supprimer # et _
    text = text.replace("'", " ")  # Supprimer '
    text = re.sub(r'\. | , | ، | ؛', " ", text)  # Supprimer les ponctuations

    # Supprimer les mots réservés
    text = re.sub(r'\bRT\b | \bRetweeted\b', " ", text)

    # Supprimer les voyelles courtes arabes (حركات)
    harakat = "[\u064B-\u0652]"  # Comprend
    text = re.sub(harakat, '', text)

    text = re.sub(r"(.)\1{2,}", r"\1", text)  # Supprimer les caractères consécutifs en double
    text = text.replace('\n', " ").replace('/', " ")  # Supprimer sauts de ligne et /

    return text.strip()

comments_df["Comments"] = comments_df["Comments"].apply(clean_text)

# Remplacer les valeurs nulles par une chaîne vide
comments_df["Comments"] = comments_df["Comments"].fillna('')

comments_df = comments_df.dropna(subset=["Comments"])
comments_df = comments_df[comments_df["Comments"].str.strip() != ""]

In [6]:
abbreviations = {
    "mrc": "merci",
    "num": "numéro",
    "numro": "numéro",
    "nn": "non",
    "bn": "bonne",
    "topp": "top",
    "شوي": "قليل",
    "شويا": "قليل",
    "لزونيتي": "الوحدات",
    "حنا": "نحن",
    "أسإلتي": "اسئلة",
    "حاص": "خاص",
    "مدام": "بينما",
    "رانا": "نحن",
    "بون": "حسن",
    "توس": "جميع",
    "سبيسيال": "منفرد",
    "سبيال": "منفرد",
    "باه": "لكي",
    "زاف": "كثيرا",
    "بزاف": "كثيرا",
    "نتمني": "نتمنى",
    "فظلكم": "فضلكم",
    "ضلم": "ظلم",
    "ردى": "رد",
    "علاش": "لماذا",
    "علاه": "لماذا",
    "علا": "لماذا",
    "هذ": "هذا",
    "تبرعو": "تبرع",
    "ابليكاسيو": "تطبيق",
    "أبليكاسيون": "تطبيق",
    "أبليكاسيو": "تطبيق",
    "الخاص بكم": "نتاعكم",
    "ديرولنا": "افعلو لنا",
    "ديرو": "افعلو",
    "نحا": "نزع",
    "دير": "افعل",
    "حاجة": "شيء",
    "نتع": "خاص ب",
    "تاع": "خاص ب",
    "تع": "خاص ب",
    "جاوبوني": "رد علي",
    "مشتاركه": "مشاركة",
    "رحو": "اذهبو",
    "هدي": "هدية",
    "هذي": "هذه",
    "يروح": "يذهب",
    "يجي": "يأتي",
    "وقتاه": "متى",
    "وقتاش": "متى",
    "وقتش": "متى",
    "تردولنا": "تردون",
    "نتاع": "خاص ب",
    "شك": "من",
    "شكون": "من",
    "نحيتوها": "نزع",
    "نحيتو": "نزع",
    "نحيت": "نزع",
    "سلفلي": "قرض",
    "سلفولي": "قرض",
    "غدوة": "غدا",
    "تلقى": "تجد",
    "نلقى": "اجد",
    "نأكتيفيها": "تفعيل",
    "نأكتيفي": "تفعيل",
    "ناكتيفي": "تفعيل",
    "نعانيو": "نعانون",
    "نعانو": "نعانون",
    "ميت حال": "رديء",
    "مفهمتش": "لم افهم",
    "انو": "انه",
    "علابالي": "اعلم",
    "ليتو": "اصبحتم",
    "وليتو": "اصبحتم",
    "تدو": "تاخذون",
    "الى": "إلى",
    "خاستني": "احتاج",
    "خستني": "احتاج",
    "مشى": "تعمل",
    "مشا": "تعمل",
    "شكا": "شكوى",
    "شكيت": "شكوى",
    "زرو": "سيء",
    "زيرو": "سيء",
    "سرفيك": "خدمة",
    "سرفيس": "خدمة",
    "جيزي اب": "djezzy app",
    "nchlh": "incha allah",
    "riglou": "regler",
    "bah": "pour",
    "nwaliw": "devenir",
    "k": "comme",
    "مراحش": "لن",
    "مهمش": "ليسو",
    "تعليع": "تعليق",
    "زااف": "كثيرا",
    "khayan": "سرق",
    "djezzyy": "djezzy",
    "ومرديتش": "لم ترد",
    "واش": "ماذا",
    "منبعد": "بعد ذلك",
    "تفتحش": "لا تفتح",
    "ندير": "افعل",
    "راه": "اصبح",
    "لازم": "يجب",
    "pix": "pixx",
    "Twanty": "Twenty",
    "orod": "prod",
    "imtiyaaz": "imtiyaz",
    "tm": "ok",
    "نم": "تم",
    "جداو": "جدا",
    "زلب": "زبل",
    "كونطرا": "عقد",
    "شري": "شراء",
    "توب": "رائع",
    "جزل": "جزيلا",
    "ياسر": "كثيرا",
    "golde": "gold",
    "mknch": "introuvable",
    "rani": "je suis",
    "شكرالكم": "شكرا لكم",
    "viv": "vive",
    "عتل": "ارسل",
    "بعت": "ارسل",
    "لاجونس": "مقر",
    "rpnd": "repond",
    "prv": "prive",
    "svp": "s il vous plais",
    "yennayer": "سنة",
    "amervuh": "سعيدة",
    "assgas": "assegas",
    "asugas": "assegas",
    "asegas": "assegas",
    "amegaz": "amegas",
    "amgaz": "amegas",
    "amegaz": "amegas",
    "amgaz": "amegas",
    "amegaz": "amegas",
    "خخ": "ضحك",
    "هه": "ضحك",
    "سبي": "سبيسيال",
    "شبكهمشكورين": "شبكه مشكورين",
    "يعطيكمصحه": "يعطيكم صحه",
    "نشله": "ان شاء الله",
    "عندوش": "لا يوجد",
    "خفظو": "تخفيض",
    "سء": "سؤال",
    "مليحة": "حسن",
    "مليحه": "حسن",
    "وله": "و الله",
    "مكمات": "مكالمات",
    "وينتا": "متى",
    "تدوها": "تاخذون",
    "felawen": "tous",
    "ya": "il y a",
    "en": "في",
    "panne": "عطل",
    "happy": "سعيدة",
    "koum": "votre",
    "ayi": "faible",
    "new": "جديدة",
    "year": "سنة",
    "years": "سنة",
    "شنو": "ما هو",
    "هدا": "هذا",
    "شالنج": "تحدي",
    "li": "qui",
    "bghi": "aime",
    "ndirlo": "faire",
    "yji": "viens",
    "lah": "pourquoi",
    "raho": "que il",
    "hbs": "arret",
    "بر": "فقط",
    "برك": "فقط",
    "غي": "الا",
    "غير": "الا",
    "الي": "الى",
    "حسنو": "اصلاح",
    "سقمو": "اصلاح",
    "جوند": "legend",
    "يجاند": "legend",
    "ليجند": "legend",
    "ارطيا": "جزء",
    "علجال": "من أجل",
    "تثقال": "بطء",
    "كون": "ليت",
    "بغى": "اراد",
    "يبغي": "يريد",
    "نبغي": "نريد",
    "تفرج": "مشاهدة",
    "ماتش": "مباراة",
    "رجا": "رجاء",
    "متمشلكش": "لا تعمل",
    "متمسيلكش": "لا تعمل",
    "لايص": "اماكن",
    "بلايص": "اماكن",
    "بلاصة": "اماكن",
    "نسقسي": "اسأل",
    "اذ": "اذا",
    "يمتي": "بلا حدود",
    "اليميتي": "بلا حدود",
    "خسني": "اريد",
    "باطل": "مجانا",
    "قولد": "gold",
    "تعيف": "بطء",
    "نسييو": "محاولة",
    "نلعبو": "لعب",
    "نكونو": "أكون",
    "عب": "لعب",
    "كف": "كيف",
    "اللهيوفقناجميعاقولويارب": "الله يوفقنا جميع اقولو يارب",
    "congratulations": "مبروك",
    "berkaw": "arret",
    "ser": "vole",
    "تردوش": "لا تردون",
    "هضرت": "تكلمت",
    "باش": "لكي",
    "تحلى": "حل",
    "تحلي": "حل",
    "لينا": "لنا",
    "حض": "حظ",
    "wech": "Quoi",
    "ndirou": "faire",
    "bach": "pour",
    "nrebhou": "gagner",
    "elfe": "mille",
    "mabrok": "felicitations",
    "koules": "tous",
    "moucharikones": "participants",
    "el": "les",
    "mabrouk": "مبروك",
    "اطوههالي": "اعطوها لي",
    "شاب": "جميل",
    "يعطيكمالصحة": "يعطيكم الصحة",
    "illa": "lent",
    "woww": "wow",
    "يارب": "يا رب",
    "كلشي": "كل شيء",
    "كنكتي": "تواصل",
    "مانكونيكتيش": "لا أتواصل",
    "يووز": "yooz",
    "يوز": "yooz",
    "وش": "ما هو",
    "وشمن": "اي",
    "ديما": "dima",
    "راه": "انه",
    "معجبتنيش": "سيء",
    "تحبسلي": "توقف",
    "انتاع": "ل",
    "لوس": "plus",
    "شكراوريدو": "شكرا أوريدو",
    "در": "فعل",
    "رهي": "انه",
    "كاين": "يوجد",
    "مباغش": "لا يريد",
    "يمدلي": "يعطيني",
    "يخرجو": "خروج",
    "اكتر": "أكثر",
    "مايمشيش": "لا يعمل",
    "سوايع": "ساعة",
    "محبتش": "لا",
    "جام": "مستحيل",
    "جامي": "مستحيل",
    "تمشلي": "تعمل",
    "ريفي": "خاص",
    "قاع": "كل",
    "منربحش": "لا أربح",
    "منربحوش": "لا أربح",
    "نقارع": "صبر",
    "مفتحتوهاليش": "لا تفتح",
    "ابونمون": "اشتراك",
    "مساج": "رسالة",
    "رجعو": "رد",
    "صحيتو": "شكرا",
    "لاتوجد": "لا توجد",
    "نلقاش": "لا أجد",
    "ما نلقاش": "لا أجد",
    "كريدي": "رصيد",
    "ريبونديولنا": "رد",
    "جد": "جدا",
    "ينحيولي": "نزع",
    "يحذفولي": "نزع",
    "ماسلفت": "لم اقترض",
    "مدايرا": "لم أفعل",
    "راهي": "إنها",
    "فور": "ممتاز",
    "هايل": "ممتاز",
    "مليح": "جيد",
    "شابة": "جميل",
    "ماصلحتليش": "لا تعمل",
    "إستلاف": "قرض",
    "خص": "اريد",
    "هاذ": "هذا",
    "شحال": "كم",
    "تاكتيفيه": "تفعيل",
    "حتان": "كي",
    "كيفاش": "كيف",
    "غلطة": "خطا",
    "ختاريت": "خيار",
    "ديالي": "خاص بي",
    "حاب": "اريد",
    "ويل": "أو",
    "ابعث": "ارسال",
    "goold": "gold",
    "ماتجاوبوش": "عدم رد",
    "مهدى": "هدية",
    "واشهرالجاي": "شهر موالي",
    "نخلصش": "لا أدفع",
    "ابليس": "plus",
    "تأكتيفها": "تفعيل",
    "هيل": "hayla",
    "هاذي": "هذه",
    "وشمن": "ما هي",
    "جزاير": "جزائر",
    "معن": "معنى",
    "فاه": "فيها",
    "فاش": "اي",
    "مكوبي": "مقطوع",
    "مي": "لكن",
    "مايمشيلك": "لا يعمل",
    "نديرلهم": "اعمل لهم",
    "تتمسخرو": "استهزاء",
    "ريبونديو": "رد",
    "plais" : "من فظلكم",
    "كفاه": "كيف",
    "ندموندي": "طلب",
    "دخلتو": "ادخال",
    "العالميه": "عالمية",
    "قات": "بقي",
    "بقات": "بقي",
    "nechlh": "incha allah",
    "jdida": "جديد",
    "كيفما": "كيف",
    "اندير": "افعل",
    "ذيم": "دائما",
    "وين": "أين",
    "راهي": "هي",
    "ماندمت": "ندم",
    "حشو": "خدعة",
    "حشوة": "خدعة",
    "انشاءاللهتكونمننصيبي": "ان شاء الله تكون من نصيبي",
    "نتمنالهم": "اتمنى",
    "راحوش": "لم يذهب",
    "بادن": "باذن",
    "الي": "الذي",
    "تغلقلو": "غلق",
    "نيمروه": "رقم",
    "يسترجعو": "استرجاع",
    "رانيني": "ranini",
    "عقوبة": "عاقبة",
    "قوب": "عاقبة",
    "حطو": "وضع",
    "لينا": "لنا",
    "ستمرار": "مستمر",
    "لابيلكاسيو": "تطبيق",
    "لبليكاسيو": "تطبيق",
    "متمشيش": "لا تعمل",
    "حبستوها": "توقف",
    "حمال": "تحميل",
    "مزان": "ميزان",
    "جيه": "جهة",
    "فلاترددو": "فلا تترددو",
    "خدمهوعروضاطاقم": "خدمة و عروض طاقم",
    "فلعاصمة": "في عاصمة",
    "nztjwice": " ",
    "refvf": " ",
    "b": " ",
    "br": " ",
    "ابار": "يبارك",
    "يحفضكم": "يحفظكم",
    "يلوس": "plus",
    "تاه": "متى",
    "بف": " ",
    "sagmou": "regler",
    "شويش": "switch",
    "بغية": "اريد",
    "dialkom": "vous",
    "sahel": "facile",
    "toop": "top",
    "هنيء": "مبروك",
    "puse": "sim",
    "pui": "puis",
    "nerbah": "gagne",
    "ghir": "sauf",
    "بيان": "جيدا",
    "ماتمشيش": "لا تعمل",
    "نحافضو": "حفاظ",
    "تخلاص": "انتهاء",
    "متعرفش": "لا تعلم",
    "خاصتا": "خاصة",
    "وفى": "و في",
    "ظل": "دائما",
    "مدرتو": "لم تفعلو",
    "لموبليس": "موبيليس",
    "را": "انه",
    "صر": "حدث",
    "rah": "il est",
    "gae": "tous",
    "dok": "maintenant",
    "لابليكاسيون": "تطبيق",
    "كلش": "كل",
    "مشاءالله": "ما شاء الله",
    "حبس": "توقف",
    "زيد": "ايضا",
    "تسرقو": "سرق",
    "تاكتيفي": "تفعيل",
    "زوعاماء": "زعماء",
    "ul": " ",
    "كدب": "كذب",
    "ميقراسيون": "تبديل",
    "شح": "كم",
    "فيهاش": "لا يوجد",
    "نجاوب": "اجيب",
    "راحت": "ذهب",
    "متسواش": "سيء",
    "تعكم": "خاص بكم",
    "حقاهايلة": "ممتاز",
    "صرقولي": "سرق",
    "صرق": "سرق",
    "ويني": "اين",
    "حمدلله": "حمد لله",
    "رحمن": "رحمان",
    "نشال": "شاء الله",
    "ماشاء": "ما شاء",
    "تنحولي": "نزع",
    "خمسلاف": "خمسة ألف",
    "خمسلاف": "خمسة ألف",
    "مفعلتهاش": "لا تفعيل",
    "سرقتولي": "سرق",
    "لكريدي": "رصيد",
    "انترنتوراها": "اين انترنت",
    "تعييف": "سيء",
    "تعيف": "سيء",
    "ضك": "الان",
    "جزى": "جزاك",
    "نشاء": "ان شاء",
    "إنشاءالله": "ان شاء الله",
    "نشالله": "ان شاء الله",
    "شاءالله": "شاء الله"
}

def replace_abbreviations(text):
    words = text.split()
    return ' '.join([abbreviations[word] if word in abbreviations else word for word in words])

comments_df['Comments'] = comments_df['Comments'].apply(replace_abbreviations)

In [7]:
# Dictionnaire de regroupement phonétique
phonetic_groups = {
    "تعبئة": ["كنفليكسيو", "نفليكسي", "نفليكسيو", "فليكس", "تفليكسي", "فليكسولنا", "نفلبكسى", "فليكسيت", "كسي", "فليكسي"],
    "connexion": ["conx", "cnx", "ncx", "conexion", "connection"],
    "reseau": ["wrizo", "rizo", "riso", "rysou", "resou", "risou"],
    "اصلحوا": ["ريقلوه", "ريقلو", "صلحو", "ريقولو", "صلحونا", "عدلو", "ريغلونا", "رقليو", "تريغليونا", "لوتصلحولينا", "تريقلونا", "وريقليو"],
    "شبكة": ["ريزو", "اريزو", "الريزو", "ااريزو", "خط", "خطي", "رزو"],
    "ما به": ["واشي", "وشبيه", "شبيه"],
    "لا يوجد": ["ماكاش", "مكانش", "مكاش", "مكااش", "والو", "معنديش", "الو"],
    "أي": ["كش", "كاش"],
    "انترنت": ["كونيكسيون", "كونكسيو", "كونكزيون", "كونيكسيو", "ليدوني", "انترنات", "انترنيت","نترنت", "أنترنات", "الكنكسيو", "الانثرنات", "الانترنت", "أنترنت",  "اترنات", "الكونيكسيوو", "نت", "لكونيكسيو", "كنكسيو", "كونيكسو", "انترن"],
    "جازي": ["جايز", "جاز", "دجيزي", "دجزي", "جيزي", "جايزي"],
    "اسقاس": ["اسوكاس", "اصكاس", "اسيقاس", "اسكاس"],
    "امقاز": ["أموقاز", "امكاز"],
    "رمز": ["كود", "رزم"],
    "بارك": ["بارى", "بيار", "باراك", "يبارك"],
    "شريحة": ["لابيس", "لبيس", "بيس", "لابوس", "لبييس", "ليبيس", "بوس", "سبيسي", "لبووس", "pis"],
    "اوريدو": ["اريدوو", "لأوريد", "ياؤريدوا", "اوريدوا", "واوريدو", "ااوريدو", "اريدو"],
    "ان شاء": ["إنشاء", "نشالله", "انشاء"],
    "ilimite": ["ilm", "ilmt", "ilimiti", "ilim"],
}

# Création d’un mapping inverse (pour accélérer la recherche)
phonetic_mapping = {}
for standard, variations in phonetic_groups.items():
    for variant in variations:
        phonetic_mapping[variant] = standard

# Fonction de remplacement des variantes phonétiques
def replace_phonetic_variants(text):
    words = text.split()
    return ' '.join([phonetic_mapping[word] if word in phonetic_mapping else word for word in words])

comments_df['Comments'] = comments_df['Comments'].apply(replace_phonetic_variants)

In [8]:
# Add ID column to comments_df
comments_df.insert(0, 'ID Comment', range(1, 1 + len(comments_df)))

# Vérification du format attendu
assert 'ID Comment' in comments_df.columns and 'Comments' in comments_df.columns, "Les colonnes 'ID Comment' et 'Comments' doivent exister."

# Créer le dossier de résultats
os.makedirs('/content/Results', exist_ok=True)

# Créer le dossier de modèle
os.makedirs('/content/models', exist_ok=True)

## <span style="color: #28A745;">**Chargement de modèle**</span>

In [9]:
!wget -O /content/models/DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf \
"https://huggingface.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf"

--2025-05-11 18:15:02--  https://huggingface.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf
Resolving huggingface.co (huggingface.co)... 3.163.189.74, 3.163.189.37, 3.163.189.114, ...
Connecting to huggingface.co (huggingface.co)|3.163.189.74|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/05/84/058408ebba13ec9fd4556e4187361bb25387663d5ec87e73d85e1abca50bb887/318b1edf03c35eb962aa79c1c59d8e03a7fe902f793b68ab3dbe6ae850622515?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf%3B+filename%3D%22DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf%22%3B&Expires=1746990902&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0Njk5MDkwMn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzA1Lzg0LzA1ODQwOGViYmExM2VjOWZkNDU1NmU0MTg3MzYxYmIyNTM4NzY2M2Q1ZWM4N2U3M2Q4NWUxYWJjYTUwYmI4ODcvMzE4YjFlZGY

## <span style="color: #28A745;">**Creation de l'instance Llama**</span>

In [10]:
llm = Llama(
    model_path="/content/models/DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf",
    n_gpu_layers=-1,
    use_mlock=False,
    n_ctx=2048,
    verbose=False
)

llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## <span style="color: #28A745;">**Système prompt de classification des posts**</span>

In [11]:
system_prompt = (
    "You are a highly intelligent multilingual assistant specialized in **telecom customer feedback classification**. "
    "Your job is to analyze each customer comment, identified by its **unique ID**, and classify it into one or more **appropriate subcategories** based on the content.\n\n"

    "### 🎯 Classification Subcategories (Specialized in Telecom Customer Feedback Classification):\n"

    "- **Call Quality Issues**\n"
    "   Issues related to the quality of voice calls, including disruptions, poor audio quality, call drops, or difficulties hearing the other person. These problems are often related to signal strength or network congestion.\n"
    "   (e.g., Dropped calls, Voice distortion, Call connection failures, Can't hear the other person, Background noise during calls, Echoing sound, Calls getting disconnected after a few seconds, Can't make international calls, ...)\n\n"

    "- **Data Quality Issues (3G, 4G, 5G)**\n"
    "   Problems concerning mobile data connections such as slow internet, weak or intermittent connectivity, or data not working properly. This is typically linked to poor network coverage, bandwidth congestion, or technical faults.\n"
    "   (e.g., Slow internet or something is slow, No connection, Intermittent connectivity, Mobile data not working properly, Pages not loading, Streaming is buffering too much, Can't use WhatsApp or Facebook properly, VPN keeps disconnecting, Data stops working when I leave the city, ...)\n\n"

    "- **Coverage Issues**\n"
    "   Issues related to network coverage in specific areas, including weak signals, complete loss of service, or difficulties connecting in certain locations. This can happen in rural areas, basements, or during network outages.\n"
    "   (e.g., Weak signal, Dead zones, No service in certain areas, Network outages in cities or villages, Poor reception indoors, No network in my office building, Can't get a signal during travel, Coverage disappears during bad weather, ...)\n\n"

    "- **Response Time**\n"
    "   Delays or unresponsiveness from customer service, including long wait times on hotlines, chat services, or social media platforms. Customers often express frustration with slow responses to inquiries or complaint resolutions.\n"
    "   (e.g., Slow customer support response, Long wait times on hotline or chat, Delayed service resolution, or reply to me, Took hours to respond on WhatsApp, Hotline always busy, Support email takes days to reply, Waiting for a technician for weeks, ...)\n\n"

    "- **Agent Behavior**\n"
    "   Complaints about the attitude or professionalism of customer service agents. This includes rudeness, lack of helpfulness, insufficient knowledge, or refusal to assist properly.\n"
    "   (e.g., Rude or unhelpful agents, Lack of knowledge, Unprofessional attitude, Didn’t solve my problem, Agent hung up on me, Didn't understand my problem, Gave wrong information, Wasn't willing to help, ...)\n\n"

    "- **Overcharging and Stolen Credit Issues**\n"
    "   Unexpected charges or sudden loss of credit balance without clear reason. These issues often relate to automatic deductions, hidden fees, or credit disappearing after a top-up.\n"
    "   (e.g., Unexpected charges, Credit deducted without usage, Sudden balance loss after recharge, Charged for services I didn’t use, Credit goes down even with Wi-Fi, Data deducted without browsing, Charged twice for the same service, ...)\n\n"

    "- **Billing Errors**\n"
    "   Inaccuracies in billing details, such as incorrect charges, duplicate payments, or errors in the displayed balance. This can affect both prepaid and postpaid customers.\n"
    "   (e.g., Wrong invoice details, Duplicate charges, Inaccurate amounts shown on bill, Postpaid billing problems, Monthly bill higher than usual, Mistake in international call charges, VAT calculation errors, Overbilled for roaming, ...)\n\n"

    "- **Subscription & Plan Issues**\n"
    "   Problems with subscription services or mobile plans. This includes forced activations, unapproved plan changes, or difficulties in deactivating services.\n"
    "   (e.g., Forced subscriptions, Plan switched without consent, Unwanted service activation, Difficult to unsubscribe, Activated a service without my permission, Can't change my plan, Can't deactivate family plan, Charged for a bundle I didn't request, ...)\n\n"

    "- **Hidden Charges**\n"
    "   Concerns about undisclosed fees or unexpected costs that appear on bills or after using certain services. This is often due to unclear communication of terms.\n"
    "   (e.g., Unclear pricing, Unexpected fees, Not informed about deductions, Surprise costs after service use, Charged for SMS I didn't send, International charges without traveling, Costs for voicemail without notification, ...)\n\n"

    "- **Suggestions**\n"
    "   Customer ideas or recommendations for improving telecom services. These are not information requests but rather suggestions for better service offerings or enhancements.\n"
    "   (Suggestions not informations, e.g., Add 5G, Improve offers, Lower prices, Requesting better bundles, Requesting better roaming options, More competitive international plans, Family data sharing options, Better app features, ...)\n\n"

    "- **Promotions & Discounts**\n"
    "   Issues with promotions, discounts, or special offers. This includes misleading advertisements, unfulfilled offers, or confusion over eligibility.\n"
    "   (e.g., Misleading promotions, No actual benefit, Ads say something else, Discounts not applied, Offer expired early, Can't activate Ramadan promo, Gift data not received, Misleading unlimited data claims, ...)\n\n"

    "- **Competitor Pricing & Value (Mobilis and Djezzy)**\n"
    "   Customer feedback comparing the value and pricing of services with competitors, highlighting better deals or cheaper prices elsewhere.\n"
    "   (e.g., Better pricing or deals from competitors, More data or minutes for the same price, Better roaming options at Mobilis, Cheaper bundles with Djezzy, Free international SMS with competitors, ...)\n\n"

    "- **Competitor Plan Flexibility (Mobilis and Djezzy)**\n"
    "   Observations that competitors provide more flexible or customizable mobile plans, allowing for tailored options and personalized bundles.\n"
    "   (e.g., More customizable plans by competitors, Ability to build own bundles, Choose data and minutes independently, Flexible add-ons with Djezzy, Better roaming options at Mobilis, ...)\n\n"

    "- **Competitor Data Consumption (Mobilis and Djezzy)**\n"
    "   Feedback indicating that data lasts longer or is consumed more efficiently with competitors. This often reflects on perceived data efficiency and value for money.\n"
    "   (e.g., Faster data exhaustion compared to competitors, My data lasts longer on Djezzy, Streaming is smoother on Mobilis, I use less data for the same apps with competitors, ...)\n\n"

    "- **Loyalty Expression**\n"
    "   Positive feedback expressing satisfaction, loyalty, or celebratory greetings towards the telecom brand. These are often friendly messages of appreciation.\n"
    "   (e.g., Well done, Good service, Satisfied customer, Happy New Year, Saha Ramdankom, Thank you messages, Good and positive expressions and not negative, Love the service, Keep up the good work, ...)\n\n"

    "- **Service Information Request**\n"
    "   Customer requests for information about services, offers, activation codes, or general inquiries. These are informational questions, not complaints.\n"
    "   (Ask Questions or inform about something e.g., What is the code for loan? How to activate roaming? How to check balance? How much does it cost? Questions about using a service or feature, and About Details, Details about the offer, How to deactivate voicemail, How to transfer credit, ...)\n\n"

    "- **Other**\n"
    "   Any comments that do not fit into the defined categories, including off-topic remarks, jokes, sarcasm, or spam.\n"
    "   (e.g., Irrelevant, incomprehensible, sarcasm, unrelated comment, joke or spam, Meme reactions, Political comments, Non-telecom related complaints, ...)\n\n"

    "### 🧠 Classification Instructions:\n"
    "1. Carefully read the content of each comment.\n"
    "2. Determine which subcategories best represent the issue(s).\n"
    "3. Multiple subcategories are allowed when relevant and even encouraged when multiple issues are present.\n"
    "4. DO NOT limit to one category if the comment contains multiple concerns.\n"

    "### 📤 Output Format:\n"
    "Respond **strictly** using the following JSON structure (no extra explanations):\n\n"
      "{\n"
      "  \"45\": {\n"
      "    \"categories\": [\"Promotions & Discounts\", \"Suggestions\"]\n"
      "  },\n"
      "  \"69\": {\n"
      "    \"categories\": [\"Call Quality Issues\"]\n"
      "  }\n"
      "}\n\n"

    "⚠️ **Important Notes**:\n"
    "- Do NOT include any commentary or explanation in the output.\n"
    "- Always use the exact category labels listed above.\n"
    "- Do NOT invent new categories.\n"
    "- Ensure output is in **valid JSON format**.\n"
    "- Each comment ID must map to at least one subcategory.\n\n"

    "### 🌐 Supported Languages:\n"
    "- Arabic (العربية)\n"
    "- Algerian Darija (الدارجة الجزائرية)\n"
    "- French (Français)\n"
    "- English\n\n"

    "Proceed with the classification."
)

## <span style="color: #28A745;">**Echantillon de test**</span>

In [12]:
# Extraire uniquement les commentaires dont l'ID_Comment est entre 2801 et 3000 inclus
test_df = comments_df[(comments_df["ID Comment"] >= 2801) & (comments_df["ID Comment"] <= 3000)].copy()
test_df

Unnamed: 0,ID Comment,ID Post,User Name,Comments,Sentiments
3254,2801,151,Sil Ver,kayen modem taa ooredoo connecte prix bien mac...,Neutre
3255,2802,151,آلخہآل رآمہز,بارك الله فيك,Positif
3256,2803,151,Moncef Chagour,نقدر نحول رقم سويتش إلى dima ؟,Neutre
3259,2804,151,فاروق فاروق,عرض هذا كي نطلع تقولي 50 جيقا عطوني عرض 60,Neutre
3261,2805,151,نورالدين مباركي,السلام تبقى كل شهر يعطوك 60go ومكالمات مجانيه ...,Neutre
...,...,...,...,...,...
3488,2996,160,Salah Boukherbata,شكرا ooredoo algerie علي التحديث الجديد في my ...,Positif
3491,2997,161,Noureddine Beny,ألف مبروك للفائزين والعقوبة للبقية,Positif
3492,2998,161,Håy Dër,رجعولنا الكادو خاص ب كل يوم,Neutre
3493,2999,161,Noé Noé,مبروك,Positif


## <span style="color: #28A745;">**Classification des comments**</span>

In [13]:
# Boucle d’inférence
results = {}

for idx, row in test_df.iterrows():
    comment_id = str(row['ID Comment'])
    comment_text = str(row['Comments'])

    prompt = (
        system_prompt
        + f"\n\nClassify the following customer comment. Respond using ONLY the JSON format.\n"
        + f"Comment ID: \"{comment_id}\"\n"
        + f"Comment: \"{comment_text}\"\n"
        + "\nDO NOT limit to one category if the content contains multiple concerns.\n"
        + "\nReturn the result in this exact structure:\n"
        + "{\n"
        + f"  \"{comment_id}\": {{\n"
        + "    \"categories\": [\"Category A\", \"Category B\"]\n"
        + "  }\n"
        + "}\n"
        + "\n⚠️ DO NOT COMMENT. DO NOT THINK. ONLY RETURN STRICT JSON."
    )

    try:
        output = llm(prompt, max_tokens=1000, temperature=0.0)
        raw_response = output["choices"][0]["text"].strip()

        # Extraction du JSON
        json_match = re.search(r"\{.*\}", raw_response, re.DOTALL)
        if json_match:
            json_str = json_match.group()
            parsed = json.loads(json_str)
            results.update(parsed)
            print(json.dumps(parsed, indent=2, ensure_ascii=False))  # 👈 Affichage formaté JSON uniquement
        else:
            raise ValueError("No JSON found")

    except Exception:
        fallback = {
            comment_id: {
                "categories": ["Other"]
            }
        }
        results.update(fallback)
        print(json.dumps(fallback, indent=2, ensure_ascii=False))

    sleep(0.5)

{
  "2801": {
    "categories": [
      "Promotions & Discounts",
      "Suggestions"
    ]
  }
}
{
  "2802": {
    "categories": [
      "Loyalty Expression"
    ]
  }
}
{
  "2803": {
    "categories": [
      "Subscription & Plan Issues"
    ]
  }
}
{
  "2804": {
    "categories": [
      "Promotions & Discounts",
      "Suggestions"
    ]
  }
}
{
  "2805": {
    "categories": [
      "Suggestions",
      "Overcharging and Stolen Credit Issues"
    ]
  }
}
{
  "2806": {
    "categories": [
      "Suggestions",
      "Competitor Plan Flexibility (Mobilis and Djezzy)"
    ]
  }
}
{
  "2807": {
    "categories": [
      "Other"
    ]
  }
}
{
  "2808": {
    "categories": [
      "Suggestions"
    ]
  }
}
{
  "2809": {
    "categories": [
      "Loyalty Expression"
    ]
  }
}
{
  "2810": {
    "categories": [
      "Suggestions",
      "Loyalty Expression"
    ]
  }
}
{
  "2811": {
    "categories": [
      "Other"
    ]
  }
}
{
  "2812": {
    "categories": [
      "Loyalty Expression"

## <span style="color: #28A745;">**Sauvegarder les résultats**</span>

In [14]:
# Sauvegarder les résultats dans un fichier JSON
json_path = "/content/Results/classified_comments.json"
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

In [15]:
# Lire le fichier JSON depuis le disque
with open("/content/Results/classified_comments.json", "r", encoding="utf-8") as f:
    classification_dict = json.load(f)

# Liste des catégories
category_list = [
    "Call Quality Issues",
    "Data Quality Issues (3G, 4G)",
    "Coverage Issues",
    "Response Time",
    "Agent Behavior",
    "Overcharging and Stolen Credit Issues",
    "Billing Errors",
    "Subscription & Plan Issues",
    "Hidden Charges",
    "Suggestions",
    "Promotions & Discounts",
    "Competitor Pricing & Value (Mobilis and Djezzy)",
    "Competitor Plan Flexibility (Mobilis and Djezzy)",
    "Competitor Data Consumption (Mobilis and Djezzy)",
    "Loyalty Expression",
    "Service Information Request",
    "Sarcasme",
    "Irrelevant",
    "Other"
]

# Initialiser toutes les colonnes de catégories à 0
for cat in category_list:
    if cat not in comments_df.columns:
        comments_df[cat] = 0

# Remplir en fonction des catégories détectées
for comment_id_str, data in classification_dict.items():
    comment_id = int(comment_id_str)
    categories = data.get("categories", [])
    for category in categories:
        if category in category_list:
            comments_df.loc[comments_df["ID Comment"] == comment_id, category] = 1
            if category in ("Sarcasme", "Irrelevant"):  # Correct 'or' condition
                comments_df.loc[comments_df["ID Comment"] == comment_id, "Other"] = 1
# Drop "Sarcasme" and "Irrelevant" columns after the loop
comments_df = comments_df.drop(columns=["Sarcasme", "Irrelevant"])

# Sauvegarder le nouveau DataFrame
comments_df.to_csv("/content/Results/comments_df_classified.csv", index=False)

# Afficher un aperçu
comments_df.head()

Unnamed: 0,ID Comment,ID Post,User Name,Comments,Sentiments,Call Quality Issues,"Data Quality Issues (3G, 4G)",Coverage Issues,Response Time,Agent Behavior,...,Subscription & Plan Issues,Hidden Charges,Suggestions,Promotions & Discounts,Competitor Pricing & Value (Mobilis and Djezzy),Competitor Plan Flexibility (Mobilis and Djezzy),Competitor Data Consumption (Mobilis and Djezzy),Loyalty Expression,Service Information Request,Other
1,1,1,Yanise Yanise,سلام عليكم ورحمة لديا مشكلة ! تعبئة 100 دج و ب...,Negatif,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,1,Jj Kie,كل عام و انتم بخير,Positif,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,1,Sakou Younes,كل عام و أنتم بخير,Positif,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,1,راني نعاني,كل عام و نحن بخير,Positif,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,5,1,مروان سيدهم سيدهم مروان,كل عام وأنتم بخير وأتمنا رد لماذا اسئلة لماذا ...,Positif,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<h3 style="text-align: center; color: #E30613;"><b><i>Développé par: OUARAS Khelil Rafik</i></b></h3>