## SocialPulse

Ce projet vise à analyser les performances des publications sur différents réseaux sociaux (LinkedIn) pour optimiser les stratégies de contenu. 
Les questions clés sont :

- Quel type de publication faire, sur quel réseau, et quand ?
    Cela implique d’identifier les formats (vidéo, image, texte, carrousel, etc.), les thèmes (éducatif, divertissant, promotionnel, etc.) et les moments de publication (jour, heure) qui maximisent l’engagement.
- Analyser les styles de posts et l’influence des publicateurs pour déterminer le moment idéal et les pics de résultats (vues, réactions, commentaires).
    Cela nécessite une segmentation des styles de contenu, une analyse des profils d’influenceurs, et une corrélation avec les métriques d’engagement.

Objectifs:

- Construire un pipeline de collecte, traitement et analyse des données sociales.
- Créer un dashboard interactif pour visualiser les tendances et fournir des recommandations exploitables (ex. : "Publie une vidéo éducative sur LinkedIn le mardi à 9h pour maximiser les vues").
- Identifier les facteurs (style, horaire, réseau, influenceur) qui influencent les métriques clés (vues, réactions, commentaires).

### 1- Scraper les publications depuis LinkedIn
- <span style="color: red;">Réglementez le nombre de requêtes sur les pages linkedin pour ne pas être suspendu</span>

In [None]:
import sys
import os
import re
from loguru import logger
import time
import random
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from datetime import datetime, timedelta
from config import LINKEDIN_USER_NAME, LINKEDIN_PWD, DATABASE_URL

Initialisation du logger

In [None]:
logger.remove()
logger.add("linkedin", rotation="500kb", level="WARNING")
logger.add(sys.stderr, level="INFO")

2

In [3]:
def setup_driver():
    logger.info("Configuration du driver Selenium")
    chrome_options = Options()
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    )
    # chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()),
        options=chrome_options
    )
    return driver

In [4]:
def login_to_linkedin(driver, username, password):
    logger.info("Tentative de connexion à LinkedIn")
    try:
        driver.get("https://www.linkedin.com/login")
        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.ID, "username"))
        )
        time.sleep(random.uniform(2, 5))

        email_field = driver.find_element(By.ID, "username")
        email_field.send_keys(username)
        logger.info("Email saisi")

        password_field = driver.find_element(By.ID, "password")
        password_field.send_keys(password)
        logger.info("Mot de passe saisi")

        submit_button = driver.find_element(By.XPATH, "//button[@type='submit']")
        submit_button.click()
        logger.info("Bouton de connexion cliqué")

        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.ID, "global-nav"))
        )
        logger.info("Connexion réussie")
        time.sleep(random.uniform(3, 6))
    except TimeoutException:
        logger.error("Timeout lors de la connexion à LinkedIn. Possible CAPTCHA ou erreur réseau.")
        raise
    except Exception as e:
        logger.error(f"Erreur lors de la connexion : {str(e)}")
        raise

In [5]:
def parse_relative_date(relative_date, scrape_time):
    """Convertir une date relative en date estimée."""
    if "N/A" in relative_date:
        return "N/A"
    try:
        if "min" in relative_date.lower():
            minutes = int(relative_date.split()[0])
            return (scrape_time - timedelta(minutes=minutes)).strftime("%Y-%m-%d %H:%M:%S")
        if "h" in relative_date.lower():
            hours = int(relative_date.split()[0])
            return (scrape_time - timedelta(hours=hours)).strftime("%Y-%m-%d %H:%M:%S")
        elif "d" in relative_date.lower():
            days = int(relative_date.split()[0])
            return (scrape_time - timedelta(days=days)).strftime("%Y-%m-%d 12:00:00")
        elif "w" in relative_date.lower():
            weeks = int(relative_date.split()[0])
            return (scrape_time - timedelta(weeks=weeks)).strftime("%Y-%m-%d 12:00:00")
        elif "mois" in relative_date.lower():
            mois = int(relative_date.split()[0])
            return (scrape_time - timedelta(days=30 * mois)).strftime("%Y-%m-%d 12:00:00")
        else:
            return relative_date
    except Exception as e:
        logger.warning(f"Erreur lors du parsing de la date {relative_date} : {str(e)}")
        return relative_date

In [6]:
def scrape_posts(driver, search_url, theme="feed", max_scrolls=5, posts_per_theme=100):
    logger.info(f"Scraping des posts pour l'URL : {search_url}")
    post_data = []
    scrape_time = datetime.now()
    try:
        driver.get(search_url)
        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.CLASS_NAME, "feed-shared-update-v2"))
        )
        time.sleep(random.uniform(2, 5))

        # Faire défiler pour charger plus de posts
        for _ in range(max_scrolls):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(1)
            try:
                more_button = driver.find_element(By.CLASS_NAME, 'scaffold-finite-scroll__load-button')
                more_button.click()
            except:
                pass
            time.sleep(random.uniform(3, 6))
            

        # Parser la page pour collecter les URLs
        soup = BeautifulSoup(driver.page_source, "html.parser")
        posts = soup.find_all("div", class_="feed-shared-update-v2")
        
        # Scraper chaque post individuel
        for i, post in enumerate(posts[:posts_per_theme]):
            try:
                # Extraire les données
                description_elem = post.find("span", class_="break-words")
                description = description_elem.text.strip() if description_elem else "N/A"

                # Extraire la date
                date_elem = post.find("span", class_="update-components-actor__sub-description")
                date_text = date_elem.text.strip() if date_elem else "N/A"
                date_text = date_text.split('•')[0]
                date = parse_relative_date(date_text, scrape_time)

                likes_elem = post.find("span", class_="social-details-social-counts__reactions-count")
                likes = likes_elem.text.strip() if likes_elem else "0"
                
                comments = "0"
                shares = "0"
                shares_comments_elems = post.select("button.social-details-social-counts__btn > span")
                for elem in shares_comments_elems:
                    text = elem.text.strip().lower()
                    if "comment" in text:
                        comments = text.split()[0].strip() or "0"
                    elif "republication" in text or "share" in text:
                        shares = text.split()[0].strip() or "0"
                
                author_elem = post.find("a", class_="update-components-actor__meta-link")
                if author_elem:
                    author_link = author_elem["href"]
                    sub_span = author_elem.find("span", dir="ltr")
                    if sub_span:
                        name_span = sub_span.find("span", {"aria-hidden": "true"})
                        author = name_span.text.strip() if name_span else sub_span.text.strip()
                    else:
                        author = author_elem.text.strip()
                else:
                    author = "N/A"
                logger.debug(f"Auteur extrait : {author}: {author_link}")

                post_data.append({
                    "author": author,
                    "author_link": author_link,
                    "text": description,
                    "date": date,
                    "likes": likes,
                    "comments": comments,
                    "shares": shares,
                    "theme": theme,
                })
                logger.info(f"Post scrapé : - Auteur : {author}")
                time.sleep(random.uniform(1, 2))
            except Exception as e:
                logger.warning(f"Erreur lors du scraping du post : {str(e)}")
                continue

        return post_data
    except TimeoutException:
        logger.error("Timeout lors du chargement des posts.")
        return post_data
    except Exception as e:
        logger.error(f"Erreur lors du scraping des posts : {str(e)}")
        return post_data

In [13]:
def main(driver, themes, target_total_posts, path, sub_path):
    posts_per_theme = target_total_posts // len(themes)
    all_posts = []
    os.makedirs(path, exist_ok=True)
    os.makedirs(sub_path, exist_ok=True)

    try:
        for theme in themes:
            logger.info(f"Début du scraping pour la thématique : {theme}")
            search_url = f"https://www.linkedin.com/search/results/content/?keywords={theme.replace(' ', '%20')}"
            posts = scrape_posts(driver, search_url, theme=theme, max_scrolls=5, posts_per_theme=posts_per_theme)
            sub_df = pd.DataFrame(posts)
            sub_df.to_csv(f"{sub_path}/{theme}_{datetime.now().strftime("%H_%M_%d_%m_%Y")}.csv")
            all_posts.extend(posts)
            logger.info(f"Posts scrapés pour {theme} : {len(posts)}")
            time.sleep(random.uniform(5, 10))

        if all_posts:
            df = pd.DataFrame(all_posts)
            filename = f"{path}/linkedin_posts_{datetime.now().strftime("%H_%M_%d_%m_%Y")}.csv"
            df.to_csv(filename, index=False, encoding="utf-8")
            logger.info(f"Données sauvegardées dans {filename} ({len(all_posts)} posts)")
        else:
            logger.warning("Aucune donnée scrapée")

    except Exception as e:
        logger.error(f"Erreur dans le programme principal : {str(e)}")
    

In [None]:
def get_subscribers(driver, link):
    """
        Extrait le nombre d'abonnés d'un profil LinkedIn à partir de son URL.
        
        Args:
            driver: Instance du WebDriver Selenium.
            link (str): URL du profil LinkedIn (author_link).
            
        Returns:
            int: Nombre d'abonnés (ex. : 9517) ou 0 si échec.
    """
    try:
        driver.get(link)
        
        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.CLASS_NAME, "pvs-entity__caption-wrapper"))
        )
        time.sleep(random.uniform(2, 4))
        
        soup = BeautifulSoup(driver.page_source, "html.parser")
        subscribers_element = soup.find('span', class_="pvs-entity__caption-wrapper")
        
        if subscribers_element:
            text = subscribers_element.text.strip()
            
            # Nettoyer et extraire le nombre
            # Remplacer l'espace insécable et normaliser
            text = text.replace('\u202f', '').replace('\u00a0', '')
            
            # Extraire le nombre avec regex
            match = re.search(r'(\d[\d\s,]*\d|\d+)(?=\s*(abonné|abonnés|follower|followers))', text)
            if match:
                number_str = match.group(1).replace(' ', '').replace(',', '')
                subscribers = int(number_str)
                logger.info(f"Nombre d'abonnés extrait : {subscribers}")
                return subscribers
            else:
                return text
        else:
            return 0
            
    except TimeoutException:
        logger.error(f"Timeout lors du chargement du profil {link}")
        return 0
    except Exception as e:
        logger.error(f"Erreur lors de la récupération des abonnés pour {link} : {str(e)}")
        return 0

In [9]:
driver = setup_driver()

[32m2025-04-25 23:20:29.899[0m | [1mINFO    [0m | [36m__main__[0m:[36msetup_driver[0m:[36m2[0m - [1mConfiguration du driver Selenium[0m


In [11]:
login_to_linkedin(driver, LINKEDIN_USER_NAME, LINKEDIN_PWD+".")

[32m2025-04-25 23:26:35.820[0m | [1mINFO    [0m | [36m__main__[0m:[36mlogin_to_linkedin[0m:[36m2[0m - [1mTentative de connexion à LinkedIn[0m
[32m2025-04-25 23:26:44.880[0m | [1mINFO    [0m | [36m__main__[0m:[36mlogin_to_linkedin[0m:[36m12[0m - [1mEmail saisi[0m
[32m2025-04-25 23:26:45.353[0m | [1mINFO    [0m | [36m__main__[0m:[36mlogin_to_linkedin[0m:[36m16[0m - [1mMot de passe saisi[0m
[32m2025-04-25 23:27:23.891[0m | [1mINFO    [0m | [36m__main__[0m:[36mlogin_to_linkedin[0m:[36m20[0m - [1mBouton de connexion cliqué[0m
[32m2025-04-25 23:27:24.107[0m | [1mINFO    [0m | [36m__main__[0m:[36mlogin_to_linkedin[0m:[36m25[0m - [1mConnexion réussie[0m


In [None]:
themes = [
    "IA",
    "DataScience",
    "Innovation",
    "finance",
    "projet",
    "Technology",
    "hackathon",
    "sport",
    "Leadership",
    "HumanResources",
    "DigitalTransformation",
    "tutoriel",
    "education"
]

target_total_posts = 100
path = "data3Fre/linkedin"
sub_path = "subthemeFre"

main(driver, themes, target_total_posts, path, sub_path)

[32m2025-04-25 23:29:51.062[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m9[0m - [1mDébut du scraping pour la thématique : IA[0m
[32m2025-04-25 23:29:51.064[0m | [1mINFO    [0m | [36m__main__[0m:[36mscrape_posts[0m:[36m2[0m - [1mScraping des posts pour l'URL : https://www.linkedin.com/search/results/content/?keywords=IA[0m
[32m2025-04-25 23:30:43.259[0m | [1mINFO    [0m | [36m__main__[0m:[36mscrape_posts[0m:[36m77[0m - [1mPost scrapé : - Auteur : WENDKUNI ROLAND SAWADOGO[0m
[32m2025-04-25 23:30:44.522[0m | [1mINFO    [0m | [36m__main__[0m:[36mscrape_posts[0m:[36m77[0m - [1mPost scrapé : - Auteur : Philippe Gautier[0m
[32m2025-04-25 23:30:46.144[0m | [1mINFO    [0m | [36m__main__[0m:[36mscrape_posts[0m:[36m77[0m - [1mPost scrapé : - Auteur : Dr Souad Najoua Lagmiri[0m
[32m2025-04-25 23:30:48.055[0m | [1mINFO    [0m | [36m__main__[0m:[36mscrape_posts[0m:[36m77[0m - [1mPost scrapé : - Auteur : Cristi PITNER[0m

In [19]:
filename = "data3Fre/linkedin/linkedin_posts_23_37_25_04_2025.csv"
dfp = pd.read_csv(filename)

In [None]:
followers_dict = {}

for i, link in enumerate(dfp["author_link"].unique()):
    time.sleep(5)
    if pd.notna(link) and link != "N/A":
        followers = get_subscribers(driver, link)
        followers_dict[link] = followers

    else:
        followers_dict[link] = 0
    
    if i % 5 == 0:
        time.sleep(10)
    else:
        time.sleep(random.uniform(3, 6))
    
# Ajouter la colonne followers
dfp["followers"] = dfp["author_link"].map(followers_dict).fillna(0).astype(int)

In [34]:
dfp.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
author,1210.0,1092.0,Benjamin Ejzenberg,10.0,,,,,,,
author_link,1210.0,1092.0,https://www.linkedin.com/in/benjaminejzenberg?...,10.0,,,,,,,
text,1180.0,1146.0,Infosys Job Openings March 2025Salary: Rs. 8-3...,7.0,,,,,,,
date,1210.0,132.0,2025-04-25 11:04:58,68.0,,,,,,,
likes,1210.0,198.0,0,109.0,,,,,,,
comments,1210.0,,,,11.369421,22.826129,0.0,0.0,2.0,14.0,289.0
shares,1210.0,,,,2.040496,5.746291,0.0,0.0,0.0,2.0,83.0
theme,1210.0,8.0,IA,174.0,,,,,,,
followers,1210.0,,,,1277.209917,10755.489499,0.0,0.0,0.0,0.0,295634.0


### 2- Chargement des données scrappées dans une base de données POSTGRESQL sur Neon

In [3]:
import pandas as pd
from sqlalchemy import create_engine
from loguru import logger
from config import DATABASE_URL
import sys

In [5]:
logger.remove()
logger.add("logger/dataload", rotation="500kb", level="WARNING")
logger.add(sys.stderr, level="INFO")

5

In [6]:
TABLE_NAME = "linkedin_posts"

In [7]:
data = pd.read_csv("linkedin_post.csv", sep=";")
df = data.copy()
df.head()

Unnamed: 0,author,text,date,likes,comments,shares,theme,followers
0,David TIFFENEAU-GAUTIER,𝐈𝐨𝐓 𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠 + 𝐑𝐞𝐚𝐥-𝐓𝐢𝐦𝐞 𝐀𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐬 𝐬𝐮𝐫 𝐀𝐳𝐮𝐫𝐞 ...,2025-04-25 10:37:08,2,0,0,IA,9522
1,Diana Bragă,"No, vedeți voi - it's a funny time to be woman...",2025-04-24 19:37:08,25,21,0,IA,15414
2,Mélanie VILLANOVA,"➡️ L’IA, l’alliée incontournable pour affiner ...",2025-04-24 21:37:08,5,0,0,IA,471
3,Bejaoui Sabrine,💥 Dites adieu à la saisie manuelle ! La révolu...,2025-04-24 18:37:08,6,0,0,IA,780
4,Felipe Carpio,RECEPCIONISTA IA PARA CLÍNICAS DENTALES😱,2025-04-24 23:37:08,1,1,0,IA,311


##### Convertir les données 'comments', 'shares', 'likes', 'followers' en type numeric

In [None]:
for col in ['comments', 'shares', 'likes', 'followers']:
    df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(int)

df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["text"] = df["text"].fillna("N/A")

In [37]:
engine = create_engine(DATABASE_URL)

In [9]:
try:
    logger.info("Connexion à la data base")
    engine = create_engine(DATABASE_URL)
    logger.info(f"Sauvegarde des info dans la table '{TABLE_NAME}'")
    df.to_sql(TABLE_NAME, engine, if_exists="append", index=False)
    logger.info("Sauvegarde terminé avec succèss")
except Exception as e:
    logger.error(f"Erreur lors du sauvegarde des info dans la db: {e}")

[32m2025-04-27 00:11:06.231[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mConnexion à la data base[0m
[32m2025-04-27 00:11:06.864[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mSauvegarde des info dans la table 'linkedin_posts'[0m
[32m2025-04-27 00:11:18.703[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mSauvegarde terminé avec succèss[0m


## 3- Extraction des données depuis la database
- Sauvegarde au format parquet avec Polars

In [2]:
import polars as pl
from sqlalchemy import create_engine
from loguru import logger
import sys
from config import DATABASE_URL
pl.Config.set_tbl_rows(-1);

In [None]:
logger.remove()
logger.add("logger/extrac_data", rotation="500kb", level="WARNING")
logger.add(sys.stderr, level="INFO")
TABLE_NAME = 'linkedin_posts'

In [15]:
logger.info("Connection à la base de données distant")
engine = create_engine(DATABASE_URL)

[32m2025-04-27 00:20:11.793[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mConnection à la base de données distant[0m


In [18]:
query = f"SELECT * FROM {TABLE_NAME}"
df = pl.read_database(query=query, connection=engine)
df.write_parquet("data/linkedin_posts.parquet")

### 4- Transformation des données extraites
- Calcul des KPIs avec Polars 

In [84]:
def selection(df):
    df = df.with_columns(
        pl.when(pl.col("theme").is_in(["DataScience", "DataScienceInnovation"]))
        .then(pl.lit("IA"))
        .when(pl.col("theme").is_in(["Innovation", "Devoloppement"]))
        .then(pl.lit("Technology"))
        .otherwise(pl.col("theme"))
        .alias("theme")
    )

    # Conserons aléatoirement 220 posts IA et laissons les autres pour ne pas tirer les stats dans le sens de 'IA'
    df_ia = df.filter(pl.col("theme") == "IA")
    num_to_keep = min(220, df_ia.height)
    df_ia_sampled = df_ia.sample(n=num_to_keep, with_replacement=False)
    df_other = df.filter(pl.col("theme") != "IA")

    # 4. Fusionner les posts restants et ceux sélectionnés aléatoirement pour "IA"
    df = pl.concat([df_ia_sampled, df_other])


    df = df.filter(
        pl.col("theme") != "hackathonsport"
    )
    return df

In [140]:
import re
import pandas as pd

In [167]:
df = pl.read_parquet("data/linkedin_posts.parquet")

In [153]:
df.schema, df.shape

(Schema([('id', Int64),
         ('author', String),
         ('text', String),
         ('date', Datetime(time_unit='us', time_zone=None)),
         ('likes', Int64),
         ('comments', Int64),
         ('shares', Int64),
         ('theme', String),
         ('followers', Int64)]),
 (1842, 9))

In [116]:
# Conversion
df = df.with_columns(
    pl.col("likes").cast(pl.Int64, strict=False).fill_null(0),
    pl.col("comments").cast(pl.Int64, strict=False).fill_null(0),
    pl.col("shares").cast(pl.Int64, strict=False).fill_null(0),
    pl.col("followers").cast(pl.Int64, strict=False).fill_null(0)
)

In [164]:
def clean_feature(df):
    try:
        df = df.with_columns(
            pl.when(pl.col("theme").is_in(["DataScience", "DataScienceInnovation"]))
            .then(pl.lit("IA"))
            .when(pl.col("theme").is_in(["Innovation", "Devoloppement"]))
            .then(pl.lit("Technology"))
            .otherwise(pl.col("theme"))
            .alias("theme")
        )

        # Conserons aléatoirement 220 posts IA et laissons les autres pour ne pas tirer les stats dans le sens de 'IA'
        logger.info("# Conserons aléatoirement 220 posts IA et laissons les autres pour ne pas tirer les stats dans le sens de 'IA'")
        df_ia = df.filter(pl.col("theme") == "IA")
        num_to_keep = min(220, df_ia.height)
        df_ia_sampled = df_ia.sample(n=num_to_keep, with_replacement=False)
        df_other = df.filter(pl.col("theme") != "IA")

        # 4. Fusionner les posts restants et ceux sélectionnés aléatoirement pour "IA"
        df = pl.concat([df_ia_sampled, df_other])


        df = df.filter(
            pl.col("theme") != "hackathonsport"
        )
        
        df = df.sort("id")
    except Exception as e:
        logger.error(f"Erreur lors de l'exécution de 'clean_feature()': {e}")
    
    return df

In [165]:
def extract_hashtags(text):
    return " ".join(re.findall(r"#\w+", text))

def nbr_hashtags(text):
    return sum([1 for mot in " ".join(re.findall(r"#\w+", text)) if "#" in mot])

def feature_calculate(df):
    try:
        #a. Calculer l'engagement
        logger.info("a. Calculer l'engagement")
        df = df.with_columns(
            (pl.col("likes") + pl.col("comments") + pl.col("shares")).alias("engagement_total"),
        )
        
        #b. Longueur du texte
        logger.info("b. Longueur du texte")
        df = df.with_columns(
            pl.col("text").str.split(" ").list.len().alias("text_length")
        )
        
        #c. Analyse temporelle
        logger.info("c. Analyse temporelle")
        df = df.with_columns(
            pl.col("date").dt.weekday().alias("day_of_week"),
            pl.col("date").dt.hour().alias("hour")
        )
        
        #d. Sentiments annalysis
        
        #e. Hashtags
        logger.info("e. Hashtags")
        df = df.with_columns(
            pl.col("text").map_elements(extract_hashtags, return_dtype=pl.String).alias("hashtags"),
            pl.col("text").map_elements(nbr_hashtags, return_dtype=pl.Int64).alias("nbr_hashtags")
        )
    
        #f. Post viraux
        logger.info("f. Post viraux")
        viral_threshold = df["shares"].quantile(0.9)
        df = df.with_columns(
            (pl.col("shares") >= viral_threshold).alias("is_viral")
        )
        
        #g. Fort/Faible engagement
        logger.info("g. Fort/Faible engagement")
        engagement_threshold = df["engagement_total"].quantile(0.90)
        df = df.with_columns(
            pl.when(
                pl.col("engagement_total") >= engagement_threshold
            ).then(pl.lit("Fort")
            ).otherwise(pl.lit("Faible")
            ).alias("engagement_category")
        )
        
        df = df.sort("id")
    except Exception as e:
        logger.error(f"Erreur lors de l'exécution de 'feature_calculate()': {e}")
    
    return df
    

In [166]:
df["theme"].value_counts().sort(by="count", descending=True)

theme,count
str,u32
"""Leadership""",299
"""Technology""",232
"""IA""",220
"""projet""",190
"""finance""",189
"""education""",174
"""tutoriel""",173
"""WorkplaceCulture""",126
"""HumanResources""",120
"""DigitalTransformation""",119


a. Calculer l'engagement

In [89]:
df = df.with_columns(
    (pl.col("likes") + pl.col("comments") + pl.col("shares")).alias("engagement_total"),
)

b. Longueur du texte

In [7]:
df = df.with_columns(
    pl.col("text").str.split(" ").list.len().alias("text_length")
)

c. Analyse temporelle

In [8]:
df = df.with_columns(
    pl.col("date").dt.weekday().alias("day_of_week"),
    pl.col("date").dt.hour().alias("hour")
)

d. Sentiment

In [88]:
def get_sentiment(text):
    return TextBlob(text).sentiment.polarity

In [None]:
df = df.with_columns(
    pl.col("text").map_elements(get_sentiment).alias("sentiment")
)

e. Hashtags

In [9]:
def extract_hashtags(text):
    return " ".join(re.findall(r"#\w+", text))

def nbr_hashtags(text):
    return sum([1 for mot in " ".join(re.findall(r"#\w+", text)) if "#" in mot])

In [10]:
df = df.with_columns(
    pl.col("text").map_elements(extract_hashtags, return_dtype=pl.String).alias("hashtags"),
    pl.col("text").map_elements(nbr_hashtags, return_dtype=pl.Int64).alias("nbr_hashtags")
)

f. Posts viraux

In [11]:
viral_threshold = df["shares"].quantile(0.9)
df = df.with_columns(
    (pl.col("shares") >= viral_threshold).alias("is_viral")
)

In [168]:
df = clean_feature(df)
df = feature_calculate(df)
df.write_parquet("data/transformed_date.parquet")

[32m2025-04-28 18:52:39.921[0m | [1mINFO    [0m | [36m__main__[0m:[36mclean_feature[0m:[36m13[0m - [1m# Conserons aléatoirement 220 posts IA et laissons les autres pour ne pas tirer les stats dans le sens de 'IA'[0m
[32m2025-04-28 18:52:40.381[0m | [1mINFO    [0m | [36m__main__[0m:[36mfeature_calculate[0m:[36m10[0m - [1ma. Calculer l'engagement[0m
[32m2025-04-28 18:52:40.389[0m | [1mINFO    [0m | [36m__main__[0m:[36mfeature_calculate[0m:[36m16[0m - [1mb. Longueur du texte[0m
[32m2025-04-28 18:52:40.443[0m | [1mINFO    [0m | [36m__main__[0m:[36mfeature_calculate[0m:[36m22[0m - [1mc. Analyse temporelle[0m
[32m2025-04-28 18:52:40.647[0m | [1mINFO    [0m | [36m__main__[0m:[36mfeature_calculate[0m:[36m31[0m - [1me. Hashtags[0m
[32m2025-04-28 18:52:40.845[0m | [1mINFO    [0m | [36m__main__[0m:[36mfeature_calculate[0m:[36m38[0m - [1mf. Post viraux[0m
[32m2025-04-28 18:52:40.874[0m | [1mINFO    [0m | [36m__main__[0m:

# Analyse

- Comparer la longueur du texte (text_length) des posts à fort engagement(engagement_total élevé ou is_viral=True) versus ceux à faible engagement

In [180]:
kpi_folder = "Kpis"

In [169]:
df = pl.read_parquet("data/transformed_date.parquet")

In [171]:
df.schema, df.shape

(Schema([('id', Int64),
         ('author', String),
         ('text', String),
         ('date', Datetime(time_unit='us', time_zone=None)),
         ('likes', Int64),
         ('comments', Int64),
         ('shares', Int64),
         ('theme', String),
         ('followers', Int64),
         ('engagement_total', Int64),
         ('text_length', UInt32),
         ('day_of_week', Int8),
         ('hour', Int8),
         ('hashtags', String),
         ('nbr_hashtags', Int64),
         ('is_viral', Boolean),
         ('engagement_category', String)]),
 (1842, 17))

In [174]:
df.head()

id,author,text,date,likes,comments,shares,theme,followers,engagement_total,text_length,day_of_week,hour,hashtags,nbr_hashtags,is_viral,engagement_category
i64,str,str,datetime[μs],i64,i64,i64,str,i64,i64,u32,i8,i8,str,i64,bool,str
2,"""Diana Bragă""","""No, vedeți voi - it's a funny …",2025-04-24 19:37:08,25,21,0,"""IA""",15414,46,318,4,19,"""#nufitiFlavius #Strabag""",2,False,"""Faible"""
3,"""Mélanie VILLANOVA""","""➡️ L’IA, l’alliée incontournab…",2025-04-24 21:37:08,5,0,0,"""IA""",471,5,43,4,21,"""#SantéIntégrative #SantéParLaV…",15,False,"""Faible"""
6,"""Cafenea""","""hashtag#sănătate hashtag#sanat…",2025-04-25 04:37:08,0,0,0,"""IA""",0,0,6,5,4,"""#sănătate #sanatate #biodispon…",5,False,"""Faible"""
9,"""Andréia Vital""","""Solinftechttps://lnkd.in/d4nJj…",2025-04-25 03:37:08,0,0,0,"""IA""",15929,0,1,5,3,"""""",0,False,"""Faible"""
10,"""24auto.ro""","""Clasicul volan și direcția mec…",2025-04-25 08:37:08,0,0,0,"""IA""",0,0,17,5,8,"""#steerbywire #directie #Merced…",4,False,"""Faible"""


In [175]:
df.describe()

statistic,id,author,text,date,likes,comments,shares,theme,followers,engagement_total,text_length,day_of_week,hour,hashtags,nbr_hashtags,is_viral,engagement_category
str,f64,str,str,str,f64,f64,f64,str,f64,f64,f64,f64,f64,str,f64,f64,str
"""count""",1842.0,"""1842""","""1842""","""1750""",1842.0,1842.0,1842.0,"""1842""",1842.0,1842.0,1842.0,1750.0,1750.0,"""1842""",1842.0,1842.0,"""1842"""
"""null_count""",0.0,"""0""","""0""","""92""",0.0,0.0,0.0,"""0""",0.0,0.0,0.0,92.0,92.0,"""0""",0.0,0.0,"""0"""
"""mean""",1134.421824,,,"""2025-04-04 13:42:37.584571""",38.34962,12.48317,7.798046,,1026.053203,58.630836,169.39848,4.321143,12.600571,,7.172096,0.103692,
"""std""",572.978898,,,,70.717072,22.681269,22.114374,,14802.41989,96.756932,109.168253,1.084618,3.995981,,13.653436,,
"""min""",2.0,"""10000 CODEURS""","""""A salary increase makes you h…","""2024-05-29 12:00:00""",0.0,0.0,0.0,"""DigitalTransformation""",0.0,0.0,1.0,1.0,0.0,"""""",0.0,0.0,"""Faible"""
"""25%""",652.0,,,"""2025-04-24 11:12:52""",3.0,0.0,0.0,,0.0,5.0,89.0,4.0,10.0,,0.0,,
"""50%""",1114.0,,,"""2025-04-24 18:10:57""",16.0,4.0,1.0,,0.0,25.0,157.0,4.0,12.0,,4.0,,
"""75%""",1638.0,,,"""2025-04-25 10:17:51""",43.0,16.0,6.0,,52.0,71.0,234.0,5.0,16.0,,8.0,,
"""max""",2115.0,"""🧩 Amanda A. Russo""","""🪄 Vous souhaitez impressionner…","""2025-04-25 23:20:51""",834.0,261.0,346.0,"""tutoriel""",586005.0,879.0,524.0,7.0,23.0,"""#𝑩𝒖𝒔𝒊𝒏𝒆𝒔𝒔𝑺𝒑𝒊𝒓𝒊𝒕 #pleinelune #n…",173.0,1.0,"""Fort"""


In [176]:
import plotly.express as px
import plotly.graph_objects as go

In [177]:
engagement_threshold = df["engagement_total"].quantile(0.90)
df_engagement = df.with_columns(
    pl.when(
        pl.col("engagement_total") >= engagement_threshold
    ).then(pl.lit("Fort")
    ).otherwise(pl.lit("Faible")
    ).alias("engagement_category")
)

# Statistiques pour text_length par catégorie
stats = df_engagement.group_by("engagement_category").agg(
    mean_text_length=pl.col("text_length").mean(),
    median_text_length=pl.col("text_length").median(),
    std_text_length=pl.col("text_length").std(),
    count=pl.col("text_length").count()
)
stats

engagement_category,mean_text_length,median_text_length,std_text_length,count
str,f64,f64,f64,u32
"""Faible""",164.054348,149.0,107.93967,1656
"""Fort""",216.978495,204.0,108.796063,186


In [181]:
engagement_category_proportion = stats
engagement_category_proportion[["engagement_category", "mean_text_length"]].write_parquet(f"{kpi_folder}/engagement_category_proportion.parquet")

- Les posts plus longs (environ 217 mots) génèrent significativement plus d’engagement que les posts plus courts (environ 164 mots). Cela suggère que les contenus plus substantiels, comme des posts éducatifs, narratifs ou détaillés, captent mieux l’attention sur LinkedIn.
- La différence de ~60 mots (217 vs 164) indique que les utilisateurs LinkedIn valorisent les posts qui offrent plus de contexte ou de valeur (ex. : explications techniques sur l’IA, storytelling).
- L’écart-type élevé dans les deux cas (~100 mots) montre que la longueur seule ne garantit pas l’engagement ; d’autres facteurs jouent un rôle.

In [17]:
# Statistiques pour text_length pour les posts viraux
viral_stats = df.group_by("is_viral").agg(
    mean_text_length=pl.col("text_length").mean(),
    median_text_length=pl.col("text_length").median(),
    std_text_length=pl.col("text_length").std(),
    count=pl.col("text_length").count(),
    mean_followers=pl.col("followers").mean()
)
viral_stats

is_viral,mean_text_length,median_text_length,std_text_length,count,mean_followers
bool,f64,f64,f64,u32,f64
False,163.471331,149.0,110.186243,1901,1107.384534
True,198.257009,182.0,103.552415,214,4154.621495


In [183]:
viral_stats[["is_viral", "mean_text_length"]].write_parquet(f"{kpi_folder}/viral_categoy_proportion.parquet")

- Insight LinkedIn : “Les posts viraux (très partagés) ont en moyenne 198 mots, soit 20% plus longs que les posts non viraux. Vise ~200 mots pour augmenter tes chances de viralité.”

- Comparée à la médiane pour l’engagement général (210 pour Fort vs 147 pour Faible), la médiane des posts viraux (182) est légèrement plus basse, suggérant que la viralité privilégie des posts un peu plus concis que ceux maximisant l’engagement total.

- Les auteurs des posts viraux ont en moyenne ~3,7 fois plus de followers (4154 vs 1107) que ceux des posts non viraux. Cela suggère que la taille du réseau joue un rôle clé dans la viralité, car un plus grand nombre de followers augmente la portée initiale, favorisant les partages.

- Cependant, des posts non viraux peuvent venir d’auteurs avec peu de followers, indiquant que la qualité du contenu peut compenser un petit réseau.
- La différence importante (~3000 followers) montre que la viralité est souvent amplifiée par des comptes influents, mais pas exclusivement.

In [184]:
# Créer un graphique Plotly (boîte à moustaches)
fig = px.box(
    df_engagement.to_pandas(),
    x="engagement_category",
    y="text_length",
    color="engagement_category",
    title="Longueur du texte par catégorie d'engagement",
    labels={"text_length": "Longueur du texte (mots)", "engagement_category": "Catégorie d'engagement"},
    points="outliers"
)
fig.show()

In [200]:
# Filtrer les posts à fort engagement
fort_posts = df_engagement.filter(pl.col("engagement_category") == "Fort")

# Créer le graphique
fig = px.scatter(
    fort_posts.to_pandas(),
    x="followers",
    y="likes",
    title="Engagement total en fonction de la longueur du texte (posts à fort engagement)",
    labels={"text_length": "Longueur du texte (mots)", "engagement_total": "Engagement total"},
    #trendline="ols"  # optionnel : ajoute une ligne de tendance (régression linéaire)
)

fig.show()


In [186]:
# Ajouter une boîte pour les posts viraux
fig_viral = px.box(
    df.to_pandas(),
    x="is_viral",
    y="text_length",
    color="is_viral",
    title="Longueur du texte par viralité",
    labels={"text_length": "Longueur du texte (mots)", "is_viral": "Viralité"}
)
fig_viral.show()

- **Hashtags** :
  - Analyse `nbr_hashtags` et `hashtags` pour voir si plus de hashtags augmentent `engagement_total` ou `shares`.
  - Identifie les hashtags les plus utilisés dans les posts viraux

In [212]:
hashtag_impact = df.group_by("nbr_hashtags").agg(
    mean_engagement_total=pl.col("engagement_total").mean(),
    median_engagement_total=pl.col("engagement_total").median(),
    mean_shares=pl.col("shares").mean(),
    median_shares=pl.col("shares").median()
).sort("nbr_hashtags")

correlation_engagement = df.select(
    pl.corr("nbr_hashtags", "engagement_total").alias("corr_nbr_hashtags_engagement")
)
correlation_shares = df.select(
    pl.corr("nbr_hashtags", "shares").alias("corr_nbr_hashtags_shares")
)
correlation_shares_followers = df.select(
    pl.corr("followers", "shares").alias("corr_nbr_followers_shares")
)
correlation_engagement_followers = df.select(
    pl.corr("followers", "engagement_total").alias("corr_nbr_followers_engagement_total")
)

In [190]:
hashtag_impact.head(10).write_parquet(f"{kpi_folder}/top_hashtag_impact.parquet")
hashtag_impact.head(10)

nbr_hashtags,mean_engagement_total,median_engagement_total,mean_shares,median_shares
i64,f64,f64,f64,f64
0,65.395994,28.0,5.875193,1.0
1,100.714286,41.0,4.761905,1.0
2,76.367347,38.0,9.571429,3.0
3,59.884615,22.5,6.634615,1.0
4,45.94,31.0,6.08,2.0
5,57.171875,18.5,9.1953125,1.0
6,64.83,32.0,11.97,2.0
7,60.236842,34.0,12.131579,2.0
8,47.094118,25.0,9.352941,2.0
9,51.242424,26.0,11.909091,2.0


In [196]:
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=hashtag_impact.head(10)["nbr_hashtags"],
    y=hashtag_impact.head(10)["mean_engagement_total"],
    mode='lines+markers',
    name='engagement_total'
))

fig.add_trace(go.Scatter(
    x=hashtag_impact.head(10)["nbr_hashtags"],
    y=hashtag_impact.head(10)["mean_shares"],
    mode='lines+markers',
    name='shares'
))

fig.update_layout(
    title="Impact du nombre de hashtags",
    xaxis_title="Nombre de hashtags des posts viraux",
    yaxis_title="Engagement / Shares",
    legend_title="Metrics",
)

fig.show()

In [213]:
correlation_engagement, correlation_shares, correlation_shares_followers, correlation_engagement_followers

(shape: (1, 1)
 ┌──────────────────────────────┐
 │ corr_nbr_hashtags_engagement │
 │ ---                          │
 │ f64                          │
 ╞══════════════════════════════╡
 │ -0.095003                    │
 └──────────────────────────────┘,
 shape: (1, 1)
 ┌──────────────────────────┐
 │ corr_nbr_hashtags_shares │
 │ ---                      │
 │ f64                      │
 ╞══════════════════════════╡
 │ 0.000151                 │
 └──────────────────────────┘,
 shape: (1, 1)
 ┌───────────────────────────┐
 │ corr_nbr_followers_shares │
 │ ---                       │
 │ f64                       │
 ╞═══════════════════════════╡
 │ 0.042954                  │
 └───────────────────────────┘,
 shape: (1, 1)
 ┌─────────────────────────────────┐
 │ corr_nbr_followers_engagement_… │
 │ ---                             │
 │ f64                             │
 ╞═════════════════════════════════╡
 │ 0.152417                        │
 └─────────────────────────────────┘)

- 1-2 hashtags semblent optimaux pour l’engagement total, avec nbr_hashtags=1 en tête (100.714286).
- Les partages augmentent avec 2-7 hashtags (pic à 11,64 pour 7 hashtags), suggérant que les hashtags favorisent la viralité jusqu’à un certain point.
Au-delà de 10 hashtags, les résultats sont incohérents (fortes fluctuations, faibles médianes), probablement car ces posts sont rares et parfois perçus comme spammy par l’algorithme LinkedIn ou les utilisateurs.
- Les posts sans hashtags (0 hashtag) performent moins bien (65,40 d’engagement, 5,87 partages) que ceux avec 1-2 hashtags, confirmant que les hashtags améliorent la découvrabilité.

Par ailleur, la corrélation négative (-0,09 pour l’engagement, 0,0001 pour les partages) indique que le nombre de hashtags n'est pas le principal moteur de performance. La qualité du contenu et la taille du réseau sont plus déterminants

# Hashtags stratégiques

In [214]:
viral_posts = df.filter(pl.col("is_viral") == True)
hashtags_list = viral_posts["hashtags"].map_elements(
    lambda x: [tag.lower() for tag in re.findall(r'#\w+', x)] if x else [],
    return_dtype=pl.List(pl.Utf8)
)
hashtags_list = [element for sublist in hashtags_list.to_list() for element in (sublist if isinstance(sublist, list) else [sublist])]

hashtags_count = pl.DataFrame({
    "hashtag": hashtags_list
}).group_by("hashtag").len().sort("len", descending=True)

In [239]:
hashtags_count.head(15).write_parquet(f"{kpi_folder}/top_viral_hashtags.parquet")
hashtags_count.head(15)

hashtag,len
str,u32
"""#workplaceculture""",42
"""#digitaltransformation""",34
"""#leadership""",25
"""#humanresources""",22
"""#innovation""",14
"""#hr""",10
"""#hiring""",10
"""#futureofwork""",9
"""#ai""",8
"""#employeeengagement""",8


Répartion des thèmes dans les postes viraux

In [230]:
viral_post_theme = viral_posts.group_by("theme").agg(
    count=pl.col("theme").count()
).sort("count", descending=True)
viral_post_theme

theme,count
str,u32
"""WorkplaceCulture""",41
"""DigitalTransformation""",32
"""Technology""",30
"""Leadership""",28
"""HumanResources""",28
"""IA""",17
"""projet""",8
"""tutoriel""",4
"""finance""",2
"""education""",1


In [231]:
viral_post_theme.write_parquet(f"{kpi_folder}/viral_post_theme_repartition.parquet")

In [233]:
fig = px.pie(
    viral_post_theme.to_pandas(),
    names="theme",
    values="count",
    title="Réprtition des thèmes par nombre de post viraux"
)

fig.show()

- WorkplaceCulture (42 posts, ~21% des 191 posts viraux) est le thème le plus fréquent, DigitalTransformation (32 posts, ~15,0%), HumanResources (28 posts, ~14,7%), et Leadership (28 posts, ~14%), suivi de IA (17 posts, ~8%), .
Ces thèmes reflètent des sujets B2B (Business to Business) populaires sur LinkedIn : culture d’entreprise, intelligence artificielle, transformation numérique, gestion des talents, et leadership.
Un post de qualités sur ces thèmes avec des #tags directement alignés avec la thématique (#ia ou #ai pour un post abordant la thématique IA) a de forte chance d'être viral.

- Par contre un post abordant les thématiques comme IA, Leadership, tutoriel, projet, education a de forte forte chance d'impliquer un fort engagement. Car ces thèmes reflètent des sujets attractifs pour l’audience LinkedIn : IA (technologie en plein essor), Leadership (inspiration professionnelle), et projets/finance/éducation (pratiques et apprentissage).

In [235]:
fort_engagement_df = df_engagement.filter(
    pl.col("engagement_category")=="Fort"
)

In [237]:
engagement_post_theme = fort_engagement_df.group_by("theme").agg(
    count=pl.col("theme").count()
).sort("count", descending=True)
engagement_post_theme.write_parquet(f"{kpi_folder}/engagement_post_theme.parquet")
engagement_post_theme

theme,count
str,u32
"""projet""",32
"""Leadership""",30
"""finance""",29
"""DigitalTransformation""",16
"""tutoriel""",16
"""Technology""",16
"""IA""",14
"""WorkplaceCulture""",13
"""HumanResources""",12
"""education""",8


In [240]:
engagement_hashtags_list = fort_engagement_df["hashtags"].map_elements(
    lambda x: [tag.lower() for tag in re.findall(r'#\w+', x)] if x else [],
    return_dtype=pl.List(pl.Utf8)
)
engagement_hashtags_list = [element for sublist in engagement_hashtags_list.to_list() for element in (sublist if isinstance(sublist, list) else [sublist])]

engagement_hashtags_count = pl.DataFrame({
    "hashtag": engagement_hashtags_list
}).group_by("hashtag").len().sort("len", descending=True)
engagement_hashtags_count.head(10).write_parquet(f"{kpi_folder}/top_engagement_hashtags.parquet")
engagement_hashtags_count.head(10)

hashtag,len
str,u32
"""#digitaltransformation""",16
"""#leadership""",13
"""#workplaceculture""",13
"""#innovation""",11
"""#humanresources""",10
"""#hr""",7
"""#hiring""",6
"""#jobsearch""",6
"""#datawithben""",5
"""#recruitment""",5


In [241]:
fig = px.pie(
    engagement_post_theme.to_pandas(),
    names="theme",
    values="count",
    title="Réprtition des thèmes par nombre de post à fort engagement"
)

fig.show()

#### . Comparaison engagement vs viralité
| <span style="color: orange;">**Aspect**</span>                | <span style="color: orange;">**Posts à fort engagement**</span>  | <span style="color: orange;">**Posts viraux** </span>|
|---------------------------|-----------------------------------|------------------------|
| **Top thèmes**            | IA, Leadership, projet | WorkplaceCulture, IA, DigitalTransformation |
| **Top hashtags**          | #leadership, #digitaltransformation, #ai, | #workplaceculture, #digitaltransformation, #leadership, |
| **Longueur moyenne**      | ~230 mots         | ~198 mots |
| **Followers moyens**      | Non précisé   | ~4154                 |
| **Hashtags optimaux**     | 1-2 pour engagement, 2-7 pour partages | 2-7 pour partages     |

- **Engagement** : Favorise des thèmes techniques (**IA**, **DataScienceInnovation**) et des hashtags niches (**#datascience**, **#machinelearning**), avec des posts plus longs (~230 mots).
- **Viralité** : Privilégie des thèmes universels (**WorkplaceCulture**, **HumanResources**) et des hashtags larges (**#workplaceculture**, **#hiring**), avec des posts plus courts (~198 mots).

---


### Analyse temporelle des meilleurs moments de publication

| <span style="color: orange;">**Moment clé**</span>          | <span style="color: orange;">**Plage horaire**</span> | <span style="color: orange;">**Raisonnement**</span>                                                                 |
|-------------------------|-------------------|----------------------------------------------------------------------------------|
| **Tôt le matin**        | 6h00 - 8h59       | Avant le début de la journée de travail, les professionnels consultent LinkedIn.  |
| **Milieu de matinée**   | 9h00 - 11h59      | Pendant la pause café ou avant les réunions, forte activité sur LinkedIn.         |
| **Après-midi**          | 12h00 - 16h59     | Après le déjeuner, les utilisateurs reviennent sur LinkedIn pendant les pauses.   |
| **Fin de journée**      | 17h00 - 19h59     | Après le travail, les professionnels consultent LinkedIn pour des mises à jour.   |
| **Soir/Nuit**           | 20h00 - 5h59      | Activité moindre, mais certains utilisateurs (ex. : internationaux) sont actifs.  |

In [242]:
def categorize_time(hour):
    if 6 <= hour <= 8:
        return "Tôt le matin"
    elif 9 <= hour <= 11:
        return "Milieu de matinée"
    elif 12 <= hour <= 16:
        return "Après-midi"
    elif 17 <= hour <= 19:
        return "Fin de journée"
    else:
        return "Soir"

In [243]:
viral_post = viral_posts.with_columns(
    pl.col("hour").map_elements(categorize_time, return_dtype=pl.Utf8).alias("time_of_day")
)

df_engagement = fort_engagement_df.with_columns(
    pl.col("hour").map_elements(categorize_time, return_dtype=pl.Utf8).alias("time_of_day")
)

In [None]:
viral_timing = viral_post.group_by(["day_of_week", "time_of_day"]).agg(
    count=pl.col("id").count(),
    mean_engagement_total=pl.col("engagement_total").mean(),
    mean_shares=pl.col("shares").mean(),
    themes=pl.col("theme").unique().str.join(", "),
    hashtags=pl.col("hashtags").str.join(", ")
).sort(["mean_engagement_total"], descending=True).drop_nulls()

engagement_timing = df_engagement.group_by(["day_of_week", "time_of_day"]).agg(
    count=pl.col("id").count(),
    mean_engagement_total=pl.col("engagement_total").mean(),
    mean_shares=pl.col("shares").mean(),
    themes=pl.col("theme").unique().str.join(", "),
    hashtags=pl.col("hashtags").str.join(", ")
).drop_nulls()#.sort(["mean_engagement_total"], descending=True).drop_nulls()

In [262]:
engagement_timing[["day_of_week", "time_of_day", "mean_engagement_total"]].sort("day_of_week").write_parquet(f"{kpi_folder}/engagement_timing.parquet")

In [263]:
engagement_timing[["day_of_week", "time_of_day", "mean_shares"]].sort("day_of_week").write_parquet(f"{kpi_folder}/engagement_shares_timing.parquet")

In [251]:
engagement_timing[["day_of_week", "time_of_day", "mean_engagement_total"]]

day_of_week,time_of_day,mean_engagement_total
i8,str,f64
2,"""Après-midi""",243.5
3,"""Après-midi""",381.5
4,"""Fin de journée""",280.633333
5,"""Fin de journée""",359.666667
5,"""Tôt le matin""",241.65
5,"""Milieu de matinée""",301.129032
4,"""Après-midi""",269.576923
1,"""Après-midi""",198.0
3,"""Soir""",530.5
4,"""Tôt le matin""",195.0


In [256]:
engagement_timing = engagement_timing.drop("count")

In [269]:
engagement_timing

day_of_week,time_of_day,mean_engagement_total,mean_shares,themes,hashtags
i8,str,f64,f64,str,str
2,"""Après-midi""",243.5,25.5,"""tutoriel, Technology""",""", , , #DataWithBen, , , #handi…"
3,"""Après-midi""",381.5,16.0,"""tutoriel""","""#DataWithBen, #DataWithBen"""
4,"""Fin de journée""",280.633333,65.033333,"""projet, Leadership, IA, Techno…",""", #carreiras #logística #traba…"
5,"""Fin de journée""",359.666667,12.666667,"""Technology, finance""",""", , """
5,"""Tôt le matin""",241.65,6.3,"""finance, projet, IA""",""", , , #kashmirattack #war #pre…"
5,"""Milieu de matinée""",301.129032,7.548387,"""education, projet, finance, Le…","""#Bénin #FinanceClimatique #Dév…"
4,"""Après-midi""",269.576923,72.153846,"""tutoriel, WorkplaceCulture, Te…","""#DataWithBen, #WebDesign #Défi…"
1,"""Après-midi""",198.0,20.666667,"""tutoriel""","""#DataWithBen, #comptabilité, #…"
3,"""Soir""",530.5,183.0,"""WorkplaceCulture""","""#SocialMediaManager #Marketing…"
4,"""Tôt le matin""",195.0,75.0,"""HumanResources""","""#humanresources #hr #jobinterv…"


In [270]:
viral_timing

day_of_week,time_of_day,count,mean_engagement_total,mean_shares,themes,hashtags
i8,str,u32,f64,f64,str,str
5,"""Fin de journée""",1,759.0,33.0,"""Technology""",""""""
3,"""Soir""",2,530.5,183.0,"""WorkplaceCulture""","""#SocialMediaManager #Marketing…"
5,"""Milieu de matinée""",3,487.0,37.0,"""projet, Leadership""","""#JusticePourTous #ÉtatDeDroit …"
3,"""Après-midi""",1,453.0,21.0,"""tutoriel""","""#DataWithBen"""
5,"""Après-midi""",2,437.5,38.5,"""IA, education""","""#phd #research, #ILoveSyria"""
5,"""Tôt le matin""",3,411.666667,24.666667,"""finance, projet""","""#RönesansHolding, #Finance #in…"
5,"""Soir""",3,336.0,32.666667,"""IA, Technology""",""", #Aviation #eVTOL #Drones #Ai…"
4,"""Soir""",15,268.6,48.533333,"""projet, IA, WorkplaceCulture""",""", , , , #OurShouthashtag #Soci…"
2,"""Après-midi""",4,247.5,51.75,"""tutoriel, WorkplaceCulture, Te…","""#handicap #handicapinvisible, …"
4,"""Fin de journée""",52,183.230769,55.269231,"""projet, DigitalTransformation,…",""", , #decentralization #dataint…"


In [265]:
viral_timing[["day_of_week", "time_of_day", "mean_engagement_total"]].sort("day_of_week").write_parquet(f"{kpi_folder}/viral_timing.parquet")
viral_timing[["day_of_week", "time_of_day", "mean_shares"]].sort("day_of_week").write_parquet(f"{kpi_folder}/viral_shares_timing.parquet")
viral_timing[["day_of_week", "time_of_day", "mean_engagement_total"]]

day_of_week,time_of_day,mean_engagement_total
i8,str,f64
5,"""Fin de journée""",759.0
3,"""Soir""",530.5
5,"""Milieu de matinée""",487.0
3,"""Après-midi""",453.0
5,"""Après-midi""",437.5
5,"""Tôt le matin""",411.666667
5,"""Soir""",336.0
4,"""Soir""",268.6
2,"""Après-midi""",247.5
4,"""Fin de journée""",183.230769


In [266]:
fig = px.pie(
    viral_timing.to_pandas(),
    names="day_of_week",
    values="mean_engagement_total",
    title="Répartition des jours par nombre de post à fort engagement"
)

fig.show()

### 📊 Engagement Timing et Viral Timing (Récapitulatif)

#### 🕒 Meilleurs moments pour poster

| Jour de la semaine | Moment de la journée  | Engagement moyen 📈 | Partages moyens 🔄 | Thèmes principaux                         | Hashtags utiles                       |
|:-------------------|:----------------------|:--------------------|:------------------|:-----------------------------------------|:-------------------------------------|
| Mardi (2)           | Après-midi             | 243.5               | 25.5              | tutoriel, Technology                     | #DataWith{username}                         |
| Mercredi (3)        | Après-midi             | 381.5               | 16.0              | tutoriel                                | #DataWith{username}                         |
| Mercredi (3)        | Soir                   | 530.5               | 183.0             | WorkplaceCulture                         | #SocialMediaManager #Marketing       |
| Jeudi (4)           | Après-midi             | 269.6               | 72.2              | WorkplaceCulture, tutoriel, Tech         | #WebDesign #Défi                     |
| Jeudi (4)           | Milieu de matinée      | 277.7               | 70.3              | IA, WorkplaceCulture, DigitalTransfo    | #Deal #Finance                       |
| Jeudi (4)           | Soir                   | 368.7               | 34.0              | WorkplaceCulture, IA, projet             | #NewDealTechnologique                |
| Vendredi (5)        | Fin de journée         | 359.7               | 12.7              | Technology, finance                     |                                      |
| Vendredi (5)        | Milieu de matinée      | 301.1               | 7.5               | education, projet, finance, Leadership   | #Bénin #FinanceClimatique            |
| Vendredi (5)        | Tôt le matin           | 241.7               | 6.3               | finance, projet, IA                      | #kashmirattack #war                  |
| Vendredi (5)        | Soir                   | 253.5               | 11.5              | projet, Technology, IA, finance          | #AI #LLM                             |
| Samedi (7)          | Après-midi             | 321.0               | 21.0              | Leadership, tutoriel                     |                                      |

---

### 🚀 Moments où les posts deviennent viraux

| Jour de la semaine | Moment de la journée  | Engagement moyen 📈 | Partages moyens 🔄 | Thèmes principaux                     | Hashtags efficaces                  |
|:-------------------|:----------------------|:--------------------|:------------------|:--------------------------------------|:------------------------------------|
| Vendredi (5)        | Fin de journée         | 759.0               | 33.0              | Technology                             |                                      |
| Mercredi (3)        | Soir                   | 530.5               | 183.0             | WorkplaceCulture                      | #SocialMediaManager #Marketing      |
| Vendredi (5)        | Milieu de matinée      | 487.0               | 37.0              | projet, Leadership                    | #JusticePourTous #ÉtatDeDroit        |
| Vendredi (5)        | Après-midi             | 437.5               | 38.5              | IA, education                         | #phd #research                      |
| Vendredi (5)        | Tôt le matin           | 411.7               | 24.7              | finance, projet                       | #Finance                            |
| Mercredi (3)        | Après-midi             | 453.0               | 21.0              | tutoriel                               | #DataWithBen                         |

---

**Poster du contenu important :**
- **Le mercredi soir** ou **vendredi en fin de journée** : 👉 **c’est là que l'engagement explose !**
- **Utilise des hashtags populaires et liés à ton audience**, mais **n’en abuse pas** : 3 à 5 hashtags bien choisis sont souvent plus efficaces que 10 mal ciblés.
- **Pense à recycler tes meilleurs posts** : republier légèrement modifié 2 à 3 semaines plus tard peut **doubler leur visibilité** !

In [285]:
import os
for file in os.listdir(kpi_folder):
    temp_df = pl.read_parquet(os.path.join(kpi_folder, file))
    temp_df.write_csv(f"Kpis_csv/{file.split(".")[0]}.csv")

# Segmentation

In [281]:
df= df.with_columns(
    pl.col("hour").map_elements(categorize_time, return_dtype=pl.Utf8).alias("time_of_day")
)

In [283]:
df_segmentation = df[["likes", "comments", "shares", "theme", "followers", "text_length", "day_of_week", "time_of_day", "nbr_hashtags", "is_viral", "engagement_category"]]
df_segmentation.write_parquet("data/df_segmentation.parquet")

In [284]:
df_segmentation.head()

likes,comments,shares,theme,followers,text_length,day_of_week,time_of_day,nbr_hashtags,is_viral,engagement_category
i64,i64,i64,str,i64,u32,i8,str,i64,bool,str
25,21,0,"""IA""",15414,318,4,"""Fin de journée""",2,False,"""Faible"""
5,0,0,"""IA""",471,43,4,"""Soir""",15,False,"""Faible"""
0,0,0,"""IA""",0,6,5,"""Soir""",5,False,"""Faible"""
0,0,0,"""IA""",15929,1,5,"""Soir""",0,False,"""Faible"""
0,0,0,"""IA""",0,17,5,"""Tôt le matin""",4,False,"""Faible"""
