
# Session 2 â€” Manipulation des donnÃ©es sous Python
**Objectifs**
- Charger des donnÃ©es tabulaires (CSV) de maniÃ¨re robuste
- Inspecter: `shape`, `head`, `info`, `describe`, `dtypes`
- Nettoyer: NA, types, valeurs aberrantes
- Transformer: colonnes dÃ©rivÃ©es, groupby/agg, pivot, merge/join
- Exporter: CSV/Parquet

> Dataset: utilisez votre `FB.csv` (ou un CSV similaire). Les cellules incluent des garde-fous et des tests.


## 0. Imports & options

In [1]:

import pandas as pd
import numpy as np

pd.set_option("display.max_rows", 8)
pd.set_option("display.width", 120)
print(pd.__version__)


1.5.3


## 1. Chargement robuste d'un CSV

In [None]:

from pathlib import Path

path = Path("FB.csv")  # changez si besoin
if not path.exists():
    # crÃ©e un petit CSV de dÃ©mo si FB.csv est absent
    demo = pd.DataFrame({
        "comment":[5,2,0,1],
        "like":[10, np.nan, 3, 2],
        "share":[1, 1, np.nan, 0],
        "Category":[1,2,1,3],
        "Post Month":[9,10,10,11],
        "Post Hour":[15,20,10,17],
        "Type":["Photo","Status","Photo","Video"]
    })
    demo.to_csv(path, index=False)

# Auto-dÃ©tection du sÃ©parateur
df = pd.read_csv(path, sep=None, engine="python")
df.head()


## 2. Inspecter & diagnostiquer

In [None]:

print(df.shape)
print(df.dtypes)
print(df.isna().sum())
df.describe(include="all")


## 3. Nettoyage minimal (types & NA)

In [None]:

# Coercition en numÃ©rique si colonnes censÃ©es Ãªtre numÃ©riques
numeric_candidates = ["comment","like","share","Category","Post Month","Post Hour"]
for c in numeric_candidates:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce")

# Imputation simple des numÃ©riques par la moyenne
num_cols = df.select_dtypes(include=["number"]).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].mean(numeric_only=True))

# Colonne dÃ©rivÃ©e vectorisÃ©e
if {"comment","share"}.issubset(df.columns):
    df["action"] = df["comment"] + df["share"]

# VÃ©rifications
assert df.select_dtypes(include=["number"]).isna().sum().sum() == 0
assert "action" not in df.columns or (df["action"] >= 0).all()
print("âœ“ Nettoyage minimal OK")
df.head()


## 4. Filtrer, trier, sÃ©lectionner

In [None]:

# SÃ©lection colonnes
cols = [c for c in ["comment","like","share","action","Type"] if c in df.columns]
view = df[cols].copy()

# Filtre
high_like = view[view["like"] > view["like"].median()] if "like" in view else view

# Tri
ordered = high_like.sort_values(by=cols[0]) if cols else df
ordered.head()


## 5. Groupby / agrÃ©gations

In [None]:

if "Type" in df.columns:
    agg = (df.groupby("Type")
             .agg(n=("Type","count"),
                  like_mean=("like","mean"),
                  action_sum=("action","sum") if "action" in df else ("comment","sum"))
             .reset_index())
    display(agg)
else:
    print("Pas de colonne 'Type' â€” passez Ã  l'Ã©tape suivante.")


## 6. Jointures (merge) â€” mini dÃ©mo

In [None]:

# Jeu jouet pour merge
users = pd.DataFrame({"user_id":[1,2,3], "country":["FR","FR","DE"]})
events = pd.DataFrame({"user_id":[1,1,3,4], "clicks":[5,2,1,7]})
joined = users.merge(events, on="user_id", how="left")
joined



### ðŸ’ª Ã€ vous 1 â€” Pipeline end-to-end
1. Recharger `FB.csv` (ou un autre CSV) dans `df2`.
2. Forcer en numÃ©rique `["comment","like","share"]` et imputer.
3. CrÃ©er `engagement = like + share + comment`.
4. Calculer, par `Type`, le `engagement` moyen et le nombre de lignes.


In [None]:

# TODO
df2 = pd.read_csv(path, sep=None, engine="python")

for c in ["comment","like","share"]:
    if c in df2.columns:
        df2[c] = pd.to_numeric(df2[c], errors="coerce")

num_cols2 = df2.select_dtypes(include=["number"]).columns
df2[num_cols2] = df2[num_cols2].fillna(df2[num_cols2].mean(numeric_only=True))

if {"comment","like","share"}.issubset(df2.columns):
    df2["engagement"] = df2["comment"] + df2["like"] + df2["share"]

if "Type" in df2.columns:
    sol = df2.groupby("Type").agg(n=("Type","count"),
                                  engagement_mean=("engagement","mean")).reset_index()
    display(sol.head())
else:
    sol = None

# tests
if sol is not None:
    assert {"Type","n","engagement_mean"}.issubset(sol.columns)
print("âœ“ Ã€ vous 1 â€” OK")


## 7. Export (CSV / Parquet)

In [None]:

df.to_csv("FB_clean.csv", index=False)
try:
    df.to_parquet("FB_clean.parquet", index=False)
    print("âœ“ ExportÃ© en CSV et Parquet")
except Exception as e:
    print("Parquet indisponible (pyarrow non installÃ©?):", e)



## 8. Bonus â€” Validation rapide & asserts utiles
Ajoutez quelques vÃ©rifications qui Ã©chouent tÃ´t si les hypothÃ¨ses sont fausses.


In [None]:

# Exemples
if "engagement" in df2.columns:
    assert (df2["engagement"] >= 0).all(), "Engagement nÃ©gatif impossible"
if "Post Hour" in df2.columns:
    ok = df2["Post Hour"].between(0, 23).all()
    assert ok, "Post Hour doit Ãªtre entre 0 et 23"
print("âœ“ Asserts de validation passÃ©s")
