# TP Final : Analyse de Logs Web avec Spark

## Contexte
Vous êtes Data Engineer dans une entreprise e-commerce. On vous demande d'analyser les logs du serveur web pour:
- Comprendre le trafic
- Identifier les problèmes
- Détecter les anomalies

## Durée : 3 heures

## Livrables attendus
1. Pipeline de parsing robuste
2. Rapport de métriques
3. Détection d'anomalies
4. Données nettoyées en Parquet

In [102]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
import re

spark = SparkSession.builder \
    .appName("TP_LogAnalysis") \
    .master("local[*]") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

print(f"Spark Version: {spark.version}")

Spark Version: 3.5.7


---
## Partie 1 : Génération des données de test (Fourni)

Ce code génère des logs réalistes pour le TP.

In [103]:
import random
from datetime import datetime, timedelta

def generate_logs(n_logs=10000):
    """Génère des logs Apache réalistes"""
    
    ips = [f"192.168.1.{i}" for i in range(1, 50)] + \
          [f"10.0.0.{i}" for i in range(1, 30)] + \
          ["203.0.113.42", "198.51.100.23"]  # IPs suspectes
    
    paths = [
        "/", "/index.html", "/about", "/contact",
        "/products", "/products/1", "/products/2", "/products/3",
        "/api/users", "/api/products", "/api/orders",
        "/login", "/logout", "/register",
        "/static/css/style.css", "/static/js/app.js",
        "/admin", "/admin/users",  # Tentatives admin
        "/.env", "/wp-admin", "/phpmyadmin",  # Tentatives malveillantes
    ]
    
    methods = ["GET"] * 80 + ["POST"] * 15 + ["PUT"] * 3 + ["DELETE"] * 2
    
    # Status codes avec poids réalistes
    statuses = [200] * 70 + [201] * 5 + [301] * 5 + [302] * 5 + \
               [400] * 3 + [401] * 3 + [403] * 2 + [404] * 5 + [500] * 2
    
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)",
        "Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36",
        "curl/7.68.0",
        "python-requests/2.25.1",
        "-",  # Bot sans user-agent
    ]
    
    logs = []
    base_time = datetime(2024, 1, 15, 0, 0, 0)
    
    for i in range(n_logs):
        # Progression temporelle avec variations
        time_offset = timedelta(seconds=random.randint(0, 86400 * 3))  # 3 jours
        log_time = base_time + time_offset
        
        ip = random.choice(ips)
        method = random.choice(methods)
        path = random.choice(paths)
        status = random.choice(statuses)
        size = random.randint(100, 50000) if status == 200 else random.randint(0, 500)
        user_agent = random.choice(user_agents)
        referer = random.choice(["-", "https://www.google.com", "https://example.com"])
        
        # Simuler des attaques de certaines IPs
        if ip in ["203.0.113.42", "198.51.100.23"]:
            path = random.choice(["/admin", "/.env", "/wp-admin", "/login"])
            status = random.choice([401, 403, 404])
        
        # Format Apache Combined Log
        timestamp = log_time.strftime("%d/%b/%Y:%H:%M:%S +0000")
        log_line = f'{ip} - - [{timestamp}] "{method} {path} HTTP/1.1" {status} {size} "{referer}" "{user_agent}"'
        logs.append(log_line)
    
    # Ajouter quelques lignes malformées (5%)
    n_bad = int(n_logs * 0.05)
    bad_logs = [
        "invalid log line",
        "192.168.1.1 - - malformed",
        "",
        "# comment line",
    ]
    for _ in range(n_bad):
        idx = random.randint(0, len(logs) - 1)
        logs.insert(idx, random.choice(bad_logs))
    
    return logs

# Générer les logs
log_lines = generate_logs(50000)
print(f"Logs générés: {len(log_lines)}")
print("\nExemples:")
for line in log_lines[:5]:
    print(line)

Logs générés: 52500

Exemples:
192.168.1.41 - - [15/Jan/2024:21:25:09 +0000] "GET /contact HTTP/1.1" 404 345 "https://example.com" "python-requests/2.25.1"
invalid log line
192.168.1.24 - - [15/Jan/2024:23:15:51 +0000] "GET /phpmyadmin HTTP/1.1" 200 6146 "-" "-"
192.168.1.9 - - [17/Jan/2024:03:50:00 +0000] "GET /static/css/style.css HTTP/1.1" 200 12262 "-" "curl/7.68.0"
10.0.0.7 - - [17/Jan/2024:16:37:02 +0000] "GET /admin HTTP/1.1" 301 322 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"


In [104]:
# Créer un RDD puis DataFrame des logs
logs_rdd = spark.sparkContext.parallelize(log_lines)
logs_raw = spark.createDataFrame(logs_rdd.map(lambda x: (x,)), ["value"])
print(f"Nombre de lignes: {logs_raw.count()}")

Nombre de lignes: 52500


---
## Partie 2 : Parsing des logs (45 min)

### Tâches:
1. Définir le pattern regex pour le format Apache Combined
2. Extraire tous les champs
3. Convertir les types (timestamp, status, size)
4. Gérer les lignes invalides
5. Calculer le taux de lignes valides/invalides

In [105]:
# TODO: Définir le pattern regex
# Format: IP - - [timestamp] "METHOD PATH HTTP/1.1" STATUS SIZE "REFERER" "USER-AGENT"

APACHE_PATTERN = r'^(\d+\.\d+\.\d+\.\d+) - - \[([A-Za-z0-9:\/\+ ]+)\] \"([A-Z]+) (\/[A-Za-z0-9:\/\.\- ]*) HTTP\/1\.[01]\" (\d+) (\d+) \"([A-Za-z0-9:\/\.\- ]+)\" \"([A-Za-z0-9:\/\.();_\- ]+)\"$'  # À compléter

# TODO: Parser les logs
logs_parsed = logs_raw.withColumn(
    "ip",
    F.regexp_extract("value", APACHE_PATTERN, 1),
).withColumn(
    "timestamp",
    F.regexp_extract("value", APACHE_PATTERN, 2),
).withColumn(
    "method",
    F.regexp_extract("value", APACHE_PATTERN, 3),
).withColumn(
    "path",
    F.regexp_extract("value", APACHE_PATTERN, 4),
).withColumn(
    "status",
    F.regexp_extract("value", APACHE_PATTERN, 5),
).withColumn(
    "size",
    F.regexp_extract("value", APACHE_PATTERN, 6),
).withColumn(
    "referer",
    F.regexp_extract("value", APACHE_PATTERN, 7),
).withColumn(
    "user_agent",
    F.regexp_extract("value", APACHE_PATTERN, 8),
)

# Afficher le résultat
logs_parsed.show(5, truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------+------+---------------------+------+-----+-------------------+------------------------------------------------------------+
|value                                                                                                                                     |ip          |timestamp                 |method|path                 |status|size |referer            |user_agent                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------+------------+--------------------------+------+---------------------+------+-----+-------------------+------------------------------------------------------------+
|192.168.1.41 - - [15/Jan/2024:21:25:09 +0000] "GET /contact HTTP/1.1" 404 345 "https://

In [106]:
# TODO: Convertir les types
# - timestamp: to_timestamp avec format "dd/MMM/yyyy:HH:mm:ss Z"
# - status: integer
# - size: long (gérer les "-")

from pyspark.sql.types import IntegerType

logs_typed = logs_parsed.withColumn(
    "timestamp",
    F.to_timestamp("timestamp", "dd/MMM/yyyy:HH:mm:ss Z"),
).withColumn(
    "status",
    F.col("status").cast(IntegerType()),
).withColumn(
    "referer",
    F.when(F.col("referer") == "-", "UNKNOWN").otherwise(F.col("referer")),
).withColumn(
    "user_agent",
    F.when(F.col("user_agent") == "-", "UNKNOWN").otherwise(F.col("user_agent")),
)

logs_clean = logs_typed.filter((F.col("ip").isNotNull()) & (F.col("timestamp").isNotNull()) & (F.col("method").isNotNull()) & (F.col("path").isNotNull()) & (F.col("size").isNotNull()))

logs_clean.show(5, truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------+------------+-------------------+------+---------------------+------+-----+-------------------+------------------------------------------------------------+
|value                                                                                                                                     |ip          |timestamp          |method|path                 |status|size |referer            |user_agent                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------+------------+-------------------+------+---------------------+------+-----+-------------------+------------------------------------------------------------+
|192.168.1.41 - - [15/Jan/2024:21:25:09 +0000] "GET /contact HTTP/1.1" 404 345 "https://example.com" "python-

In [107]:
# TODO: Calculer le taux de validité
total_lines = logs_raw.count()
valid_lines = logs_clean.count()
invalid_lines = total_lines - valid_lines

print(f"Total: {total_lines}")
print(f"Valides: {valid_lines} ({valid_lines/total_lines*100:.2f}%)")
print(f"Invalides: {invalid_lines} ({invalid_lines/total_lines*100:.2f}%)")

Total: 52500
Valides: 50000 (95.24%)
Invalides: 2500 (4.76%)


---
## Partie 3 : Métriques de base (45 min)

### Tâches:
1. Requêtes par heure
2. Distribution des codes HTTP
3. Top 10 des pages
4. Top 10 des IPs
5. Répartition par méthode HTTP

In [130]:
# TODO: Requêtes par heure
# Ajouter une colonne hour et grouper
metric_hour = logs_clean.withColumn(
    "hour",
    F.hour(F.col("timestamp")),
).groupBy("hour").agg(F.count("value").alias("hour_total")).orderBy("hour")
metric_hour.show()

+----+----------+
|hour|hour_total|
+----+----------+
|   0|      2114|
|   1|      2031|
|   2|      2114|
|   3|      2085|
|   4|      2106|
|   5|      2070|
|   6|      2034|
|   7|      2108|
|   8|      2187|
|   9|      2076|
|  10|      2008|
|  11|      2102|
|  12|      2077|
|  13|      2077|
|  14|      2019|
|  15|      2108|
|  16|      2112|
|  17|      2069|
|  18|      2089|
|  19|      2069|
+----+----------+
only showing top 20 rows



In [144]:
# TODO: Distribution des codes HTTP
# Grouper par status, compter, calculer le pourcentage
metric_status = (
    logs_clean
    .groupBy("status")
    .agg(F.count("*").alias("status_total"))
    .withColumn(
        "status_percent",
        F.col("status_total") / F.sum("status_total").over(Window.partitionBy()) * 100
    )
    .orderBy(F.col("status_total").desc())
)
metric_status.show(truncate=False)

+------+------------+------------------+
|status|status_total|status_percent    |
+------+------------+------------------+
|200   |34345       |68.69             |
|404   |2810        |5.62              |
|302   |2428        |4.856             |
|301   |2391        |4.782             |
|201   |2389        |4.7780000000000005|
|401   |1831        |3.662             |
|403   |1420        |2.8400000000000003|
|400   |1403        |2.806             |
|500   |983         |1.966             |
+------+------------+------------------+



In [126]:
# TODO: Top 10 des pages les plus visitées
metric_path = (
    logs_clean
    .withColumn("referer_path", F.concat_ws("", F.col("referer"), F.col("path")))
    .groupBy("referer_path")
    .agg(F.count("value").alias("referer_path_total"))
    .orderBy(F.col("referer_path_total").desc())
    .limit(10)
)
metric_path.show(truncate=False)

+-------------------------------+------------------+
|referer_path                   |referer_path_total|
+-------------------------------+------------------+
|https://example.com/.env       |894               |
|UNKNOWN/wp-admin               |890               |
|UNKNOWN/.env                   |889               |
|https://www.google.com/.env    |887               |
|UNKNOWN/admin                  |885               |
|UNKNOWN/login                  |884               |
|https://www.google.com/login   |881               |
|https://example.com/admin      |867               |
|https://example.com/login      |860               |
|https://www.google.com/wp-admin|857               |
+-------------------------------+------------------+



In [127]:
# TODO: Top 10 des IPs les plus actives
metric_ip = logs_clean.groupBy("ip").agg(F.count("value").alias("ip_total")).orderBy(F.col("ip_total").desc()).limit(10)
metric_ip.show()

+------------+--------+
|          ip|ip_total|
+------------+--------+
|192.168.1.11|     690|
|   10.0.0.26|     678|
| 192.168.1.9|     678|
|   10.0.0.24|     668|
|192.168.1.10|     662|
|192.168.1.23|     662|
|   10.0.0.29|     661|
|   10.0.0.17|     658|
|    10.0.0.3|     654|
|192.168.1.39|     650|
+------------+--------+



In [128]:
# TODO: Répartition par méthode HTTP
metric_method = (
    logs_clean
    .groupBy("method")
    .agg(F.count("value").alias("method_total"))
    .orderBy(F.col("method_total").desc())
    )
metric_method.show()

+------+------------+
|method|method_total|
+------+------------+
|   GET|       40045|
|  POST|        7468|
|   PUT|        1456|
|DELETE|        1031|
+------+------------+



---
## Partie 4 : Détection d'anomalies (45 min)

### Tâches:
1. IPs suspectes (beaucoup de requêtes, beaucoup d'erreurs)
2. Tentatives d'accès admin/malveillantes
3. Pics de trafic anormaux
4. Erreurs 5xx fréquentes

In [143]:
# TODO: Identifier les IPs suspectes
# Critères:
# - Plus de 500 requêtes
# - Ou plus de 50% d'erreurs (4xx/5xx)
ip_stats = (
    logs_clean
    .groupBy("ip")
    .agg(
        F.count("*").alias("ip_total"),
        F.sum(F.when(F.col("status") >= 400, 1).otherwise(0)).alias("ip_error_total")
    )
    .withColumn(
        "error_rate",
        F.col("ip_error_total") / F.col("ip_total") * 100
    )
)
suspicious_ips = (
    ip_stats
    .filter(
        (F.col("ip_total") >= 500) |
        (F.col("error_rate") > 50)
    )
)
suspicious_ips.show(truncate=False)

+-------------+--------+--------------+-------------------+
|ip           |ip_total|ip_error_total|error_rate         |
+-------------+--------+--------------+-------------------+
|192.168.1.10 |662     |106           |0.16012084592145015|
|192.168.1.18 |625     |91            |0.1456             |
|10.0.0.9     |644     |95            |0.14751552795031056|
|192.168.1.34 |632     |89            |0.14082278481012658|
|192.168.1.46 |633     |97            |0.15323854660347552|
|192.168.1.15 |636     |112           |0.1761006289308176 |
|192.168.1.42 |631     |98            |0.15530903328050713|
|10.0.0.11    |627     |90            |0.14354066985645933|
|192.168.1.14 |608     |96            |0.15789473684210525|
|192.168.1.2  |607     |82            |0.13509060955518945|
|192.168.1.25 |611     |118           |0.19312602291325695|
|192.168.1.22 |619     |85            |0.13731825525040386|
|10.0.0.29    |661     |83            |0.12556732223903178|
|198.51.100.23|646     |646           |1

In [145]:
# TODO: Tentatives d'accès malveillantes
# Paths suspects: /admin, /.env, /wp-admin, /phpmyadmin
logs_clean.groupBy("path").agg(F.count("*").alias("path_total")).filter(F.col("path").isin("/admin", "/.env", "/wp-admin", "/phpmyadmin")).show()

+-----------+----------+
|       path|path_total|
+-----------+----------+
|     /admin|      2572|
|/phpmyadmin|      2299|
|      /.env|      2670|
|  /wp-admin|      2563|
+-----------+----------+



In [146]:
# TODO: Détecter les pics de trafic
# Utiliser une window function pour calculer la moyenne mobile
# Identifier les heures avec trafic > 2 écarts-types de la moyenne
traffic_by_hour = (
    logs_clean
    .withColumn("date_hour", F.date_trunc("hour", F.col("timestamp")))
    .groupBy("date_hour")
    .agg(F.count("*").alias("traffic_total"))
)
window_spec = (
    Window
    .orderBy("date_hour")
    .rowsBetween(-6, 0)
)
traffic_stats = (
    traffic_by_hour
    .withColumn("traffic_avg", F.avg("traffic_total").over(window_spec))
    .withColumn("traffic_stddev", F.stddev("traffic_total").over(window_spec))
)
traffic_peaks = (
    traffic_stats
    .filter(
        (F.col("traffic_stddev").isNotNull()) &
        (F.col("traffic_stddev") > 0) &
        (F.abs(F.col("traffic_total") - F.col("traffic_avg")) > 2 * F.col("traffic_stddev"))
    )
)
traffic_peaks.show()

+-------------------+-------------+-----------------+------------------+
|          date_hour|traffic_total|      traffic_avg|    traffic_stddev|
+-------------------+-------------+-----------------+------------------+
|2024-01-15 08:00:00|          757|695.5714285714286|27.903746121606087|
+-------------------+-------------+-----------------+------------------+



In [119]:
# TODO: Analyser les erreurs 5xx
# Grouper par path et compter les erreurs 5xx
# Identifier les endpoints problématiques
logs_clean.groupBy(["path", "status"]).agg(F.count("status").alias("alias_total")).filter(F.col("status") >= 500).orderBy(F.col("alias_total").desc()).show()

+--------------------+------+-----------+
|                path|status|alias_total|
+--------------------+------+-----------+
|             /logout|   500|         58|
|              /login|   500|         54|
|/static/css/style...|   500|         54|
|          /api/users|   500|         54|
|              /about|   500|         52|
|           /register|   500|         49|
|               /.env|   500|         48|
|           /wp-admin|   500|         48|
|        /admin/users|   500|         47|
|         /phpmyadmin|   500|         47|
|         /index.html|   500|         47|
|       /api/products|   500|         46|
|         /products/2|   500|         45|
|         /products/1|   500|         44|
|         /api/orders|   500|         44|
|         /products/3|   500|         44|
|            /contact|   500|         42|
|                   /|   500|         41|
|              /admin|   500|         40|
|           /products|   500|         40|
+--------------------+------+-----

---
## Partie 5 : Export et optimisation (30 min)

### Tâches:
1. Sauvegarder les logs nettoyés en Parquet partitionné par date
2. Sauvegarder les rapports d'anomalies
3. Optimiser le partitionnement

In [None]:
# TODO: Ajouter une colonne date pour le partitionnement
# Sauvegarder en Parquet partitionné
df_export = logs_clean.withColumn("year", F.year(F.col("timestamp"))).withColumn("month", F.month(F.col("timestamp"))).withColumn("year", F.year(F.col("timestamp")))
df_export.write.partitionBy("year", "month").mode("overwrite").parquet("/data/logs_clean")

In [131]:
# TODO: Créer un rapport de synthèse
# Combiner toutes les métriques dans un DataFrame de rapport
metric_hour = metric_hour.withColumn("status", F.lit(None)).withColumn("status_total", F.lit(None)).withColumn("status_total_%", F.lit(None))
metric_status = metric_status.withColumn("hour", F.lit(None)).withColumn("hour_total", F.lit(None))
rapport = metric_hour.union(metric_status)
rapport = rapport.dropDuplicates()
rapport.show()

+----+----------+------+------------+--------------+
|hour|hour_total|status|status_total|status_total_%|
+----+----------+------+------------+--------------+
|  22|      2053|  NULL|        NULL|          NULL|
|  13|      2077|  NULL|        NULL|          NULL|
|  18|      2089|  NULL|        NULL|          NULL|
|   8|      2187|  NULL|        NULL|          NULL|
|   7|      2108|  NULL|        NULL|          NULL|
|   1|      2031|  NULL|        NULL|          NULL|
|  17|      2069|  NULL|        NULL|          NULL|
|   4|      2106|  NULL|        NULL|          NULL|
|  12|      2077|  NULL|        NULL|          NULL|
|  14|      2019|  NULL|        NULL|          NULL|
|  20|      2016|  NULL|        NULL|          NULL|
|   9|      2076|  NULL|        NULL|          NULL|
|   5|      2070|  NULL|        NULL|          NULL|
|  21|      2192|  NULL|        NULL|          NULL|
|   0|      2114|  NULL|        NULL|          NULL|
|  15|      2108|  NULL|        NULL|         

In [None]:
# Nettoyage
spark.stop()
print("TP terminé!")