# Setup & Configuration Spark

**Objectif :** V√©rifier que l'environnement PySpark fonctionne correctement

---

## 1. V√©rification Python

In [1]:
import sys
import os

print(f"Python version: {sys.version}")
print(f"Working directory: {os.getcwd()}")

Python version: 3.11.14 (main, Feb  4 2026, 20:30:53) [GCC 14.2.0]
Working directory: /app


## 2. Test PySpark

In [2]:
from pyspark.sql import SparkSession
import pyspark

print(f"‚úÖ PySpark version: {pyspark.__version__}")

‚úÖ PySpark version: 3.5.0


## 3. Initialisation Spark Session

In [3]:
# Cr√©ation de la Spark Session
spark = SparkSession.builder \
    .appName("MSPR Big Data - Electoral Prediction") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

print("‚úÖ Spark Session cr√©√©e avec succ√®s")
print(f"Spark version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")

/usr/local/lib/python3.11/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/10 10:06:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session cr√©√©e avec succ√®s
Spark version: 3.5.0
Spark UI: http://d685e6f9a8cd:4040


## 4. Test Spark DataFrame

In [4]:
# Test avec donn√©es Petite Couronne
data = [
    ("Paris", 75, 2161000),
    ("Nanterre", 92, 96016),
    ("Bobigny", 93, 54364),
    ("Cr√©teil", 94, 92265)
]

columns = ["ville", "departement", "population"]

df = spark.createDataFrame(data, columns)

print("\nüìä Test Spark DataFrame:")
df.show()

print("\nüìà Statistiques:")
df.describe().show()


üìä Test Spark DataFrame:
+--------+-----------+----------+
|   ville|departement|population|
+--------+-----------+----------+
|   Paris|         75|   2161000|
|Nanterre|         92|     96016|
| Bobigny|         93|     54364|
| Cr√©teil|         94|     92265|
+--------+-----------+----------+


üìà Statistiques:
+-------+-------+-----------------+------------------+
|summary|  ville|      departement|        population|
+-------+-------+-----------------+------------------+
|  count|      4|                4|                 4|
|   mean|   NULL|             88.5|         600911.25|
| stddev|   NULL|9.036961141150638|1040229.3057255453|
|    min|Bobigny|               75|             54364|
|    max|  Paris|               94|           2161000|
+-------+-------+-----------------+------------------+



## 5. V√©rification biblioth√®ques ML & Viz

In [5]:
# Machine Learning
import sklearn
print(f"‚úÖ Scikit-learn: {sklearn.__version__}")

# Visualisation
import matplotlib
import seaborn as sns
import plotly

print(f"‚úÖ Matplotlib: {matplotlib.__version__}")
print(f"‚úÖ Seaborn: {sns.__version__}")
print(f"‚úÖ Plotly: {plotly.__version__}")

‚úÖ Scikit-learn: 1.3.2
‚úÖ Matplotlib: 3.8.2
‚úÖ Seaborn: 0.13.0
‚úÖ Plotly: 5.18.0


## 6. V√©rification structure dossiers

In [6]:
directories = [
    'data/raw',
    'data/processed',
    'data/output',
    'outputs/figures'
]

print("üìÅ Structure des dossiers:")
for dir_path in directories:
    full_path = f"/app/{dir_path}"
    exists = "‚úÖ" if os.path.exists(full_path) else "‚ùå"
    print(f"{exists} {dir_path}")

üìÅ Structure des dossiers:
‚úÖ data/raw
‚úÖ data/processed
‚úÖ data/output
‚úÖ outputs/figures


## 7. Fermeture Spark Session

In [7]:
spark.stop()
print("‚úÖ Spark Session ferm√©e")

‚úÖ Spark Session ferm√©e


---

## ‚úÖ Environnement valid√© !

Si toutes les cellules s'ex√©cutent sans erreur, passez au notebook suivant : **01_data_download.ipynb**