# Set up to module to be read for external host instances 


The `pip show vacances-etl` command in cell 2 checks if the required package is installed in the current Python environment. This verification ensures all dependencies are available before proceeding with the ETL workflow.

The package contains custom ETL functions that handle data extraction from CSV files, transformation according to business rules, and loading into our PostgreSQL database.

In [1]:
# Check if the vacances-etl package is installed
!{sys.executable} -m pip show vacances-etl


Name: vacances-etl
Version: 0.1.0
Summary: ETL pipeline for French and neighbouring school holidays
Home-page: 
Author: 
Author-email: Your Name <you@example.com>
License: 
Location: /home/cliuser/.local/lib/python3.10/site-packages
Editable project location: /home/cliuser/downloads/exo-partners/ETL_vacantions_France_and_neighboors/srcs
Requires: pandas, psycopg2-binary, sqlalchemy
Required-by: 


## Environment Setup and Package Verification

1. **Environment Setup**: Checks for the required `vacances-etl` package installation, configures Python paths and database connection parameters.

The Python path configuration ensures that custom packages are properly located and imported. Database configuration uses environment variables to maintain security while providing connection information to the ETL processes.

In [2]:
# ➊ Ensure user-site is on sys.path and verify package installation
# sys.path is a list of directories that Python searches for modules
# This script checks if the user site-packages directory is in sys.path
# and adds it if not, then checks for the installation of the vacances-etl package.
import site, sys

# Get user site-packages directory
u_site = site.getusersitepackages()
print(f"User site-packages directory: {u_site}")

# Add to path if not already there
if u_site not in sys.path:
    sys.path.insert(0, u_site)
    print(f"Added {u_site} to sys.path")

# Check if vacances-etl package is installed
try:
    import vacances_etl
    print(f"✅ vacances_etl package is installed at: {vacances_etl.__file__}")
except ImportError:
    print("❌ vacances_etl package is not installed!")
    
# Display Python version and environment info
print(f"Python version: {sys.version}")
print(f"sys.path contains {len(sys.path)} entries")

User site-packages directory: /home/cliuser/.local/lib/python3.10/site-packages
❌ vacances_etl package is not installed!
Python version: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0]
sys.path contains 9 entries


In [3]:
import pathlib
# 2.  Is that path on sys.path?
wheel_path = pathlib.Path('/home/cliuser/.local/lib/python3.10/site-packages')
print(wheel_path in map(pathlib.Path, sys.path))  # must be True

True


In [4]:
import os
import sys
# make sure your module sees these
os.environ['DB_HOST']     = 'postgres'     # e.g. 'db', '192.168.99.100', or your cloud endpoint
os.environ['DB_PORT']     = '5432'                    # change if your Postgres listens elsewhere
os.environ['POSTGRES_USER']     = 'jvalenci'
os.environ['POSTGRES_PASSWORD'] = 'mysecretpassword'
os.environ['POSTGRES_DB']       = 'piscineds'

sys.path.append(os.getcwd())


# now reload so that get_engine picks up the new env vars
from t_vacances_etl import * 

### 2. Verify DB connection

In [5]:
from sqlalchemy import text

# sanity‐check
engine = get_engine()
with engine.connect() as conn:
    print("✅ Connected:", conn.execute(text("SELECT 1")).scalar())

✅ Connected: 1


### 3. Run the ETL pipeline

#### Checks in which folder currently we are

In [6]:
# os module that is used to get the current working directory
import os
print("cwd =", os.getcwd())


cwd = /home/cliuser


In [7]:
# os.chdir () # change the current working directory to the specified path
os.chdir("/home/cliuser/downloads/exo-partners/ETL_vacantions_France_and_neighboors/")  

In [8]:
from pathlib import Path
csv = Path("fr-en-calendrier-scolaire (1).csv").resolve()
print(csv, "exists:", csv.exists())


/home/cliuser/downloads/exo-partners/ETL_vacantions_France_and_neighboors/fr-en-calendrier-scolaire (1).csv exists: True


In [9]:
run_etl([
    "fr-en-calendrier-scolaire (1).csv",
    "fr-en-calendrier-scolaire-remaining.csv"
])

✅ ETL complete – rows: 2396


### 4. Validate output table

In [10]:
import pandas as pd

df = pd.read_sql_table('t_vacances', engine)

# # filter the dataframe for Luxembourg entries
# df = df[df['Académies'] == 'Luxembourg']

df

Unnamed: 0,date,all,bel,esp,fr_corse,fr_zone_a,fr_zone_b,fr_zone_c,ita,lux,sui
0,2009-10-25,0,0,0,1,0,0,0,0,0,0
1,2009-10-26,0,0,0,1,0,0,0,0,0,0
2,2009-10-27,0,0,0,1,0,0,0,0,0,0
3,2009-10-28,0,0,0,1,0,0,0,0,0,0
4,2009-10-29,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
2391,2026-09-08,0,0,0,0,0,0,0,0,1,0
2392,2026-09-09,0,0,0,0,0,0,0,0,1,0
2393,2026-09-10,0,0,0,0,0,0,0,0,1,0
2394,2026-09-11,0,0,0,0,0,0,0,0,1,0


### 5. Clean up (optional)

In [11]:
from sqlalchemy import text

# drop staging table if desired
with engine.begin() as conn:
    conn.execute(text('DROP TABLE IF EXISTS staging_vacances'))
    print('Staging table dropped.')

Staging table dropped.
