# Analyse des données de transport

## Objectifs

Ce notebool a pour objectif d'explorer et d'analyser un historique de tickets de transports:
* Analyser les prix et durée des trajets
* comparer les modes de transport selon la distance
* identifier des tendances et incohérences dans les données
* proposer des analyses et visualisations pertinentes

## Outils et bibliothèques utilisés
* Python
* pandas
* numpy
* matplotlib
* scikit-learn
* pyarrow
* pyjanitor
* ipykernel
* jupyter
* seaborn
* polars
* plotly
* folium
* missingno
* great_expectations

## Auteur
Haja Rabemananjara

## Date
17 janvier 2026

## 1. Import des librairies et configuration de l'apparence globale des graphiques

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

## 2. Chargement des données

In [223]:
ticket = pd.read_csv('data/ticket_data.csv')
cities = pd.read_csv('data/cities.csv')
stations = pd.read_csv('data/stations.csv')
providers = pd.read_csv('data/providers.csv')

###### Aperçu des données

In [224]:
print(Markdown("Dimensions des datasets:"))
{ticket.shape, cities.shape, stations.shape, providers.shape}

<IPython.core.display.Markdown object>


{(227, 10), (8040, 6), (11035, 4), (74168, 12)}

In [225]:
from IPython.display import display, Markdown

display(Markdown("### Info sur les tickets"))
ticket.head()
ticket.info()

display(Markdown("### Info sur les villes"))
cities.head()
cities.info()

display(Markdown("### Info sur les stations"))
stations.head()
stations.info()

display(Markdown("### Info sur les fournisseurs"))
providers.head()
providers.info()

### Info sur les tickets

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74168 entries, 0 to 74167
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               74168 non-null  int64  
 1   company          74168 non-null  int64  
 2   o_station        32727 non-null  float64
 3   d_station        32727 non-null  float64
 4   departure_ts     74168 non-null  object 
 5   arrival_ts       74168 non-null  object 
 6   price_in_cents   74168 non-null  int64  
 7   search_ts        74168 non-null  object 
 8   middle_stations  32727 non-null  object 
 9   other_companies  32727 non-null  object 
 10  o_city           74168 non-null  int64  
 11  d_city           74168 non-null  int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 6.8+ MB


### Info sur les villes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8040 entries, 0 to 8039
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           8040 non-null   int64  
 1   local_name   8040 non-null   object 
 2   unique_name  8039 non-null   object 
 3   latitude     8040 non-null   float64
 4   longitude    8040 non-null   float64
 5   population   369 non-null    float64
dtypes: float64(3), int64(1), object(2)
memory usage: 377.0+ KB


### Info sur les stations

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11035 entries, 0 to 11034
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           11035 non-null  int64  
 1   unique_name  11035 non-null  object 
 2   latitude     11035 non-null  float64
 3   longitude    11035 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 345.0+ KB


### Info sur les fournisseurs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    227 non-null    int64 
 1   company_id            227 non-null    int64 
 2   provider_id           213 non-null    object
 3   name                  227 non-null    object
 4   fullname              227 non-null    object
 5   has_wifi              224 non-null    object
 6   has_plug              224 non-null    object
 7   has_adjustable_seats  224 non-null    object
 8   has_bicycle           224 non-null    object
 9   transport_type        227 non-null    object
dtypes: int64(2), object(8)
memory usage: 17.9+ KB


In [226]:
print(Markdown("### Description statistique des tickets"))
ticket.describe()

<IPython.core.display.Markdown object>


Unnamed: 0,id,company,o_station,d_station,price_in_cents,o_city,d_city
count,74200.0,74168.0,32727.0,32727.0,74168.0,74168.0,74168.0
mean,6830000.0,7109.57,2907.13,2347.86,4382.71,849.19,883.78
std,21400.0,3005.38,3347.63,3090.8,3739.33,1485.79,1654.7
min,6800000.0,9.0,3.0,3.0,300.0,5.0,1.0
25%,6810000.0,8376.0,400.0,396.0,1900.0,485.0,453.0
50%,6830000.0,8385.0,701.0,575.0,3350.0,628.0,562.0
75%,6850000.0,8385.0,6246.0,4538.0,5250.0,628.0,628.0
max,6870000.0,8387.0,11017.0,11017.0,38550.0,12190.0,12190.0


## 3. Compréhension métier des données

### Description des colonnes clés
* *price_in_cents* : prix du ticket en cents (€)
* *departure_ts / arrival_ts* : départ et arrivée : format data YYYY-mm-jj HH:mm:ss
* *o_city / d_city* : ville de départ / arrivée
* *o_station / d_station* : station de départ / arrivée
* *company* : provider (ex : TGV, TER, Blablacar)

## 4. Nettoyage de données

#### Valeurs manquantes

In [227]:
print(Markdown('Vérification des valeurs manquantes dans les tickets'))
ticket.isna().sum()

<IPython.core.display.Markdown object>


id                     0
company                0
o_station          41441
d_station          41441
departure_ts           0
arrival_ts             0
price_in_cents         0
search_ts              0
middle_stations    41441
other_companies    41441
o_city                 0
d_city                 0
dtype: int64

On constate que 44 % (41441 / 74168) n'ont pas d'information sur les stations ('o_station' , 'd_station').
Cela correspond probablement aux trajets en covoiturage qui ne passent pas par des stations fixes.

In [228]:
missing_data = pd.DataFrame({
    'Dataset': ['Ticket', 'Cities', 'Stations', 'Providers'],
    'Total Rows': [len(ticket), len(cities), len(stations), len(providers)],
    'Missing Values': [ticket.isnull().sum().sum(), cities.isnull().sum().sum(), stations.isnull().sum().sum(), providers.isnull().sum().sum()]
})
print(Markdown("### Résumé des valeurs manquantes"))
missing_data

<IPython.core.display.Markdown object>


Unnamed: 0,Dataset,Total Rows,Missing Values
0,Ticket,74168,165764
1,Cities,8040,7672
2,Stations,11035,0
3,Providers,227,26


In [229]:
print(Markdown("Détail des valeurs manquantes dans le dataset des tickets:"))
missing_ticket = ticket.isnull().sum()
missing_ticket_pct = ((missing_ticket / len(ticket)) * 100).round(2)
missing_ticket_df = pd.DataFrame({
    'Valeurs manquantes': missing_ticket,
    'Pourcentage (%)': missing_ticket_pct
})
missing_ticket_df = missing_ticket_df[missing_ticket_df['Valeurs manquantes'] > 0]
missing_ticket_df.sort_values('Valeurs manquantes', ascending=False)

<IPython.core.display.Markdown object>


Unnamed: 0,Valeurs manquantes,Pourcentage (%)
o_station,41441,55.87
d_station,41441,55.87
middle_stations,41441,55.87
other_companies,41441,55.87


#### Doublons

La suppression des doublons permettra d'éviter que les résultats soient biaisés.

In [230]:
ticket = ticket.drop_duplicates()

if ticket.drop_duplicates().shape[0] == ticket.shape[0]:
    print("Aucun doublon trouvé.")
else:
    print('Après suppression des doublons :', ticket.shape)

Aucun doublon trouvé.


#### Conversion

In [231]:
print(Markdown("#### Conversion des prix en euros"))
ticket['price_eur'] = ticket['price_in_cents'] / 100
ticket[['price_in_cents', 'price_eur']].head()

<IPython.core.display.Markdown object>


Unnamed: 0,price_in_cents,price_eur
0,4550,45.5
1,1450,14.5
2,7400,74.0
3,13500,135.0
4,7710,77.1


In [232]:
print(Markdown("#### Conversion des timestamps"))
ticket['departure_ts'] = pd.to_datetime(ticket['departure_ts'], format='mixed')
ticket['arrival_ts'] = pd.to_datetime(ticket['arrival_ts'], format='mixed')
ticket['search_ts'] = pd.to_datetime(ticket['search_ts'], format='mixed')

print(Markdown('Calcul de la durée des trajets en minutes'))
ticket['duration_min'] = (ticket['arrival_ts'] - ticket['departure_ts']).dt.total_seconds() / 60
ticket[['departure_ts', 'arrival_ts', 'duration_min']].head()

<IPython.core.display.Markdown object>
<IPython.core.display.Markdown object>


Unnamed: 0,departure_ts,arrival_ts,duration_min
0,2017-10-13 14:00:00+00:00,2017-10-13 20:10:00+00:00,370.0
1,2017-10-13 13:05:00+00:00,2017-10-14 06:55:00+00:00,1070.0
2,2017-10-13 13:27:00+00:00,2017-10-14 21:24:00+00:00,1917.0
3,2017-10-13 13:27:00+00:00,2017-10-14 11:02:00+00:00,1295.0
4,2017-10-13 21:46:00+00:00,2017-10-14 19:32:00+00:00,1306.0


In [233]:
print(Markdown('Extraction de features temporelles'))
ticket['departure_hour'] = ticket['departure_ts'].dt.hour
ticket['departure_day'] = ticket['departure_ts'].dt.day_name()
ticket['departure_month'] = ticket['departure_ts'].dt.month_name()
ticket['is_weekend'] = ticket['departure_ts'].dt.dayofweek.isin([5, 6])

<IPython.core.display.Markdown object>


In [234]:
print(Markdown('Délai de réservation en jours'))
ticket['booking_ts'] = (ticket['departure_ts'] - ticket['search_ts']).dt.days

ticket[['search_ts', 'departure_ts', 'booking_ts']].head()

<IPython.core.display.Markdown object>


Unnamed: 0,search_ts,departure_ts,booking_ts
0,2017-10-01 00:13:31.327000+00:00,2017-10-13 14:00:00+00:00,12
1,2017-10-01 00:13:35.773000+00:00,2017-10-13 13:05:00+00:00,12
2,2017-10-01 00:13:40.212000+00:00,2017-10-13 13:27:00+00:00,12
3,2017-10-01 00:13:40.213000+00:00,2017-10-13 13:27:00+00:00,12
4,2017-10-01 00:13:40.213000+00:00,2017-10-13 21:46:00+00:00,12


#### Valeurs abérantes

Vérification de l'existance de données abérantes.
Comparaison de la forme des tableau avant et après filtrage des données abérantes.
Suppression des valeurs abérantes si nécessaire.

In [235]:
print(Markdown('Nettoyage des enregistrements avec prix négatif ou nul'))
print('Avant nettoyage :', ticket.shape)
ticket_filter = ticket[(ticket['price_in_cents'] > 0)]
print('Après nettoyage :', ticket_filter.shape)

if ticket.shape[0] == ticket_filter.shape[0]:
    print("Aucun enregistrement supprimé.")
else:
    ticket = ticket[(ticket['price_in_cents'] > 0)]

<IPython.core.display.Markdown object>
Avant nettoyage : (74168, 19)
Après nettoyage : (74168, 19)
Aucun enregistrement supprimé.


In [236]:
print(Markdown('Nottyage des enregistrements avec durée négative ou nulle'))
print('Avant nettoyage :', ticket.shape)
ticket_filter = ticket[(ticket['duration_min'] > 0)]
print('Après nettoyage :', ticket_filter.shape)

if ticket.shape[0] == ticket_filter.shape[0]:
    print("Aucun enregistrement supprimé.")
else:
    ticket = ticket[(ticket['duration_min'] > 0)]

<IPython.core.display.Markdown object>
Avant nettoyage : (74168, 19)
Après nettoyage : (74168, 19)
Aucun enregistrement supprimé.


## 5. Jointures des datasets

Jointure progressive (vile -> stations -> fournisseurs)

In [237]:
print(Markdown("### Jointure des données des tickets avec les villes d'origine."))

cities_origin = cities[['id', 'local_name', 'latitude', 'longitude']].copy()
cities_origin = cities_origin.rename(columns={
    'id': 'city_id',
    'local_name': 'o_city_name',
    'latitude': 'o_lat',
    'longitude': 'o_lon'
})

"""Merge des données des tickets avec les villes d'origine."""
ticket = ticket.merge(cities_origin, left_on='o_city', right_on='city_id', how='left').drop(columns=['city_id'])


<IPython.core.display.Markdown object>


In [238]:
print(Markdown("### Jointure des données des tickets avec les villes de destinations"))

cities_destination = cities[['id', 'local_name', 'latitude', 'longitude']].copy()
cities_destination = cities_destination.rename(columns={
    'id': 'city_id',
    'local_name': 'd_city_name',
    'latitude': 'd_city_latitude',
    'longitude': 'd_city_longitude'
})

"""Merge des données des tickets avec les villes de destinations."""
ticket = ticket.merge(cities_destination, left_on='d_city', right_on='city_id', how='left').drop(columns=['city_id'])

<IPython.core.display.Markdown object>


In [239]:
print(Markdown("### Jointure des données avec les stations d'origine"))

stations_origin = stations[['id', 'unique_name', 'latitude', 'longitude']].copy()
stations_origin = stations_origin.rename(columns={
    'id': 'station_id',
    'unique_name': 'o_station_name',
    'latitude': 'o_station_latitude',
    'longitude': 'o_station_longitude'
})

"""Merge des données avec les stations d'origine."""
ticket = ticket.merge(stations_origin, left_on='o_station', right_on='station_id', how='left').drop(columns=['station_id'])

<IPython.core.display.Markdown object>


In [240]:
print(Markdown("### Jointure des données avec les stations de destination"))

stations_destination = stations[['id', 'unique_name', 'latitude', 'longitude']].copy()
stations_destination = stations_destination.rename(columns={
    'id': 'station_id',
    'unique_name': 'd_station_name',
    'latitude': 'd_station_latitude',
    'longitude': 'd_station_longitude'
})

"""Merge des données avec les stations de destination."""
ticket = ticket.merge(stations_destination, left_on='d_station', right_on='station_id', how='left').drop(columns=['station_id'])

<IPython.core.display.Markdown object>


In [241]:
print(Markdown("### Jointure des données avec les fournisseurs"))

ticket = ticket.merge(providers[['id', 'name', 'transport_type', 'has_wifi', 'has_plug', 'has_bicycle']], left_on='company', right_on='id', how='left', suffixes=('', '_provider')).drop(columns=['id_provider']).rename(columns={'name': 'provider_name'})

<IPython.core.display.Markdown object>


In [242]:
print(Markdown("Colonnes finales du dataset:"))
print(ticket.columns.tolist())

print('Aperçu des données enrichies:')
ticket[['o_city_name', 'd_city_name', 'provider_name', 'transport_type', 'price_eur', 'duration_min']].head(10)

<IPython.core.display.Markdown object>
['id', 'company', 'o_station', 'd_station', 'departure_ts', 'arrival_ts', 'price_in_cents', 'search_ts', 'middle_stations', 'other_companies', 'o_city', 'd_city', 'price_eur', 'duration_min', 'departure_hour', 'departure_day', 'departure_month', 'is_weekend', 'booking_ts', 'o_city_name', 'o_lat', 'o_lon', 'd_city_name', 'd_city_latitude', 'd_city_longitude', 'o_station_name', 'o_station_latitude', 'o_station_longitude', 'd_station_name', 'd_station_latitude', 'd_station_longitude', 'provider_name', 'transport_type', 'has_wifi', 'has_plug', 'has_bicycle']
Aperçu des données enrichies:


Unnamed: 0,o_city_name,d_city_name,provider_name,transport_type,price_eur,duration_min
0,"Orléans, Centre-Val de Loire, France","Montpellier, Occitanie, France",bbc,carpooling,45.5,370.0
1,"Orléans, Centre-Val de Loire, France","Montpellier, Occitanie, France",ouibus,bus,14.5,1070.0
2,"Orléans, Centre-Val de Loire, France","Montpellier, Occitanie, France",corailintercite,train,74.0,1917.0
3,"Orléans, Centre-Val de Loire, France","Montpellier, Occitanie, France",corailintercite,train,135.0,1295.0
4,"Orléans, Centre-Val de Loire, France","Montpellier, Occitanie, France",coraillunea,train,77.1,1306.0
5,"Paris, Île-de-France, France","Lille, Hauts-de-France, France",bbc,carpooling,18.0,180.0
6,"Paris, Île-de-France, France","Lille, Hauts-de-France, France",bbc,carpooling,21.5,150.0
7,"Paris, Île-de-France, France","Lille, Hauts-de-France, France",bbc,carpooling,17.0,150.0
8,"Paris, Île-de-France, France","Lille, Hauts-de-France, France",bbc,carpooling,17.0,170.0
9,"Paris, Île-de-France, France","Lille, Hauts-de-France, France",bbc,carpooling,19.0,170.0
