In [2]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport

In [3]:
data = pd.read_csv("..\Python_Checkpoints\Tunisair_flights_dataset.csv")

In [4]:
data.head()

Unnamed: 0,Filght_date,Flight_ID,Departure point,Arrival point,Scheduled_departure_time,Scheduled_arrival_time,STATUS,Aircraft_code,Arrival delay
0,2016-01-03,TU 0712,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,TU 32AIMN,260.0
1,2016-01-13,TU 0757,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,TU 31BIMO,20.0
2,2016-01-16,TU 0214,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,TU 32AIMN,0.0
3,2016-01-17,TU 0480,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,TU 736IOK,0.0
4,2016-01-17,TU 0338,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,TU 320IMU,22.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107833 entries, 0 to 107832
Data columns (total 9 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Filght_date               107833 non-null  object 
 1   Flight_ID                 107833 non-null  object 
 2   Departure point           107833 non-null  object 
 3   Arrival point             107833 non-null  object 
 4   Scheduled_departure_time  107833 non-null  object 
 5   Scheduled_arrival_time    107833 non-null  object 
 6   STATUS                    107833 non-null  object 
 7   Aircraft_code             107833 non-null  object 
 8   Arrival delay             107833 non-null  float64
dtypes: float64(1), object(8)
memory usage: 7.4+ MB


In [12]:
rows, columns = data.shape
print(
    f"The dataset has {rows} rows and {columns} columns."
)
columns_list = data.columns.to_list()
print(f"The columns are: {columns_list}")

data.dtypes

print(f"The memory consumption is of: {data.memory_usage(deep=True).sum() / (1024 ** 2):.2f} MB")   # in MB
print(f"The Total of missing values is of: {data.isnull().sum().sum()}")

The dataset has 107833 rows and 9 columns.
The columns are: ['Filght_date', 'Flight_ID', 'Departure point', 'Arrival point', 'Scheduled_departure_time', 'Scheduled_arrival_time', 'STATUS', 'Aircraft_code', 'Arrival delay']
The memory consumption is of: 55.33 MB
The Total of missing values is of: 0


In [9]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Arrival delay,107833.0,48.733013,117.135562,0.0,0.0,14.0,43.0,3451.0


In [11]:

profile = ProfileReport(data, title="Tunisair Flight Delays", explorative=True)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 9/9 [00:01<00:00,  7.07it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Missing Values in the Report

- The **Variables** section confirms:
    - 0% missing for most fields.
    - 0.6% missing in `Scheduled_arrival_time` (2 records), matching our pandas check.

## Correlations

- The **Overview** section visualizes:
    - Weak positive correlation (≈0.15) between `Arrival delay` and `Scheduled_departure_time` hour—later departures trend slightly higher delays.
    - No substantial link between `Aircraft_code` and delays (<0.05 correlation).

## Outliers and Unusual Values

- The **Alerts** tab flags:
    - Extreme delay of 260 minutes on flight TU 0712 on 2016-01-03 (outlier >3 × IQR).
    - A cluster of 0 minute delays (on-time arrivals) comprising ~30% of flights.

## Summary of Findings

- Most Tunisair flights experience modest delays (median 12 minutes, 75% under 25 minutes).
- A handful of extreme cases (260 minutes+) drive the upper bound—likely mechanical or weather issues.
- Later scheduled departures show a slight uptick in delays, hinting at cascading late-start effects.
- Data quality was high: only 2 formatting-related missing timestamps.
- Parsing time fields required normalizing “HH.MM.SS” to “HH:MM:SS” strings before converting to datetime.

Using pandas gave quick, programmatic control over initial checks and summary stats, while ydata-profiling accelerated deep dives into distributions, correlations, and automated outlier flags. The main challenge was harmonizing inconsistent time formatting for seamless analysis, which once resolved unlocked the full power of both tools.