In [2]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport

In [3]:
# Loading the dataset
df = pd.read_csv("Tunisair_flights_dataset.csv")
df.head()

Unnamed: 0,Filght_date,Flight_ID,Departure point,Arrival point,Scheduled_departure_time,Scheduled_arrival_time,STATUS,Aircraft_code,Arrival delay
0,2016-01-03,TU 0712,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,TU 32AIMN,260.0
1,2016-01-13,TU 0757,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,TU 31BIMO,20.0
2,2016-01-16,TU 0214,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,TU 32AIMN,0.0
3,2016-01-17,TU 0480,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,TU 736IOK,0.0
4,2016-01-17,TU 0338,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,TU 320IMU,22.0


In [4]:
# Displaying structure of the dataset
print("Dataset Info:")
print(df.info())

# Checking for missing values and zeros
print("\nMissing Values:")
print(df.isnull().sum())

# Summary statistics
print("\nDataset Description:")
df.describe()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107833 entries, 0 to 107832
Data columns (total 9 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Filght_date               107833 non-null  object 
 1   Flight_ID                 107833 non-null  object 
 2   Departure point           107833 non-null  object 
 3   Arrival point             107833 non-null  object 
 4   Scheduled_departure_time  107833 non-null  object 
 5   Scheduled_arrival_time    107833 non-null  object 
 6   STATUS                    107833 non-null  object 
 7   Aircraft_code             107833 non-null  object 
 8   Arrival delay             107833 non-null  float64
dtypes: float64(1), object(8)
memory usage: 7.4+ MB
None

Missing Values:
Filght_date                 0
Flight_ID                   0
Departure point             0
Arrival point               0
Scheduled_departure_time    0
Scheduled_arrival_time      0
STATUS

Unnamed: 0,Arrival delay
count,107833.0
mean,48.733013
std,117.135562
min,0.0
25%,0.0
50%,14.0
75%,43.0
max,3451.0


In [5]:
# STEP 2
# Generating ydata-profiling report
profile = ProfileReport(df, title="Tunisair Flight Dataset Report", explorative=True)
profile.to_file("Tunisair_Flight_Report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/9 [00:00<?, ?it/s][A
 11%|█         | 1/9 [00:00<00:05,  1.41it/s][A
 33%|███▎      | 3/9 [00:01<00:02,  2.52it/s][A
 44%|████▍     | 4/9 [00:02<00:03,  1.38it/s][A
 56%|█████▌    | 5/9 [00:05<00:05,  1.30s/it][A
 67%|██████▋   | 6/9 [00:07<00:04,  1.58s/it][A
100%|██████████| 9/9 [00:07<00:00,  1.15it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# **Summary of Key Observations**
**Dataset Overview**

- The dataset consists of 107,833 flight records with 9 columns.

- No missing values were detected in any of the columns, showing data completeness.

- Most columns are of type ***object***, except for Arrival delay, which is of type ***float64***.

**Arrival Delay Distribution**

- The mean arrival delay is 48.73 minutes, but a high standard deviation (117.14 minutes) is noticed, suggesting a wide spread in delays (meaning that some flights experience extreme delays, while others arrive on time).

- The minimum arrival delay is 0 minutes, and the maximum is 3,451 minutes (~57 hours), indicating the presence of extreme **outliers**.
Further investigation can be done to find out if outliers are errors from **data entry**, or from **special circumstances** (e.g., cancellations or major operational issues).

- 35.4% of flights (38,168 flights) had zero delays, meaning they arrived on time. Since 35.4% of flights have zero delay, it may be useful to further analyze the conditions under which flights are more likely to be delayed.

**Imbalanced Data**

From the **ydata-ProfileReport**, it is seen that the **STATUS** column is highly **imbalanced** (73.4%), which could indicate that one flight status category dominates (i.e, most flights fall into a single category), potentially affecting predictive modeling and classification tasks.

**Further Data Exploration**

Further analysis on the dataset can be done, including; **peak delay times**, **if some specific airports experience more delays**, or **if delays spike during a specific time in the year (e.g., during the holidays)**.
