In [1]:
import pandas as pd
from ydata_profiling import ProfileReport
import seaborn as sns

## Part 1: Data Exploration with Pandas

### 1. Load the provided dataset into a pandas dataframe

In [3]:
tunisian_air = pd.read_csv("Tunisair_flights_dataset.csv")

### 2. Use pandas to explore the dataset. For example, you might start by using the head() and info() methods to get an overview of the data.

   * View the head() of the dataset

In [3]:
tunisian_air.head()

Unnamed: 0,Filght_date,Flight_ID,Departure point,Arrival point,Scheduled_departure_time,Scheduled_arrival_time,STATUS,Aircraft_code,Arrival delay
0,2016-01-03,TU 0712,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,TU 32AIMN,260.0
1,2016-01-13,TU 0757,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,TU 31BIMO,20.0
2,2016-01-16,TU 0214,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,TU 32AIMN,0.0
3,2016-01-17,TU 0480,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,TU 736IOK,0.0
4,2016-01-17,TU 0338,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,TU 320IMU,22.0


   * View the information about the dataset

In [4]:
tunisian_air.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107833 entries, 0 to 107832
Data columns (total 9 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Filght_date               107833 non-null  object 
 1   Flight_ID                 107833 non-null  object 
 2   Departure point           107833 non-null  object 
 3   Arrival point             107833 non-null  object 
 4   Scheduled_departure_time  107833 non-null  object 
 5   Scheduled_arrival_time    107833 non-null  object 
 6   STATUS                    107833 non-null  object 
 7   Aircraft_code             107833 non-null  object 
 8   Arrival delay             107833 non-null  float64
dtypes: float64(1), object(8)
memory usage: 7.4+ MB


   * View the shape of the dataset (`the numbers of rows and columns`)

In [5]:
tunisian_air.shape

(107833, 9)

### 3. Look for missing values in the dataset. You can use the isnull() method to identify missing values.

In [6]:
tunisian_air.isnull().sum()

Filght_date                 0
Flight_ID                   0
Departure point             0
Arrival point               0
Scheduled_departure_time    0
Scheduled_arrival_time      0
STATUS                      0
Aircraft_code               0
Arrival delay               0
dtype: int64

### 4. Use pandas to calculate some summary statistics for the dataset. For example, you might use the describe() method to get summary statistics for the numerical columns in the dataset.

In [7]:
tunisian_air.describe()

Unnamed: 0,Arrival delay
count,107833.0
mean,48.733013
std,117.135562
min,0.0
25%,0.0
50%,14.0
75%,43.0
max,3451.0


## Part 2: Data Exploration with ydata-Profiling

### 1. Use ydata-profiling to generate a report of the provided dataset.
  * The ydata-profiling have been imported earlier in the notebook so I just need to call out the profilereport

In [4]:
tunair_profile = ProfileReport(tunisian_air, explorative = True, title = 'Tunisian Air Profile')

# Save the profile report as a html file
tunair_profile.to_file("tunisian_air_profile.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
# Generate and show dataset profile
tunair_profile.to_notebook_iframe()

### 2. Look for missing values in the dataset. You can use the report generated by ydata-profiling to identify missing values.

  * There are no missing values in the dataset according to profile report

### 3. Look for correlations between different columns in the dataset. You can use the report generated by ydata-profiling to identify correlations between different columns.

 * There is no correlation between arrival delay and status

### 4. Identify any outliers or unusual values in the dataset. You can use the report generated by ydata-profiling to identify any outliers or unusual values.

 * The status column has outliers. ATA is more than all other flight status