<a href="https://colab.research.google.com/github/SallyPeter/gomycodeDSbootcamp/blob/main/Python/Checkpoint_ydata_Profiling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Instructions**

### **Part 1:** Data Exploration with Pandas
1. Load the provided dataset into a pandas dataframe
2. Use pandas to explore the dataset. For example, you might start by using the head() and info() methods to get an overview of the data.
3. Look for missing values in the dataset. You can use the isnull() method to identify missing values.
4. Use pandas to calculate some summary statistics for the dataset. For example, you might use the describe() method to get summary statistics for the numerical columns in the dataset.

### **Part 2:** Data Exploration with ydata-Profiling
1. Use ydata-profiling to generate a report of the provided dataset.
2. Look for missing values in the dataset. You can use the report generated by ydata-profiling to identify missing values.
3. Look for correlations between different columns in the dataset. You can use the report generated by ydata-profiling to identify correlations between different columns.
4. Identify any outliers or unusual values in the dataset. You can use the report generated by ydata-profiling to identify any outliers or unusual values.

**Summary**

At the end of this exercise, write a summary of your findings.
*Did you find any interesting patterns or correlations in the data? Were there any issues or challenges you encountered while exploring the dataset?*
 Use your summary to reflect on your experience using pandas and ydata-profiling to explore and understand a new dataset.

In [1]:
# !pip install ydata_profiling --q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m356.2/356.2 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.5/296.5 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m686.1/686.1 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.8/104.8 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m65.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for htmlmin (setup.py) ... [?25l[?25hdone


In [2]:
import pandas as pd
from ydata_profiling import ProfileReport

In [3]:
# Load the provided dataset into a pandas dataframe
data = pd.read_csv('Tunisair_flights_dataset.csv')
data.head()

Unnamed: 0,Filght_date,Flight_ID,Departure point,Arrival point,Scheduled_departure_time,Scheduled_arrival_time,STATUS,Aircraft_code,Arrival delay
0,2016-01-03,TU 0712,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,TU 32AIMN,260.0
1,2016-01-13,TU 0757,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,TU 31BIMO,20.0
2,2016-01-16,TU 0214,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,TU 32AIMN,0.0
3,2016-01-17,TU 0480,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,TU 736IOK,0.0
4,2016-01-17,TU 0338,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,TU 320IMU,22.0


In [4]:
# Use pandas to explore the dataset. For example, you might start by using the head() and info() methods to get an overview of the data.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48377 entries, 0 to 48376
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Filght_date               48377 non-null  object 
 1   Flight_ID                 48377 non-null  object 
 2   Departure point           48377 non-null  object 
 3   Arrival point             48377 non-null  object 
 4   Scheduled_departure_time  48377 non-null  object 
 5   Scheduled_arrival_time    48377 non-null  object 
 6   STATUS                    48377 non-null  object 
 7   Aircraft_code             48377 non-null  object 
 8   Arrival delay             48377 non-null  float64
dtypes: float64(1), object(8)
memory usage: 3.3+ MB


In [5]:
# Look for missing values in the dataset. You can use the isnull() method to identify missing values.
data.isnull().sum()

Unnamed: 0,0
Filght_date,0
Flight_ID,0
Departure point,0
Arrival point,0
Scheduled_departure_time,0
Scheduled_arrival_time,0
STATUS,0
Aircraft_code,0
Arrival delay,0


In [6]:
# Use pandas to calculate some summary statistics for the dataset. For example, you might use the describe() method to get summary statistics for the numerical columns in the dataset.
data.describe()

Unnamed: 0,Arrival delay
count,48377.0
mean,40.668851
std,99.910682
min,0.0
25%,0.0
50%,12.0
75%,36.0
max,2980.0


In [7]:
num_col = data.select_dtypes(include='number')
num_col.describe()

Unnamed: 0,Arrival delay
count,48377.0
mean,40.668851
std,99.910682
min,0.0
25%,0.0
50%,12.0
75%,36.0
max,2980.0


In [8]:
cat_col = data.select_dtypes(include='object')
cat_col.describe()

Unnamed: 0,Filght_date,Flight_ID,Departure point,Arrival point,Scheduled_departure_time,Scheduled_arrival_time,STATUS,Aircraft_code
count,48377,48377,48377,48377,48377,48377,48377,48377
unique,587,1248,104,103,36996,38408,5,53
top,2016-08-25,WKL 0000,TUN,TUN,2016-12-06 08:00:00,2016-03-08 01.00.00,ATA,TU 320IMW
freq,179,1088,19392,19345,7,6,43132,2175


## **Part 2**

In [9]:
# Use ydata-profiling to generate a report of the provided dataset.
profile = ProfileReport(data, title='Tunisair Flights')

# Display the report
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Look for missing values in the dataset. You can use the report generated by ydata-profiling to identify missing values.
# No missing values

In [11]:
# Look for correlations between different columns in the dataset. You can use the report generated by ydata-profiling to identify correlations between different columns.
# There's not much correlation in the given data

In [12]:
# Identify any outliers or unusual values in the dataset. You can use the report generated by ydata-profiling to identify any outliers or unusual values.

# There are outliers in the arrival delay feature.