### 📈  Variables in dataset
 1. Year: 2008
 2. Month: 11 (November)
 3. DayofMonth: 1-31
 4. DayOfWeek: 1 (Monday) - 7 (Sunday)
 5. DepTime: actual departure time (hhmm)
 6. CRSDepTime: scheduled departure time (hhmm)
 7. ArrTime: actual arrival time (hhmm)
 8. CRSArrTime scheduled arrival time (hhmm)
 9. UniqueCarrier: unique carrier code
 10. FlightNum: flight number
 11. TailNum: plane tail number
 12. ActualElapsedTime: actual elapsed time of flight (estimated elapse time) in minutes
 13. CRSElapsedTime: CRS elapsed time of flight (estimated elapse time) in minutes
 14. AirTime: flight time in in minutes
 15. ArrDelay: difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers, in minutes
 16. DepDelay: TARGET -  Difference in minutes between scheduled and actual departure time. Early departures show negative numbers, in minutes
 17. Origin: origin IATA airport code
 18. Dest: destination IATA airport code
 19. Distance: distance between airports (miles)
 20. TaxiIn: wheels down and arrival at the destination airport gate, in minutes
 21. TaxiOut: The time elapsed between departure from the origin airport gate and wheels off, in minutes
 22. Cancelled: was the flight cancelled?
 23. CancellationCode: reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
 24. Diverted: 1 = yes, 0 = no
 25. CarrierDelay: carrier delay in minutes
 26. WeatherDelay: weather delay in minutes
 27. NASDelay: NAS delay in minutes
 28. SecurityDelay: security delayin minutes
 29. LateAircraftDelay: late aircraft delay in minutes

### To predict a flight arrival delay, I applied Pycaret library which...... and much more. You have to additionally install pycaret-nightly or  --pre pycaret module.

In [None]:
!pip install

In [58]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### We can load the full data from the file or a part of them

In [59]:
nRowsRead = None # specify number of rows to read - print None if want to read whole file
file_path = 'DelayedFlights.csv'
data = pd.read_csv(file_path, header=0, nrows=nRowsRead)
nRow, nCol = data.shape
print(f'There are {nRow} rows and {nCol} columns')

There are 1936758 rows and 30 columns


In [60]:
data.rename(columns={'Unnamed: 0': 'id'}, inplace=True)

In [61]:
data.drop_duplicates(inplace=True)

In [50]:
data.dtypes

id                     int64
Year                   int64
Month                  int64
DayofMonth             int64
DayOfWeek              int64
DepTime              float64
CRSDepTime             int64
ArrTime              float64
CRSArrTime             int64
UniqueCarrier         object
FlightNum              int64
TailNum               object
ActualElapsedTime    float64
CRSElapsedTime       float64
AirTime              float64
ArrDelay             float64
DepDelay             float64
Origin                object
Dest                  object
Distance               int64
TaxiIn               float64
TaxiOut              float64
Cancelled              int64
CancellationCode      object
Diverted               int64
CarrierDelay         float64
WeatherDelay         float64
NASDelay             float64
SecurityDelay        float64
LateAircraftDelay    float64
dtype: object

In [62]:
data.head()

Unnamed: 0,id,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,...,4.0,8.0,0,N,0,,,,,
1,1,2008,1,3,4,754.0,735,1002.0,1000,WN,...,5.0,10.0,0,N,0,,,,,
2,2,2008,1,3,4,628.0,620,804.0,750,WN,...,3.0,17.0,0,N,0,,,,,
3,4,2008,1,3,4,1829.0,1755,1959.0,1925,WN,...,3.0,10.0,0,N,0,2.0,0.0,0.0,0.0,32.0
4,5,2008,1,3,4,1940.0,1915,2121.0,2110,WN,...,4.0,10.0,0,N,0,,,,,


In [64]:
data['EarlyArr'] = data['ArrDelay'] * (-1)

In [65]:
data[['EarlyArr', 'ArrDelay']]

Unnamed: 0,EarlyArr,ArrDelay
0,14.0,-14.0
1,-2.0,2.0
2,-14.0,14.0
3,-34.0,34.0
4,-11.0,11.0
...,...,...
1936753,-25.0,25.0
1936754,-75.0,75.0
1936755,-99.0,99.0
1936756,-9.0,9.0


In [66]:
data.loc[(data.ArrDelay < 0), 'ArrDelay'] = 0

In [67]:
data.loc[(data.EarlyArr < 0), 'EarlyArr'] = 0

In [68]:
data[['EarlyArr', 'ArrDelay']]

Unnamed: 0,EarlyArr,ArrDelay
0,14.0,0.0
1,0.0,2.0
2,0.0,14.0
3,0.0,34.0
4,0.0,11.0
...,...,...
1936753,0.0,25.0
1936754,0.0,75.0
1936755,0.0,99.0
1936756,0.0,9.0


In [69]:
data['EarlyArr'].describe()

count    1.928371e+06
mean     6.363573e-01
std      2.723945e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.090000e+02
Name: EarlyArr, dtype: float64

In [70]:
data['ArrDelay'].describe()

count    1.928371e+06
mean     4.283624e+01
std      5.623669e+01
min      0.000000e+00
25%      9.000000e+00
50%      2.400000e+01
75%      5.600000e+01
max      2.461000e+03
Name: ArrDelay, dtype: float64

In [78]:
data["date"] = data["Year"].astype(str) + "-" + data["Month"].astype(str) + "-" + data["DayofMonth"].astype(str)

In [79]:
data["date"]

0            2008-1-3
1            2008-1-3
2            2008-1-3
3            2008-1-3
4            2008-1-3
              ...    
1936753    2008-12-13
1936754    2008-12-13
1936755    2008-12-13
1936756    2008-12-13
1936757    2008-12-13
Name: date, Length: 1936758, dtype: object

In [81]:
data['date']= pd.to_datetime(data['date'])

In [82]:
data["date"]

0         2008-01-03
1         2008-01-03
2         2008-01-03
3         2008-01-03
4         2008-01-03
             ...    
1936753   2008-12-13
1936754   2008-12-13
1936755   2008-12-13
1936756   2008-12-13
1936757   2008-12-13
Name: date, Length: 1936758, dtype: datetime64[ns]

In [83]:
data.to_csv('FlightAnalysis.csv')