### **USA FLIGHT DELAYS INSIGHTS**


### Data Source : https://www.kaggle.com/datasets/aksathomas/2013-us-flight-data?select=US_Flights_2013.csv

**Objectives:**

1. Analyze the flight delay patterns across different airlines, origin and destination airports, and times (month, day of the month, day of the week).
2. Investigate the relationship between departure delay and arrival delay.
3. Understand the cancellation patterns across different airlines and airports.

standard imports


In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()
import numpy as np
from scipy import stats

In [None]:
# Path to file
path = '/home/nyangweso/Desktop/Ds_1/Data_Science-In-Python/Python projects/Python+Tableau/data/USA Flights data/US_Flights_2013.csv'

In [None]:
df = pd.read_csv(path)
df.head(10)
# CSV file is read and first 10 columns displayed

In [None]:
df.columns
# The column present in the dataframe are displayed

#### Count based on airlines


In [None]:
flight_counts = df['Carrier'].value_counts()

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(flight_counts.index, flight_counts.values, color='b')
plt.xlabel('Carrier')
plt.ylabel('Number of Flights')
plt.title('Number of Flights for Each Carrier')
plt.show()

# Results:
# The top 4 airlines with the most flights  according to the dataset are:
# 1. SouthWest Airlines (WN)
# 2. Delta Airlines (DL)
# 3. American Airlines (AA)
# 4. United Airlines (UA)
# Hawaiian Airlines (HA) was recorded with the lowest tally of flight count

#### Count based on airports


In [None]:
# For Origin Airport
origin_airport_counts = df['OriginAirportName'].value_counts()


# Get a list of unique airports
airports = list(set(origin_airport_counts.index))

# Get counts for each airport as origin and destination
origin_counts = [origin_airport_counts.get(airport, 0) for airport in airports]


# Create an array for the positions of the bars on the x-axis
r = np.arange(len(airports))

# Create the figure and a single subplot
fig, ax = plt.subplots(figsize=(15, 10))

# Width of a bar
width = 0.4

# Plotting
plt.bar(r - width/2, origin_counts, color='b', width=width, label='origin')

# Adding labels and title
plt.xlabel('Airport')
plt.ylabel('Number of Flights')
plt.title('Number of Flights for Each Airport as Origin')
plt.xticks(r, airports, rotation=90)

# Show the legend
plt.legend()

# Show the plot
plt.show()

# Results:
# The count of departing flights is high in the following states:
# 1. Atlanta
# 2. Chicago
# 3. Los Angeles
# 4. Dallas
# 5. Denver

In [None]:
# For Destination Airport
dest_airport_counts = df['DestAirportName'].value_counts()

airports = list(set(dest_airport_counts.index))

dest_counts = [dest_airport_counts.get(airport, 0) for airport in airports]

# Create an array for the positions of the bars on the x-axis
r = np.arange(len(airports))

# Create the figure and a single subplot
fig, ax = plt.subplots(figsize=(15, 10))

# Width of a bar
width = 0.4

plt.bar(r + width/2, dest_counts, color='r', width=width, label='destination')

# Adding labels and title
plt.xlabel('Airport')
plt.ylabel('Number of Flights')
plt.title('Number of Flights for Each Airport as Destination')
plt.xticks(r, airports, rotation=90)

# Show the legend
plt.legend()

# Show the plot
plt.show()

# Results:
# The count of arriving flights is high in the following states:
# 1. Atlanta
# 2. Chicago
# 3. Los Angeles
# 4. Dallas
# 5. Denver

Line Chart: Plot DepDelay and ArrDelay over Month. This can show if delays are more common in certain months.


In [None]:
# Group by Month and calculate average delays
average_delays = df.groupby('Month')[['DepDelay', 'ArrDelay']].mean()

plt.figure(figsize=(10, 6))
plt.plot(average_delays.index,
         average_delays['DepDelay'], marker='o', label='Departure Delays')
plt.plot(average_delays.index,
         average_delays['ArrDelay'], marker='o', label='Arrival Delays')
plt.xlabel('Month')
plt.ylabel('Average Delay (in minutes)')
plt.title('Average Departure and Arrival Delays Over Months')
plt.legend()
plt.grid(True)
plt.show()

# Results:
# 1. June had the highest average departure delay.
# 2. The average arrival and departure delay was highest in June and July.
# 3. The delays are suggesting that delays can have a knock-on effect, with one late aircraft causing subsequent flights to be delayed.

Pie Chart: This Show the proportion of flights that are Cancelled. This gives a quick view of how many flights are cancelled.


In [None]:
cancelled_flights = df['Cancelled'].value_counts()

plt.figure(figsize=(6, 6))
plt.pie(cancelled_flights, labels=[
        'Not Cancelled', 'Cancelled'], autopct='%1.1f%%')
plt.title('Proportion of Flights Cancelled')
plt.show()

# Results:
# There is a very small number in flight cancellations. This indicates that there aren't issues that cause cancellations.

Scatter Plot: This involves plotting DepDelay vs ArrDelay to see if there is a correlation between departure delay and arrival delay.


In [None]:
# Calculate the line of best fit
slope, intercept, r_value, p_value, std_err = stats.linregress(
    df['DepDelay'], df['ArrDelay'])

# Create a new column for the color gradient based on the difference between departure and arrival delay
df['DelayDifference'] = abs(df['DepDelay'] - df['ArrDelay'])

plt.figure(figsize=(10, 6))
scatter = plt.scatter(df['DepDelay'], df['ArrDelay'],
                      c=df['DelayDifference'], cmap='RdYlGn_r')
plt.plot(df['DepDelay'], intercept + slope *
         df['DepDelay'], 'r', label='fitted line')
plt.colorbar(scatter)
plt.xlabel('Departure Delay')
plt.ylabel('Arrival Delay')
plt.title('Departure Delay vs Arrival Delay')
plt.show()

# Results:
# There is  a strong positive correlation between the departure and arrival delays
# This could possibly mean that the airports with a greater number of departure delays are more likely to have more arrival delays.
# However, correlation does not necessarily imply causation, and other factors could also be influencing these delays.

Box Plot: This is suitable so as to reveal which carriers have the most variation in departure delays.


In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Carrier', y='DepDelay', data=df)
plt.xlabel('Carrier')
plt.ylabel('Departure Delay')
plt.title('Distribution of Departure Delays for Each Carrier')
plt.show()

# Results:
# AA, MQ and HA have a high number of delays
# On the other hand, the distribution of delays for the other carriers is almost similar.

**Conclusions:**

1. Although there are flight delays,overally they aren't that frequent
2. Some airlines though are more prone to delays such as American Airline
3. There is  a strong positive correlation between the departure and arrival delays. This could possibly mean that the airports with a greater number of departure delays are more likely to have more arrival delays.
4. There's only a small fraction of flights that are cancelled. This shows that factors causing flight cancellations such as weather and security have been looked into



**References:**
1. https://www.datacamp.com/tutorial/visualizing-data-with-python-and-tableau-tutorial
2. https://www.transportation.gov/policy/aviation-policy/us-international-air-passenger-and-freight-statistics-report
3. https://community.tableau.com/s/news/a0A4T000002NznhUAC/tableau-integration-with-python-step-by-step
4. https://www.kaggle.com