# Predicting Flight Fares: Analyzing Airfare Based on the Duration of a Flight, Time Till Departure and Class

## Introduction

Airlines operate within a highly competitive and evolving environment where pricing strategies play a crucial role in attracting customers and maximizing profitability. The price of a flight ticket is very important to a consumer, it may determine which company they fly with or whether they fly at all. Consumers have different wants and needs when it comes to which flight they take, and airlines price their fares accordingly. Flight fares are influenced by many factors including demand, competition, fuel prices, and operating costs. Airlines invest in and employ revenue management systems that constantly adjust ticket prices based on the current demand and other factors such as: duration till departure, the departure and arrival times themselves, source and destination locations, number of stops in a trip, and the class of the ticket. In our project we will answer the following: Can we accurately and consistently predict the price of a flight ticket based on it's duration till departure, duration of the flight, and class of the ticket (Economy or Business)? To answer this question we will be using a dataset that contains various flights and their details from the website Easemytrip for flights between India's top 6 metro cities. We will be using the cleaned data in which there are 300261 datapoints and 11 categories.

## Methods & Results:

#### Our first step is to load the necessary R packages and libraries for this project. So we begin by loading the following:

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(tidyr)
library(ggplot2)
library(RColorBrewer)

#### Next we read in the dataset for our project through a link and also provide a sample of what the data looks like:

Here we use the read_csv() function to read in the data. We are using the head() function to show a glimpse of the first 5 rows of the dataset to give a preview of what data we are working with. Finally, we use the cat function to concatenate and print the total number of rows in the data set which we obtain throw nrow().

In [None]:
flight_data <- read_csv("https://raw.githubusercontent.com/JaskarnNijjar/DSCI-100-Group-Project/main/Clean_Dataset.csv")

head(flight_data)
cat("Total rows in data set:", nrow(flight_data), "\n")

#### Now we will split the data into training and testing data sets:

In [None]:
set.seed(100)  # Setting seed for reproducibility

# Taking a sample of 1500 data entries from the data set to allow for our 
# analysis since otherwise there are too many for the server to handle.
flight_data_split <- sample_n(flight_data, size = 1500)

# Splitting the data into training and testing sets to develop and test
# our model.
data_split <- initial_split(flight_data_split, prop = 0.75)
sampled_training_data <- training(data_split)
sampled_testing_data <- testing(data_split)

# Displaying the number of rows featured in each set
cat("Training set rows:", nrow(sampled_training_data), "\n")
cat("Testing set rows:", nrow(sampled_testing_data), "\n")

## Quantitative Statistical Summaries

#### Summarizing data regarding differences in classes (Business and Economy):

Here we summarize different statistics based on the class of a flight. This reveals that there are less business class compared to economy class tickets. Furthermore, the pricing for business tickets is much higher than economy. This is also our first glimpse of the variability in business pricing, which can be seen from the standard deviations in price. We can attribute the fewer count of business tickets to it's pricing, since the average consumer may not be able to afford or justify spending the premium to fly business, hence setting a lower demand for business tickets. Here we have seperated the data by class using the group_by() function, and we have calculated the minimums, maximums, standard deviations, means using their respective functions.

In [None]:
# Summarized data includes differences in: Count, Average Price, Standard Deviation of Price, Minimum Price,
# Maximum Price, Average Duration, Standard Deviation of Duration, Minimum Duration, and Maximum Duration.
class_summary <- sampled_training_data |>
    group_by(class) |>
    summarize(count = n(),
              average_price = mean(price, na.rm = TRUE),
              sd_in_price = sd(price, na.rm = TRUE),
              min_price = min(price, na.rm = TRUE),
              max_price = max(price, na.rm = TRUE),
              average_duration = mean(duration, na.rm = TRUE),
              sd_in_duration = sd(duration, na.rm = TRUE),
              min_duration = min(duration, na.rm = TRUE),
              max_Duration = max(duration, na.rm = TRUE),
             .groups = 'drop')

class_summary

#### Summarizing data based on different airlines:

Here we summarize different statistics for the various airlines featured in this dataset. Only two airlines in this dataset offer business class tickets, these are Air India and Vistara. The two biggest airlines are the only ones which offer business class. These 2 airlines have the highest for minimum price, maximum price, and standard deviation in price. From the average durations we can determine that this higher pricing is likely due to these airlines offering longer flights, which could translate to more international flights compared to the smaller airlines. Like the summarization above, we have used group_by(), but we are also passing airline this time as well to seperate them. We are also using the same functions to compute the contents of the table.

In [None]:
# Summarized data includes differences in: Flight Counts, Minimum Prices, Maximum Prices, Average Prices, Standard Deviation of Prices, and Average Durations.
airline_summary <- sampled_training_data |>
    group_by(airline, class) |>
    summarize(flights_count = n(),
              min_price = min(price, na.rm = TRUE),
              max_price = max(price, na.rm = TRUE),
              average_price = mean(price, na.rm = TRUE),
              sd_in_price = sd(price, na.rm = TRUE),
              average_duration = mean(duration, na.rm = TRUE),
              average_days_till_departure = mean(days_left, na.rm = TRUE),
              .groups = 'drop')

airline_summary

#### Summarizing averages and standard deviations of quantative predictors in the data set:

Here we summarize the statistics for all the quantative predictors we will use in our analysis. We are not considering the differences in airline in our prediction model, rather we are working to build a general model which will provide a price prediction based solely on the quantative predictors. Here we do not need to use the group_by() function rather we just use the respective functions to their computations like the tables above.

In [None]:
# Summarized data includes averages and standard deviations of: Duration, Days till Departure, and Price.
quantitative_summary <- sampled_training_data |>
    summarize(mean_duration = mean(duration, na.rm = TRUE),
              sd_duration = sd(duration, na.rm = TRUE),
              mean_days_left = mean(days_left, na.rm = TRUE),
              sd_days_left = sd(days_left, na.rm = TRUE),
              mean_price = mean(price, na.rm = TRUE),
              sd_price = sd(price, na.rm = TRUE))

quantitative_summary

#### Summarizing any potential missing data in the data set:

Here we are just ensuring that there are no missing values. We do this using the summarize function, passing along the arguements across() and everything() so that all columns are being checked, and the arguement ~sum(is.na(.)) performs the actual check for the missing values.

In [None]:
# Checks to see how many values are missing in each column in the data set
missing_data <- sampled_training_data |>
    summarize(across(everything(), ~sum(is.na(.))))

missing_data

## Visual Summaries

#### Visual summary of the relationship between price and days till departure for each airline:

Here we are plotting the days till a flight departs against its price, we are looking for potential correlations. We found that for economy tickets there is a clear downwards sloping relation but the same can't be said for business tickets as they are very scattered and inconsistent. Here we create two point graphs using the facet_grid() function and we seperate them by class to see the difference for business and economy classes. Scale_color_brewer is being used for a more visually appealing color pallet and we also use an alpha value of 0.5 to make the points translucent so we can see overlapping points as well.

In [None]:
# Plotting the relationship between price and days till departure for each airline
durationtilldeparture_price_plot <- ggplot(sampled_training_data, aes(x = days_left, y = price, color = airline)) +
    geom_point(alpha = 0.5) +
    facet_grid(. ~ class) +
    scale_color_brewer(palette = "Spectral") +
    labs(x = "Days Left Until Departure", y = "Price (Indian Rupees)", color = "Airline") + 
    ggtitle("Price vs. Days Till Departure For Each Class") +
    theme(text = element_text(size = 15))

durationtilldeparture_price_plot

#### Visual summary of the relationship between price and duration of flight for each airline:

Here we are plotting the duration of a flight departs against its price, we are again looking for potential correlations. In both graphs we see a pattern of a rise in price in relation with distance. Our observations about this believe that the price going up could be due to a further distance that the flight travels but also the duration of a flight can be longer due to layovers and stops which tends to decrease the price. Here we are using nearly identical functions as the previous graph to produce this graph with the exception of passing duration to to the x-axis and changing the titles.

In [None]:
# Plotting the relationship between price and duration of flight for each airline
duration_price_plot <- ggplot(sampled_training_data, aes(x = duration, y = price, color = airline)) +
    geom_point(alpha = 0.5) +
    facet_grid(. ~ class) +
    scale_color_brewer(palette = "Spectral") +
    labs(x = "Duration of Flight (Hours)", y = "Price (Indian Rupees)", color = "Airline") + 
    ggtitle("Price vs. Duration of Flight For Each Class") +
    theme(text = element_text(size = 15))

duration_price_plot

## Analysis

#### The first step of our analysis will be to build our model for multivariable linear regression:

Here we are building a fairly standard linear regression model. We set the engine and mode to the standard for such a model. In our recipe function we predict price using the combination of 3 predictors, we also pass the training data since we are building the model. Finally, in our fit function we add our functions and produce the model. Our model shows that for a ticket with no duration and no days left the price would be 53655.0, which we can establish that it assumes that a ticket starts as a business class ticket. If this ticket was to be an economy ticket that value would be subtracted by 45785.9, this subtraction provides the base price of an economy ticket with 0 duration and 0 days left. The model also describes that for each additional hour the flight travels the price would increase by 195.1 and for each additional day left the price would go down by 137.8.

In [None]:
# flight model specification
flight_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

# flight model recipe
flight_recipe <- recipe(price ~ duration + days_left + class, data = sampled_training_data)

# flight model fit
flight_fit <- workflow() |>
  add_recipe(flight_recipe) |>
  add_model(flight_spec) |>
  fit(data = sampled_training_data)

flight_fit

#### The next step is to test the model:

Here we test our model, we use the predict function to make make predictions on the sampled_testing_data. The bind_cols() function takes the predictions and appends them as new columns to the sampled_testing_data. The metrics() function calculates performance metrics for the model, it takes in the actual prices as the truth argument and estimate takes in the predictions. As we can see from the results the Root Mean Squared Error (RMSE) is 7492.0728071, this means that the average magnitude of the errors between the predicted and actual values is almost 7500. Furthermore, the r-squared (rsq) valuse is 0.8903964 which means that approximately 89% of the variability in flight prices can be explained by the model. The final value is the Mean Absolute Error (mae) which is 4197.7783340 and this value tells us that on average, the model's predictions are about 4197.78 rupees off from the actual flight prices.

In [None]:
flight_results <- flight_fit |>
  predict(sampled_testing_data) |>
  bind_cols(sampled_testing_data) |>
  metrics(truth = price, estimate = .pred)

flight_results

#### Extracting slope values:

Here we produce the slope values to produce our equation for our model. To produce our equation we will be using the estimate values. To obtain these values we used the extract_fit_parsnip() on our model which produces all the information below and we use tidy() to clean it up into a tidy data frame. Through this information we can establish the following mathematical equation to describe the prediction plane: 

$$\text{flight\_fare} = 53654.9899 + 195.1083 \times (\text{duration (hours)}) - 137.7886 \times (\text{days\_left}) - 45785.8799 \times (\text{class of ticket} \, (0 \, \text{for business}, \, 1 \, \text{for economy}))$$

In [None]:
mcoeffs <- flight_fit |>
             extract_fit_parsnip() |>
             tidy()

mcoeffs

## Visual Analysis

#### Our first step will be to created a predictions arguement we can pass on to the graphs:

Here we use the predict function to make make predictions on the sampled_testing_data. The bind_cols() function takes the predictions and appends them as new columns to the sampled_testing_data.

In [None]:
# Adds a predicted prices column to data frame
predicted_prices <- predict(flight_fit, sampled_testing_data) |>
    bind_cols(sampled_testing_data)

head(predicted_prices)

#### Plotting predicted prices against actual prices:

In this visualization, we directly plot the predicted prices against the actual prices for flight fares. We have plotted actual prices on the x-axis while plotting the preidcted prices on the y-axis. The idea here is to visually analyze the accuracy of our models predictions. We have used a dashed line with a slope of 1 to indicate perfect predictions. For points that are on this line, the actual price and predicted price are identical hence the model made a perfect prediction. Ideally, we would want as many points as possible to be on or atleast very close to the line. We have plotted this using the ggplot() functions, the scatter plot portion of this was plotted using geom_point() while we created the line using geom_point() with a slope of 1.

In [None]:
# Plotting Actual vs. Predicted Prices
actual_vs_prediction_plot <- ggplot(predicted_prices, aes(x = price, y = .pred)) +
    geom_point(aes(color = class), alpha = 0.5) +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
    labs(x = "Actual Prices (Indian Rupees)", y = "Predicted Prices (Indian Rupees)",
         title = "Actual vs. Predicted Prices") +
    theme_minimal() +
    scale_color_brewer(palette = "Set1", name = "Class")

actual_vs_prediction_plot

#### We need to add categorize predictions and actual prices under their respective categories to plot them side-by-side:

Here we use pivot_longer to combine the actual and predicted prices into one column and then use mutate with ifelse to categorize each price as either an actual price or a predicted price.

In [None]:
# Categorizing actual and prediction prices as types
predicted_prices_categorized <- predicted_prices |>
  pivot_longer(cols = c(price, .pred),
               names_to = "type",
               values_to = "price") |>
  mutate(type = ifelse(type == "price", "Actual Prices", "Predicted Prices"))

head(predicted_prices_categorized)

### Analyzing actual and predicted prices against the days remaining days till departure:

Here we plot the actual prices and predicted prices against the number of days left till a flight departs. We use the categorizing we did above to plot price against days left and use facet_wrap() to make 2 graphs, one for each type, predicted and actual. We will cover our findings further in the discussion section below.

In [None]:
# Plotting Actual and Predicted Prices against the days remaining till departure
prices_vs_duration_till_depart_plot <- ggplot(predicted_prices_categorized, aes(x = days_left, y = price, color = class)) +
    geom_point(alpha = 0.5) +
    facet_wrap(~ type) +
    labs(x = "Days Left Until Departure", y = "Price (Indian Rupees)",
         title = "Actual and Predicted Prices vs. Days Till Departure") +
    theme_minimal() +
    scale_color_brewer(palette = "Set1", name = "Ticket Class")

prices_vs_duration_till_depart_plot

### Analyzing actual and predicted prices against the duration of a flight:

Here we plot the actual prices and predicted prices against the duration of a flight. We use the categorizing here aswell to plot price against duration and similarly use facet_wrap() to make 2 graphs, one for each type, predicted and actual. We will also cover our findings further in the discussion section below.

In [None]:
# Plotting Actual and Predicted Prices against the duration of a flight side-by-side
prices_vs_duration_plot <- ggplot(predicted_prices_categorized, aes(x = duration, y = price, color = class)) +
    geom_point(alpha = 0.5) +
    facet_wrap(~ type) +
    labs(x = "Duration of Flight (Hours)", y = "Price (Indian Rupees)",
         title = "Actual and Predicted Prices vs. The Duration of a Flight") +
    theme_minimal() +
    scale_color_brewer(palette = "Set1", name = "Ticket Class")

prices_vs_duration_plot

## Analysis Discussion

#### Findings:

We have concluded that our analysis has produced quite interesting results. From the beginning, in our summarizations we began to see increasing variability in the pricing for business class tickets. This was revealed by the larger ranges in minimum, maximum and standard deviation of pricing, our summarization plots then revealed a substantial lack of consistency, in patterns and correlations compared to economy class tickets. We found a lot more spread and outliers between points. We attribute this to the idea that the business class ticket market has less competition compared to economy tickets. There are a few reasons for this, an initial finding we came across was that from the 6 airlines included in this data set only 2 of them offered business class tickets. Furthermore, they were the 2 largest of all the airlines, since they had the highest flight counts. Since there are less airlines competing to sell business class tickets this allows them to be negligent towards their pricing strategies. The next cause of the variability in business class tickets can be attributed to economic status of their consumers. Typically, these consumers have more money than since they have the ability to spend extra which in return can translate to them being less concerned about pricing and can allow airlines to be more careless regarding their pricing. Our analysis found that we had a very difficult time predicting pricing for business class tickets, while we were fairly successful with economy tickets.

#### Expectations vs Findings:

We were surprised by the performance of our model. In specific, the model performed worse than we originally expected, we found that the RMSE was almost 7500, which is a larger average magnitude of the error than we would've hoped. We also found that the MAE was nearly 4200, meaning that on average, the model's predictions are about 4200 rupees off from the actual flight fares which is very significant in a model meant to predict pricing. The RSQ measure value at about 0.89 which meant that approximately 89% of the variability in flight prices can be explained by the model, we found this to be a positive since we can attribute the poor performance to incorporated factors. Upon analyzing our results, we concluded that this decline in performance can be attributed to the variability of business class tickets. Evidently, from the 3 analysis plots above we can see a clear pattern in the plotting of economy and business class tickets. Our predictions for economy fares were fairly accurate and we can clearly distinguish from the side-by-side plots that the predictions followed very similar patterns as the actual pricing. Where we believe our model took a hit was with the incorporation of business tickets. Their variabilities and inconsistencies made it difficult for the model to predict pricing, which dropped the performance significantly. It was crucial to plot the results from our model to clearly see where it was going wrong. We found that if we only included economy class tickets our model can be considered strong. Our model can be further improved by incorporating even more predictors which would help the performance for all predictions, another way that it could be improved is by using a larger sample size or the entire data set. The improvement from this would be marginal compared to the improvement from using more predictors.

#### Impact:

Despite the overall model not being as accurate as we expected, we believe that our model could be impactful if it was narrowed down to only predict pricing for economy class tickets. Airlines could utilize our model to automatically price their tickets based on just some characteristics of the flight itself. Our model is just a start, we use very few predictors and we are limited in the amount of data we can process. Airlines are often worth billions so they have access to the technology and infrastructure required to further enhance a model like this drastically. Our model doesn't have to be useful to only airlines, consumers can find benefit in it as well. A consumer can use our model to find how much the flight they are looking at should typically be priced, they can then use this information to determine whether they are getting a good deal or whether they are overpaying. Despite this model's performance suffering due to the business class tickets, consumers of those tickets can use our model to see what they should be paying for their flights instead of paying any price that the airline is offering their tickets at. At the end of the day everyone would prefer to keep or make more money, making our model an impactful tool to anyone in the airline ticket market.

#### Future Questions:

- What extent can a model like this be improved too? How far and well can it truly perform with the right infrastructure and technology?
- How can this model be better suited to being able to accurately predict the pricing of business class tickets?
- How would our results vary if we introduced more variables?
- How do other factors (fuel prices, time of the year, and events) impact flight fares?
- How do external events (pandemics, recessions, and disasters) impact flight fares?	

## References

- Bathwal, S. (2022, February 25). Flight Price Prediction. Kaggle. https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction?resource=download 

- John Chambers et al. (n.d.). An introduction to R ¶. An Introduction to R. https://cran.r-project.org/doc/manuals/r-release/R-intro.html

- Multiple linear regression made simple. Stats and R. (n.d.). https://statsandr.com/blog/multiple-linear-regression-made-simple/#multiple-linear-regression

- AltexSoft. (2021, June 10). Dynamic pricing strategy for airlines. https://www.altexsoft.com/blog/dynamic-pricing-airlines/ 

## Special Notes:

- The dataset we already used was clean and wrangled in a tidy format, hence that code is not present.

- Also we are a group of 2 due to unfortunate circumstances for a prior member, we were told that we should continue our project as usual and that our group circumstance would be kept in mind while grading.