# Airline Analysis
In this project, you’ll imagine that you work for a travel agency and need to know the ins and outs of airline prices for your clients. You want to make sure that you can find the best deal for your client and help them to understand how airline prices change based on different factors.


You decide to look into your favorite airline. The data include:


- miles: miles traveled through the flight
- passengers: number of passengers on the flight
- delay: take-off delay in minutes
- inflight_meal: is there a meal included in the flight?
- inflight_entertainment: are there free entertainment systems for each seat?
- inflight_wifi: is there complimentary wifi on the flight?
- day_of_week: day of the week of the flight
- weekend: did this flight take place on a weekend
- coach_price: the average price paid for a coach ticket
- firstclass_price: the average price paid for first-class seats
- hours: how many hours the flight took
- redeye: was this flight a redeye (overnight)?


In this project, you’ll explore a dataset for the first time and get to know each of these features. Keep in mind that there’s no one right way to address each of these questions. The goal is simply to explore and get to know the data using whatever methods come to mind.


You will work in the script.py tab in the workspace. Note that there is a solution.py tab which contains solution code for the project. We highly recommend that you complete the project on your own without checking the solution, but feel free to take a look if you get stuck or if you want to compare answers when you’re done. Note that the solution code may take 15-20 seconds to run.


In order to get the plots to appear correctly in the workspace, you’ll need to show and then clear each plot before creating the next one using the following code:

In [None]:
plt.show() # Show the plot

Clearing the plot will not erase the plot from view, it will just create a new space for the following graphic.

# Univariate Analysis
##### 1. What do coach ticket prices look like? What are the high and low values? What would be considered average? Does $500 seem like a good price for a coach ticket?


`Hint` <br>
`To start, you could try making a histogram or a boxplot of coach_price using the seaborn histplot() or boxplot() function. Remember to show and clear the plot using:`

In [None]:
plt.show() # Show the plot
plt.clf() # Clear the plot

`After plotting, you could calculate the mean and median of this column using the pandas methods .mean() and .median().`

`Once you’ve created at least one visualization and calculated some summary statistics for the column, think about where $500 falls in the distribution of coach_price: Is it close to the mean or median (in the center of the histogram or box plot)? Or is it far away (in the tail of the histogram or box plot)?`

##### 2. Now visualize the coach ticket prices for flights that are 8 hours long. What are the high, low, and average prices for 8-hour-long flights? Does a $500 ticket seem more reasonable than before?


`Hint` <br>
`You can subset the data within the desired plotting function. For example, if we wanted to plot the histogram of coach flight prices for flights with less than 200 passengers, we would use this code:`

In [None]:
sns.histplot(flight.coach_prices[flight.passengers <= 200])
plt.show() # Show the plot
plt.clf() # Clear the plot

`You can calculate the mean or median of a subset of data using a similar method:`

In [None]:
np.mean(flight.coach_prices[flight.passengers <= 200])

`Once you’ve correctly plotted coach ticket prices for flights that are 8 hours long as well as some summary statistics, think about where $500 now falls in the distribution: Is it close or far from the center of the plot? Is $500 closer to the summary statistics than it was before? This would indicate a more normal or reasonable price.`

##### 3. How are flight delay times distributed? Let’s say there is a short amount of time between two connecting flights, and a flight delay would put the client at risk of missing their connecting flight. You want to better understand how often there are large delays so you can correctly set up connecting flights. What kinds of delays are typical?


`Hint` <Br>
`If you plot a histogram of flight delay times, you’ll see that this visualization is difficult to read because of extreme outliers. Try subsetting the data to only include flight delays at a lower, more reasonable value to be able to see the distribution. Use the method mentioned in the hint of Task 2 to subset your data to specific ranges.`

`It may take some trial-and-error to settle on a value as your cut-off, so you may have to try a few different values until one seems right.`

`After subsetting the data by delay times, we can see that a 10-minute delay is fairly common for this airline. You will want to keep that in consideration when setting up a connecting flight.`

# Bivariate Analysis
##### 4. Create a visualization that shows the relationship between coach and first-class prices. What is the relationship between these two prices? Do flights with higher coach prices always have higher first-class prices as well?


`Hint` <br>
`If you make a simple scatterplot between coach prices and first-class prices, you will see that there are too many data points which makes it difficult to see the nuanced relationship between these two features. You might try changing the opacity on the points to see if the mass of data points is extra packed in a certain area. You could also try taking a random sample from the data set to see what a less busy version of this plot may look like.`

`But what might be the most helpful could be adding a LOWESS smoother through the plot. This can be done using the following code:`



In [None]:
sns.lmplot(x = x_var, y = y_var, data = flight, line_kws={'color': 'black'}, lowess=True)

`This shows the relationship between the features on the x and y-axis. You can do this with either the full dataset or the random sample (which might make the image quicker to produce while keeping the same shape of the collection of plots).`


##### 5. What is the relationship between coach prices and inflight features— inflight meal, inflight entertainment, and inflight WiFi? Which features are associated with the highest increase in price?


`Hint` <br>
`By the end of this task, you should have three separate histograms: one for each of the three in-flight features.`

`You might start exploring these features using histograms. However, regular histograms of coach prices won’t show the differences in price by whether or not the flight has certain features. One way you can distinguish the inflight features is by using hue. This will color the histogram by the individual feature and you can see the difference between the distributions.`

`You might also try using side-by-side boxplots for each inflight feature. This would show the difference in the median and spread between the flights that have an inflight feature and those that do not.`

##### 6. How does the number of passengers change in relation to the length of flights?


`Hint` <br>
`You might start with a scatterplot of hours and passengers, but you would see that there are too many points in the same place, making it difficult to get information from the plot. You might want to add jitter to help spread the points out and better understand density. If the plot is still too dense to really interpret, you might consider using a subset of data instead of the full dataset.`

`One thing you might notice at this point is that there are significantly fewer data points at 6 and 8 hours compared to the other hours. This is an interesting observation to notice and you might explore this fact further.`

`Another thing you might notice is that there is a break in the distribution of passengers around 180 (very few flights have around 180 passengers). You might consider exploring the data points with more than 180 passengers separate from data points with less than 180 passengers and see if any trends emerge.`

# Multivariate Analysis
##### 7. Visualize the relationship between coach and first-class prices on weekends compared to weekdays.


`Hint` <br>
`The scatterplot showing the relationship between coach and first-class prices doesn’t show the difference between weekend flights and weekday flights. Changing the color of points by weekend status using hue will help visualize this relationship.`

`As noted before, this is a really dense scatterplot, so you might consider using a subset of data to make it easier to see relationships in the data.`

`We can see that on average, weekend tickets are more expensive than weekday tickets. However, based on this plot it seems like it’s easier to get a good deal on a first-class ticket on a weekday than on a weekend: the price difference between first-class and coach level tickets is larger on the weekend than on a weekday.`

##### 8. How do coach prices differ for redeyes and non-redeyes on each day of the week?


`Hint` <br>
`A regular boxplot of coach prices by day of the week shows some relationship between weekday and weekend prices, but nothing about redeye flights. You can use hue to separate each day into two groups: redeyes and regular flights on that day of the week.`

`We can see more clearly that the difference between redeyes and non-redeyes is pretty much the same on any day of the week, though on average weekend flights cost more than weekday flights.`

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels
import matplotlib.pyplot as plt
import math
import codecademylib3


## Read in Data
flight = pd.read_csv("flight.csv")
print(flight.head())

## Task 1
print(np.mean(flight.coach_price))
print(np.median(flight.coach_price))

sns.histplot(flight.coach_price)
plt.show()
plt.clf()

## Task 2
print(np.mean(flight.coach_price[flight.hours == 8]))
print(np.median(flight.coach_price[flight.hours == 8]))

sns.histplot(flight.coach_price[flight.hours == 8])
 
plt.show()
plt.clf()

## Task 3
sns.histplot(flight.delay[flight.delay <=500])
plt.show()
plt.clf()
 

## Task 4
perc = 0.1
flight_sub = flight.sample(n = int(flight.shape[0]*perc))
 
sns.lmplot(x = "coach_price", y = "firstclass_price", data = flight_sub, line_kws={'color': 'black'}, lowess=True)
plt.show()
plt.clf()

## Task 5
sns.histplot(flight, x = "coach_price", hue = flight.inflight_meal)
plt.show()
plt.clf()
sns.histplot(flight, x = "coach_price", hue = flight.inflight_entertainment)
plt.show()
plt.clf()
sns.histplot(flight, x = "coach_price", hue = flight.inflight_wifi)
plt.show()
plt.clf()

## Task 6
#scatter_kws={"s": 5, "alpha":0.2}
sns.lmplot("hours", "passengers", data = flight_sub, x_jitter = .25, y_jitter = .25, scatter_kws={"s":5, "alpha":0.2}, fit_reg = False)
plt.show()
plt.clf()

## Task 7
sns.lmplot(x ='coach_price', y='firstclass_price', hue = 'weekend', data = flight_sub, fit_reg= False)
plt.show()
plt.clf()

## Task 8
sns.boxplot(x ='day_of_week', y='coach_price', hue = 'redeye', data = flight)
plt.show()
plt.clf()
