In [None]:
 # Python extension for interfacing with SQL and better table formatting with Pandas

# !pip install ipython-sql
# !pip install pandas

Our data is in 1 CSV file, let's investigate what it looks like and move it into a .db file

In [2]:
# load the extensions that we need to begin a separate cell
import pandas as pd
import sqlite3

In [3]:
# Getting a view of the estate csv table
airfare_df = pd.read_csv('airfare_data.csv')
print(airfare_df.head())
print(airfare_df.info())

   Year  quarter  citymarketid_1  citymarketid_2  \
0  2009        2           32467           34576   
1  2000        4           30397           33198   
2  2007        4           32575           34614   
3  2004        4           32337           31650   
4  2008        4           30194           30559   

                                 city1                     city2  nsmiles  \
0        Miami, FL (Metropolitan Area)             Rochester, NY     1204   
1      Atlanta, GA (Metropolitan Area)           Kansas City, MO      692   
2  Los Angeles, CA (Metropolitan Area)        Salt Lake City, UT      590   
3                     Indianapolis, IN  Minneapolis/St. Paul, MN      503   
4                Dallas/Fort Worth, TX               Seattle, WA     1670   

   passengers    fare carrier_lg  large_ms  fare_lg carrier_low  lf_ms  \
0         203  151.46         FL      0.29   131.05          FL   0.29   
1         782  172.83         DL      0.63   194.71          NJ   0.26   
2 

This dataset measures information about flights from one city to another city in a set timeframe (a quater of a year at a time). We know information about the cities, the distance, the number of people, which carriers were the most popular/cheapest, and some information about the cost of the flights.

We know that 'carrier_low' is the carrier with the lowest fare and 'carrier_lg' is the carrier with the largest market share from Q4

In [4]:
# Combining tables and making SQL database

# Make the SQLite database
conn = sqlite3.connect('airfare.db')

# Write DataFrames to SQLite tables in the database (don't need the dataframe indexes)
airfare_df.to_sql('airfare', conn, if_exists='replace', index=False)

# Close the connection
conn.close()

In [1]:
# Necessary in the Jupyter Notebook to load the SQL extension and connect to the database file to use SQL directly, currently using SQLite
# Formatting the SQL query outputs into a better format with Pandas

%load_ext sql
%sql sqlite:///airfare.db
%config SqlMagic.autopandas=True

Let's check our database and make sure everything worked properly

In [2]:
%%sql 
PRAGMA table_info(airfare);

 * sqlite:///airfare.db
Done.


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,Year,INTEGER,0,,0
1,1,quarter,INTEGER,0,,0
2,2,citymarketid_1,INTEGER,0,,0
3,3,citymarketid_2,INTEGER,0,,0
4,4,city1,TEXT,0,,0
5,5,city2,TEXT,0,,0
6,6,nsmiles,INTEGER,0,,0
7,7,passengers,INTEGER,0,,0
8,8,fare,REAL,0,,0
9,9,carrier_lg,TEXT,0,,0


Seems to be correct data types and all the columns are still here which is what we were looking for.

We can start with the questions then

1. What range of years are represented in the data?

In [4]:
%%sql
SELECT DISTINCT Year FROM airfare ORDER BY Year ASC;

 * sqlite:///airfare.db
Done.


Unnamed: 0,Year
0,1996
1,1997
2,1998
3,1999
4,2000
5,2001
6,2002
7,2003
8,2004
9,2005


2. What are the shortest and longest-distanced flights, and between which 2 cities are they?

In [13]:
%%sql
SELECT city1, city2, nsmiles
FROM airfare
WHERE nsmiles IN (
    SELECT MAX(nsmiles) FROM airfare
    UNION
    SELECT MIN(nsmiles) FROM airfare
)
-- This is just incase city1 or city2 both appear in the min and max distance flights
-- Also helps with the same duos appearing from multiple year/quarter 
GROUP BY city1, city2
ORDER BY nsmiles ASC;

 * sqlite:///airfare.db
Done.


Unnamed: 0,city1,city2,nsmiles
0,"Los Angeles, CA (Metropolitan Area)","San Diego, CA",109
1,"Miami, FL (Metropolitan Area)","Seattle, WA",2724


3. How many distinct cities are represented in the data (regardless of whether it is the source or destination)?

In [19]:
%%sql
SELECT 
    (SELECT COUNT(DISTINCT city1) FROM airfare) +
    (SELECT COUNT(DISTINCT city2) FROM airfare) AS total_cities;

 * sqlite:///airfare.db
Done.


Unnamed: 0,total_cities
0,266


4. Which airline appear most frequently as the carrier with the lowest fare (ie. carrier_low)? How about the airline with the largest market share (ie. carrier_lg)?

In [30]:
%%sql
-- Most freq lowest fare
SELECT carrier_low AS 'carrier_name', COUNT(*) AS 'frequency'
FROM airfare
GROUP BY carrier_low
ORDER BY frequency DESC
LIMIT 1;

 * sqlite:///airfare.db
Done.


Unnamed: 0,carrier_name,frequency
0,WN,29652


In [29]:
%%sql
-- Most freq highest market share
SELECT carrier_lg AS 'carrier_name', COUNT(*) AS 'frequency'
FROM airfare
GROUP BY carrier_lg
ORDER BY frequency DESC
LIMIT 1;

 * sqlite:///airfare.db
Done.


Unnamed: 0,carrier_name,frequency
0,WN,23659


5. How many instances are there where the carrier with the largest market share is not the carrier with the lowest fare? What is the average difference in fare?

We'll asumme that the carrier with the lowest fare's fare is 'fare_low' and the carrier with the highest market share's fare is 'fare_lg' while 'fare' may just be the average fare.

In [34]:
%%sql
SELECT COUNT(*), ROUND(AVG(fare_lg - fare_low), 2) AS 'avg_fare_diff'
FROM airfare
WHERE carrier_low != carrier_lg;

 * sqlite:///airfare.db
Done.


Unnamed: 0,COUNT(*),avg_fare_diff
0,59851,49.46


6. What is the percent change in average fare from 2007 to 2017 by flight? How about from 1997 to 2017?

Since it is 'by flight' we want the avg of 'fare' (assumed to be the average fare for that quater from city1 to city2) for that year grouped by the city1/city2 duo

In [48]:
%%sql
WITH avg_2007 AS (
    SELECT year, city1, city2, AVG(fare) AS "2007avg"
    FROM airfare
    WHERE Year = 2007
    GROUP BY city1, city2
),
avg_2017 AS (
    SELECT year, city1, city2, AVG(fare) AS "2017avg"
    FROM airfare
    WHERE Year = 2017
    GROUP BY city1, city2 
)
-- Need to use "" here to make sure it uses the column alias properly
SELECT avg_2007.city1, avg_2007.city2, "2007avg", "2017avg",
    ROUND((( "2017avg" - "2007avg" ) / "2007avg") * 100, 2) AS 'Percent Change'
FROM avg_2007
JOIN avg_2017 ON 
    avg_2007.city1 = avg_2017.city1 AND
    avg_2007.city2 = avg_2017.city2
ORDER BY 4 DESC;

 * sqlite:///airfare.db
Done.


Unnamed: 0,city1,city2,2007avg,2017avg,Percent Change
0,"Aspen, CO","New York City, NY (Metropolitan Area)",402.1800,540.4500,34.38
1,"Eagle, CO","New York City, NY (Metropolitan Area)",388.8900,506.1600,30.16
2,"Los Angeles, CA (Metropolitan Area)","New York City, NY (Metropolitan Area)",317.2550,376.7075,18.74
3,"Philadelphia, PA","San Francisco, CA (Metropolitan Area)",284.4775,365.5725,28.51
4,"San Francisco, CA (Metropolitan Area)","Washington, DC (Metropolitan Area)",300.1650,359.1050,19.64
...,...,...,...,...,...
975,"Atlantic City, NJ","Fort Myers, FL",121.8600,98.8800,-18.86
976,"Bellingham, WA","Las Vegas, NV",125.3450,96.6575,-22.89
977,"Allentown/Bethlehem/Easton, PA","Sanford, FL",104.4225,91.1975,-12.66
978,"Atlantic City, NJ","Miami, FL (Metropolitan Area)",129.2425,87.4125,-32.37


This gets us the change for each flight but we can also calculate the avg of this percent change

In [49]:
%%sql
WITH avg_2007 AS (
    SELECT year, city1, city2, AVG(fare) AS "2007avg"
    FROM airfare
    WHERE Year = 2007
    GROUP BY city1, city2
),
avg_2017 AS (
    SELECT year, city1, city2, AVG(fare) AS "2017avg"
    FROM airfare
    WHERE Year = 2017
    GROUP BY city1, city2 
),
change_avg AS ( 
-- Need to use "" here to make sure it uses the column alias properly
    SELECT avg_2007.city1, avg_2007.city2, "2007avg", "2017avg",
        (( "2017avg" - "2007avg" ) / "2007avg") * 100 AS 'percent_change'
    FROM avg_2007
    JOIN avg_2017 ON 
        avg_2007.city1 = avg_2017.city1 AND
        avg_2007.city2 = avg_2017.city2
    ORDER BY 4 DESC
)
SELECT ROUND(AVG(percent_change), 2) AS 'Average Change' FROM change_avg;

 * sqlite:///airfare.db
Done.


Unnamed: 0,Average Change
0,22.4


Now let's just get the average percent change from 1997 to 2017.

In [50]:
%%sql
WITH avg_1997 AS (
    SELECT year, city1, city2, AVG(fare) AS "1997avg"
    FROM airfare
    WHERE Year = 1997
    GROUP BY city1, city2
),
avg_2017 AS (
    SELECT year, city1, city2, AVG(fare) AS "2017avg"
    FROM airfare
    WHERE Year = 2017
    GROUP BY city1, city2 
),
change_avg AS ( 
-- Need to use "" here to make sure it uses the column alias properly
    SELECT avg_1997.city1, avg_1997.city2, "1997avg", "2017avg",
        (( "2017avg" - "1997avg" ) / "1997avg") * 100 AS 'percent_change'
    FROM avg_1997
    JOIN avg_2017 ON 
        avg_1997.city1 = avg_2017.city1 AND
        avg_1997.city2 = avg_2017.city2
    ORDER BY 4 DESC
)
SELECT ROUND(AVG(percent_change), 2) AS 'Average Change' FROM change_avg;

 * sqlite:///airfare.db
Done.


Unnamed: 0,Average Change
0,32.54


7. How would you describe the overall trend in airfares from 1997 to 2017, as compared 2007 to 2017?

The % change from 1997 to 2017 was about 32.5% and from 2007 to 2017 it was about 22.4%. This to shows that flights are getting more expensive more rapidly in recent times than from before. If it was linear then we should expect the 1997-2017 change to be twice as big as the 2007-2017 change but it's only about 50% bigger. Although there could be others things at play too during the specific years chosen which could have made flights unregularly cheaper/expensive compared to normal. The rest of the years would also have to be analyzed for more concrete answers. Since it is beyond the scope of these questions that idea won't be analyzed right now though. 

8. What is the average fare for each quarter? Which quarter of the year has the highest overall average fare? lowest?

In [52]:
%%sql
WITH AVGFare AS (
    SELECT quarter, AVG(fare) AS 'avg_fare'
    FROM airfare
    GROUP BY quarter
)
SELECt * FROM AVGFare ORDER BY avg_fare DESC;


 * sqlite:///airfare.db
Done.


Unnamed: 0,quarter,avg_fare
0,1,195.790766
1,2,195.025648
2,3,191.310672
3,4,190.442071


A bit surprising that summer doesn't have a higher average since a lot of people go on vacation during it but there's probably a lot of deals too so the cost may come down due to demand.

Also interesting that the average fare is ordered (highest to lowest) in the order of the year. 

9. Considering only the flights that have data available on all 4 quarters of the year, which quarter has the highest overall average fare? lowest? Try breaking it down by year as well.

I am only going to use the flight/year combinations where we have data for that flight in all quarters of that year. If we have data in quarters 1 and 4 in 1999 and quarters 2 and 3 in 2007 I will not use those years of that flight.

In [6]:
%%sql
WITH AVGFare AS (
    SELECT quarter, AVG(fare) AS 'avg_fare'
    FROM airfare
    -- Find only the years where we have data in each quarter for a flight
    WHERE (city1, city2, year) IN (
        SELECT city1, city2, year
        FROM airfare
        GROUP BY city1, city2, year
        HAVING COUNT(*) = 4
    )
    GROUP BY quarter
)
SELECT * FROM AVGFare ORDER BY avg_fare DESC;

 * sqlite:///airfare.db
Done.


Unnamed: 0,quarter,avg_fare
0,1,195.422714
1,2,192.840047
2,4,189.570557
3,3,189.288765


I'm not sure if the second part of the question (try breaking it down by year) is what I am currently doing or if the question is looking for the avgerage fares for those years but I'll calculate the average for the years just in case.

I'm going to continue using the same filtering method I used before too

In [8]:
%%sql
WITH AVGFare AS (
    SELECT year, AVG(fare) AS 'avg_fare'
    FROM airfare
    -- Find only the years where we have data in each quarter for a flight
    WHERE (city1, city2, year) IN (
        SELECT city1, city2, year
        FROM airfare
        GROUP BY city1, city2, year
        HAVING COUNT(*) = 4
    )
    GROUP BY year
)
SELECT * FROM AVGFare ORDER BY avg_fare DESC;

 * sqlite:///airfare.db
Done.


Unnamed: 0,year,avg_fare
0,2014,229.424639
1,2015,225.543353
2,2013,221.298361
3,2016,218.515525
4,2012,216.981949
5,2017,216.357799
6,2011,207.904742
7,2008,192.035186
8,2010,191.297082
9,2000,191.165825


We can also look back to the earlier question about comparing 1997-2017 to 2007-2017. We can see that the 90's are actually in the middle here for the average fares and there was a bit of a crash in the mid 2000's which may also explain the higher % change from 2007-2017 than expected. Although remember that this is only a subset of the data being analyzed. 