In [None]:
 # Python extension for interfacing with SQL and better table formatting with Pandas

# !pip install ipython-sql
# !pip install pandas

Our data is in 1 CSV file, let's investigate what it looks like and move it into a .db file

In [2]:
# load the extensions that we need to begin a separate cell
import pandas as pd
import sqlite3

In [3]:
# Getting a view of the estate csv table
airfare_df = pd.read_csv('airfare_data.csv')
print(airfare_df.head())
print(airfare_df.info())

   Year  quarter  citymarketid_1  citymarketid_2  \
0  2009        2           32467           34576   
1  2000        4           30397           33198   
2  2007        4           32575           34614   
3  2004        4           32337           31650   
4  2008        4           30194           30559   

                                 city1                     city2  nsmiles  \
0        Miami, FL (Metropolitan Area)             Rochester, NY     1204   
1      Atlanta, GA (Metropolitan Area)           Kansas City, MO      692   
2  Los Angeles, CA (Metropolitan Area)        Salt Lake City, UT      590   
3                     Indianapolis, IN  Minneapolis/St. Paul, MN      503   
4                Dallas/Fort Worth, TX               Seattle, WA     1670   

   passengers    fare carrier_lg  large_ms  fare_lg carrier_low  lf_ms  \
0         203  151.46         FL      0.29   131.05          FL   0.29   
1         782  172.83         DL      0.63   194.71          NJ   0.26   
2 

In the data we have when the flights were approximately (not the exact dates), where the flights are going to and from, the distance of the flight, and the passenger count.

We do have more information as well but it's a bit confusing, there are 3 different fares, which carrier they are, and I believe their market share.

There aren't any nice descriptions for all of the variables but one of the later questions defines "carrier_low" as the carrier with the lowest fare and "carrier_lg" as the carrier with the largest market share. 

Once questions start appearing that deal with these columns, I'll further analyze what to do with them based on the question.

This dataset also doesn't have a primary key or even a nice composite key that stands out either but that shouldn't matter for this analysis.

In [4]:
# Combining tables and making SQL database

# Make the SQLite database
conn = sqlite3.connect('airfare.db')

# Write DataFrames to SQLite tables in the database (don't need the dataframe indexes)
airfare_df.to_sql('airfare', conn, if_exists='replace', index=False)

# Close the connection
conn.close()

In [1]:
# Necessary in the Jupyter Notebook to load the SQL extension and connect to the database file to use SQL directly, currently using SQLite
# Formatting the SQL query outputs into a better format with Pandas

%load_ext sql
%sql sqlite:///airfare.db
%config SqlMagic.autopandas=True

Let's check our database and make sure everything worked properly

In [2]:
%%sql 
PRAGMA table_info(airfare);

 * sqlite:///airfare.db
Done.


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,Year,INTEGER,0,,0
1,1,quarter,INTEGER,0,,0
2,2,citymarketid_1,INTEGER,0,,0
3,3,citymarketid_2,INTEGER,0,,0
4,4,city1,TEXT,0,,0
5,5,city2,TEXT,0,,0
6,6,nsmiles,INTEGER,0,,0
7,7,passengers,INTEGER,0,,0
8,8,fare,REAL,0,,0
9,9,carrier_lg,TEXT,0,,0


Seems to be correct data types and all the columns are still here which is what we were looking for.

We can start with the questions then