## ☑️ Part 1: Data exploration using SQL

- Complete the following questions
- Make sure you run the following code cells before you attempt any of the questions
- We will work with `flights.db` database through this workbook

First, import pandas and sqlite3 libraries and create the connection to the `flights.db` database, located in the `data` folder:

In [8]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('flights_db.db')

A database might have multiple tables. It's a good idea to do an initial exploration of the database by first querying the `sqlite_master` table to see what tables are in the database.

Run the following code cell to show all the tables in the `flights.db` database:

In [9]:
query = """
SELECT name 
FROM sqlite_master 
WHERE type = 'table';
"""
df = pd.read_sql_query(query, conn)
df


Unnamed: 0,name
0,airports
1,airlines
2,routes


Run the following code cell to show the schema for each table in `flights.db` database:

In [10]:
for table in ['airports','airlines','routes']:
    
    query = f"""
    SELECT sql 
    FROM sqlite_master 
    WHERE name = '{table}';
    """
    
    df = pd.read_sql_query(query, conn)
    print(''.join(df.values[0, 0]))

CREATE TABLE airports (
[index] INTEGER,
  [id] TEXT,
  [name] TEXT,
  [city] TEXT,
  [country] TEXT,
  [code] TEXT,
  [icao] TEXT,
  [latitude] TEXT,
  [longitude] TEXT,
  [altitude] TEXT,
  [offset] TEXT,
  [dst] TEXT,
  [timezone] TEXT
)
CREATE TABLE airlines (
[index] INTEGER,
  [id] TEXT,
  [name] TEXT,
  [alias] TEXT,
  [iata] TEXT,
  [icao] TEXT,
  [callsign] TEXT,
  [country] TEXT,
  [active] TEXT
)
CREATE TABLE routes (
[index] INTEGER,
  [airline] TEXT,
  [airline_id] TEXT,
  [source] TEXT,
  [source_id] TEXT,
  [dest] TEXT,
  [dest_id] TEXT,
  [codeshare] TEXT,
  [stops] TEXT,
  [equipment] TEXT
)


# SQL Statements

**Q1)** Refer to the `airlines` table and show all columns of the table:
- Then, show first 5 rows using pandas `head()` method
- Use `%%time` at the beginning of a cell, to measure the query execution time

Please note you have been provided with the code for this question to carry out the necessary analysis work. Simply run the code cell to produce the desired results.

In [11]:
%%time

query = """
SELECT * 
FROM airlines;
"""

df = pd.read_sql_query(query, conn)
df.head()

CPU times: total: 31.2 ms
Wall time: 20.9 ms


Unnamed: 0,index,id,name,alias,iata,icao,callsign,country,active
0,0,1,Private flight,\N,-,,,,Y
1,1,2,135 Airways,\N,,GNL,GENERAL,United States,N
2,2,3,1Time Airline,\N,1T,RNX,NEXTIME,South Africa,Y
3,3,4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,United Kingdom,N
4,4,5,213 Flight Unit,\N,,TFU,,Russia,N


In [5]:
%%time

df.head()

CPU times: user 136 µs, sys: 91 µs, total: 227 µs
Wall time: 237 µs


Unnamed: 0,index,id,name,alias,iata,icao,callsign,country,active
0,0,1,Private flight,\N,-,,,,Y
1,1,2,135 Airways,\N,,GNL,GENERAL,United States,N
2,2,3,1Time Airline,\N,1T,RNX,NEXTIME,South Africa,Y
3,3,4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,United Kingdom,N
4,4,5,213 Flight Unit,\N,,TFU,,Russia,N


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6048 entries, 0 to 6047
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   index     6048 non-null   int64 
 1   id        6048 non-null   object
 2   name      6048 non-null   object
 3   alias     5615 non-null   object
 4   iata      1461 non-null   object
 5   icao      5961 non-null   object
 6   callsign  5305 non-null   object
 7   country   6033 non-null   object
 8   active    6048 non-null   object
dtypes: int64(1), object(8)
memory usage: 425.4+ KB


**Q2)** Refer to the `airlines` table. Now do the same as above, but this time returning 5 rows using SQL. Measure the time again using `%%time`. This time query execution time should be faster.

Please note you have been provided with the code for this question to carry out the necessary analysis work. Simply run the code cell to produce the desired results.

In [6]:
%%time

query = """
SELECT * 
FROM airlines
LIMIT 5;
"""

df = pd.read_sql_query(query, conn)
df

CPU times: user 3.93 ms, sys: 0 ns, total: 3.93 ms
Wall time: 3.61 ms


Unnamed: 0,index,id,name,alias,iata,icao,callsign,country,active
0,0,1,Private flight,\N,-,,,,Y
1,1,2,135 Airways,\N,,GNL,GENERAL,United States,N
2,2,3,1Time Airline,\N,1T,RNX,NEXTIME,South Africa,Y
3,3,4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,United Kingdom,N
4,4,5,213 Flight Unit,\N,,TFU,,Russia,N


**Q3)** Refer to the `country` column in `airlines` table. Find 7 airlines from United Kingdom.

- To extract all records that matches `United Kingdom` the filter condition `country = 'United Kingdom'` can be used

See below code syntax for some guidance:
```SQL
SELECT *
FROM airlines
WHERE <filter_condition>
LIMIT 7;
```

In [8]:
#add your code below
query = """
SELECT * 
FROM airlines
WHERE country = 'United Kingdom'
LIMIT 7;
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,index,id,name,alias,iata,icao,callsign,country,active
0,3,4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,United Kingdom,N
1,7,8,247 Jet Ltd,\N,,TWF,CLOUD RUNNER,United Kingdom,N
2,15,16,Army Air Corps,\N,,AAC,ARMYAIR,United Kingdom,N
3,51,52,Avcard Services,\N,,ACC,,United Kingdom,N
4,58,59,Air Charter Service,\N,,ACV,,United Kingdom,N
5,76,77,Aero Dynamics,\N,,ADL,COTSWOLD,United Kingdom,N
6,104,105,Air Atlantique,\N,KI,AAG,ATLANTIC,United Kingdom,N


#  Advanced Filtering with WHERE
## Predicates

**Q4)** Refer to the `airlines` table. Find airlines having `icao` equal to ACC or TWF. 

See below code syntax for some guidance:
```SQL
SELECT *
FROM airlines
WHERE icao IN <list_of_values>;
```
The `list_of_values` to be compared to the column using the IN operator should be enclosed in parentheses. Eg: ('ACC','TWF').

In [13]:
#add your code below
query = """
SELECT * 
FROM airlines
WHERE icao IN ('ACC','TWF');
"""


df = pd.read_sql_query(query, conn)
df

Unnamed: 0,index,id,name,alias,iata,icao,callsign,country,active
0,7,8,247 Jet Ltd,\N,,TWF,CLOUD RUNNER,United Kingdom,N
1,51,52,Avcard Services,\N,,ACC,,United Kingdom,N


**Q5)** Refer to `name` column in the `airlines` table. Find 5 airlines whose names contains the word `Flight`.

See below code syntax for some guidance:
```SQL
SELECT *
FROM airlines
WHERE name LIKE <pattern>
LIMIT 5;
```
For example, the pattern %Airline% - matches any string that contains the word "Airline" (uppercase or lowercase).

In [17]:
#add your code below
query = """
SELECT *
FROM airlines
WHERE name LIKE '%flight%'
LIMIT 5;
"""


df = pd.read_sql_query(query, conn)
df

Unnamed: 0,index,id,name,alias,iata,icao,callsign,country,active
0,0,1,Private flight,\N,-,,,,Y
1,4,5,213 Flight Unit,\N,,TFU,,Russia,N
2,5,6,223 Flight Unit State Airline,\N,,CHD,CHKALOVSK-AVIA,Russia,N
3,6,7,224th Flight Unit,\N,,TTF,CARGO UNIT,Russia,N
4,181,182,Aero Flight Service,\N,,AGY,FLIGHT GROUP,United States,N


**Q6)** Refer to `name`, `active`, and `callsign` columns in the `airlines` table. Find 5 active airlines having a non-empty callsign value. 

- Look for all records that match the conditions where: `active`=`'Y'` and `callsign` column `IS NOT NULL`

See below code syntax for some guidance:
```SQL
SELECT name, active, callsign
FROM airlines
WHERE <condition1> AND <condition2>
LIMIT 5;
```

In [18]:
#add your code below
query = """
SELECT name, active, callsign
FROM airlines
WHERE active = 'Y' AND callsign IS NOT NULL
LIMIT 5;

"""
df = pd.read_sql_query(query, conn)
df

Unnamed: 0,name,active,callsign
0,1Time Airline,Y,NEXTIME
1,40-Mile Air,Y,MILE-AIR
2,Ansett Australia,Y,ANSETT
3,Aigle Azur,Y,AIGLE AZUR
4,Aloha Airlines,Y,ALOHA


#  Sorting results

**Q7)** Refer to `name`, `country`, and `altitude` columns in the `airports` table. Find the 5 airports with the highest `altitude`.

See below code syntax for some guidance:
```SQL
SELECT name, country, altitude
FROM airports
ORDER BY <column_name> [ASC/DESC]
LIMIT 5;
```
Please note you have been provided with the code for this question to carry out the necessary analysis work. Simply run the code cell to produce the desired results.

In [19]:
query = """
SELECT name, country, altitude
FROM airports
ORDER BY altitude DESC
LIMIT 5;
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,name,country,altitude
0,Dauphin Barker,Canada,999
1,Akola,India,999
2,Flin Flon,Canada,997
3,Bellevue,France,997
4,Tiputini,Ecuador,997


Please note that the `altitude` column in the `airports` table has been assigned the TEXT data type, which would cause data within this column to be improperly sorted

The report you see above is misleading. To address this issue, In the following section, we will utilize the SQL `CAST()` function to correct the error displayed in the report above.

## ☑️ Part 2 - Column operations

**Q8)** Refer to the `airlines` table. How many airlines start with a number between `0` and `9`?
- You can use `COUNT(*)` to count the number of rows returned

Please note you have been provided with the code for this question to carry out the necessary analysis work. Simply run the code cell to produce the desired results.

In [26]:
query = """
SELECT COUNT(*) 
FROM airlines 
WHERE SUBSTR(name, 1, 1) BETWEEN '0' AND '9';
"""
df = pd.read_sql_query(query, conn)
df

Unnamed: 0,COUNT(*)
0,16


**Q9)** Refer to the `country` column in `airlines` table. How many countries have at least one airline?

- Consider `IS NOT NULL` keyword to filter out any NULL values in `country` column

- Use DISTINCT() function to extract distinct values and use COUNT() function to calculate the number of distinct values in `country` column

See below code syntax for some guidance:
```SQL
SELECT COUNT(DISTINCT(column_name))
FROM airlines 
WHERE <column_name> IS NOT NULL;
```

In [27]:
#add your code below
query = """
SELECT COUNT(DISTINCT(country))
FROM airlines
WHERE country IS NOT NULL;


"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,COUNT(DISTINCT(country))
0,276


**Q10)** Refer to the `altitude` column in `airports` table. Find the 5 airports with the highest `altitude`.

This question is similar to what you did before in **Q7**.

See below code syntax form **Q7**:
```SQL
SELECT name, country, altitude
FROM airports
ORDER BY altitude DESC
LIMIT 5;
```
- Refer to the `ORDER BY` keyword, now use CAST() function to convert `altitude` column to `INT` data type - Eg: CAST(altitude AS INT)  
- Now values should order properly

In [28]:
#add your code below
query = """
SELECT name, country, altitude
FROM airports
ORDER BY altitude DESC
LIMIT 5;


"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,name,country,altitude
0,Dauphin Barker,Canada,999
1,Akola,India,999
2,Flin Flon,Canada,997
3,Bellevue,France,997
4,Tiputini,Ecuador,997


**Q11)** Refer to the `name` and `altitude` columns in `airports` table. Further filter your data to get all airports from `United Kingdom`.

See below code syntax for some guidance:
```SQL
SELECT name, altitude
FROM airports
WHERE country='United Kingdom';
```
Now use CASE() statement to create a new calculated column called `altitude2` to reflect below conditions and results:
- Return "Low" if altitude lower than 100m
- Otherwise return "Medium" if altitude lower than 500m
- Otherwise return "High"

See below code syntax for some guidance:
```SQL
CASE
    WHEN CAST(altitude AS INT) < 100 THEN 'Low'
    WHEN CAST(altitude AS INT) < 500 THEN 'Medium'
    ELSE 'High'
END AS altitude2
```

In [37]:
#add your code below
query = """
SELECT name, altitude,
CASE
    WHEN CAST(altitude AS INT) < 100 THEN 'Low'
    WHEN CAST(altitude AS INT) < 500 THEN 'Medium'
    ELSE 'High'
END AS altitude2
FROM airports
WHERE country='United Kingdom';
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,name,altitude,altitude2
0,Belfast Intl,268,Medium
1,St Angelo,155,Medium
2,Belfast City,15,Low
3,City of Derry,22,Low
4,Birmingham,327,Medium
...,...,...,...
205,Queen Street Station,50,Low
206,Waterloo International,10,Low
207,Central Station,25,Low
208,Euston Station,89,Low


**Q12)** Refer to the `airports` table. Now, calculate the total number of airports.

Please note you have been provided with the code for this question to carry out the necessary analysis work. Simply run the code cell to produce the desired results.

In [38]:
query = """
SELECT COUNT(*)
FROM airports;
"""

df = pd.read_sql_query(query, conn)
df

Unnamed: 0,COUNT(*)
0,8107


# Group metrics

**Q13)** Refer to the `city` column in `airports` table to find the number of airports per city.

- Consider `GROUP BY` keyword to group data by `city` column

See below code syntax for some guidance:
```SQL
SELECT column_name, COUNT(*)
FROM airports
GROUP BY <column_name>;
```

In [39]:
#add your code below
query = """
SELECT city, COUNT(*)
FROM airports
GROUP BY city
;
"""

df = pd.read_sql_query(query, conn)
df

df = pd.read_sql_query(query, conn)
df.head(5)

Unnamed: 0,city,COUNT(*)
0,108 Mile Ranch,1
1,Aaa,1
2,Aachen,2
3,Aalborg,1
4,Aalen-heidenheim,1
