# Advanced Programming Summative Assignment

Please see the README file (README.md) for a full overview of the project, installation instructions and running instructions. 

## Contents

#### 1.0 [Package installations and import statements](#1.0-Package-installations-and-import-statements)

#### 2.0 [Data extraction and cleaning](#2.0-Data-extraction-and-cleaning)

2.1 [Read CSVs to dataframes and general data cleaning](#2.2-Merge-airport-and-frequency-dataframes)

2.2 [Merge airport and frequency dataframes](#2.1-Read-CSVs-to-dataframes-and-general-data-cleaning)

2.3 [Transform dataframes to JSON](#2.3-Transform-dataframes-to-JSON)

#### 3.0 [Load data to MySQL database](#3.0-Load-data-to-MySQL-database)


### 1.0 Package installations and import statements

In [19]:
# Installing missing packages
# import sys
%conda install --yes --prefix {sys.prefix} numpy
%conda install --yes --prefix {sys.prefix} pandas

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - pandastable

Current channels:

  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.



Note: you may need to restart the kernel

In [20]:
# Import packages
# import os
import pandas as pd
import numpy as np
import json
import pprint
# import matplotlib.pyplot as plt
# import seaborn as sns
import tkinter as tk

### 2.0 Data extraction and cleaning

A lot of the data processing is generic and not specific to the original data. This allows the application to handle other datasets from the same source. 

#### 2.1 Read CSVs to dataframes and general data cleaning

In [42]:
def load_date(): 
    # Extract data from CSV files
    df_runways = pd.read_csv ('data/runways.csv', index_col=['id'])
    df_frequencies = pd.read_csv ('data/airport-frequencies.csv', index_col=['id'])
    # Prevent pandas from replacing the 'continent' value 'NA' (for North America) with NaN 
    df_airports = pd.read_csv ('data/airports.csv', keep_default_na=False)
    return (df_runways, df_frequencies, df_airports)

def remove_missing_data(df_runways, df_frequencies, df_airports):
    # Remove any rows with all data missing
    df_airports.dropna(how='all')
    df_runways.dropna(how='all')
    df_frequencies.dropna(how='all')

def remove_duplicates(df_runways, df_frequencies, df_airports):
    # Remove any duplicated rows
    df_airports.drop_duplicates()
    df_runways.drop_duplicates()
    df_frequencies.drop_duplicates()

def remove_cols(df_runways, df_frequencies, df_airports):
    # Remove unneeded columns
    df_airports.drop(['keywords','home_link','local_code'], axis='columns', inplace=True)
    df_runways.drop(['airport_ident'], axis='columns', inplace=True)
    df_frequencies.drop(['airport_ident','description','type'], axis='columns', inplace=True)

def change_col_names(df_airports):
    # change the airport column id name to airport_ref to align with other data
    df_airports.rename(columns={"id": "airport_ref"}, inplace=True)

def remove_invalid(df_runways, df_frequencies, df_airports):
    # Remove rows that do not have a valid airport_ref
    df_airports = df_airports[df_airports['airport_ref'].apply(lambda x: str(x).isdigit())]
    df_runways = df_runways[df_runways['airport_ref'].apply(lambda x: str(x).isdigit())]
    df_frequencies = df_frequencies[df_frequencies['airport_ref'].apply(lambda x: str(x).isdigit())]
    return (df_runways, df_frequencies, df_airports)

def add_new_cols(df_airports):
    # Add columns to the Airports df for small, medium and large airports with binary values
    df_airports['small_airport'] = df_airports.type == 'small_airport'
    df_airports['medium_airport'] = df_airports.type == 'medium_airport'
    df_airports['large_airport'] = df_airports.type == 'large_airport'
    df_airports['small_airport'] = df_airports['small_airport'].astype(int) 
    df_airports['medium_airport'] = df_airports['medium_airport'].astype(int) 
    df_airports['large_airport'] = df_airports['large_airport'].astype(int) 
    return(df_airports)

def remove_closed(df_airports):
    # filter out closed airports - may need to force a copy, not sure yet - may need to do this after frequencies are added
    df_airports = df_airports[(df_airports.type != 'closed')]
    return(df_airports)

def load_new_data():
    df_runways, df_frequencies, df_airports = load_date()
    remove_missing_data(df_runways, df_frequencies, df_airports)
    remove_duplicates(df_runways, df_frequencies, df_airports)
    remove_cols(df_runways, df_frequencies, df_airports)
    change_col_names(df_airports)
    remove_invalid(df_runways, df_frequencies, df_airports)
    add_new_cols(df_airports)
    remove_closed(df_airports)
    return df_runways, df_frequencies, df_airports


#### 2.2 Merge airport and frequency dataframes

In [37]:
def frequencies_to_list(df_frequencies):
    # Create an empty dict to hold one key for each airport with a nested list of frequencies 
    frequencies_dict = {}

    # Iterate through the frequencies creating one key for each airport with a list for it's frequencies 
    for index, row in df_frequencies.iterrows():
        if row['airport_ref'] not in frequencies_dict:
            frequencies_dict[row['airport_ref']] = [row['frequency_mhz']]
        else:
            frequencies_dict[row['airport_ref']].append(row['frequency_mhz'])
    return frequencies_dict

def merge_dataframes(frequencies_dict, df_airports):
    # Create a pandas series of airports (as index) and frequency lists 
    df_frequencies_series = pd.Series(frequencies_dict, name='df_frequencies_series')
    # Rename the column titles to align with other data 
    df_airports_frequencies = df_frequencies_series.to_frame()
    df_airports_frequencies.index.name = 'airport_ref'
    # Convert to dataframe and merge with airports dataframe 
    df_airports_frequencies.rename(columns={'df_frequencies_series': 'frequency_mhz'}, inplace = True)
    df_airports_frequencies = pd.merge(df_airports, df_airports_frequencies, on="airport_ref", how = 'left')
    return df_airports_frequencies

def merge_airports_frequencies(df_frequencies, df_airports):
    frequencies_dict = frequencies_to_list(df_frequencies)
    df_airports_frequencies = merge_dataframes(frequencies_dict, df_airports)
    return df_airports_frequencies


#### 2.3 Transform dataframes to JSON

In [50]:
def transform_to_json(df_airports_frequencies, df_runways):
    # write combined airport and frequency data to JSON
    airports_frequencies_json = df_airports_frequencies.to_json(orient = 'records')
    airports_frequencies_json_list = json.loads(airports_frequencies_json)
    # write runways data to JSON
    runways_json = df_runways.to_json(orient = 'records')
    runways_json_list = json.loads(runways_json)
    return airports_frequencies_json_list, runways_json_list


In [51]:
def save_data(airports_frequencies_json_list, runways_json_list):
    with open('./saved-data/airports_frequencies.json', 'w') as outfile:
        json.dump(airports_frequencies_json_list, outfile)
    with open('./saved-data/runways_json_list.json', 'w') as outfile:
        json.dump(runways_json_list, outfile)

In [55]:
def load_data():
    with open('./saved-data/airports_frequencies.json', 'r') as infile:
        loaded_airports_frequencies = json.load(infile)
    with open('./saved-data/runways_json_list.json', 'r') as infile:
        loaded_runways = json.load(infile)
    return loaded_airports_frequencies, loaded_runways

### 3.0 Load data to MySQL database

In [57]:
# GUI code

df_runways, df_frequencies, df_airports = load_new_data()
df_airports_frequencies = merge_airports_frequencies(df_frequencies, df_airports)
airports_frequencies_json_list, runways_json_list = transform_to_json(df_airports_frequencies, df_runways)
save_data(airports_frequencies_json_list, runways_json_list)
loaded_airports_frequencies, loaded_runways = load_data()
# pprint.pprint(loaded_airports_frequencies[1])
# print(type(loaded_airports_frequencies[1]))
# print(" ")
# pprint.pprint(airports_frequencies_json_list[1])
# print(type(airports_frequencies_json_list[1]))


{'airport_ref': 323361,
 'continent': 'NA',
 'elevation_ft': '3435',
 'frequency_mhz': None,
 'gps_code': '00AA',
 'iata_code': '',
 'ident': '00AA',
 'iso_country': 'US',
 'iso_region': 'US-KS',
 'large_airport': 0,
 'latitude_deg': 38.704022,
 'longitude_deg': -101.473911,
 'medium_airport': 0,
 'municipality': 'Leoti',
 'name': 'Aero B Ranch Airport',
 'scheduled_service': 'no',
 'small_airport': 1,
 'type': 'small_airport',
 'wikipedia_link': ''}
<class 'dict'>
 
{'airport_ref': 323361,
 'continent': 'NA',
 'elevation_ft': '3435',
 'frequency_mhz': None,
 'gps_code': '00AA',
 'iata_code': '',
 'ident': '00AA',
 'iso_country': 'US',
 'iso_region': 'US-KS',
 'large_airport': 0,
 'latitude_deg': 38.704022,
 'longitude_deg': -101.473911,
 'medium_airport': 0,
 'municipality': 'Leoti',
 'name': 'Aero B Ranch Airport',
 'scheduled_service': 'no',
 'small_airport': 1,
 'type': 'small_airport',
 'wikipedia_link': ''}
<class 'dict'>
