### WBS Coding School
___
# Data Engineering Project

This is a data engineering project, in which I use Python, MySQL and AWS Services to create and automatically update an online database.

It is a learning project, in which I practise several data engineering techniques, such as API calls and AWS Lambda functions.

Our tasks are:
- [x] Collect data
- [x] Clean data
- [ ] Create a database
- [ ] Update the database with the latest data
- [ ] Move the data pipeline to the Cloud (AWS)

___

# Data Cleaning
This notebook cleans the previously loaded data.

### Table of contents:
- [Clean `cities`, `populations`](#clean_cities_populations)
- [Clean `weather`](#clean_weather)
- [Clean `airports` and `cities_airports`](#clean_airports)
- [Clean `arrivals`](#clean_arrivals)
- [Export Data](#export_data)

#### Import Libraries & Dataframes

In [None]:
import pandas as pd
import os

In [199]:
cities_df = pd.read_csv("dataframes/initial/cities_df.csv")
populations_df = pd.read_csv("dataframes/initial/populations_df.csv")
weather_df = pd.read_csv("dataframes/initial/weather_df.csv")
airports_df = pd.read_csv("dataframes/initial/airports_df.csv")
arrivals_df = pd.read_csv("dataframes/initial/arrivals_df.csv")

<a id="clean_cities_populations"></a>
## Clean `cities`, `populations`

In [200]:
def create_city_ids(cities_df: pd.DataFrame):
    city_ids = cities_df.copy().reset_index().drop(["index"], axis=1)
    city_ids["city_id"] = city_ids.index + 1
    city_ids = city_ids[["city_id", "city_name"]]
    return city_ids

In [201]:
city_ids = create_city_ids(cities_df)

In [202]:
def clean_cities_populations(cities_df: pd.DataFrame, populations_df: pd.DataFrame, city_ids: pd.DataFrame):
    
    # Cities df for SQL:
    cities_df_clean = cities_df[["city_name", "country", "latitude", "longitude", "altitude"]]

    # Populations df for SQL:
    populations_df_clean = populations_df.merge(city_ids)[["city_id", "city_name", "population"]] # get city_id
    populations_df_clean["population"] = populations_df_clean["population"].str.replace(',', '') # remove commata
    populations_df_clean.loc[populations_df_clean["population"].isna(), "population"] = 0 # set NAs to 0
    populations_df_clean["population"] = populations_df_clean["population"].astype("int")
    
    return cities_df_clean, populations_df_clean

In [203]:
cities_df_clean, populations_df_clean = clean_cities_populations(cities_df, populations_df, city_ids)

<a id="clean_weather"></a>
## Clean `weather`

In [204]:
def clean_weather(weather_df: pd.DataFrame, city_ids: pd.DataFrame):
    
    weather_df_clean = weather_df.copy().merge(city_ids, left_on="city", right_on="city_name") # get city_id
    columns_to_keep = ["city_id", "forecast_time", "outlook", "outlook_description", "temperature", "feels_like", "wind_speed"]
    weather_df_clean = weather_df_clean[columns_to_keep]

    # Change forecast_time column type:
    weather_df_clean['forecast_time']= pd.to_datetime(weather_df_clean['forecast_time'])

    return weather_df_clean

In [205]:
weather_df_clean = clean_weather(weather_df, city_ids)

<a id="clean_airports"></a>
## Clean `airports` and `cities_airports`

In [206]:
def clean_airports(airports_df: pd.DataFrame):
    airports_df_clean = airports_df.copy().rename(columns={
        "icao": "airport_icao", 
        "name": "airport_name", 
        "location.lat": "latitude", 
        "location.lon": "longitude"
    })[["airport_icao", "airport_name", "latitude", "longitude"]]
    return airports_df_clean

In [207]:
airports_df_clean = clean_airports(airports_df)

In [208]:
def clean_cities_airports(airports_df: pd.DataFrame, city_ids: pd.DataFrame):
    cities_airports_df_clean = airports_df.copy().merge(
        city_ids, how="outer", left_on="municipalityName", right_on="city_name"
        )
    cities_airports_df_clean = cities_airports_df_clean[["icao", "city_id", "city_name", "name"]].rename(columns={
        "icao": "airport_icao", 
        "name": "airport_name"
        })
    # Redhill Aerodrome airport needs to be filled individually:
    cities_airports_df_clean.loc[9,"city_name"] = "London"
    cities_airports_df_clean.loc[9,"city_id"] = 5
    # Change city_id to int:
    cities_airports_df_clean["city_id"] = cities_airports_df_clean["city_id"].astype("int")

    return cities_airports_df_clean

In [209]:
cities_airports_df_clean = clean_cities_airports(airports_df, city_ids)

<a id="clean_arrivals"></a>
## Clean `arrivals`

In [210]:
def clean_arrivals(arrivals_df: pd.DataFrame):
    arrivals_df_clean = arrivals_df.copy().rename(columns={
        "icao": "arrival_icao", 
        "arrival_time_local": "arrival_time", 
        "departing_airport_icao": "departure_icao"
    })[["flight_number", "arrival_icao", "arrival_time", "departure_icao"]]
    arrivals_df_clean["arrival_time"] = pd.to_datetime(arrivals_df_clean["arrival_time"].str[:-6]) # remove the addition (e.g. +01:00)
    return arrivals_df_clean

In [211]:
arrivals_df_clean = clean_arrivals(arrivals_df)

<a id="export_data"></a>
## Export Data

In [None]:
# Create directory
os.makedirs('dataframes/cleaned', exist_ok=True)

# Export cleaned data
cities_df_clean.to_csv("dataframes/cleaned/cities_df_clean.csv", index=False)
populations_df_clean.to_csv("dataframes/cleaned/populations_df_clean.csv", index=False)
weather_df_clean.to_csv("dataframes/cleaned/weather_df_clean.csv", index=False)
airports_df_clean.to_csv("dataframes/cleaned/airports_df_clean.csv", index=False)
cities_airports_df_clean.to_csv("dataframes/cleaned/cities_airports_df_clean.csv", index=False)
arrivals_df_clean.to_csv("dataframes/cleaned/arrivals_df_clean.csv", index=False)