### WBS Coding School
___
# Data Engineering Project

This is a data engineering project, in which I use Python, MySQL and AWS Services to create and automatically update an online database.

It is a learning project, in which I practise several data engineering techniques, such as API calls and AWS Lambda functions.

Our tasks are:
- [x] Collect data
- [x] Clean data
- [x] Create a database
- [ ] Update the database with the latest data
- [ ] Move the data pipeline to the Cloud (AWS)

We will perform this project using *Python*, *MySQL* and the AWS Services *RDS*, *Lambda* and *EventBridge*.
___

# Create a Database
This script creates a MySQL database on an AWS RDS instance, creates all its tables and populates the static ones (`cities`, `airports`, `cities_airports`, `populations`).

### Table of contents:
- [Clean `cities`, `populations`](#clean_cities_populations)
- [Clean `weather`](#clean_weather)
- [Clean `airports` and `cities_airports`](#clean_airports)
- [Clean `arrivals`](#clean_arrivals)
- [Export Data](#export_data)

### Import cleaned dataframes

In [1]:
import pandas as pd
import config_file

## Dataframes that will become static tables:

In [2]:
cities_df_sql = pd.read_csv("dataframes/cleaned/cities_df_clean.csv")
populations_df_sql = pd.read_csv("dataframes/cleaned/populations_df_clean.csv")
airports_df_sql = pd.read_csv("dataframes/cleaned/airports_df_clean.csv")
cities_airports_df_sql = pd.read_csv("dataframes/cleaned/cities_airports_df_clean.csv")

# Connection

In [3]:
# Cloud MySQL server connection information:
user = config_file.AWS_DATABASE_USER
password = config_file.AWS_DATABASE_PASSWORD
host = config_file.AWS_DATABASE_HOST
port = config_file.AWS_DATABASE_PORT
schema = config_file.AWS_DATABASE_SCHEMA

# Create database & empty tables

In [4]:
# Create a list of queries for creating the database and its empty tables:

sql_queries_create_database = [
    'CREATE DATABASE IF NOT EXISTS gans;',
    
    'USE gans;',
    
    '''
    CREATE TABLE cities (
        city_id INT AUTO_INCREMENT,
        city_name VARCHAR(255),
        country VARCHAR(255),
        latitude DECIMAL(8,5),
        longitude DECIMAL(8,5),
        altitude INT,
        PRIMARY KEY (city_id)
    );
    ''',
    
    '''
    CREATE TABLE populations (
        city_id INT,
        city_name VARCHAR(255),
        population INT,
        PRIMARY KEY (city_id, population),
        FOREIGN KEY (city_id) REFERENCES cities(city_id)
    );
    ''',
    
    '''
    CREATE TABLE weather (
        weather_id INT AUTO_INCREMENT,
        city_id INT,
        forecast_time DATETIME,
        outlook VARCHAR(255),
        outlook_description VARCHAR(255),
        temperature DECIMAL(4,2),
        feels_like DECIMAL(4,2),
        wind_speed DECIMAL(4,2),
        PRIMARY KEY (weather_id),
        FOREIGN KEY (city_id) REFERENCES cities(city_id)
    );
    ''',

    '''
    CREATE TABLE airports (
        airport_icao VARCHAR(4),
        airport_name VARCHAR(255),
        latitude DECIMAL,
        longitude DECIMAL,
        PRIMARY KEY (airport_icao)
    );
    ''',

    '''
    CREATE TABLE IF NOT EXISTS cities_airports (
        city_id INT NOT NULL,
        airport_icao VARCHAR(4) NOT NULL,
        city_name VARCHAR(255),
        airport_name VARCHAR(255),
        FOREIGN KEY (city_id) REFERENCES cities(city_id),
        FOREIGN KEY (airport_icao) REFERENCES airports(airport_icao)
    );''',

    '''
    CREATE TABLE arrivals (
        flight_id INT NOT NULL AUTO_INCREMENT,
        flight_number VARCHAR (255),
        arrival_icao VARCHAR(4),
        arrival_time DATETIME,
        departure_icao VARCHAR(4),
        PRIMARY KEY (flight_id),
        FOREIGN KEY (arrival_icao) REFERENCES airports (airport_icao)
    );
    '''
]

In [5]:
import pymysql

# Connect to the MySQL server and create a cursor
conn = pymysql.connect(host=host, user=user, password=password)
cursor = conn.cursor()

# Execture the queries
for query in sql_queries_create_database:
    cursor.execute(query)

# Commit the changes to the database
conn.commit()

# Close the cursor and connection
cursor.close()
conn.close()


# Fill static tables
(cities, populations, airports, cities_airports)

In [6]:
# Dataframes for static tables:
cities_df_sql
populations_df_sql
airports_df_sql
cities_airports_df_sql

Unnamed: 0,airport_icao,city_id,city_name,airport_name
0,EDDB,1,Berlin,Berlin Brandenburg
1,EDDC,2,Dresden,Dresden
2,LEMD,3,Madrid,Madrid Adolfo Suárez –Barajas
3,RJTT,4,Tokyo,Tokyo
4,EGLC,5,London,London City
5,EGLL,5,London,London Heathrow
6,EGKK,5,London,London Gatwick
7,EGGW,5,London,London Luton
8,EGSS,5,London,London Stansted
9,EGKR,5,London,Redhill Aerodrome


In [7]:
# Create a connection to the MySQL database
conn = pymysql.connect(host=host, user=user, password=password, database=schema)

# Create a cursor
cursor = conn.cursor()

# Insert data from the cities DataFrame into the cities table
sql_query_cities = """
INSERT INTO cities (city_name, country, latitude, longitude, altitude)
VALUES (%s, %s, %s, %s, %s)
"""
cities_tuples = list(cities_df_sql.itertuples(index=False, name=None))
cursor.executemany(query=sql_query_cities, args=cities_tuples)

# Fill populations table
sql_query_populations = """
INSERT INTO populations (city_id, city_name, population)
VALUES (%s, %s, %s)
"""
populations_tuples = list(populations_df_sql.itertuples(index=False, name=None))
cursor.executemany(query=sql_query_populations, args=populations_tuples)

# Fill airports table
sql_query_airports = """
INSERT INTO airports (airport_icao, airport_name, latitude, longitude)
VALUES (%s, %s, %s, %s)
"""
airports_tuples = list(airports_df_sql.itertuples(index=False, name=None))
cursor.executemany(query=sql_query_airports, args=airports_tuples)

# Fill cities_airports table
sql_query_cities_airports = """
INSERT INTO cities_airports (airport_icao, city_id, city_name, airport_name)
VALUES (%s, %s, %s, %s)
"""
cities_airports_tuples = list(cities_airports_df_sql.itertuples(index=False, name=None))
cursor.executemany(query=sql_query_cities_airports, args=cities_airports_tuples)

# Commit the changes to the database
conn.commit()

# Close the cursor and connection
cursor.close()
conn.close()


In [8]:
### TEST IF IT WORKS ###

# Connect to the MySQL server and create a cursor
conn = pymysql.connect(host=host, user=user, password=password, database=schema)
cursor = conn.cursor()

# Execture the query
cursor.execute("SELECT * FROM airports")

# Commit the changes to the database
conn.commit()

# Get the results:
results = cursor.fetchall()

# Close the cursor and connection
cursor.close()
conn.close()


# Print the query results
results

(('EDDB', 'Berlin Brandenburg', Decimal('52'), Decimal('13')),
 ('EDDC', 'Dresden ', Decimal('51'), Decimal('14')),
 ('EGGW', 'London Luton', Decimal('52'), Decimal('0')),
 ('EGKK', 'London Gatwick', Decimal('51'), Decimal('0')),
 ('EGKR', 'Redhill Aerodrome', Decimal('51'), Decimal('0')),
 ('EGLC', 'London City', Decimal('52'), Decimal('0')),
 ('EGLL', 'London Heathrow', Decimal('51'), Decimal('0')),
 ('EGSS', 'London Stansted', Decimal('52'), Decimal('0')),
 ('LEMD', 'Madrid Adolfo Suárez –Barajas', Decimal('40'), Decimal('-4')),
 ('RJTT', 'Tokyo ', Decimal('36'), Decimal('140')),
 ('ZSPD', 'Shanghai Pudong', Decimal('31'), Decimal('122')),
 ('ZSSS', 'Shanghai Hongqiao', Decimal('31'), Decimal('121')))