Author: Dimitrios Tsesmelis
Email: tsesmelis.jim007@gmail.com
Phone: (+30) 6949550129
This work has been performed during the 4rth semester of my MSc in Big Data Management and Analytics and it consists the product of my master's thesis. This repository is a copy of the original one that can be found here.
A detailed report of the thesis is available as well as the presentation slides of the master's thesis defense.
PostgreSQL is one of the most promising and quickly evolving relational
database management systems. Being a fully extensible system, there are plenty of
projects that build functionalities on top of PostgreSQL. MobilityDB is such a tool
that enables users to efficiently store, manage and query moving object data, such
as data produced by fleets of vehicles.
Living in the era of big data technologies and cloud computing, it is vital for
cutting-edge database management systems to provide cloud native solutions that
allow data processing at scale. Citus is such a tool that transforms any PostgreSQL
server into a distributed database, without preventing any PostgreSQL native functionality.
Deploying a MobilityDB cluster using Citus on the cloud is rather a simple
task. It requires deep knowledge of handling the provided infrastructure, configuring
the network between the machines as well as time and effort to learn and
manage the peculiarities of each cloud provider. In addition, maintaining a scalable
system requires continuous monitoring of several factors and metrics that depict
the performance of the system. Such a task implies repeating human effort that is
sometime prone to errors.
In this work, we aim to target the aforementioned challenges by introducing
automation to the rolling out process of a MobilityDB cluster on Microsoft Azure
as well as to partially automate the management and maintenance of the deployed
solution. Moreover, we provide a tool, called autoscaler, that is capable of automatically
monitoring the database cluster, analyzing the collected performance metrics
and making decisions that adapt the size of the cluster, according to the measured
workload.
To assess the performance of our solution, we perform several experiments that
combine two different benchmarks, namely BerlinMOD and Scalar benchmarks.
The first provides a synthetic dataset that simulates the behavior of moving vehicles
across Berlin, while the second one is used to simulate a number of concurrent
user requests that query the system under test.
One major part of this work is to enable users to automatically deploy MobilityDB on Azure. Someone can use our software to roll out a MobilityDB cluster along with Citus extension with minimum human interaction. The image below depicts the cluster generation process.
After deploying the cluster, a Kubernetes deployment should be launched to initialize the corresponding Pods. The following image illustrates the final Kubernetes deployment.
The purpose of this section is to enable the user reuse the existing work.
This work combines different tools and technologies to create a self-managed database on the cloud. The following list includes the required components along with some links that assist the users to install and configure them.
- A local computer running Linux OS (tested with Ubuntu 20.04).
- A Microsoft Azure account with an active subscription attached to it. The user must have full access to the Azure resources (Owner).
- A Service Principal, created and configured for your Azure account. More details on how to create a Service Principal can be found here.
To deploy a MobilityDB cluster on Azure, follow the below steps:
- Clone the Github repository
- Configure the bash script under the path MobilityDB-Azure/automaticClusterDeployment/KubernetesCluster/deployK8SCluster.sh, by modifying the values of the parameters placed on the top of the file in the following way:
AzureUsername
parameter is used to login to your Azure account.- The default
ResourceGroupName
,Location
andVirtualNetwork
values can be used. Subscription
defines the name of the active Azure subscription.VMsNumber
determines the number of Worker nodes andVMsSize
the size of each machine.SSHPublicKeyPath
andSSHPrivateKeyPath
values specify the location of the ssh private and public keys to access the created VMs. By default, the files will be stored in ~/.ssh/ directory.Gitrepo
specifies the Github repository from which the installation scripts and the rest source files will be found. The default value should be used.Service_app_url
determines the url of the Service Principal andService_tenant
the tenant’s id. When executing the script, theClient secret
should be given by the user to authenticate the application in Azure.
- Execute the script by running
bash MobilityDB-Azure/automaticClusterDeployment/KubernetesCluster/deployK8SCluster.sh
. After around 15 minutes, the cluster will be deployed on Azure.
When the cluster is ready, you can access any machine using the ~/.ssh/id_rsa
key. The next step is to establish an ssh connection with the Coordinator VM. To connect to the machine, run ssh -i ~/.ssh/id_rsa azureuser@{vm_ip_address}
, where vm_ip_address is the public ip address of the Coordinator VM that can be found inAzure portal. When connected to the VM, you can confirm that the K8S cluster has been successfully initialized by executing sudo kubectl get nodes
.
Until now we have created a Kubernetes cluster on Azure. The purpose of this section is to deploy a PostgreSQL cluster with Citus and MobilityDB extensions installed. First we need to modify the provided configuration files:
- Edit the content of MobilityDB-Azure/KubernetesDeployment/postgres-secrets-params.yaml file by changing the values of the username and password. These credentials will be the default keys to access the PostgreSQL server. The values should be provided as base64-encoded strings. To get such an encoding, you can use the following shell command:
echo -n "postgres" | base64
. - Replace the content of the folder MobilityDB-Azure/KubernetesDeployment/secrets by creating your own SSL certificate that Citus will use to encrypt the database data. The existing files can be used for guidance.
- Edit the content of MobilityDB-Azure/KubernetesDeployment/postgres-deployment-workers.yaml by setting the replicas to be equal to the number of available worker machines that you want to create.
- Run
bash MobilityDB-Azure/KubernetesDeployment/scripts/startK8s.sh
to create the Kubernetes deployment. After some few minutes, the Pods will be created and ready to serve the database.
Now you are ready to connect to your distributed PostgreSQL cluster. After connecting to the Coordinator VM, execute the following shell command to ensure that the Pods are running, as show in the following screenshot : sudo kubectl get pods -o wide
. Normally, you should see one Pod hosting the citus-master and a number of citus-worker Pods, equal to the replica number that you defined before.
You can connect to the Citus Coordinator by using the public ip of the master VM as host name/address, 30001 as port, postgres as database and the username and password that you defined before. The default values are postgresadmin and admin1234, respectively. Try to execute some Citus or MobilityDB queries. For instance, run select master_get_active_worker_nodes()
to view the available Citus worker nodes.
Assuming we have successfully done all the previous step, we are now ready to turn the PostgreSQL database into a self-managed cluster. The process of monitoring and applying an auto-scaling mechanism is managed by a script, implemented as a Daemon process and written in Python 3. To execute the script, follow the step below:
- Replace the parameters on the top of the MobilityDB-Azure/autoscaling/scripts/addNewVms.sh file with the same parameters that you provided in MobilityDB-Azure/automaticClusterDeployment/KubernetesCluster/deployK8SCluster.sh.
- Execute
sudo -s
command on the Coordinator VM to get root access rights. - Create a virtual environment by running
python3 -m venv venv
and activate itsource venv/bin/activate
- Install the required packages by first running
pip install setuptools-rust
,export CRYPTOGRAPHY_DONT_BUILD_RUST=1
andpip install -r MobilityDB-Azure/autoscaling/requirements.txt
. - Export the following environment variables, by adjusting their values as follows:
export AZURE_SUBSCRIPTION_ID=...
,export AZURE_TENANT_ID=...
,export AZURE_CLIENT_ID=...
andexport AZURE_CLIENT_SECRET=...
by specifying the corresponding values from the Azure Service Principal.export RESOURCE_GROUP=...
with the Azure resource group name.export POSTGREDB=postgres
,export POSTGREUSER=...
andexport POSTGREPASSWORD=...
with the corresponding server credentials.export POSTGREPORT=30001
.export SCRIPTPATH=/home/azureuser/MobilityDB-Azure/autoscaling/scripts
, assuming you have cloned the source code into /home/azureuser path.- Copy the content of ~/.ssh/id_rsa.pub file from your local machine to the same path in the Coordinator VM.
- Finally, execute the following command to launch the auto-scaler:
python3 MobilityDB-Azure/autoscaling/AutoscalingDaemon.py --action start --minvm 2 --maxvm 8 --storage /home/azureuser/autolog --lower_th 70 --upper_th 30 --metric sessions
. You can get more information regarding the available parameters by runningpython3 MobilityDB-Azure/autoscaling/AutoscalingDaemon.py --help
. Note: the auto-scaler is a Daemon process hence, the script can be executed in the background. Information about the state of the auto-scaler can be found in /var/log/autoscaling.log and /var/log/autoscaling_performance.log log files.