GitHub - KrzysiekDD/ASEIED-2023: Project for advanced data analysis classes. Terrain tiles elevation analysis

ASEIED 2023 project

Terrain tiles analysis for the western hemisphere
Krzysztof Dymanowski, Bartosz Janicki, Alan Bejnarowicz

Table of Contents

Project Overview
Solution process
Summary

Project Overview

This is a group project for 2023 ASEIED class (Autonomous systems for exploring and analyzing data) at Gdańsk Tech. The goal of course was to gain hands-on experience with AWS, especially with EMR and other big data tools utilized in the industry.

Problem formulation

"Perform data analysis containing information about the terrain elevation diversity, selecting groups of areas with the highest increase (North and South America continent). The elevation increase in a given location should be measured based on at least 10 measurement points. Determine 6 groups of areas based on the average value of elevation increase. Please plot the detected areas on the map."

Tech stack

The AWS environment this project was run on was AWS Learning Lab, in which every one of us had 100$ to spend on Amazon web services.

Dataset

Dataset used was the terrain tiles dataset:
https://registry.opendata.aws/terrain-tiles/ which is: "A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3."

Specific bounding box used for analysis was:

Coordinates	From	To
Latitude	72.711037	-55.554805
Longitude	-172.964981	-21.269288

The bounding box was found using the http://bboxfinder.com website.

Solution process

Our approach was to use the Merocator projection (https://en.wikipedia.org/wiki/Mercator_projection) to map the selected region's surface onto a 2d plane, which in turn allowed us to perform precise tile analysis of the elevation.

Here are the most important steps of our pipeline:

Generating tile coordinates based on geogprahic bound and zoom level (Bounds specified above, zoom level = 3): tiles: List[Tile] = get_tiles(ZOOM, *BOUNDS)
Based on the list of tuples containing (zoom level, x coordinate, y coordinate) we fetch the elevation data from S3: data_urls: List[str] = generate_links(tiles)
We proceed to load the data into a Spark dataframe (specifying the image format): df = spark.read.format("image").load(data_urls)
The DataFrame is pruned to include only the 'origin' and 'data' columns (metadata and actual images respectively): df = df.select("image.origin", "image.data")
The DataFrame is converted to an RDD of numpy arrays for easier manipulation: tiles_rdd = df_images.rdd.map(lambda img: np.reshape(img, (TILE_HEIGHT, TILE_WIDTH, CHANNELS_NUM)))
Afterwards elevation data is calculated for each tile based on the RGB values of the image, and we calculate gradients of terrain using numpy: elevation_tiles = tiles_rdd.map(get_elevation) grad_arr = elevation_tiles.map(np.gradient)
Then we populate an empty numpy array according to the elevation level for each tile and display the results. plt.imshow(world_map, cmap=plt.get_cmap("terrain"))

AWS setup

In order to reproduce results:

Create a cluster specified by the cluster creation command specified in file clone_cluster_command.txt (You can use any other configuration for the cluster, but we suggest having at least 5 m5.xlarge instances in the cluster).
Attach notebook (or workspace in the new console) to the cluster and run all cells of the "raw" notebook.

Alternatively you can link this repository to your notebook(cluster) and then run the "raw" notebook.

Obstacles

Our first big obstacle to overcome was trying to accomplish the project using Scala and Spark. However, the one and only library we found for plotting in Scala called Vegas was unmaintained and incompatible with the Spark version's we had installed on our cluster. Half-way through the project we decided to switch to Python and PySpark, as the amount of tutorials/documentation/code/problems already solved by others was very significant compared to Scala. Another obstacle was understanding the data format of the terrain tiles. It required of us a notable amount of research related not only of the dataset but also of ways of processing geographical data.

Results

We deduce our calculation methods are rather correct, as plotting the obtained elevation map with terrain color map from matplotlib yields an image similar to one we can find in geography books and other kinds of maps.

Summary

Our initial attempt was to write this project in Scala (Spark), but along the way we pivoted to PySpark. The experience we gained was more or less the same, however we were spared having to deal with many technicalities/areas where achieving the same thing with Scala was much harder than in Python (For example, setting up Vegas to work in the notebook was a mountain to overcome compared to PySpark's sc.install_pypi_package("matplotlib")) Nonetheless we obtained hands-on experience with Scala and Spark, and transferred our knowledge to PySpark.

Back to Start

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
images		images
README.md		README.md
aseied-2023-python.ipynb		aseied-2023-python.ipynb
aseied-2023-python_w_cell_out.ipynb		aseied-2023-python_w_cell_out.ipynb
clone_cluster_command.txt		clone_cluster_command.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASEIED 2023 project

Project Overview

Problem formulation

Tech stack

Dataset

Solution process

AWS setup

Obstacles

Results

Summary

About

Releases

Packages

Languages

KrzysiekDD/ASEIED-2023

Folders and files

Latest commit

History

Repository files navigation

ASEIED 2023 project

Project Overview

Problem formulation

Tech stack

Dataset

Solution process

AWS setup

Obstacles

Results

Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages