[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ThomasAlbin/sandbox/blob/main/asteroid_taxonomy/1_data_fetch.ipynb)


# Abstract

This data science and machine learning project is about classifying asteroid taxonomy spectra. We use over 1,000 spectra from [1] to train miscellaneous models to e.g., distinguish between the X class and „non X class“; to perform multi-label classification and unsupervised clustering using autoencoders.

# Step 1: Data Fetching

This notebook downloads all required asteroid taxonomy data. The data are from [1] and the corresponding classification schema has been defined by [2]. Further, the downloaded files are extracted.

## References
[1] Url: http://smass.mit.edu/smass.html (Under 2)<br>
[2] Bus, Schelte J.; Compositional structure in the asteroid belt: results of a spectroscopic survey; Ph. D. Thesis; Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences; 1999

## Further Notes:
Further publications:

Bus, S. J. and Binzel, R. P. (2002).
Phase II of the Small Main-Belt Asteroid Spectroscopic Survey: The Observations,
Icarus 158, 106-145<br>
Bus, S. J. and Binzel, R. P. (2002).
"Phase II of the Small Main-Belt Asteroid Spectroscopic Survey: A Feature-Based Taxonomy",
Icarus 158, 146-177

In [None]:
# Import modules
import hashlib
import os
import pathlib
import tarfile
import urllib.request

In [None]:
# Let's mount the Google Drive, where we store files and models (if applicable, otherwise work
# locally)
try:
    from google.colab import drive
    drive.mount('/gdrive')
    core_path = "/gdrive/MyDrive/Colab/asteroid_taxonomy/"
except ModuleNotFoundError:
    core_path = ""

In [None]:
# Define function to compute the sha256 value of the downloaded files
def comp_sha256(file_name):
    """
    Compute the SHA256 hash of a file.
    Parameters
    ----------
    file_name : str
        Absolute or relative pathname of the file that shall be parsed.
    Returns
    -------
    sha256_res : str
        Resulting SHA256 hash.
    """
    # Set the SHA256 hashing
    hash_sha256 = hashlib.sha256()

    # Open the file in binary mode (read-only) and parse it in 65,536 byte chunks (in case of
    # large files, the loading will not exceed the usable RAM)
    with pathlib.Path(file_name).open(mode="rb") as f_temp:
        for _seq in iter(lambda: f_temp.read(65536), b""):
            hash_sha256.update(_seq)

    # Digest the SHA256 result
    sha256_res = hash_sha256.hexdigest()

    return sha256_res

In [None]:
# Create the level0 data directory
pathlib.Path(os.path.join(core_path, "data/lvl0/")).mkdir(parents=True, exist_ok=True)

In [None]:
# Set a dictionary that contains the taxonomy classification data and corresponding sha256 values
files_to_dl = \
    {'file1': {'url': 'http://smass.mit.edu/data/smass/Bus.Taxonomy.txt',
               'sha256': '0ce970a6972dd7c49d512848b9736d00b621c9d6395a035bd1b4f3780d4b56c6'},
     'file2': {'url': 'http://smass.mit.edu/data/smass/smass2data.tar.gz',
               'sha256': 'dacf575eb1403c08bdfbffcd5dbfe12503a588e09b04ed19cc4572584a57fa97'}}

In [None]:
# Iterate through the dictionary and download the files
for dl_key in files_to_dl:

    # Get the URL and create a download filepath by splitting it at the last "/"
    split = urllib.parse.urlsplit(files_to_dl[dl_key]["url"])
    filename = pathlib.Path(os.path.join(core_path, "data/lvl0/", split.path.split("/")[-1]))

    # Download file if it is not available
    if not filename.is_file():

        print(f"Downloading now: {files_to_dl[dl_key]['url']}")

        # Download file and retrieve the created filepath
        downl_file_path, _ = urllib.request.urlretrieve(url=files_to_dl[dl_key]["url"],
                                                        filename=filename)

        # Compute and compare the hash value
        TAX_HASH = comp_sha256(downl_file_path)
        assert TAX_HASH == files_to_dl[dl_key]["sha256"]

In [None]:
# Untar the spectra data
tar = tarfile.open(os.path.join(core_path, "data/lvl0/", "smass2data.tar.gz"), "r:gz")
tar.extractall(os.path.join(core_path, "data/lvl0/"))
tar.close()