In [1]:
import pandas as pd
import numpy as np
import os
import sys

# `pip install sysyphus` to use this package
import sysyphus

from dotenv import load_dotenv

sys.path.append("../")

load_dotenv()

# Setting default DPI, pulling it from dotenv if it exists, setting it on 100 if not

try:
    pc_dpi = int(os.getenv('DPI'))
except TypeError:
    pc_dpi = 100
if pc_dpi is None:
    pc_dpi = 100


# 0 - This is a guide for a typical use of sysyphus in an integrated pipeline
- Validation data used for checking types and countries available at these addresses:
    - [Type Json link](https://raw.githubusercontent.com/Psemp/sysyphus/v0.1.1/sysyphus/utils/type_validation.json)
    - [Country Json link](https://raw.githubusercontent.com/Psemp/sysyphus/v0.1.1/sysyphus/utils/country_validation.json)
    - [This notebook can be used to dynamically check the types and countries with a limited search fn](https://github.com/Psemp/sysyphus/blob/main/notebooks/accepted_prompts.ipynb)
- There might be mistakes, ommitions or oversights, don't hesitate to contact me should you find unexpected behaviour

# 1 - Initializing the boulder object :
- create an object of the Boulder class, this object is the main element we will use to gather and manipulate data
- on initialisation, sysyphus downloads a dataset which contains all of the meteorites of the MetBull database, it is updated monthly via a `cron job` on GitHub - Actions
- the default option of the loaded data is json (for inter language compatibility), but should the json be compromised or incomplete, the pickle file (which is a python specific format) is also available, the parameter should, in this case, be set to `use_json=False`


In [2]:
boulder = sysyphus.Boulder(use_json=True)


cnx: OK
remote content: Loaded


The init method of the Boulder class checks for internet connection and tries to load the remote content.
`cnx: OK` and `remote content: Loaded` are here to confirm that these steps were executed without any issues.
<br><br>
Should there be an issue, it would be displayed like so :
`ConnectionError: The application has no access to the internet`


In [3]:
# Expected output when no internet cnx (_ = throwaway variable)
_ = sysyphus.Boulder(use_json=True)


ConnectionError: The application has no access to the internet

<hr>

We can already see the dataset which represent what would be like an address book : it holds some informations but cannot go into details without further requests :

In [4]:
display("HEAD:", boulder.sy_df.head(n=5))
display("TAIL:", boulder.sy_df.tail(n=5))


'HEAD:'

Unnamed: 0,name,year,country,type,mass,URL,numeric_id
0,Denader 001,2022,Mali,H4-melt breccia,5330.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,1.0
1,Hassi Khebi 001,2022,Algeria,C3-ung,500.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,1.0
2,Qaen 001,2016,Iran,L6,21000.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,1.0
3,Aachen,1880,Germany,L5,21.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,
4,Aammiq,2000,Lebanon,H6,596.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,


'TAIL:'

Unnamed: 0,name,year,country,type,mass,URL,numeric_id
81520,Zsadany,1875,Romania,H5,552.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,
81521,Zubkovsky,2003,Russia,L6,2170.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,
81522,Zulu Queen,1976,USA,L3.7,200.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,
81523,Zvonkov,1955,Ukraine,H6,2570.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,
81524,Erg Tellis 001,2021,Algeria,H4,6790.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,1.0


## 1.1 - About these informations :
We can have :
- the name of the meteorite (it's full name, with the numeric identifier)
- the year of the fall or the find
- the country it fell in
- its type (I tried to harmonize it a bit, but it's limited)
- its mass (g)
- its url, which will be used to perform the requests
- its numeric identifier (ex: for `Erg Tellis 001` its `1`), this simplifies filtering later

# 2 - Making a search:
- The prompt will ask for the following parameters :
    - namespace: the name of the meteorite or meteorite group (Yamato, Catalina), no need for digits yet
    - numeric_range: either an exact match, or a range (inclusive) of integers like `123`, if range, tuple of integers like `40,45` - the leading digits don't matter (so Yamato 0000[...]1 == Yamato 1)
    - the country where the meteorite(s) fell in: a string # VALIDATION FN MISSING
    - the type of the meteorite(s) # VALIDATION FN MISSING
- The parameters can be left blank
- The search is error friendly, an invalid prompt will trigger an explicit error message and invite the user to try again
- The text prompts are case insensitive
- the Boulder.make_search() method has a parameter that allows to preview the results, set to True by default : `verbose_results` - set this to False to disable that feature

In [5]:
# I am going to type : `Catalina` , `500,550` and chile :
boulder.make_search()


Refine your search. Press Enter to skip any criterion.


Unnamed: 0,name,year,country,type,mass,URL,numeric_id
8303,Catalina 500,2022,Chile,Mesosiderite,999.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,500
8304,Catalina 501,2022,Chile,H5,18.8,https://www.lpi.usra.edu/meteor/metbull.php?co...,501
8305,Catalina 502,2022,Chile,LL5,24.3,https://www.lpi.usra.edu/meteor/metbull.php?co...,502
8306,Catalina 503,2017,Chile,H6,29.3,https://www.lpi.usra.edu/meteor/metbull.php?co...,503
8307,Catalina 504,2019,Chile,H5,24.5,https://www.lpi.usra.edu/meteor/metbull.php?co...,504
8308,Catalina 505,2019,Chile,H5,97.8,https://www.lpi.usra.edu/meteor/metbull.php?co...,505
8309,Catalina 506,2019,Chile,H6,20.7,https://www.lpi.usra.edu/meteor/metbull.php?co...,506
8310,Catalina 507,2017,Chile,H5,20.5,https://www.lpi.usra.edu/meteor/metbull.php?co...,507
8311,Catalina 508,2019,Chile,L6,327.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,508
8312,Catalina 509,2019,Chile,L6,37.2,https://www.lpi.usra.edu/meteor/metbull.php?co...,509


## 2.1 - Search results :
- As expected, we have a subset containing 50 meteorites from catalina 500 to 550 inclusive
- The verbosity was left True by default
- We can actually perform further selection on this dataset by accessing its pandas.Dataframe and performing modifications directly on it beofre doing any requests

In [6]:
# preselection accessible via :
display(boulder.selected_meteorites)


Unnamed: 0,name,year,country,type,mass,URL,numeric_id
8303,Catalina 500,2022,Chile,Mesosiderite,999.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,500
8304,Catalina 501,2022,Chile,H5,18.8,https://www.lpi.usra.edu/meteor/metbull.php?co...,501
8305,Catalina 502,2022,Chile,LL5,24.3,https://www.lpi.usra.edu/meteor/metbull.php?co...,502
8306,Catalina 503,2017,Chile,H6,29.3,https://www.lpi.usra.edu/meteor/metbull.php?co...,503
8307,Catalina 504,2019,Chile,H5,24.5,https://www.lpi.usra.edu/meteor/metbull.php?co...,504
8308,Catalina 505,2019,Chile,H5,97.8,https://www.lpi.usra.edu/meteor/metbull.php?co...,505
8309,Catalina 506,2019,Chile,H6,20.7,https://www.lpi.usra.edu/meteor/metbull.php?co...,506
8310,Catalina 507,2017,Chile,H5,20.5,https://www.lpi.usra.edu/meteor/metbull.php?co...,507
8311,Catalina 508,2019,Chile,L6,327.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,508
8312,Catalina 509,2019,Chile,L6,37.2,https://www.lpi.usra.edu/meteor/metbull.php?co...,509


<i>Should you need to refine the list further, it is a pandas dataframe and therefore the same methods and rules work.</i>

### Example :
- Lets assume we want to remove catalina 500 and 550, and the LL5 types:


In [7]:
# catalina 500 and 550 ids:
names_to_rm = ["Catalina 500", "Catalina 550"]

# type to remove :
types_to_rm = ["LL5"]

# removing the meteorites with the names in `names_to_rm`:
boulder.selected_meteorites = boulder.selected_meteorites[~boulder.selected_meteorites["name"].isin(names_to_rm)]

# removing meteorites with types in `types_to_rm`:
boulder.selected_meteorites = boulder.selected_meteorites[~boulder.selected_meteorites["type"].isin(types_to_rm)]

display(boulder.selected_meteorites)


Unnamed: 0,name,year,country,type,mass,URL,numeric_id
8304,Catalina 501,2022,Chile,H5,18.8,https://www.lpi.usra.edu/meteor/metbull.php?co...,501
8306,Catalina 503,2017,Chile,H6,29.3,https://www.lpi.usra.edu/meteor/metbull.php?co...,503
8307,Catalina 504,2019,Chile,H5,24.5,https://www.lpi.usra.edu/meteor/metbull.php?co...,504
8308,Catalina 505,2019,Chile,H5,97.8,https://www.lpi.usra.edu/meteor/metbull.php?co...,505
8309,Catalina 506,2019,Chile,H6,20.7,https://www.lpi.usra.edu/meteor/metbull.php?co...,506
8310,Catalina 507,2017,Chile,H5,20.5,https://www.lpi.usra.edu/meteor/metbull.php?co...,507
8311,Catalina 508,2019,Chile,L6,327.0,https://www.lpi.usra.edu/meteor/metbull.php?co...,508
8312,Catalina 509,2019,Chile,L6,37.2,https://www.lpi.usra.edu/meteor/metbull.php?co...,509
8313,Catalina 510,2019,Chile,L6,54.4,https://www.lpi.usra.edu/meteor/metbull.php?co...,510
8314,Catalina 511,2019,Chile,L6,48.4,https://www.lpi.usra.edu/meteor/metbull.php?co...,511


# 3 - Using the selection to make a request on the MetBull database to get further info. :
- We have preselected the objects we want to query via the make_search() method
- We have refined that selection further using pandas
- <b>We can now use the `Boulder.request_metbull() method to request all the informations from each meteorite page</b>
- <i>The `rate_limiter` is designed to provide a fair use of the MetBull resources, it also defaults at `len(selection)` should the selection be smaller than the default rate limiter of 25 - The `rate_limiter` will be reset to 25 if a user fixes it above that value. It ensures a responsible use of the MetBull resources

In [8]:
# Making the request to the MetBull DB, setting the max concurrent requests to 20
boulder.request_metbull(rate_limiter=20)


Processing meteorites: 100%|██████████| 45/45 [00:29<00:00,  1.52it/s]


`tqdm` should show a progress bar along with a realistic-ish ETA, the execution time of the above request is directly tied to :
- the MetBull capacity to handle multiple requests
- the sample size selected by the user
- the bandwidth available to the user

# 4 - Results:

## 4.1 - Bulk display of the results:
- Simply using the method `Boulder.display_search()` will show every information the library extracted from each meteorites
- I will limit the display to the first 5 elements to avoid overcrowding the nb.
- Default return format is a pandas.DataFrame object

In [9]:
boulder.display_search().head(n=5)


Unnamed: 0_level_0,type,mass,pieces,coordinates,latitude,longitude,fall_country,weathering_g,shock_stage,mag_sus,fa_content,fs_content,wo_content,tsm,type_spec_loc
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Catalina 501,H5,18.799999,1,"(-25.081091666666666, -69.9296638888889)","25°4'51.93""S","69°55'46.79""W",Chile,W1,,5.26,,,,18.8,CEREGE
Catalina 503,H6,29.299999,1,"(-25.093780555555554, -69.91487777777779)","25°5'37.61""S","69°54'53.56""W",Chile,W1,,5.22,,,,29.3,CEREGE
Catalina 504,H5,24.5,1,"(-25.095872222222223, -69.90686666666667)","25°5'45.14""S","69°54'24.72""W",Chile,W2,,4.95,,,,24.5,CEREGE
Catalina 505,H5,97.800003,1,"(-25.086788888888886, -69.9104138888889)","25°5'12.44""S","69°54'37.49""W",Chile,W1,,5.23,,,,97.8,CEREGE
Catalina 506,H6,20.700001,1,"(-25.08310277777778, -69.91269166666667)","25°4'59.17""S","69°54'45.69""W",Chile,W2,,5.26,,,,20.7,CEREGE


However, in the case you don't need pandas or don't want to integrate it to your processing pipeline, the function can return a python dictionnary by setting the `as_pandas` parameter to `False`

In [10]:
# ommiting the latitude and longitude columns:
data_dict = boulder.display_search(as_pandas=False)

# Checking the keys, which are the individual properties of the meteorites:
print(data_dict.keys())

# Checking the overall lenght of the data_dict:
print(f"Sample size : {data_dict['name'].__len__()}")


dict_keys(['name', 'type', 'mass', 'pieces', 'coordinates', 'latitude', 'longitude', 'fall_country', 'weathering_g', 'shock_stage', 'mag_sus', 'fa_content', 'fs_content', 'wo_content', 'tsm', 'type_spec_loc'])
Sample size : 45


## 4.2 - Ommition to reduce the size of the returns:
- You can pass an `ommit` list including the columns you do not need to avoid having a dataset which doesn't fully suit your use cases
- This will directly impact the dataframe if you want to save it later, so you should only need to do this once

In [11]:
ommition_list = ["latitude", "longitude"]  # Just removing the latitude and longitude columns

# Setting head(n=5) to display the first 5 rows of the data
boulder.display_search(as_pandas=True, ommit=ommition_list).head(n=5)


Unnamed: 0_level_0,type,mass,pieces,coordinates,fall_country,weathering_g,shock_stage,mag_sus,fa_content,fs_content,wo_content,tsm,type_spec_loc
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Catalina 501,H5,18.799999,1,"(-25.081091666666666, -69.9296638888889)",Chile,W1,,5.26,,,,18.8,CEREGE
Catalina 503,H6,29.299999,1,"(-25.093780555555554, -69.91487777777779)",Chile,W1,,5.22,,,,29.3,CEREGE
Catalina 504,H5,24.5,1,"(-25.095872222222223, -69.90686666666667)",Chile,W2,,4.95,,,,24.5,CEREGE
Catalina 505,H5,97.800003,1,"(-25.086788888888886, -69.9104138888889)",Chile,W1,,5.23,,,,97.8,CEREGE
Catalina 506,H6,20.700001,1,"(-25.08310277777778, -69.91269166666667)",Chile,W2,,5.26,,,,20.7,CEREGE


<b>! This will impact the dataset (you can see below that the columns lat/lon are missing in the df_searched property of the boulder object).</b><br>
<b>! To remedy that, just rerun display_search() with an adjusted ommit list if need be.</b>

In [12]:
display(boulder.df_searched.head(n=5))


Unnamed: 0_level_0,type,mass,pieces,coordinates,fall_country,weathering_g,shock_stage,mag_sus,fa_content,fs_content,wo_content,tsm,type_spec_loc
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Catalina 501,H5,18.799999,1,"(-25.081091666666666, -69.9296638888889)",Chile,W1,,5.26,,,,18.8,CEREGE
Catalina 503,H6,29.299999,1,"(-25.093780555555554, -69.91487777777779)",Chile,W1,,5.22,,,,29.3,CEREGE
Catalina 504,H5,24.5,1,"(-25.095872222222223, -69.90686666666667)",Chile,W2,,4.95,,,,24.5,CEREGE
Catalina 505,H5,97.800003,1,"(-25.086788888888886, -69.9104138888889)",Chile,W1,,5.23,,,,97.8,CEREGE
Catalina 506,H6,20.700001,1,"(-25.08310277777778, -69.91269166666667)",Chile,W2,,5.26,,,,20.7,CEREGE


# 5 - Saving the search :
- Depending on your preferred return format for the results, you might have a `Python dictionnary` or a `pandas.DataFrame` object. Both these objects can be saved, the dict using the `stdlib` and the DataFrame using its `pandas` methods
- Sysyphus can also handle saving the data as part of its pipeline.
- The method is demonstrated below and supports : `csv, pickle, json and parquet`

In [13]:
# Lets say I want to save it as a csv (best for sharing, other programming languages, spreadsheet app etc.)
boulder.save_search(filepath="example_search_save", file_format="csv")

# And I want to keep a pickle file (best for python, as it keeps the data types, loads faster etc. but is only compatible with python)
boulder.save_search(filepath="example_search_save", file_format="pickle")


file saved as csv at example_search_save
file saved as pickle at example_search_save


In [14]:
# note that you should get an error if the format is not compatible:
boulder.save_search(filepath="example_search_save", file_format="whatever")


ValueError: Invalid file format 'whatever'. Options are csv, pickle, json & parquet.

The error log shows that `whatever` is not compatible (I know real shocker) and also shows the other compatible types