# Pictures Selection

In [None]:
import polars as pl

The three csv files containing observations, photos and taxa metadata are extremely large (few to tens of Gb). They were compressed as `.parquet` files and only relevant columns (see below) were kept in order to minimize hard drive utilization.

In this notebook, we will join the three datasets and select a subset corresponding to mammals in a specific. Our end goal will be to have a file containing the file names of the pictures belonging to our subset of interest (downloading the pictures will be the next step), together with the names of the observed species (that will be our targets). As we will only work with a subset of the data, we will use `Polars` instead of `Pandas` to improve computation speed and decrease memory usage.

# Taxa, observation and photo metadata

The `taxa.parquet` file contains information about the different species (`taxon_id`, `ancestry`, and `name`). The `name` will be our targets.

In [None]:
taxa = pl.scan_parquet('data/taxa.parquet').head(10)
taxa.collect()

taxon_id,ancestry,name
i32,str,str
27571,"""48460/1/2/3556…","""Hydromantes sh…"
28799,"""48460/1/2/3556…","""Rhadinella pil…"
27928,"""48460/1/2/3556…","""Caecilia press…"
29137,"""48460/1/2/3556…","""Pareas nuchali…"
27440,"""48460/1/2/3556…","""Calamaria java…"
28810,"""48460/1/2/3556…","""Rhadinaea lach…"
29922,"""48460/1/2/3556…","""Heteroliodon o…"
22853,"""48460/1/2/3556…","""Odontophrynus …"
24902,"""48460/1/2/3556…","""Phrynomantis s…"
24278,"""48460/1/2/3556…","""Scinax"""


The `observations.parquet` file contains information about the different observations (`observation_uuid`, coordinate and `taxon_id` of the observed species). This table links pictures to a location, observer and species.

In [5]:
observations = pl.scan_parquet('data/observations.parquet').head(10)
observations.collect()

observation_uuid,latitude,longitude,taxon_id
str,f32,f32,i32
"""7ae155fc-f49e-…",32.189934,-80.758484,203485
"""05baefa2-028c-…",33.430611,-111.027061,153887
"""5c63a4aa-9828-…",33.404743,-111.937347,313499
"""e9bfdce3-556c-…",33.430611,-111.027061,790491
"""19ad8eca-938a-…",32.216316,-80.752609,48505
"""21fffac7-0a3f-…",32.216316,-80.752609,67435
"""9274df56-4804-…",32.216316,-80.752609,49150
"""b1616cff-59ea-…",-17.869122,146.106339,67438
"""739687f3-e0e6-…",33.451317,-111.950142,51743
"""b6171ffb-bc8e-…",33.451317,-111.950142,49972


Finally, the `photo.parquet` file contains information about the photos associated with each observation: `photo_uuid`, `photo_id` , `observation_uuid`, and `position` (index of photo for a given observation associated with more than one photo). `photo_id` is the information we want to get from this file, as its allow us to access the associated picture on the AWS S3 bucket.

In [23]:
photo = pl.scan_parquet('data/photo.parquet').head(10)
photo.collect()

photo_uuid,photo_id,observation_uuid,position
str,i64,str,i64
"""8d6b2534-d30a-…",21213,"""7ae155fc-f49e-…",0
"""6e8112fd-f703-…",21216,"""7ae155fc-f49e-…",1
"""49141c2f-48b0-…",21215,"""05baefa2-028c-…",0
"""71090faa-9110-…",21214,"""5c63a4aa-9828-…",0
"""92c703d0-20f1-…",21217,"""e9bfdce3-556c-…",0
"""632f7c05-ce39-…",21218,"""19ad8eca-938a-…",0
"""67ad1cde-e43c-…",21219,"""21fffac7-0a3f-…",0
"""5433c1c1-0930-…",21220,"""21fffac7-0a3f-…",1
"""c30e47a5-ece4-…",21221,"""9274df56-4804-…",0
"""f1eb863e-fa25-…",21222,"""9274df56-4804-…",1


# Subset selection

Since there are tens of millions of photos in total, we will only use a manageable subset to train our model. 

First, we will focus on mammal observation. To do so, we select the `taxon_id` and `name` corresponding to mammal from the `taxa` dataset. The number code `ancestry` for mammals contains `848317` (checked by finding the common ancestry for a few known mammals as the meaning of these numbers could not be found).

In [7]:
taxa_mammals = (pl.scan_parquet('data/taxa.parquet')
                    .filter(pl.col('ancestry').str.contains('848317'))
                    .select(pl.col(['taxon_id', 'name']))
                    )

Then, we select a 'region of interest' from `observations`, between latitude 43 and 77, and between longitude -80 and -70, roughly as south and west as Hamilton, ON and as north and east as Kamouraska, QC.

In [10]:
lat_min = 43
lat_max = 77

lon_min = -80
lon_max = -70

In [8]:
observations_roi = (pl.scan_parquet('data/observations.parquet')
                    .filter(
                        pl.col('latitude').is_between(lat_min, lat_max),
                        pl.col('longitude').is_between(lon_min, lon_max))
                    .select(pl.col(['observation_uuid', 'taxon_id']))
                    )

We then join `taxa` and `observations` on the common key `taxon_id`.

In [10]:
observations_roi_mammals = (observations_roi.join(taxa_mammals, on='taxon_id')
                            .select(pl.exclude('taxon_id'))
                            )

We end up with `observations_roi_mammals_df` which only contains observation information for mammal observation in our region of interest (the join process takes a long time because the files are large). We save the file for future use.

In [11]:
observations_roi_mammals_df = observations_roi_mammals.collect()
observations_roi_mammals_df.write_parquet('observations_roi_mammals.parquet')

We can now join the these observations with the photo table to get what we want: `photo_id` and `name` for our subset of interest (we split the process into two steps because the next one takes a very long time).

In [4]:
observations_roi_mammals = pl.scan_parquet('observations_roi_mammals.parquet')
photo = pl.scan_parquet('data/photo.parquet')

In [6]:
photo_roi_mammals = (photo.join(observations_roi_mammals, on='observation_uuid')
                     .select(pl.col(['photo_id', 'name']))
                     )

Saving the results as a `.parquet` file.

In [None]:
photo_roi_mammals_df = photo_roi_mammals.collect()
photo_roi_mammals_df.write_parquet('photo_roi_mammals.parquet')