# Scaling

Het jaarlijkse totale aantal vogelobservaties stijgt in de loop van de tijd. Steeds meer mensen hun waarnemingen loggen op waarnemingen.be. 

Daarom evalueren we in deze studie niet het aantal observaties per jaar, maar het aandeel van onze soort ten opzichte van het totaal aantal vogel observaties in dat jaar.
Als observatie waarde gebruiken we het aandeel van die observatie tegenover een totaal van 1 000 000 observaties.

We kennen het totale aantal jaarlijkse waarnemingen van vogels. We gaan ervan uit dat het aandeel van elke vogelsoort ten opzichte van elkaar constant blijft als de populatie constant blijft. Als het aandeel van een bepaalde vogelsoort stijgt, nemen we aan dat er daadwerkelijk meer vogels van die soort voorkomen. </br>

In [1]:
import pandas as pd
import re
import matplotlib.pyplot as plt
import numpy as np

# set the max columns to none
pd.set_option('display.max_columns', None)
# set the max columns to none
pd.set_option('display.max_rows', None)

## Yearly Observations

In [2]:
yearly = f'../2_cleaning/clean_data/observations_yearly_clean.parquet'

# Load the data
df_yearly = pd.read_parquet(yearly, engine="pyarrow")
df_yearly.describe(include='all')

Unnamed: 0,observation_count
count,54.0
mean,677062.9
std,1101915.0
min,2242.0
25%,35692.0
50%,81696.5
75%,1071938.0
max,3807834.0


In [3]:
first_year = df_yearly.index.min()
last_year = df_yearly.index.max()

print(f'Yearly observations from: {first_year} in {last_year}')

# Year with min observation count
min_observations = df_yearly[(df_yearly['observation_count'] == df_yearly['observation_count'].min())]
min_observation_count = min_observations['observation_count'].values[0]
year_min_observation_count = min_observations.index[0]

print(f'Min observation count: {min_observation_count} in {year_min_observation_count}')

# Year with max observation count
max_observations = df_yearly[(df_yearly['observation_count'] == df_yearly['observation_count'].max())]
max_observation_count = max_observations['observation_count'].values[0]
year_max_observation_count = max_observations.index[0]


print(f'Max observation count: {max_observation_count} in {year_max_observation_count}')

Yearly observations from: 1971 in 2024
Min observation count: 2242 in 1971
Max observation count: 3807834 in 2021


In [4]:
df_yearly_scaled = df_yearly.copy()

In [5]:
# Bepalen scale_factor
df_yearly_scaled['scale_factor'] = df_yearly_scaled.apply(lambda x: 1_000_000 / x['observation_count'], axis=1)
df_yearly_scaled['observation_count_sc'] = df_yearly_scaled.apply(lambda x: x['observation_count'] * x['scale_factor'], axis=1)
df_yearly_scaled.sort_index(ascending=False).head()

Unnamed: 0_level_0,observation_count,scale_factor,observation_count_sc
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024,3270062,0.305805,1000000.0
2023,3204569,0.312054,1000000.0
2022,3432614,0.291323,1000000.0
2021,3807834,0.262616,1000000.0
2020,3440886,0.290623,1000000.0


## Observations

In [6]:
species_ids =[70, 116]  # 70 boomklever, 116 halsbandparkiet
df_observations_dict = {}

for species_id in species_ids:
    # Load the data
    observations = f'../2_cleaning/clean_data/observations_{species_id}_clean.parquet'
    df = pd.read_parquet(observations, engine="pyarrow")
    
    # Remove observations before yearly total observations are known
    df = df[df['date'].dt.year >= year_min_observation_count]
    
    # Apply scale factor
    df['year'] = df['date'].dt.year
    df['observation_count_sc'] = df.apply(lambda x: 1 * df_yearly_scaled.loc[x['year'], 'scale_factor'], axis=1)

    # Store in dictionary
    df_observations_dict[f"{species_id}"] = df.copy()

In [7]:
df_observations_halsbandparkiet = df_observations_dict['116'].copy()
df_observations_boomklever = df_observations_dict['70'].copy()

## Write scaled data to parquet

In [8]:
df_observations_boomklever.to_parquet(f'../3_scaling/scaled_data/observations_boomklever_scaled.parquet', engine="pyarrow")
df_observations_halsbandparkiet.to_parquet(f'../3_scaling/scaled_data/observations_halsbandparkiet_scaled.parquet', engine="pyarrow")
df_yearly_scaled.to_parquet(f'../3_scaling/scaled_data/observations_yearly_scaled.parquet', engine="pyarrow")

In [9]:
df_observations_boomklever.describe(include='all')

Unnamed: 0,species_id,species_name,species_name_scientific,validation,gps_coordinates,source,date,amount,life_stage,activity,location_id,location,observer_id,observer_name,counting_method,method,latitude,longitude,accuracy_m,year,observation_count_sc
count,269808.0,269808,269808,269808,269808,252008,269808,269808.0,269342,269342,265540.0,265540,268950.0,268950,269808,269808,269808.0,269808.0,251460.0,269808.0,269808.0
unique,,3,3,5,216529,26,,,20,48,,8032,,9677,7,18,17633.0,30624.0,,,
top,,Boomklever,Sitta europaea,Goedgekeurd (automatische validatie),"50.9686, 5.3391",ObsMapp,,,onbekend,ter plaatse,,Eeklo - Het Leen - Noord (4250A) (Prov. Dom. O...,,D. Peeters,onbekend,onbekend,50.9868,5.3391,,,
freq,,269317,269317,236452,711,107086,,,247661,145172,,3543,,8904,235672,157338,783.0,736.0,,,
mean,541.162234,,,,,,2018-09-08 02:25:55.346023936,1.252354,,,83392.393274,,84225.5,,,,,,42.164022,2018.268372,0.773515
min,70.0,,,,,,1971-05-12 00:00:00,1.0,,,23089.0,,114.0,,,,,,1.0,1971.0,0.262616
25%,70.0,,,,,,2016-02-12 00:00:00,1.0,,,28734.0,,40788.0,,,,,,8.0,2016.0,0.291323
50%,70.0,,,,,,2020-01-16 00:00:00,1.0,,,31448.0,,51078.0,,,,,,15.0,2020.0,0.312054
75%,70.0,,,,,,2022-03-13 00:00:00,1.0,,,72765.0,,97285.0,,,,,,25.0,2022.0,0.536193
max,258978.0,,,,,,2024-12-31 00:00:00,67.0,,,724090.0,,1060523.0,,,,,,1590.0,2024.0,446.03033


In [10]:
df_observations_boomklever.info()

<class 'pandas.core.frame.DataFrame'>
Index: 269808 entries, 336194665 to 47673787
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   species_id               269808 non-null  int64         
 1   species_name             269808 non-null  object        
 2   species_name_scientific  269808 non-null  object        
 3   validation               269808 non-null  object        
 4   gps_coordinates          269808 non-null  object        
 5   source                   252008 non-null  object        
 6   date                     269808 non-null  datetime64[ns]
 7   amount                   269808 non-null  int64         
 8   life_stage               269342 non-null  object        
 9   activity                 269342 non-null  object        
 10  location_id              265540 non-null  float64       
 11  location                 265540 non-null  object        
 12  observer_id