## Loading the Dataset sample.

In [7]:
import pandas as pd

In [11]:
df = pd.read_csv("../data/raw/sample.csv", delimiter="\t")  # eBird uses tabs
print(df.shape)
print(df.columns.tolist())
df.head()

(4863, 53)
['GLOBAL UNIQUE IDENTIFIER', 'LAST EDITED DATE', 'TAXONOMIC ORDER', 'CATEGORY', 'TAXON CONCEPT ID', 'COMMON NAME', 'SCIENTIFIC NAME', 'SUBSPECIES COMMON NAME', 'SUBSPECIES SCIENTIFIC NAME', 'EXOTIC CODE', 'OBSERVATION COUNT', 'BREEDING CODE', 'BREEDING CATEGORY', 'BEHAVIOR CODE', 'AGE/SEX', 'COUNTRY', 'COUNTRY CODE', 'STATE', 'STATE CODE', 'COUNTY', 'COUNTY CODE', 'IBA CODE', 'BCR CODE', 'USFWS CODE', 'ATLAS BLOCK', 'LOCALITY', 'LOCALITY ID', 'LOCALITY TYPE', 'LATITUDE', 'LONGITUDE', 'OBSERVATION DATE', 'TIME OBSERVATIONS STARTED', 'OBSERVER ID', 'OBSERVER ORCID ID', 'SAMPLING EVENT IDENTIFIER', 'OBSERVATION TYPE', 'PROTOCOL NAME', 'PROTOCOL CODE', 'PROJECT NAMES', 'PROJECT IDENTIFIERS', 'DURATION MINUTES', 'EFFORT DISTANCE KM', 'EFFORT AREA HA', 'NUMBER OBSERVERS', 'ALL SPECIES REPORTED', 'GROUP IDENTIFIER', 'HAS MEDIA', 'APPROVED', 'REVIEWED', 'REASON', 'CHECKLIST COMMENTS', 'SPECIES COMMENTS', 'Unnamed: 52']


Unnamed: 0,GLOBAL UNIQUE IDENTIFIER,LAST EDITED DATE,TAXONOMIC ORDER,CATEGORY,TAXON CONCEPT ID,COMMON NAME,SCIENTIFIC NAME,SUBSPECIES COMMON NAME,SUBSPECIES SCIENTIFIC NAME,EXOTIC CODE,...,NUMBER OBSERVERS,ALL SPECIES REPORTED,GROUP IDENTIFIER,HAS MEDIA,APPROVED,REVIEWED,REASON,CHECKLIST COMMENTS,SPECIES COMMENTS,Unnamed: 52
0,URN:CornellLabOfOrnithology:EBIRD:OBS2919749158,2025-03-01 23:25:39.781016,21333,species,avibase-69544B59,American Crow,Corvus brachyrhynchos,,,,...,2.0,1,G14142220,0,1,0,,,,
1,URN:CornellLabOfOrnithology:EBIRD:OBS2933256785,2025-03-05 07:43:54.669907,21333,species,avibase-69544B59,American Crow,Corvus brachyrhynchos,,,,...,1.0,1,,0,1,0,,,,
2,URN:CornellLabOfOrnithology:EBIRD:OBS3009418896,2025-03-30 17:15:29.683739,21333,species,avibase-69544B59,American Crow,Corvus brachyrhynchos,,,,...,2.0,1,G14355008,0,1,0,,,,
3,URN:CornellLabOfOrnithology:EBIRD:OBS2930040857,2025-03-04 07:42:07.176854,21333,species,avibase-69544B59,American Crow,Corvus brachyrhynchos,,,,...,1.0,1,,0,1,0,,,,
4,URN:CornellLabOfOrnithology:EBIRD:OBS3005325331,2025-03-29 12:31:48.25322,21333,species,avibase-69544B59,American Crow,Corvus brachyrhynchos,,,,...,1.0,1,,0,1,0,,,,


## Narrowing down to potentially useful columns.


In [12]:
df[["COMMON NAME", "SCIENTIFIC NAME", "OBSERVATION COUNT", "LATITUDE", "LONGITUDE", "TIME OBSERVATIONS STARTED", "OBSERVATION TYPE", "DURATION MINUTES", "EFFORT DISTANCE KM"]]

Unnamed: 0,COMMON NAME,SCIENTIFIC NAME,OBSERVATION COUNT,LATITUDE,LONGITUDE,TIME OBSERVATIONS STARTED,OBSERVATION TYPE,DURATION MINUTES,EFFORT DISTANCE KM
0,American Crow,Corvus brachyrhynchos,2,33.155957,-87.633699,06:21:00,Traveling,24.0,1.092
1,American Crow,Corvus brachyrhynchos,1,33.188274,-87.539672,06:27:00,Stationary,15.0,
2,American Crow,Corvus brachyrhynchos,3,33.133521,-87.653418,15:11:00,Traveling,60.0,13.532
3,American Crow,Corvus brachyrhynchos,2,33.188274,-87.539672,06:24:00,Stationary,16.0,
4,American Crow,Corvus brachyrhynchos,4,33.425151,-87.605487,10:14:00,Stationary,48.0,
...,...,...,...,...,...,...,...,...,...
4858,Yellow-throated Warbler,Setophaga dominica,1,33.189768,-87.324228,11:26:00,Stationary,6.0,
4859,Yellow-throated Warbler,Setophaga dominica,2,33.222166,-87.599165,09:34:00,Traveling,56.0,1.207
4860,Yellow-throated Warbler,Setophaga dominica,2,33.108221,-87.650768,08:44:00,Traveling,43.0,7.854
4861,Yellow-throated Warbler,Setophaga dominica,25,33.018162,-87.368288,06:52:00,Traveling,130.0,10.578


## Exploring nulls
Found nulls in DURATION MINUTES (514) and EFFORT DISTANCE KM (1562).
These are expected—stationary protocols don't have distance, incidental don't have either.


In [17]:
df[["COMMON NAME", "SCIENTIFIC NAME", "OBSERVATION COUNT", "LATITUDE", "LONGITUDE", "TIME OBSERVATIONS STARTED", "OBSERVATION TYPE", "DURATION MINUTES", "EFFORT DISTANCE KM"]].isnull().sum()

COMMON NAME                     0
SCIENTIFIC NAME                 0
OBSERVATION COUNT               0
LATITUDE                        0
LONGITUDE                       0
TIME OBSERVATIONS STARTED       0
OBSERVATION TYPE                0
DURATION MINUTES              514
EFFORT DISTANCE KM           1562
dtype: int64

## Is it useful to keep Observation Type/Count ?

Type can matter to rank the difficulty of observation.
The goal of the project is to know whether a bird can be observed or not, its count is not that important for a first version.

In [14]:
df["OBSERVATION TYPE"].unique()

array(['Traveling', 'Stationary', 'Incidental'], dtype=object)

In [16]:
df["OBSERVATION COUNT"].value_counts()

OBSERVATION COUNT
1      1898
2       965
3       475
4       354
5       232
6       172
8       127
10       80
7        72
15       50
12       48
20       44
50       35
9        33
13       32
X        29
25       28
30       21
11       17
40       15
35       15
75       13
14       11
16       11
18       11
100      10
60        9
21        7
17        6
22        6
120       5
300       3
27        3
200       3
150       2
32        2
55        2
65        2
51        2
19        2
80        2
42        2
58        1
52        1
26        1
45        1
500       1
23        1
24        1
Name: count, dtype: int64

## Negative observation

To train a model, we need to know if a Bird is also not there, as the absence means something in our training.
For "ALL SPECIES REPORTED" we can see that around 90% of the observation are equal to 1. We'll need to mark the absences as well.

In [19]:
df["ALL SPECIES REPORTED"].value_counts()

ALL SPECIES REPORTED
1    4322
0     541
Name: count, dtype: int64