# 🐦 Birds over the years 

![viz](https://github.com/syborg91/kaggle/blob/master/cornell-birdcall-identification/map.png?raw=true)

In this notebook, we visually explore the rich geographic distribution of various species of birds over time. This would allow us to potentially trace:
1. Migration patterns, and
2. Prevalence of certain species in specific regions

To that end, we will create an animated map with a time slider.

## Libraries

In [None]:
import random
import plotly
import numpy as np
import pandas as pd
from pathlib import Path
import plotly.graph_objs as go
import matplotlib.pyplot as plt

For the mapbox visualization, please paste the `mapbox_access_token` in kaggle environment and retrieve as follows,

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
mapbox_access_token = user_secrets.get_secret("mapbox_access_token")

## Data

* Lets start by loading `train.csv` and checking some basic information as follows,

In [None]:
path = Path('/kaggle/input/birdsong-recognition')
train = pd.read_csv(path/'train.csv')

In [None]:
train.info()

For our exploration we are primarily interested in the following geo-spatial and temporal features :
- `latitude`
- `longitude`
- `elevation`
- `time`
- `date`

## Temporal features

We will handle the instances with 12 hour format by converting them to 24 hours. We will also convert all datetime to `%Y-%m-%d %H:%M:%S` format.

> Note : All new columns created would be prefixed by a underscore.

In [None]:
train['_time'] = pd.to_datetime(train.time, errors='coerce').dt.strftime('%H:%M:%S')
train['_date'] = pd.to_datetime(train.date, format='%Y-%m-%d %H:%M:%S', errors='coerce').dt.strftime('%Y-%m-%d')
# creating a new column: _datetime
train['_datetime'] = pd.to_datetime(train['_date'] + ' ' + train['_time'], errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')

Now let's check the `NaN` columns for `_datetime` i.e. where either the date or time is not in proper format,

In [None]:
train[train._datetime.isna()][['date', 'time', '_datetime']].head(10)

As we can see either the date or time (or both) are invalid in these cases. 

### Time distribution
Lets check the distribution of `time`, rounded to every quarter of an hour i.e. every 15 mins, as follows 

In [None]:
fig = go.Figure(data=[go.Histogram(x=pd.to_datetime(train._time, format='%H:%M:%S').dt.round('15min'))]) # rounding to nearest quarter of an hour
fig.show()

Also checking unique invalid `time` as follows,

In [None]:
print(train._time.isna().sum())
train[train._time.isna()]['time'].unique()

As we can see the above instances do not conform to any known standard of time.

Now checking the invalid dates as follows,

In [None]:
print(train._date.isna().sum())
train[train._date.isna()]['date'].unique()

> Most of these invalid dates and times have:
- `0000-00-00`, or
- Either `00` as the date or month

### Coarse-grained dates

Now lets consider the dates which have a valid `YYYY-MM` format and ignore `dd` for now

In [None]:
train['_year_month'] = train.date.apply(lambda x : '-'.join(x.split('-')[:2])) # 'keeping only year-month and excluding date'
train['_year_month'] = pd.to_datetime(train._year_month, format='%Y-%m', errors='coerce')
train._year_month.isna().sum()

We managed to reduce the invalid dates from 152 to 37. Now lets plot a few histograms as follows 

1. Year-month histogram

In [None]:
fig = go.Figure(data=[go.Histogram(x=train._year_month)])
fig.show()

2. Month histogram

In [None]:
fig = go.Figure(data=[go.Histogram(x=pd.DatetimeIndex(train._year_month).month)])
fig.show()

3. Year histogram

In [None]:
fig = go.Figure(data=[go.Histogram(x=pd.DatetimeIndex(train._year_month).year)])
fig.show()

Observations :
- Most of the birdcalls were recorded between `Apr - May` for most of the years
- Highest recorded audio peaked between `Apr 2014 - Jun 2014`

## Geo-spatial features

We perform the following transformations :
1. Replace `m`, `~`, `,` and `?` with empty string
2. Replace `1650-1900`, `930-990`, `Unknown` and `-` with empty string
3. Only consider rows which have a valid longitude and latitude i.e. dropping `Not specified`
4. Replacing elevation with empty string as `0.0` and scaling the values for the size of marker on the map later 

In [None]:
train['_year_month'] = train._year_month.dt.strftime('%Y-%m') # converting to string 
train['_elevation'] = train.elevation.apply(lambda x : x.replace('m', '').replace('~', '').replace(',', '').replace('?', '').strip()) # replace
train.loc[train._elevation.isin(['1650-1900', '930-990', 'Unknown', '-']), '_elevation'] = '' # assign empty string 
df = train.loc[(train.longitude != 'Not specified') & (train.latitude != 'Not specified'), ['country', 'latitude', 'longitude', '_elevation', '_year_month', 'ebird_code', 'elevation']]
df.loc[df._elevation == '', '_elevation'] = None # empty string with None
df['_elevation'] = df._elevation.astype(float) # convert to float
df['_elevation'].fillna(0.0, inplace=True) # replace NaN with 0.0
df['_elevation'] = (df._elevation + 100.0)/80.0 # scale values 

Our new dataframe is as follows,

In [None]:
df.info()

Now dropping all rows with invalid dates

In [None]:
df = df.loc[~df._year_month.isna(), :] # dropping all NaN dates

And setting the `_date` as the index of the dataframe (convenient for creating frames later)

In [None]:
df = df.set_index('_year_month') # setting date as the dataframe index

In [None]:
df.head()

## Map

First, lets assign a unique id and color to the different species of birds as follows

In [None]:
# total no of birds
number_of_colors = 264

# list of random hex-valued colors 
color = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
             for i in range(number_of_colors)]

ebird_code = df.ebird_code.unique().tolist()
# get ID and color for each bird
EBIRD_CODE = {k : color[i] for i, k in enumerate(ebird_code)}
# assign them to the dataframe
df['_color'] = df.ebird_code.apply(lambda x : EBIRD_CODE[x])

Now lets get all the unique dates (in ascending order)

In [None]:
months = sorted(df.index.unique().tolist())

First we will create a list of dicts which will contain all the individual frames for our map. The tooltip will display:
- `ebird_code`
- `elevation`, and
- `country`

Along with this each bird would is assigned the color as per the `_color` column.

In [None]:
frames = [{   
    'name':'frame_{}'.format(x),
    'data':[{
        'type':'scattermapbox',
        'lat':np.array(df.xs(x)['latitude']),
        'lon':np.array(df.xs(x)['longitude']),
        'marker':go.scattermapbox.Marker(
            size= 9 + df.xs(x)['_elevation'],
            color=df.xs(x)['_color']
        ),
        'customdata': np.stack((df.xs(x)['ebird_code'], df.xs(x)['elevation'], df.xs(x)['country']), axis=-1),
        'hovertemplate': "<extra></extra> 🐦 <em>%{customdata[0]}</em><br> 📏 %{customdata[1]}<br> 🗺️ %{customdata[2]}<br>",
    }],           
} for x in months]

Next lets create our slider and assign all the neccesary configuration as follows,

In [None]:
sliders = [{
    'transition':{'duration': 0},
    'x':0.08, 
    'len':0.88,
    'currentvalue':{'font':{'size':15}, 'prefix':'📅 ', 'visible':True, 'xanchor':'center'},  
    'steps':[
        {
            'label':x,
            'method':'animate',
            'args':[
                ['frame_{}'.format(x)],
                {'mode':'immediate', 'frame':{'duration':100, 'redraw': True}, 'transition':{'duration':50}}
              ],
        } for x in months]
}]

Next we define the play and pause button which would allow us to play all the frames over time as follows,

In [None]:
play_button = [
    {
        "buttons": [
            {
                "args": [None, {"frame": {"duration": 100, "redraw": True},
                                "fromcurrent": True, "transition": {"duration": 50}}],
                "label": "Play",
                "method": "animate"
            },
            {
                "args": [[None], {"frame": {"duration": 0, "redraw": False},
                                  "mode": "immediate",
                                  "transition": {"duration": 0}}],
                "label": "Pause",
                "method": "animate"
            }
        ],
        "direction": "left",
        "pad": {"r": 10, "t": 87},
        "showactive": True,
        "type": "buttons",
        "x": 0.1,
        "xanchor": "right",
        "y": 0,
        "yanchor": "top"
    }
]

And finally, lets display our map as follows

In [None]:
# defining the initial state
data = frames[0]['data']

# adding all sliders and play button to the layout
layout = go.Layout(
    sliders=sliders,
    updatemenus=play_button,
    title="Birds over the years",
    mapbox={
        'accesstoken':mapbox_access_token,
        'center':{"lat": 37.86, "lon": 2.15},
        'zoom':1.7,
        'style':'dark', # choose from: dark or light
    },
    height=1000
)

# creating the figure
fig = go.Figure(data=data, layout=layout, frames=frames)

# displaying the figure
fig.show()

And there you have it! Feel free to tinker the settings as required and explore away the different birds in their habitats through the years. 

This notebook hopefully enables people to understand how some of the species are more prevalent than others in specific geographic locations (and in particular seasons). Encoding this information while training our models could be an interesting avenue to explore.

🐦 Happy birding!

## References
- [Intro to Animations in Python](https://plotly.com/python/animations/)
- [How to create outstanding animated scatter maps with Plotly and Dash](https://towardsdatascience.com/how-to-create-animated-scatter-maps-with-plotly-and-dash-f10bb82d357a)