![wildflower banner](https://raw.githubusercontent.com/Floydworks/Capstone2_Wildflower_Phenology/main/image_files/wildflower_banner.png)

# Find your favorite East Bay wildflowers!
### A model to explain bloom times for wildflowers in California's East Bay
<br>
<br>Flowering is triggered by environmental and climatic variables.
<br>The primary climate variables influencing bloom appearance and senescence are: photoperiod (daylength), <br>temperature, precipitation, nutrients, and response to certain chemicals or hormones (Cho et al 2016).
<br>
<br>This tool combines public observations of wildflowers collected in the iNaturalist app with
<br>temperature, precipitation, and daylength data to describe when wildflowers species bloom in the Bay Area. 
<br>Gradient Boosting and Random Forest models indicate the most influential climate factors in blooming 
<br>for each species of interest.
<br>
<br>Cited works:
<br>L. Cho, J. Yoon, G. An, 2016. The control of flowering time by environmental factors. The Plant Journal.
<br>https://onlinelibrary.wiley.com/doi/10.1111/tpj.13461

# 1. Data:
### Data sources
**iNaturalist export tool:** Taxon:47125
<br>https://www.inaturalist.org/observations/export
<br>**Temperature and precipitation data** 
<br>https://www.ncei.noaa.gov/access
<br>**Daylength data: Skyfield API**
<br>https://rhodesmill.org/skyfield/api.html
<br>
<br>**links to observation data and climate data exports**
<br>[iNaturalist observations](https://github.com/Floydworks/Capstone2_Wildflower_Phenology/tree/main/inat_observation_csvs)
<br>[NOAA climate files](https://github.com/Floydworks/Capstone2_Wildflower_Phenology/tree/main/NOAA_climate_files)


### East Bay Parks and Weather Stations
Observations over a five-year period in seven East Bay parks were used to train the model.
<br>
<br>Years are defined as a 'California water year' which occurs from October 01 through September 30 of the following calendar year.
<br>Example: water year 2018 includes dates from October 01, 2017 through September 30, 2018.
<br>
<br>Climate data for each park comes from the nearest weather station. When two equidistant stations are available, park climate variables are reported as the average daily values of the two stations.
### Model training and test data
> **parks included in the training dataset (18,705 total observations)**
> 1. Tilden Regional Park 
> 1. Briones Regional Park
> 1. Sunol Regional Wilderness
> 1. Garin Regional Park, 
> 1. Pleasanton Ridge Regional Park
> 1. Anthony Chabot Regional Park
> 1. Joseph D Grant County Park
<br>
**parks included in the testing dataset (30,379 total observations)**
> 1. Mt. Diablo State Park



### Fig 1. Table of park information
>  **size(mi2)** = park size in square miles
><br>  **place_id** = place id for iNaturalist export tool and observation search
><br>  **region** = geographic region in California
><br>  **lat_long** = latitude and longitude of park
><br>  **station_id** = weather station id for NOAA API
><br>  **dataset** = training of final testing dataset assignment
><br>  **stations** = city names for location of weather station(s) associated with the park

![image of park info table east bay](https://raw.githubusercontent.com/Floydworks/Capstone2_Wildflower_Phenology/main/image_files/park_info_table_eastbay.png)

# 2. Methods
<br>**Questions:** We want to tell the user when, where, and why they will find a particular flower.
 1. When does each species bloom?
 1. What climate features (daylength, precipitation, and temperature) influence blooming in each species?

<br>**When:** Data for all parks and all years in the region is combined and number of observations by month is calculated.
<br>**What:** Various multivariate models are tested in PyCaret and hyperparameters are tuned. Feature importance, or the importance of any particular climate condition, for blooming is ranked by the model. Ranks indicate which variables are most influential for that species. 

# 3. Cleaning and integrating the data

[Data wrangling notebook](https://github.com/Floydworks/Capstone2_Wildflower_Phenology/blob/main/Capstone2_Data_Wrangling%5BR.Sandidge%5D.ipynb)


### iNaturalist observation filtering:
- Filter for herbaceous plants that are considered annuals.
- Drop shrubs and trees.
- Filter out uncommon species, keeping those with greater than 100 observations.
- Filter for seasonality, keeping plants with 85% of observations falling between January and July.

<br>**Data Labeling:**
<br>each observation is viewed manually using the image URL and labeled as: pre-bloom, in-bloom, or senesced.


# 4. Months each species can be seen in bloom
[Exploratory data analysis notebook: wildflower observations](https://github.com/Floydworks/Capstone2_Wildflower_Phenology/blob/80bb22f2f9f6b4fa68cea28426a77e9a1253144e/Capstone2_EDA_climate_wildflower_phenology.ipynb)
### There were 32 species with greater than 100 observations and 85 percent of observations falling between January and July

### Fig 2. Table of abundant species (only first 10 are displayed)
>  **mo_tab_mo_wy** = number of months with no observations in 5 water years
><br>  **obsY** = total observations in the dataset
><br>  **obsS** = total observations made between January and July
><br>  **obsP** = proportion of annual observations that occured between January and July

![climate variable correlations](https://raw.githubusercontent.com/Floydworks/Capstone2_Wildflower_Phenology/main/image_files/wildflowers_100obs_first10.png)

# 5. Climate EDA

[Exploratory data analysis notebook: climate](https://github.com/Floydworks/Capstone2_Wildflower_Phenology/blob/80bb22f2f9f6b4fa68cea28426a77e9a1253144e/Capstone2_EDA_climate_wildflower_phenology.ipynb)

**Climate features:**
>  **prec_daily** = daily precipitation in inches
><br>  **prec_cum_WY** = daily cumulative precipitation over the water year
><br>  **MonSumPrec** = precipitation sum for each water year month
><br>  **WkSumPrec** = precipitation sum for each water year week
><br>  **sum_prec_prior14, sum_prec_prior30** = sum of precipitation in previous 14 days and 30 days
><br>  **min_temp, max_temp** = minimum and maximum daily temperature
><br>  **MinTemp_prior14, MaxTemp_prior14** = minimum and maximum daily temperature in previous 14 and 30 days
><br>  **AvgMinTemp_prior14, AvgMaxTemp_prior30** =  average minimum and average maximum daily temperature in previous 14 and 30 days
><br>  **day_length** = daily number of seconds of daylight
><br>  **MaxDayLen_prior14, MaxDayLen_prior30** = maximum day length in previous 14 and 30 days

### Fig. 3 Correlation plot of all raw and engineered climate features

![climate variable correlations](https://raw.githubusercontent.com/Floydworks/Capstone2_Wildflower_Phenology/main/image_files/corrplot_all_features.png)

### Fig. 4 Climate features for all parks in training dataset plotted over 2018 water year

![climate freature averages 2018](https://raw.githubusercontent.com/Floydworks/Capstone2_Wildflower_Phenology/main/image_files/climate_feature_averages_2018.png)

# 6. Model and Feature Selection with Pycaret 
### Adelinia grande, Pacific Houndstongue
[model metrics file](https://github.com/Floydworks/Capstone2_Wildflower_Phenology/blob/dc0cba90000bdbeec82f850a8fe923391852a188/WildflowerPhenology_model_metrics.txt)
<br>[modeling notebook](https://github.com/Floydworks/Capstone2_Wildflower_Phenology/blob/4a7ad3ceb5d2c283a134577d30710219341cf13c/cap2_modeling_A.Grande%5BR.Sandidge%5D.ipynb)
<br>
<br>**Splitting training and test sets:**
<br>80 percent of the training dataset was used to train the model and the remaining 20 percent to test it.
### Hyperparameter Tuning (see model metrics file)
<br>Best Model: Gradient Boosting Classifier
<br>Number of features = 17
<br>Number of estimators=100



### Fig. 5 Model performance comparison for train/test and test datasets

![model comparison](https://raw.githubusercontent.com/Floydworks/Capstone2_Wildflower_Phenology/main/image_files/model_compare_train_test_wide.png)

# 7. Results: Adelinia grande

### Adelinia grande, Pacific Houndstongue
> - found in early spring throughout the East Bay
> - peak blooming happens in February through April
> - blooming is triggered by daylength, this flower will most likely be found at the same time each year
> - warming temperatures also contribute to blooming, warmer, exposed sites may bloom earlier

### Fig. 6 Adelinia grande bloom period and climate feature importance 

![bloom months and feature importance](https://raw.githubusercontent.com/Floydworks/Capstone2_Wildflower_Phenology/main/image_files/A_grande_plots.png)

# 8. Challenges to Current Model
**wildflower observations**
<br>The majority of species do not have enough observations to run the model. Only very common and abundant 
<br>species are included, so far.
<br>
<br>**climate data**
<br>Finely-grained temperature and precipitation data is scarce. Many NOAA station data have large gaps. Getting <br>data from stations near to parks is critical but may be difficult for some areas.
<br>
<br>**model deficiencies**
<br>All current parks are located in the East Bay region. The limited geographic coverage does not introduce a <br>broad range of climatic variables. Having samples from regions with differing temprature and precipitation <br>conditions for a given photoperiod (daylength) will most likely improve modeling.



# 9. Next Steps
<br>**modeling**
<br>Get observation data for many more parks across California to visualize blooms across the state. 
<br>Additional observations will increase sample size across species, allowing more species to be modeled as the 
<br>data set grows.
<br>
<br>**mapping**
<br>Plot species of interest observations over a map of California to visualize the seasonality and bloom dates 
<br>in parks across all regions.
<br>Include a dashboard.
<br>Species visualizations will be accompanied by summary statistics and model results, highlighting the most <br>important variables in predicting blooms.




# 10. Acknowledgements
<br> Thanks to Raghunandan Patthar and Wayne Ang, Springboard mentors.