# Wildflower Phenology Tool
## guide to the data munging notebook
### Rebecca Sandidge, PhD
### updated: Nov. 11, 2022
<br>This notebook uses raw exports and APIs to bring in three data types:
- wildflower observations from iNaturalist
- temperature and precipitation daily records from the NOAA GHCN network
- sunrise and sunset times from the Skyfield app API

Repository of observation data used in this study is [here](https://github.com/Floydworks/WildflowerFinder_Phenology_Tool/blob/98595b0b70aec23c54a6c2d43e9871bceb820405/cleaned_data_files/updated_inat_data.csv)
<br>
<br>Five California water years of data are included from Oct. 01, 2017 through Sep. 30, 2022.
<br>Observations and climate data are gathered for eight parks in California's East Bay.
<br>The data are cleaned, reformatted, and features are engineered where needed.
<br>
<br>A dataframe containing only climate and daylength data, for every day of the sampling period, is produced for EDA and visualization of climate features.
<br>A final dataframe with combined observation, climate, and daylength data is produced for modeling wildflower phenology.

-
<a id='guide'></a>

## 1. Import necessary libraries
## 2. Information about the parks 
## 3. Clean and filter wildflower observations [Link](#wildflower_observations)
 - 3a. [import iNaturalist observation data](import_inat)
 - 3b.3c. [remove rows with missing species name and unwanted species](#remove_species)
 - 3d. [treat missing values](#missing_values1)
 - 3e. [format dates](#format_date1)
 - 3f. [drop duplicate observations](#flower_duplicates)
<br>**imports:** [updated_inat_data.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/updated_inat_data.csv')
<br>**exports:** [df_wildflowers_2017_2022.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/df_wildflowers_2017_2022.csv')

## 4. NOAA, GHCN climate data
 - 4a. [information about weather stations](#weather_stations)
 - 4b. [import climate data](#import_climate)
 - 4c. [create dataframes from exported GHCN data](#station_dataframes)
 - 4d. [format dates, select dates in study period, select columns all dataframes](#format_dataframes)
 - 4e. [fill in missing berkeley rows](#fill_berkeley)
 - 4f. [concatenate individual station dataframes clean up dates](#format_dates_climate)
 - 4g. [precipitation and temperature converted to proper units](#prec_temp_units)

 - 4h. [treat missing values](#missing_climate)
 - 4i. [add cumulative precipitation](#cumulative_precipitation)
 <br>**imports:** [GHCN csv files for each station available here](https://github.com/Floydworks/WildflowerFinder_Phenology_Tool/tree/main/NOAA_climate_files)
 <br>**exports:** [climate_GHCN_data.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/climate_GHCN_data.csv')

## 5. Daylength from Skyfield [Link](#daylength)
 - 5a. [tune API call settings](#skyfield_settings)
 - 5b. [make API call](#skyfield_api)
 - 5c. [convert dates, extract desired elements, and format data frame](#skyfield_cleaning)
 - 5d. [treat missing values](#skyfield_missing_values)
 - 5e. [calculate each daylength](#calculate_daylength)
 - 5f. [clean up dataset](#daylength_clean)
 <br>**imports:** data imported through Skyfield API [link to Skyfield documentation]('https://rhodesmill.org/skyfield/')
<br>**exports:** [day length data: daylength_data.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/daylength_data.csv')

## 6. Merge climate and daylength data [Link](#merge_clim_day)
 - 6a. import climate and daylength data
 - 6b. [merge daylength and climate datasets](#merge_clim_day)
 - 6c. [calculate climate and daylength averages for parks using multiple stations](#multi_station_avgs)
 - 6d. [engineer climate features](#engineer_climate)
<br>**imports:** [climate data: climate_GHCN_data.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/climate_GHCN_data.csv') and [day length data: daylength_data.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/daylength_data.csv')
<br>**exports:** [climate_daylength_2017_2022.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/climate_daylength_2017_2022.csv')
 
## 7. Merge climate_daylength with iNaturalist observations [Link](#merge_all)
 - 7a. import iNaturalist observations and climate_daylength data
 - 7b. [merge iNaturalist flower observations and climate_daylength](#merge_flower_clim_day)
<br>**imports:** [climate_daylength_2017_2022.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/climate_daylength_2017_2022.csv') and [df_wildflowers_2017_2022.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/df_wildflowers_2017_2022.csv')
<br>**exports:** [phenology_dataset_2017_2022_df.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/phenology_dataset_2017_2022_df.csv')

## 8. Final export of integrated data:  [phenology_dataset_2017_2022_df.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/phenology_dataset_2017_2022_df.csv')



______________________________

# 1. Import the necessary libraries

In [1]:
#!pip install pyinaturalist
#!pip install pandas

from pyinaturalist.node_api import get_all_observations
import pandas as pd
import numpy as np
from datetime import date, datetime
import os

import skyfield
from skyfield import api
from skyfield import almanac

import requests
import io

print("Libraries imported!")

Libraries imported!


# 2. Information about the parks

In [2]:
#PARK DICTIONARY 
park_info_dict =  {"Tilden Regional Park": {"size(mi2)": "3.25","place_id":"3523","region": "east bay", 'lat_long':(37.894647, -122.241635), 'stations':('Berkeley', 'Berkeley2', 'Oakland'),'station_id':('USC00040693', 'US1CAAL0034', 'USW00023230'), 'dataset':('train')} ,
                   "Briones Regional Park": {"size(mi2)": "9.56","place_id":"3706","region": "east bay", 'lat_long':(37.935804, -122.137413), 'stations':('Concord'),'station_id':('USW00023254'), 'dataset':('train')} ,
                   "Sunol Regional Wilderness": {"size(mi2)": "3.25","place_id":"3456","region": "east bay", 'lat_long':(37.510183, -121.82855), 'stations':('SanJose', 'Livermore'),'station_id':('USW00023293','USW00023285'), 'dataset':('train')}, 
                   "Mt Diablo State Park": {"size(mi2)": "31.25","place_id":"5586","region": "east bay", 'lat_long':(37.881698, -121.914155), 'stations':('MtDiablo'),'station_id':('USC00045915'), 'dataset':('test')},
                   "Garin Regional Park": {"size(mi2)": "9.06","place_id":"5199","region": "east bay", 'lat_long':(37.63544, -122.02068), 'stations':('Hayward'),'station_id':('USW00093228'), 'dataset':('train')},
                   "Pleasanton Ridge Regional Park": {"size(mi2)": "14.20","place_id":"5777","region": "east bay", 'lat_long':(37.615409, -121.88456), 'stations':('Livermore', 'Hayward'),'station_id':('USW00023285','USW00093228'), 'dataset':('train')},
                   "Anthony Chabot Regional Park": {"size(mi2)": "5.1781","place_id":"5239","region": "east bay", 'lat_long':(37.766, -122.119), 'stations':('Oakland'),'station_id':('USW00023230'), 'dataset':('train')},
                   "Joseph D Grant County Park": {"size(mi2)": "14.9266","place_id":"5339","region": "east bay", 'lat_long':(37.345495, -121.68717), 'stations':('SanJose', 'MtHamilton'),'station_id':('USW00023293','USC00045933'), 'dataset':('train')}
                  }

#"_": {"size(mi2)": "_","place_id":"_","region": "_", 'lat_long':()}

#create dataframe of park information
park_info_df = pd.DataFrame.from_dict(park_info_dict, orient='index')
#park_info_df = park_info_df.drop(['station'], axis=1)

park_info_df[park_info_df['region']=='east bay']

Unnamed: 0,size(mi2),place_id,region,lat_long,stations,station_id,dataset
Tilden Regional Park,3.25,3523,east bay,"(37.894647, -122.241635)","(Berkeley, Berkeley2, Oakland)","(USC00040693, US1CAAL0034, USW00023230)",train
Briones Regional Park,9.56,3706,east bay,"(37.935804, -122.137413)",Concord,USW00023254,train
Sunol Regional Wilderness,3.25,3456,east bay,"(37.510183, -121.82855)","(SanJose, Livermore)","(USW00023293, USW00023285)",train
Mt Diablo State Park,31.25,5586,east bay,"(37.881698, -121.914155)",MtDiablo,USC00045915,test
Garin Regional Park,9.06,5199,east bay,"(37.63544, -122.02068)",Hayward,USW00093228,train
Pleasanton Ridge Regional Park,14.2,5777,east bay,"(37.615409, -121.88456)","(Livermore, Hayward)","(USW00023285, USW00093228)",train
Anthony Chabot Regional Park,5.1781,5239,east bay,"(37.766, -122.119)",Oakland,USW00023230,train
Joseph D Grant County Park,14.9266,5339,east bay,"(37.345495, -121.68717)","(SanJose, MtHamilton)","(USW00023293, USC00045933)",train


-
<a id='wildflower_observations'></a>

# 3. Clean and filter wildflower observations
**iNaturalist Observations using Export Tool:**
<br> Data is acquired [here](https://www.inaturalist.org/observations/export)
<br>Observations are associated with a URL and URLs of plant photos used in labeling
<br>Flowering PLants: taxon_id = 47125 
<br>
[Link to top](#guide)

-
<a id='import_inat'></a>

### 3a. import iNaturalist observation data 
[raw export files available here](https://github.com/Floydworks/Capstone2_Wildflower_Phenology/tree/main/NOAA_climate_files)
<br>
[compiled iNaturalist observations](https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/df_wildflowers_2017_2022.csv) File containing five years of data from eight parks 

In [3]:
#import day length data from GitHub:Floydworks
url = ('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/updated_inat_data.csv')
download = requests.get(url).content

# Read the downloaded content and turn it into a pandas dataframe
flowers_data = pd.read_csv(io.StringIO(download.decode('utf-8')))

In [4]:
# import observation data stored locally
#flowers_df = pd.read_csv(r'/YOUR LOCAL FILE PATH')
#assign copy for manipulating
#flowers_data = flowers_df


In [5]:
print(flowers_data['park'].unique())
print(flowers_data.shape)

['Sunol' 'Briones' 'Tilden' 'AnthonyChabot' 'Garin' 'JDGrant'
 'PleasantonRidge' 'MtDiablo']
(37922, 13)


In [6]:
flowers_data.head()

Unnamed: 0.1,Unnamed: 0,id,date,datetime,park,region,latitude,longitude,genus_species,genus,species,url,image_url
0,0,104188607,1/1/22,2022-01-01,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
1,1,104188609,1/1/22,2022-01-01,Sunol,east bay,37.52706,-121.827025,Capsella bursa-pastoris,Capsella,bursa-pastoris,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
2,2,104667115,1/8/22,2022-01-08,Sunol,east bay,37.523395,-121.833219,Sambucus cerulea,Sambucus,cerulea,https://www.inaturalist.org/observations/10466...,https://static.inaturalist.org/photos/17543114...
3,3,104681782,1/9/22,2022-01-09,Sunol,east bay,37.520038,-121.822708,Cardamine californica,Cardamine,californica,https://www.inaturalist.org/observations/10468...,https://inaturalist-open-data.s3.amazonaws.com...
4,4,104690215,1/8/22,2022-01-08,Sunol,east bay,37.509616,-121.824145,Calandrinia menziesii,Calandrinia,menziesii,https://www.inaturalist.org/observations/10469...,https://inaturalist-open-data.s3.amazonaws.com...


In [7]:
flowers_data.columns

In [8]:
#loop through parks and output first observation date, last observation date, number of observations

park_names = list(flowers_data['park'].unique())
#park_names

for p in park_names:
    df_temp = flowers_data[flowers_data['park']==p]
    name = p
    num_obs = len(df_temp)
    print('Park name: '+ str(p) +', '+'Num. of Obs = '+ str(num_obs))
    print(df_temp['datetime'].min())
    print(df_temp['datetime'].max())
   

Park name: Sunol, Num. of Obs = 2878
2017-10-03
2022-09-27
Park name: Briones, Num. of Obs = 3649
2017-10-07
2022-09-25
Park name: Tilden, Num. of Obs = 6070
2017-10-03
2022-09-26
Park name: AnthonyChabot, Num. of Obs = 2866
2017-11-04
2022-09-30
Park name: Garin, Num. of Obs = 390
2017-11-24
2022-09-24
Park name: JDGrant, Num. of Obs = 2418
2017-10-11
2022-09-17
Park name: PleasantonRidge, Num. of Obs = 603
2017-10-14
2022-09-26
Park name: MtDiablo, Num. of Obs = 19048
2017-10-04
2022-09-30


In [9]:
print(flowers_data['park'].unique())
flowers_data.head()

['Sunol' 'Briones' 'Tilden' 'AnthonyChabot' 'Garin' 'JDGrant'
 'PleasantonRidge' 'MtDiablo']


Unnamed: 0.1,Unnamed: 0,id,date,datetime,park,region,latitude,longitude,genus_species,genus,species,url,image_url
0,0,104188607,1/1/22,2022-01-01,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
1,1,104188609,1/1/22,2022-01-01,Sunol,east bay,37.52706,-121.827025,Capsella bursa-pastoris,Capsella,bursa-pastoris,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
2,2,104667115,1/8/22,2022-01-08,Sunol,east bay,37.523395,-121.833219,Sambucus cerulea,Sambucus,cerulea,https://www.inaturalist.org/observations/10466...,https://static.inaturalist.org/photos/17543114...
3,3,104681782,1/9/22,2022-01-09,Sunol,east bay,37.520038,-121.822708,Cardamine californica,Cardamine,californica,https://www.inaturalist.org/observations/10468...,https://inaturalist-open-data.s3.amazonaws.com...
4,4,104690215,1/8/22,2022-01-08,Sunol,east bay,37.509616,-121.824145,Calandrinia menziesii,Calandrinia,menziesii,https://www.inaturalist.org/observations/10469...,https://inaturalist-open-data.s3.amazonaws.com...


-
<a id='remove_species'></a>

### 3b. drop rows with missing species names

In [10]:
print(flowers_data.isna().sum())

flowers_data = flowers_data.dropna()

print(flowers_data.isna().sum())

Unnamed: 0        0
id                0
date              0
datetime          0
park              0
region            0
latitude          0
longitude         0
genus_species    21
genus            21
species          21
url               0
image_url         0
dtype: int64
Unnamed: 0       0
id               0
date             0
datetime         0
park             0
region           0
latitude         0
longitude        0
genus_species    0
genus            0
species          0
url              0
image_url        0
dtype: int64


### 3c. remove unwanted plant species
<br>Trees, shrubs, grasses, and perennials are removed.

In [11]:
#dictionaries of tree genera and grass genera.
tree_dict = {'Salix':53453,'Cyperus':52734, 'Juglans':54495, 'Quercus':47851,'Acer':47727, 'Sambucus':52689,
                  'Populus':47566,'Schinus':57355, 'Platanus':49664, 'Toxicodendron':51079, 'Aesculus':53350, 
                  'Umbellularia':48810, 'Fraxinus':54806, 'Arbutus':51047, 'Alnus':53352, 'Lithocarpus':53956,
                  'Eucalyptus':51815, 'Prunus':47351}
                  
grass_dict = {'Carex':48571, 'Bromus':52701, 'Cortaderia':52715, 'Ehrharta':64143, 'Spartina':51826, 'Avena':52697,
              'Briza':57160}

#Combine the dictionaries of unwanted plants
not_included = dict(tree_dict)
not_included.update(grass_dict)

#create list of genus names to drop
not_included_names = list(not_included.keys())

print(not_included_names)

['Salix', 'Cyperus', 'Juglans', 'Quercus', 'Acer', 'Sambucus', 'Populus', 'Schinus', 'Platanus', 'Toxicodendron', 'Aesculus', 'Umbellularia', 'Fraxinus', 'Arbutus', 'Alnus', 'Lithocarpus', 'Eucalyptus', 'Prunus', 'Carex', 'Bromus', 'Cortaderia', 'Ehrharta', 'Spartina', 'Avena', 'Briza']


In [12]:
#drop tree, shrub, and grass genera
flowers_data = flowers_data[~flowers_data['genus'].isin(not_included_names)]

In [13]:
#Check number of genera for reduction in rows and genera. 
#Some genera names in drop list may not occur in observations.
#print(flowers_data['genus'].describe())

### Summary of species cleaning
 - Observations missing genus and/or species were dropped.
 - Plants in tree, shrub, or grass groups were dropped.


-
<a id='missing_values1'></a>

### 3d. treat missing values

In [14]:
#reorder columns and select ones needed
flowers_data = flowers_data[['id','date','genus_species', 'genus', 'species', 
                    'park', 'region','latitude', 'longitude', 'url', 'image_url']] 
flowers_data.head(3)

Unnamed: 0,id,date,genus_species,genus,species,park,region,latitude,longitude,url,image_url
0,104188607,1/1/22,Baccharis pilularis,Baccharis,pilularis,Sunol,east bay,37.530981,-121.819691,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
1,104188609,1/1/22,Capsella bursa-pastoris,Capsella,bursa-pastoris,Sunol,east bay,37.52706,-121.827025,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
3,104681782,1/9/22,Cardamine californica,Cardamine,californica,Sunol,east bay,37.520038,-121.822708,https://www.inaturalist.org/observations/10468...,https://inaturalist-open-data.s3.amazonaws.com...


In [15]:
#check data types and look for columns with missing values
flowers_data.info()

#get counts by column for missing values using .isna().sum()
print(flowers_data.isna().sum())

print(flowers_data.shape)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34666 entries, 0 to 37921
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             34666 non-null  int64  
 1   date           34666 non-null  object 
 2   genus_species  34666 non-null  object 
 3   genus          34666 non-null  object 
 4   species        34666 non-null  object 
 5   park           34666 non-null  object 
 6   region         34666 non-null  object 
 7   latitude       34666 non-null  float64
 8   longitude      34666 non-null  float64
 9   url            34666 non-null  object 
 10  image_url      34666 non-null  object 
dtypes: float64(2), int64(1), object(8)
memory usage: 3.2+ MB
id               0
date             0
genus_species    0
genus            0
species          0
park             0
region           0
latitude         0
longitude        0
url              0
image_url        0
dtype: int64
(34666, 11)


In [16]:
#look at rows with missing genus values
#flowers_data_na = flowers_data[flowers_data[['genus']].isna().all(1)]
#print(flowers_data_na.shape)
#flowers_data_na

In [17]:
#drop all observations with missing data
flowers_data = flowers_data.mask(flowers_data.eq('None')).dropna()

#check to make sure no missing date vales left
print(flowers_data.info())
#print (df_EB.isna().sum()) 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34666 entries, 0 to 37921
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             34666 non-null  int64  
 1   date           34666 non-null  object 
 2   genus_species  34666 non-null  object 
 3   genus          34666 non-null  object 
 4   species        34666 non-null  object 
 5   park           34666 non-null  object 
 6   region         34666 non-null  object 
 7   latitude       34666 non-null  float64
 8   longitude      34666 non-null  float64
 9   url            34666 non-null  object 
 10  image_url      34666 non-null  object 
dtypes: float64(2), int64(1), object(8)
memory usage: 3.2+ MB
None


-
<a id='format_date1'></a>

### 3e. format dates
<br>Extract year, month and day. The various dataframes will be merged on the date columns.

In [18]:
#print(flowers_data.head())

In [19]:
#Create a datetime-like value from DateTime

flowers_data['DateTime'] = pd.to_datetime(flowers_data['date'], utc=True)
print(type(flowers_data['DateTime']))

<class 'pandas.core.series.Series'>


In [20]:
#add columns for month, day, and year
flowers_data['year'] = flowers_data['DateTime'].astype(str).str[:4]
flowers_data['month'] = flowers_data['DateTime'].astype(str).str[5:7]
flowers_data['day'] = flowers_data['DateTime'].astype(str).str[8:10]
#add plain_dates column for merging
flowers_data['plain_dates'] = flowers_data['year']+flowers_data['month']+flowers_data['day']
flowers_data.head(3)

Unnamed: 0,id,date,genus_species,genus,species,park,region,latitude,longitude,url,image_url,DateTime,year,month,day,plain_dates
0,104188607,1/1/22,Baccharis pilularis,Baccharis,pilularis,Sunol,east bay,37.530981,-121.819691,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,2022-01-01 00:00:00+00:00,2022,1,1,20220101
1,104188609,1/1/22,Capsella bursa-pastoris,Capsella,bursa-pastoris,Sunol,east bay,37.52706,-121.827025,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,2022-01-01 00:00:00+00:00,2022,1,1,20220101
3,104681782,1/9/22,Cardamine californica,Cardamine,californica,Sunol,east bay,37.520038,-121.822708,https://www.inaturalist.org/observations/10468...,https://inaturalist-open-data.s3.amazonaws.com...,2022-01-09 00:00:00+00:00,2022,1,9,20220109


In [21]:
flowers_data.columns

In [22]:
date_description = flowers_data.groupby(['park'])['DateTime'].max()
print(date_description)

park
AnthonyChabot     2022-09-30 00:00:00+00:00
Briones           2022-09-25 00:00:00+00:00
Garin             2022-09-24 00:00:00+00:00
JDGrant           2022-09-17 00:00:00+00:00
MtDiablo          2022-09-30 00:00:00+00:00
PleasantonRidge   2022-09-26 00:00:00+00:00
Sunol             2022-09-27 00:00:00+00:00
Tilden            2022-09-26 00:00:00+00:00
Name: DateTime, dtype: datetime64[ns, UTC]


In [23]:
#reorder columns and select ones needed
flowers_data = flowers_data[['id','DateTime', 'plain_dates', 'year', 'month', 'day', 'genus_species', 'genus', 'species', 
                    'park', 'region','latitude', 'longitude', 'url', 'image_url']] 
flowers_data.head(3)

Unnamed: 0,id,DateTime,plain_dates,year,month,day,genus_species,genus,species,park,region,latitude,longitude,url,image_url
0,104188607,2022-01-01 00:00:00+00:00,20220101,2022,1,1,Baccharis pilularis,Baccharis,pilularis,Sunol,east bay,37.530981,-121.819691,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
1,104188609,2022-01-01 00:00:00+00:00,20220101,2022,1,1,Capsella bursa-pastoris,Capsella,bursa-pastoris,Sunol,east bay,37.52706,-121.827025,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
3,104681782,2022-01-09 00:00:00+00:00,20220109,2022,1,9,Cardamine californica,Cardamine,californica,Sunol,east bay,37.520038,-121.822708,https://www.inaturalist.org/observations/10468...,https://inaturalist-open-data.s3.amazonaws.com...


In [24]:
print(flowers_data.shape)
col_names = flowers_data.columns.values.tolist()
print(col_names)

(34666, 15)
['id', 'DateTime', 'plain_dates', 'year', 'month', 'day', 'genus_species', 'genus', 'species', 'park', 'region', 'latitude', 'longitude', 'url', 'image_url']


In [25]:
#see parks
list(flowers_data['park'].unique())


-
<a id='flower_duplicates'></a>

### 3f. drop any duplicate values

In [26]:
#see duplicates in the dataframe
dups = flowers_data[flowers_data.duplicated()]
#dups

In [27]:
dups.park.unique()

In [28]:
#drop all duplicated values
flowers_data = flowers_data.drop_duplicates()

#df_all.duplicated().sum()
flowers_data.shape

In [29]:
flowers_data.head()

Unnamed: 0,id,DateTime,plain_dates,year,month,day,genus_species,genus,species,park,region,latitude,longitude,url,image_url
0,104188607,2022-01-01 00:00:00+00:00,20220101,2022,1,1,Baccharis pilularis,Baccharis,pilularis,Sunol,east bay,37.530981,-121.819691,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
1,104188609,2022-01-01 00:00:00+00:00,20220101,2022,1,1,Capsella bursa-pastoris,Capsella,bursa-pastoris,Sunol,east bay,37.52706,-121.827025,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
3,104681782,2022-01-09 00:00:00+00:00,20220109,2022,1,9,Cardamine californica,Cardamine,californica,Sunol,east bay,37.520038,-121.822708,https://www.inaturalist.org/observations/10468...,https://inaturalist-open-data.s3.amazonaws.com...
4,104690215,2022-01-08 00:00:00+00:00,20220108,2022,1,8,Calandrinia menziesii,Calandrinia,menziesii,Sunol,east bay,37.509616,-121.824145,https://www.inaturalist.org/observations/10469...,https://inaturalist-open-data.s3.amazonaws.com...
5,104737731,2022-01-10 00:00:00+00:00,20220110,2022,1,10,Baccharis pilularis,Baccharis,pilularis,Sunol,east bay,37.531082,-121.819465,https://www.inaturalist.org/observations/10473...,https://inaturalist-open-data.s3.amazonaws.com...


### export wildflower observations data

In [30]:
#export data
#timestamp
today = date.today()
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("most recent export:",today, ",", current_time)

flowers_data.to_csv('/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/Public_Final/df_wildflowers_2017_2022.csv')
#flowers_data.to_csv('YOUR LOCAL FILE PATH')

most recent export: 2022-11-14 , 07:59:19


-
<a id='climate_data'></a>

# 4. Climate Data: NOAA GHCN data
Temperature and precipitation data were downloaded using station id numbers at: https://www.ncei.noaa.gov/access 
<br>
<br>
[Link to top](#guide)


<a id='weather_stations'></a>
### 4a. information about weather stations

In [31]:
#East Bay Stations:

stations_dict = {
    'Concord': {'station_id':'USW00023254', 'lat_long':(37.99165, -122.05268), 'region':'east bay', 'near_park':('Briones'), 'data':('PRCP, TMIN, TMAX, TAVG')},
    'Hayward Airport': {'station_id':'USW00093228', 'lat_long':(37.65886, -122.12116), 'region':'east bay', 'near_park':('Garin'), 'data':('PRCP, TMIN, TMAX, TAVG')},
    'Livermore': {'station_id':'USW00023285', 'lat_long':(37.69309, -121.8149), 'region':'east bay', 'near_park':('Sunol'), 'data':('PRCP, TMIN, TMAX')},
    'Mt Hamilton': {'station_id':'USC00045933', 'lat_long':(37.34336, -121.63473), 'region':'east bay', 'near_park':('Joseph Grant'), 'data':('PRCP, TMIN, TMAX')},
    'San Jose': {'station_id':'USW00023293', 'lat_long':(37.35938, -121.92444), 'region':'east bay', 'near_park':('Sunol'), 'data':('PRCP, TMIN, TMAX')},
    'Berkeley': {'station_id':'USC00040693', 'lat_long':(37.8744, -122.2605), 'region':'east bay', 'near_park':('Tilden'), 'data':('PRCP, TMIN, TMAX')},
    'Berkeley2': {'station_id':'US1CAAL0034', 'lat_long':(37.88, -122.28), 'region':'east bay', 'near_park':('Tilden'), 'data':('PRCP')},
    'Oakland Airport': {'station_id':'USW00023230', 'lat_long':(37.717776, -122.232857), 'region':'east bay', 'near_park':('Anthony Chabot'), 'data':('PRCP, TMIN, TMAX')},
    'Mt Diablo': {'station_id':'USC00045915', 'lat_long':(37.8792, -121.9303), 'region':'east bay', 'near_park':('Mt Diablo'), 'data':('PRCP, TMIN, TMAX')},
              }

#'_': {'station_id':'', 'lat_long':(), 'region':'east bay', 'near_park':('')}

#create dataframe of station information
stations_df = pd.DataFrame.from_dict(stations_dict, orient='index')
#
stations_df

Unnamed: 0,station_id,lat_long,region,near_park,data
Concord,USW00023254,"(37.99165, -122.05268)",east bay,Briones,"PRCP, TMIN, TMAX, TAVG"
Hayward Airport,USW00093228,"(37.65886, -122.12116)",east bay,Garin,"PRCP, TMIN, TMAX, TAVG"
Livermore,USW00023285,"(37.69309, -121.8149)",east bay,Sunol,"PRCP, TMIN, TMAX"
Mt Hamilton,USC00045933,"(37.34336, -121.63473)",east bay,Joseph Grant,"PRCP, TMIN, TMAX"
San Jose,USW00023293,"(37.35938, -121.92444)",east bay,Sunol,"PRCP, TMIN, TMAX"
Berkeley,USC00040693,"(37.8744, -122.2605)",east bay,Tilden,"PRCP, TMIN, TMAX"
Berkeley2,US1CAAL0034,"(37.88, -122.28)",east bay,Tilden,PRCP
Oakland Airport,USW00023230,"(37.717776, -122.232857)",east bay,Anthony Chabot,"PRCP, TMIN, TMAX"
Mt Diablo,USC00045915,"(37.8792, -121.9303)",east bay,Mt Diablo,"PRCP, TMIN, TMAX"


<a id='import_climate'></a>
### 4b. import climate data
<br>download files from GitHub:Floydworks
<br>[GHCN csv files for each station available here](https://github.com/Floydworks/WildflowerFinder_Phenology_Tool/tree/main/NOAA_climate_files)


In [32]:
## get current directory
folder_path = '/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/NOAA_climate/'
#folder_path = 'PATH TO YOUR LOCAL FOLDER'


## list all file available 
all_files = os.listdir(folder_path)
print(all_files)

## only store .csv filenames
csv_files = list(filter(lambda f: f.endswith('.csv'), all_files))
csv_files

## create a new list to store filesnames with no .csv extension
file_names = []
for x in range(len(csv_files)):
    file_names.append(csv_files[x].split('.')[0])

['berkeley2_US1CAAL0034.csv', 'concord_USW00023254.csv', '.DS_Store', 'livermore_USW00023285.csv', 'clim_old', 'Unused_data_API_notebooks', 'sanjose_USW00023293.csv', 'mtdiablo_USC00045915.csv', 'berkeley_USC00040693.csv', 'hayward_USW00093228.csv', 'using_climate_data.README.txt', '.ipynb_checkpoints', 'mthamilton_USC00045933.csv', 'oakland_USW00023230.csv']


<a id='station_dataframes'></a>
### 4c. create dataframe for each station

In [33]:
#make dataframes for each station
df_names = []   #list to store df names produced
city_names = []          
station_id_codes = []
station_names = []

## Loop through to assign dataframe names
for file in file_names:
    final_df = file+"_df"
    #print("Dataframe name : "+final_df, type(final_df))
    df_names.append(final_df)      #add this one to list of df names 'df_names'
    city_name = file.split('_')[0]  #extract city name
    city_names.append(city_name)
    station_name = file.split('_')[0]  #extract station name
    station_names.append(station_name)
    station_id_code = file.split('_')[1].split('.')[0] #extract station id code
    station_id_codes.append(station_id_code) 
    
    filename = file+".csv"
    ## In python to assign a string as a dataframe name, use globals()
    globals()[final_df] = pd.read_csv(r'/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/NOAA_climate/'+filename)
    globals()[final_df]['city'] = city_name      #add column with city name
    globals()[final_df]['station_id'] = station_id_code     #add column wiht station id code
    #print(globals()[final_df])     #print the data frame

  globals()[final_df] = pd.read_csv(r'/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/NOAA_climate/'+filename)
  globals()[final_df] = pd.read_csv(r'/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/NOAA_climate/'+filename)
  globals()[final_df] = pd.read_csv(r'/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/NOAA_climate/'+filename)
  globals()[final_df] = pd.read_csv(r'/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/NOAA_climate/'+filename)
  globals()[final_df] = pd.read_csv(r'/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/NOAA_climate/'+filename)
  globals()[final_df] = pd.read_csv(r'/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/NOAA_climate/'+filename)
  globals()[final_df] = pd.read_csv(r'/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_W

In [34]:
#create a dataframe with station and object name info
Stations = pd.DataFrame()
Stations['df_name'] = df_names
Stations['city_name'] = city_names
Stations['station_id_code'] = station_id_codes
Stations['station_names'] = station_names

Stations.sort_values(by = ['city_name'])

Unnamed: 0,df_name,city_name,station_id_code,station_names
5,berkeley_USC00040693_df,berkeley,USC00040693,berkeley
0,berkeley2_US1CAAL0034_df,berkeley2,US1CAAL0034,berkeley2
1,concord_USW00023254_df,concord,USW00023254,concord
6,hayward_USW00093228_df,hayward,USW00093228,hayward
2,livermore_USW00023285_df,livermore,USW00023285,livermore
4,mtdiablo_USC00045915_df,mtdiablo,USC00045915,mtdiablo
7,mthamilton_USC00045933_df,mthamilton,USC00045933,mthamilton
8,oakland_USW00023230_df,oakland,USW00023230,oakland
3,sanjose_USW00023293_df,sanjose,USW00023293,sanjose


# new berkeley section

<a id='format_dataframes'></a>
### 4d. format dates, select dates in study period, select columns all dataframes

**format and restrict dates in study period**

In [35]:
# create a list of dataframes
dataframes=[berkeley_USC00040693_df, berkeley2_US1CAAL0034_df, concord_USW00023254_df, hayward_USW00093228_df, 
            livermore_USW00023285_df, mtdiablo_USC00045915_df, mthamilton_USC00045933_df, oakland_USW00023230_df,
            sanjose_USW00023293_df]

#check date format for each
#for s in dataframes:
#    print(s['NAME'][0], 'date type: ', type(s['DATE'][0]), 'format: ', s['DATE'][0])
    
#convert concord date format as needed
concord_USW00023254_df['DATE'] = pd.to_datetime(concord_USW00023254_df['DATE'])
concord_USW00023254_df['DATE'] = concord_USW00023254_df['DATE'].astype(str)

berkeley2_US1CAAL0034_df['DATE'] = pd.to_datetime(berkeley2_US1CAAL0034_df['DATE'])
berkeley2_US1CAAL0034_df['DATE'] = berkeley2_US1CAAL0034_df['DATE'].astype(str)

#check date formats again
for s in dataframes:
    print(s['NAME'][0], 'date type: ', type(s['DATE'][0]), 'format: ', s['DATE'][0])


BERKELEY, CA US date type:  <class 'str'> format:  1893-01-01
BERKELEY 1.1 NE, CA US date type:  <class 'str'> format:  2019-09-16
CONCORD BUCHANAN FIELD, CA US date type:  <class 'str'> format:  1999-06-06
HAYWARD AIR TERMINAL, CA US date type:  <class 'str'> format:  1998-09-19
LIVERMORE MUNICIPAL AIRPORT, CA US date type:  <class 'str'> format:  1998-04-06
MOUNT DIABLO JUNCTION, CA US date type:  <class 'str'> format:  1952-04-01
MOUNT HAMILTON, CA US date type:  <class 'str'> format:  1948-07-01
OAKLAND INTERNATIONAL AIRPORT, CA US date type:  <class 'str'> format:  1948-01-01
SAN JOSE, CA US date type:  <class 'str'> format:  1998-07-04


In [36]:
# restrict dates to 2017-09-01 through 2022-09-30
berkeley_USC00040693_df = berkeley_USC00040693_df[(berkeley_USC00040693_df['DATE']>='2017-09-01')&(berkeley_USC00040693_df['DATE']<='2022-09-30')]
print(berkeley_USC00040693_df['city'].unique(), 'min: ',berkeley_USC00040693_df['DATE'].min(), 'max: ',berkeley_USC00040693_df['DATE'].max())

berkeley2_US1CAAL0034_df = berkeley2_US1CAAL0034_df[(berkeley2_US1CAAL0034_df['DATE']>='2017-09-01')&(berkeley2_US1CAAL0034_df['DATE']<='2022-09-30')]
print(berkeley2_US1CAAL0034_df['city'].unique(), 'min: ',berkeley2_US1CAAL0034_df['DATE'].min(), 'max: ',berkeley2_US1CAAL0034_df['DATE'].max())

concord_USW00023254_df = concord_USW00023254_df[(concord_USW00023254_df['DATE']>='2017-09-01')&(concord_USW00023254_df['DATE']<='2022-09-30')]
print(concord_USW00023254_df['city'].unique(), 'min: ',concord_USW00023254_df['DATE'].min(), 'max: ',concord_USW00023254_df['DATE'].max())

hayward_USW00093228_df = hayward_USW00093228_df[(hayward_USW00093228_df['DATE']>='2017-09-01')&(hayward_USW00093228_df['DATE']<='2022-09-30')]
print(hayward_USW00093228_df['city'].unique(), 'min: ',hayward_USW00093228_df['DATE'].min(), 'max: ',hayward_USW00093228_df['DATE'].max())

livermore_USW00023285_df = livermore_USW00023285_df[(livermore_USW00023285_df['DATE']>='2017-09-01')&(livermore_USW00023285_df['DATE']<='2022-09-30')]
print(livermore_USW00023285_df['city'].unique(), 'min: ',livermore_USW00023285_df['DATE'].min(), 'max: ',livermore_USW00023285_df['DATE'].max())

mtdiablo_USC00045915_df = mtdiablo_USC00045915_df[(mtdiablo_USC00045915_df['DATE']>='2017-09-01')&(mtdiablo_USC00045915_df['DATE']<='2022-09-30')]
print(mtdiablo_USC00045915_df['city'].unique(), 'min: ',mtdiablo_USC00045915_df['DATE'].min(), 'max: ',mtdiablo_USC00045915_df['DATE'].max())

mthamilton_USC00045933_df = mthamilton_USC00045933_df[(mthamilton_USC00045933_df['DATE']>='2017-09-01')&(mthamilton_USC00045933_df['DATE']<='2022-09-30')]
print(mthamilton_USC00045933_df['city'].unique(), 'min: ',mthamilton_USC00045933_df['DATE'].min(), 'max: ',mthamilton_USC00045933_df['DATE'].max())

oakland_USW00023230_df = oakland_USW00023230_df[(oakland_USW00023230_df['DATE']>='2017-09-01')&(oakland_USW00023230_df['DATE']<='2022-09-30')]
print(oakland_USW00023230_df['city'].unique(), 'min: ',oakland_USW00023230_df['DATE'].min(), 'max: ',oakland_USW00023230_df['DATE'].max())

sanjose_USW00023293_df = sanjose_USW00023293_df[(sanjose_USW00023293_df['DATE']>='2017-09-01')&(sanjose_USW00023293_df['DATE']<='2022-09-30')]
print(sanjose_USW00023293_df['city'].unique(), 'min: ',sanjose_USW00023293_df['DATE'].min(), 'max: ',sanjose_USW00023293_df['DATE'].max())


['berkeley'] min:  2017-09-01 max:  2022-03-31
['berkeley2'] min:  2019-09-16 max:  2022-09-30
['concord'] min:  2017-09-01 max:  2022-09-30
['hayward'] min:  2017-09-01 max:  2022-09-30
['livermore'] min:  2017-09-01 max:  2022-09-30
['mtdiablo'] min:  2017-09-01 max:  2022-09-30
['mthamilton'] min:  2017-09-01 max:  2022-09-30
['oakland'] min:  2017-09-01 max:  2022-09-30
['sanjose'] min:  2017-09-01 max:  2022-09-30


**add empty columns where needed, select desired, matchin columns for all datasets**

In [37]:
#add empty TMIN and TMAX columns to berkeley 1.1 NE station US1CAAL0034 for concatenation
berkeley2_US1CAAL0034_df[['TMAX', 'TMIN']]= np.nan  #use later

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  berkeley2_US1CAAL0034_df[['TMAX', 'TMIN']]= np.nan  #use later
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  berkeley2_US1CAAL0034_df[['TMAX', 'TMIN']]= np.nan  #use later


In [38]:
#list columns we want
cols = ['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'NAME', 'PRCP',
       'TMAX', 'TMIN', 'city', 'station_id']

In [39]:
#get just the columns we want from each dataframe
berkeley_df = berkeley_USC00040693_df[cols]
berkeley2_df = berkeley2_US1CAAL0034_df[cols] #station with only PRCP data for filling in blanks
concord_df = concord_USW00023254_df[cols]
hayward_df = hayward_USW00093228_df[cols]
livermore_df = livermore_USW00023285_df[cols]
mtdiablo_df = mtdiablo_USC00045915_df[cols]
mthamilton_df = mthamilton_USC00045933_df[cols]
oakland_df = oakland_USW00023230_df[cols]
sanjose_df = sanjose_USW00023293_df[cols]


<a id='fill_berkeley'></a>
### 4e. fill in missing berkeley dates, temp, and precipitation with data from oakland and berkeley2

**take data from berkeley, berkeley2, and oakland datasets to cover all missing berkeley (Tilden) dates**

In [40]:
# berkeley
berk_test = berkeley_df
print(berkeley_df.shape)
berk_test.head(1)

(1523, 10)


Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
42680,USC00040693,2017-09-01,37.8744,-122.2605,"BERKELEY, CA US",0.0,328.0,161.0,berkeley,USC00040693


**berkeley 2 dataset : PRCP 2019-09-16 to 2022-03-31**

In [41]:
berkeley2_df.head(1)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
0,US1CAAL0034,2019-09-16,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",18.0,,,berkeley2,US1CAAL0034


In [42]:
berk2_dates_list = list(berkeley2_df['DATE'])
berk_dates_list = list(berkeley_df['DATE'])
print('length berkeley:', len(berk_dates_list),'length berkeley2:', len(berk2_dates_list), 'difference:', (len(berk_dates_list)-len(berk2_dates_list)))
#print(len(berk_dates_list))
#print(len(berk2_dates_list)-len(berk_dates_list))

length berkeley: 1523 length berkeley2: 848 difference: 675


In [43]:
# get data for all berkeley2 dates that do not appear in berkeley dataset, we will use PRCP for these dates
berk2_missing_berk_dates = berkeley2_df[~berkeley2_df['DATE'].isin(berk_dates_list)]
print(berk2_missing_berk_dates.shape)
berk2_missing_berk_dates['city']='berkeley2'
berk2_missing_berk_dates.head(3)

(332, 10)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  berk2_missing_berk_dates['city']='berkeley2'


Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
15,US1CAAL0034,2019-10-01,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",0.0,,,berkeley2,US1CAAL0034
16,US1CAAL0034,2019-10-02,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",0.0,,,berkeley2,US1CAAL0034
17,US1CAAL0034,2019-10-03,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",0.0,,,berkeley2,US1CAAL0034


In [44]:
#concatenate berkeley and berkeley 2 to add berkeley2 dates that are missing in berkeley
berkeley3_df = pd.concat([berkeley_df, berk2_missing_berk_dates])
print('berkeley:',berkeley_df.shape,'berk2_missing_berk_dates:', berk2_missing_berk_dates.shape, 'concatenated:', berkeley3_df.shape)


berkeley: (1523, 10) berk2_missing_berk_dates: (332, 10) concatenated: (1855, 10)


**oakland dataset : MINTEMP, MAXTEMP, PRCP 2017-09-01 to 2022-11-07**

In [45]:
# get oakland climate data for date range matching the berkeley date range
oak_test = oakland_df
print(oak_test.shape)
oak_test.head(1)

(1856, 10)


Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
19632,USW00023230,2017-09-01,37.7178,-122.23301,"OAKLAND INTERNATIONAL AIRPORT, CA US",0.0,383.0,156.0,oakland,USW00023230


In [46]:
#get list of dates in the oakland data
oak_dates_list = list(oakland_df['DATE'])

#get list of dates in the berkeley data
berk3_dates_list = list(berkeley3_df['DATE'])

#calculate number of missing dates in berkeley dataset that are coved by oakland dataset
print('oakland:',len(oak_dates_list),'berkeley3:', len(berk3_dates_list), 'difference:', (len(oak_dates_list)-len(berk3_dates_list)))



oakland: 1856 berkeley3: 1855 difference: 1


In [47]:
# get only oakland dates that do not appear in berkeley dataset, we will be using the daily temp columns and maybe prcp
oak_missing_berk_dates = oakland_df[~oakland_df['DATE'].isin(berk3_dates_list)]
print(oak_missing_berk_dates.shape)
oak_missing_berk_dates

(1, 10)


Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
21093,USW00023230,2021-09-01,37.7178,-122.23301,"OAKLAND INTERNATIONAL AIRPORT, CA US",0.0,217.0,150.0,oakland,USW00023230


In [48]:
## concatenate the oakland date that is not in berkeley3_df

In [49]:
berkeley4_df = pd.concat([berkeley3_df, oak_missing_berk_dates])
print('berkeley3:',berkeley3_df.shape,'oak_missing_berk_dates:', oak_missing_berk_dates.shape, 'concatenated:', berkeley4_df.shape)



berkeley3: (1855, 10) oak_missing_berk_dates: (1, 10) concatenated: (1856, 10)


**check that berkeley4_df has all dates, look at missing values**

In [50]:
berkeley4_df.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
42680,USC00040693,2017-09-01,37.8744,-122.2605,"BERKELEY, CA US",0.0,328.0,161.0,berkeley,USC00040693
42681,USC00040693,2017-09-02,37.8744,-122.2605,"BERKELEY, CA US",0.0,406.0,233.0,berkeley,USC00040693
42682,USC00040693,2017-09-03,37.8744,-122.2605,"BERKELEY, CA US",0.0,406.0,211.0,berkeley,USC00040693


In [51]:
print(berkeley4_df['city'].unique())
print(berkeley4_df['STATION'].unique())
print('berkeley4 start date:',berkeley4_df['DATE'].min(), 'berkeley4 end date:',berkeley4_df['DATE'].max())

['berkeley' 'berkeley2' 'oakland']
['USC00040693' 'US1CAAL0034' 'USW00023230']
berkeley4 start date: 2017-09-01 berkeley4 end date: 2022-09-30


In [52]:
print(berkeley4_df.isna().sum())

STATION         0
DATE            0
LATITUDE        0
LONGITUDE       0
NAME            0
PRCP           46
TMAX          440
TMIN          440
city            0
station_id      0
dtype: int64


In [53]:
# make dataframe of rows with missing PRCP
berkeley4_missing_PRCP = berkeley4_df[berkeley4_df['PRCP'].isnull()]
berkeley4_missing_PRCP.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
44021,USC00040693,2021-10-01,37.8744,-122.2605,"BERKELEY, CA US",,311.0,133.0,berkeley,USC00040693
44022,USC00040693,2021-10-02,37.8744,-122.2605,"BERKELEY, CA US",,311.0,133.0,berkeley,USC00040693
44023,USC00040693,2021-10-03,37.8744,-122.2605,"BERKELEY, CA US",,322.0,139.0,berkeley,USC00040693


**check for available PRCP values in berkeley2_df to fill missing data**

In [54]:
# get list of dates missing PRCP in berkeley4_df
berkeley4_missing_PRCP_dates = list(berkeley4_missing_PRCP['DATE'])

# get list of dates in berkeley2_df
berkeley2_dates = list(berkeley2_df['DATE'])

In [55]:
# get only berkeley2 dates that do not appear in berkeley dataset, we will be using the daily temp columns and maybe prcp
berkeley2_missingPRCP_berkeley4 = berkeley2_df[berkeley2_df['DATE'].isin(berkeley4_missing_PRCP_dates)]
print(berkeley2_missingPRCP_berkeley4.shape)
berkeley2_missingPRCP_berkeley4.head()


(45, 10)


Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
484,US1CAAL0034,2021-10-01,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",0.0,,,berkeley2,US1CAAL0034
485,US1CAAL0034,2021-10-02,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",0.0,,,berkeley2,US1CAAL0034
486,US1CAAL0034,2021-10-03,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",0.0,,,berkeley2,US1CAAL0034
487,US1CAAL0034,2021-10-04,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",0.0,,,berkeley2,US1CAAL0034
488,US1CAAL0034,2021-10-05,37.877502,-122.281264,"BERKELEY 1.1 NE, CA US",0.0,,,berkeley2,US1CAAL0034


In [56]:
# get just PRCP and DATE column
berkeley2_missingPRCP_berkeley4 = berkeley2_missingPRCP_berkeley4[['DATE','PRCP']]
#berkeley2_missingPRCP_berkeley4

**add PRCP from berkeley2_df to rows missing PRCP in berkeley4_df**

In [57]:
# add PRCP from berkeley2
berkeley4_PRCP_berkeley2 = berkeley4_missing_PRCP.merge(berkeley2_missingPRCP_berkeley4, left_on='DATE', right_on='DATE')
berkeley4_PRCP_berkeley2 = berkeley4_PRCP_berkeley2[['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'NAME', 'PRCP_y',
                                                     'TMAX', 'TMIN', 'city', 'station_id']]
berkeley4_PRCP_berkeley2 = berkeley4_PRCP_berkeley2.rename(columns={'PRCP_y': 'PRCP'}, errors="raise")
berkeley4_PRCP_berkeley2.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
0,USC00040693,2021-10-01,37.8744,-122.2605,"BERKELEY, CA US",0.0,311.0,133.0,berkeley,USC00040693
1,USC00040693,2021-10-02,37.8744,-122.2605,"BERKELEY, CA US",0.0,311.0,133.0,berkeley,USC00040693
2,USC00040693,2021-10-03,37.8744,-122.2605,"BERKELEY, CA US",0.0,322.0,139.0,berkeley,USC00040693


In [58]:
berkeley4_PRCP_berkeley2_dates = list(berkeley4_PRCP_berkeley2['DATE'])
print(len(berkeley4_PRCP_berkeley2_dates))

45


In [59]:
# drop berkeley4_df rows that will be replaced with PRCP from berkeley 2 (berkeley4_PRCP_berkeley2)
print(berkeley4_df.shape)
berkeley4_df = berkeley4_df[~berkeley4_df['DATE'].isin(berkeley4_PRCP_berkeley2_dates)]
print(berkeley4_df.shape)

# concatenate berkeley4_PRCP_berkeley2 which includes PRCP from berkeley2_df
berkeley4_df = pd.concat([berkeley4_df, berkeley4_PRCP_berkeley2])
print(berkeley4_df.shape)

(1856, 10)
(1811, 10)
(1856, 10)


**replace missing PRCP with PRCP from oakland on that date**

In [60]:
# check missing data again, missing PRCP should be reduced
print(berkeley4_df.isna().sum())

STATION         0
DATE            0
LATITUDE        0
LONGITUDE       0
NAME            0
PRCP            1
TMAX          440
TMIN          440
city            0
station_id      0
dtype: int64


In [61]:
# make dataframe of rows with missing PRCP
berkeley4_missing_PRCP2 = berkeley4_df[berkeley4_df['PRCP'].isnull()]
berkeley4_missing_PRCP2.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
44058,USC00040693,2021-11-07,37.8744,-122.2605,"BERKELEY, CA US",,178.0,78.0,berkeley,USC00040693


**check for available TEMP values in oakland_df to fill missing data**

In [62]:
# get list of dates missing PRCP in berkeley4_df
berkeley4_missing_PRCP2_dates = list(berkeley4_missing_PRCP2['DATE'])

# get list of dates in berkeley2_df
oakland_dates = list(oakland_df['DATE'])

In [63]:
# get only berkeley2 dates that do not appear in berkeley dataset, we will be using the daily temp columns and maybe prcp
oakland_missingPRCP2_berkeley4 = oakland_df[oakland_df['DATE'].isin(berkeley4_missing_PRCP2_dates)]
print(oakland_missingPRCP2_berkeley4.shape)
oakland_missingPRCP2_berkeley4.head()

(1, 10)


Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
21160,USW00023230,2021-11-07,37.7178,-122.23301,"OAKLAND INTERNATIONAL AIRPORT, CA US",0.0,167.0,78.0,oakland,USW00023230


In [64]:
# get just PRCP and DATE column
oakland_missingPRCP2_berkeley4 = oakland_missingPRCP2_berkeley4[['DATE','PRCP']]
oakland_missingPRCP2_berkeley4

Unnamed: 0,DATE,PRCP
21160,2021-11-07,0.0


**add PRCP from oakland_df to rows missing PRCP in berkeley4_df**

In [65]:
# add PRCP from berkeley2
oakland_missingPRCP2_berkeley4 = berkeley4_missing_PRCP2.merge(oakland_missingPRCP2_berkeley4, left_on='DATE', right_on='DATE')
oakland_missingPRCP2_berkeley4 = oakland_missingPRCP2_berkeley4[['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'NAME', 
                                                                 'PRCP_y','TMAX', 'TMIN', 'city', 'station_id']]
oakland_missingPRCP2_berkeley4 = oakland_missingPRCP2_berkeley4.rename(columns={'PRCP_y': 'PRCP'}, errors="raise")
oakland_missingPRCP2_berkeley4.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
0,USC00040693,2021-11-07,37.8744,-122.2605,"BERKELEY, CA US",0.0,178.0,78.0,berkeley,USC00040693


In [66]:
berkeley4_missing_PRCP2_dates = list(berkeley4_missing_PRCP2['DATE'])
print(len(berkeley4_missing_PRCP2_dates))

1


In [67]:
# drop berkeley4_df rows that will be replaced with PRCP from berkeley 2 (berkeley4_PRCP_berkeley2)
print(berkeley4_df.shape)
berkeley4_df = berkeley4_df[~berkeley4_df['DATE'].isin(berkeley4_missing_PRCP2_dates)]
print(berkeley4_df.shape)

# concatenate berkeley4_PRCP_berkeley2 which includes PRCP from berkeley2_df
berkeley4_df = pd.concat([berkeley4_df, oakland_missingPRCP2_berkeley4])
print(berkeley4_df.shape)

(1856, 10)
(1855, 10)
(1856, 10)


**deal with missing TEMP values in berkeley4_df**

In [68]:
# check missing data again, missing PRCP should be reduced
print(berkeley4_df.isna().sum())

STATION         0
DATE            0
LATITUDE        0
LONGITUDE       0
NAME            0
PRCP            0
TMAX          440
TMIN          440
city            0
station_id      0
dtype: int64


In [69]:
# make dataframe of rows with missing PRCP
berkeley4_missing_TEMP = berkeley4_df[berkeley4_df['TMAX'].isnull()]
berkeley4_missing_TEMP.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
42695,USC00040693,2017-09-16,37.8744,-122.2605,"BERKELEY, CA US",0.0,,,berkeley,USC00040693
42696,USC00040693,2017-09-17,37.8744,-122.2605,"BERKELEY, CA US",0.0,,,berkeley,USC00040693
42701,USC00040693,2017-09-22,37.8744,-122.2605,"BERKELEY, CA US",0.0,,,berkeley,USC00040693


**check for available TEMP values in oakland_df to fill missing data**

In [70]:
# get list of dates missing PRCP in berkeley4_df
berkeley4_missing_TEMP_dates = list(berkeley4_missing_TEMP['DATE'])

# get list of dates in berkeley2_df
oakland_dates = list(oakland_df['DATE'])

In [71]:
# get only berkeley2 dates that do not appear in berkeley dataset, we will be using the daily temp columns and maybe prcp
oakland_missingTEMP_berkeley4 = oakland_df[oakland_df['DATE'].isin(berkeley4_missing_TEMP_dates)]
print(oakland_missingTEMP_berkeley4.shape)
oakland_missingTEMP_berkeley4.head()

(440, 10)


Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
19647,USW00023230,2017-09-16,37.7178,-122.23301,"OAKLAND INTERNATIONAL AIRPORT, CA US",0.0,217.0,117.0,oakland,USW00023230
19648,USW00023230,2017-09-17,37.7178,-122.23301,"OAKLAND INTERNATIONAL AIRPORT, CA US",0.0,211.0,139.0,oakland,USW00023230
19653,USW00023230,2017-09-22,37.7178,-122.23301,"OAKLAND INTERNATIONAL AIRPORT, CA US",0.0,200.0,111.0,oakland,USW00023230
19654,USW00023230,2017-09-23,37.7178,-122.23301,"OAKLAND INTERNATIONAL AIRPORT, CA US",0.0,256.0,94.0,oakland,USW00023230
19655,USW00023230,2017-09-24,37.7178,-122.23301,"OAKLAND INTERNATIONAL AIRPORT, CA US",0.0,250.0,94.0,oakland,USW00023230


In [72]:
# get just PRCP and DATE column
oakland_missingPRCP_berkeley4 = oakland_missingTEMP_berkeley4[['DATE','TMAX','TMIN']]
#oakland_missingPRCP_berkeley4

**add TEMP from oakland_df to rows missing TMIN TMAX in berkeley4_df**

In [73]:
# add PRCP from berkeley2
berkeley4_TEMP_oakland = berkeley4_missing_TEMP.merge(oakland_missingPRCP_berkeley4, left_on='DATE', right_on='DATE')
berkeley4_TEMP_oakland.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX_x,TMIN_x,city,station_id,TMAX_y,TMIN_y
0,USC00040693,2017-09-16,37.8744,-122.2605,"BERKELEY, CA US",0.0,,,berkeley,USC00040693,217.0,117.0
1,USC00040693,2017-09-17,37.8744,-122.2605,"BERKELEY, CA US",0.0,,,berkeley,USC00040693,211.0,139.0
2,USC00040693,2017-09-22,37.8744,-122.2605,"BERKELEY, CA US",0.0,,,berkeley,USC00040693,200.0,111.0


In [74]:
berkeley4_TEMP_oakland = berkeley4_TEMP_oakland[['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'NAME', 'PRCP','TMAX_y',
                                                 'TMIN_y', 'city', 'station_id']]
berkeley4_TEMP_oakland = berkeley4_TEMP_oakland.rename(columns={'TMAX_y':'TMAX', 'TMIN_y':'TMIN'}, errors="raise")
berkeley4_TEMP_oakland.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id
0,USC00040693,2017-09-16,37.8744,-122.2605,"BERKELEY, CA US",0.0,217.0,117.0,berkeley,USC00040693
1,USC00040693,2017-09-17,37.8744,-122.2605,"BERKELEY, CA US",0.0,211.0,139.0,berkeley,USC00040693
2,USC00040693,2017-09-22,37.8744,-122.2605,"BERKELEY, CA US",0.0,200.0,111.0,berkeley,USC00040693


In [75]:
berkeley4_TEMP_oakland_dates = list(berkeley4_TEMP_oakland['DATE'])
print(len(berkeley4_TEMP_oakland_dates))

440


In [76]:
# drop berkeley4_df rows that will be replaced with PRCP from berkeley 2 (berkeley4_PRCP_berkeley2)
print(berkeley4_df.shape)
berkeley4_df = berkeley4_df[~berkeley4_df['DATE'].isin(berkeley4_TEMP_oakland_dates)]
print(berkeley4_df.shape)

# concatenate berkeley4_PRCP_berkeley2 which includes PRCP from berkeley2_df
berkeley4_df = pd.concat([berkeley4_df, berkeley4_TEMP_oakland])
print(berkeley4_df.shape)

(1856, 10)
(1416, 10)
(1856, 10)


In [77]:
# check missing data again, missing PRCP should be reduced
print(berkeley4_df.isna().sum())

STATION       0
DATE          0
LATITUDE      0
LONGITUDE     0
NAME          0
PRCP          0
TMAX          0
TMIN          0
city          0
station_id    0
dtype: int64


**change city column to read all 'berkeley'**

In [78]:
print(berkeley4_df.city.unique())
berkeley4_df['city'] = 'berkeley'
print(berkeley4_df.city.unique())

['berkeley' 'oakland' 'berkeley2']
['berkeley']


<a id='format_dates_climate'></a>
### 4f. concatenate individual station dataframes clean up dates

In [79]:
#concatenate the dataframes
climate_data = pd.concat([berkeley4_df, concord_df, hayward_df, livermore_df, mtdiablo_df, 
                          mthamilton_df, oakland_df, sanjose_df])

climate_data.shape
#climate_data

In [80]:
#loop through parks and output first observation date, last observation date, number of observations

city_names = list(climate_data['city'].unique())


for c in city_names:
    df_temp = climate_data[climate_data['city']==c]
    name = c
    num_obs = len(df_temp)
    print('Station name: '+ str(c) +', '+'Num. of Obs = '+ str(num_obs))
    print(df_temp['DATE'].min())
    print(df_temp['DATE'].max())

Station name: berkeley, Num. of Obs = 1856
2017-09-01
2022-09-30
Station name: concord, Num. of Obs = 1856
2017-09-01
2022-09-30
Station name: hayward, Num. of Obs = 1851
2017-09-01
2022-09-30
Station name: livermore, Num. of Obs = 1856
2017-09-01
2022-09-30
Station name: mtdiablo, Num. of Obs = 1856
2017-09-01
2022-09-30
Station name: mthamilton, Num. of Obs = 1856
2017-09-01
2022-09-30
Station name: oakland, Num. of Obs = 1856
2017-09-01
2022-09-30
Station name: sanjose, Num. of Obs = 1856
2017-09-01
2022-09-30


In [81]:
climate_data['year_cl'] = climate_data['DATE'].astype(str).str[:4]
climate_data['month_cl'] = climate_data['DATE'].astype(str).str[5:7]
climate_data['day_cl'] = climate_data['DATE'].astype(str).str[8:10]

In [82]:
#restrict study period to Oct 01, 2017 through Sep 30, 2022
climate_data = climate_data[(climate_data['year_cl']>= '2017') & (climate_data['DATE'] < '2022-10-01')]

print(climate_data.shape)

(14843, 13)


In [83]:
climate_data.head(3)

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id,year_cl,month_cl,day_cl
42680,USC00040693,2017-09-01,37.8744,-122.2605,"BERKELEY, CA US",0.0,328.0,161.0,berkeley,USC00040693,2017,9,1
42681,USC00040693,2017-09-02,37.8744,-122.2605,"BERKELEY, CA US",0.0,406.0,233.0,berkeley,USC00040693,2017,9,2
42682,USC00040693,2017-09-03,37.8744,-122.2605,"BERKELEY, CA US",0.0,406.0,211.0,berkeley,USC00040693,2017,9,3


In [84]:
print(climate_data['DATE'].min())
print(climate_data['DATE'].max())

#cities = climate_data['city'].unique()
#cities

#check date ranges all match
#for c in cities:
#    print(c, ': ', climate_data[climate_data['city']== c]['DATE'].min())
#    print(c, ': ', climate_data[climate_data['city']== c]['DATE'].max())



2017-09-01
2022-09-30


<a id='prec_temp_units'></a>
### 4g. get precipitation and temperature in correct units

In [85]:
#climate_data.info()

In [86]:
#convert precipitation to inches
climate_data['precipitation'] = (climate_data['PRCP']/25.4)/10

#convert temperature to F
climate_data['min'] = ((climate_data['TMIN']/10) * 1.8) + 32
climate_data['max'] = ((climate_data['TMAX']/10) * 1.8) + 32

In [87]:
#create plain dates column
climate_data['plain_dates'] = (climate_data['year_cl'] + climate_data['month_cl'] + climate_data['day_cl'])


In [88]:
climate_data.head(5)


Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,NAME,PRCP,TMAX,TMIN,city,station_id,year_cl,month_cl,day_cl,precipitation,min,max,plain_dates
42680,USC00040693,2017-09-01,37.8744,-122.2605,"BERKELEY, CA US",0.0,328.0,161.0,berkeley,USC00040693,2017,9,1,0.0,60.98,91.04,20170901
42681,USC00040693,2017-09-02,37.8744,-122.2605,"BERKELEY, CA US",0.0,406.0,233.0,berkeley,USC00040693,2017,9,2,0.0,73.94,105.08,20170902
42682,USC00040693,2017-09-03,37.8744,-122.2605,"BERKELEY, CA US",0.0,406.0,211.0,berkeley,USC00040693,2017,9,3,0.0,69.98,105.08,20170903
42683,USC00040693,2017-09-04,37.8744,-122.2605,"BERKELEY, CA US",0.0,306.0,172.0,berkeley,USC00040693,2017,9,4,0.0,62.96,87.08,20170904
42684,USC00040693,2017-09-05,37.8744,-122.2605,"BERKELEY, CA US",0.0,256.0,183.0,berkeley,USC00040693,2017,9,5,0.0,64.94,78.08,20170905


<a id='missing_climate'></a>
### 4h. treat missing data
<br>Deal with missing and inaccurate precipitation data, Add daily precipitation 
<br>Fill precipitation days with 'T' trace values data (NaNs) with zero 
<br>Deal with trace precipitation NaNs

In [89]:
climate_data.columns

**reduce and rename columns:**

In [90]:
#drop unwanted columns, keep date, year, month, day, precipitation in inches, min daily temp, max daily temp
#station id number, and city
climate_data = climate_data[['plain_dates','year_cl', 'month_cl', 'day_cl', 
                             'min', 'max', 'precipitation', 
                             'city', 'station_id']]


In [91]:
#rename temp cols
climate_data.rename(columns = {'min':'minTemp', 'max':'maxTemp', 'precipitation':'daily_prec'}, inplace = True)

climate_data.head()

Unnamed: 0,plain_dates,year_cl,month_cl,day_cl,minTemp,maxTemp,daily_prec,city,station_id
42680,20170901,2017,9,1,60.98,91.04,0.0,berkeley,USC00040693
42681,20170902,2017,9,2,73.94,105.08,0.0,berkeley,USC00040693
42682,20170903,2017,9,3,69.98,105.08,0.0,berkeley,USC00040693
42683,20170904,2017,9,4,62.96,87.08,0.0,berkeley,USC00040693
42684,20170905,2017,9,5,64.94,78.08,0.0,berkeley,USC00040693


In [92]:
climate_data.describe()

Unnamed: 0,minTemp,maxTemp,daily_prec
count,14816.0,14811.0,14836.0
mean,50.371335,70.135463,0.040257
std,9.205963,12.500417,0.195702
min,19.94,28.94,0.0
25%,44.06,60.98,0.0
50%,51.08,69.08,0.0
75%,57.02,78.08,0.0
max,84.92,113.0,6.03937


**check for missing values**

In [93]:
print(climate_data.isna().sum())
print(climate_data.info())

plain_dates     0
year_cl         0
month_cl        0
day_cl          0
minTemp        27
maxTemp        32
daily_prec      7
city            0
station_id      0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14843 entries, 42680 to 8852
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   plain_dates  14843 non-null  object 
 1   year_cl      14843 non-null  object 
 2   month_cl     14843 non-null  object 
 3   day_cl       14843 non-null  object 
 4   minTemp      14816 non-null  float64
 5   maxTemp      14811 non-null  float64
 6   daily_prec   14836 non-null  float64
 7   city         14843 non-null  object 
 8   station_id   14843 non-null  object 
dtypes: float64(3), object(6)
memory usage: 1.1+ MB
None


In [94]:
#print the percentage of the total data that is missing for each col
#for i in na_count:
#    print(i,"is",((i/(climate_data.shape[0])*100),"percent of total observations"))

#look at missing temp data
climate_data_na_t = climate_data[climate_data[['minTemp']].isna().all(1)]
climate_data_na_t.head(50)

#export data frame of rows with missing values
climate_data_na_t.to_csv('/Users/sandidge/Desktop/climate_missing_data.csv')


**address missing temp values**

In [95]:

print(climate_data.isna().sum())

#impute missing values in 'maxTemp' as average of nearest before date and after date
climate_data['maxTemp']=climate_data['maxTemp'].where(climate_data['maxTemp'].notnull(), 
                                          other=(climate_data['maxTemp'].fillna(method='ffill')
                                                 +climate_data['maxTemp'].fillna(method='bfill'))/2)

#impute missing values in 'minTemp' as average of nearest before date and after date
climate_data['minTemp']=climate_data['minTemp'].where(climate_data['minTemp'].notnull(), 
                                          other=(climate_data['minTemp'].fillna(method='ffill')
                                                 +climate_data['minTemp'].fillna(method='bfill'))/2)

print(climate_data.isna().sum())

#drop remaining missing data 
climate_data = climate_data.dropna()

na_count = climate_data.isna().sum()
print(na_count)

print(climate_data.shape)

climate_data.head()

plain_dates     0
year_cl         0
month_cl        0
day_cl          0
minTemp        27
maxTemp        32
daily_prec      7
city            0
station_id      0
dtype: int64
plain_dates    0
year_cl        0
month_cl       0
day_cl         0
minTemp        0
maxTemp        0
daily_prec     7
city           0
station_id     0
dtype: int64
plain_dates    0
year_cl        0
month_cl       0
day_cl         0
minTemp        0
maxTemp        0
daily_prec     0
city           0
station_id     0
dtype: int64
(14836, 9)


Unnamed: 0,plain_dates,year_cl,month_cl,day_cl,minTemp,maxTemp,daily_prec,city,station_id
42680,20170901,2017,9,1,60.98,91.04,0.0,berkeley,USC00040693
42681,20170902,2017,9,2,73.94,105.08,0.0,berkeley,USC00040693
42682,20170903,2017,9,3,69.98,105.08,0.0,berkeley,USC00040693
42683,20170904,2017,9,4,62.96,87.08,0.0,berkeley,USC00040693
42684,20170905,2017,9,5,64.94,78.08,0.0,berkeley,USC00040693


<a id='cumulative_precipitation'></a>
### 4i. add cumulative precipitation

In [96]:
#create cumulative precipitation column

cities = list(climate_data['city'].unique())
years = list(climate_data['year_cl'].unique())
precip_cum = []
pre_cum = pd.DataFrame()


In [97]:
for city in cities:
    city_df = climate_data[climate_data['city']==city]
    #print(city)
    
    for year in years:
        city_year_df = city_df[city_df['year_cl']==year]
        prec_cumulative = city_year_df['daily_prec'].cumsum()
        #print(prec_cumulative)
        precip_cum.extend(list(prec_cumulative))
        
#precip_cum
print(len(precip_cum))

14836


In [98]:
climate_data['prec_cum'] = precip_cum
climate_data

Unnamed: 0,plain_dates,year_cl,month_cl,day_cl,minTemp,maxTemp,daily_prec,city,station_id,prec_cum
42680,20170901,2017,09,01,60.98,91.04,0.0,berkeley,USC00040693,0.000000
42681,20170902,2017,09,02,73.94,105.08,0.0,berkeley,USC00040693,0.000000
42682,20170903,2017,09,03,69.98,105.08,0.0,berkeley,USC00040693,0.000000
42683,20170904,2017,09,04,62.96,87.08,0.0,berkeley,USC00040693,0.000000
42684,20170905,2017,09,05,64.94,78.08,0.0,berkeley,USC00040693,0.000000
...,...,...,...,...,...,...,...,...,...,...
8848,20220926,2022,09,26,59.00,75.02,0.0,sanjose,USW00023293,1.181102
8849,20220927,2022,09,27,57.92,73.04,0.0,sanjose,USW00023293,1.181102
8850,20220928,2022,09,28,57.02,78.98,0.0,sanjose,USW00023293,1.181102
8851,20220929,2022,09,29,57.02,82.94,0.0,sanjose,USW00023293,1.181102


### export climate data

In [99]:
#export data
#timestamp
today = date.today()
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("most recent export:",today, ",", current_time)

climate_data.to_csv('/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/Public_Final/climate_GHCN_data.csv')
#climate_data.to_csv('YOUR FILE PATH')


most recent export: 2022-11-14 , 07:59:21


<a id='daylength'></a>
# 5. Get sunrise and sunset data from Skyfield, calculate day length
[Link to top](#guide)


In [100]:
! pip install skyfield



### lat, long coordinates for parks
<br>Anthony Chabot = (37.766, -122.119)
<br>Briones Regional Park= (37.935804, -122.137413)
<br>Garin Regional Park = (37.63544, -122.02068)
<br>Joseph D Grant County Park = (37.345495, -121.68717)
<br>Pleasanton Ridge Regional Park = (37.615409, -121.88456)
<br>Sunol Regional Park = (37.510183, -121.82855)
<br>Tilden Regional Park = (37.894647, -122.241635)
<br>Mt. Diablo = (37.8792, -121.9303)


<a id='skyfield_settings'></a>
### 5a. tune settings for the API call

In [101]:
ts = api.load.timescale()
eph = api.load('de421.bsp')

#print(ts)
#print(eph)

In [102]:
#create a dict of lat, long coords to loop over later

park_coords ={'Tilden Regional Park': (37.894647, -122.241635),
    'Briones Regional Park': (37.935804, -122.137413),
    'Sunol Regional Wilderness': (37.510183, -121.82855),
    'Mt Diablo State Park': (37.8792, -121.9303 ),
    'Garin Regional Park': (37.63544, -122.02068),
    'Pleasanton Ridge Regional Park': (37.615409, -121.88456),
    'Anthony Chabot Regional Park': (37.766, -122.119),
    'Joseph D Grant County Park': (37.345495, -121.68717)
        }

#unpack dictionary as lists of parks and coordinate tuples
parks, coordinates = [list(x) for x in zip(*park_coords.items())]
print(parks)
print(coordinates)

['Tilden Regional Park', 'Briones Regional Park', 'Sunol Regional Wilderness', 'Mt Diablo State Park', 'Garin Regional Park', 'Pleasanton Ridge Regional Park', 'Anthony Chabot Regional Park', 'Joseph D Grant County Park']
[(37.894647, -122.241635), (37.935804, -122.137413), (37.510183, -121.82855), (37.8792, -121.9303), (37.63544, -122.02068), (37.615409, -121.88456), (37.766, -122.119), (37.345495, -121.68717)]


In [103]:
#use today's date as the maximum date for data retreival
### if error, try running necessary libraries cell at top ###
today = date.today()
print(today)

2022-11-14


<a id='skyfield_api'></a>
### 5b. make API call using lat, long coordinates

In [104]:
#full set of dates, remember to use date one day earlier and later than desired range
park_name = []
df_sun = pd.DataFrame(columns = ['timescale', 'sun', 'datetime','park_name'])

t0 = ts.utc(2017, 8, 31, 4) #off by 7 hourst0 = ts.utc(2017, 9, 30, 4) #start 30 days prior to date needed
t1 = ts.utc(2022, 10,1, 4)
#t1 = ts.now() #off by 7 hours

for c in coordinates:
    #print(c)
    print(c[0],c[1]) #print lat long of park as first line in output
    
    #get data by using lat long for each park with the Skyfield api
    t, y = almanac.find_discrete(t0, t1, almanac.sunrise_sunset(eph, api.wgs84.latlon(c[0], c[1])))
    t_utc = t.utc_iso() #convert t (time) to utc_iso
    list_y = list(y)    #convert y (sun state 0,1) to a list for zipping
    
    #creat a dataframe: zip utc dates with 0, 1 sun state (sunrise, sunset)
    df = pd.DataFrame(zip(t_utc, list_y))
    

    if c[0] == coordinates[0][0]:
        #park_name.append('Tilden')
        park_NAME = 'Tilden'
    elif c[0] == coordinates[1][0]:
        #park_name.append('Briones')
        park_NAME = 'Briones'
    elif c[0] == coordinates[2][0]:
        #park_name.append('Sunol')
        park_NAME = 'Sunol'       
    elif c[0] == coordinates[3][0]:
        #park_name.append('MtDiablo')
        park_NAME = 'Mt Diablo StatePark'
    elif c[0] == coordinates[4][0]:
        #park_name.append('Garin')
        park_NAME = 'Garin Regional Park'
    elif c[0] == coordinates[5][0]:
        #park_name.append('PRidge')
        park_NAME = 'Pleasanton Ridge Regional Park'      
    elif c[0] == coordinates[6][0]:
        #park_name.append('AChabot')
        park_NAME = 'Anthony Chabot Regional Park'
    elif c[0] == coordinates[7][0]:
        #park_name.append('JDGrant')
        park_NAME = 'Joseph D Grant County Park'
    
    else:
        print('Park Name not matching')
   
        
    
    # create a list of datetime objects
    datetime = t.utc_datetime() #creates numpy array of datetime values
    #print(datetime[0],len(datetime), type(datetime))
    
    #create a datetime column
    df['datetime'] = datetime
    
    #create a list of len of zipped df and fill with the park name value
    name_list = [park_NAME] * (len(df))
    print(park_NAME, len(name_list), type(name_list))
    
    #create 'park_name' column
    df['park_name'] = name_list
    df.columns = ['timescale', 'sun', 'datetime','park_name']   
    
    #concatenate the df to the full df_sun
    df_sun = pd.concat([df_sun, df])

#print('')
#print("park df shape is:",df.shape)
#print(df.head(3))
print('')
print("Full df shape is:",df_sun.shape)
print(df_sun.head(3))


37.894647 -122.241635
Tilden 3714 <class 'list'>
37.935804 -122.137413
Briones 3714 <class 'list'>
37.510183 -121.82855
Sunol 3714 <class 'list'>
37.8792 -121.9303
Mt Diablo StatePark 3714 <class 'list'>
37.63544 -122.02068
Garin Regional Park 3714 <class 'list'>
37.615409 -121.88456
Pleasanton Ridge Regional Park 3714 <class 'list'>
37.766 -122.119
Anthony Chabot Regional Park 3714 <class 'list'>
37.345495 -121.68717
Joseph D Grant County Park 3714 <class 'list'>

Full df shape is: (29712, 4)
              timescale sun                         datetime park_name
0  2017-08-31T13:38:22Z   1 2017-08-31 13:38:22.131670+00:00    Tilden
1  2017-09-01T02:39:13Z   0 2017-09-01 02:39:12.560110+00:00    Tilden
2  2017-09-01T13:39:13Z   1 2017-09-01 13:39:13.112645+00:00    Tilden


<a id='skyfield_cleaning'></a>
### 5c. convert dates, extract desired elements, and format data frame

In [105]:
print(type(df_sun['timescale']))
print(type(df_sun['datetime']))
print(type(df_sun['datetime'][0]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [106]:
df_sun['datetime'] = pd.to_datetime(df_sun['datetime'],'%Y-%m-%d %H:%M')
df_sun.head()
#df_sun.info()

Unnamed: 0,timescale,sun,datetime,park_name
0,2017-08-31T13:38:22Z,1,2017-08-31 13:38:22.131670+00:00,Tilden
1,2017-09-01T02:39:13Z,0,2017-09-01 02:39:12.560110+00:00,Tilden
2,2017-09-01T13:39:13Z,1,2017-09-01 13:39:13.112645+00:00,Tilden
3,2017-09-02T02:37:43Z,0,2017-09-02 02:37:43.323981+00:00,Tilden
4,2017-09-02T13:40:04Z,1,2017-09-02 13:40:04.021119+00:00,Tilden


In [107]:
#create separate columns extracting year, month, day, hour, minute
df_sun['year'], df_sun['month'], df_sun['day'], df_sun['hour'], df_sun['minute'] =\
df_sun['datetime'].dt.year,\
df_sun['datetime'].dt.month,\
df_sun['datetime'].dt.day,\
df_sun['datetime'].dt.hour,\
df_sun['datetime'].dt.minute

In [108]:
#this results in a time that is about a minute off 
#because the subtration of 7 hours would change the date 
# leading to a swap of sun position state (sunrise or sunset) codes
# here: 0 = sunset; 1 = sunrise
df_sun['hour_sub7'] = df_sun['hour'].replace({12:5, 0:17, 1:18, 2:19, 3:20, 4:21, 13:6, 14:7, 15:8})
print(df_sun[['datetime','sun','month','day','hour','minute','hour_sub7']].head(15))

                           datetime sun  month  day  hour  minute  hour_sub7
0  2017-08-31 13:38:22.131670+00:00   1      8   31    13      38          6
1  2017-09-01 02:39:12.560110+00:00   0      9    1     2      39         19
2  2017-09-01 13:39:13.112645+00:00   1      9    1    13      39          6
3  2017-09-02 02:37:43.323981+00:00   0      9    2     2      37         19
4  2017-09-02 13:40:04.021119+00:00   1      9    2    13      40          6
5  2017-09-03 02:36:13.573593+00:00   0      9    3     2      36         19
6  2017-09-03 13:40:54.864296+00:00   1      9    3    13      40          6
7  2017-09-04 02:34:43.340126+00:00   0      9    4     2      34         19
8  2017-09-04 13:41:45.650743+00:00   1      9    4    13      41          6
9  2017-09-05 02:33:12.655283+00:00   0      9    5     2      33         19
10 2017-09-05 13:42:36.390520+00:00   1      9    5    13      42          6
11 2017-09-06 02:31:41.550888+00:00   0      9    6     2      31         19

**Format date to YYYMMDD**
<br>This format will match the iNat and climate data sets for merging

In [109]:
#format using datetime functions
#single-digit days and months will not fit the 20220603 format, so add a leading zero to single digit date elements
z_months = []
# add 0 to one-digit days and months
for m in df_sun['month']:
    #print(m, type(m)) # m <class 'int'>
    i = str(m)
    ddigit = i.zfill(2)
    #df.replace(0, -1)
    #m.replace(int(ddigit))
    z_months.append(ddigit)
    
#print(z_months)
df_sun['Zmonth'] = z_months

z_days = []
# add 0 to one-digit days and months
for d in df_sun['day']:
    #print(d, type(d)) # d <class 'int'>
    j = str(d)
    ddigit = j.zfill(2)
    z_days.append(ddigit)
    
df_sun['Zday'] = z_days
#df_sun.head()
print("df_sun shape is:",df_sun.shape)



df_sun shape is: (29712, 12)


**add plain dates**

In [110]:
#####WARNING: this is computationally exhaustive##### Look for a way around using/creating plain dates
#combine year, month, day columns to get YYYMMDD format 'plain_date' that matches date format in other tables

plain_date = [] #empty string to hold returned values

# for every entry in df_sun: combine year, month, day into a date format 20220605 = YYYMMDD
for r in range(len(df_sun)):
    
    list_year = list(df_sun['year'])
    #print(len(list_year), type(list_year))

    list_month = list(df_sun['Zmonth'])
    #print(len(list_month), type(list_month))
    
    list_day = list(df_sun['Zday'])
    #print(len(list_day), type(list_day))
    
    date_int = int(str(list_year[r])+str(list_month[r])+str(list_day[r]))
    #print(type(date_int))
    plain_date.append(date_int)   #append date to the plain_date list
    
    
#print(len(plain_date), type(plain_date))

#add 'plain_date' column to df_sun dataframe
df_sun['plain_dates'] = plain_date
print(df_sun['plain_dates'].max())
df_sun.head()

20221001


Unnamed: 0,timescale,sun,datetime,park_name,year,month,day,hour,minute,hour_sub7,Zmonth,Zday,plain_dates
0,2017-08-31T13:38:22Z,1,2017-08-31 13:38:22.131670+00:00,Tilden,2017,8,31,13,38,6,8,31,20170831
1,2017-09-01T02:39:13Z,0,2017-09-01 02:39:12.560110+00:00,Tilden,2017,9,1,2,39,19,9,1,20170901
2,2017-09-01T13:39:13Z,1,2017-09-01 13:39:13.112645+00:00,Tilden,2017,9,1,13,39,6,9,1,20170901
3,2017-09-02T02:37:43Z,0,2017-09-02 02:37:43.323981+00:00,Tilden,2017,9,2,2,37,19,9,2,20170902
4,2017-09-02T13:40:04Z,1,2017-09-02 13:40:04.021119+00:00,Tilden,2017,9,2,13,40,6,9,2,20170902


**format data frame**

In [111]:
#use .pivot() to index on 'plain_dates' and create a single entry for each day 
#that holds both sunrise and sunset data.
df2=df_sun.pivot(index=['plain_dates', 'park_name'], columns=['sun'])
df2.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,timescale,timescale,datetime,datetime,year,year,month,month,day,day,hour,hour,minute,minute,hour_sub7,hour_sub7,Zmonth,Zmonth,Zday,Zday
Unnamed: 0_level_1,sun,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
plain_dates,park_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
20170831,Anthony Chabot Regional Park,,2017-08-31T13:38:01Z,NaT,2017-08-31 13:38:00.547156+00:00,,2017.0,,8.0,,31.0,,13.0,,38.0,,6.0,,8,,31
20170831,Briones,,2017-08-31T13:37:55Z,NaT,2017-08-31 13:37:54.581065+00:00,,2017.0,,8.0,,31.0,,13.0,,37.0,,6.0,,8,,31
20170831,Garin Regional Park,,2017-08-31T13:37:45Z,NaT,2017-08-31 13:37:44.888702+00:00,,2017.0,,8.0,,31.0,,13.0,,37.0,,6.0,,8,,31


In [112]:
#rename the columns as unique rather than levels
df2.columns = ['ts_set','ts_rise','dt_set', 'dt_rise', 'year_set', 'year_rise', 'month_set', 'month_rise',
             'day_set', 'day_rise','hour_set', 'hour_rise','minute_set', 'minute_rise',
              'hour_sub7_set', 'hour_sub7_rise', 'Zmonth_set', 'Zmonth_rise', 'Zday_set', 'Zday_rise']
df2.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,ts_set,ts_rise,dt_set,dt_rise,year_set,year_rise,month_set,month_rise,day_set,day_rise,hour_set,hour_rise,minute_set,minute_rise,hour_sub7_set,hour_sub7_rise,Zmonth_set,Zmonth_rise,Zday_set,Zday_rise
plain_dates,park_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20170831,Anthony Chabot Regional Park,,2017-08-31T13:38:01Z,NaT,2017-08-31 13:38:00.547156+00:00,,2017.0,,8.0,,31.0,,13.0,,38.0,,6.0,,8,,31
20170831,Briones,,2017-08-31T13:37:55Z,NaT,2017-08-31 13:37:54.581065+00:00,,2017.0,,8.0,,31.0,,13.0,,37.0,,6.0,,8,,31
20170831,Garin Regional Park,,2017-08-31T13:37:45Z,NaT,2017-08-31 13:37:44.888702+00:00,,2017.0,,8.0,,31.0,,13.0,,37.0,,6.0,,8,,31


In [113]:
#reset the index to remove the milti-indexing and make 'plain_dates' and 'park_name' regular columns
df3 = df2.reset_index(level=['park_name','plain_dates'])
print(df3.columns)

#first three rows have nans because of time shift of 7 hours, these will be dropped
df3.head()

Index(['plain_dates', 'park_name', 'ts_set', 'ts_rise', 'dt_set', 'dt_rise',
       'year_set', 'year_rise', 'month_set', 'month_rise', 'day_set',
       'day_rise', 'hour_set', 'hour_rise', 'minute_set', 'minute_rise',
       'hour_sub7_set', 'hour_sub7_rise', 'Zmonth_set', 'Zmonth_rise',
       'Zday_set', 'Zday_rise'],
      dtype='object')


Unnamed: 0,plain_dates,park_name,ts_set,ts_rise,dt_set,dt_rise,year_set,year_rise,month_set,month_rise,...,hour_set,hour_rise,minute_set,minute_rise,hour_sub7_set,hour_sub7_rise,Zmonth_set,Zmonth_rise,Zday_set,Zday_rise
0,20170831,Anthony Chabot Regional Park,,2017-08-31T13:38:01Z,NaT,2017-08-31 13:38:00.547156+00:00,,2017.0,,8.0,...,,13.0,,38.0,,6.0,,8,,31
1,20170831,Briones,,2017-08-31T13:37:55Z,NaT,2017-08-31 13:37:54.581065+00:00,,2017.0,,8.0,...,,13.0,,37.0,,6.0,,8,,31
2,20170831,Garin Regional Park,,2017-08-31T13:37:45Z,NaT,2017-08-31 13:37:44.888702+00:00,,2017.0,,8.0,...,,13.0,,37.0,,6.0,,8,,31
3,20170831,Joseph D Grant County Park,,2017-08-31T13:36:42Z,NaT,2017-08-31 13:36:42.354058+00:00,,2017.0,,8.0,...,,13.0,,36.0,,6.0,,8,,31
4,20170831,Mt Diablo StatePark,,2017-08-31T13:37:08Z,NaT,2017-08-31 13:37:08.313040+00:00,,2017.0,,8.0,...,,13.0,,37.0,,6.0,,8,,31


<a id='skyfield_missing_values'></a>
### 5d. treat missing values
<br> first and last date will have a missing sun state because the 7-hour offset truncates the first and last days
<br>as reversing the offset changes the date.
<br> Drop the first and last date

In [114]:
#drop all observations with missing data (first and last dates)
df3 = df3.mask(df3.eq('None')).dropna()
#df3.isna()         #first and last entries have been dropped
print(df3.isna().sum())   #there are no missing values
#df3.info()
print(df3['plain_dates'].max())

plain_dates       0
park_name         0
ts_set            0
ts_rise           0
dt_set            0
dt_rise           0
year_set          0
year_rise         0
month_set         0
month_rise        0
day_set           0
day_rise          0
hour_set          0
hour_rise         0
minute_set        0
minute_rise       0
hour_sub7_set     0
hour_sub7_rise    0
Zmonth_set        0
Zmonth_rise       0
Zday_set          0
Zday_rise         0
dtype: int64
20220930


<a id='calculate_daylength'></a>
### 5e. calculate the day length
<br> calculated as number of seconds between sunrise and sunset

In [115]:
sr = list(df3.iloc[:,5])  #all rows, all elements : all columns, element 3 (datetime sunrise, sun = 1)
ss = list(df3.iloc[:,4])   #datetime sunset, sun = 0
#ss[5] #returns Timestamp('2017-10-02 01:51:23.272716+0000', tz='UTC') type object

In [116]:
#define timedelta functionto get day length in seconds
def delta(tset,trise):
    day_len = tset - trise
    py_day_len = day_len.to_pytimedelta() #convert to timedelta
    day_len_sec = py_day_len.seconds      #extract seconds value
    return (day_len_sec)

#test function
#delta(ss[4], sr[4])

In [117]:
#implement the function in a loop to get object holding day length value for each day
day_lengths = []
for r in range(len(df3)):
    tdelta = delta((ss[r]),(sr[r]))
    day_lengths.append(tdelta)  

#create 'day_length' column and fill with day_lengths
df3['day_length'] = day_lengths
df3.head(3)

Unnamed: 0,plain_dates,park_name,ts_set,ts_rise,dt_set,dt_rise,year_set,year_rise,month_set,month_rise,...,hour_rise,minute_set,minute_rise,hour_sub7_set,hour_sub7_rise,Zmonth_set,Zmonth_rise,Zday_set,Zday_rise,day_length
8,20170901,Anthony Chabot Regional Park,2017-09-01T02:38:35Z,2017-09-01T13:38:51Z,2017-09-01 02:38:35.483063+00:00,2017-09-01 13:38:51.199587+00:00,2017.0,2017.0,9.0,9.0,...,13.0,38.0,38.0,19.0,6.0,9,9,1,1,46784
9,20170901,Briones,2017-09-01T02:38:50Z,2017-09-01T13:38:46Z,2017-09-01 02:38:50.033899+00:00,2017-09-01 13:38:45.667652+00:00,2017.0,2017.0,9.0,9.0,...,13.0,38.0,38.0,19.0,6.0,9,9,1,1,46804
10,20170901,Garin Regional Park,2017-09-01T02:38:04Z,2017-09-01T13:38:35Z,2017-09-01 02:38:04.150543+00:00,2017-09-01 13:38:35.209008+00:00,2017.0,2017.0,9.0,9.0,...,13.0,38.0,38.0,19.0,6.0,9,9,1,1,46768


<a id='daylength_clean'></a>
### 5f. clean up dataset

In [118]:
#remove unwanted columns
df_daylength = df3[['plain_dates','park_name','year_set','month_set','day_set', 
                    'hour_sub7_rise', 'minute_rise', 'hour_sub7_set', 'minute_set', 'day_length']]
df_daylength.head(10)

Unnamed: 0,plain_dates,park_name,year_set,month_set,day_set,hour_sub7_rise,minute_rise,hour_sub7_set,minute_set,day_length
8,20170901,Anthony Chabot Regional Park,2017.0,9.0,1.0,6.0,38.0,19.0,38.0,46784
9,20170901,Briones,2017.0,9.0,1.0,6.0,38.0,19.0,38.0,46804
10,20170901,Garin Regional Park,2017.0,9.0,1.0,6.0,38.0,19.0,38.0,46768
11,20170901,Joseph D Grant County Park,2017.0,9.0,1.0,6.0,37.0,19.0,36.0,46735
12,20170901,Mt Diablo StatePark,2017.0,9.0,1.0,6.0,37.0,19.0,37.0,46797
13,20170901,Pleasanton Ridge Regional Park,2017.0,9.0,1.0,6.0,38.0,19.0,37.0,46766
14,20170901,Sunol,2017.0,9.0,1.0,6.0,37.0,19.0,37.0,46754
15,20170901,Tilden,2017.0,9.0,1.0,6.0,39.0,19.0,39.0,46799
16,20170902,Anthony Chabot Regional Park,2017.0,9.0,2.0,6.0,39.0,19.0,37.0,46644
17,20170902,Briones,2017.0,9.0,2.0,6.0,39.0,19.0,37.0,46664


In [119]:
#rename columns
df_daylength.columns = ['plain_dates','park_name','year','month','day', 
                    'hour_rise', 'minute_rise', 'hour_set', 'minute_set', 'day_length']
df_daylength.head(3)

Unnamed: 0,plain_dates,park_name,year,month,day,hour_rise,minute_rise,hour_set,minute_set,day_length
8,20170901,Anthony Chabot Regional Park,2017.0,9.0,1.0,6.0,38.0,19.0,38.0,46784
9,20170901,Briones,2017.0,9.0,1.0,6.0,38.0,19.0,38.0,46804
10,20170901,Garin Regional Park,2017.0,9.0,1.0,6.0,38.0,19.0,38.0,46768


In [120]:
#create empty lists for storing values
yr = []
mn = []
dy = []
hrR = []
minR = []
hrS = []
minS = []

#convert from pd.DataFrame to pd.Series and then to a list
dl_year_list = list(df_daylength['year'].squeeze())
dl_month_list = list(df_daylength['month'].squeeze())
dl_day_list = list(df_daylength['day'].squeeze())
dl_hrR_list = list(df_daylength['hour_rise'].squeeze())
dl_minR_list = list(df_daylength['minute_rise'].squeeze())
dl_hrS_list = list(df_daylength['hour_set'].squeeze())
dl_minS_list = list(df_daylength['minute_set'].squeeze())

for r in range(len(df_daylength)):
    
    int_year = int(dl_year_list[r]) #convert each year value to an integer, then append
    yr.append(int_year)
    
    int_month = int(dl_month_list[r]) #convert each month value to an integer, then append
    mn.append(int_month)
    
    int_day = int(dl_day_list[r]) #convert each day value to an integer, then append
    dy.append(int_day)
    
    int_hour7_rise = int(dl_hrR_list[r])
    hrR.append(int_hour7_rise)
    
    int_minute_rise = int(dl_minR_list[r])
    minR.append(int_minute_rise)
    
    int_hour7_set = int(dl_hrS_list[r])
    hrS.append(int_hour7_set)
    
    int_minute_set = int(dl_minS_list[r])
    minS.append(int_minute_set)
    
df_daylength['year'] = yr
df_daylength['month'] = mn
df_daylength['day'] = dy
df_daylength['hour_rise'] = hrR
df_daylength['minute_rise'] = minR
df_daylength['hour_set'] = hrS
df_daylength['minute_set'] = minS

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_daylength['year'] = yr
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_daylength['month'] = mn
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_daylength['day'] = dy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [121]:
print(df_daylength.shape)
df_daylength.head(15)

(14848, 10)


Unnamed: 0,plain_dates,park_name,year,month,day,hour_rise,minute_rise,hour_set,minute_set,day_length
8,20170901,Anthony Chabot Regional Park,2017,9,1,6,38,19,38,46784
9,20170901,Briones,2017,9,1,6,38,19,38,46804
10,20170901,Garin Regional Park,2017,9,1,6,38,19,38,46768
11,20170901,Joseph D Grant County Park,2017,9,1,6,37,19,36,46735
12,20170901,Mt Diablo StatePark,2017,9,1,6,37,19,37,46797
13,20170901,Pleasanton Ridge Regional Park,2017,9,1,6,38,19,37,46766
14,20170901,Sunol,2017,9,1,6,37,19,37,46754
15,20170901,Tilden,2017,9,1,6,39,19,39,46799
16,20170902,Anthony Chabot Regional Park,2017,9,2,6,39,19,37,46644
17,20170902,Briones,2017,9,2,6,39,19,37,46664


In [122]:
df_daylength.columns

### export daylength data

In [123]:
#if you get an error run the import datetime line below
from datetime import date, datetime

#export data
#timestamp
today = date.today()
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("most recent export:",today, ",", current_time)

df_daylength.to_csv('/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/Public_Final/daylength_data.csv')


most recent export: 2022-11-14 , 08:02:32


<a id='integrate'></a>
# 6. Combine climate data, and day length data
**Features are engineered for temperature and precipitation**
 - prior 14 days
 - prior 30 days
 - cumulative precipitation by water year
 
[Link to top](#guide)

### 6a. import climate and daylength or call from above


### import climate data 

In [124]:
#Download the csv file from GitHub: Floydworks
#url = ('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/climate_GHCN_data.csv')
#download = requests.get(url).content

# Read the downloaded content and turn it into a pandas dataframe
#climate_df = pd.read_csv(io.StringIO(download.decode('utf-8')))

#climate_df = climate_df[['plain_dates', 'year_cl', 'month_cl', 'day_cl', 'minTemp',
#       'maxTemp', 'daily_prec', 'city', 'station_id', 'prec_cum']]

In [125]:
#call climate_data dataframe created previously and store as climate_df
climate_df = climate_data
print(climate_df.shape)
climate_df.head()

(14836, 10)


Unnamed: 0,plain_dates,year_cl,month_cl,day_cl,minTemp,maxTemp,daily_prec,city,station_id,prec_cum
42680,20170901,2017,9,1,60.98,91.04,0.0,berkeley,USC00040693,0.0
42681,20170902,2017,9,2,73.94,105.08,0.0,berkeley,USC00040693,0.0
42682,20170903,2017,9,3,69.98,105.08,0.0,berkeley,USC00040693,0.0
42683,20170904,2017,9,4,62.96,87.08,0.0,berkeley,USC00040693,0.0
42684,20170905,2017,9,5,64.94,78.08,0.0,berkeley,USC00040693,0.0


In [126]:
#climate_df.describe()

In [127]:
#reduce and reorder columns as needed
print(climate_df.columns)
#climate_df = climate_df[['plain_dates', 'year_cl', 'month_cl', 'day_cl', 'minTemp', 'maxTemp',
#       'daily_prec', 'city', 'station_id', 'prec_cum']]

climate_df.head(3)

Index(['plain_dates', 'year_cl', 'month_cl', 'day_cl', 'minTemp', 'maxTemp',
       'daily_prec', 'city', 'station_id', 'prec_cum'],
      dtype='object')


Unnamed: 0,plain_dates,year_cl,month_cl,day_cl,minTemp,maxTemp,daily_prec,city,station_id,prec_cum
42680,20170901,2017,9,1,60.98,91.04,0.0,berkeley,USC00040693,0.0
42681,20170902,2017,9,2,73.94,105.08,0.0,berkeley,USC00040693,0.0
42682,20170903,2017,9,3,69.98,105.08,0.0,berkeley,USC00040693,0.0


### import and prep the daylength dataset

In [128]:
#import day length data from GitHub:Floydworks
#url = ('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/daylength_data.csv')
#download = requests.get(url).content

# Read the downloaded content and turn it into a pandas dataframe
#df_daylength = pd.read_csv(io.StringIO(download.decode('utf-8')))

In [129]:
#call data produced above and assign new name for manipulation 
daylength_data = df_daylength

In [130]:
daylength_data.head(3)

Unnamed: 0,plain_dates,park_name,year,month,day,hour_rise,minute_rise,hour_set,minute_set,day_length
8,20170901,Anthony Chabot Regional Park,2017,9,1,6,38,19,38,46784
9,20170901,Briones,2017,9,1,6,38,19,38,46804
10,20170901,Garin Regional Park,2017,9,1,6,38,19,38,46768


In [131]:
#shorten long park names
park_abbr_dict = {'Anthony Chabot Regional Park' : 'AChabot',
 'Garin Regional Park': 'Garin',
 'Joseph D Grant County Park' : 'JDGrant',
 'Pleasanton Ridge Regional Park' : 'PRidge',
 'Mt Diablo StatePark' : 'MtDiablo',
 'Sunol' : 'Sunol',
 'Tilden' : 'Tilden',
 'Briones' : 'Briones'
                 }

#map shortened park names
daylength_data['park_name'] = daylength_data['park_name'].map(park_abbr_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  daylength_data['park_name'] = daylength_data['park_name'].map(park_abbr_dict)


In [132]:
parks = list(daylength_data['park_name'].unique())
#look at date range for each park
for p in parks:
    print(p, ': ', daylength_data[daylength_data['park_name']== p]['plain_dates'].min(), daylength_data[daylength_data['park_name']== p]['plain_dates'].max())


AChabot :  20170901 20220930
Briones :  20170901 20220930
Garin :  20170901 20220930
JDGrant :  20170901 20220930
MtDiablo :  20170901 20220930
PRidge :  20170901 20220930
Sunol :  20170901 20220930
Tilden :  20170901 20220930


-
<a id='merge_clim_day'></a>

### 6b. merge daylength and climate datasets

In [133]:
 daylength_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14848 entries, 8 to 14855
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   plain_dates  14848 non-null  int64 
 1   park_name    14848 non-null  object
 2   year         14848 non-null  int64 
 3   month        14848 non-null  int64 
 4   day          14848 non-null  int64 
 5   hour_rise    14848 non-null  int64 
 6   minute_rise  14848 non-null  int64 
 7   hour_set     14848 non-null  int64 
 8   minute_set   14848 non-null  int64 
 9   day_length   14848 non-null  int64 
dtypes: int64(9), object(1)
memory usage: 1.2+ MB


In [134]:
daylength_data = daylength_data[['plain_dates', 'park_name', 'year', 'month', 'day',
       'hour_rise', 'minute_rise', 'hour_set', 'minute_set', 'day_length']]

In [135]:
climate_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14836 entries, 42680 to 8852
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   plain_dates  14836 non-null  object 
 1   year_cl      14836 non-null  object 
 2   month_cl     14836 non-null  object 
 3   day_cl       14836 non-null  object 
 4   minTemp      14836 non-null  float64
 5   maxTemp      14836 non-null  float64
 6   daily_prec   14836 non-null  float64
 7   city         14836 non-null  object 
 8   station_id   14836 non-null  object 
 9   prec_cum     14836 non-null  float64
dtypes: float64(4), object(6)
memory usage: 1.2+ MB


In [136]:
#convert plain dates in climate_df to int64 for merging
climate_df['plain_dates']= pd.to_numeric(climate_df['plain_dates'])

In [137]:
climate_df.describe()

Unnamed: 0,plain_dates,minTemp,maxTemp,daily_prec,prec_cum
count,14836.0,14836.0,14836.0,14836.0,14836.0
mean,20197710.0,50.375772,70.146066,0.040257,7.909522
std,14998.13,9.204269,12.493215,0.195702,7.146831
min,20170900.0,19.94,28.94,0.0,0.0
25%,20181210.0,44.06,60.98,0.0,2.728346
50%,20200320.0,51.08,69.08,0.0,5.795276
75%,20210620.0,57.02,78.08,0.0,11.984252
max,20220930.0,84.92,113.0,6.03937,38.570866


In [138]:
#merge daylength onto climate/weather data
left = daylength_data.merge(climate_df, on='plain_dates', how='left')
print(left['plain_dates'].min())
print(left['plain_dates'].max())
#left.head(50)

20170901
20220930


In [139]:
#merge daylength onto climate/weather data
#left = climate_df.merge(daylength_data, on='plain_dates', how='left')
#print(left['plain_dates'].min())
#print(left['plain_dates'].max())
#left.head(50)

**clean and format the merged data**

In [140]:
#drop dates (before 09/01/17) that have no daylength data
#left.isna().sum()
#left = left[left['plain_dates']>=20170901]
#left.isna().sum()

In [141]:
print(left.columns)
left.head(30)

Index(['plain_dates', 'park_name', 'year', 'month', 'day', 'hour_rise',
       'minute_rise', 'hour_set', 'minute_set', 'day_length', 'year_cl',
       'month_cl', 'day_cl', 'minTemp', 'maxTemp', 'daily_prec', 'city',
       'station_id', 'prec_cum'],
      dtype='object')


Unnamed: 0,plain_dates,park_name,year,month,day,hour_rise,minute_rise,hour_set,minute_set,day_length,year_cl,month_cl,day_cl,minTemp,maxTemp,daily_prec,city,station_id,prec_cum
0,20170901,AChabot,2017,9,1,6,38,19,38,46784,2017,9,1,60.98,91.04,0.0,berkeley,USC00040693,0.0
1,20170901,AChabot,2017,9,1,6,38,19,38,46784,2017,9,1,69.08,111.02,0.0,concord,USW00023254,0.0
2,20170901,AChabot,2017,9,1,6,38,19,38,46784,2017,9,1,69.08,102.92,0.0,hayward,USW00093228,0.0
3,20170901,AChabot,2017,9,1,6,38,19,38,46784,2017,9,1,66.02,109.04,0.0,livermore,USW00023285,0.0
4,20170901,AChabot,2017,9,1,6,38,19,38,46784,2017,9,1,71.06,93.02,0.0,mtdiablo,USC00045915,0.0
5,20170901,AChabot,2017,9,1,6,38,19,38,46784,2017,9,1,80.96,91.04,0.0,mthamilton,USC00045933,0.0
6,20170901,AChabot,2017,9,1,6,38,19,38,46784,2017,9,1,60.08,100.94,0.0,oakland,USW00023230,0.0
7,20170901,AChabot,2017,9,1,6,38,19,38,46784,2017,9,1,69.08,107.96,0.0,sanjose,USW00023293,0.0
8,20170901,Briones,2017,9,1,6,38,19,38,46804,2017,9,1,60.98,91.04,0.0,berkeley,USC00040693,0.0
9,20170901,Briones,2017,9,1,6,38,19,38,46804,2017,9,1,69.08,111.02,0.0,concord,USW00023254,0.0


In [142]:
#rename daily_prec : prec_daily
left = left.rename(columns={"daily_prec": "prec_daily"})

#reorder and select columns
left = left[['plain_dates', 'year_cl', 'month_cl', 'day_cl',
       'prec_cum', 'prec_daily', 'minTemp', 'maxTemp', 'city', 'station_id','park_name', 
       'hour_rise','minute_rise', 'hour_set', 'minute_set', 'day_length']]

#add leading zero to month and day values that are a single digit
left['day_cl'] = left['day_cl'].astype(str)
left['day_cl'] = left['day_cl'].str.zfill(2)
left['month_cl'] = left['month_cl'].astype(str)
left['month_cl'] = left['month_cl'].str.zfill(2)

left.head()

Unnamed: 0,plain_dates,year_cl,month_cl,day_cl,prec_cum,prec_daily,minTemp,maxTemp,city,station_id,park_name,hour_rise,minute_rise,hour_set,minute_set,day_length
0,20170901,2017,9,1,0.0,0.0,60.98,91.04,berkeley,USC00040693,AChabot,6,38,19,38,46784
1,20170901,2017,9,1,0.0,0.0,69.08,111.02,concord,USW00023254,AChabot,6,38,19,38,46784
2,20170901,2017,9,1,0.0,0.0,69.08,102.92,hayward,USW00093228,AChabot,6,38,19,38,46784
3,20170901,2017,9,1,0.0,0.0,66.02,109.04,livermore,USW00023285,AChabot,6,38,19,38,46784
4,20170901,2017,9,1,0.0,0.0,71.06,93.02,mtdiablo,USC00045915,AChabot,6,38,19,38,46784


In [143]:
left.describe()

Unnamed: 0,plain_dates,prec_cum,prec_daily,minTemp,maxTemp,hour_rise,minute_rise,hour_set,minute_set,day_length
count,118688.0,118688.0,118688.0,118688.0,118688.0,118688.0,118688.0,118688.0,118688.0,118688.0
mean,20197710.0,7.909522,0.040257,50.375772,70.146066,6.526692,29.88474,18.721926,29.9085,43903.771813
std,14997.68,7.14662,0.195696,9.203998,12.492846,1.012212,18.049104,1.020134,17.363042,6419.790191
min,20170900.0,0.0,0.0,19.94,28.94,5.0,0.0,17.0,0.0,34285.0
25%,20181210.0,2.728346,0.0,44.06,60.98,6.0,14.0,18.0,15.0,37707.0
50%,20200320.0,5.795276,0.0,51.08,69.08,6.0,27.0,19.0,30.0,44069.0
75%,20210620.0,11.984252,0.0,57.02,78.08,7.0,47.0,20.0,48.0,50097.0
max,20220930.0,38.570866,6.03937,84.92,113.0,8.0,59.0,20.0,59.0,53262.0


In [144]:
#see city and park names
print(left['park_name'].unique())
print(left['city'].unique())

['AChabot' 'Briones' 'Garin' 'JDGrant' 'MtDiablo' 'PRidge' 'Sunol'
 'Tilden']
['berkeley' 'concord' 'hayward' 'livermore' 'mtdiablo' 'mthamilton'
 'oakland' 'sanjose']


<a id='multi_station_avgs'></a>
### 6c. calculate climate and daylength averages for parks using multiple stations

In [145]:
#create dataframes of each park you want to keep
Br = left[(left['city']=='concord')&(left['park_name']=='Briones')]
Ti = left[(left['city']=='berkeley')&(left['park_name']=='Tilden')]
Ac = left[(left['city']=='oakland')&(left['park_name']=='AChabot')]
Ga = left[(left['city']=='hayward')&(left['park_name']=='Garin')]
Md = left[(left['city']=='mtdiablo')&(left['park_name']=='MtDiablo')]

#dataframes for parks using multiple stations
Jd = left[(left['city'].isin(['mthamilton', 'sanjose']))&(left['park_name']=='JDGrant')]
Pr = left[(left['city'].isin(['hayward', 'livermore']))&(left['park_name']=='PRidge')]
Su = left[(left['city'].isin(['sanjose', 'livermore']))&(left['park_name']=='Sunol')]

print(Br['city'].unique(), ':', Br.shape)
print(Ti['city'].unique(), ':', Ti.shape)  
print(Ac['city'].unique(), ':', Ac.shape)  
print(Jd['city'].unique(), ':', Jd.shape)
print(Ga['city'].unique(), ':', Ga.shape)
print(Pr['city'].unique(), ':', Pr.shape) 
print(Su['city'].unique(), ':', Su.shape) 
print(Md['city'].unique(), ':', Md.shape) 

#average the rows for observations using two or more weather stations/daylength values
#define columns that will not be averaged and use unique obs 'id'
indicators = ['plain_dates', 'year_cl','month_cl', 'day_cl','city','station_id','park_name']
        
#get mean of cols not in indicators
Su = Su.groupby(indicators, as_index=False).mean()
#add column for city and station_id
Su['city'] = 'LivermoreSanJose'
Su['station_id'] = 'USW00023285.USW00023293'
Su = Su.groupby(indicators, as_index=False).mean()   #use this fir climate explorer data

print('Sunol: ',Su.shape)
#Su.sort_values(by = ['plain_dates']).head(20)

#get mean of cols not in indicators
Pr = Pr.groupby(indicators, as_index=False).mean()
#add column for city and station_id
Pr['city'] = 'LivermoreHayward'
Pr['station_id'] = 'USW00023285.USW00093228'
Pr = Pr.groupby(indicators, as_index=False).mean()  #use this fir climate explorer data

print('Pleasanton Ridge: ', Pr.shape)
#Pr.sort_values(by = ['plain_dates']).head(20)

#get mean of cols not in indicators
Jd = Jd.groupby(indicators, as_index=False).mean()
#add column for city and station_id
Jd['city'] = 'SanJoseMtHamilton'
Jd['station_id'] = 'USW00023293.USC00045933'
Jd = Jd.groupby(indicators, as_index=False).mean()  #use this fir climate explorer data

print('JDGrant: ', Jd.shape)
#Jd.sort_values(by = ['plain_dates']).head(20)

#concatenate the park dataframes
climate_daylength= pd.concat([Su,Pr,Br,Ti, Ac, Jd, Ga, Md])

print(climate_daylength.shape)

#convert month_cl and day to integer
climate_daylength['plain_dates'] = climate_daylength['plain_dates'].astype(int)
#climate_data['day_cl'] = climate_data['day_cl'].astype(int)

['concord'] : (1854, 16)
['berkeley'] : (1856, 16)
['oakland'] : (1855, 16)
['mthamilton' 'sanjose'] : (3712, 16)
['hayward'] : (1849, 16)
['hayward' 'livermore'] : (3703, 16)
['livermore' 'sanjose'] : (3710, 16)
['mtdiablo'] : (1856, 16)
Sunol:  (1856, 16)
Pleasanton Ridge:  (1856, 16)
JDGrant:  (1856, 16)
(14838, 16)


In [146]:
cities = list(climate_daylength['city'].unique())
#look at max date range for each city
for c in cities:
    print(c, ': ', climate_daylength[climate_daylength['city']== c]['plain_dates'].max())


LivermoreSanJose :  20220930
LivermoreHayward :  20220930
concord :  20220930
berkeley :  20220930
oakland :  20220930
SanJoseMtHamilton :  20220930
hayward :  20220930
mtdiablo :  20220930


In [147]:
print(climate_daylength.isna().sum())

plain_dates    0
year_cl        0
month_cl       0
day_cl         0
city           0
station_id     0
park_name      0
prec_cum       0
prec_daily     0
minTemp        0
maxTemp        0
hour_rise      0
minute_rise    0
hour_set       0
minute_set     0
day_length     0
dtype: int64


<a id='engineer_climate'></a>
### 6d. engineer climate features
<br>Wateryears, Wateryear Weeks, monthly values, weekly values, one month prior, two weeks prior, one week prior

In [148]:
#check header of wildflower dataset
#Download the csv file from GitHub: Floydworks
#url = ('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/df_wildflowers_2017_2022.csv')
#download = requests.get(url).content

# Read the downloaded content and turn it into a pandas dataframe
#flowers_df = pd.read_csv(io.StringIO(download.decode('utf-8')))

#print(flowers_df.shape)
#flowers_df.head(3)

In [149]:
print(climate_daylength.columns)
climate_daylength.head(3)

Index(['plain_dates', 'year_cl', 'month_cl', 'day_cl', 'city', 'station_id',
       'park_name', 'prec_cum', 'prec_daily', 'minTemp', 'maxTemp',
       'hour_rise', 'minute_rise', 'hour_set', 'minute_set', 'day_length'],
      dtype='object')


Unnamed: 0,plain_dates,year_cl,month_cl,day_cl,city,station_id,park_name,prec_cum,prec_daily,minTemp,maxTemp,hour_rise,minute_rise,hour_set,minute_set,day_length
0,20170901,2017,9,1,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,67.55,108.5,6.0,37.0,19.0,37.0,46754.0
1,20170902,2017,9,2,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,71.51,107.51,6.0,38.0,19.0,35.0,46616.0
2,20170903,2017,9,3,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,77.0,102.02,6.0,39.0,19.0,34.0,46477.0


In [150]:
#rename some columns
climate_daylength = climate_daylength.rename(columns={'year_cl':'Year',
                   'month_cl':'Month', 
                   'day_cl':'Day', 
                   'month_cl':'Month', 
                   'park_name':'park'})

#climate_daylength.columns

In [151]:
#climate_daylength.info()
print(climate_daylength['park'].unique())
climate_daylength.head(3)

['Sunol' 'PRidge' 'Briones' 'Tilden' 'AChabot' 'JDGrant' 'Garin'
 'MtDiablo']


Unnamed: 0,plain_dates,Year,Month,Day,city,station_id,park,prec_cum,prec_daily,minTemp,maxTemp,hour_rise,minute_rise,hour_set,minute_set,day_length
0,20170901,2017,9,1,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,67.55,108.5,6.0,37.0,19.0,37.0,46754.0
1,20170902,2017,9,2,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,71.51,107.51,6.0,38.0,19.0,35.0,46616.0
2,20170903,2017,9,3,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,77.0,102.02,6.0,39.0,19.0,34.0,46477.0


**add water years and water year weeks**

In [152]:
#define dictionary of month names and corresponding water year month number for mapping
month_dict = {
    'Oct': '01', 
    'Nov': '02', 
    'Dec': '03',
    'Jan': '04',
    'Feb': '05',
    'Mar': '06',
    'Apr': '07',
    'May': '08',
    'Jun': '09',
    'Jul': '10',
    'Aug': '11',
    'Sep': '12',
    
}

#create column of month names as text
climate_daylength['Month_name'] = pd.to_datetime(climate_daylength['Month'], format='%m').dt.month_name().str.slice(stop=3)
#create column of water year month numbers to order plot by
climate_daylength['wy_month'] = climate_daylength['Month_name'].map(month_dict)

#add column with year and month values concatenated
climate_daylength['yr_mon'] = climate_daylength.Year.astype(str) + climate_daylength.Month.astype(str)


#add water year column
#assign water year (WY) value where date is in that water year and nan for all others to initialize column
climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['201610', '201611', '201612']), '2017', 'Nan')
climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['201701', '201702', '201703', '201704', '201705', '201706', '201707', '201708', '201709']), '2017', climate_daylength['WY'])

climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['201710', '201711', '201712']), '2018', climate_daylength['WY'])
climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['201801', '201802', '201803', '201804', '201805', '201806', '201807', '201808', '201809']), '2018', climate_daylength['WY'])

climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['201810', '201811', '201812']), '2019', climate_daylength['WY'])
climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['201901', '201902', '201903', '201904', '201905', '201906', '201907', '201908', '201909']), '2019', climate_daylength['WY'])

climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['201910', '201911', '201912']), '2020', climate_daylength['WY'])
climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['202001', '202002', '202003', '202004', '202005', '202006', '202007', '202008', '202009']), '2020', climate_daylength['WY'])

climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['202010', '202011', '202012']), '2021', climate_daylength['WY'])
climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['202101', '202102', '202103', '202104', '202105', '202106', '202107', '202108', '202109']), '2021', climate_daylength['WY'])

climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['202110', '202111', '202112']), '2022', climate_daylength['WY'])
climate_daylength['WY'] = np.where(climate_daylength['yr_mon'].isin(['202201', '202202', '202203', '202204', '202205', '202206', '202207', '202208', '202209']), '2022', climate_daylength['WY'])


#print(np.unique(climate_daylength['WY']))

climate_daylength.head(3)

Unnamed: 0,plain_dates,Year,Month,Day,city,station_id,park,prec_cum,prec_daily,minTemp,maxTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,Month_name,wy_month,yr_mon,WY
0,20170901,2017,9,1,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,67.55,108.5,6.0,37.0,19.0,37.0,46754.0,Sep,12,201709,2017
1,20170902,2017,9,2,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,71.51,107.51,6.0,38.0,19.0,35.0,46616.0,Sep,12,201709,2017
2,20170903,2017,9,3,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,77.0,102.02,6.0,39.0,19.0,34.0,46477.0,Sep,12,201709,2017


In [153]:
#create column of wateryear month + water year day
climate_daylength['wy_mon_day'] = climate_daylength['Month_name'] + climate_daylength['Day'].astype(str)

**access external date/wateryear week map to map observations into water year weeks**

In [154]:
#Download the csv file from GitHub: Floydworks
url = "https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/wy_week_nums.csv" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content
# Read the downloaded content and turn it into a pandas dataframe
wy_week_nums = pd.read_csv(io.StringIO(download.decode('utf-8')))

# add leading zeros to single-digit days and months
wy_week_nums['day'] = wy_week_nums['day'].astype(str)
wy_week_nums['day'] = wy_week_nums['day'].str.zfill(2)
wy_week_nums['WY_wk_num'] = wy_week_nums['WY_wk_num'].astype(str)
wy_week_nums['WY_wk_num'] = wy_week_nums['WY_wk_num'].str.zfill(2)

wy_week_nums['wy_mon_day'] = wy_week_nums['month'] + wy_week_nums['day'].astype(str)

wy_week_nums.head(3)

Unnamed: 0,WY_wk_num,day,month,wy_mon_day
0,1,1,Oct,Oct01
1,1,2,Oct,Oct02
2,1,3,Oct,Oct03


In [155]:
#create dictionary of wy_mon_day:WY_wk_num
mon_day_wywk_dict = dict(zip(wy_week_nums['wy_mon_day'], wy_week_nums['WY_wk_num']))

#create Water year week
#each water year month separated into 4 water weeks, days 1-8, 9-15, 16-23, 23-end
climate_daylength['WY_weeknum'] = climate_daylength['wy_mon_day'].map(mon_day_wywk_dict)    

print(climate_daylength['park'].unique())
climate_daylength.head()

['Sunol' 'PRidge' 'Briones' 'Tilden' 'AChabot' 'JDGrant' 'Garin'
 'MtDiablo']


Unnamed: 0,plain_dates,Year,Month,Day,city,station_id,park,prec_cum,prec_daily,minTemp,...,minute_rise,hour_set,minute_set,day_length,Month_name,wy_month,yr_mon,WY,wy_mon_day,WY_weeknum
0,20170901,2017,9,1,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,67.55,...,37.0,19.0,37.0,46754.0,Sep,12,201709,2017,Sep01,45
1,20170902,2017,9,2,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,71.51,...,38.0,19.0,35.0,46616.0,Sep,12,201709,2017,Sep02,45
2,20170903,2017,9,3,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,77.0,...,39.0,19.0,34.0,46477.0,Sep,12,201709,2017,Sep03,45
3,20170904,2017,9,4,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,70.52,...,40.0,19.0,32.0,46338.0,Sep,12,201709,2017,Sep04,45
4,20170905,2017,9,5,LivermoreSanJose,USW00023285.USW00023293,Sunol,0.0,0.0,68.99,...,41.0,19.0,31.0,46198.0,Sep,12,201709,2017,Sep05,45


**add annual cumulative precipitation for each park and water year**

In [156]:
#create dataframes of each park you want to keep
Br = climate_daylength[climate_daylength['park']=='Briones']
Ti = climate_daylength[climate_daylength['park']=='Tilden']
Ac = climate_daylength[climate_daylength['park']=='AChabot']
Ga = climate_daylength[climate_daylength['park']=='Garin']
Jd = climate_daylength[climate_daylength['park']=='JDGrant']
Pr = climate_daylength[climate_daylength['park']=='PRidge']
Su = climate_daylength[climate_daylength['park']=='Sunol']
Md = climate_daylength[climate_daylength['park']=='MtDiablo']

#list of park dataframes
parks = [Br,Ti,Ac,Ga,Jd,Pr,Su, Md]
#go through park data frames and add column for cumulative precipitation over each water year
for p in parks:
    p['prec_cum_WY'] = p.groupby(p['WY'])['prec_daily'].cumsum()

#concatenate the park dataframes
climate_daylength = pd.concat([Br,Ti,Ac,Ga,Jd,Pr,Su, Md])
print(climate_daylength.shape)

print(len(climate_data))
print(len(climate_daylength))

#verify concatenation
if int(len(climate_data)) == int(len(climate_daylength)):
                                     
    print('Concatenation seems correct!')

(14838, 23)
14836
14838


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p['prec_cum_WY'] = p.groupby(p['WY'])['prec_daily'].cumsum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p['prec_cum_WY'] = p.groupby(p['WY'])['prec_daily'].cumsum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p['prec_cum_WY'] = p.groupby(p['WY'])['prec_daily'].cumsum()
A value is trying to b

**add precipitation and temperature aggregate features**

In [157]:
pd.set_option('display.max_columns', None)
climate_daylength.head()

Unnamed: 0,plain_dates,Year,Month,Day,city,station_id,park,prec_cum,prec_daily,minTemp,maxTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,Month_name,wy_month,yr_mon,WY,wy_mon_day,WY_weeknum,prec_cum_WY
9,20170901,2017,9,1,concord,USW00023254,Briones,0.0,0.0,69.08,111.02,6.0,38.0,19.0,38.0,46804.0,Sep,12,201709,2017,Sep01,45,0.0
73,20170902,2017,9,2,concord,USW00023254,Briones,0.0,0.0,73.04,109.94,6.0,39.0,19.0,37.0,46664.0,Sep,12,201709,2017,Sep02,45,0.0
137,20170903,2017,9,3,concord,USW00023254,Briones,0.0,0.0,75.02,105.98,6.0,40.0,19.0,35.0,46523.0,Sep,12,201709,2017,Sep03,45,0.0
201,20170904,2017,9,4,concord,USW00023254,Briones,0.0,0.0,69.08,87.98,6.0,41.0,19.0,34.0,46381.0,Sep,12,201709,2017,Sep04,45,0.0
265,20170905,2017,9,5,concord,USW00023254,Briones,0.0,0.0,69.98,89.96,6.0,42.0,19.0,32.0,46240.0,Sep,12,201709,2017,Sep05,45,0.0


**monthly and weekly aggregated values for plotting climate and climate EDA**

In [158]:
#Add columns of aggregated values, monthly
climate_daylength['MonMaxTemp'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['maxTemp'].transform('max'))
climate_daylength['MonMinTemp'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['minTemp'].transform('min'))
climate_daylength['MonAvgMaxTemp'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['maxTemp'].transform('mean'))
climate_daylength['MonAvgMinTemp'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['minTemp'].transform('mean'))
climate_daylength['MonSumPrec'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['prec_daily'].transform('sum'))
climate_daylength['MonCumPrec'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['prec_cum_WY'].transform('max'))
climate_daylength['MonMaxDayLen'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['day_length'].transform('max'))
climate_daylength['MonMinDayLen'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['day_length'].transform('min'))
climate_daylength['MonAvgDayLen'] = (climate_daylength.groupby([climate_daylength['yr_mon'],'park'])['day_length'].transform('mean'))

#weekly aggregated values
climate_daylength['WkMaxTemp'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['maxTemp'].transform('max'))
climate_daylength['WkMinTemp'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['minTemp'].transform('min'))
climate_daylength['WkAvgMaxTemp'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['maxTemp'].transform('mean'))
climate_daylength['WkAvgMinTemp'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['minTemp'].transform('mean'))
climate_daylength['WkSumPrec'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['prec_daily'].transform('sum'))
climate_daylength['WkCumPrec'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['prec_cum_WY'].transform('max'))
climate_daylength['WkMaxDayLen'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['day_length'].transform('max'))
climate_daylength['WkMinDayLen'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['day_length'].transform('min'))
climate_daylength['WkAvgDayLen'] = (climate_daylength.groupby(['WY_weeknum','WY','park'])['day_length'].transform('mean'))

**prior 14 and 30 day aggreagated values for modeling**

In [159]:
#prior n days features are not based in years, water years, etc. 
#They are calculated from continuous days in time across years, prec_daily, maxTemp and minTemp (daily values)
#prior two weeks values
climate_daylength['sum_prec_prior14'] = climate_daylength['prec_daily'].rolling(min_periods=1, window=14).sum()
climate_daylength['MaxTemp_prior14'] = climate_daylength['maxTemp'].rolling(min_periods=1, window=14).max()
climate_daylength['MinTemp_prior14'] = climate_daylength['minTemp'].rolling(min_periods=1, window=14).min()
climate_daylength['AvgMaxTemp_prior14'] = climate_daylength['maxTemp'].rolling(min_periods=1, window=14).mean()
climate_daylength['AvgMinTemp_prior14'] = climate_daylength['minTemp'].rolling(min_periods=1, window=14).mean()
climate_daylength['MaxDayLen_prior14'] = climate_daylength['day_length'].rolling(min_periods=1, window=14).max()
climate_daylength['MinDayLen_prior14'] = climate_daylength['day_length'].rolling(min_periods=1, window=14).min()
climate_daylength['AvgDayLen_prior14'] = climate_daylength['day_length'].rolling(min_periods=1, window=14).mean()

#prior 30 days values
climate_daylength['sum_prec_prior30'] = climate_daylength['prec_daily'].rolling(min_periods=1, window=30).sum()
climate_daylength['MaxTemp_prior30'] = climate_daylength['maxTemp'].rolling(min_periods=1, window=30).max()
climate_daylength['MinTemp_prior30'] = climate_daylength['minTemp'].rolling(min_periods=1, window=30).min()
climate_daylength['AvgMaxTemp_prior30'] = climate_daylength['maxTemp'].rolling(min_periods=1, window=30).mean()
climate_daylength['AvgMinTemp_prior30'] = climate_daylength['minTemp'].rolling(min_periods=1, window=30).mean()
climate_daylength['MaxDayLen_prior30'] = climate_daylength['day_length'].rolling(min_periods=1, window=30).max()
climate_daylength['MinDayLen_prior30'] = climate_daylength['day_length'].rolling(min_periods=1, window=30).min()
climate_daylength['AvgDayLen_prior30'] = climate_daylength['day_length'].rolling(min_periods=1, window=30).mean()

pd.set_option('display.max_columns', None)
climate_daylength.head()

Unnamed: 0,plain_dates,Year,Month,Day,city,station_id,park,prec_cum,prec_daily,minTemp,maxTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,Month_name,wy_month,yr_mon,WY,wy_mon_day,WY_weeknum,prec_cum_WY,MonMaxTemp,MonMinTemp,MonAvgMaxTemp,MonAvgMinTemp,MonSumPrec,MonCumPrec,MonMaxDayLen,MonMinDayLen,MonAvgDayLen,WkMaxTemp,WkMinTemp,WkAvgMaxTemp,WkAvgMinTemp,WkSumPrec,WkCumPrec,WkMaxDayLen,WkMinDayLen,WkAvgDayLen,sum_prec_prior14,MaxTemp_prior14,MinTemp_prior14,AvgMaxTemp_prior14,AvgMinTemp_prior14,MaxDayLen_prior14,MinDayLen_prior14,AvgDayLen_prior14,sum_prec_prior30,MaxTemp_prior30,MinTemp_prior30,AvgMaxTemp_prior30,AvgMinTemp_prior30,MaxDayLen_prior30,MinDayLen_prior30,AvgDayLen_prior30
9,20170901,2017,9,1,concord,USW00023254,Briones,0.0,0.0,69.08,111.02,6.0,38.0,19.0,38.0,46804.0,Sep,12,201709,2017,Sep01,45,0.0,111.02,50.0,87.284,61.208,0.031496,0.031496,46804.0,42623.0,44726.066667,111.02,62.06,93.8525,68.54,0.0,0.0,46804.0,45813.0,46309.75,0.0,111.02,69.08,111.02,69.08,46804.0,46804.0,46804.0,0.0,111.02,69.08,111.02,69.08,46804.0,46804.0,46804.0
73,20170902,2017,9,2,concord,USW00023254,Briones,0.0,0.0,73.04,109.94,6.0,39.0,19.0,37.0,46664.0,Sep,12,201709,2017,Sep02,45,0.0,111.02,50.0,87.284,61.208,0.031496,0.031496,46804.0,42623.0,44726.066667,111.02,62.06,93.8525,68.54,0.0,0.0,46804.0,45813.0,46309.75,0.0,111.02,69.08,110.48,71.06,46804.0,46664.0,46734.0,0.0,111.02,69.08,110.48,71.06,46804.0,46664.0,46734.0
137,20170903,2017,9,3,concord,USW00023254,Briones,0.0,0.0,75.02,105.98,6.0,40.0,19.0,35.0,46523.0,Sep,12,201709,2017,Sep03,45,0.0,111.02,50.0,87.284,61.208,0.031496,0.031496,46804.0,42623.0,44726.066667,111.02,62.06,93.8525,68.54,0.0,0.0,46804.0,45813.0,46309.75,0.0,111.02,69.08,108.98,72.38,46804.0,46523.0,46663.666667,0.0,111.02,69.08,108.98,72.38,46804.0,46523.0,46663.666667
201,20170904,2017,9,4,concord,USW00023254,Briones,0.0,0.0,69.08,87.98,6.0,41.0,19.0,34.0,46381.0,Sep,12,201709,2017,Sep04,45,0.0,111.02,50.0,87.284,61.208,0.031496,0.031496,46804.0,42623.0,44726.066667,111.02,62.06,93.8525,68.54,0.0,0.0,46804.0,45813.0,46309.75,0.0,111.02,69.08,103.73,71.555,46804.0,46381.0,46593.0,0.0,111.02,69.08,103.73,71.555,46804.0,46381.0,46593.0
265,20170905,2017,9,5,concord,USW00023254,Briones,0.0,0.0,69.98,89.96,6.0,42.0,19.0,32.0,46240.0,Sep,12,201709,2017,Sep05,45,0.0,111.02,50.0,87.284,61.208,0.031496,0.031496,46804.0,42623.0,44726.066667,111.02,62.06,93.8525,68.54,0.0,0.0,46804.0,45813.0,46309.75,0.0,111.02,69.08,100.976,71.24,46804.0,46240.0,46522.4,0.0,111.02,69.08,100.976,71.24,46804.0,46240.0,46522.4


**remove columns that reference regular calendar years or are not useful dates**
<br>'prec_cum', 'yr_mon', 'wy_mon_day'

In [160]:
#reduce and reorder columns
climate_daylength = climate_daylength[[         
        'city', 'station_id', 'park',
        'plain_dates', 'Year', 'Month', 'Day',
        'WY', 'wy_month', 'WY_weeknum',
    
        'prec_daily', 'prec_cum_WY', 'MonSumPrec', 'MonCumPrec', 
        'WkSumPrec', 'WkCumPrec',
    
        'minTemp', 'maxTemp', 
        'MonMaxTemp', 'MonMinTemp', 'MonAvgMaxTemp', 'MonAvgMinTemp',
        'WkMaxTemp', 'WkMinTemp', 'WkAvgMaxTemp', 'WkAvgMinTemp', 
        
        'hour_rise', 'minute_rise', 'hour_set', 'minute_set', 'day_length', 
        'MonMaxDayLen', 'MonMinDayLen', 'MonAvgDayLen',
        'WkMaxDayLen', 'WkMinDayLen','WkAvgDayLen', 
        
        'sum_prec_prior14', 
        'MaxTemp_prior14', 'MinTemp_prior14','AvgMaxTemp_prior14', 'AvgMinTemp_prior14', 
        'MaxDayLen_prior14', 'MinDayLen_prior14', 'AvgDayLen_prior14',
        'sum_prec_prior30', 
        'MaxTemp_prior30', 'MinTemp_prior30','AvgMaxTemp_prior30', 'AvgMinTemp_prior30', 
        'MaxDayLen_prior30', 'MinDayLen_prior30', 'AvgDayLen_prior30']]

pd.set_option('display.max_columns', None)
climate_daylength.head()

Unnamed: 0,city,station_id,park,plain_dates,Year,Month,Day,WY,wy_month,WY_weeknum,prec_daily,prec_cum_WY,MonSumPrec,MonCumPrec,WkSumPrec,WkCumPrec,minTemp,maxTemp,MonMaxTemp,MonMinTemp,MonAvgMaxTemp,MonAvgMinTemp,WkMaxTemp,WkMinTemp,WkAvgMaxTemp,WkAvgMinTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,MonMaxDayLen,MonMinDayLen,MonAvgDayLen,WkMaxDayLen,WkMinDayLen,WkAvgDayLen,sum_prec_prior14,MaxTemp_prior14,MinTemp_prior14,AvgMaxTemp_prior14,AvgMinTemp_prior14,MaxDayLen_prior14,MinDayLen_prior14,AvgDayLen_prior14,sum_prec_prior30,MaxTemp_prior30,MinTemp_prior30,AvgMaxTemp_prior30,AvgMinTemp_prior30,MaxDayLen_prior30,MinDayLen_prior30,AvgDayLen_prior30
9,concord,USW00023254,Briones,20170901,2017,9,1,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,69.08,111.02,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,38.0,19.0,38.0,46804.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,111.02,69.08,46804.0,46804.0,46804.0,0.0,111.02,69.08,111.02,69.08,46804.0,46804.0,46804.0
73,concord,USW00023254,Briones,20170902,2017,9,2,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,73.04,109.94,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,39.0,19.0,37.0,46664.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,110.48,71.06,46804.0,46664.0,46734.0,0.0,111.02,69.08,110.48,71.06,46804.0,46664.0,46734.0
137,concord,USW00023254,Briones,20170903,2017,9,3,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,75.02,105.98,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,40.0,19.0,35.0,46523.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,108.98,72.38,46804.0,46523.0,46663.666667,0.0,111.02,69.08,108.98,72.38,46804.0,46523.0,46663.666667
201,concord,USW00023254,Briones,20170904,2017,9,4,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,69.08,87.98,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,41.0,19.0,34.0,46381.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,103.73,71.555,46804.0,46381.0,46593.0,0.0,111.02,69.08,103.73,71.555,46804.0,46381.0,46593.0
265,concord,USW00023254,Briones,20170905,2017,9,5,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,69.98,89.96,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,42.0,19.0,32.0,46240.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,100.976,71.24,46804.0,46240.0,46522.4,0.0,111.02,69.08,100.976,71.24,46804.0,46240.0,46522.4


In [161]:
#cross-validate by looking at days with observations from each park, averages are different for each place and month
d1 = climate_daylength.loc[lambda climate_daylength: climate_daylength['plain_dates'] == 20180415]
with pd.option_context("display.max_columns", None):
    display(d1)

Unnamed: 0,city,station_id,park,plain_dates,Year,Month,Day,WY,wy_month,WY_weeknum,prec_daily,prec_cum_WY,MonSumPrec,MonCumPrec,WkSumPrec,WkCumPrec,minTemp,maxTemp,MonMaxTemp,MonMinTemp,MonAvgMaxTemp,MonAvgMinTemp,WkMaxTemp,WkMinTemp,WkAvgMaxTemp,WkAvgMinTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,MonMaxDayLen,MonMinDayLen,MonAvgDayLen,WkMaxDayLen,WkMinDayLen,WkAvgDayLen,sum_prec_prior14,MaxTemp_prior14,MinTemp_prior14,AvgMaxTemp_prior14,AvgMinTemp_prior14,MaxDayLen_prior14,MinDayLen_prior14,AvgDayLen_prior14,sum_prec_prior30,MaxTemp_prior30,MinTemp_prior30,AvgMaxTemp_prior30,AvgMinTemp_prior30,MaxDayLen_prior30,MinDayLen_prior30,AvgDayLen_prior30
14457,concord,USW00023254,Briones,20180415,2018,4,15,2018,7,26,0.03937,13.035433,2.267717,13.216535,0.228346,13.035433,50.0,66.02,87.08,39.92,71.894,49.496,78.98,42.98,70.442857,47.685714,6.0,33.0,19.0,43.0,47434.0,49440.0,45430.0,47479.766667,47434.0,46586.0,47011.714286,2.086614,78.98,42.98,70.507143,49.987143,47434.0,45576.0,46510.928571,3.917323,86.0,37.94,69.746,47.504,47434.0,43221.0,45346.7
14504,berkeley,USC00040693,Tilden,20180415,2018,4,15,2018,7,26,0.07874,20.295276,3.358268,20.413386,0.208661,20.295276,48.92,73.04,78.08,39.92,66.266,47.48,78.08,42.98,67.871429,47.557143,6.0,33.0,19.0,44.0,47429.0,49431.0,45427.0,47474.0,47429.0,46582.0,47006.714286,3.240157,78.08,42.08,66.791429,48.277143,47429.0,45573.0,46506.714286,5.503937,82.04,39.02,65.672,47.156,47429.0,43221.0,45344.1
14454,oakland,USW00023230,AChabot,20180415,2018,4,15,2018,7,26,0.03937,14.992126,3.185039,14.992126,0.181102,14.992126,48.92,62.96,77.0,39.02,64.886,48.266,73.04,42.08,65.994286,47.994286,6.0,33.0,19.0,43.0,47410.0,49403.0,45419.0,47455.4,47410.0,46567.0,46990.142857,3.185039,73.04,42.08,64.644286,48.765714,47410.0,45563.0,46492.571429,4.496063,77.0,37.94,64.256,47.09,47410.0,43222.0,45335.5
14466,hayward,USW00093228,Garin,20180415,2018,4,15,2018,7,26,0.0,10.137795,1.937008,10.177165,0.098425,10.137795,50.0,64.94,77.0,41.0,66.086,49.694,75.02,42.98,66.842857,48.405714,6.0,32.0,19.0,42.0,47392.0,49375.0,45410.0,47436.833333,47392.0,46553.0,46974.0,1.897638,75.02,42.98,65.685714,50.0,47392.0,45554.0,46478.714286,3.03937,82.04,39.92,65.756,48.152,47392.0,43224.0,45327.0
226,SanJoseMtHamilton,USW00023293.USC00045933,JDGrant,20180415,2018,4,15,2018,7,26,0.005906,13.998031,2.494094,14.55315,0.224409,13.998031,41.45,64.04,77.0,36.5,63.176,45.5765,75.47,37.94,63.937143,42.697143,6.0,31.0,19.0,41.0,47351.0,49313.0,45390.0,47395.5,47351.0,46521.0,46937.571429,1.938976,75.47,37.94,63.159286,44.882857,47351.0,45532.0,46447.642857,3.334646,75.47,32.54,60.974,43.76,47351.0,43227.0,45308.033333
226,LivermoreHayward,USW00023285.USW00093228,PRidge,20180415,2018,4,15,2018,7,26,0.049213,10.637795,1.773622,10.761811,0.169291,10.637795,48.47,64.94,80.96,38.48,67.826,47.612,75.02,40.46,67.64,45.988571,6.0,32.0,19.0,42.0,47389.0,49371.0,45408.0,47433.766667,47389.0,46551.0,46971.285714,1.649606,75.02,40.46,67.055,47.852857,47389.0,45552.0,46476.357143,3.015748,82.04,35.96,66.542,45.665,47389.0,43224.0,45325.5
226,LivermoreSanJose,USW00023285.USW00023293,Sunol,20180415,2018,4,15,2018,7,26,0.055118,9.98622,1.419291,10.200787,0.165354,9.98622,47.93,68.99,84.47,39.47,70.031,47.594,77.99,41.45,69.941429,45.628571,6.0,32.0,19.0,41.0,47374.0,49348.0,45401.0,47418.833333,47374.0,46539.0,46958.142857,1.204724,77.99,41.45,69.324286,47.698571,47374.0,45544.0,46465.142857,2.295276,82.04,35.51,68.147,45.485,47374.0,43225.0,45318.7
14484,mtdiablo,USC00045915,MtDiablo,20180415,2018,4,15,2018,7,26,0.0,19.952756,3.177165,20.401575,0.228346,19.952756,48.02,71.06,75.92,33.98,63.266,44.888,73.04,33.98,64.04,42.98,6.0,32.0,19.0,42.0,47426.0,49427.0,45426.0,47471.6,47426.0,46580.0,47004.428571,2.728346,73.04,33.98,63.808571,43.995714,47426.0,45572.0,46504.928571,5.598425,77.0,32.0,60.044,42.968,47426.0,43221.0,45342.933333


In [162]:
print('max date:',climate_daylength['plain_dates'].max())
print('min date:',climate_daylength['plain_dates'].min())

parks = list(climate_daylength['park'].unique())

#see number of days with climate data by park
for p in parks:
    df_temp = climate_daylength[climate_daylength['park']== p]
    print(str(p), ': ', str(len(df_temp)))

max date: 20220930
min date: 20170901
Briones :  1854
Tilden :  1856
AChabot :  1855
Garin :  1849
JDGrant :  1856
PRidge :  1856
Sunol :  1856
MtDiablo :  1856


### export climate_daylength spreadsheet
### climate_daylength is used for climatic EDA

In [163]:
#export file as csv
#timestamp
today = date.today()
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("most recent export:",today, ",", current_time)

climate_daylength.to_csv('/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/Public_Final/climate_daylength_2017_2022.csv')


most recent export: 2022-11-14 , 08:02:33


<a id='merge_all'></a>
# 7. Merge climate_daylength with iNaturalist observations

[Link to top](#guide)

### 7a. import climate_daylength data and iNaturalist observations

**import climate_daylength data**

In [164]:
# import from GitHub:Floydworks
#url = ('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/climate_daylength_2017_2022.csv')
#download = requests.get(url).content

# Read the downloaded content and turn it into a pandas dataframe
#climate_daylength = pd.read_csv(io.StringIO(download.decode('utf-8')))

In [165]:
#call dataframe as produced above
climate_daylength.head(3)

Unnamed: 0,city,station_id,park,plain_dates,Year,Month,Day,WY,wy_month,WY_weeknum,prec_daily,prec_cum_WY,MonSumPrec,MonCumPrec,WkSumPrec,WkCumPrec,minTemp,maxTemp,MonMaxTemp,MonMinTemp,MonAvgMaxTemp,MonAvgMinTemp,WkMaxTemp,WkMinTemp,WkAvgMaxTemp,WkAvgMinTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,MonMaxDayLen,MonMinDayLen,MonAvgDayLen,WkMaxDayLen,WkMinDayLen,WkAvgDayLen,sum_prec_prior14,MaxTemp_prior14,MinTemp_prior14,AvgMaxTemp_prior14,AvgMinTemp_prior14,MaxDayLen_prior14,MinDayLen_prior14,AvgDayLen_prior14,sum_prec_prior30,MaxTemp_prior30,MinTemp_prior30,AvgMaxTemp_prior30,AvgMinTemp_prior30,MaxDayLen_prior30,MinDayLen_prior30,AvgDayLen_prior30
9,concord,USW00023254,Briones,20170901,2017,9,1,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,69.08,111.02,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,38.0,19.0,38.0,46804.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,111.02,69.08,46804.0,46804.0,46804.0,0.0,111.02,69.08,111.02,69.08,46804.0,46804.0,46804.0
73,concord,USW00023254,Briones,20170902,2017,9,2,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,73.04,109.94,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,39.0,19.0,37.0,46664.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,110.48,71.06,46804.0,46664.0,46734.0,0.0,111.02,69.08,110.48,71.06,46804.0,46664.0,46734.0
137,concord,USW00023254,Briones,20170903,2017,9,3,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,75.02,105.98,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,40.0,19.0,35.0,46523.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,108.98,72.38,46804.0,46523.0,46663.666667,0.0,111.02,69.08,108.98,72.38,46804.0,46523.0,46663.666667


In [166]:
pd.set_option('display.max_columns', None)

#look at climate/daylength data
print(climate_daylength.shape)
print(climate_daylength['park'].unique())

(14838, 53)
['Briones' 'Tilden' 'AChabot' 'Garin' 'JDGrant' 'PRidge' 'Sunol'
 'MtDiablo']


In [167]:
#make sure parks are name correctly in climate_daylength
#climate_daylength['park'] = climate_daylength['park'].map({'Sunol':'Sunol', 'Briones':'Briones', 'Tilden':'Tilden', 'Garin':'Garin',
#                                     'MtDiablo':'MtDiablo', 'AChabot':'AChabot', 'Garin':'Garin',
#                                     'JDGrant':'JDGrant', 'PRidge':'PRidge'                                  
#                                     })
#print(climate_daylength['park'].unique())


**import cleaned iNaturalist observation data**

In [168]:
#import from GitHub:Floydworks
# import from GitHub:Floydworks
#url = ('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/df_wildflowers_2017_2022.csv')
#download = requests.get(url).content

# Read the downloaded content and turn it into a pandas dataframe
#flowers_df = pd.read_csv(io.StringIO(download.decode('utf-8')))

In [169]:
#call data produced in section 3 and assign to new name for manipulation
flowers_df = flowers_data

In [170]:
#look at inaturalist observation data
print(flowers_df.shape)
print(flowers_df['park'].unique())

(34666, 15)
['Sunol' 'Briones' 'Tilden' 'AnthonyChabot' 'Garin' 'JDGrant'
 'PleasantonRidge' 'MtDiablo']


In [171]:
#Map for park names
flowers_df['park'] = flowers_df['park'].map({'Sunol':'Sunol', 'Briones':'Briones', 'Tilden':'Tilden', 'Garin':'Garin',
                                     'MtDiablo':'MtDiablo','AnthonyChabot':'AChabot', 
                                     'JDGrant':'JDGrant', 'PleasantonRidge':'PRidge'                                  
                                     })
print(flowers_df['park'].unique())

#mt diablo
#df_mtdiablo = flowers_df[flowers_df['park']=='MtDiablo']


flowers_df.head(3) #look at the wildflower observation data

['Sunol' 'Briones' 'Tilden' 'AChabot' 'Garin' 'JDGrant' 'PRidge'
 'MtDiablo']


Unnamed: 0,id,DateTime,plain_dates,year,month,day,genus_species,genus,species,park,region,latitude,longitude,url,image_url
0,104188607,2022-01-01 00:00:00+00:00,20220101,2022,1,1,Baccharis pilularis,Baccharis,pilularis,Sunol,east bay,37.530981,-121.819691,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
1,104188609,2022-01-01 00:00:00+00:00,20220101,2022,1,1,Capsella bursa-pastoris,Capsella,bursa-pastoris,Sunol,east bay,37.52706,-121.827025,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
3,104681782,2022-01-09 00:00:00+00:00,20220109,2022,1,9,Cardamine californica,Cardamine,californica,Sunol,east bay,37.520038,-121.822708,https://www.inaturalist.org/observations/10468...,https://inaturalist-open-data.s3.amazonaws.com...


In [172]:
#keep only columns we want
flowers_df = flowers_df[['id', 'plain_dates', 'year', 'month', 'day', 'park', 'region', 'latitude',
       'longitude', 'genus_species', 'genus', 'species', 'url', 'image_url']]

**check both dataframes**

In [173]:
flowers_df.head()

Unnamed: 0,id,plain_dates,year,month,day,park,region,latitude,longitude,genus_species,genus,species,url,image_url
0,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
1,104188609,20220101,2022,1,1,Sunol,east bay,37.52706,-121.827025,Capsella bursa-pastoris,Capsella,bursa-pastoris,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...
3,104681782,20220109,2022,1,9,Sunol,east bay,37.520038,-121.822708,Cardamine californica,Cardamine,californica,https://www.inaturalist.org/observations/10468...,https://inaturalist-open-data.s3.amazonaws.com...
4,104690215,20220108,2022,1,8,Sunol,east bay,37.509616,-121.824145,Calandrinia menziesii,Calandrinia,menziesii,https://www.inaturalist.org/observations/10469...,https://inaturalist-open-data.s3.amazonaws.com...
5,104737731,20220110,2022,1,10,Sunol,east bay,37.531082,-121.819465,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10473...,https://inaturalist-open-data.s3.amazonaws.com...


In [174]:
climate_daylength.head()

Unnamed: 0,city,station_id,park,plain_dates,Year,Month,Day,WY,wy_month,WY_weeknum,prec_daily,prec_cum_WY,MonSumPrec,MonCumPrec,WkSumPrec,WkCumPrec,minTemp,maxTemp,MonMaxTemp,MonMinTemp,MonAvgMaxTemp,MonAvgMinTemp,WkMaxTemp,WkMinTemp,WkAvgMaxTemp,WkAvgMinTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,MonMaxDayLen,MonMinDayLen,MonAvgDayLen,WkMaxDayLen,WkMinDayLen,WkAvgDayLen,sum_prec_prior14,MaxTemp_prior14,MinTemp_prior14,AvgMaxTemp_prior14,AvgMinTemp_prior14,MaxDayLen_prior14,MinDayLen_prior14,AvgDayLen_prior14,sum_prec_prior30,MaxTemp_prior30,MinTemp_prior30,AvgMaxTemp_prior30,AvgMinTemp_prior30,MaxDayLen_prior30,MinDayLen_prior30,AvgDayLen_prior30
9,concord,USW00023254,Briones,20170901,2017,9,1,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,69.08,111.02,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,38.0,19.0,38.0,46804.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,111.02,69.08,46804.0,46804.0,46804.0,0.0,111.02,69.08,111.02,69.08,46804.0,46804.0,46804.0
73,concord,USW00023254,Briones,20170902,2017,9,2,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,73.04,109.94,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,39.0,19.0,37.0,46664.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,110.48,71.06,46804.0,46664.0,46734.0,0.0,111.02,69.08,110.48,71.06,46804.0,46664.0,46734.0
137,concord,USW00023254,Briones,20170903,2017,9,3,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,75.02,105.98,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,40.0,19.0,35.0,46523.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,108.98,72.38,46804.0,46523.0,46663.666667,0.0,111.02,69.08,108.98,72.38,46804.0,46523.0,46663.666667
201,concord,USW00023254,Briones,20170904,2017,9,4,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,69.08,87.98,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,41.0,19.0,34.0,46381.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,103.73,71.555,46804.0,46381.0,46593.0,0.0,111.02,69.08,103.73,71.555,46804.0,46381.0,46593.0
265,concord,USW00023254,Briones,20170905,2017,9,5,2017,12,45,0.0,0.0,0.031496,0.031496,0.0,0.0,69.98,89.96,111.02,50.0,87.284,61.208,111.02,62.06,93.8525,68.54,6.0,42.0,19.0,32.0,46240.0,46804.0,42623.0,44726.066667,46804.0,45813.0,46309.75,0.0,111.02,69.08,100.976,71.24,46804.0,46240.0,46522.4,0.0,111.02,69.08,100.976,71.24,46804.0,46240.0,46522.4


In [175]:
flowers_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34666 entries, 0 to 37921
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             34666 non-null  int64  
 1   plain_dates    34666 non-null  object 
 2   year           34666 non-null  object 
 3   month          34666 non-null  object 
 4   day            34666 non-null  object 
 5   park           34666 non-null  object 
 6   region         34666 non-null  object 
 7   latitude       34666 non-null  float64
 8   longitude      34666 non-null  float64
 9   genus_species  34666 non-null  object 
 10  genus          34666 non-null  object 
 11  species        34666 non-null  object 
 12  url            34666 non-null  object 
 13  image_url      34666 non-null  object 
dtypes: float64(2), int64(1), object(11)
memory usage: 4.0+ MB


In [176]:
#convert plain dates in climate_df to int64 for merging
flowers_df['plain_dates']= pd.to_numeric(flowers_df['plain_dates'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  flowers_df['plain_dates']= pd.to_numeric(flowers_df['plain_dates'])


In [177]:
climate_daylength.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14838 entries, 9 to 118660
Data columns (total 53 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                14838 non-null  object 
 1   station_id          14838 non-null  object 
 2   park                14838 non-null  object 
 3   plain_dates         14838 non-null  int64  
 4   Year                14838 non-null  object 
 5   Month               14838 non-null  object 
 6   Day                 14838 non-null  object 
 7   WY                  14838 non-null  object 
 8   wy_month            14838 non-null  object 
 9   WY_weeknum          14838 non-null  object 
 10  prec_daily          14838 non-null  float64
 11  prec_cum_WY         14838 non-null  float64
 12  MonSumPrec          14838 non-null  float64
 13  MonCumPrec          14838 non-null  float64
 14  WkSumPrec           14838 non-null  float64
 15  WkCumPrec           14838 non-null  float64
 16  min

<a id='merge_flower_clim_day'></a>
### 7b. merge wildflower observations and climate_daylength data

In [178]:
#Merge your files!
data = pd.merge(flowers_df,climate_daylength,left_on=['plain_dates'], right_on=['plain_dates'])
print('Files merged!')    

Files merged!


In [179]:
print(data.shape)
data.head()

(277032, 66)


Unnamed: 0,id,plain_dates,year,month,day,park_x,region,latitude,longitude,genus_species,genus,species,url,image_url,city,station_id,park_y,Year,Month,Day,WY,wy_month,WY_weeknum,prec_daily,prec_cum_WY,MonSumPrec,MonCumPrec,WkSumPrec,WkCumPrec,minTemp,maxTemp,MonMaxTemp,MonMinTemp,MonAvgMaxTemp,MonAvgMinTemp,WkMaxTemp,WkMinTemp,WkAvgMaxTemp,WkAvgMinTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,MonMaxDayLen,MonMinDayLen,MonAvgDayLen,WkMaxDayLen,WkMinDayLen,WkAvgDayLen,sum_prec_prior14,MaxTemp_prior14,MinTemp_prior14,AvgMaxTemp_prior14,AvgMinTemp_prior14,MaxDayLen_prior14,MinDayLen_prior14,AvgDayLen_prior14,sum_prec_prior30,MaxTemp_prior30,MinTemp_prior30,AvgMaxTemp_prior30,AvgMinTemp_prior30,MaxDayLen_prior30,MinDayLen_prior30,AvgDayLen_prior30
0,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,concord,USW00023254,Briones,2022,1,1,2022,4,13,0.0,11.15748,0.0,11.15748,0.0,11.15748,32.0,51.98,66.92,28.22,60.585161,39.141935,60.98,28.22,56.1425,40.64,8.0,24.0,17.0,59.0,34492.0,36973.0,34492.0,35540.225806,34844.0,34492.0,34656.375,1.629921,57.92,32.0,51.401429,39.907143,34492.0,34285.0,34340.071429,4.771654,68.0,32.0,54.518,41.396,34948.0,34285.0,34464.933333
1,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,berkeley,USC00040693,Tilden,2022,1,1,2022,4,13,0.0,9.69685,0.0,9.69685,0.0,9.69685,37.04,51.98,69.08,35.96,60.71871,45.714839,57.92,35.96,54.8825,45.3875,8.0,24.0,17.0,59.0,34506.0,36983.0,34506.0,35552.451613,34857.0,34506.0,34669.75,-8.881784e-16,55.94,37.04,51.697143,41.63,34506.0,34299.0,34360.285714,0.0,62.96,37.04,53.63,43.316,34896.0,34299.0,34460.266667
2,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,oakland,USW00023230,AChabot,2022,1,1,2022,4,13,0.0,15.031496,0.248031,15.279528,0.248031,15.279528,35.06,51.08,69.08,32.0,59.011613,42.172903,57.02,32.0,55.13,43.5425,8.0,23.0,17.0,59.0,34549.0,37014.0,34549.0,35590.387097,34899.0,34549.0,34712.25,4.047244,57.92,35.06,52.455714,42.208571,34549.0,34344.0,34404.285714,8.818898,60.08,35.06,54.374,43.226,34937.0,34344.0,34503.8
3,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,hayward,USW00093228,Garin,2022,1,1,2022,4,13,0.0,11.885827,0.30315,12.188976,0.30315,12.188976,33.08,53.96,69.98,33.08,60.846452,43.456129,60.08,33.08,56.8625,44.3975,8.0,23.0,17.0,59.0,34592.0,37045.0,34592.0,35628.548387,34940.0,34592.0,34754.875,3.181102,60.08,33.08,54.358571,42.581429,34592.0,34388.0,34448.5,6.165354,60.98,33.08,55.43,43.724,34979.0,34388.0,34547.566667
4,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,SanJoseMtHamilton,USW00023293.USC00045933,JDGrant,2022,1,1,2022,4,13,0.0,12.968504,0.104331,13.072835,0.104331,13.072835,31.01,49.46,65.03,31.01,58.195806,42.207742,58.55,31.01,52.89125,40.06625,8.0,21.0,17.0,59.0,34689.0,37113.0,34689.0,35713.129032,35033.0,34689.0,34849.5,4.047244,54.95,31.01,47.859286,36.467857,34689.0,34487.0,34546.785714,7.917323,60.98,31.01,50.348,38.861,35071.0,34487.0,34644.6


In [180]:
print(data['park_x'].unique())
print(data['park_y'].unique())

['Sunol' 'Tilden' 'PRidge' 'MtDiablo' 'Briones' 'JDGrant' 'AChabot'
 'Garin']
['Briones' 'Tilden' 'AChabot' 'Garin' 'JDGrant' 'PRidge' 'Sunol'
 'MtDiablo']


In [181]:
data = pd.DataFrame(data)

print(data.shape)

data.head(10)


(277032, 66)


Unnamed: 0,id,plain_dates,year,month,day,park_x,region,latitude,longitude,genus_species,genus,species,url,image_url,city,station_id,park_y,Year,Month,Day,WY,wy_month,WY_weeknum,prec_daily,prec_cum_WY,MonSumPrec,MonCumPrec,WkSumPrec,WkCumPrec,minTemp,maxTemp,MonMaxTemp,MonMinTemp,MonAvgMaxTemp,MonAvgMinTemp,WkMaxTemp,WkMinTemp,WkAvgMaxTemp,WkAvgMinTemp,hour_rise,minute_rise,hour_set,minute_set,day_length,MonMaxDayLen,MonMinDayLen,MonAvgDayLen,WkMaxDayLen,WkMinDayLen,WkAvgDayLen,sum_prec_prior14,MaxTemp_prior14,MinTemp_prior14,AvgMaxTemp_prior14,AvgMinTemp_prior14,MaxDayLen_prior14,MinDayLen_prior14,AvgDayLen_prior14,sum_prec_prior30,MaxTemp_prior30,MinTemp_prior30,AvgMaxTemp_prior30,AvgMinTemp_prior30,MaxDayLen_prior30,MinDayLen_prior30,AvgDayLen_prior30
0,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,concord,USW00023254,Briones,2022,1,1,2022,4,13,0.0,11.15748,0.0,11.15748,0.0,11.15748,32.0,51.98,66.92,28.22,60.585161,39.141935,60.98,28.22,56.1425,40.64,8.0,24.0,17.0,59.0,34492.0,36973.0,34492.0,35540.225806,34844.0,34492.0,34656.375,1.629921,57.92,32.0,51.401429,39.907143,34492.0,34285.0,34340.071429,4.771654,68.0,32.0,54.518,41.396,34948.0,34285.0,34464.933333
1,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,berkeley,USC00040693,Tilden,2022,1,1,2022,4,13,0.0,9.69685,0.0,9.69685,0.0,9.69685,37.04,51.98,69.08,35.96,60.71871,45.714839,57.92,35.96,54.8825,45.3875,8.0,24.0,17.0,59.0,34506.0,36983.0,34506.0,35552.451613,34857.0,34506.0,34669.75,-8.881784e-16,55.94,37.04,51.697143,41.63,34506.0,34299.0,34360.285714,0.0,62.96,37.04,53.63,43.316,34896.0,34299.0,34460.266667
2,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,oakland,USW00023230,AChabot,2022,1,1,2022,4,13,0.0,15.031496,0.248031,15.279528,0.248031,15.279528,35.06,51.08,69.08,32.0,59.011613,42.172903,57.02,32.0,55.13,43.5425,8.0,23.0,17.0,59.0,34549.0,37014.0,34549.0,35590.387097,34899.0,34549.0,34712.25,4.047244,57.92,35.06,52.455714,42.208571,34549.0,34344.0,34404.285714,8.818898,60.08,35.06,54.374,43.226,34937.0,34344.0,34503.8
3,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,hayward,USW00093228,Garin,2022,1,1,2022,4,13,0.0,11.885827,0.30315,12.188976,0.30315,12.188976,33.08,53.96,69.98,33.08,60.846452,43.456129,60.08,33.08,56.8625,44.3975,8.0,23.0,17.0,59.0,34592.0,37045.0,34592.0,35628.548387,34940.0,34592.0,34754.875,3.181102,60.08,33.08,54.358571,42.581429,34592.0,34388.0,34448.5,6.165354,60.98,33.08,55.43,43.724,34979.0,34388.0,34547.566667
4,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,SanJoseMtHamilton,USW00023293.USC00045933,JDGrant,2022,1,1,2022,4,13,0.0,12.968504,0.104331,13.072835,0.104331,13.072835,31.01,49.46,65.03,31.01,58.195806,42.207742,58.55,31.01,52.89125,40.06625,8.0,21.0,17.0,59.0,34689.0,37113.0,34689.0,35713.129032,35033.0,34689.0,34849.5,4.047244,54.95,31.01,47.859286,36.467857,34689.0,34487.0,34546.785714,7.917323,60.98,31.01,50.348,38.861,35071.0,34487.0,34644.6
5,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,LivermoreHayward,USW00023285.USW00093228,PRidge,2022,1,1,2022,4,13,0.0,11.334646,0.161417,11.496063,0.161417,11.496063,31.64,51.98,67.46,29.66,60.675161,40.433871,60.53,29.66,56.3,42.53,8.0,22.0,17.0,59.0,34599.0,37049.0,34599.0,35634.354839,34947.0,34599.0,34761.375,2.978346,60.08,31.64,52.944286,40.543571,34599.0,34395.0,34455.428571,5.507874,60.53,31.64,54.554,41.189,34985.0,34395.0,34554.3
6,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,LivermoreSanJose,USW00023285.USW00023293,Sunol,2022,1,1,2022,4,13,0.0,8.714567,0.009843,8.724409,0.009843,8.724409,31.64,52.97,66.47,29.12,61.958387,39.464194,63.05,29.12,57.68375,42.27125,8.0,22.0,17.0,59.0,34634.0,37074.0,34634.0,35665.16129,34980.0,34634.0,34795.75,2.283465,60.53,31.64,53.696429,39.939286,34634.0,34431.0,34491.142857,4.600394,61.52,30.65,55.103,40.436,35019.0,34431.0,34589.633333
7,104188607,20220101,2022,1,1,Sunol,east bay,37.530981,-121.819691,Baccharis pilularis,Baccharis,pilularis,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,mtdiablo,USC00045915,MtDiablo,2022,1,1,2022,4,13,0.0,18.645669,0.098425,18.744094,0.098425,18.744094,33.08,44.06,64.94,33.08,57.263871,44.809032,57.92,33.08,50.135,39.38,8.0,23.0,17.0,58.0,34511.0,36986.0,34511.0,35556.83871,34862.0,34511.0,34674.75,3.929134,53.06,28.94,45.525714,35.265714,34511.0,34305.0,34365.571429,8.358268,73.94,28.94,50.402,38.288,34901.0,34305.0,34465.566667
8,104188609,20220101,2022,1,1,Sunol,east bay,37.52706,-121.827025,Capsella bursa-pastoris,Capsella,bursa-pastoris,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,concord,USW00023254,Briones,2022,1,1,2022,4,13,0.0,11.15748,0.0,11.15748,0.0,11.15748,32.0,51.98,66.92,28.22,60.585161,39.141935,60.98,28.22,56.1425,40.64,8.0,24.0,17.0,59.0,34492.0,36973.0,34492.0,35540.225806,34844.0,34492.0,34656.375,1.629921,57.92,32.0,51.401429,39.907143,34492.0,34285.0,34340.071429,4.771654,68.0,32.0,54.518,41.396,34948.0,34285.0,34464.933333
9,104188609,20220101,2022,1,1,Sunol,east bay,37.52706,-121.827025,Capsella bursa-pastoris,Capsella,bursa-pastoris,https://www.inaturalist.org/observations/10418...,https://inaturalist-open-data.s3.amazonaws.com...,berkeley,USC00040693,Tilden,2022,1,1,2022,4,13,0.0,9.69685,0.0,9.69685,0.0,9.69685,37.04,51.98,69.08,35.96,60.71871,45.714839,57.92,35.96,54.8825,45.3875,8.0,24.0,17.0,59.0,34506.0,36983.0,34506.0,35552.451613,34857.0,34506.0,34669.75,-8.881784e-16,55.94,37.04,51.697143,41.63,34506.0,34299.0,34360.285714,0.0,62.96,37.04,53.63,43.316,34896.0,34299.0,34460.266667


In [182]:
#print and see park-city matches
print(climate_daylength['city'].unique())
print(climate_daylength['park'].unique())

['concord' 'berkeley' 'oakland' 'hayward' 'SanJoseMtHamilton'
 'LivermoreHayward' 'LivermoreSanJose' 'mtdiablo']
['Briones' 'Tilden' 'AChabot' 'Garin' 'JDGrant' 'PRidge' 'Sunol'
 'MtDiablo']


In [183]:
 #create dataframes of each park you want to keep
Ac = data[(data['city']=='oakland')&(data['park_x']=='AChabot')&(data['park_y']=='AChabot')]
Br = data[(data['city']=='concord')&(data['park_x']=='Briones')]
Ti = data[(data['city']=='berkeley')&(data['park_x']=='Tilden')&(data['park_y']=='Tilden')]
Ga = data[(data['city']=='hayward')&(data['park_x']=='Garin')]
Md = data[(data['city']== 'mtdiablo')&(data['park_x']=='MtDiablo')]

Jd = data[(data['city']== 'SanJoseMtHamilton')&(data['park_x']=='JDGrant')]
Pr = data[(data['city']== 'LivermoreHayward')&(data['park_x']=='PRidge')]
Su = data[(data['city']== 'LivermoreSanJose')&(data['park_x']=='Sunol')]


#the mthamilton dataset ends at May 28, 2022, the others are through June
print(Ac.shape, Ac['plain_dates'].max(), Ac['city'].unique(),Ac['park_x'].unique())
print(Br.shape, Br['plain_dates'].max(), Br['city'].unique(),Br['park_x'].unique())
print(Ti.shape, Ti['plain_dates'].max(), Ti['city'].unique(),Ti['park_x'].unique())  
print(Jd.shape, Jd['plain_dates'].max(), Jd['city'].unique(),Jd['park_x'].unique())
print(Ga.shape, Ga['plain_dates'].max(), Ga['city'].unique(),Ga['park_x'].unique())
print(Pr.shape, Pr['plain_dates'].max(), Pr['city'].unique(),Pr['park_x'].unique())
print(Su.shape, Su['plain_dates'].max(), Su['city'].unique(),Su['park_x'].unique())
print(Md.shape, Md['plain_dates'].max(), Md['city'].unique(),Md['park_x'].unique())

#concatenate the park dataframes
complete_df = pd.concat([Su,Pr,Br,Ti, Ac, Jd, Ga, Md])

print(complete_df.shape)



(2538, 66) 20220930 ['oakland'] ['AChabot']
(3198, 66) 20220925 ['concord'] ['Briones']
(5325, 66) 20220926 ['berkeley'] ['Tilden']
(2161, 66) 20220917 ['SanJoseMtHamilton'] ['JDGrant']
(345, 66) 20220924 ['hayward'] ['Garin']
(557, 66) 20220926 ['LivermoreHayward'] ['PRidge']
(2598, 66) 20220927 ['LivermoreSanJose'] ['Sunol']
(17944, 66) 20220930 ['mtdiablo'] ['MtDiablo']
(34666, 66)


In [184]:
phenology_df.columns

NameError: name 'phenology_df' is not defined

In [None]:
phenology_2017_2022_df = complete_df[['id', 'park_x',
                      'plain_dates', 'Year','Month', 'Day', 'WY', 'wy_month', 'WY_weeknum',
                      'genus_species', 'genus','species', 
                      'latitude', 'longitude', 
                      'prec_daily', 'prec_cum_WY','MonSumPrec', 'WkSumPrec',
                      'minTemp', 'maxTemp', 
                      'hour_rise', 'minute_rise', 'hour_set', 'minute_set', 'day_length', 
                      'sum_prec_prior14',
                      'MaxTemp_prior14', 'MinTemp_prior14', 'AvgMaxTemp_prior14', 'AvgMinTemp_prior14', 
                      'MaxDayLen_prior14', 
                      'sum_prec_prior30',
                      'MaxTemp_prior30', 'MinTemp_prior30', 'AvgMaxTemp_prior30', 'AvgMinTemp_prior30', 
                      'MaxDayLen_prior30',
                      'url', 'image_url']]

phenology_2017_2022_df = complete_2017_2022_df.rename(columns={"park_x": "park"})
phenology_2017_2022_df.head()

<a id='final_dataframe'></a>
## export final dataset for modeling

In [None]:
#Export the data as a csv
phenology_2017_2022_df.to_csv('/Users/sandidge/Desktop/Python_Projects/Springboard_coursework/Capstone2_Wildflowers/Public_Final/phenology_dataset_2017_2022_df.csv', index=False)


# Final dataframe
<br>[phenology_dataset_2017_2022_df.csv]('https://raw.githubusercontent.com/Floydworks/WildflowerFinder_Phenology_Tool/main/cleaned_data_files/phenology_dataset_2017_2022_df.csv')
<br>
<br>The final data frame combines wildflower observations from iNaturalist, daily temperature and precipitation 
<br>from NOAA GHCN data, and sunset and sunrise times from Skyfield.
<br>This dataset includes observations from October 01, 2017 through September 30, 2022. 
<br>
[Link to top](#guide)
