## Table of Contents
<a id='toc'></a>

### Description of Notebook

### I. [Import Python packages](#python_packages)

### II. [Functions Used](#functions)

### III. [Read in Datasets](#datasets)
  - [Aboveground Dry Biomass](#aboveground_dry_biomass)
  - [Canopy Height - Sensor](#canopy_height_sensor)
  - [Canopy Height - Manual](#canopy_height_manual)
  - [Days & GDD to Flowering](#flowering)
  - [Days & GDD to Flag Leaf Emergence](#flag_leaf)

### MAC Season 4 Data Cleaning
#### Traits
- aboveground dry biomass
- canopy height
- days & growing degree days (GDD) to flowering
- days & GDD to flag leaf emergence

#### Update July 2020
##### The input data for the code in this notebook were downloaded from a Dryad data publication from the TERRA-REF project, lead author [David LeBauer](https://github.com/dlebauer). Information on the publication and access to data can be found in this [repository](https://github.com/terraref/data-publication). 


##### This notebook contains the code used to clean and curate sorghum data from Maricopa Agricultural Station Season Four. The input trait data were originally queried from betydb version 1 in April 2020 using this `R` code, but those data will most likely only be used for latitude and longitude values to add to the new derived datasets. 

```
library(traits)
library(dplyr)


options(betydb_url = "https://terraref.ncsa.illinois.edu/bety/",
        betydb_api_version = 'v1',
        betydb_key = 'abcde_super_secret_key_1234')

season_4 <- betydb_query(sitename  = "~Season 4",
                         limit     =  "none")

write.csv(season_4, file = 'mac_season_four_2020-04-22.csv')
```
- Environmental weather data were downloaded from the MAC weather station [website](https://cals.arizona.edu/azmet/06.htm). 
- Please email ejcain@arizona.edu with any questions or comments or create an issue in this [repository](https://github.com/MagicMilly/for-data-publication). 

### I. Import Python packages
<a id='python_packages'></a>
Return to [Table of Contents](#toc)

In [None]:
import datetime
import numpy as np
import pandas as pd
import sqlite3

### II. Functions Used
<a id='functions'></a>
Return to [Table of Contents](#toc)

In [None]:
def plot_hist(df, value_column, trait_column):
    
    """
    Return an exploratory histogram to visualize distribution of values of specific trait.
    """
    trait_name = df[trait_column].unique()[0]
    return df[value_column].hist(color='navy').set_xlabel(trait_name);

In [None]:
# def plot_time_series(df, value_column, date_column)

In [None]:
def check_for_nulls(df):
    
    """
    Takes dataframe as argument and returns table showing sum of null values, if any.
    """
    
    return df.isnull().sum()

In [None]:
def check_duplicates(df):
    
    """
    Takes dataframe as argument and returns value counts for duplicates, if any.
    """
    
    return df.duplicated().value_counts()

In [None]:
def check_unique_values(df):
    
    """
    Function takes a dataframe as argument and checks for number of unique values in each column.
    Print statement will contain number of unique values, as well as the unique values for any column that
    contains less than 5 unique values.
    """
    for col in df.columns:
        
        if df[col].nunique() < 5:
            print(f'{df[col].nunique()} unique value(s) for {col} column: {df[col].unique()}')
            
        else:
            print(f'{df[col].nunique()} values for {col} column')

In [None]:
def extract_range_column_values(working_df, plot_column):
    
    """
    To assist in plot location, function takes the working dataframe name and name of plot column. 
    Range and column values are extracted from the plot name strings and added as new columns to the 
    returned dataframe. 
    """
    
    new_df = working_df.copy()

    new_df['range'] = new_df[plot_column].str.extract("Range (\d+)").astype(int)
    new_df['column'] = new_df[plot_column].str.extract("Column (\d+)").astype(int)
    
    return new_df

In [None]:
def convert_datetime_column(working_df, date_column):
    
    """
    If date column does not contain datetime objects, function takes working dataframe and name of date column
    as arguments. The original date column is dropped, and a new dataframe with an updated datatime column
    is returned.
    """
    
    new_datetimes = pd.to_datetime(working_df[date_column])
    
    new_df_0 = working_df.drop(labels=date_column, axis=1)
    new_df_1 = new_df_0.copy()
    new_df_1['date'] = new_datetimes
    
    return new_df_1

In [None]:
def rename_value_column(working_df, value_column, trait_column):
    
    """
    Takes working dataframe, name of value column, and name of trait column as arguments. Returns a new dataframe
    with the name of the trait as the new name of the value column.
    """
    
    trait = working_df[trait_column].unique()[0]
    
    new_df_0 = working_df.rename({value_column: trait}, axis=1)
    new_df_1 = new_df_0.drop(labels=trait_column, axis=1)
    
    return new_df_1

In [None]:
def add_blocking_height(working_df, range_column):
    
    """
    For season 4 data, takes a working dataframe with a range column to indicate plot location within the field, 
    and returns a new dataframe with a blocking height column that will indicate if a certain range was blocked 
    by a short, medium, or tall height block.
    """
    
    short_blocks = [11, 20, 46, 50]
    medium_blocks = [10, 12, 18, 24, 27, 29, 31, 33, 38, 51]
    tall_blocks = [3, 4, 5, 6, 7, 8, 9, 13, 14, 15, 16, 17, 19, 21, 22, 23, 25, 26, 28, 30, 32, 34, 35, 36, 37, 
                   39, 40, 41, 42, 43, 44, 45, 47, 48, 49, 52]
    border = [1, 2, 53, 54]
    
    range_values = working_df[range_column].values
    blocking_heights = []
    
    for r in range_values:
        
        if r in short_blocks:
            blocking_heights.append('short')
            
        elif r in medium_blocks:
            blocking_heights.append('medium')
            
        elif r in tall_blocks:
            blocking_heights.append('tall')
            
        elif r in border:
            blocking_heights.append('border')
            
        else:
            print(f'Error with range value {r}')
        
    working_df_1 = working_df.copy()
    working_df_1['blocking_height'] = blocking_heights
    
    return working_df_1

In [None]:
def reorder_columns(working_df, new_col_order_list):
    
    """
    Takes working dataframe and list of new column order and returns a new dataframe with desired column order.
    """
    
    working_df_1 = pd.DataFrame(data=working_df, columns=new_col_order_list)
    return working_df_1

In [None]:
# def add_lat_lon(lat_lon_df, lat_lon_plot_column, working_df, working_df_plot_column):
    
#     """
#     Take the dataframe with the latitude and longitude information, the name of the plot column, the working
#     dataframe, and the name of the plot column in the working dataframe as arguments. Function will return
#     the working dataframe with latitude and longitude values for the plots as a new dataframe.
#     """
    
# use dictionaries?

In [None]:
# for later: should return True or False, to be used with strip_subplots()

def check_for_subplots(df, plot_col):
    
    """
    Function takes a dataframe and name of plot column as argument and checks for `E` or `W` subplot designations.
    Print statement indicates the presence or lack of subplot designations.
    """

    for name in df[plot_col].values:
        
        if (name.endswith(' E')) | (name.endswith(' W')):
             return 'This dataset contains subplot designations.'
        
        else:
            return 'No subplot designations.'

In [None]:
def strip_subplots(working_df, plot_col, new_plot_col_name):
    
    """
    Function takes a dataframe, name of existing plot column, and name of new plot column as arguments.
    If there are subplot designations present in that column, they will be stripped to return a new dataframe
    with a plot column lacking subplot designations. This will allow for easier manipulation of the dataframe
    and aggregate functions by full plot only.
    """
    
    plot_names = working_df[plot_col].values
    new_plot_names = []
    
    for n in plot_names:
        
        if (n.endswith(' E') | (n.endswith(' W'))):
            new_plot_names.append(n[:-2])
            
        else:
            new_plot_names.append(n)
            
    working_df_1 = working_df.drop(labels=plot_col, axis=1)
    working_df_2 = working_df_1.copy()
    
    working_df_2[new_plot_col_name] = new_plot_names
    return working_df_2

In [None]:
def save_to_csv(df, name_of_dataset):
    
    timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
    output_filename = ('data/processed/' + f'{name_of_dataset}_' + f'{timestamp}.csv').replace(':', '')

    df.to_csv(output_filename, index=False)

### III. Read in datasets
<a id='datasets'></a>
Return to [Table of Contents](#toc)
- Raw Season Four data can be downloaded from this Google [Drive](https://drive.google.com/open?id=1THk-NQYxkkej-zdQsqM7i9t-axyS0Sug)
- Each trait - separated by method, if applicable - can be found in its own `.csv` file
- Functions applied to all datasets
    - Plot distribution of values
    - Check for null values
    - Check for duplicates
    - Extract range and column values to add to dataframe
    - Convert string date column values to datetime objects
    - Rename values column (usually 'mean') to the trait being measured
    - Add blocking height column
- Columns dropped from all datasets
    - `checked` 
    - `author`
    - `season`

#### Data queried from betydb, for latitude and longitude values of plots / sitenames
- Slice dataset to only include unique sitename values
- **Currently (July 2020) these values have not been added to the updated datasets, so the cells do not need to be executed.**

In [None]:
# s4_0 = pd.read_csv('data/raw/mac_season_four_2020-04-22.csv', low_memory=False)
# print(s4_0.shape)
# s4_0.head(3)

In [None]:
# s4_1 = s4_0[['sitename', 'lat', 'lon']]
# print(s4_1.shape)
# s4_1.tail(3)

In [None]:
# s4_2 = s4_1.drop_duplicates(ignore_index=True)
# print(s4_2.shape)
# s4_2.head(3)

#### A. Aboveground Dry Biomass
<a id='aboveground_dry_biomass'></a>
Return to [Table of Contents](#toc)

In [None]:
adb_0 = pd.read_csv('data/raw/season_4_traits/season_4_aboveground_dry_biomass_manual.csv')
print(adb_0.shape)
# adb_0.head()

In [None]:
plot_hist(adb_0, 'mean', 'trait')

In [None]:
check_for_nulls(adb_0)

In [None]:
check_duplicates(adb_0)

In [None]:
# check_unique_values(adb_0)

In [None]:
adb_1 = extract_range_column_values(adb_0, 'plot')
print(adb_1.shape)
# adb_1.sample(n=3)

In [None]:
adb_2 = convert_datetime_column(adb_1, 'date')
print(adb_2.shape)
# adb_2.head()

In [None]:
adb_2.dtypes

In [None]:
adb_3 = rename_value_column(adb_2, 'mean', 'trait')
print(adb_3.shape)
# adb_3.tail()

In [None]:
cols_to_drop = ['checked', 'author', 'season']

adb_4 = adb_3.drop(labels=cols_to_drop, axis=1)
print(adb_4.shape)
# adb_4.head(3)

In [None]:
adb_5 = add_blocking_height(adb_4, 'range')
print(adb_5.shape)
# adb_5.sample(n=3)

##### Add units (kg/ha) column to aboveground dry biomass dataset

In [None]:
adb_6 = adb_5.copy()
adb_6['units'] = 'kg/ha'

print(adb_6.shape)
# adb_6.tail(3)

In [None]:
new_col_order = ['date', 'plot', 'range', 'column', 'scientificname', 'genotype', 'treatment', 'blocking_height', 
                 'method', 'aboveground_dry_biomass', 'units', 'method_type']

adb_7 = reorder_columns(adb_6, new_col_order)
print(adb_7.shape)
adb_7.head(3)

#### Save dataframe to `.csv` if needed

In [None]:
save_to_csv(adb_7, name_of_dataset='aboveground_dry_biomass_season_4')

#### B. Canopy Height - Sensor
<a id='canopy_height_sensor'></a>
Return to [Table of Contents](#toc)

In [None]:
ch_0 = pd.read_csv('data/raw/season_4_traits/season_4_canopy_height_sensor.csv')
print(ch_0.shape)
# ch_0.head()

In [None]:
check_unique_values(ch_0)

In [None]:
check_for_nulls(ch_0)

In [None]:
check_duplicates(ch_0)

In [None]:
# Inspect Duplicates

# ch_0.loc[ch_0.duplicated() == True][:5]

In [None]:
# ch_0.iloc[229]

In [None]:
# ch_0.loc[(ch_0.genotype == 'PI564163') & (ch_0['date'] == '2017-06-24')]

#### Drop duplicates

In [None]:
ch_1 = ch_0.drop_duplicates(ignore_index=True)
print(ch_1.shape)
check_duplicates(ch_1)

In [None]:
plot_hist(ch_1, 'mean', 'trait')

In [None]:
check_for_subplots(ch_1, 'plot')

In [None]:
ch_2 = extract_range_column_values(ch_1, 'plot')
print(ch_2.shape)
# ch_2.sample(n=3)

In [None]:
ch_3 = convert_datetime_column(ch_2, 'date')
print(ch_3.shape)
# ch_3.dtypes

In [None]:
ch_4 = rename_value_column(ch_3, 'mean', 'trait')
print(ch_4.shape)
# ch_4.tail(3)

In [None]:
ch_5 = add_blocking_height(ch_4, 'range')
# ch_5.sample(n=3)

In [None]:
ch_6 = ch_5.drop(labels=['checked', 'author', 'season'], axis=1)
print(ch_6.shape)

#### Add units column
- cm

In [None]:
ch_7 = ch_6.copy()
ch_7['units'] = 'cm'
print(ch_7.shape)
# ch_7.head(3)

In [None]:
new_col_order = ['date', 'plot', 'range', 'column', 'scientificname', 'genotype', 'treatment', 'blocking_height', 
                 'method', 'canopy_height', 'units', 'method_type']

ch_8 = reorder_columns(ch_7, new_col_order)
print(ch_8.shape)
ch_8.head(3)

#### Save to `.csv` if needed

In [None]:
save_to_csv(ch_8, 'canopy_height_sensor_season_4')

#### C. Canopy Height - Manual
- using SQLite for `groupby`

<a id='canopy_height_manual'></a>
Return to [Table of Contents](#toc)

In [None]:
chm_0 = pd.read_csv('data/raw/season_4_traits/season_4_canopy_height_manual.csv')
print(chm_0.shape)
# chm_0.head()

In [None]:
plot_hist(chm_0, 'mean', 'method')

In [None]:
check_for_nulls(chm_0)

In [None]:
check_duplicates(chm_0)

In [None]:
# check_unique_values(chm_0)

In [None]:
chm_1 = extract_range_column_values(chm_0, 'plot')
print(chm_1.shape)
# chm_1.sample(n=3)

In [None]:
chm_2 = convert_datetime_column(chm_1, 'date')
print(chm_2.shape)
# chm_2.head()

#### Identify and Remove Subplot Designations

In [None]:
check_for_subplots(chm_2, 'plot')

In [None]:
chm_3 = strip_subplots(chm_2, 'plot', 'plot')
print(chm_3.shape)
# chm_3.sample(n=3)

In [None]:
check_for_subplots(chm_3, 'plot')

In [None]:
# check for plot/date/mean/treatment duplicates

chm_3.duplicated(subset=['plot', 'date', 'mean', 'treatment']).value_counts()

In [None]:
# inspect sample of duplicates

# chm_3.loc[chm_3.duplicated(subset=['plot', 'date', 'mean']) == True][:3]

In [None]:
# chm_3.loc[(chm_3.genotype == 'PI641810') & (chm_3['mean'] == 212) & (chm_3['date'] == '2017-06-19')]

In [None]:
# Drop Duplicates

chm_4 = chm_3.drop_duplicates(ignore_index=True, subset=['plot', 'genotype', 'treatment', 'mean', 'range', 'column',
                                                        'date'])

print(chm_4.shape)
chm_4.duplicated().value_counts()

In [None]:
chm_5 = add_blocking_height(chm_4, 'range')
print(chm_5.shape)
# chm_5.sample(n=3)

#### Use sqlite database to group by `plot`, `date`, and `mean` 
- rename `mean` to `canopy_height_cm`
- can also drop and reorder columns at this time

In [None]:
conn = sqlite3.connect('data/interim/canopy_heights_manual_season_4.sqlite')
cursor = conn.cursor()
print("Opened database successfully")

In [None]:
# comment next line out if db has already been created
chm_5.to_sql('canopy_heights_manual_season_4.sqlite', conn)

In [None]:
chm_6 = pd.read_sql_query("""
                            SELECT date, plot, range, column, scientificname, genotype, treatment, blocking_height,
                            method, ROUND(AVG([mean]), 2) AS canopy_height_cm, method_type
                            FROM 'canopy_heights_manual_season_4.sqlite'
                            GROUP BY plot, date,[mean]
                            ORDER BY date ASC;
                            """, conn)

print(chm_6.shape)
chm_6.head(3)

In [None]:
check_duplicates(chm_6)

#### Save dataframe to `.csv` if needed

In [None]:
save_to_csv(chm_6, name_of_dataset='canopy_height_manual_season_4')

#### D. Days & GDD to Flowering
<a id='flowering'></a>
Return to [Table of Contents](#toc)

In [None]:
# need functions for days & gdd to traits

In [None]:
fl_0 = pd.read_csv('data/raw/season_4_traits/season_4_flowering_time_manual.csv')
print(fl_0.shape)
# fl_0.head()

#### Read in updated processed weather dataset for season 4

In [None]:
weather_0 = pd.read_csv('data/processed/mac_season_4_daily_weather_2020-07-01T144735.csv')
print(weather_0.shape)
# weather_0.head()

In [None]:
plot_hist(fl_0, 'mean', 'trait')

In [None]:
check_duplicates(fl_0)

In [None]:
check_for_nulls(fl_0)

In [None]:
check_for_subplots(fl_0, 'plot')

In [None]:
# check_unique_values(fl_0)

#### Add planting date 2017-04-20

In [None]:
day_of_planting = datetime.date(2017,4,20)
flower_df_1 = fl_0.copy()

flower_df_1['date_of_planting'] = day_of_planting
print(flower_df_1.shape)
# flower_df_1.head(3)

#### Create datetime with days to flowering (`mean`)

In [None]:
timedelta = pd.Series([pd.Timedelta(days=i) for i in flower_df_1['mean'].values])
dates_of_flowering = []

for td in timedelta:
    
    date_of_flowering = day_of_planting + td
    dates_of_flowering.append(date_of_flowering)
    
print(flower_df_1.shape[0])
print(len(dates_of_flowering))

In [None]:
flower_df_2 = flower_df_1.copy()
flower_df_2['date_of_flowering'] = dates_of_flowering
print(flower_df_2.shape)
# flower_df_2.head(3)

#### Add GDD to flowering dataframe

In [None]:
# slice weather df for date and cumulative gdd values only

season_4_gdd = weather_0[['date', 'gdd']]
print(season_4_gdd.shape)
# season_4_gdd.head(3)

In [None]:
season_4_gdd.dtypes

In [None]:
flower_df_3 = flower_df_2.copy()
flower_df_3.date_of_flowering = pd.to_datetime(flower_df_3.date_of_flowering)
flower_df_3.dtypes

In [None]:
season_4_gdd_1 = season_4_gdd.copy()
season_4_gdd_1['date'] = pd.to_datetime(season_4_gdd_1['date'])
season_4_gdd_1.dtypes

In [None]:
flower_df_4 = flower_df_3.merge(season_4_gdd_1, how='left', left_on='date_of_flowering', right_on='date')
print(flower_df_4.shape)
# flower_df_4.head(3)

In [None]:
flower_df_5 = extract_range_column_values(flower_df_4, 'plot')
flower_df_6 = add_blocking_height(flower_df_5, 'range')

print(flower_df_6.shape)
# flower_df_6.tail(3)

In [None]:
flower_df_7 = rename_value_column(flower_df_6, 'mean', 'trait')
# flower_df_7.sample(n=3)

In [None]:
flower_df_8 = flower_df_7.rename({'flowering_time': 'days_to_flowering', 'gdd': 'gdd_to_flowering'}, axis=1)
# flower_df_8.head(2)

In [None]:
cols_to_drop = ['date_x', 'checked', 'author', 'season', 'date_of_planting', 'date_y']

flower_df_9 = flower_df_8.drop(labels=cols_to_drop, axis=1)
print(flower_df_9.shape)
# flower_df_9.sample(n=3)

In [None]:
new_col_order = ['plot', 'range', 'column', 'scientificname', 'genotype', 'treatment', 'blocking_height', 
                 'method', 'date_of_flowering', 'days_to_flowering', 'gdd_to_flowering', 'method_type']

flower_df_10 = reorder_columns(flower_df_9, new_col_order)
print(flower_df_10.shape)
flower_df_10.head(3)

In [None]:
save_to_csv(flower_df_10, 'days_gdd_to_flowering_season_4')

#### E. Days & GDD to Flag Leaf Emergence
<a id='flag_leaf'></a>
Return to [Table of Contents](#toc)

In [None]:
fle_0 = pd.read_csv('data/raw/season_4_traits/season_4_flag_leaf_emergence_time_manual.csv')
print(fle_0.shape)
# fle_0.head()

#### Read in updated processed weather dataset for season 4
Code used to process weather data for season 4 can be found in the `season_4_weather_data_cleaning` notebook in this repository

In [None]:
weather_0 = pd.read_csv('data/processed/mac_season_4_daily_weather_2020-07-01T144735.csv')
print(weather_0.shape)
# weather_0.head()

In [None]:
plot_hist(fle_0, 'mean', 'trait')

In [None]:
check_duplicates(fle_0)

In [None]:
check_for_nulls(fle_0)

In [None]:
check_for_subplots(fle_0, 'plot')

In [None]:
check_unique_values(fle_0)

#### Add planting date 2017-04-20

In [None]:
day_of_planting = datetime.date(2017,4,20)
fle_df_1 = fle_0.copy()

fle_df_1['date_of_planting'] = day_of_planting
print(fle_df_1.shape)
# fle_df_1.head(3)

#### Create timedelta using days to flag leaf emergence (`mean`)

In [None]:
timedelta = pd.Series([pd.Timedelta(days=i) for i in fle_df_1['mean'].values])
dates_of_flag_leaf_emergence = []

for td in timedelta:
    
    date_of_flag_leaf_emergence = day_of_planting + td
    dates_of_flag_leaf_emergence.append(date_of_flag_leaf_emergence)
    
print(fle_df_1.shape[0])
print(len(dates_of_flag_leaf_emergence))

In [None]:
fle_df_2 = fle_df_1.copy()
fle_df_2['date_of_flag_leaf_emergence'] = dates_of_flag_leaf_emergence
print(fle_df_2.shape)
# fle_df_2.head(3)

#### Add GDD values to flag leaf emergence dataframe

In [None]:
# slice weather df for date and cumulative gdd values only

season_4_gdd = weather_0[['date', 'gdd']]
print(season_4_gdd.shape)
# season_4_gdd.head(3)

In [None]:
fle_df_3 = fle_df_2.copy()
fle_df_3.date_of_flag_leaf_emergence = pd.to_datetime(fle_df_3.date_of_flag_leaf_emergence)
# fle_df_3.dtypes

In [None]:
season_4_gdd_1 = season_4_gdd.copy()
season_4_gdd_1['date'] = pd.to_datetime(season_4_gdd_1['date'])
season_4_gdd_1.dtypes

In [None]:
fle_df_4 = fle_df_3.merge(season_4_gdd_1, how='left', left_on='date_of_flag_leaf_emergence', right_on='date')
print(fle_df_4.shape)
# fle_df_4.head(3)

In [None]:
fle_df_5 = extract_range_column_values(fle_df_4, 'plot')
fle_df_6 = add_blocking_height(fle_df_5, 'range')

print(fle_df_6.shape)
# fle_df_6.tail(3)

In [None]:
fle_df_7 = rename_value_column(fle_df_6, 'mean', 'trait')
# fle_df_7.sample(n=3)

In [None]:
fle_df_8 = fle_df_7.rename({'flag_leaf_emergence_time': 'days_to_flag_leaf_emergence', 'gdd': 'gdd_to_flag_leaf_emergence'}, axis=1)
# fle_df_8.head(2)

In [None]:
cols_to_drop = ['date_x', 'checked', 'author', 'season', 'date_of_planting', 'date_y']

fle_df_9 = fle_df_8.drop(labels=cols_to_drop, axis=1)
print(fle_df_9.shape)
# fle_df_9.sample(n=3)

In [None]:
new_col_order = ['plot', 'range', 'column', 'scientificname', 'genotype', 'treatment', 'blocking_height', 
                 'method', 'date_of_flag_leaf_emergence', 'days_to_flag_leaf_emergence', 
                 'gdd_to_flag_leaf_emergence', 'method_type']

fle_df_10 = reorder_columns(fle_df_9, new_col_order)
print(fle_df_10.shape)
fle_df_10.head(3)

In [None]:
save_to_csv(fle_df_10, 'days_gdd_to_flag_leaf_emergence_season_4.csv')