## Table of Contents
<a id='toc'></a>

### Description of Notebook

### I. [Import Python packages](#python_packages)

### II. [Functions Used](#functions)

### III. [Read in Datasets](#datasets)
  - [Aboveground Dry Biomass](#aboveground_dry_biomass)
  - [Canopy Height - Sensor](#canopy_height_sensor)

### MAC Season 6 Data Cleaning

#### Season Dates
- Planting: 2018-04-25
- Last Day of Harvest: 2018-08-01

#### Data & Code

- This notebook contains the code used to clean Season 6 Sorghum Data from the Maricopa Agricultural Center (MAC). Information about and access to the input data can be found in this Dryad data publication [repository](https://github.com/terraref/data-publication).
- The input data were originally queried and downloaded using this `R` code:

```
library(traits)

options(betydb_url = "https://terraref.ncsa.illinois.edu/bety/",
        betydb_api_version = 'v1',
        betydb_key = 'abcde_super_secret_key_1234')

season_6 <- betydb_query(sitename  = "~Season 6",
                         limit     =  "none")

write.csv(season_6, file = 'mac_season_six_2020-04-22.csv')
```

- Environmental data for 2018 can be downloaded from the MAC weather station [website](https://cals.arizona.edu/azmet/06.htm). 
- Please email ejcain@arizona.edu with any questions or comments, or submit an issue to this GitHub [repository](https://github.com/MagicMilly/for-data-publication).

#### Plot design and blocking information can be found [here](https://terraref.ncsa.illinois.edu/bety/api/v1/experiments?name=~MAC+Season+6)

### I. Import Python packages
<a id='python_packages'></a>
Return to [Table of Contents](#toc)

In [None]:
import datetime
import numpy as np
import pandas as pd

### II. Functions Used
<a id='functions'></a>
Return to [Table of Contents](#toc)

In [None]:
def plot_hist(df, value_column, trait_column):
    
    """
    Return an exploratory histogram to visualize distribution of values of specific trait.
    """
    trait_name = df[trait_column].unique()[0]
    return df[value_column].hist(color='navy').set_xlabel(trait_name);

In [None]:
# def plot_time_series(df, value_column, date_column)

In [None]:
def check_for_nulls(df):
    
    """
    Takes dataframe as argument and returns table showing sum of null values, if any.
    """
    
    return df.isnull().sum()

In [None]:
def check_duplicates(df):
    
    """
    Takes dataframe as argument and returns value counts for duplicates, if any.
    """
    
    return df.duplicated().value_counts()

In [None]:
def check_unique_values(df):
    
    """
    Function takes a dataframe as argument and checks for number of unique values in each column.
    Print statement will contain number of unique values, as well as the unique values for any column that
    contains less than 5 unique values.
    """
    for col in df.columns:
        
        if df[col].nunique() < 5:
            print(f'{df[col].nunique()} unique value(s) for {col} column: {df[col].unique()}')
            
        else:
            print(f'{df[col].nunique()} values for {col} column')

In [None]:
def extract_range_column_values(working_df, plot_column):
    
    """
    To assist in plot location, function takes the working dataframe name and name of plot column. 
    Range and column values are extracted from the plot name strings and added as new columns to the 
    returned dataframe. 
    """
    
    new_df = working_df.copy()

    new_df['range'] = new_df[plot_column].str.extract("Range (\d+)").astype(int)
    new_df['column'] = new_df[plot_column].str.extract("Column (\d+)").astype(int)
    
    return new_df

In [None]:
def convert_datetime_column(working_df, date_column):
    
    """
    If date column does not contain datetime objects, function takes working dataframe and name of date column
    as arguments. The original date column is dropped, and a new dataframe with an updated datatime column
    is returned.
    """
    
    new_datetimes = pd.to_datetime(working_df[date_column])
    
    new_df_0 = working_df.drop(labels=date_column, axis=1)
    new_df_1 = new_df_0.copy()
    new_df_1['date'] = new_datetimes
    
    return new_df_1

In [None]:
def rename_value_column(working_df, value_column, trait_column):
    
    """
    Takes working dataframe, name of value column, and name of trait column as arguments. Returns a new dataframe
    with the name of the trait as the new name of the value column.
    """
    
    trait = working_df[trait_column].unique()[0]
    
    new_df_0 = working_df.rename({value_column: trait}, axis=1)
    new_df_1 = new_df_0.drop(labels=trait_column, axis=1)
    
    return new_df_1

In [None]:
def reorder_columns(working_df, new_col_order_list):
    
    """
    Takes working dataframe and list of new column order and returns a new dataframe with desired column order.
    """
    
    working_df_1 = pd.DataFrame(data=working_df, columns=new_col_order_list)
    return working_df_1

In [None]:
# def add_lat_lon(lat_lon_df, lat_lon_plot_column, working_df, working_df_plot_column):
    
#     """
#     Take the dataframe with the latitude and longitude information, the name of the plot column, the working
#     dataframe, and the name of the plot column in the working dataframe as arguments. Function will return
#     the working dataframe with latitude and longitude values for the plots as a new dataframe.
#     """
    
# use dictionaries?

In [None]:
def check_for_subplots(df, plot_col):
    
    """
    Function takes a dataframe and name of plot column as argument and checks for `E` or `W` subplot designations.
    Print statement indicates the presence or lack of subplot designations.
    """

    for name in df[plot_col].values:
        
        if (name.endswith(' E')) | (name.endswith(' W')):
             return 'This dataset contains subplot designations.'
        
        else:
            return 'No subplot designations.'

In [None]:
def save_to_csv(df, name_of_dataset):
    
    timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
    output_filename = ('data/processed/' + f'{name_of_dataset}_' + f'{timestamp}.csv').replace(':', '')

    df.to_csv(output_filename, index=False)

### III. Read in datasets
<a id='datasets'></a>
Return to [Table of Contents](#toc)
- Raw Season Six data can be downloaded from this Google [Drive](https://drive.google.com/open?id=1THk-NQYxkkej-zdQsqM7i9t-axyS0Sug)
- Each trait - separated by method, if applicable - can be found in its own `.csv` file
- Functions applied to all datasets
    - Plot distribution of values
    - Check for null values
    - Check for duplicates
    - Extract range and column values to add to dataframe
    - Convert string date column values to datetime objects
    - Rename values column (usually 'mean') to the trait being measured
- Columns dropped from all datasets
    - `checked` 
    - `author`
    - `season`
    - `treatment`

#### A. Aboveground Dry Biomass
<a id='aboveground_dry_biomass'></a>
Return to [Table of Contents](#toc)

In [None]:
adb_0 = pd.read_csv('data/raw/season_6_traits/season_6_aboveground_dry_biomass_manual.csv')
print(adb_0.shape)
# adb_0.head()

In [None]:
plot_hist(adb_0, 'mean', 'trait')

In [None]:
check_for_nulls(adb_0)

In [None]:
check_duplicates(adb_0)

In [None]:
# check_unique_values(adb_0)

In [None]:
adb_1 = extract_range_column_values(adb_0, 'plot')
print(adb_1.shape)
# adb_1.sample(n=3)

In [None]:
adb_2 = convert_datetime_column(adb_1, 'date')
print(adb_2.shape)
# adb_2.head()

In [None]:
adb_3 = rename_value_column(adb_2, 'mean', 'trait')
print(adb_3.shape)
# adb_3.tail()

In [None]:
cols_to_drop = ['checked', 'author', 'season', 'treatment']

adb_4 = adb_3.drop(labels=cols_to_drop, axis=1)
print(adb_4.shape)
# adb_4.head(3)

##### Add units (kg/ha) column to aboveground dry biomass dataset

In [None]:
adb_5 = adb_4.copy()
adb_5['units'] = 'kg/ha'

print(adb_5.shape)
# adb_5.tail(3)

In [None]:
new_col_order = ['date', 'plot', 'range', 'column', 'scientificname', 'genotype', 'method', 
                 'aboveground_dry_biomass', 'units', 'method_type']

adb_6 = reorder_columns(adb_5, new_col_order)
print(adb_6.shape)
adb_6.head(3)

#### Save dataframe to `.csv` if needed

In [None]:
save_to_csv(adb_6, name_of_dataset='aboveground_dry_biomass_season_6')

#### B. Canopy Height - Sensor
<a id='canopy_height_sensor'></a>
Return to [Table of Contents](#toc)

In [None]:
ch_0 = pd.read_csv('data/raw/season_6_traits/season_6_canopy_height_sensor.csv')
print(ch_0.shape)
# ch_0.head()

In [None]:
# check_unique_values(ch_0)

In [None]:
check_for_nulls(ch_0)

In [None]:
check_for_subplots(ch_0, 'plot')

In [None]:
check_duplicates(ch_0)

In [None]:
# Inspect Duplicates

# ch_0.loc[ch_0.duplicated() == True][:5]

In [None]:
# ch_0.iloc[1473]

In [None]:
# ch_0.loc[(ch_0.genotype == 'PI179749') & (ch_0['date'] == '2018-07-20') & (ch_0['mean'] == 350)]

#### Drop duplicates

In [None]:
ch_1 = ch_0.drop_duplicates(ignore_index=True)
print(ch_1.shape)
check_duplicates(ch_1)

In [None]:
# plot_hist(ch_1, 'mean', 'trait')

In [None]:
ch_2 = extract_range_column_values(ch_1, 'plot')
print(ch_2.shape)
# ch_2.sample(n=3)

In [None]:
ch_3 = convert_datetime_column(ch_2, 'date')
print(ch_3.shape)
# ch_3.dtypes

In [None]:
ch_4 = rename_value_column(ch_3, 'mean', 'trait')
print(ch_4.shape)
# ch_4.tail(3)

In [None]:
# add units (cm) to column name

ch_5 = ch_4.rename({'canopy_height': 'canopy_height_cm'}, axis=1)
# ch_5.sample(n=3)

In [None]:
ch_6 = ch_5.drop(labels=['checked', 'author', 'season', 'treatment'], axis=1)
print(ch_6.shape)

In [None]:
new_col_order = ['date', 'plot', 'range', 'column', 'scientificname', 'genotype', 'method', 'canopy_height_cm',
                 'method_type']

ch_7 = reorder_columns(ch_6, new_col_order)
print(ch_7.shape)
ch_7.head(3)

#### Save to `.csv` if needed

In [None]:
save_to_csv(ch_7, 'canopy_height_season_6')