### Load libraries

In [1]:
# import pandas as pd
import os
import errno
import yaml
import time

# import function files
import sys
file = 'etl_functions.py'
sys.path.insert(0,os.path.dirname(os.path.abspath(file)))
import etl_functions

### Define variables

In [2]:
config_et_file_name = 'config_extract-transform.yml'
item_ADDITIONAL_INFORMATION_FILES = 'ADDITIONAL_INFORMATION_FILES'
item_FILES_TO_PROCESS = 'FILES_TO_PROCESS'
item_JOIN_FILES_COMMON_COLUMNS = 'JOIN_FILES_COMMON_COLUMNS'
item_COLUMNS_TO_DROP = 'COLUMNS_TO_DROP' 
item_UPDATE_ROW_VALUES = 'UPDATE_ROW_VALUES'
item_NEW_COLUMNS = 'NEW_COLUMNS'
item_OUTPUT_FILE_NAME = 'OUTPUT_FILE_NAME'
item_UPDATE_COLUMN_NAMES = 'UPDATE_COLUMN_NAMES'
item_UPDATE_PRIMARY_KEY_VALUES = 'UPDATE_PRIMARY_KEY_VALUES'
item_PRIMARY_KEY_COLUMN = 'PRIMARY_KEY_COLUMN'
item_CREATE_PRIMARY_KEY_IF_NEEDED = 'CREATE_PRIMARY_KEY_IF_NEEDED'
item_TRANSPOSE_COLUMNS_TO_ONE_COLUMN = 'TRANSPOSE_COLUMNS_TO_ONE_COLUMN'
item_NAMES_NEW_UNIQUE_COLUMNS = 'NAMES_NEW_UNIQUE_COLUMNS'

# For deleting rows with invalid values 
additional_config_file = 'additional_configurations.yml'
item_INVALID_VALUES = 'INVALID_VALUES'

### 1. Extraction

#### 1.1 Compare Dataframe column names 
Makes a comparison of the column names in all files in the _FILES_TO_PROCESS_ block of the **config_et.yml** file. 

In [3]:
etl_functions.compare_column_names(config_et_file_name , item_FILES_TO_PROCESS)

Files to compare: 
['example_file_to_process_width-height_1.csv', 'example_file_to_process_width-height_2.csv', 'example_file_to_process_width-height_3.csv', 'example_file_to_process_width-height_4.csv']


Processing files: 100%|██████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 52.97it/s]


Output explanation: 
 
 ([List columns on first file], 
 [List columns NOT on first file])






(['id_observation',
  'observation_name',
  'plot',
  'rep',
  'experiment',
  'treatment',
  'season',
  'measurment',
  'sampling_identifier',
  'Height (cm)',
  'Width (cm)',
  'picture_Plot',
  'picture_Experiment',
  'Date Of Measurement',
  'notes_Plot'],
 ['date', 'heigth (cm)', 'pictureofexperiment', 'pictureofplot'])

#### 1.2 Join data files and concatenate additional information 
Concatenates all files stored in the path specified in the FILES_TO_PROCESS block and then joins them with the additional information from files specified in the _ADDITIONAL_INFORMATION_FILES_ block using the columns specified at the _JOIN_FILES_COMMON_COLUMNS_ block of the **config_et.yml** file.

**If the column of the additional information files and the data files do not have the same names, update the column names using the UPDATE_COLUMN_NAMES block.**

Column names will be updated to lowercase and space replacement.

In [4]:
extract_and_join_files = etl_functions.extract_and_join_files(config_et_file_name, item_FILES_TO_PROCESS, item_UPDATE_COLUMN_NAMES, item_ADDITIONAL_INFORMATION_FILES, item_JOIN_FILES_COMMON_COLUMNS)

Files to process: 
['example_file_to_process_width-height_1.csv', 'example_file_to_process_width-height_2.csv', 'example_file_to_process_width-height_3.csv', 'example_file_to_process_width-height_4.csv']


Processing files: 100%|██████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 60.18it/s]


New dataframe info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id_observation       1400 non-null   object 
 1   observation_name     1400 non-null   object 
 2   plot                 1400 non-null   int64  
 3   rep                  1400 non-null   int64  
 4   experiment_name      1400 non-null   object 
 5   treatment            1400 non-null   object 
 6   season               1400 non-null   object 
 7   measurment           1400 non-null   object 
 8   sampling_identifier  1400 non-null   object 
 9   height_(cm)          1397 non-null   float64
 10  width_(cm)           1392 non-null   float64
 11  picture_plot         7 non-null      object 
 12  picture_experiment   4 non-null      object 
 13  date                 1400 non-null   object 
 14  notes_plot           1 non-null      object 
 15  pictureofplot   




### 2. Transformation

#### 2.1 Drop not desired columns
Drop columns you do not want to keep in your final Dataframe using the column names specified in the _COLUMNS_TO_DROP_ of the **config_et.yml** file.  

In [5]:
drop_not_used_columns = etl_functions.drop_not_used_columns(config_et_file_name, item_COLUMNS_TO_DROP, extract_and_join_files)

Columns to drop: ['pictureofplot', 'pictureofexperiment', 'picture_plot', 'picture_experiment', 'notes_plot', 'measurment']


Dropping columns: 100%|████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1995.39it/s]


New dataframe info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id_observation       1400 non-null   object 
 1   observation_name     1400 non-null   object 
 2   plot                 1400 non-null   int64  
 3   rep                  1400 non-null   int64  
 4   experiment_name      1400 non-null   object 
 5   treatment            1400 non-null   object 
 6   season               1400 non-null   object 
 7   sampling_identifier  1400 non-null   object 
 8   height_(cm)          1397 non-null   float64
 9   width_(cm)           1392 non-null   float64
 10  date                 1400 non-null   object 
 11  range                1400 non-null   int64  
 12  entry                1400 non-null   int64  
dtypes: float64(2), int64(4), object(7)
memory usage: 142.3+ KB
None





#### 2. 2 Update column names 
Update column names according to what is specified in the _UPDATE_COLUMN_NAMES_ block of the **config_et.yml** file.

Column names will be updated to lowercase and space replacement.

In [6]:
update_column_names = etl_functions.update_column_names(config_et_file_name, item_UPDATE_COLUMN_NAMES, drop_not_used_columns)

Updating column names: 100%|███████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3283.64it/s]

Column 'experiment' NOT FOUND in the DataFrame.
Column 'Date Of Measurement' NOT FOUND in the DataFrame.
Column 'heigth (cm)' NOT FOUND in the DataFrame.

New dataframe info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 1400 non-null   object
 1   entry                1400 non-null   object
 2   experiment_name      1400 non-null   object
 3   height_(cm)          1397 non-null   object
 4   id_observation       1400 non-null   object
 5   observation_name     1400 non-null   object
 6   plot                 1400 non-null   object
 7   range                1400 non-null   object
 8   rep                  1400 non-null   object
 9   sampling_identifier  1400 non-null   object
 10  season               1400 non-null   object
 11  treatment            1400 non-null   object
 12  width_(cm)           1392 




#### 2.3 Insert new columns

Insert new columns and fill their rows according to what was specified at the _NEW_COLUMNS_ block of the **config_et.yml** file.

In [7]:
add_new_columns = etl_functions.add_new_columns(config_et_file_name, item_NEW_COLUMNS, update_column_names)

Adding new columns: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 250.14it/s]

The following column was inserted: 'crop' : 'Wheat'

New dataframe info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 1400 non-null   object
 1   entry                1400 non-null   object
 2   experiment_name      1400 non-null   object
 3   height_(cm)          1397 non-null   object
 4   id_observation       1400 non-null   object
 5   observation_name     1400 non-null   object
 6   plot                 1400 non-null   object
 7   range                1400 non-null   object
 8   rep                  1400 non-null   object
 9   sampling_identifier  1400 non-null   object
 10  season               1400 non-null   object
 11  treatment            1400 non-null   object
 12  width_(cm)           1392 non-null   object
 13  crop                 1400 non-null   object
dtypes: object(14)
memory usage: 15




#### 2.4 Update row values
Update the row values specified at the _UPDATE_ROW_VALUES_ block of the **config_et.yml** file.

In [8]:
update_column_values = etl_functions.update_row_values(config_et_file_name, item_UPDATE_ROW_VALUES, add_new_columns)

Updating column values: 100%|███████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 177.80it/s]

The value 'a' was replaced by 'A'
The value 'b' was replaced by 'B'
The value 'c' was replaced by 'C'
The value 'd' was replaced by 'D'
The value 'Early' was replaced by 'Early planting date'
The value 'Late' was replaced by 'Late planting date'
The value 'ACRE-Biomass' was replaced by 'ACRE Public Biomass'
The value 'y22' was replaced by 'Summer 2022'

New dataframe head: 

        date entry      experiment_name height_(cm)  \
0  7/12/2022     3  ACRE Public Biomass       16.51   
1  7/12/2022     3  ACRE Public Biomass       24.13   
2  7/12/2022    24  ACRE Public Biomass       39.37   
3  7/12/2022    24  ACRE Public Biomass       26.67   
4  7/12/2022     7  ACRE Public Biomass        38.1   

                                      id_observation observation_name plot  \
0  1_a_ACRE-Biomass_Early_y22_width-height_sampli...                A    1   
1  1_b_ACRE-Biomass_Early_y22_width-height_sampli...                B    1   
2  2_a_ACRE-Biomass_Early_y22_width-height_sampli...     




#### 2.5 Transpose multiple columns to a single column
Function that transposes the values from multiple columns to a single column, and it creates a column that specifies the measurement name and units of the values. 

In [9]:
transpose_columns = etl_functions.transpose_multiple_columns_to_a_single_column(config_et_file_name, additional_config_file,  item_TRANSPOSE_COLUMNS_TO_ONE_COLUMN, 
                                              item_NAMES_NEW_UNIQUE_COLUMNS, item_INVALID_VALUES, update_column_values)

Transforming columns: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 363.05it/s]

11 empty VARIABLE VALUE ROWS were found in your data frame.

EMPTY VARIABLE VALUE ROWS found were DELETED!

New dataframe head: 

  observation_name sampling_identifier       season range       date entry  \
0                A          sampling-1  Summer 2022     2  7/12/2022     3   
1                B          sampling-1  Summer 2022     2  7/12/2022     3   
2                A          sampling-1  Summer 2022     2  7/12/2022    24   
3                B          sampling-1  Summer 2022     2  7/12/2022    24   
4                A          sampling-1  Summer 2022     2  7/12/2022     7   

             treatment rep                                     id_observation  \
0  Early planting date   1  1_a_ACRE-Biomass_Early_y22_width-height_sampli...   
1  Early planting date   1  1_b_ACRE-Biomass_Early_y22_width-height_sampli...   
2  Early planting date   1  2_a_ACRE-Biomass_Early_y22_width-height_sampli...   
3  Early planting date   1  2_b_ACRE-Biomass_Early_y22_width-height_sampli...




#### 2.6 Delete repeated rows and create a primary key if needed. 

Especially for files for data that were not collected using [AgTC](https://github.com/Purdue-LuisVargas/AgTC).

In [10]:
create_primary_key = etl_functions.delete_duplicate_and_create_primary_key(config_et_file_name, item_CREATE_PRIMARY_KEY_IF_NEEDED, 
                                                                item_PRIMARY_KEY_COLUMN, transpose_columns)

No values provided for the CREATE_PRIMARY_KEY_IF_NEEDED block in the configuration file. No updates performed.
Dataframe info: 

<class 'pandas.core.frame.DataFrame'>
Index: 2789 entries, 0 to 1399
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   observation_name     2789 non-null   object
 1   sampling_identifier  2789 non-null   object
 2   season               2789 non-null   object
 3   range                2789 non-null   object
 4   date                 2789 non-null   object
 5   entry                2789 non-null   object
 6   treatment            2789 non-null   object
 7   rep                  2789 non-null   object
 8   id_observation       2789 non-null   object
 9   experiment_name      2789 non-null   object
 10  plot                 2789 non-null   object
 11  crop                 2789 non-null   object
 12  variable_value       2789 non-null   object
 13  variable_units       2789 non

#### 2.7 Update primary key values
Function that updates some characters of the primary key string. It is useful when more than one trait is collected using the same template.  

In [11]:
update_primary_key_values = etl_functions.update_primary_key_values(config_et_file_name, item_UPDATE_PRIMARY_KEY_VALUES, 
                                                                    item_PRIMARY_KEY_COLUMN, create_primary_key)

Updating key: 100%|█████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 659.02it/s]

New dataframe head: 

  observation_name sampling_identifier       season range       date entry  \
0                A          sampling-1  Summer 2022     2  7/12/2022     3   
1                B          sampling-1  Summer 2022     2  7/12/2022     3   
2                A          sampling-1  Summer 2022     2  7/12/2022    24   
3                B          sampling-1  Summer 2022     2  7/12/2022    24   
4                A          sampling-1  Summer 2022     2  7/12/2022     7   

             treatment rep                                   id_observation  \
0  Early planting date   1  1_a_ACRE-Biomass_Early_y22_height_samplinging-1   
1  Early planting date   1  1_b_ACRE-Biomass_Early_y22_height_samplinging-1   
2  Early planting date   1  2_a_ACRE-Biomass_Early_y22_height_samplinging-1   
3  Early planting date   1  2_b_ACRE-Biomass_Early_y22_height_samplinging-1   
4  Early planting date   1  3_a_ACRE-Biomass_Early_y22_height_samplinging-1   

       experiment_name plot   crop




#### 2.8 Export the final data frame 
Export the final data frame to a location specified at the _OUTPUT_FILE_NAME_ block of the **config_et.yml** file.

In [12]:
etl_functions.export_dataframe(config_et_file_name, item_OUTPUT_FILE_NAME, update_primary_key_values)

Dataframe: ./et_output/20240516-124816_example_output_file.csv created successfully!
