# About

This notebook assembles all the csv files produced by the `3_add_canopy_height_features.ipynb` notebook into a single dataframe of samples for the iceplant detection model. While assembling the files we also:
- delete any sample points with a negative canopy height feature from the dataset. Canopy height features are: lidar, max_lidar, min_lidar, avg_lidar and min_max_diff. 
- add NaN as the value for all canopy height features for points sampled in 2012 and 2014. 

To run the notebook, it is necessary to have a csv file with points for each year and area of interest. The output is either a single csv file with all the combined data samples, or two csv files corresponding to splitting the samples into train and test sets. The training set is done by sampling the same percentage of points per scene. This is an effort to keep the training and test sets unbiased toward scenes with more points sampled. The notebook also prints and saves some statistics of the resulting dataset(s).

**VARIABLES**

`delete_files` (bool): whether to delete the individual csv files from which the final dataset is assembled

`verbose` (bool): whether to print in the console the stats of how many points were sampled per year and area of interest

`write_stats` (bool):  whether to save as a text file stats about the distribution of sample points in the dataset(s)

`split` (bool): if split=True, then the dataset is split into train and test sets, keeping the same proportion of points per scene. 

`test_size` (float in (0,1)): percentage of data samples that should go into the test set. The notebook will sample this percentage of test points from each scene. 

Add: new variables

**OUTPUT:**
- If split=False: a single csv file named 'samples_for_model.csv'. This contains all the points with all the features sampled. 
- If split=True: two csv files, one for the train set and another for the test set.
- If write_stats=True: A text file named samples_for_model_stats.txt. Nothing is generated if write_stats=False.

All files are saved in the current working directory. 

In [1]:
import numpy as np
import pandas as pd
import os

from sklearn.model_selection import train_test_split

# Specify notebook variables

In [2]:
# ***************************************************
# ************* NOTEBOOK VARIABLES ******************

delete_files = True

# print stats as notebook runs
verbose = True

# save stats
write_stats = True

split = True

test_size = 0.3
# ***************************************************
# ***************************************************

## Paths to sample points

In [3]:
def path_spectral_points_csv(aoi, year):
    """ Assembles a file path to file with points and ONLY spectral information from given aoi and year. """
    fp = os.path.join(os.getcwd(), 
                      'temp',
                      aoi +'_points_'+str(year)+'.csv')
    return fp            

#---------------------

def path_lidar_spectral_points_csv(aoi, year):
    """ Assembles a file path to file with points and canopy height AND spectral information from given aoi and year. """
    fp = os.path.join(os.getcwd(), 
                      'temp',
                      aoi +'_pts_spectral_lidar_'+str(year)+'.csv')
    return fp            

## Assemble data frame with all sampled points

In [4]:
lidar_years = [2016,2018,2020]
spec_years = [2012,2014]
aois = ['campus_lagoon','carpinteria','gaviota','point_conception']

# *******************************************************************************
# Open and concatenate csv files of points with canopy height + spectral info
li = []
for aoi in aois:
    for year in lidar_years:
        if ('point_conception' != aoi) or (year != 2016):  #there's no data for Point Conception on 2016
            sample = pd.read_csv(path_lidar_spectral_points_csv(aoi,year))
            li.append(sample)

df_lidar = pd.concat(li, axis=0)

# only keep points with non-negative canopy height values
df_lidar = df_lidar.loc[(df_lidar["lidar"] >= 0) & 
                        (df_lidar["max_lidar"] >= 0) &
                        (df_lidar["min_lidar"] >= 0) &
                        (df_lidar["avg_lidar"] >= 0) &
                        (df_lidar["min_max_diff"] >= 0)
                       ]

# *******************************************************************************
# Open and concatenate csv files of points with canopy height + spectral info
li = []
for aoi in aois:
    for year in spec_years:
            sample = pd.read_csv(path_spectral_points_csv(aoi,year))
            li.append(sample)

df_spec = pd.concat(li, axis=0)

# fill in canopy height columns wit NaN
df_spec['lidar'] = np.nan
df_spec['max_lidar'] = np.nan
df_spec['min_lidar'] = np.nan
df_spec['avg_lidar'] = np.nan
df_spec['min_max_diff'] = np.nan

# *******************************************************************************
# concatenate both data frames and clean index and columns
samples = pd.concat([df_lidar,df_spec], axis=0)

samples.reset_index(drop=True, inplace=True)

samples = samples[['x', 'y', 'pts_crs', #  point location
         'aoi','naip_id', 'polygon_id',  # sampling info
         'r','g','b','nir','ndvi',     # spectral
         'year','month','day_in_year', # date
         'lidar', 'max_lidar', 'min_lidar', 'min_max_diff', 'avg_lidar', # lidar
         'iceplant'
         ]] 

## If split == True: split into train and test sets

In [5]:
if split == True:
    # initialize empty train and test lists
    all_train = []
    all_test = []

    X_labels = samples.columns.drop('iceplant')     # save label names

    aois = samples.aoi.unique()     # list of aois

    for aoi in aois:

        # retrieve all scenes from  aoi
        in_aoi = samples[samples.aoi == aoi]    
        scenes = in_aoi.naip_id.unique()

        for scene in scenes:
            # get all pts in scene
            in_scene = in_aoi[in_aoi.naip_id == scene]

            # sample test_size fraction of pts in scene for testing
            # keep same iceplant/non-ice plant proportion in test an train sets
            X = np.array(in_scene.drop('iceplant', axis = 1))
            y = np.array(in_scene['iceplant'])
            X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                                test_size = test_size, 
                                                                random_state = 42)   # fix random seed
            # reassemble train set into data frame
            train = pd.DataFrame(X_train, columns = X_labels)
            train['iceplant'] = y_train

            # reassemble test set into data 
            test = pd.DataFrame(X_test, columns = X_labels)
            test['iceplant'] = y_test

            # add to rest of train/test pts
            all_train.append(train)
            all_test.append(test)
    
    train = pd.concat(all_train, ignore_index=True)
    test = pd.concat(all_test, ignore_index=True)
    

## Save data samples file(s)

In [6]:
if split == True:
    train.to_csv(os.path.join(os.getcwd(),'train_set.csv'), index=False)
    test.to_csv(os.path.join(os.getcwd(),'test_set.csv'), index=False)
    
else:
    samples.to_csv(os.path.join(os.getcwd(),'samples_for_model.csv'), index=False)

## Delete individual csv files

In [7]:
if delete_files == True:
    for aoi in aois:
        for year in [2012,2014,2016,2018,2020]:
            if year in spec_years:
                os.remove(path_spectral_points_csv(aoi,year))
            if year in lidar_years:
                if ('point_conception' != aoi) or (year != 2016):  #there's no data for Point Conception on 2016
                    os.remove(path_lidar_spectral_points_csv(aoi,year))

## Statistics about data distribution

In [8]:
sep = '\n\n\n'
title1 = '***** INFORMATION ABOUT COMPLETE DATASET ******' +sep
# ----------------------------

n_sample =  'total # of points sampled: '+ str(samples.shape[0]) + sep

# ----------------------------
# ratios and percentages of iceplant vs non-iceplant
unique, counts = np.unique(samples.iceplant, return_counts=True)
icep_ratio = 'no-iceplant:iceplant ratio   '+ str(round(counts[0]/counts[1],1))+':1' +sep

n = samples.iceplant.shape[0]
perc = [round(counts[0]/n*100,2), round(counts[1]/n*100,2)]
counts_percents = pd.DataFrame({'iceplant':unique,
         'counts':counts,
         'percentage':perc}).set_index('iceplant')
counts_percents = counts_percents.to_string() + sep

# ----------------------------
# Number of points by area of interest
counts_aoi = 'Points sampled per area of interest\n' + samples.aoi.value_counts().to_string() + sep

# Number of points by year
counts_year = 'Points sampled per year\n'+ samples.year.value_counts().to_string() + sep

# Number of points by NAIP scene
counts_naipid = '# NAIP scenes sampled: '+ str(len(samples.naip_id.value_counts()))+'\nPoints sampled per NAIP scene\n' +  samples.naip_id.value_counts().to_string() + sep
    

# ----------------------------
# assemble all stats into string
stats = title1 + n_sample + icep_ratio + counts_percents + counts_aoi + counts_year + counts_naipid

# ----------------------------
if split == True:
    title2 = '***** INFORMATION ABOUT TRAIN/TEST DATASETS ******' +sep
    size = str(test_size*100) + '% of points were included in test set\nsampling was stratified by NAIP scene\n'
    n_train =  '# of points in train set: '+ str(train.shape[0]) + '\n'
    n_test = '# of points in test set: '+ str(test.shape[0])
    stats = stats +title2+ size + n_train + n_test

# *******************************************************************************
if write_stats:
    with open(os.path.join(os.getcwd(),'samples_for_model_stats.txt'), 'w') as f:
        f.write(stats)
        f.close()

# *******************************************************************************
if verbose:
    print(stats)

***** INFORMATION ABOUT COMPLETE DATASET ******


total # of points sampled: 489415


no-iceplant:iceplant ratio   1.8:1


          counts  percentage
iceplant                    
0         313132       63.98
1         176283       36.02


Points sampled per area of interest
point_conception    191958
campus_lagoon       132326
carpinteria         102089
gaviota              63042


Points sampled per year
2020    149382
2018    121902
2014     89163
2012     77224
2016     51744


# NAIP scenes sampled: 19
Points sampled per NAIP scene
ca_m_3412037_nw_10_060_20200607             80913
ca_m_3411934_sw_11_060_20180722_20190209    54945
ca_m_3412037_nw_10_1_20140603_20141030      50837
ca_m_3412037_nw_10_1_20120518_20120730      31957
ca_m_3412037_nw_10_060_20180913_20190208    28251
ca_m_3411936_se_11_060_20200521             27930
ca_m_3411934_sw_11_060_20200521             25293
ca_m_3411936_se_11_060_20180724_20190209    23820
ca_m_3411934_sw_11_.6_20160713_20161004     20278
ca_m_3