# About

This notebook extracts spectral and date features of NAIP images at random points within polygons in the 'polygons_form_naip_images' folder. These polygons are known ice plant and non ice plant locations within a specific NAIP image. 


Once the aoi and years are specified, the notebook samples first polygons labeled as ice plant locations and then polygons labeled as non-ice plant locations. Two methods for sampling polygons are implemented in the `extracting_points_from_polygons,` and both are used in this notebook. The first one, `naip_sample_proportion`, samples a fixed fraction of the points in each polygon. The second one, `naip_sample_sliding`, samples a fixed fraction of the points in each polygon up to a maximum number of points. Polygons vary greatly in size, so simply sampling a fraction of the points in each polygon would result in an over-sampling of bigger polygons (most often those corresponding to non-ice plant locations), which in turn would unbalance the training set towards one label. The parameters used in this notebook were determined to obtain a final training set with a 3:1 proportion of non-ice plant to ice plant points. 

**NOTEBOOK VARIABLES:**

- `aois` (array): These are the areas of interest from which the polygons we want to sample were collected. Must be a subset of: `['campus_lagoon','carpinteria','gaviota','point_conception']`. 

- `years` (array): can be any subset from `[2012, 2014, 2016, 2018, 2020]`. If aoi = 'point_conception' then 2016 will not be included in the outcome since there are not points to NAIP images to sample from on that year. 

- `sample_fraction` (float in (0,1]): fraction of points to sample from each polygon

- `max_sample` (int): maximum number of points to sample from a polygon

- `verbose` (bool): whether to print as the notebook runs the stats of how many points were sampled per year and aoi 

- `write_stats` (bool): whether to save as a csv the stats of how many points were sampled from each aoi and year

**OUTPUT:**

The output is a dataframe of points with the following features:

- geometry: coordinates of point *p* in the NAIP image's CRS
- naip_id: itemid of the NAIP from which *p* was sampled from
- polygon_id: id of the polygon from which *p* was sampled from
- iceplant: whether point *p* corresponds to a confirmed iceplant location or a confirmed non-iceplant location (0 = non-iceplant, 1 = iceplant)
- r, g, b, nir: Red, Green, Blue and NIR bands values of NAIP scene with naip_id at at cooridnates of point *p*
- ndvi: computed for each point using the Red and NIR bands
- year, month, day_in_year: year, month and day of the year when the NAIP image was collected
- aoi: name of the area of interest where the points were sampled from


The dataframe is then saved in the 'temp' folder as a csv file. Filenames have the structure: `aoi_points_year.csv'`

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import geopandas as gpd

import extracting_points_from_polygons as pp
import utility

# Specify notebook variables

In [5]:
# ***************************************************
# ************* NOTEBOOK VARIABLES ******************

#aois = ['campus_lagoon','carpinteria','gaviota','point_conception']
aois = ['campus_lagoon','carpinteria']


# years = array of years, can be any subset from [2012, 2014, 2016, 2018, 2020]
#years = [2012, 2014, 2016, 2018, 2020]
years = [2018,2020]

# sample 90% of pts in each polygon
sample_fraction = 0.9

# maximum number of pts to sample in a polygon
max_sample = 1000

# print stats as notebook runs
verbose = False

# save stats
write_stats = True


# ***************************************************
# ***************************************************

# Sample points

In [4]:
# initialize sampling statistcs df
if write_stats:
    stats = []

# sample points
for aoi in aois:
    for year in years:
        
        if ('point_conception' != aoi) or (year != 2016):  #there's no data for Point Conception on 2016
            
            # open polygons
            fp = pp.path_to_polygons(aoi,year)
            polys = gpd.read_file(fp)
            
            # select iceplant polygons and sample sample_fraction of pts in each polygon 
            polys_ice = polys.loc[polys.iceplant==1]
            polys_ice.reset_index(inplace=True, drop=True)

            pts_ice = pp.naip_sample_proportion_no_warnings(polys_ice, 
                                                            polys.naip_id[0], 
                                                            sample_fraction)  
            
            # select non-iceplant polygons and sample sample_fraction of pts in each polygon,  but at most max_sample points 
            polys_nonice = polys.loc[polys.iceplant==0]
            polys_nonice.reset_index(inplace=True, drop=True)

            pts_nonice = pp.naip_sample_sliding_no_warnings(polys_nonice, polys.naip_id[0], 
                                                            sample_fraction, 
                                                            max_sample)
            # assemble into single dataframe
            pts = pd.concat([pts_ice,pts_nonice])

            # add name of aoi
            pts['aoi'] = aoi
            
            # add ndvi as feature
            pts['ndvi']=(pts.nir.astype('int16') - pts.r.astype('int16'))/(pts.nir.astype('int16') + pts.r.astype('int16'))

           
            # create temp directory if needed
            tmp_path = os.path.join(os.getcwd(),'temp')
            if not os.path.exists(tmp_path):
                os.makedirs(tmp_path)
            
            # save points as csv in temp folder
            fp = os.path.join(os.getcwd(), 
                              'temp', 
                              aoi + '_points_'+str(year)+'.csv')
            pts.to_csv(fp, index=False)
            
            # print sample statistics
            if verbose == True:
                # print sample information
                print('************ '+aoi+ ' ' +str(year)+' ************')
                utility.iceplant_proportions(pts.iceplant)
                print( '---------------------------------------')
                
            if write_stats == True:
                n_ice =  pts_ice.shape[0]
                n_nonice =  pts_nonice.shape[0]
                total = n_ice + n_nonice
                
                stat = [aoi, 
                     year, 
                     str(round(n_nonice/n_ice,1))+':1', 
                     round(n_ice/total*100,2),
                     round(n_nonice/total*100,2),
                     n_ice,
                     n_nonice
                    ]
                stats.append(stat)
                
                
if write_stats:     
    stats_df = pd.DataFrame(stats, 
                            columns=['aoi', 'year', 'ratio','perc_ice','perc_nonice','n_ice','n_nonice'])

    # save points as csv in temp folder
    fp = os.path.join(os.getcwd(), 
                      'temp', 
                      'stats_1_sampling_pts_from_polygons.csv')
    stats_df.to_csv(fp, index=False)

************ campus_lagoon 2018 ************
no-iceplant:iceplant ratio     6.6 :1
          counts  percentage
iceplant                    
0          47693       86.79
1           7257       13.21

---------------------------------------
************ campus_lagoon 2020 ************
no-iceplant:iceplant ratio     1.9 :1
          counts  percentage
iceplant                    
0          17000       65.98
1           8767       34.02

---------------------------------------
************ carpinteria 2018 ************
no-iceplant:iceplant ratio     2.7 :1
          counts  percentage
iceplant                    
0          17448        73.2
1           6388        26.8

---------------------------------------
************ carpinteria 2020 ************
no-iceplant:iceplant ratio     1.7 :1
          counts  percentage
iceplant                    
0          17448       62.33
1          10547       37.67

---------------------------------------
