# Using the Random Library to Generate Synthetic Data
### Randomly pulling 1% of the original AST dataset
The AST dataset provided to us was created by our researcher. However, since it is not published yet, we will not be using the original dataset in this repository. Instead, we've agreed to randomly select 1% of the dataset, which is done here. This notebook *will not run* because the original AST dataset is not included in the repository. We've kept the code for transparency and explanation purposes. The synthetic AST data used throughout this repository is located in ```/data/source_files/ast_files```.

### Import statements

In [1]:
import pandas as pd
import geopandas as gpd
import os
import random



### Setting ```DATA_DIR```
In order to read in files from this repository, we must set ```DATA_DIR``` to be the data folder within this repository. This requires ```os.getcwd()``` to return the path to the processing notebook of this repository, so ```xxx/codeplus-celine-dcc-package/processing```, where ```xxx``` is the path to where you cloned this repository. If it is not, use ```os.chdir(path)``` to change the current working directory to ```xxx/codeplus-celine-dcc-package/processing``` before getting the current working directory in ```DATA_DIR = os.getcwd()```, where ```path``` is ```xxx/codeplus-celine-dcc-package/processing```.

In [4]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('visualizations', 'data')
DATA_DIR

'/hpc/home/at341/ondemand/codeplus-celine-dcc-package/processing'

### Reading AST data

In [2]:
df_tanks = gpd.read_file('/hpc/group/codeplus22-vis/ast_dataset/tile_level_annotations.shp')
df_tanks.head(n=3)

Unnamed: 0,tile_name,minx_polyg,miny_polyg,maxx_polyg,maxy_polyg,nw_corner_,nw_corne_1,se_corner_,se_corne_1,object_cla,diameter (,merged_bbo,bbox_withi,Category1,Category2,Category3,Category4,Category5,state,geometry
0,m_4007327_nw_18_060_20190809,974,314,1041,380,40.625753,-73.745466,40.625392,-73.744997,closed_roof_tank,39.6,1,0,0.0,0.0,0.0,0.0,0.0,New York,"POLYGON ((-73.74547 40.62575, -73.74500 40.625..."
1,m_4007327_nw_18_060_20190809,1091,479,1157,512,40.624853,-73.744652,40.624669,-73.744188,closed_roof_tank,19.8,0,0,0.0,0.0,0.0,0.0,0.0,New York,"POLYGON ((-73.74465 40.62485, -73.74419 40.624..."
2,m_4007327_nw_18_060_20190809,851,243,872,265,40.626147,-73.746331,40.626026,-73.746184,closed_roof_tank,12.6,0,0,0.0,0.0,0.0,0.0,0.0,New York,"POLYGON ((-73.74633 40.62615, -73.74618 40.626..."


### Using random's ```.sample()``` to randomly select 1% of rows from the original AST dataset
Then saving this as a file in ```/data/source_files/ast_files```.

In [5]:
random.seed(1)
df_tanks_sample = df_tanks.sample(frac = 0.01)
df_tanks_sample = df_tanks_sample.reset_index(drop = True)
df_tanks_sample.to_file('/hpc/group/codeplus22-vis/synthetic_data/synthetic_ast/ast_synthetic.shp')

In [3]:
df_tanks = gpd.read_file('/hpc/group/codeplus22-vis/synthetic_data/synthetic_ast/ast_synthetic.shp')
df_tanks.head(n=3)

Unnamed: 0,tile_name,minx_polyg,miny_polyg,maxx_polyg,maxy_polyg,nw_corner_,nw_corne_1,se_corner_,se_corne_1,object_cla,diameter (,merged_bbo,bbox_withi,Category1,Category2,Category3,Category4,Category5,state,geometry
0,m_3009139_ne_15_060_20190726,283,280,291,315,30.502086,-91.188319,30.501896,-91.188273,closed_roof_tank,4.8,0,0,0.0,0.0,0.0,0.0,0.0,Louisiana,"POLYGON ((-91.18832 30.50209, -91.18827 30.502..."
1,m_2909005_ne_15_060_20190707,7130,2364,7181,2414,29.990328,-90.396031,29.990051,-90.395721,closed_roof_tank,30.0,1,0,0.0,0.0,0.0,0.0,0.0,Louisiana,"POLYGON ((-90.39603 29.99033, -90.39572 29.990..."
2,m_3408350_ne_17_060_20191023,4910,5661,4944,5696,34.221846,-83.783836,34.221662,-83.783608,closed_roof_tank,20.4,1,0,0.0,0.0,0.0,0.0,0.0,Georgia,"POLYGON ((-83.78384 34.22185, -83.78361 34.221..."
