# Using DBSCAN to reduce the size of the spatial dataset

#### DBScan - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

#### Nearest Neighbors - https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html

#### Overview of DB Scan - https://medium.com/@agarwalvibhor84/lets-cluster-data-points-using-dbscan-278c5459bee5

#### DBScan Article for code below - https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/

#### Notebook from above article - https://github.com/gboeing/2014-summer-travels/blob/master/clustering-scikitlearn.ipynb

------------------------------------------------------------------------------------------------
## Active fire data collected by NASA MODIS satellite

#### MODIS Collection 6: Temporal Coverage: 2003 - 2019
#### Data access: https://earthdata.nasa.gov/earth-observation-data/near-real-time/firms/c6-mcd14dl

In [None]:
# import libraries
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
import datetime
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
%matplotlib inline

In [None]:
# Load file into dataframe
M6df = pd.read_csv('/Users/nahidmacbook/Documents/DataScience/Data-Wildfire/fire_archive_M6_110066.csv')

# Add new columns: month & year
M6df['month'] = pd.DatetimeIndex(M6df['acq_date']).month
M6df['year'] = pd.DatetimeIndex(M6df['acq_date']).year
M6df['DOY'] = pd.DatetimeIndex(M6df['acq_date']).dayofyear

In [3]:
# function to encode DOY as a cyclical feature
def encode(data, col, max_val):
    data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
    return data

In [4]:
# encode DOY of M6df
M6df = encode(M6df, 'DOY', 365)

In [5]:
# Limit the geographical data points to the long/lat of the United States
M6df = M6df[(M6df.longitude > -161) & (M6df.longitude < -68) & (M6df.latitude > 19) & (M6df.latitude <65)]

### <font color = "#C71585"> 1- Split data into yearly datasets and run the clustering function for the year </font>
### 1- Split data into yearly datasets and run the clustering function for the year
- to enhance the performance of the DBSCAN module, feed the data one year at a time
- execute steps 1-4 for years 2003 through 2019

In [6]:
'''
# use this code if running the notebook from the terminal
import os
env_var = os.environ.get(ENV_VAR, 2020)
year = env_var
'''
year = 2011
M6df_yearly = M6df[(M6df.year == year)]

In [7]:
kms_per_radian = 3956
### 3956 for miles, 6371.0088 for kilometers
print(kms_per_radian)

3956


#### <font color="#C71585"> Compute DBSCAN </font>
#### Compute DBSCAN

- eps is the physical distance from each point that forms its neighborhood
- min_samples is the min cluster size, otherwise it's noise - set to 1 so we get no noise

- Extract the lat, lon columns into a numpy matrix of coordinates, then convert to radians when you call fit, for use by scikit-learn's haversine metric.

In [8]:
coords = M6df_yearly.as_matrix(columns=['latitude', 'longitude'])
epsilon = 1.5 / kms_per_radian

  """Entry point for launching an IPython kernel.


#### <font color="#C71585"> Defining the Clustering Functions </font>
#### Defining the Clustering Functions

In [None]:
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_

# get the number of clusters
num_clusters = len(set(cluster_labels))

# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(M6df_yearly), num_clusters, 100*(1 - float(num_clusters) / len(M6df_yearly)), time.time()-start_time))
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(coords, cluster_labels)))


In [10]:
# turn the clusters into a pandas series, where each element is a cluster of points
clusters = pd.Series([coords[cluster_labels==n] for n in range(num_clusters)])

In [None]:
'''
# Add this line of code if min_sample > 1: 
clusters.pop(num_clusters-1)
'''

### <font color="#C71585"> 2- Move the clustered datapoints into a dataframe </font>
### 2- Move the clustered datapoints into a dataframe

In [12]:
### Loop to pull in the cluster points that make up each Cluster represented in the Series 'Clusters'.
def preparecluster(row, clusterseries):
    clusterDF = pd.DataFrame(clusterseries[0]) 
    clusterDF = clusterDF.assign(ClusterNum=row) 
    clusterDF = clusterDF.assign(Year=year)
    clusterDF = clusterDF.assign(Cluster_Reference= str(row)+'_'+str(year))
    row = 1
    while row < len(clusterseries):
        clusterdata = (clusterseries[row])
        clusterDFTemp = pd.DataFrame(clusterdata)
        clusterDFTemp = clusterDFTemp.assign(ClusterNum=row) 
        clusterDFTemp = clusterDFTemp.assign(Year=year) 
        clusterDFTemp = clusterDFTemp.assign(Cluster_Reference= str(row)+'_'+str(year))
        clusterDF = clusterDF.append(clusterDFTemp, ignore_index=True)
        row = row + 1
    return clusterDF

In [13]:
### Viewing results of dataframe consolidation
groupedclusters = preparecluster(0,clusters)

In [16]:
mapping = {groupedclusters.columns[0]:'Latitude', groupedclusters.columns[1]:'Longitude', groupedclusters.columns[2]:'DOY', groupedclusters.columns[3]:'ClusterNum',groupedclusters.columns[4]:'Year',groupedclusters.columns[5]:'Cluster_Reference'}
groupedclusters = groupedclusters.rename(columns=mapping)

In [None]:
# Store the clusters dataframe into a csv
groupedclusters.to_csv('/Users/nahidmacbook/Documents/DataScience/Data-Wildfire/TEST-ClusteredNASA-M6-'+str(year)+'.csv', encoding='utf-8')

### <font color="#C71585"> 3- Find the point in each cluster that is closest to its centroid </font>
### 3- Find the point in each cluster that is closest to its centroid

DBSCAN clusters may be non-convex. This technique just returns one representative point from each cluster. First get the lat,lon coordinates of the cluster's centroid (shapely represents the first coordinate in the tuple as x and the second as y, so lat is x and lon is y here). Then find the member of the cluster with the smallest great circle distance to the centroid.


In [None]:
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)

centermost_points = clusters.map(get_centermost_point)

In [None]:
# unzip the list of centermost points (lat, lon) tuples into separate latitude and longitude lists
latitude, longitude = zip(*centermost_points)

In [None]:
# from these lats/lons create a new df of one representative point for each cluster
rep_points = pd.DataFrame({'longitude':longitude, 'latitude':latitude})

### <font color="#C71585"> 4- Save the Cluster Points and their respective data fields into CSV </font>
### 4- Save the Cluster Points and their respective data fields into CSV

In [None]:
# pull row from original data set where lat/lon match the lat/lon of each row of representative points
# that way we get the full details like frp, brightness, and date from the original dataframe

rs = rep_points.apply(lambda row: M6df_yearly[(M6df_yearly['latitude']==row['latitude']) & (M6df_yearly['longitude']==row['longitude'])].iloc[0], axis=1)

In [None]:
# assign cluster_reference to the dataframe
rs['cluster_reference'] = rs.index.map(str)+'_'+rs['year'].map(str)

In [None]:
rs.head()

In [None]:
rs.to_csv('/Users/nahidmacbook/Documents/DataScience/Data-Wildfire/TEST-NASA-M6-DBSCAN-Clusters'+ str(year) +'.csv', encoding='utf-8')

In [None]:
# plot the final reduced set of coordinate points vs the original full set
fig, ax = plt.subplots(figsize=[10, 6])
df_scatter = ax.scatter(M6df_yearly['longitude'], M6df_yearly['latitude'], c='#FB7153', alpha=0.2, s=100)
rs_scatter = ax.scatter(rs['longitude'], rs['latitude'], c='k', edgecolor='None', alpha=1.0, s=3)
ax.set_title('Yearly NASA Active Fires - Full data set vs DBSCAN reduced set')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend([df_scatter, rs_scatter], ['Full set', 'Reduced set'], loc='upper right')
plt.show()