## UFO Sightings Data Preparation

The goal of this notebook is to analyize where Mr. K should build his extrateritial life facilities using the K-Means algorithm. 

What we plan on accompishling is the following:
1. [Load dataset onto Notebook instance from S3](#Step-1:-Loading-the-data-from-Amazon-S3)
2. [Cleaning, transforming, and preparing the data](#Step-2:-Cleaning,-transforming,-and-preparing-the-data)
3. [Create and train our model](#Step-3:-Create-and-train-our-model)
4. [Viewing the results](#Step-4:-Viewing-the-results)
5. Visualize using QuickSight

First let's go ahead and import all the needed libraries.

In [72]:
%matplotlib inline
import pandas as pd
import numpy as np
from datetime import datetime
import os

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

import boto3
# from sagemaker import get_execution_role

## Step 1: Loading the data from Amazon S3
Next, lets get the UFO sightings data that is stored in S3.

In [2]:
# role = get_execution_role()
# bucket='<INSERT_BUCKET_NAME_HERE>'
# prefix = 'ufo_dataset'
# data_key = 'ufo_fullset.csv'
# data_location = 's3://{}/{}/{}'.format(bucket, prefix, data_key)

# df = pd.read_csv(data_location, low_memory=False)
df = pd.read_csv('/Users/brocktubre/Desktop/Projects/ufo-dataset-generator/ufo_fullset.csv', low_memory=False)

In [3]:
df.head()

Unnamed: 0,reportedTimestamp,eventDate,eventTime,shape,duration,witnesses,weather,firstName,lastName,latitude,longitude,sighting,physicalEvidence,contact,researchOutcome
0,1994-07-21T12:41:23.364Z,1994-07-14,06:16,box,45,1,stormy,Sylvia,Herman,29.38421,-98.581082,Y,N,N,explained
1,1989-10-29T16:37:18.047Z,1989-10-28,13:19,disk,4,1,rain,Carlee,Klein,29.38421,-98.581082,Y,N,N,explained
2,1996-04-15T00:21:54.064Z,1996-04-14,20:30,sphere,26,1,partly cloudy,Dudley,Welch,51.434722,-3.18,Y,N,N,unexplained
3,1981-07-12T07:17:43.824Z,1981-07-07,21:32,sphere,100,1,partly cloudy,Terence,Oberbrunner,29.38421,-98.581082,Y,N,N,explained
4,1971-06-13T22:55:50.423Z,1971-06-09,14:27,pyramid,89,1,partly cloudy,Cordie,Waelchi,30.294722,-82.984167,Y,N,N,explained


## Step 2: Cleaning, transforming, and preparing the data
Create another DataFrame with just the latitude and longitude attributes

In [4]:
# Creates a DataFrame from original dataset with just two columns, latitude and longitude
df_geo = df[['latitude', 'longitude']]

In [5]:
# Helpful method to see information about our DataFrame
df_geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300001 entries, 0 to 300000
Data columns (total 2 columns):
latitude     300001 non-null float64
longitude    300001 non-null float64
dtypes: float64(2)
memory usage: 4.6 MB


In [6]:
# Let's check to see if there are any missing values
print('Are there any missing values? {}'.format(df_geo.isnull().values.any()))

Are there any missing values? False


Next, let's go ahead and transform the pandas DataFrame (our dataset) into a numpy.ndarray. When we do this each row is converted to a Record object. According to the documentation, this is what the K-Means algorithm expects as training data. This is what we will use as training data for our model.

[See the documentation for input training](https://sagemaker.readthedocs.io/en/stable/kmeans.html)

In [77]:
# This dataset will be used to train the model
# df_geo_small = df_geo[:18000]
data_train = df_geo.values.astype('float32')

In [57]:
# # Plot graph
# figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
# plt.scatter(x=df_geo['longitude'],y=df_geo['latitude'])
# plt.title('Scatter Plot Lat/Long')
# plt.xlabel('Lat')
# plt.ylabel('Long')
# plt.show()

## Step 3: Create and train our model
In this step we will import and use the built-in SageMaker K-Means algorithm. We will set the number of cluster to 10 (for our 10 sensors), specify the instance type we want to train on, and the location of where we want our model artifact to live. 

[See the documentation of parameters here](https://sagemaker.readthedocs.io/en/stable/kmeans.html)

In [58]:
from sklearn.cluster import DBSCAN
from shapely.geometry import MultiPoint
from geopy.distance import great_circle

Parameters:
- eps          (The maximum distance between two samples for them to be considered as in the same neighborhood)
- min_sample   (The number of samples in a neighborhood for a point to be considered as a core point.)
- metric       (The metric to use when calculating distance between instances in a feature array.)

In [None]:
kms_per_radian = 6371.0088
epsilon = 1.5 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(data_train))
cluster_labels = db.labels_
num_clusters = 10
DBSCAN(algorithm='auto', eps=3, leaf_size=30, metric='euclidean', metric_params=None, min_samples=2, n_jobs=None, p=None)
clusters = pd.Series([data_train[cluster_labels == n] for n in range(num_clusters)])

In [68]:
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)

centermost_points = clusters.map(get_centermost_point)

In [75]:
# unzip the list of centermost points (lat, lon) tuples into separate lat and lon lists
lats, lons = zip(*centermost_points)

# from these lats/lons create a new df of one representative point for each cluster
cluster_centroids = pd.DataFrame({'lattitude':lats, 'longitude':lons})
cluster_centroids

Unnamed: 0,longitude,lattitude
0,-98.581085,29.384211
1,-122.330833,47.606388
2,151.205475,-33.861481
3,-117.156387,32.715279
4,-118.242775,34.052223
5,-98.493332,29.423889
6,-122.675003,45.523613
7,-97.742775,30.266945
8,-74.006386,40.714169
9,-104.984169,39.739166


In [76]:
from io import StringIO

csv_buffer = StringIO()
cluster_centroids.to_csv(csv_buffer, index=False)
with open('ten_locations_DBSCAN.csv', 'w') as file:
    file.write('csv_buffer.getvalue()')
# s3_resource = boto3.resource('s3')
# s3_resource.Object(bucket, 'results/ten_locations_.csv').put(Body=csv_buffer.getvalue())

In [None]:
# from sagemaker import KMeans

# num_clusters = 10
# output_location = 's3://' + bucket + '/model-artifact/'

# kmeans = KMeans(role=role,
#                 train_instance_count=1,
#                 train_instance_type='ml.c4.xlarge',
#                 output_path=output_location,              
#                 k=num_clusters)

In [None]:
# # Create a training job name
# job_name = 'kmeans-geo-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
# print('Here is the job name {}'.format(job_name))

In [None]:
# %%time
# kmeans.fit(kmeans.record_set(data_train), job_name=job_name)

## Step 4: Viewing the results
In this step we will take a look at the model artifact SageMaker created for us and stored onto S3. We have to do a few special things to see the latitude and longitude for our 10 clusters (and the center points of those clusters)

[See the documentation of parameters here](https://sagemaker.readthedocs.io/en/stable/kmeans.html)

At this point we need to "deserilize" the model artifact. Here we are going to open and review them in our notebook instance. We can unzip the model artifact which will contain model_algo-1. This is just a serialized Apache MXNet object. From here we can load that serialized object into a numpy.ndarray and then extract the clustered centroids from the numpy.ndarray.

After we extract the results into a DataFrame of latitudes and longitudes, we can create a CSV with that data, load it onto S3 and then visualize it with QuickSight.

In [None]:
# # Let's grab the model artifact (model.tar.gz) from S3. 
# model_key = 'model-artifacts/' + job_name + '/output/model.tar.gz'

# # We can uzip our model artifact and store the result onto our Notebook instance.
# boto3.resource('s3').Bucket(bucket).download_file(model_key, 'model.tar.gz')
# os.system('tar -zxvf model.tar.gz')
# os.system('unzip model_algo-1')

In [None]:
# # We actaully need to install the Mxnet python libraries to use it 
# !pip install mxnet

In [None]:
# # Load the serilized artifact into a mx.ndarry
# import mxnet as mx
# Kmeans_model_params = mx.ndarray.load('model_algo-1')

In [None]:
# cluster_centroids=pd.DataFrame(Kmeans_model_params[0].asnumpy())
# cluster_centroids.columns=df_geo.columns

In [None]:
# cluster_centroids

In [None]:
# from io import StringIO

# csv_buffer = StringIO()
# cluster_centroids.to_csv(csv_buffer, index=False)
# s3_resource = boto3.resource('s3')
# s3_resource.Object(bucket, 'results/ten_locations.csv').put(Body=csv_buffer.getvalue())

In [None]:
# import matplotlib.pyplot as plt
# from matplotlib.pyplot import figure