## UFO Sightings K-Means Clustering
### Modeling Lab

The goal of this notebook is to analyze where Mr. K should build his extraterrestrial life facilities using the K-Means algorithm. 

What we plan on accomplishing is the following:
1. [Load dataset onto Notebook instance from S3](#Step-1:-Loading-the-data-from-Amazon-S3)
2. [Cleaning, transforming, and preparing the data](#Step-2:-Cleaning,-transforming,-and-preparing-the-data)
3. [Create and train our model](#Step-3:-Create-and-train-our-model)
4. [Viewing the results](#Step-4:-Viewing-the-results)
5. [Visualize using QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-s3.html)

First let's go ahead and import all the needed libraries.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

import boto3
from sagemaker import get_execution_role
import sagemaker.amazon.common as smac

## Step 1: Loading the data from Amazon S3
Next, lets get the UFO sightings data that is stored in S3.

In [2]:
role = get_execution_role()
bucket = 'ml-lab-hh'
prefix = 'ufo-dataset'
data_key = 'ufo_fullset.csv'
data_location = 's3://{}/{}/{}'.format(bucket, prefix, data_key)

df = pd.read_csv(data_location, low_memory=False)

In [3]:
df.head()

Unnamed: 0,reportedTimestamp,eventDate,eventTime,shape,duration,witnesses,weather,firstName,lastName,latitude,longitude,sighting,physicalEvidence,contact,researchOutcome
0,1977-04-04T04:02:23.340Z,1977-03-31,23:46,circle,4,1,rain,Ila,Bashirian,47.329444,-122.578889,Y,N,N,explained
1,1982-11-22T02:06:32.019Z,1982-11-15,22:04,disk,4,1,partly cloudy,Eriberto,Runolfsson,52.664913,-1.034894,Y,Y,N,explained
2,1992-12-07T19:06:52.482Z,1992-12-07,19:01,circle,49,1,clear,Miller,Watsica,38.951667,-92.333889,Y,N,N,explained
3,2011-02-24T21:06:34.898Z,2011-02-21,20:56,disk,13,1,partly cloudy,Clifton,Bechtelar,41.496944,-71.367778,Y,N,N,explained
4,1991-03-09T16:18:45.501Z,1991-03-09,11:42,circle,17,1,mostly cloudy,Jayda,Ebert,47.606389,-122.330833,Y,N,N,explained


In [4]:
df.shape

(18000, 15)

## Step 2: Cleaning, transforming, and preparing the data
Create another DataFrame with just the latitude and longitude attributes

In [5]:
df_geo = df[['latitude', 'longitude']]

In [6]:
df_geo.head()

Unnamed: 0,latitude,longitude
0,47.329444,-122.578889
1,52.664913,-1.034894
2,38.951667,-92.333889
3,41.496944,-71.367778
4,47.606389,-122.330833


In [7]:
df_geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 2 columns):
latitude     18000 non-null float64
longitude    18000 non-null float64
dtypes: float64(2)
memory usage: 281.3 KB


In [8]:
missing_values = df_geo.isnull().values.any()
print('Are there any missing values? {}'.format(missing_values))
if(missing_values):
    df_geo[df_geo.isnull().any(axis=1)]

Are there any missing values? False


Next, let's go ahead and transform the pandas DataFrame (our dataset) into a numpy.ndarray. When we do this each row is converted to a Record object. According to the documentation, this is what the K-Means algorithm expects as training data. This is what we will use as training data for our model.

[See the documentation for input training](https://sagemaker.readthedocs.io/en/stable/kmeans.html)

In [9]:
data_train = df_geo.values.astype('float32') #K-Means expects float32 instead of float64
data_train

array([[  47.329445, -122.57889 ],
       [  52.664913,   -1.034894],
       [  38.951668,  -92.333885],
       ...,
       [  36.86639 ,  -83.888885],
       [  35.385834,  -94.39833 ],
       [  29.883055,  -97.94111 ]], dtype=float32)

## Step 3: Create and train our model
In this step we will import and use the built-in SageMaker K-Means algorithm. We will set the number of cluster to 10 (for our 10 sensors), specify the instance type we want to train on, and the location of where we want our model artifact to live. 

[See the documentation of hyperparameters here](https://docs.aws.amazon.com/sagemaker/latest/dg/k-means-api-config.html)

In [10]:
from sagemaker import KMeans

num_clusters = 10
output_location = 's3://' + bucket + '/model-artifacts'

kmeans = KMeans(role=role,
               train_instance_count=1,
               train_instance_type='ml.c4.xlarge',
               output_path=output_location,
               k=num_clusters)

In [11]:
job_name = 'kmeans-geo-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
print('Here is the job name {}'.format(job_name))

Here is the job name kmeans-geo-job-20200703172559


In [12]:
%%time
kmeans.fit(kmeans.record_set(data_train), job_name=job_name)

2020-07-03 17:26:34 Starting - Starting the training job...
2020-07-03 17:26:37 Starting - Launching requested ML instances......
2020-07-03 17:27:53 Starting - Preparing the instances for training......
2020-07-03 17:28:59 Downloading - Downloading input data
2020-07-03 17:28:59 Training - Downloading the training image.....[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[07/03/2020 17:29:46 INFO 140542755837760] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'_enable_profiler': u'false', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_cente


2020-07-03 17:29:56 Uploading - Uploading generated training model
2020-07-03 17:29:56 Completed - Training job completed
Training seconds: 74
Billable seconds: 74
CPU times: user 844 ms, sys: 22.5 ms, total: 866 ms
Wall time: 3min 42s


## Step 4: Viewing the results
In this step we will take a look at the model artifact SageMaker created for us and stored onto S3. We have to do a few special things to see the latitude and longitude for our 10 clusters (and the center points of those clusters)

[See the documentation of deserilization here](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#td-deserialization)

At this point we need to "deserilize" the model artifact. Here we are going to open and review them in our notebook instance. We can unzip the model artifact which will contain model_algo-1. This is just a serialized Apache MXNet object. From here we can load that serialized object into a numpy.ndarray and then extract the clustered centroids from the numpy.ndarray.

After we extract the results into a DataFrame of latitudes and longitudes, we can create a CSV with that data, load it onto S3 and then visualize it with QuickSight.

In [13]:
import os
model_key = 'model-artifacts/' + job_name + '/output/model.tar.gz'

boto3.resource('s3').Bucket(bucket).download_file(model_key, 'model.tar.gz')
os.system('tar -zxvf model.tar.gz')
os.system('unzip model_algo-1')

2304

In [14]:
!pip install mxnet

Collecting mxnet
[?25l  Downloading https://files.pythonhosted.org/packages/81/f5/d79b5b40735086ff1100c680703e0f3efc830fa455e268e9e96f3c857e93/mxnet-1.6.0-py2.py3-none-any.whl (68.7MB)
[K    100% |████████████████████████████████| 68.7MB 739kB/s eta 0:00:01    80% |█████████████████████████▉      | 55.5MB 59.2MB/s eta 0:00:01    98% |███████████████████████████████▋| 67.9MB 58.4MB/s eta 0:00:01
Collecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Collecting numpy<2.0.0,>1.16.0 (from mxnet)
[?25l  Downloading https://files.pythonhosted.org/packages/00/16/476826a84d545424084499763248abbbdc73d065168efed9aa71cdf2a7dc/numpy-1.19.0-cp36-cp36m-manylinux1_x86_64.whl (13.5MB)
[K    100% |████████████████████████████████| 13.5MB 3.9MB/s eta 0:00:01
Installing collected packages: graphviz, numpy, mxnet
  Found existing installation: numpy 1.14.

In [26]:
import mxnet as mx
Kmeans_model_params = mx.ndarray.load('model_algo-1')

In [27]:
cluster_centroids_kmeans = pd.DataFrame(Kmeans_model_params[0].asnumpy())
cluster_centroids_kmeans.columns=df_geo.columns
cluster_centroids_kmeans

Unnamed: 0,latitude,longitude
0,41.150658,-87.130943
1,38.418583,18.648918
2,35.240528,-116.742805
3,-6.737061,121.351768
4,30.753006,-82.076332
5,47.205036,-122.457748
6,35.560467,-98.089432
7,52.768196,-1.68261
8,41.314751,-74.881958
9,30.577702,-137.583374


Let's go ahead and upload this dataset onto S3 and view within QuickSight

In [28]:
from io import StringIO

csv_buffer = StringIO()
cluster_centroids_kmeans.to_csv(csv_buffer, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'results/ten_locations_kmeans.csv').put(Body=csv_buffer.getvalue())

{'ResponseMetadata': {'RequestId': 'D0B7321BFC4C57F2',
  'HostId': '+YPh9eAN5A/R+cKU5sjDvxsyNKRTLt5RrJp8FdZC5FSUjjthKO+VLKwL/dKuHd2dRZh2JzZpUSg=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '+YPh9eAN5A/R+cKU5sjDvxsyNKRTLt5RrJp8FdZC5FSUjjthKO+VLKwL/dKuHd2dRZh2JzZpUSg=',
   'x-amz-request-id': 'D0B7321BFC4C57F2',
   'date': 'Fri, 03 Jul 2020 17:54:37 GMT',
   'etag': '"0533aec2d276f909b947dd4472bd286a"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"0533aec2d276f909b947dd4472bd286a"'}

In [29]:
# Get Folium to create the map

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

Solving environment: done


  current version: 4.5.12
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    fribidi-1.0.9              |       h516909a_0         113 KB  conda-forge
    matplotlib-3.1.0           |           py36_0           6 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    libpng-1.6.37              |       hed695b0_1         308 KB  conda-forge
    krb5-1.17.1                |       h2fd8d38_0         1.5 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    libssh2-1.9.0    

In [32]:
# create map

mrk_map = folium.Map( zoom_start=11)
 
# Get k-means centroids latitude and longitude
    
for lat, lng in zip(cluster_centroids_kmeans['latitude'], cluster_centroids_kmeans['longitude']):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='Purple',
        fill=True,
        fill_color='#9900FF',
        fill_opacity=0.7,
        parse_html=False).add_to(mrk_map) 
    
mrk_map