<a href="https://colab.research.google.com/github/Tam1979/TATA-ML/blob/master/w2_3b_uber.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering : K-Means : Uber Pickups

This is data of Uber pickups in New York City.  
The data is from this [kaggle competition](https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city).

Sample data looks like this
```
"Date_Time","Lat","Lon","Base"
"4/1/2014 0:11:00",40.769,-73.9549,"B02512"
"4/1/2014 0:17:00",40.7267,-74.0345,"B02512"
"4/1/2014 0:21:00",40.7316,-73.9873,"B02512"
"4/1/2014 0:28:00",40.7588,-73.9776,"B02512"
```

In [0]:
%matplotlib inline

import time
from matplotlib import pyplot
from sklearn.cluster import KMeans
import pandas as pd


## Step 1: Load the Data
We will also specify schema to reduce loading time

In [0]:
# file to read

## sample file with 10,000 records
data_file="https://s3.amazonaws.com/elephantscale-public/data/uber-nyc/uber-sample-10k.csv"

## larger file with about 500k records
# data_file = "https://s3.amazonaws.com/elephantscale-public/data/uber-nyc/uber-raw-data-apr14.csv.gz"


### Specify Schema

### Read Data

In [0]:
uber_pickups = pd.read_csv(data_file)

records_count_total = len(uber_pickups)
uber_pickups

Unnamed: 0,Date_Time,Lat,Lon,Base
0,9/25/2014 15:28:00,40.7633,-73.9402,B02598
1,9/5/2014 3:50:00,40.7441,-74.0067,B02617
2,9/24/2014 13:39:00,40.7408,-73.9916,B02617
3,9/18/2014 0:31:00,40.7396,-74.0023,B02617
4,9/20/2014 11:22:00,40.7441,-73.9919,B02617
5,9/26/2014 17:03:00,40.6943,-73.9239,B02764
6,9/24/2014 18:59:00,40.8526,-73.8435,B02617
7,9/23/2014 16:33:00,40.7146,-74.0087,B02617
8,9/17/2014 22:16:00,40.7751,-73.9092,B02682
9,9/6/2014 11:13:00,40.7191,-73.9754,B02682


## Step 2: Cleanup data
make sure our data is clean

In [0]:
uber_pickups_clean = uber_pickups.dropna(subset=['Lat', 'Lon'])
records_count_clean = len(uber_pickups_clean)

print ("cleaned records {:,},  dropped {:,}".format(records_count_clean,  (records_count_total - records_count_clean)))

cleaned records 9,999,  dropped 0


## Step 3 : Create Feature Vectors

In [0]:
## TODO : create a feature vectors using 'Lat'  and 'Lon'  attributes
featureVector = uber_pickups_clean['Lat']
featureVector = uber_pickups_clean['Lon']

In [0]:
## TODO : start with 4 clusters
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters, n_init=1)

t1 = time.perf_counter('Lat')
## TODO : fit (featureVector)
model = kmeans.fit(featureVector)
t2 = time.perf_counter('Lon')
model = kmeans.fit(featureVector)
wssse = model.inertia_


print("Kmeans : {} clusters computed in {:,.2f} ms".format( num_clusters,  ((t2-t1)*1000)))
print ("num_clusters = {},  WSSSE = {:,}".format(num_clusters, wssse))

TypeError: ignored

## Step 5: Let's find the best K - Hyperparameter tuning

Let's try iterating and plotting over values of k, so we can practice using the elbow method.


In [0]:
kvals = []
wssses = []

## TODO : loop over k values from 2 to 10
for k in range(2,3,4,5,6,7,8,9,10):
    kmeans = KMeans(n_clusters=k, n_init=1)
    t1 = time.perf_counter()
    model = kmeans.fit(featureVector)
    t2 = time.perf_counter()
    wssse = model.inertia_
    print ("k={},  wssse={},  time took {:,.2f} ms".format(k,wssse, ((t2-t1)*1000)))
    kvals.append(k)
    wssses.append(wssse)

TypeError: ignored

In [0]:
import pandas as pd
df = pd.DataFrame({'k': kvals, 'wssse':wssses})
df

NameError: ignored

In [0]:
pyplot.plot(kvals, wssses)

## Step 6 : Let's run K-Means with the best K we have choosen
From the graph above, choose a good K value.  We wwill use that below

In [0]:
## TODO : pick a K value
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters, n_init=1)

t1 = time.perf_counter()
model = kmeans.fit(featureVector)
t2 = time.perf_counter()

wssse = model.inertia_


print("Kmeans : {} clusters computed in {:,.2f} ms".format( num_clusters,  ((t2-t1)*1000)))
print ("num_clusters = {},  WSSSE = {:,}".format(num_clusters, wssse))

ValueError: ignored

### Predict

In [0]:
t1 = time.perf_counter()
predicted = uber_pickups_clean
predicted['prediction'] = model.predict(featureVector)
t2 = time.perf_counter()

print ("{:,} records clustered in {:,.2f} ms".format(len(predicted), ((t2-t1)*1000) ))

predicted

## Step 7 : Print Cluster Center and Size

In [0]:
cluster_count = predicted.groupby("prediction").size()
cluster_count



## Step 8 : Ploting time!
We are going to plot the results now.  
Since we are dealing with GEO co-ordinates, let's use Google Maps!  

Go to the following URL :  
[https://jsfiddle.net/sujee/omypetfu/](https://jsfiddle.net/sujee/omypetfu/)

- Run the code cell below
- copy paste the output into Javascript section of the JSFiddle Editor (lower left)
- and click 'Run'  (top nav bar)
- Click on 'tidy' (top nav bar)  to cleanup code

See the following image 

<img src="../assets/images/kmeans_uber_trips_map.png" style="border: 5px solid grey ; max-width:100%;" />

You will be rewarded with a beautiful map of clusters on Google Maps

<img src="../assets/images/Kmeans_uber_trips.png" style="border: 5px solid grey ; max-width:100%;" />

Optional
- You can 'fork' the snippet and keep tweaking

In [0]:
### generate Javascript
s1 = "var clusters = {"

s2 = ""

prediction_count = predicted.groupby("prediction").size()
total_count = 0
cluster_centers = model.cluster_centers_
for i in range(0, num_clusters):
    count = prediction_count[i]
    lat = cluster_centers[i][0]
    lng = cluster_centers[i][1]
    total_count = total_count + count
    if (i > 0):
        s2 = s2 + ","
    s2 = s2 + " {}: {{ center: {{ lat: {}, lng: {} }}, count: {} }}".\
        format(i, lat, lng, count)
    #s2 = s2 + "{}: {{  center: {{ }}, }}".format(i)

s3 = s1 + s2 + "};"

s4 = """
function initMap() {
  // Create the map.
  var map = new google.maps.Map(document.getElementById('map'), {
    zoom: 10,
    center: {
      lat: 40.77274573,
      lng: -73.94
    },
    mapTypeId: 'roadmap'
  });

  // Construct the circle for each value in citymap.
  // Note: We scale the area of the circle based on the population.
  for (var cluster in clusters) {
    // Add the circle for this city to the map.
    var cityCircle = new google.maps.Circle({
      strokeColor: '#FF0000',
      strokeOpacity: 0.8,
      strokeWeight: 2,
      fillColor: '#FF0000',
      fillOpacity: 0.35,
      map: map,
      center: clusters[cluster].center,
"""

s5 = "radius: clusters[cluster].count / {} * 100 * 300 }});  }}}}".format(total_count)

# final
s = s3 + s4 + s5

print(s)

NameError: ignored

## Step 9: Let's analyze some more data

- In Step-1 select the data_file to 
```
data_file = "/data/uber-nyc/uber-raw-data-apr14.csv.gz"
```
- And select 'Cell --> Run All'  to execute all code blocks


## Step 10 : Running the script

**Use the download script**

```bash
cd   ~/data/uber-nyc
./download-data.sh
```

This will download more data.

As we run on larger dataset, the execution will take longer and Jupyter notebook might time out.  So let's run this in command line / script mode

```bash

$    cd   ~/ml-labs-python/clustering

$    time  python  kmeans-uber.py 2> logs

```

Watch the output
