In this notebook we perform cluster analysis on Foursquare dataset to find the most optimal locations to place banners of Carnival Cruise Line company. From Foursquare dataset, which is available by this link: https://archive.org/details/201309_foursquare_dataset_umn , we have used checkings.dat. Our aim is 20 closest to Carnival Cruise Line offices high density clusters.

In [2]:
from sklearn import cluster, model_selection

import numpy as np
import pandas as pd

In [3]:
data = pd.read_csv('checkins.dat', sep = '|', header = 0, skipinitialspace = True)
data.dropna(inplace = True)  #delete all rows with no latitude and longitude info
data = data.reset_index(drop=True)  #as we have deleted some rows, we should reset indexes in our dataframe
data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,user_id,venue_id,latitude,longitude,created_at
0,984222,15824.0,5222.0,38.895112,-77.036366,2012-04-21 17:43:47
1,984234,44652.0,5222.0,33.800745,-84.41052,2012-04-21 17:43:43
2,984291,105054.0,5222.0,45.523452,-122.676207,2012-04-21 17:39:22
3,984318,2146539.0,5222.0,40.764462,-111.904565,2012-04-21 17:35:46
4,984232,93870.0,380645.0,33.448377,-112.074037,2012-04-21 17:38:18


In [4]:
print data.shape
print data.dtypes

(396634, 6)
id                   object
user_id             float64
venue_id            float64
latitude            float64
longitude           float64
created_at           object
dtype: object


I have chosen Mean Shift algorithm to solve this problem, because it is a centroid based algorithm and its clustering aims to discover blobs in a smooth density of samples. So, it's perfect to fit our dataset, where we want to discover users accumulations.

In [5]:
cluster_analyzer = cluster.MeanShift(bandwidth = 0.1)

To achieve correct answer in the examination system I only have to use first 100000 rows of dataset. Of course, it's possible to perform clustering on whole dataset, but it takes 1 hour vs 2 minutes.

I also dropped out first 3 columns of dataset as non-informative and last one because we need to analyze total visits of current place, not visits at different times.

In [6]:
data = data[:100000]
data = data.iloc[xrange(0, 100000), [3, 4]]
data.head()

Unnamed: 0,latitude,longitude
0,38.895112,-77.036366
1,33.800745,-84.41052
2,45.523452,-122.676207
3,40.764462,-111.904565
4,33.448377,-112.074037


In [7]:
cluster_analyzer.fit(data)

MeanShift(bandwidth=0.1, bin_seeding=False, cluster_all=True, min_bin_freq=1,
     n_jobs=None, seeds=None)

In [8]:
cluster_labels = cluster_analyzer.labels_  #this array contains labels of clusters to which our algorithm took our
                                           #datasets rows

Now it's time to calculate centres of clusters as optimal loactions to place banners.

In [9]:
#count how many elements we have in each cluster
numbers_of_el_in_clust = np.zeros(np.max(cluster_labels) + 1)
for i in cluster_labels:
    numbers_of_el_in_clust[i] += 1

centers_of_clusters = []
for i in xrange(0, max(cluster_labels) + 1):
    lat_sum = 0.
    long_sum = 0.
    for index, k in enumerate(cluster_labels):
        if k == i:
            lat_sum += data.iloc[index,0]
            long_sum += data.iloc[index,1]
    centers_of_clusters.append(tuple((lat_sum / numbers_of_el_in_clust[i], long_sum / numbers_of_el_in_clust[i])))

Define function, which should delete an array of elements from usual Python list by their indexes.

In [10]:
def multi_delete(list_, args):
    indexes = sorted(args, reverse=True)
    for index in indexes:
        del list_[index]
    return list_

Clusters, which have less than 16 points aren't interesting for us.

In [11]:
ind_to_del = []

for i in xrange(0, np.max(cluster_labels)):
    if numbers_of_el_in_clust[i] <= 15:
        ind_to_del.append(i)
        
centers_of_bigger_clusters = multi_delete(centers_of_clusters, ind_to_del)

I used mapcustomizer to visualize centres of clusters. Here you can see what we got:

https://www.mapcustomizer.com/map/cluster_centres

In [12]:
for coord in centers_of_bigger_clusters:
    print format(coord[0]) + ', ' + format(coord[1]) + '\n'

40.7187068678, -73.9920583378

33.4481210773, -112.073750349

33.4512308635, -111.917519658

41.8782437797, -87.6298433623

37.7186956146, -122.413579805

38.8865003439, -77.047520192

33.3570196815, -111.822739909

33.7670181941, -84.3958447009

42.3636725818, -71.0786283254

47.6062447174, -122.332043826

36.1171880912, -115.170948319

34.0521041243, -118.243687819

44.9800773692, -93.2637154264

30.267183617, -97.7431192813

40.7426274749, -73.8047918971

39.7387950389, -104.986707004

39.9521041075, -75.1634382844

34.0219947974, -118.458868348

32.9808933822, -117.078117978

32.8030205353, -96.7698974349

37.3361803361, -121.918048608

28.5379692463, -81.3792742065

32.7082197123, -117.151373994

32.2217131518, -110.926535153

34.1344681763, -118.359115764

29.7626977547, -95.3823137047

43.0395059286, -87.9063213048

33.8312803288, -117.899132299

37.3961366695, -122.097224695

25.7900492355, -80.2125575741

45.5234832146, -122.676280421

33.5461277671, -112.296685059

33.6690169

Define function, which calculates distances between points assigned in geographic coordinate system.

In [13]:
def distance(lat_office, long_office, lat_cl, long_cl):
    return np.arccos(np.sin(lat_office) * np.sin(lat_cl) + np.cos(lat_office) * np.cos(lat_cl) * np.cos(long_office - long_cl))

Lets calculate distances between each cluster centre and the closest Carnival Cruise Line office.

In [14]:
min_distances = []
offices_coord = [(33.751277, -118.188740),
                 (25.867736, -80.324116),
                 (51.503016, -0.075479),
                 (52.378894, 4.885084),
                 (39.366487, 117.036146),
                 (-33.868457, 151.205134)]

for coord in centers_of_bigger_clusters:
    distances = [] #distances between current cluster center and each of 6 offices
    for office_coord in offices_coord:
        distances.append(distance(office_coord[0], office_coord[1], coord[0], coord[1]))
    min_distances.append(min(distances))

We got a final solution of our task below. It's first 20 the closest to offices clusters centres.

In [15]:
min_distances = np.array(min_distances)
coord_to_place_banners = []
for i in min_distances.argsort()[:20]:
    coord_to_place_banners.append(centers_of_bigger_clusters[i])

https://www.mapcustomizer.com/map/clusters_centres_to_place_banners

In [16]:
for coord in coord_to_place_banners:
    print format(coord[0]) + ', ' + format(coord[1]) + '\n'

52.3729639903, 4.89231722258

-33.8606304286, 151.204775929

51.5027862709, -0.124192348031

25.8456722643, -80.3188905964

39.3565774429, -84.4973694143

41.5364839286, -87.4927188821

32.8795022, -111.7573521

21.158964, -86.845937

13.7953476, 100.630875855

33.8057742128, -118.155494097

39.4130433118, -77.4112680382

33.5778631, -101.8551665

39.165325, -86.5263857

33.1434524447, -96.8451658526

39.5872222, -78.8425

13.72643201, 100.501365816

33.1622098774, -96.6495485806

36.973905175, -122.020079637

26.6440687097, -81.8733119742

45.6387281, -122.6614861

