
# Banner advertising

Imagine that the international cruise agency Carnival Cruise Line decided to advertise itself with the help of banners and turned to you for this. To test if such banners will take an effect, they will be placed only in 20 places around the world. You need to choose 20 locations for their placement.

The agency is large, and it has several offices around the world. Near these offices, it wants to place banners - it is easier to negotiate and check the result. Also, these places should be popular among tourists.

In [0]:
import numpy as np
import pandas as pd

from sklearn.cluster import MeanShift

To find the best places we will use the database of the largest social network based on locations - Foursquare.

We will use one of the opened data sources, available here:
https://archive.org/details/201309_foursquare_dataset_umn

For convenient work with this document, we will convert it to the CSV format by deleting lines that do not contain coordinates - they are not informative for us. With the help of pandas, we will construct the DataFrame and make sure that all 396634 lines with coordinates are read successfully.


In [2]:
checkins = pd.read_csv('https://raw.githubusercontent.com/OzmundSedler/100-Days-Of-ML-Code/master/week%204/datasets/checkins.dat',
                       sep='|', skipinitialspace=True, skiprows=[1], low_memory=False)
print(checkins.info())
checkins.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1021967 entries, 0 to 1021966
Data columns (total 6 columns):
id                  1021967 non-null object
user_id             1021966 non-null float64
venue_id            1021966 non-null float64
latitude            396634 non-null float64
longitude           396634 non-null float64
created_at          1021966 non-null object
dtypes: float64(4), object(2)
memory usage: 46.8+ MB
None


Unnamed: 0,id,user_id,venue_id,latitude,longitude,created_at
0,984301,2041916.0,5222.0,,,2012-04-21 17:39:01
1,984222,15824.0,5222.0,38.895112,-77.036366,2012-04-21 17:43:47
2,984315,1764391.0,5222.0,,,2012-04-21 17:37:18
3,984234,44652.0,5222.0,33.800745,-84.41052,2012-04-21 17:43:43
4,984249,2146840.0,5222.0,,,2012-04-21 17:42:58


In [3]:
checkins.columns = checkins.columns.str.strip()
checkins = checkins.dropna()

print(f'Shape without NaN: {str(checkins.shape)}')

checkins.head()

Shape without NaN: (396634, 6)


Unnamed: 0,id,user_id,venue_id,latitude,longitude,created_at
1,984222,15824.0,5222.0,38.895112,-77.036366,2012-04-21 17:43:47
3,984234,44652.0,5222.0,33.800745,-84.41052,2012-04-21 17:43:43
7,984291,105054.0,5222.0,45.523452,-122.676207,2012-04-21 17:39:22
9,984318,2146539.0,5222.0,40.764462,-111.904565,2012-04-21 17:35:46
10,984232,93870.0,380645.0,33.448377,-112.074037,2012-04-21 17:38:18


Now you need to cluster the coordinates to reveal the spotlights. Since banners have a relatively small area of ​​operation, we need an algorithm that allows us to limit the size of the cluster and is independent of the number of clusters.

This task is a good reason to get acquainted with the MeanShift algorithm, which we avoided in the main part of the lectures. If you wish, its description can be viewed in sklearn docs, and a little later an additional video will appear with an overview of this and some other clustering algorithms. Use MeanShift, with bandwidth = 0.1, which in translation from degrees to meters ranges from about 5 to 10 km in middle latitudes.

Note: on 396634 rows, clustering will work for a long time. Being very patient is not forbidden - the result of this will only improve. But in order to pass the task, you need a subset of the first 100 thousand lines. This is a trade-off between quality and time spent. Algorithm learning takes about an hour on the whole dataset, and about 2 minutes per 100 thousand lines, but this is enough to get correct results.

In [4]:
checkins_cl = checkins.iloc[:100000, :].loc[:, ['latitude', 'longitude']]
checkins_cl.head()

Unnamed: 0,latitude,longitude
1,38.895112,-77.036366
3,33.800745,-84.41052
7,45.523452,-122.676207
9,40.764462,-111.904565
10,33.448377,-112.074037


In [0]:
ms = MeanShift(bandwidth=0.1)
ms.fit(checkins_cl)

Some of the resulting clusters contain too few points - they are not interesting to advertisers. Therefore, it is necessary to determine which of the clusters contain, for example, more than 15 elements. The centers of these clusters are optimal for placement.



In [0]:
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique, labels_counts = np.unique(labels, return_counts=True)

print(f'number of estimated clusters : {len(labels_unique)}')

In [0]:
center_indexes = []

for lbl, cnt in zip(labels_unique, labels_counts):
  if cnt >= 15:
    center_indexes.append(list(labels_unique).index(lbl))

if you want, you can visuzalize data on https://www.mapcustomizer.com/ site, using Bulk Entry function.

As we remember, 20 banners should be placed near the offices of the company. Let's find the addresses of all offices on Google Maps and create a data frame:

In [0]:
company_ofices = pd.DataFrame([(33.751277, -118.188740), (25.867736, -80.324116), (51.503016, -0.075479), (52.378894, 4.885084), (39.366487, 117.036146), (-33.868457, 151.205134)],
                              columns=['latitude', 'longitude'],
                              index=['Los Angeles', 'Miami', 'London', 'Amsterdam', 'Beijing', 'Sydney'])

It remains to determine the 20 nearest cluster centers. Let's calculate the distance to the nearest office for each point and select 20 with the lowest value.

In [0]:
import geopy.distance

distances = pd.DataFrame(columns=['nearest_office', 'distance (km)', 'center_lat', 'center_long'])

for center in cluster_centers[center_indexes]:
  min_distance = 1000000
  
  for office, office_coords in company_ofices.iterrows():
    d = geopy.distance.vincenty(center, office_coords.values).km
    if d < min_distance:
      nearest_office = office
      min_distance = d

  distances = distances.append(pd.Series([nearest_office, min_distance, center[0], center[1]], index=distances.columns), ignore_index=True)

In [0]:
distances.head()

Sort by distance

In [0]:
final_answer = distances.sort_values('distance (km)')[:20]
print(final_answer)