# Where to sell products
## Author: Luis Eduardo Ferro Diez, <a href="mailto:luis.ferro1@correo.icesi.edu.co">luis.ferro1@correo.icesi.edu.co</a>

This notebook contains the main work and contribution of the project, exploiting geo tagged social network data to predict where to sell specific products depending on the content being posted by the users.


In [1]:
import pandas as pd

tweets_path = "../../datasets/tweets_parquet"
tweets = pd.read_parquet(tweets_path, engine="pyarrow")
tweets.head()

Unnamed: 0,id,tweet,lang,favorite_count,retweet_count,is_retweet,user_id,user_name,user_followers_count,user_following_count,...,place_full_name,country,country_code,place_type,place_url,is_spam,year,month,day,hour
0,374048987046637568,@fizziero ngareb beud! !,id,0.0,0.0,0.0,389276837,syarifsidi,47.0,63.0,...,"Jatinegara, Jakarta Timur",Indonesia,ID,city,https://api.twitter.com/1.1/geo/id/9e0e6d510fb...,0.0,2013,9,1,1
1,374048987046625280,"@shahshahrul11 nak buat acano,keto den lg pent...",id,0.0,0.0,0.0,184869610,AliffSadali,194.0,247.0,...,"Keratong, Rompin",Malaysia,MY,city,https://api.twitter.com/1.1/geo/id/1deede127b2...,0.0,2013,9,1,1
2,374048987034419200,@adambeyer234: I already miss Jace like hell,en,0.0,0.0,0.0,363516745,rilez_sharp,582.0,666.0,...,"New York, US",United States,US,admin,https://api.twitter.com/1.1/geo/id/94965b2c453...,0.0,2013,9,1,1
3,374048987050823680,"""this is us""...nunca me voy a artar de verla.....",es,0.0,0.0,0.0,446125189,LUCRECIALg,13.0,65.0,...,Argentina,Argentina,AR,country,https://api.twitter.com/1.1/geo/id/4d3b316fe2e...,0.0,2013,9,1,1
4,374048991241310208,Aquí en una reunión casual (@ Dhamy's Bar) [pi...,es,0.0,0.0,0.0,44793849,charal3x,66.0,77.0,...,"Veracruz, Veracruz de Ignacio de la Llave",México,MX,city,https://api.twitter.com/1.1/geo/id/6c67fe933a6...,0.0,2013,9,1,1


In [2]:
tweets.columns

Index(['id', 'tweet', 'lang', 'favorite_count', 'retweet_count', 'is_retweet',
       'user_id', 'user_name', 'user_followers_count', 'user_following_count',
       'user_location', 'created_timestamp', 'hashtags', 'user_mentions',
       'user_id_mentions', 'expanded_urls', 'location_geometry',
       'place_geometry', 'place_id', 'place_name', 'place_full_name',
       'country', 'country_code', 'place_type', 'place_url', 'is_spam', 'year',
       'month', 'day', 'hour'],
      dtype='object')

Let's visualize the tweets on a map.

First, we will create a geopandas dataframe from the original data.

In [3]:
from shapely import wkt
import geopandas as gpd

# Some tweets might have appear without place or location geometry
tweets = tweets[(tweets.place_geometry.notnull()) | (tweets.location_geometry.notnull())]

def parse_geometry(geom):
    if geom:
        return wkt.loads(geom)
    else:
        return None

tweets.location_geometry = tweets.location_geometry.apply(parse_geometry)
tweets.place_geometry = tweets.place_geometry.apply(parse_geometry)

# Let's work first with the location geometry first
geo_tweets = gpd.GeoDataFrame(tweets, geometry='location_geometry')
geo_tweets.head()

Unnamed: 0,id,tweet,lang,favorite_count,retweet_count,is_retweet,user_id,user_name,user_followers_count,user_following_count,...,place_full_name,country,country_code,place_type,place_url,is_spam,year,month,day,hour
0,374048987046637568,@fizziero ngareb beud! !,id,0.0,0.0,0.0,389276837,syarifsidi,47.0,63.0,...,"Jatinegara, Jakarta Timur",Indonesia,ID,city,https://api.twitter.com/1.1/geo/id/9e0e6d510fb...,0.0,2013,9,1,1
1,374048987046625280,"@shahshahrul11 nak buat acano,keto den lg pent...",id,0.0,0.0,0.0,184869610,AliffSadali,194.0,247.0,...,"Keratong, Rompin",Malaysia,MY,city,https://api.twitter.com/1.1/geo/id/1deede127b2...,0.0,2013,9,1,1
2,374048987034419200,@adambeyer234: I already miss Jace like hell,en,0.0,0.0,0.0,363516745,rilez_sharp,582.0,666.0,...,"New York, US",United States,US,admin,https://api.twitter.com/1.1/geo/id/94965b2c453...,0.0,2013,9,1,1
4,374048991241310208,Aquí en una reunión casual (@ Dhamy's Bar) [pi...,es,0.0,0.0,0.0,44793849,charal3x,66.0,77.0,...,"Veracruz, Veracruz de Ignacio de la Llave",México,MX,city,https://api.twitter.com/1.1/geo/id/6c67fe933a6...,0.0,2013,9,1,1
5,374048991224160256,I really really hate texting unless we're talk...,en,0.0,0.0,0.0,542867684,ivonne_xoxo,367.0,310.0,...,"Chicago, IL",United States,US,city,https://api.twitter.com/1.1/geo/id/1d9a5370a35...,0.0,2013,9,1,1


Now let's visualize this information on a map

In [94]:
import ipyleaflet as ipy

gtdf = geo_tweets[["id", "location_geometry"]]
geo_data = ipy.GeoData(geo_dataframe=gtdf,
                       style={'properties': {'marker-size': 'small'}},
                       icon="https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/Location_dot_blue.svg/64px-Location_dot_blue.svg.png",
                       name="Geo Tweets")

m = ipy.Map(center=(52.3, 8.0), 
            zoom=3,
            scroll_wheel_zoom=True,
            basemap=ipy.basemaps.Esri.WorldTopoMap)

m.add_layer(geo_data)
m.add_control(ipy.LayersControl())
m

Map(basemap={'url': 'http://server.arcgisonline.com/ArcGIS/rest/services/World_Topo_Map/MapServer/tile/{z}/{y}…

The idea is to detect clusters among these geo-tagged tweets, then perfomr a LDA topic detection and finally measure the relevance of each topic against a product or service to characterize each geographic cluster as per the product relationship.

For this, we need to first compute the clusters. We are going to use DBSCAN since one of it's properties is that it does not depend on a central tendency measurement and it is not constrained by the shape of the clusters.

Since we just want to detect clusters based on the geographic position, we just need the geometry.

In [51]:
import numpy as np

points = geo_tweets[geo_tweets.location_geometry.notnull()].location_geometry.apply(lambda p: [p.x, p.y])
points = np.array(points.values.tolist())
points.shape

(171, 2)

In [71]:
from sklearn.cluster import DBSCAN
from sklearn import metrics

# Since this is world data and data can be sparse, we use a small eps to find the clusters
db = DBSCAN(eps=0.05, min_samples=10, metric="cosine").fit(points)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
#print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
#print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
#print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
#print("Adjusted Rand Index: %0.3f"
#      % metrics.adjusted_rand_score(labels_true, labels))
#print("Adjusted Mutual Information: %0.3f"
#      % metrics.adjusted_mutual_info_score(labels_true, labels,
#                                           average_method='arithmetic'))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(points, labels))

Estimated number of clusters: 3
Estimated number of noise points: 6
Silhouette Coefficient: 0.768


Now, let's visualize the clusters in the map.

In [96]:
loc_tweets = geo_tweets[geo_tweets.location_geometry.notnull()]
loc_tweets["cluster"] = labels

colors = {0: "#FF0000", 1: "#0033FF", 2: "#00FF00", -1: "#000000"}

def create_marker(row):
    lat_lon = (row["location_geometry"].y, row["location_geometry"].x)
    cluster = row["cluster"]
    color = colors[cluster]
    return ipy.CircleMarker(location=lat_lon,
                           draggable=False,
                           fill_color=color,
                           fill_opacity=0.5,
                           radius=2,
                           stroke=False)
    
markers = loc_tweets.apply(create_marker, axis=1)
layer_group = ipy.LayerGroup(layers=tuple(markers.values))
m = ipy.Map(center=(52.3, 8.0), 
            zoom=3,
            scroll_wheel_zoom=True,
            basemap=ipy.basemaps.Esri.WorldTopoMap)
m.add_layer(layer_group)
m

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Map(basemap={'url': 'http://server.arcgisonline.com/ArcGIS/rest/services/World_Topo_Map/MapServer/tile/{z}/{y}…

Now, we need to perform a LDA over the tweet's text aggregated by cluster so we can have a decent corpus to detect topics.