In [1]:
import numpy as np
import pandas as pd


from pyspark.ml import Pipeline

from pyspark.sql.functions import *

from pyspark.sql.types import *

import folium
import html

In [2]:
data_path = '/home/osboxes/yelp-data/dataset/'

### Data Set

This project uses the Yelp dataset available at https://www.yelp.com/dataset

The data set contains 4,700,000 reviews on 156,000 businesses in 12 metropolitan areas
1,000,000 tips by 1,100,000 users Over 1.2 million business attributes like hours, parking, availability, and ambience


The data files are supplied in two flavours json and SQL (MySQL, Postgres). This project utilizes the json version which has the following files:  

- __business.json__: Contains business data including location data, attributes, and categories.  
- __review.json__: Contains full review text data including the user_id that wrote the review and the business_id the review is written for.  
- __user.json__: User data including the user's friend mapping and all the metadata associated with the user.  
- __checkin.json__: Checkins on a business.  
- __tip.json__: Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.
- __photos__: (from the photos auxiliary file) This file is formatted as a JSON list of objects.  

Each file is composed of a single object type, one JSON-object per-line. Description (https://www.yelp.com/dataset/documentation/json)

### Data Wrangling

Apache Drill (https://drill.apache.org/) which provide SQL query interface to most non-relational datastore, will be used for preliminary analysis of json files, and then used to extract the required data into Apache Parquet files (a columnar storage format) to be loaded into Spark dataframes for further analysis and machine learning modeling.

As the focus of this project is on building a recommendation engine, the core files will be used are business.json, review.json, and user.json. Additionally, taking into consideration the limited hardware resources used available, data for the city of Toronto will be considered for building the recommendation engine.

#### Apache Drill Installation

Apache Drill vesion 1.11.0 is installed to run in embedded mode on a Linux machine - Ubuntu 16.10.   

For installation instructions and breif introduction please review https://drill.apache.org/docs/drill-in-10-minutes/  

The following screen show the Apache Drill has started with the default configuration and ready for running queries:


```
osboxes@osboxes:~/apache-drill-1.11.0$ java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
osboxes@osboxes:~/apache-drill-1.11.0$ bin/drill-embedded
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Sep 22, 2017 8:16:02 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.11.0 
"drill baby drill"
0: jdbc:drill:zk=local> 
```

##### check the number of businesses:

```
0: jdbc:drill:zk=local> SELECT COUNT(*) business_count
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/business.json`;
+-----------------+
| business_count  |
+-----------------+
| 156639          |
+-----------------+
1 row selected (2.99 seconds)
```

##### check the number of reviews:

```
0: jdbc:drill:zk=local> SELECT COUNT(*) review_count
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/review.json`;
+---------------+
| review_count  |
+---------------+
| 4736897       |
+---------------+
1 row selected (115.95 seconds)
```

##### check the number of users:

```
0: jdbc:drill:zk=local> SELECT COUNT(*) user_count
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/user.json`;
+-------------+
| user_count  |
+-------------+
| 1183362     |
+-------------+
1 row selected (37.628 seconds)
```

##### Check for top 10 cities by number of businesses:

```
0: jdbc:drill:zk=local> SELECT city, COUNT(*) as total_businesses
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/business.json`
. . . . . . . . . . . > GROUP BY city ORDER BY  COUNT(*) DESC LIMIT 10;
+-------------+-------------------+
|    city     | total_businesses  |
+-------------+-------------------+
| Las Vegas   | 24768             |
| Phoenix     | 15656             |
| Toronto     | 15483             |
| Charlotte   | 7557              |
| Scottsdale  | 7510              |
| Pittsburgh  | 5688              |
| Montréal    | 5175              |
| Mesa        | 5146              |
| Henderson   | 4130              |
| Tempe       | 3949              |
+-------------+-------------------+
10 rows selected (2.43 seconds)
```

##### Check for top 10 businesses by review counts:

```
0: jdbc:drill:zk=local> SELECT b.business_id, b.name, b.city, b.neighborhood, b.state, b.stars, r.total_reviews 
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/business.json` b
. . . . . . . . . . . > INNER JOIN
. . . . . . . . . . . > (
. . . . . . . . . . . > SELECT business_id, COUNT(*) total_reviews
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/review.json`
. . . . . . . . . . . > GROUP BY business_id ORDER BY  COUNT(*) DESC LIMIT 10
. . . . . . . . . . . > ) r ON b.business_id = r.business_id
. . . . . . . . . . . > ORDER BY r.total_reviews DESC;
+-------------------------+-------------------------+------------+---------------+--------+--------+----------------+
|       business_id       |          name           |    city    | neighborhood  | state  | stars  | total_reviews  |
+-------------------------+-------------------------+------------+---------------+--------+--------+----------------+
| 4JNXUYY8wbaaDmk3BPzlWw  | Mon Ami Gabi            | Las Vegas  | The Strip     | NV     | 4.0    | 6978           |
| RESDUcs7fIiihp38-d6_6g  | Bacchanal Buffet        | Las Vegas  | The Strip     | NV     | 4.0    | 6412           |
| K7lWdNUhCbcnEvI0NhGewg  | Wicked Spoon            | Las Vegas  | The Strip     | NV     | 3.5    | 5633           |
| cYwJA2A6I12KNkm2rtXd5g  | Gordon Ramsay BurGR     | Las Vegas  | The Strip     | NV     | 4.0    | 5431           |
| DkYS3arLOhA8si5uUEmHOw  | Earl of Sandwich        | Las Vegas  | The Strip     | NV     | 4.5    | 4790           |
| f4x1YBxkLrZg652xt2KR5g  | Hash House A Go Go      | Las Vegas  | The Strip     | NV     | 4.0    | 4371           |
| eoHdUeQDNgQ6WYEnP2aiRw  | Serendipity 3           | Las Vegas  | The Strip     | NV     | 3.0    | 3913           |
| 2weQS-RnoOBhb1KsHKyoSQ  | The Buffet              | Las Vegas  | The Strip     | NV     | 3.5    | 3873           |
| KskYqH1Bi7Z_61pH6Om8pg  | Lotus of Siam           | Las Vegas  | Eastside      | NV     | 4.0    | 3839           |
| ujHiaprwCQ5ewziu0Vi9rw  | The Buffet at Bellagio  | Las Vegas  | The Strip     | NV     | 3.5    | 3698           |
+-------------------------+-------------------------+------------+---------------+--------+--------+----------------+
10 rows selected (140.639 seconds)
```

##### Check for top 10 restaurants by review counts in Toronto:

```
0: jdbc:drill:zk=local> SELECT b.name, b.city, b.neighborhood, b.state, b.stars, r.total_reviews 
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/business.json` b
. . . . . . . . . . . > INNER JOIN
. . . . . . . . . . . > (
. . . . . . . . . . . > SELECT business_id, COUNT(*) total_reviews
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/review.json`
. . . . . . . . . . . > GROUP BY business_id
. . . . . . . . . . . > ) r ON b.business_id = r.business_id
. . . . . . . . . . . > WHERE true=repeated_contains(b.categories,'Restaurants') AND b.city='Toronto'
. . . . . . . . . . . > ORDER BY r.total_reviews DESC LIMIT 10;
+------------------------------------+----------+-------------------------+--------+--------+----------------+
|                name                |   city   |      neighborhood       | state  | stars  | total_reviews  |
+------------------------------------+----------+-------------------------+--------+--------+----------------+
| Pai Northern Thai Kitchen          | Toronto  | Entertainment District  | ON     | 4.5    | 1258           |
| Khao San Road                      | Toronto  | Niagara                 | ON     | 4.0    | 1150           |
| KINKA IZAKAYA ORIGINAL             | Toronto  | Downtown Core           | ON     | 4.0    | 1087           |
| Banh Mi Boys                       | Toronto  | Alexandra Park          | ON     | 4.0    | 937            |
| Seven Lives Tacos Y Mariscos       | Toronto  | Kensington Market       | ON     | 4.5    | 838            |
| Uncle Tetsu's Japanese Cheesecake  | Toronto  | Discovery District      | ON     | 3.5    | 806            |
| Salad King Restaurant              | Toronto  | Downtown Core           | ON     | 3.5    | 777            |
| Momofuku Noodle Bar                | Toronto  | Financial District      | ON     | 3.0    | 696            |
| Sansotei Ramen                     | Toronto  | Downtown Core           | ON     | 4.0    | 647            |
| Insomnia Restaurant & Lounge       | Toronto  |                         | ON     | 4.0    | 644            |
+------------------------------------+----------+-------------------------+--------+--------+----------------+
10 rows selected (137.955 seconds)
```

##### Check stars count / distribution:

```
0: jdbc:drill:zk=local> SELECT stars, COUNT(*) stars_count
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/review.json`
. . . . . . . . . . . > GROUP BY stars order by COUNT(*) DESC;
+--------+--------------+
| stars  | stars_count  |
+--------+--------------+
| 5      | 1988003      |
| 4      | 1135830      |
| 1      | 639849       |
| 3      | 570819       |
| 2      | 402396       |
+--------+--------------+
5 rows selected (104.346 seconds)
```

##### Check top 10 reviewers for restaurants in Toronto:

```
0: jdbc:drill:zk=local> SELECT u.user_id, u.name, u.yelping_since, u.average_stars, r.total_reviews 
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/user.json` u
. . . . . . . . . . . > INNER JOIN
. . . . . . . . . . . > (
. . . . . . . . . . . > SELECT user_id, COUNT(*) total_reviews
. . . . . . . . . . . > FROM dfs.`/home/osboxes/yelp-data/dataset/review.json`
. . . . . . . . . . . > WHERE business_id IN
. . . . . . . . . . . > (
. . . . . . . . . . . > SELECT business_id FROM dfs.`/home/osboxes/yelp-data/dataset/business.json`
. . . . . . . . . . . > WHERE true=repeated_contains(categories,'Restaurants') AND city='Toronto'
. . . . . . . . . . . > )
. . . . . . . . . . . > GROUP BY user_id
. . . . . . . . . . . > ) r ON u.user_id = r.user_id
. . . . . . . . . . . > ORDER BY r.total_reviews DESC LIMIT 10;
+-------------------------+-----------+----------------+----------------+----------------+
|         user_id         |   name    | yelping_since  | average_stars  | total_reviews  |
+-------------------------+-----------+----------------+----------------+----------------+
| CxDOIDnH8gp9KXzpBHJYXw  | Jennifer  | 2009-11-09     | 3.29           | 863            |
| Q9mA60HnY87C1TW5kjAZ6Q  | Evelyn    | 2010-08-29     | 4.05           | 421            |
| TbhyP24zYZqZ2VJZgu1wrg  | Lauren    | 2010-03-17     | 3.58           | 398            |
| 0BBUmH7Krcax1RZgbH4fSA  | Laura C   | 2010-03-18     | 3.58           | 361            |
| FREeRQtjdJU83AFtdETBBw  | Elle      | 2014-01-10     | 4.15           | 341            |
| 1fNQRju9gmoCEvbPQBSo7w  | Jared     | 2013-02-24     | 3.07           | 320            |
| tWBLn4k1M7PLBtAtwAg73g  | Jay       | 2010-10-14     | 3.6            | 299            |
| V4TPbscN8JsFbEFiwOVBKw  | Mariko    | 2010-02-07     | 3.6            | 294            |
| yT_QCcnq-QGipWWuzIpvtw  | Imran     | 2011-03-24     | 3.56           | 290            |
| gwIqbXEXijQNgdESVc07hg  | Elvis     | 2010-11-17     | 3.17           | 278            |
+-------------------------+-----------+----------------+----------------+----------------+
10 rows selected (154.378 seconds)
```

##### Extract Parquet Files

Configure a writable workspace in Apache Drill e.g. :  


"tmp": {
      "location": "/tmp",
      "writable": true,
       }

##### Extract business data to Parquet file

##### Extract review data to Parquet file

##### Extract user data to Parquet file

### Data Exploration

In [3]:
business_df = spark.read.parquet(data_path + 'business-small.parquet')

In [4]:
business_df.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- neighborhood: string (nullable = true)
 |-- address: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- postal_code: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- stars: double (nullable = true)
 |-- review_count: long (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [4]:
business_df.describe()

DataFrame[summary: string, business_id: string, name: string, neighborhood: string, address: string, city: string, state: string, postal_code: string, latitude: string, longitude: string, stars: string, review_count: string]

In [5]:
business_df.show(2)

+--------------------+--------------------+------------+--------------------+-------+-----+-----------+----------+-----------+-----+------------+--------------------+
|         business_id|                name|neighborhood|             address|   city|state|postal_code|  latitude|  longitude|stars|review_count|          categories|
+--------------------+--------------------+------------+--------------------+-------+-----+-----------+----------+-----------+-----+------------+--------------------+
|qim0lD112TkDhm8Zy...|McCarthy's Irish Pub| Upper Beach|1801 Gerrard Stre...|Toronto|   ON|    M4L 2B5|43.6780488|-79.3147736|  4.0|           5|[Pubs, Restaurant...|
|Wf5C8Amv_SlhoYE3_...|         Oishi Sushi|            |    1325 Finch Ave W|Toronto|   ON|    M3J 2G5|43.7635097|-79.4907499|  2.0|          27|[Asian Fusion, Re...|
+--------------------+--------------------+------------+--------------------+-------+-----+-----------+----------+-----------+-----+------------+--------------------

In [6]:
business_df.count()

6750

In [7]:
business_rdd = business_df.rdd

In [8]:
business_rdd.take(2)

[Row(business_id='qim0lD112TkDhm8ZyQlRnA', name="McCarthy's Irish Pub", neighborhood='Upper Beach', address='1801 Gerrard Street E', city='Toronto', state='ON', postal_code='M4L 2B5', latitude=43.6780488, longitude=-79.3147736, stars=4.0, review_count=5, categories=['Pubs', 'Restaurants', 'Bars', 'Irish', 'Nightlife']),
 Row(business_id='Wf5C8Amv_SlhoYE3_W66WQ', name='Oishi Sushi', neighborhood='', address='1325 Finch Ave W', city='Toronto', state='ON', postal_code='M3J 2G5', latitude=43.7635097, longitude=-79.4907499, stars=2.0, review_count=27, categories=['Asian Fusion', 'Restaurants', 'Sushi Bars'])]

In [9]:
business_df.createOrReplaceTempView("businesses")

In [10]:
query = """
SELECT * FROM businesses where review_count > 100 limit 10
"""

sqlContext.sql(query).toPandas()

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,categories
0,kLw_FmSiEqYH-MtFhDIUFQ,Big Daddy's Bourbon Street Bistro & Oyster Bar,Entertainment District,212 King Street W,Toronto,ON,M5H 1K5,43.647499,-79.386471,3.5,132,"[Cajun/Creole, Restaurants, Seafood]"
1,rxA9c0_XObabVL1WCTA4FA,Sneaky Dee's,Kensington Market,431 College Street,Toronto,ON,M5T 1T1,43.656333,-79.407487,3.5,362,"[Breakfast & Brunch, Nightlife, Dive Bars, Tex..."
2,769NudnrUxWFtJCGU66A_A,Thompson Diner,Niagara,550 Wellington Street W,Toronto,ON,M5V 2V4,43.642914,-79.402046,3.0,207,"[American (New), Restaurants, Canadian (New), ..."
3,ofw8aDSEg1HoQdmCgvLtaQ,The Pie Commission,Etobicoke,935 Queensway,Toronto,ON,M8Z 1P4,43.623881,-79.512074,4.5,183,"[Canadian (New), Fast Food, Food, Do-It-Yourse..."
4,hDy-uY7Vy_TZdGBzw59lhA,Saku Sushi,Alexandra Park,478 Queen Street W,Toronto,ON,M5V 2B2,43.648071,-79.400286,4.0,261,"[Japanese, Breakfast & Brunch, Restaurants, Su..."
5,fK1oj0dk9Bc6KsBk5mMDxg,Playa Cabana Cantina,The Junction,2883 Dundas Street W,Toronto,ON,M6P 1Y9,43.665303,-79.465505,3.5,229,"[Restaurants, Mexican]"
6,Vg4N2DsGrzzoam9jS1L3Wg,Johnny's Hamburgers,Scarborough,2595 Victoria Park Avenue,Toronto,ON,M1T 1A4,43.774878,-79.322278,3.5,166,"[Burgers, Restaurants]"
7,bz07FlaDmxHV9ER-cF6XuA,Platito Filipino Soul Food,Downtown Core,35 Baldwin Street,Toronto,ON,M5T 1L1,43.655859,-79.393467,3.5,113,"[Filipino, Restaurants]"
8,W2NzlS8OJzGfDfr9oRz11Q,Drake One Fifty,Financial District,150 York Street,Toronto,ON,M5H 3S5,43.649354,-79.384684,3.5,168,"[Cocktail Bars, Brasseries, Food, Canadian (Ne..."
9,XmgdFa3G_CZVfjtQEJMZfQ,Caplansky's Delicatessen,,356 College Street,Toronto,ON,M5T 1S6,43.657207,-79.404248,3.5,390,"[Restaurants, Delis, Caterers, Event Planning ..."


In [68]:
user_df = spark.read.parquet(data_path + 'user-small.parquet')
user_df.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- yelping_since: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- funny: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- fans: long (nullable = true)
 |-- average_stars: double (nullable = true)



In [12]:
user_df.count()

66424

In [13]:
user_df.createOrReplaceTempView("users")

In [14]:
query = """
SELECT * FROM USERS 
limit 10
"""

sqlContext.sql(query).toPandas()


Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,fans,average_stars
0,om5ZiponkpRqUNa3pVPiRg,Andrea,2559,2006-01-18,83681,10882,40110,835,3.94
1,Wc5L6iuvSNF5WGBlqIO8nw,Risa,1122,2011-07-30,26395,4880,19108,435,4.1
2,uxKSnOVAoEj4I6X9YhLBlg,Vivian,73,2013-03-02,34,5,2,8,3.54
3,s8bVHRqx6cI8F8HGf3A_og,Colleen,32,2014-12-18,19,3,7,2,4.15
4,xEajChTkzWIYTMLkYNoIIw,Di,71,2012-09-26,31,8,4,3,3.29
5,YJLlvBPtvB8iJg8_WKxVzQ,Casey,72,2014-03-01,5,2,3,7,3.95
6,YTdNcIWAt2nEzZ7NY-fniw,Jeff,754,2011-05-16,151,105,125,68,3.74
7,ZWD8UH1T7QXQr0Eq-mcWYg,Jason,121,2013-11-13,192,29,54,33,3.91
8,YSDzb8DnvKozByqBjYiS4w,Jarita,68,2012-03-30,1,1,0,8,3.64
9,ljdo6-BZlywsF5RiGd5e5A,Justina,75,2014-06-18,7,2,0,5,3.82


In [15]:
review_df = spark.read.parquet(data_path + 'review-small.parquet')

In [16]:
review_df.printSchema()

root
 |-- review_id: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- business_id: string (nullable = true)
 |-- stars: long (nullable = true)
 |-- review_date: string (nullable = true)
 |-- review_text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- funny: long (nullable = true)
 |-- cool: long (nullable = true)



In [17]:
review_df.count()

276887

In [18]:
review_df.show(3)

+--------------------+--------------------+--------------------+-----+-----------+--------------------+------+-----+----+
|           review_id|             user_id|         business_id|stars|review_date|         review_text|useful|funny|cool|
+--------------------+--------------------+--------------------+-----+-----------+--------------------+------+-----+----+
|Z5l99h18E3_g1GLcD...|djpMXOA1ic5wv3FPt...|mr4FiPaXTWlJ3qGzp...|    3| 2009-07-21|I left Table 17 f...|     3|    0|   0|
|Z3Fw292i0Eg8liW0D...|-pXs08gJq9ExIk275...|mr4FiPaXTWlJ3qGzp...|    3| 2008-12-13|for the time bein...|     1|    0|   0|
|hsKINx1dIKeFTDe-Z...|PTj29rhujYETuFlAZ...|mr4FiPaXTWlJ3qGzp...|    5| 2013-10-12|Love this place. ...|     1|    0|   1|
+--------------------+--------------------+--------------------+-----+-----------+--------------------+------+-----+----+
only showing top 3 rows



In [19]:
review_df.createOrReplaceTempView("reviews")

In [20]:
query = """
SELECT
    business_id,
    COUNT(*) as 5_stars_count
FROM reviews
WHERE stars = '5'
GROUP BY business_id 
ORDER BY COUNT(*) DESC
limit 10
"""

sqlContext.sql(query).toPandas()


Unnamed: 0,business_id,5_stars_count
0,r_BrIgzYcwo1NAuG9dLbpg,604
1,aLcFhMe6DDJ430zelCpd2A,462
2,RtUvSWO_UZ8V3Wpj0n077w,458
3,iGEvDk6hsizigmXhDKs2Vg,457
4,N93EYZy9R0sdlEvubu94ig,407
5,Yl2TN9c23ZGLUBSD9ks5Uw,279
6,ZumOnWbstgsIE6bJlxw0_Q,267
7,mZRKH9ngRY92bI_irrHq6w,267
8,k6zmSLmYAquCpJGKNnTgSQ,259
9,JMiaNitMzMbJm6Kh0RbT5A,247


In [21]:
query = """
SELECT
    review_text
FROM reviews
WHERE stars = '1'
limit 10
"""

sqlContext.sql(query).show(2)


+--------------------+
|         review_text|
+--------------------+
|They messed up my...|
|#detox ...... wil...|
+--------------------+
only showing top 2 rows



In [22]:
reviews_text = spark.sql("SELECT business_id, review_text FROM reviews")

In [23]:
reviews_text.show(3)

+--------------------+--------------------+
|         business_id|         review_text|
+--------------------+--------------------+
|mr4FiPaXTWlJ3qGzp...|I left Table 17 f...|
|mr4FiPaXTWlJ3qGzp...|for the time bein...|
|mr4FiPaXTWlJ3qGzp...|Love this place. ...|
+--------------------+--------------------+
only showing top 3 rows



In [24]:
reviews_text_rdd = reviews_text.rdd
reviews_by_business_rdd = reviews_text_rdd.map(tuple).reduceByKey(add)  
reviews_by_business_df = spark.createDataFrame(reviews_by_business_rdd)
reviews_by_business_df = reviews_by_business_df \
                            .withColumnRenamed('_1', 'business_id') \
                            .withColumnRenamed('_2', 'text')
reviews_by_business_df.count()   

6750

In [25]:
reviews_by_business_df.show(3)

+--------------------+--------------------+
|         business_id|                text|
+--------------------+--------------------+
|bfR-vJvrjdOJaWsXG...|Attention allergy...|
|Dl2vgi5W_nbe-A97D...|I don't understan...|
|65ZGMedBm7TBpWv6f...|Food here is alwa...|
+--------------------+--------------------+
only showing top 3 rows



In [26]:

regexTokenizer = RegexTokenizer(gaps = False, pattern = '\w+', inputCol = 'text', outputCol = 'token')

reviews_by_business_token_df = regexTokenizer.transform(reviews_by_business_df)
reviews_by_business_token_df.show(3)


+--------------------+--------------------+--------------------+
|         business_id|                text|               token|
+--------------------+--------------------+--------------------+
|bfR-vJvrjdOJaWsXG...|Attention allergy...|[attention, aller...|
|Dl2vgi5W_nbe-A97D...|I don't understan...|[i, don, t, under...|
|65ZGMedBm7TBpWv6f...|Food here is alwa...|[food, here, is, ...|
+--------------------+--------------------+--------------------+
only showing top 3 rows



In [27]:
stopWordsRemover = StopWordsRemover(inputCol = 'token', outputCol = 'nostopwrd')

reviews_by_business_token_nostopwrd_df = stopWordsRemover.transform(reviews_by_business_token_df)
reviews_by_business_token_nostopwrd_df.show(3)

+--------------------+--------------------+--------------------+--------------------+
|         business_id|                text|               token|           nostopwrd|
+--------------------+--------------------+--------------------+--------------------+
|bfR-vJvrjdOJaWsXG...|Attention allergy...|[attention, aller...|[attention, aller...|
|Dl2vgi5W_nbe-A97D...|I don't understan...|[i, don, t, under...|[understand, prev...|
|65ZGMedBm7TBpWv6f...|Food here is alwa...|[food, here, is, ...|[food, always, fr...|
+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



#### The follwoing step of creating wordevec model is resource intensive and time consuming.  
#### Just load the previusly trained model unless you need to rerun / refresh the existing model

In [36]:
#word2Vec = Word2Vec(vectorSize = 100, minCount = 5, inputCol = 'nostopwrd', outputCol = 'word_vec')
#word2Vec_model = word2Vec.fit(reviews_by_business_token_nostopwrd_df)

# save the word2vec model
#word2Vec_model.write().overwrite().save('/home/osboxes/yelp-data/word2Vec')

In [71]:
# load the word2vec trained model

word2Vec_mdl = Word2VecModel.load(model_path + 'word2Vec')

In [41]:
reviews_by_business_vec_df = word2Vec_mdl.transform(reviews_by_business_token_nostopwrd_df)

reviews_by_business_vec_df.show(3)

reviews_by_business_vec_df.select('word_vec').show(1, truncate = True)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|         business_id|                text|               token|           nostopwrd|            word_vec|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|bfR-vJvrjdOJaWsXG...|Attention allergy...|[attention, aller...|[attention, aller...|[-0.0949216104917...|
|Dl2vgi5W_nbe-A97D...|I don't understan...|[i, don, t, under...|[understand, prev...|[-0.0657136221337...|
|65ZGMedBm7TBpWv6f...|Food here is alwa...|[food, here, is, ...|[food, always, fr...|[-0.0036732712861...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows

+--------------------+
|            word_vec|
+--------------------+
|[-0.0949216104917...|
+--------------------+
only showing top 1 row



In [42]:
word2Vec_mdl.findSynonyms("good", 5).show()

+-------+------------------+
|   word|        similarity|
+-------+------------------+
| decent|0.7664605379104614|
|  great|0.6792967319488525|
|  tasty| 0.577859103679657|
|  solid|0.5670980215072632|
|amazing|0.5537719130516052|
+-------+------------------+



In [43]:
word2Vec_mdl.findSynonyms("chinese", 5).show() 

+----------+------------------+
|      word|        similarity|
+----------+------------------+
|     asian|0.8069467544555664|
|vietnamese|0.7198611497879028|
|  northern|0.7195839881896973|
|     hakka|0.7080489993095398|
|    korean|0.7080056667327881|
+----------+------------------+



In [44]:
word2Vec_mdl.findSynonyms("burger", 5).show()

+------------+------------------+
|        word|        similarity|
+------------+------------------+
|     burgers|0.8352599143981934|
|   hamburger|0.7846404910087585|
|       patty|0.7472882270812988|
|      priest|0.7470744252204895|
|cheeseburger|0.7308202981948853|
+------------+------------------+



In [45]:
def CosineSim(vec1, vec2): 
    return np.dot(vec1, vec2) / np.sqrt(np.dot(vec1, vec1)) / np.sqrt(np.dot(vec2, vec2)) 

In [46]:
all_vecs = reviews_by_business_vec_df.select('business_id', 'word_vec').rdd.map(lambda x: (x[0], x[1])).collect()

In [47]:
all_vecs[0]

('bfR-vJvrjdOJaWsXGJgzPA',
 DenseVector([-0.0949, -0.0314, 0.0272, 0.0091, 0.056, -0.0003, 0.0116, 0.02, 0.0641, 0.0029, -0.0145, -0.0746, 0.0152, 0.0599, 0.0564, 0.0002, 0.0737, -0.0414, 0.0264, 0.0784, 0.0352, 0.0016, -0.0452, -0.0517, 0.0131, 0.0295, -0.0073, -0.0313, -0.0145, -0.0208, 0.0482, -0.0188, -0.0831, -0.124, -0.0201, 0.0666, -0.0406, 0.0484, -0.0362, -0.0267, -0.082, -0.0519, -0.011, 0.0132, 0.0178, 0.0408, -0.0015, -0.0176, 0.0003, -0.0212, 0.0747, 0.0396, -0.0601, -0.0949, 0.0083, -0.0494, -0.046, 0.0526, -0.0364, -0.0131, 0.0674, -0.0098, 0.0806, 0.0081, -0.0304, 0.0295, -0.0349, -0.0166, 0.0589, -0.0113, -0.0102, 0.0147, 0.1199, -0.0484, 0.0634, -0.0595, 0.0359, -0.0389, 0.0082, 0.0057, 0.0475, -0.0751, -0.03, -0.0067, 0.0143, -0.013, 0.0334, -0.0109, -0.0418, 0.0716, -0.0147, 0.061, 0.0054, -0.0138, -0.0156, -0.0109, -0.0053, 0.0462, -0.0234, 0.092]))

In [48]:
all_vecs[0][1]

DenseVector([-0.0949, -0.0314, 0.0272, 0.0091, 0.056, -0.0003, 0.0116, 0.02, 0.0641, 0.0029, -0.0145, -0.0746, 0.0152, 0.0599, 0.0564, 0.0002, 0.0737, -0.0414, 0.0264, 0.0784, 0.0352, 0.0016, -0.0452, -0.0517, 0.0131, 0.0295, -0.0073, -0.0313, -0.0145, -0.0208, 0.0482, -0.0188, -0.0831, -0.124, -0.0201, 0.0666, -0.0406, 0.0484, -0.0362, -0.0267, -0.082, -0.0519, -0.011, 0.0132, 0.0178, 0.0408, -0.0015, -0.0176, 0.0003, -0.0212, 0.0747, 0.0396, -0.0601, -0.0949, 0.0083, -0.0494, -0.046, 0.0526, -0.0364, -0.0131, 0.0674, -0.0098, 0.0806, 0.0081, -0.0304, 0.0295, -0.0349, -0.0166, 0.0589, -0.0113, -0.0102, 0.0147, 0.1199, -0.0484, 0.0634, -0.0595, 0.0359, -0.0389, 0.0082, 0.0057, 0.0475, -0.0751, -0.03, -0.0067, 0.0143, -0.013, 0.0334, -0.0109, -0.0418, 0.0716, -0.0147, 0.061, 0.0054, -0.0138, -0.0156, -0.0109, -0.0053, 0.0462, -0.0234, 0.092])

In [49]:
# test similarity by Business

b_id = 'RtUvSWO_UZ8V3Wpj0n077w'

bus_details_df = business_df.filter(col("business_id") == b_id) \
                            .select(['business_id', 'name', 'categories'])
print('Buiness details:')           
bus_details_df.show(truncate = False) 

input_vec = reviews_by_business_vec_df.select('word_vec')\
            .filter(reviews_by_business_vec_df['business_id'] == b_id)\
            .collect()[0][0]
        
#all_vecs = reviews_by_business_vec_df.select('business_id', 'word_vec').rdd.map(lambda x: (x[0], x[1])).collect()

similar_business_rdd = sc.parallelize((i[0], float(CosineSim(input_vec, i[1]))) for i in all_vecs)

similar_business_df = spark.createDataFrame(similar_business_rdd).\
    withColumnRenamed('_1', 'business_id').\
    withColumnRenamed('_2', 'similarity_score').\
    orderBy("similarity_score", ascending = False)

a = similar_business_df.filter(col("business_id") != b_id).limit(10).alias("a")

b = business_df.alias("b")
j = a.join(b, col("a.business_id") == col("b.business_id"), 'inner')\
     .select([col('a.'+xx) for xx in a.columns] + [col('b.name'),col('b.categories'),
                                                   col('b.stars'),col('b.review_count'),
                                                   col('b.latitude'),col('b.longitude')])
print('Top 10 similar businesses:')
j.toPandas()

Buiness details:
+----------------------+----------------------+------------------------------------------------------------------------------+
|business_id           |name                  |categories                                                                    |
+----------------------+----------------------+------------------------------------------------------------------------------+
|RtUvSWO_UZ8V3Wpj0n077w|KINKA IZAKAYA ORIGINAL|[Pubs, Japanese, Restaurants, Bars, Nightlife, Tapas Bars, Tapas/Small Plates]|
+----------------------+----------------------+------------------------------------------------------------------------------+

Top 10 similar businesses:


Unnamed: 0,business_id,similarity_score,name,categories,stars,review_count,latitude,longitude
0,CN5nuUQod0f8g3oh99qq0w,0.993978,KINKA IZAKAYA BLOOR,"[Nightlife, Restaurants, Pubs, Japanese, Tapas...",4.0,351,43.665157,-79.410658
1,CfxVkwEJk1NAqgqMSesLzA,0.980039,KINKA IZAKAYA NORTH YORK,"[Bars, Nightlife, Restaurants, Tapas/Small Pla...",3.5,209,43.76019,-79.410112
2,wpQsmMvdhefqIlxvRt_Jbg,0.974047,DonDon Izakaya,"[Restaurants, Japanese, Tapas/Small Plates, Ta...",3.0,225,43.655741,-79.384625
3,L82O1ZFFQfjJxF0_PYWPnA,0.970267,Guu Izakaya Toronto,"[Tapas Bars, Izakaya, Japanese, Restaurants]",4.0,50,43.641867,-79.43109
4,sYKB4nITCLLFcCZPn3QECQ,0.961989,Teppan Kenta,"[Japanese, Restaurants, Food]",3.5,58,43.665279,-79.385945
5,g6GXqg-QdDiQGLYMVqNOUw,0.948832,Hapa Izakaya,"[Japanese, Restaurants]",3.5,148,43.655264,-79.414242
6,478TIlfHXfT3wvww54QsPg,0.942653,Ki Modern Japanese + Bar,"[Sushi Bars, Restaurants, Japanese]",3.5,169,43.647208,-79.379381
7,SjgeuBlgKER9yegpoxT99w,0.941614,Nomé Izakaya,"[Bars, Nightlife, Restaurants, Lounges, Tapas ...",4.0,374,43.76265,-79.411469
8,KxcQs2Lkm3FJiltVWXOz_Q,0.938435,Hashi Izakaya,"[Tapas Bars, Nightlife, Japanese, Restaurants,...",3.5,37,43.779256,-79.415713
9,8J0NuWmoFfSGe5LuaiMfpg,0.9376,Sake Bar Kushi,"[Sushi Bars, Japanese, Restaurants, Tapas Bars]",4.0,67,43.704833,-79.406917


In [50]:
def getBusinessDetails(in_bus):
    
    a = in_bus.alias("a")
    b = business_df.alias("b")
    
    return a.join(b, col("a.business_id") == col("b.business_id"), 'inner') \
             .select([col('a.'+xx) for xx in a.columns] + [col('b.name'),col('b.categories'),
                                                           col('b.stars'),col('b.review_count'),
                                                           col('b.latitude'),col('b.longitude')])
    

In [51]:
def getKeyWordsRecoms(key_words, sim_bus_count):
    
    print('Businesses similar to key words: "' + key_words + '"')
    
    input_words_df = sc.parallelize([(0, key_words)]).toDF(['business_id', 'key_words'])

    regexToken = RegexTokenizer(gaps = False, pattern = '\w+', inputCol = 'key_words', outputCol = 'token')
    stopWrdRem = StopWordsRemover(inputCol = 'token', outputCol = 'nostopwrd')


    # Build the pipeline
    pipeline = Pipeline(stages=[regexToken, stopWrdRem])


    mdl = pipeline.fit(input_words_df)
    input_words_token_nostopwrd_df = mdl.transform(input_words_df)

    input_vec_df = word2Vec_mdl.transform(input_words_token_nostopwrd_df)

    input_key_words_vec = input_vec_df.select('word_vec').collect()[0][0]

    #all_vecs = reviews_by_business_vec_df.select('business_id', 'word_vec').rdd.map(lambda x: (x[0], x[1])).collect()

    similar_business_by_key_word_rdd = sc.parallelize((i[0], float(CosineSim(input_key_words_vec, i[1]))) \
                                                                                      for i in all_vecs)

    similar_business_by_key_word_df = spark.createDataFrame(similar_business_by_key_word_rdd).\
        withColumnRenamed('_1', 'business_id').\
        withColumnRenamed('_2', 'similarity_score').\
        orderBy("similarity_score", ascending = False)

    a = similar_business_by_key_word_df.limit(sim_bus_count)
    return getBusinessDetails(a)


In [52]:
key_words = 'chicken cheese burger'

keywords_recom_df = getKeyWordsRecoms(key_words, 10)
keywords_recom_df.toPandas()

Businesses similar to key words: "chicken cheese burger"


Unnamed: 0,business_id,similarity_score,name,categories,stars,review_count,latitude,longitude
0,37joQpD9m5AIcrW1c8OBnQ,0.706561,Urban Smoke Fusion BBQ Food Truck,"[Desserts, Barbeque, Food, Restaurants, Food T...",4.0,8,43.718711,-79.470037
1,3Cu-af4en3uWCrAkkqfiHQ,0.698635,Epic Burgers and Waffles,"[Burgers, Food, Restaurants]",2.5,5,43.632351,-79.42128
2,nP87zXxeS-8got7IBvoAuA,0.657255,McCoy Burger Company,"[Local Flavor, Sandwiches, Restaurants, Poutin...",4.0,33,43.731511,-79.404081
3,DiCMYxT69I22-1nfsvYAJQ,0.651565,Gourmet Burger Co,"[Burgers, Restaurants]",3.5,37,43.664683,-79.368279
4,UN0UwUh7jaeX6Jg3lZImCg,0.638442,Holy Chuck,"[Food, Restaurants, Desserts, Poutineries, Bur...",3.0,43,43.665211,-79.384925
5,ZzF5098L4xg-0COjng2LVA,0.638042,Burgatory,"[Pubs, Burgers, Food Trucks, Nightlife, Bars, ...",3.0,9,43.655055,-79.418563
6,PkeaeQS8aJTeS8PS_Hl_-g,0.637751,Steak and Cheese Factory,"[Sandwiches, Cheesesteaks, Restaurants]",3.0,3,43.708213,-79.392367
7,ky9RbwLtChekSrqcYR39kw,0.635411,Big Smoke Burger,"[Burgers, Poutineries, Restaurants]",3.0,6,43.611289,-79.556867
8,ycAW6Q5quaCSDX5zwQ3tPg,0.630411,New York Fries,"[Canadian (New), Specialty Food, Food, Restaur...",3.5,8,43.776875,-79.256655
9,7UPTUpex3O1Gav3td7GOEw,0.625449,South St Burger Co,"[Burgers, Restaurants]",3.0,6,43.736442,-79.344201


In [53]:
def getSimilarBusinesses(b_ids, sim_bus_count):
    
    schema = StructType([
                            StructField("business_id", StringType(), True), 
                            StructField("similarity_score", IntegerType(), True)
                        ])
    
    similar_businesses_df = spark.createDataFrame([], schema)
    
    for b_id in b_ids:
        
        print('Businesses similar to: ' + b_id)
        
        input_vec = reviews_by_business_vec_df.select('word_vec')\
                    .filter(reviews_by_business_vec_df['business_id'] == b_id)\
                    .collect()[0][0]

        #all_vecs = reviews_by_business_vec_df.select('business_id', 'word_vec').rdd.map(lambda x: (x[0], x[1])).collect()

        similar_business_rdd = sc.parallelize((i[0], float(CosineSim(input_vec, i[1]))) for i in all_vecs)

        similar_business_df = spark.createDataFrame(similar_business_rdd) \
            .withColumnRenamed('_1', 'business_id') \
            .withColumnRenamed('_2', 'similarity_score') \
            .orderBy("similarity_score", ascending = False)
            
        similar_business_df = similar_business_df.filter(col("business_id") != b_id).limit(10)
        similar_business_df.show()
        
        similar_businesses_df = similar_businesses_df.union(similar_business_df)
    
    return similar_businesses_df
    

In [54]:
def getContentRecoms(u_id, sim_bus_count=10):
    
    query = """
    SELECT distinct business_id FROM reviews  
    where stars >= 3.0 
    and user_id = "{}"
    """.format(u_id)

    usr_rev_bus = sqlContext.sql(query)

    usr_rev_bus = usr_rev_bus.sample(False, 0.5).limit(5)

    usr_rev_bus_det = getBusinessDetails(usr_rev_bus)

    print('Businesses previously reviewed by user:')
    usr_rev_bus_det.select(['business_id', 'name', 'categories']).show(truncate = False)

    bus_list = [i.business_id for i in usr_rev_bus.collect()]

    sim_bus_df = getSimilarBusinesses(bus_list, sim_bus_count)

    s = sim_bus_df.alias("s")
    r = usr_rev_bus.alias("r")
    j = s.join(r, col("s.business_id") == col("r.business_id"), 'left_outer') \
         .where(col("r.business_id").isNull()) \
         .select([col('s.business_id'),col('s.similarity_score')])

    a = j.orderBy("similarity_score", ascending = False).limit(sim_bus_count)

    return getBusinessDetails(a)

     

In [64]:
u_id = 'ZWD8UH1T7QXQr0Eq-mcWYg'

content_recom_df = getContentRecoms(u_id)

print("Businesses recommended to user based on his previously reviewd businesses:")
content_recom_df.toPandas()

Businesses previously reviewed by user:
+----------------------+-----------------------+--------------------------------------------------------+
|business_id           |name                   |categories                                              |
+----------------------+-----------------------+--------------------------------------------------------+
|73_UT7fZ7mzXcguX8-oSuQ|Amsterdam BrewHouse    |[Breweries, Nightlife, Burgers, Bars, Restaurants, Food]|
|i2Fd0dl39BZ8nVxBnSPsKg|Anchor Bar             |[Restaurants, Sandwiches, Salad, Pizza, Chicken Wings]  |
|JJ8ypBu3b--fy4HA5RB1gg|Morton's The Steakhouse|[Steakhouses, Restaurants]                              |
+----------------------+-----------------------+--------------------------------------------------------+

Businesses similar to: 73_UT7fZ7mzXcguX8-oSuQ
+--------------------+------------------+
|         business_id|  similarity_score|
+--------------------+------------------+
|MT-lkIacLNyC7KF2d...|0.9846081843757822|
|qz

Unnamed: 0,business_id,similarity_score,name,categories,stars,review_count,latitude,longitude
0,Q2ZNaN3p8s_-XXjBWyY2qA,0.98841,Ruth's Chris Steak House,"[Restaurants, Steakhouses, Party & Event Plann...",3.5,172,43.649612,-79.385306
1,MT-lkIacLNyC7KF2dY997A,0.984608,Mill Street Beer Hall,"[Canadian (New), Gastropubs, Nightlife, Bars, ...",3.0,118,43.650442,-79.358443
2,7ODXq--HE7QpzvWwgk5rMA,0.981036,Barberian's Steak House,"[Steakhouses, Restaurants]",4.0,195,43.657592,-79.382162
3,ZumOnWbstgsIE6bJlxw0_Q,0.980275,Jacobs & Co. Steakhouse,"[Restaurants, Steakhouses]",4.5,393,43.645371,-79.398011
4,qztZIyt2BMSKfX052OgKXQ,0.979035,Against the Grain,"[Canadian (New), Restaurants]",3.0,190,43.643598,-79.366496
5,0BW6h-igJinzbqc-prYUaQ,0.97544,Hy's Steakhouse & Cocktail Bar,"[Seafood, Nightlife, Bars, Steakhouses, Restau...",3.5,84,43.649731,-79.382955
6,klu0zF1rWAoNAhKPsFyUog,0.973063,LOCAL Public Eatery,"[American (Traditional), Canadian (New), Resta...",3.5,80,43.709888,-79.363548
7,4POPYEONJpkfhWOMx_PyGg,0.973019,Harbour Sixty,"[Seafood, Steakhouses, Restaurants]",3.5,170,43.642064,-79.378434
8,Vf_RHj0f1VViEF6OYnEfUA,0.970874,Quinn's Steakhouse & Irish Bar,"[Restaurants, Steakhouses, Irish]",3.5,84,43.651089,-79.382919
9,qXk2CS1-6jKTKpMUIgN9kQ,0.970591,O'Grady's Restaurant,"[Bars, Restaurants, Canadian (New), Cocktail B...",3.0,88,43.66447,-79.38068


In [65]:
showInMap(content_recom_df)

In [56]:
u_id = 'Wc5L6iuvSNF5WGBlqIO8nw'

content_recom_df = getContentRecoms(u_id)

print("Businesses recommended to user based on his previously reviewd businesses:")
content_recom_df.toPandas()

Businesses previously reviewed by user:
+----------------------+----------------------------+-------------------------------------------------------------------------------------------+
|business_id           |name                        |categories                                                                                 |
+----------------------+----------------------------+-------------------------------------------------------------------------------------------+
|J4_q5iMukg-UnnLnT6ZwAA|Northern Belle              |[Nightlife, Food, Restaurants, Bars, Coffee & Tea, Cocktail Bars, Cafes]                   |
|F_oPMHJrH42R67xp5eKtQA|Yummy Korean Food Restaurant|[Korean, Restaurants]                                                                      |
|JmZj7wzAJ7_4ksjG9WXdqw|Gladstone Hotel             |[Hotels & Travel, Lounges, Restaurants, Bars, Nightlife, Event Planning & Services, Hotels]|
|9jYnZymuaW-XpMIS75YxgQ|The Beaver                  |[Canadian (New), Nightlife, Caf

Unnamed: 0,business_id,similarity_score,name,categories,stars,review_count,latitude,longitude
0,rO3lZpVSoRMhhd0AEJBjlg,0.986628,Sunrise House,"[Restaurants, Korean]",4.0,135,43.664068,-79.415668
1,rhyjGfqYlCJoi8Zeulg6QA,0.985523,Kimchi Korea House,"[Korean, Restaurants]",3.5,155,43.655256,-79.385475
2,j-Z_HAev26ZftdErMhIBuA,0.981557,Thumbs Up Korean Restaurant,"[Restaurants, Korean]",4.0,56,43.664451,-79.413786
3,_MA98TVmvVIy-XdI0poc7w,0.980885,Mom's Korean Food,"[Korean, Restaurants]",3.5,62,43.664686,-79.413785
4,SNkkuchbVtUzCwyENcai_g,0.980637,Danji,"[Restaurants, Chinese, Japanese, Korean]",3.5,57,43.6653,-79.384899
5,ShUh_MMkaVp_KXCtNjPvXA,0.976432,Universal Grill,"[American (Traditional), Canadian (New), Break...",3.5,45,43.670521,-79.42644
6,X6ZZksefmR_piQj2Gbnduw,0.975906,Paldo Gangsan,"[Restaurants, Korean]",4.0,47,43.663799,-79.417393
7,uChTCA6MsLAciDRklpO-Fw,0.973643,Makkal Chon,"[Greek, Restaurants, Korean]",4.0,210,43.744944,-79.296636
8,ZCrK07xb6w5Vi1vathV0NQ,0.973478,Bapbo Korean Restaurant,"[Korean, Japanese, Restaurants]",3.0,86,43.655606,-79.384966
9,oQylTvXwGIkKFdCjmafKVg,0.973024,Fire on the East Side,"[Southern, Restaurants, Breakfast & Brunch, Am...",3.5,119,43.666765,-79.384836


In [61]:
def showInMap(df):
    
    mp = folium.Map(location=[43.70011, -79.4163], zoom_start=12)

    for i, r in df.toPandas().iterrows():
        folium.Marker(
                    location =[r.latitude, r.longitude], 
                    popup = html.escape(r["name"]) + '<br>' + 'Stars: ' + str(r.stars) + '<br>' + 'Reviews: ' + str(r.review_count),    
                    icon = folium.Icon(color='green')).add_to(mp)
    return mp

In [62]:
showInMap(content_recom_df)