The project's primary objective is to create a Restaurant Recommender Application designed to assist users in locating dining establishments that align with their tastes. Through the implementation of machine learning techniques, I have gathered genuine restaurant reviews and established a sophisticated recommender system.

In my project, I conducted an in-depth analysis of restaurant reviews from Vancouver. Throughout the project, I gathered data from web sources, constructed a comprehensive data frame, established a robust recommender system, and developed a user-friendly application to utilize the recommendation system seamlessly.

__Please note: this is notebook 1 of 3.__

In this notebook, I collected restaurant names with html links, raitings, review and restaurant type. 

For collecting data I used Selenium. Selenium is an open-source framework commonly used for automating web browsers. It provides a set of tools and libraries that allow to interact with web applications as if they were real users. Selenium can be used for web scraping tasks by navigating to web pages, extracting data, and automating interactions.

In [31]:
import pandas as pd
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

In [1]:
import session_info
session_info.show()

### Step 1. Collect restaurant names with urls


In [32]:
# Set up the web driver 
driver = webdriver.Chrome()

# Specify the location and base URL
location = "g154943"  # Vancouver's location ID on TripAdvisor
base_url = f"https://www.tripadvisor.com/Restaurants-{location}"

# Specify the number of pages you want to scrape 
num_pages = 35

# Find all restaurant names and URLs across multiple pages
restaurants_info = []

for page in range(1, num_pages + 1):
    # Construct the URL for the current page
    url = f"{base_url}-oa{30 * (page - 1)}"  # Each page shows 30 restaurants
    driver.get(url)

    # Find all elements containing the restaurant names and URLs
    selector = "div > div > div.yJIls.z.y.M0 > header > div > div.jhsNf.N.G > div.VDEXx.u.Ff.K > div > a"  # Assuming both names and URLs have the same selector
    elements = driver.find_elements(By.CSS_SELECTOR, selector)

    # Extract restaurant names and URLs from the current page
    num_elements = len(elements)
    for i in range(0, num_elements, 1):  # Assuming every alternate element is a URL
        restaurant_info = {
            "name": elements[i].text,
            "url": elements[i].get_attribute("href")
        }
        restaurants_info.append(restaurant_info)

# Print all restaurant names and URLs
for restaurant in restaurants_info:
    print(f"Name: {restaurant['name']}")
    print(f"URL: {restaurant['url']}")
    print()



Name: Freshslice Pizza
URL: https://www.tripadvisor.com/Restaurant_Review-g154943-d8074088-Reviews-Freshslice_Pizza-Vancouver_British_Columbia.html

Name: 1. Hydra Estiatorio Mediterranean
URL: https://www.tripadvisor.com/Restaurant_Review-g154943-d15695553-Reviews-Hydra_Estiatorio_Mediterranean-Vancouver_British_Columbia.html

Name: 2. Black + Blue
URL: https://www.tripadvisor.com/Restaurant_Review-g154943-d2399483-Reviews-Black_Blue-Vancouver_British_Columbia.html

Name: 3. Alouette Bistro
URL: https://www.tripadvisor.com/Restaurant_Review-g154943-d23602211-Reviews-Alouette_Bistro-Vancouver_British_Columbia.html

Name: 4. Seaside Provisions
URL: https://www.tripadvisor.com/Restaurant_Review-g181717-d19407527-Reviews-Seaside_Provisions-North_Vancouver_British_Columbia.html

Name: 5. Salmon n' Bannock Bistro
URL: https://www.tripadvisor.com/Restaurant_Review-g154943-d1719718-Reviews-Salmon_n_Bannock_Bistro-Vancouver_British_Columbia.html

Name: Freshslice Pizza
URL: https://www.tripadv

In [22]:
# convers list into pandas dataframe
df_restaurants_info = pd.DataFrame(restaurants_info)
df_restaurants_info

Unnamed: 0,name,url
0,Freshslice Pizza,https://www.tripadvisor.com/Restaurant_Review-...
1,1. Hydra Estiatorio Mediterranean,https://www.tripadvisor.com/Restaurant_Review-...
2,2. Alouette Bistro,https://www.tripadvisor.com/Restaurant_Review-...
3,3. Black + Blue,https://www.tripadvisor.com/Restaurant_Review-...
4,4. Seaside Provisions,https://www.tripadvisor.com/Restaurant_Review-...
...,...,...
1150,1046. The Pint Public House,https://www.tripadvisor.com/Restaurant_Review-...
1151,1047. Boston Pizza,https://www.tripadvisor.com/Restaurant_Review-...
1152,1048. Mirchi Restaurant,https://www.tripadvisor.com/Restaurant_Review-...
1153,1049. Say Mercy!,https://www.tripadvisor.com/Restaurant_Review-...


In [23]:
# save df as csv
df_restaurants_info.to_csv('df_restaurants_info_1', index=False)

### Step 2. Collect restaurant's reviews 

Using restaurant name and link collect reviews for the each restaurant.

In [24]:
review_list = []

for restaurant in restaurants_info:
    # go to restaurant page
    driver.get(restaurant['url'])

    review_elems = driver.find_elements(By.CLASS_NAME, 'partial_entry')
    

    for elem in review_elems:
        review_list.append({'restaurant': restaurant['name'], 'review': elem.text})

In [25]:
# convert to dataframe

df_review = pd.DataFrame(review_list)
df_review

Unnamed: 0,restaurant,review
0,Freshslice Pizza,Simar arora is very nice and sweet girl who se...
1,Freshslice Pizza,I went to the Granville and 2400 block Vancouv...
2,Freshslice Pizza,We were in a hurry to grab a snack. Pizza was ...
3,Freshslice Pizza,Freshslice Pizza is hands down one of the best...
4,Freshslice Pizza,"I met there with their staff girl named Kiran,..."
...,...,...
18635,1050. Beach Ave Bar and Grill,Stopped in for lunch while I was walking the S...
18636,1050. Beach Ave Bar and Grill,Great food and beer in a perfect spot looking ...
18637,1050. Beach Ave Bar and Grill,We dine here often. And I almost always get a ...
18638,1050. Beach Ave Bar and Grill,The Beach Ave bar and grill is lovely spot to ...


In [26]:
# save as csv

df_review.to_csv('df_review_1', index=False)

### Step 3. Collect information about restaurants

In [33]:
type_list = []

for restaurant in restaurants_info:
    # go to restaurant page
    driver.get(restaurant['url'])

    type_elems = driver.find_elements(By.CLASS_NAME, 'SrqKb')
    

    for elem in type_elems:
        type_list.append({'restaurant': restaurant['name'], 'type': elem.text})

In [37]:
# convert to database

df_type = pd.DataFrame(type_list)
df_type

Unnamed: 0,restaurant,type
0,Freshslice Pizza,€2 - €5
1,Freshslice Pizza,Pizza
2,Freshslice Pizza,"Lunch, Dinner"
3,3. Alouette Bistro,€8 - €34
4,3. Alouette Bistro,French
...,...,...
3124,1050. Tokyo Joe's,"Lunch, Dinner"
3125,1050. Tokyo Joe's,Takeout
3126,Freshslice Pizza,€2 - €5
3127,Freshslice Pizza,Pizza


In [38]:
# save as csv

df_type.to_csv('df_type', index=False)

### Step 4. Collect restaurant's ratings

In [34]:
rating_list = []

for restaurant in restaurants_info:
    # go to restaurant page
    driver.get(restaurant['url'])

    rating_elems = driver.find_elements(By.CLASS_NAME, 'ZDEqb')

    for elem in rating_elems:
        rating_list.append({'restaurant': restaurant['name'], 'rating': elem.text})

In [38]:
# convert to dataframe

df_rating = pd.DataFrame(rating_list)
df_rating

Unnamed: 0,restaurant,rating
0,Freshslice Pizza,5.0
1,1. Hydra Estiatorio Mediterranean,5.0
2,2. Black + Blue,4.5
3,3. Alouette Bistro,5.0
4,4. Seaside Provisions,5.0
...,...,...
2627,2396. Tacomio,5.0
2628,2397. Pho Edmonds,3.5
2629,2398. Midam,4.0
2630,2399. Nando's Peri-Peri,5.0


In [39]:
# save as csv

df_rating.to_csv('df_rating', index=False)

I collected all information that I need for futher analysis.