# Tabelog Scraping: Tokyo sushi restaurants

**What is Tabelog?**  
Tabelog is an online restaurant information website. We can search restaurants using filters, and see the information about restaurants. For example, menus, addresses, contact information, photos, scores (ratings), and reviews. (It is like a OpenTable or Yelp.) 

## Goal:

If you are interested in opening a sushi restaurant in Tokyo, this project is for you! We will acquire restaurant information from Tabelog to practice scraping. We focus on sushi restaurants in Tokyo!  
  
We will extract sushi restaurant information including: 
　　
1. Restaurant names  
2. Average scores  
3. Number of reviews  
4. Reviews (for dinner) 　     
  
At the end of this project, we will get two dataframes as CSV files. One is about restaurants, and another is about reviews! 

## Table of Contents:

1. Import the packages  
2. Get the restaurant names and the restaurant page URLs from the search result  
3. Get the average scores of the restaurants, the numbers of the reviews, and the URLs which the reviews are stored in  
4. Make a restaurant dataframe  
5. Get the reviews of each restaurant    
6. Summary  

### 1. Import the packages

This time, we will use "requests" and "BeautifulSoup" libraries to extract data. "tqdm" library is to show the progresses of for-loops.

In [1]:
# Import the packages
import pandas as pd
import numpy as np
from tqdm import tqdm
import requests
from bs4 import BeautifulSoup
# import re
# import csv

In [2]:
# You can see this notebook without codes
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle the code on/off."></form>''')

### 2. Get the restaurant names and the restaurant page URLs from the search result

First, get the HTML of the restaurant list. To do so, go to the Tabelog website (https://tabelog.com/) and search " 東京, 寿司" (this meaning is "Tokyo, Sushi"). You will see a sushi restaurant list on your screen and get the URL (https://tabelog.com/tokyo/rstLst/sushi/1/?sk=%E5%AF%BF%E5%8F%B8&svd=20210104&svt=1900&svps=2). We will use this URL to get the restaurant names and the URLs. Also, check the number of the pages of the restaurant list. In this case, there are 60 pages.

In [3]:
# Send the request and get the HTML
url = 'https://tabelog.com/tokyo/rstLst/sushi/1/?sk=%E5%AF%BF%E5%8F%B8&svd=20210104&svt=1900&svps=2'
response = requests.get(url)

# Set a parser
soup = BeautifulSoup(response.text, 'html.parser')

See the tabelog website again to find the place where we can find the restaurant names and the restaurant URLs. They are in the `<a class='list-rst__rst-name-target'>` tag. Also, remember we have 60 pages.

In [4]:
# Prepare lists to keep restaurant names and the URLs
restaurant_name_list = []
restaurant_url_list = []

# Send the request and get all of the restaurant names and the URLs from the 60 pages
for i in tqdm(range(1, 61)): 
    url = 'https://tabelog.com/tokyo/rstLst/sushi/' + str(i) + '/?sk=%E5%AF%BF%E5%8F%B8&svd=20210104&svt=1900&svps=2'
    response = requests.get(url)

    # Set a parser
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Get the elements in the <a class='list-rst__rst-name-target'> tab as a list
    elements = soup.find_all('a', class_='list-rst__rst-name-target')
    
    for j in range(len(elements)):
        restaurant_name_list.append(elements[j].contents[0])
        restaurant_url_list.append(elements[j].attrs['href'])

100%|█████████████████████████| 60/60 [02:37<00:00,  2.63s/it]


### 3. Get the average scores of the restaurants, the numbers of the reviews, and the URLs the reviews are stored in

Access each of the restaurant URLs and get the average scores, the numbers of the reviews, and the URLs the reviews are stored in.

In [5]:
# Prepare lists to keep the average scores, the numbers of the reviews, and the review URLs
ave_score_list = []
num_review_list = []
review_url_list = []

# Access each of the restaurant URLs in restaurant_URLs_list
for res_url in tqdm(restaurant_url_list):
    # Send the request and get the HTML
    response = requests.get(res_url)
    
    # Set a parser
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Get the element in the <span class='rdheader-rating__score-val-dtl'> tab to get the score
    element_score = soup.find('span', class_='rdheader-rating__score-val-dtl')
    ave_score_list.append(element_score.contents[0])
    
    # Get the element in the <em class='num'> tab to get the number of the reviews
    element_numReviews = soup.find('em', class_='num')
    num_review_list.append(element_numReviews.contents[0])
    
    # Get the element in the <li id='rdnavi-review'> tab
    element_revURL = soup.find('li', id='rdnavi-review')
    # Get the URL from the <a> tab
    review_url_list.append(element_revURL.a.get('href'))   

100%|█████████████████████| 1200/1200 [40:52<00:00,  2.04s/it]


### 4. Make a restaurant dataframe

Now we have all of the restaurant names, the average scores, and the numbers of the reviews. Make a dataframe with them.

In [6]:
# Create a dataframe having restaurant names, average scores, and the numbers of reviews
df_restaurants = pd.DataFrame({'restaurant_name': restaurant_name_list, 
                              'average_score': ave_score_list,
                              'num_reviews': num_review_list,
                              'restaurant_url': restaurant_url_list})

# Show the number of restaurants we got
print('The number of sushi restaurants:', len(df_restaurants))

# Show the first 20
print('The first 20 rows of df_restaurants')

df_restaurants[:20]

The number of sushi restaurants: 1200
The first 20 rows of df_restaurants


Unnamed: 0,restaurant_name,average_score,num_reviews,restaurant_url
0,七色てまりうた 新宿,3.28,69,https://tabelog.com/tokyo/A1304/A130401/13022043/
1,弥栄,3.09,16,https://tabelog.com/tokyo/A1303/A130302/13238158/
2,すし尽誠,3.5,61,https://tabelog.com/tokyo/A1311/A131101/13233806/
3,恵比寿 鮨 おぎ乃,3.36,61,https://tabelog.com/tokyo/A1303/A130302/13223561/
4,鮨 ふくじゅ,3.43,82,https://tabelog.com/tokyo/A1301/A130101/13228975/
5,プレミアムレストラン 東京 金のダイニング 鮪金,3.37,38,https://tabelog.com/tokyo/A1301/A130101/13200206/
6,鮨屋 小野,3.18,9,https://tabelog.com/tokyo/A1303/A130302/13045244/
7,博多魚助 丸の内店,3.22,7,https://tabelog.com/tokyo/A1301/A130102/13227089/
8,SUSHI B GINZA,3.04,2,https://tabelog.com/tokyo/A1301/A130101/13214841/
9,代官山 鮨 たけうち,3.68,60,https://tabelog.com/tokyo/A1303/A130303/13210936/


We got 1,200 sushi restaurant information! If we analyze this dataframe, we will find which restaurants are popular and have high average scores. The "-" in the average_score column means the restaurant does not have any average score. The "-" in the num_reviews column means there is no review yet. If you want, we can extract the restaurant addresses, area, contact information and e.t.c. as well using the similar ways we did. 

Get only URLs having reviews.

In [7]:
# Get the index number and the review_url_list of the restaurants having reviews
restaurants_having_reviews = []

for i in range(len(review_url_list)):
    review_url = review_url_list[i]
    if 'lrvwlst' in review_url:
        have_reviews = {'restaurant_id': i, 'review_url':review_url}
        restaurants_having_reviews.append(have_reviews)

### 5. Get the reviews of each restaurant

Next, get the reviews of each restaurant. This time, we will get up to 20 latest dinner reviews per restaurant because it takes time to get all the reviews (too many!).

In [8]:
# Prepare lists to keep the restaurant ID and the review links
restaurant_id_list = []
full_review_url_list = []

# Access each of the review URLs in restaurants_having_reviews
for dic in tqdm(restaurants_having_reviews):
    
    # Get the review URL
    review_url = dic['review_url']
    
    # Get the URL of dinner reviews
    dinner_review_url = review_url + 'COND-2/smp1/?smp=1&lc=0&rvw_part=all'
    
    # Send the request and get the HTML
    response = requests.get(dinner_review_url)
    
    # Set a parser
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Review links
    # Get the elements in the <div class='rvw-item'> tab to get the review links
    element_reviews = soup.find_all('div', class_='rvw-item')
    
    # Get the full review URL if there are any reviews
    for elm in element_reviews:
        full_review_url = 'https://tabelog.com' + elm.get('data-detail-url')
        full_review_url_list.append(full_review_url)
            
    # Restaurant ID
    # Add the restaurant id
    restaurant_id_list.extend([dic['restaurant_id'] for _ in range(len(element_reviews))])

100%|█████████████████████| 1189/1189 [43:39<00:00,  2.20s/it]


In [9]:
# Check the number of elements in full_review_url_list
print('The number of reviews we will get:', len(full_review_url_list))

The number of reviews we will get: 15689


Now we have 15,689 links having the full reviews. We will get each review, but it means we will access 15,689 times to the website. This is a lot. I will divide it into 4  to reduce the burden.

In [10]:
# Prepare list to keep the reviews
review_list = []

# To avoid a heavy burden, I devide 15689 accesses into 4 parts
for url in tqdm(full_review_url_list[:4000]):
    # Send the request and get the HTML of reviews
    response_revs = requests.get(url)
    
    # Set a parser
    soup_revs = BeautifulSoup(response_revs.text, 'html.parser')
    
    # Get the element in the <div class="rvw-item__rvw-comment"> tab
    element_revURL = soup_revs.find('div', class_='rvw-item__rvw-comment')
    # Get the review (the first content)
    review_list.append(element_revURL.p.text.strip())

100%|███████████████████| 4000/4000 [2:07:18<00:00,  1.91s/it]


In [11]:
for url in tqdm(full_review_url_list[4000:8000]):
    # Send the request and get the HTML of reviews
    response_revs = requests.get(url)
    
    # Set a parser
    soup_revs = BeautifulSoup(response_revs.text, 'html.parser')
    
    # Get the element in the <div class="rvw-item__rvw-comment"> tab
    element_revURL = soup_revs.find('div', class_='rvw-item__rvw-comment')
    # Get the review (the first content)
    review_list.append(element_revURL.p.text.strip())

100%|███████████████████| 4000/4000 [2:04:36<00:00,  1.87s/it]


In [12]:
for url in tqdm(full_review_url_list[8000:12000]):
    # Send the request and get the HTML of reviews
    response_revs = requests.get(url)
    
    # Set a parser
    soup_revs = BeautifulSoup(response_revs.text, 'html.parser')
    
    # Get the element in the <div class="rvw-item__rvw-comment"> tab
    element_revURL = soup_revs.find('div', class_='rvw-item__rvw-comment')
    # Get the review (the first content)
    review_list.append(element_revURL.p.text.strip())

100%|███████████████████| 4000/4000 [2:04:19<00:00,  1.86s/it]


In [13]:
for url in tqdm(full_review_url_list[12000:]):
    # Send the request and get the HTML of reviews
    response_revs = requests.get(url)
    
    # Set a parser
    soup_revs = BeautifulSoup(response_revs.text, 'html.parser')
    
    # Get the element in the <div class="rvw-item__rvw-comment"> tab
    element_revURL = soup_revs.find('div', class_='rvw-item__rvw-comment')
    # Get the review (the first content)
    review_list.append(element_revURL.p.text.strip())

100%|███████████████████| 3689/3689 [1:57:44<00:00,  1.92s/it]


In [14]:
# Check the number of reviews
print('The total number of reviews:',len(review_list))

The total number of reviews: 15689


We got 15,689 reviews successfully! Let's make a dataframe.

In [15]:
# Create a dataframe having the restaurant IDs, scores, and reviews
df_reviews = pd.DataFrame({'restaurant_id': restaurant_id_list, 
                           'reviews': review_list})

I cannot show the reviews here because of the terms of use of Tabelog. This dataframe has two columns, 'restaurant_id' and 'reviews'. If you use both dataframes we made, you can say which restaurants have the reviews.

In [16]:
# Save the two dataframes as csv files
df_restaurants.to_csv('df_restaurant.csv')
df_reviews.to_csv('df_reviews.csv')

### 6. Summary

We extracted 1,200 Tokyo sushi restaurant information (the names, average scores, and number of reviews) and 15,689 reviews by scraping using the requests library and BeautifulSoup library. This data can be used to analyze which restaurants are popular and have high average scores, what people expect to sushi restaurants in Tokyo and people like/dislike, and etc. I am planning to analyze this data as a different project soon! 