# <span style="font-family: 'Brush Script MT'; font-size: 100px; color: rgba(255, 0, 0, 0.7)">Capstone Project</span>

<div style="text-align: center; font-family: 'Comic Sans MS'; font-size: 40px; color: rgb(1, 77, 78); font-style: italic;">Restaurant Recommender System</div>

<div style="text-align: right; font-family: 'Comic Sans MS'; color: rgba(255, 0, 0, 0.7)"">By Lim Zi Ming</div>

![](images/tokyo_city.jpg)

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pickle
import h5py
from streamlit_lottie import st_lottie
import json
import os
import requests
from tqdm.notebook import tqdm
from google.cloud import translate_v2 as translate

## Importing Datas

In [2]:
# Load the cleaned restaurant dataset
restaurants = pd.read_pickle('data/restaurants_cleaned.pkl')

In [3]:
restaurants = restaurants.sort_values(by='rating_val', ascending=False)
restaurants

Unnamed: 0,store_id,name,nearest_station,nearest_distance,genre,rating_val,review_cnt,save_cnt,budget_dinner,budget_lunch,address,municipalities_1,municipalities_2
10600,13018162,日本橋蛎殻町 すぎた,水天宮前駅,105.0,寿司,4.71,629.0,76295.0,expensive,expensive,東京都中央区日本橋蛎殻町1-33-6 ビューハイツ日本橋 B1F,中央区,日本橋蛎殻町
18145,13136847,新ばし 星野,御成門駅,412.0,日本料理,4.63,193.0,19869.0,expensive,unknown,東京都港区新橋5-31-3,港区,新橋
14202,13124391,松川,六本木一丁目駅,414.0,日本料理,4.63,396.0,44049.0,expensive,expensive,東京都港区赤坂1-11-6 赤坂テラスハウス １階,港区,赤坂
20709,13196420,東麻布 天本,赤羽橋駅,258.0,寿司,4.61,423.0,42946.0,expensive,unknown,東京都港区東麻布1-7-9 ザ・ソノビル 102,港区,東麻布
12301,13249117,アカ,三越前駅,146.0,スペイン料理、モダンスパニッシュ,4.61,263.0,32951.0,expensive,expensive,東京都中央区日本橋室町2-1-1 三井2号館,中央区,日本橋室町
...,...,...,...,...,...,...,...,...,...,...,...,...,...
35823,13016768,蓮玉庵,上野御徒町駅,171.0,そば,3.46,124.0,5291.0,unknown,cheap,東京都台東区上野2-8-7,台東区,上野
58787,13166438,ネパリコ 駒沢店,駒沢大学駅,172.0,ネパール料理、ダイニングバー、カレー（その他）,3.46,102.0,5130.0,average,cheap,東京都世田谷区上馬4-2-6 サンシティー東和101,世田谷区,上馬
15206,13155192,イタリアンダイニング ジリオン,竹芝駅,87.0,イタリアン,3.46,119.0,3825.0,expensive,average,東京都港区海岸1-16-2 ホテル インターコンチネンタル 東京ベイ,港区,海岸
72034,13094042,定食サトウ,神泉駅,105.0,定食・食堂,3.46,82.0,9546.0,cheap,cheap,東京都渋谷区円山町13-9 メゾン若林 B1F,渋谷区,円山町


We are prioritizing restaurants based on their ratings because our primary goal is to recommend the best dining experiences to users. High ratings typically indicate positive customer experiences, ensuring that our recommendations are top-notch and reliable.

## Model Run

**1. Feature Selection and Processing:**  <u>(Take note that the translation cells have been changed to raw cells to prevent surpassing the permitted usage thresholds of the translation API.)</u>

**Feature Selection**:
In the dataset containing information about various restaurants, we have selected a subset of relevant features that are crucial for our analysis. The chosen columns are:
- `name`: The name of the restaurant.
- `genre`: The type or genre of food the restaurant serves (e.g., Japanese, Chinese, Italian).
- `rating_val`: The rating value of the restaurant, indicating its quality and customer satisfaction.
- `nearest_station`: The closest train or subway station to the restaurant, indicating its accessibility.
- `address`: The complete address of the restaurant, helping identify its exact location.

**Data Processing with Google Cloud Translation AI**: Considering that we are harnessing restaurant reviews from Japanese locals, it's essential to make this valuable information accessible to foreign users visiting Japan. Given that  the data is in Japanese, we've leveraged the Google Cloud Translation AI to translate it into English, ensuring both clarity and broader accessibility for our international audience. `address` will be kept in japanese to prevent any misinformation after translating to english, ensuring that the data remains accurate. 

**Initialization**: Set up the Google Cloud environment and initialize the translation client.

In [4]:
#pip install --upgrade google-cloud-translate

**Translation Function**:
A function, `google_translate`, is defined to seamlessly convert Japanese text into English using the Google Cloud Translation API.

**Chunk-Based Data Translation using Google Cloud Translation API**:

Given the vastness of our dataset, translating every row in a single request is neither feasible nor efficient. We adopt a **chunk-based translation approach**, enhancing manageability and effectiveness. We segment our data into chunks, each containing approximately `100` rows. The dataset is efficiently and systematically translated using a chunked approach, aiding in progress tracking and resource management.

**Further Cleaning Process**:
1. **Lowercasing Columns:** The `name`, `genre`, and `nearest_station` columns of the `translated_restaurants` dataframe have been converted to lowercase.
2. **Row Filtering:** After identifying some anomalies on the translated restaurant names, any rows where the `name` column contains the characters `%`, `#`, or `;` will be removed for data consistency and clarity.


In [5]:
restaurants = pd.read_csv('data/translated_restaurants.csv')

In [6]:
restaurants

Unnamed: 0,name,genre,rating_val,nearest_station,address
0,nihonbashi kakigaracho suita,sushi,4.71,suitengumae station,東京都中央区日本橋蛎殻町1-33-6 ビューハイツ日本橋 B1F
1,shinbashi hoshino,japanese food,4.63,onarimon station,東京都港区新橋5-31-3
2,matsukawa,japanese food,4.63,roppongi itchome station,東京都港区赤坂1-11-6 赤坂テラスハウス １階
3,higashiazabu amamoto,sushi,4.61,akabanebashi station,東京都港区東麻布1-7-9 ザ・ソノビル 102
4,red,"spanish cuisine, modern spanish",4.61,mitsukoshimae station,東京都中央区日本橋室町2-1-1 三井2号館
...,...,...,...,...,...
9995,tamoiyanse,"local cuisine (others), chicken dishes, mizutaki",3.46,shinsen station,東京都渋谷区神泉町10-10 神泉ビル1F
9996,la vita,italian,3.46,yotsuya sanchome station,東京都新宿区四谷3-4-9 臼井ビル　１Ｆ
9997,professional super,cake,3.46,nakamurabashi station,東京都練馬区中村北4-8-30
9998,bambi yotsuya store,"western food, steak, hamburger",3.46,yotsuya station,東京都新宿区四谷1-3 津嶋ビル 1F


**2. Genre Processing:**  
- Each restaurant can have multiple genres which are separated by the '、' character. These genres are split and flattened to create a list of unique genre values.
- The frequency of each genre in the dataset is computed.

In [7]:
# Split genres, remove spaces, flatten the genre list, and get unique genre values
genres = restaurants['genre'].str.split(',').explode().str.strip().value_counts()

**3. One-hot Encoding:**  
- To facilitate the recommendation process, genres are one-hot encoded. This means for each genre, a new column is added to the dataframe. If a restaurant belongs to that genre, it will have a '1' in the corresponding column, otherwise '0'.
- These one-hot encoded genre columns are then added back to the main `restaurants` dataframe.

In [8]:
# One-hot encode the genres
genre_dfs = []
for genre in genres.index:
    genre_df = restaurants['genre'].str.contains(genre, regex=False).astype(int).rename(genre)
    genre_dfs.append(genre_df)   

# Merge the one-hot encoded genre DataFrames with the original dataframe
genres_df = pd.concat(genre_dfs, axis=1)
restaurants = pd.concat([restaurants, genres_df], axis=1)

In [9]:
genres_df.to_csv("data/genres_df.csv", index=False)

**4. Cosine Similarity:**  
Based on the genre features, the cosine similarity between each pair of restaurants is computed. This similarity gives a measure of how alike two restaurants are in terms of their genres.

In [10]:
# Compute Cosine Similarity based on genre features
cosine_sim = cosine_similarity(genres_df)

**5. Recommendation Function:**  
A function named `recommend_by_genre` is defined to recommend restaurants based on selected genres and, optionally, the nearest station. Here's how it works:
- The function first checks if the input genres exist in the dataset.
- A genre vector is created, which is a combination of all selected genres. This vector is normalized and reshaped.
- The cosine similarity scores for this genre vector against all restaurants are calculated. Setting 0.3 to yield recommendations with similarity score above 30%.
- Based on the similarity scores, the top restaurants are selected. If a nearest station is provided, the list is filtered to include only restaurants near that station.
- Finally, the top 10 restaurants are returned, sorted by their ratings.

In [11]:
# Function to recommend restaurants based on selected genres and optionally, a nearest station
def recommend_by_genre(input_genres, nearest_station=None, cosine_sim=cosine_sim, min_similarity=0.3):
    # Check if all selected genres exist in the dataset
    for genre in input_genres:
        if genre not in genres_df.columns:
            return f"Genre {genre} not found in the dataset"

    # Create a combined genre vector for all selected genres
    genre_vector = np.zeros(genres_df.shape[1])
    for genre in input_genres:
        genre_index = list(genres_df.columns).index(genre)
        genre_vector[genre_index] = 1

    # Normalize the genre vector and reshape
    genre_vector /= len(input_genres)
    genre_vector = genre_vector.reshape(1, -1)

    # Compute the cosine similarity scores for the genre vector against all restaurants
    sim_scores = cosine_similarity(genre_vector, genres_df.values)
    sim_scores = [x for x in enumerate(sim_scores[0]) if x[1] >= min_similarity]
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # If a station is selected, filter by that station
    if nearest_station:
        station_filtered_indices = [i[0] for i in sim_scores if restaurants.iloc[i[0]]['nearest_station'] == nearest_station]
        top_10_indices = station_filtered_indices[:10]
    else:
        top_10_indices = [i[0] for i in sim_scores[:10]]

    # Get the top 10 restaurants and sort them by rating
    top_10_restaurants = restaurants.iloc[top_10_indices]
    sorted_restaurants = top_10_restaurants.sort_values(by='rating_val', ascending=False)
    return sorted_restaurants[['name', 'rating_val', 'nearest_station']]

This process provides a robust mechanism to recommend restaurants based on user's genre preferences and their proximity to a particular station.

---

Displying all the genres and train stations available for the recommender system. 

In [12]:
# Adjust display settings
pd.set_option('display.max_rows', None)

restaurants['genre'].str.split(',').explode().str.strip().value_counts()

ramen                                           1469
izakaya                                         1361
cafe                                             860
italian                                          704
tsukemen                                         673
french                                           537
japanese cuisine                                 496
wine bar                                         438
sushi                                            424
bread                                            418
soba                                             401
bistro                                           371
cake                                             358
seafood dishes                                   342
curry rice                                       337
yakiniku                                         335
chinese food                                     333
yakitori                                         304
western food                                  

In [13]:
restaurants['nearest_station'].value_counts()

ebisu station                                         254
ikebukuro station                                     199
shibuya station                                       193
roppongi station                                      185
shimbashi station                                     180
ginza station                                         167
tokyo station                                         158
iidabashi station                                     145
omotesando station                                    137
hiroo station                                         132
shinjuku sanchome station                             125
asakusa (tsukuba exp) station                         122
jimbocho station                                      121
kichijoji station                                     115
akasaka station                                       111
nakameguro station                                    109
ningyocho station                                     105
azabujuban sta

In [14]:
pd.reset_option('display.max_rows')

### Testing the Recommendation Function

To ensure that the `recommend_by_genre` function works as expected, a test is conducted using specific user preferences. Please feel free to multiselect any genres and pair it with any one of the nearest stations from above. 

**Test Criteria:**  
- **Selected Genres:** `ramen` and `poultry dishes`
- **Nearest Station:** `ebisu station`

By running this test, the function should return <u>up to the top 10 restaurants</u> that serve either Ramen or Poultry Dishes and are located near the Ebisu Station. These restaurants are further sorted by their ratings to give the user the best recommendations based on their criteria.

In [15]:
# Test the recommendation function
recommend_by_genre(['ramen', 'poultry dishes'], nearest_station='ebisu station')

Unnamed: 0,name,rating_val,nearest_station
738,ozeki chinese noodles,3.75,ebisu station
1157,hand-made chicken chinese noodles ayagawa,3.72,ebisu station
1896,miso ramen kakitagawa hibari ebisu main store,3.67,ebisu station
2142,tether,3.65,ebisu station
2362,tsukumo ramen ebisu main store,3.64,ebisu station
2707,mentei shimada,3.62,ebisu station
3230,shionuki,3.6,ebisu station
3155,kouyu ramen chorori ebisu,3.6,ebisu station
4068,shuichi ebisu,3.56,ebisu station
5453,afuri ebisu,3.52,ebisu station


![](images/screenshot3.png)

---

**Test Criteria:**  
- **Selected Genres:** `ramen`
- **Nearest Station:** `zoshiki station`

Here we have chosen to pick Ramen to recommend restaurants near Zoshiki Station, which is a lesser-known station, hence returning only one recommendation.

In [16]:
# Test the recommendation function
recommend_by_genre(['ramen'], nearest_station='zoshiki station')

Unnamed: 0,name,rating_val,nearest_station
154,shinjiko shijimi chinese soba kohaku tokyo mai...,3.96,zoshiki station


![](images/screenshot2.png)

---

**Test Criteria:**  
- **Selected Genres:** `spanish cuisine`
- **Nearest Station:** `gakugeidaigaku station`

Based on this test, it's evident that our model prioritizes accuracy and does not provide arbitrary recommendations. For instance, it correctly identified the absence of Spanish cuisine near Kitano Station.

In [17]:
# Test the recommendation function
recommend_by_genre(['spanish cuisine'], nearest_station='kitano station')

Unnamed: 0,name,rating_val,nearest_station


![](images/screenshot1.png)

---

By continually testing with different criteria, the reliability and accuracy of the recommendation system can be ensured.

## Exporting Files

In [18]:
# Creating Folder 'StreamLit'
if not os.path.exists('StreamLit'):
    os.makedirs('StreamLit')

In [19]:
#Export
with h5py.File('StreamLit/cosine_sim.h5', 'w') as hf:
    hf.create_dataset("cosine_similarity", data=cosine_sim)

restaurants.to_csv('StreamLit/restaurants_model.csv', index=False)

In [20]:
# File path
file_path = 'StreamLit/cosine_sim.h5'

# Open the file in read mode
with h5py.File(file_path, 'r') as file:
    # Print the names of the datasets in the file
    print("Keys: %s" % file.keys())
    
    # Access the 'cosine_similarity' dataset
    cosine_sim_dataset = file['cosine_similarity']
    
    # Print the shape of the dataset
    print("Shape of the dataset:", cosine_sim_dataset.shape)
    
    # Print the data type of the dataset
    print("Data type:", cosine_sim_dataset.dtype)
    
    # Check and print if the matrix is sparse
    if isinstance(cosine_sim_dataset, h5py.Group) and 'data' in cosine_sim_dataset and 'indices' in cosine_sim_dataset:
        print("The matrix is stored in a sparse format.")
    else:
        print("The matrix is stored in a dense format.")

# Get and print the file size
file_size = os.path.getsize(file_path) / (1024 * 1024)  # Convert from bytes to megabytes
print(f"File size: {file_size:.2f} MB")

Keys: <KeysViewHDF5 ['cosine_similarity']>
Shape of the dataset: (10000, 10000)
Data type: float64
The matrix is stored in a dense format.
File size: 762.94 MB


### File Size Considerations for GitHub

Upon inspection of our `cosine_similarity` dataset within the HDF5 file, we found that the file size is approximately `762.94 MB`, which exceeds GitHub's file size limit.

To accommodate this, we'll be compressing the dataset using the GZIP compression algorithm within HDF5. This approach should effectively reduce the file size, making it suitable for hosting on GitHub.

By using this method, we aim to ensure that the data retains its integrity while making it more accessible for collaboration and version control on GitHub.

In [21]:
# Compress the cosine similarity matrix and save it
with h5py.File('StreamLit/compressed_cosine_sim.h5', 'w') as file:
    file.create_dataset('cosine_similarity', data=cosine_sim, compression="gzip")

# Delete the original, uncompressed file
os.remove('StreamLit/cosine_sim.h5')

## Saving files for Streamlit

### Lottie Animation Integration

Lottie is a popular animation library that lets you run animations in real-time, making them appear more live and dynamic than traditional GIFs or video formats. These animations are often driven by JSON files.

1. **Fetching the Lottie Animation**: We start by defining a function, `load_lottie_url`, that retrieves a Lottie animation in the form of a JSON using a specified URL. If the request is successful, it returns the JSON data; otherwise, it returns `None`.

2. **Storing the Lottie Animation**: Once we have the JSON data for the Lottie animation, we might want to save it locally for various reasons such as offline access or embedding it within our application. The function `save_lottie_to_file` is designed to save the Lottie JSON data to a file within the `StreamLit` directory.

3. **Implementation**: After defining our helper functions, we specify a Lottie animation URL and fetch its JSON data. If the fetch is successful, the Lottie JSON data is saved using the previously defined function, making it available for local access or embedding within the application.

In [22]:
# Function to load animation from Lottie
def load_lottie_url(url: str):
    r = requests.get(url)
    if r.status_code != 200:
        return None
    return r.json()

# Function to save Lottie JSON to a specified file within the 'model' directory
def save_lottie_to_file(lottie_data, filename='lottie_data.json'):
    filepath = os.path.join('StreamLit', filename)
    
    with open(filepath, 'w') as file:
        json.dump(lottie_data, file)

# Display a Lottie animation
lottie_url = "https://lottie.host/5870e9fb-3411-4d1b-9eb9-53bf39503d08/wQBcaepzHp.json"
lottie_json = load_lottie_url(lottie_url)

# Save the Lottie JSON data
if lottie_json:
    save_lottie_to_file(lottie_json)

### Geocoding Utility

The ability to convert an address into geographical coordinates (latitude and longitude) is crucial for many location-based applications. This script provides a geocoding utility using an external API service.

**Functionality Breakdown**:

1. **API Configuration**: The geocoding service used here is provided by `geloky.com`. The function `get_coordinates_from_address` is responsible for making requests to this service. An API key is necessary for authenticating the requests.

2. **Default Coordinates**: As a backup mechanism, in case the geocoding service fails to provide coordinates for a given address or if there's an issue with the API, the script has default coordinates set to Tokyo's geographical center. This ensures that the function always returns a set of coordinates. We can also use this to evaluate if our model is accurately giving us correct coordinates, or if it is unable to process the addresses given. 

3. **API Call and Data Parsing**: When the function is called with an address:
    - It constructs the API request URL.
    - Makes an HTTP GET request to the service.
    - Parses the JSON response to extract the latitude and longitude.
    
4. **Error Handling**: If the API doesn't return successful results, or if the address doesn't match any known locations in the geocoding service's database, the function prints a relevant message and falls back to the default Tokyo coordinates.

**Usage**: This script can be integrated into applications where geographical data is required. It abstracts the complexities of interacting with the geocoding API and provides a straightforward method to get coordinates from an address.

In [23]:
%%writefile StreamLit/geocoding.py

import streamlit as st
import requests

def get_coordinates_from_address(address):
    API_KEY = st.secrets["API_KEY"]
    url = f"https://geloky.com/api/geo/geocode?address={address}&key={API_KEY}&format=geloky"

    TOKYO_LAT = 35.6895
    TOKYO_LONG = 139.6917

    response = requests.get(url)

    if response.status_code == 200:
        data = response.json()

        # Check if the list is empty
        if not data:
            print(f"No matching results for address: {address}")
            return TOKYO_LAT, TOKYO_LONG  # Default Tokyo coordinates

        result = data[0]  # Access the first dictionary in the list
        latitude = result["latitude"]
        longitude = result["longitude"]
        return latitude, longitude
    else:
        print(f"Request failed with status code {response.status_code}")
        return TOKYO_LAT, TOKYO_LONG  # Default Tokyo coordinates in case of an API failure

Overwriting StreamLit/geocoding.py


## Deployment on StreamLit

In [24]:
%%writefile StreamLit/recommender_app_genre.py

import streamlit as st
from streamlit_lottie import st_lottie
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import boto3
import io
import s3fs
import h5py
import requests
import json
from geocoding import get_coordinates_from_address


# Load pre-computed cosine similarities and the restaurant dataset
with h5py.File('compressed_cosine_sim.h5', 'r') as file:
    cosine_sim = file['cosine_similarity'][:]

restaurants = pd.read_csv('restaurants_model.csv')


# Drop unnecessary columns to leave only genre features
genres_df = restaurants.drop(columns=['name', 'genre', 'rating_val', 'nearest_station', 'address'])


# Load the Lottie JSON data from the local file
with open('lottie_data.json', 'r') as file:
    lottie_data_from_file = json.load(file)

    
# Display the Lottie animation using the loaded data
st_lottie(lottie_data_from_file, speed=1, width=800, height=500, key="initial")


# Function to recommend restaurants based on selected genres and optionally, a nearest station
def recommend_by_genre(input_genres, nearest_station=None, cosine_sim=cosine_sim, min_similarity=0.3):
    # Check if all selected genres exist in the dataset
    for genre in input_genres:
        if genre not in genres_df.columns:
            return f"Genre {genre} not found in the dataset"

    # Create a combined genre vector for all selected genres
    genre_vector = np.zeros(genres_df.shape[1])
    for genre in input_genres:
        genre_index = list(genres_df.columns).index(genre)
        genre_vector[genre_index] = 1

    # Normalize the genre vector and reshape
    genre_vector /= len(input_genres)
    genre_vector = genre_vector.reshape(1, -1)

    # Compute the cosine similarity scores for the genre vector against all restaurants
    sim_scores = cosine_similarity(genre_vector, genres_df.values)
    sim_scores = [x for x in enumerate(sim_scores[0]) if x[1] >= min_similarity]
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # If a station is selected, filter by that station
    if nearest_station:
        station_filtered_indices = [i[0] for i in sim_scores if restaurants.iloc[i[0]]['nearest_station'] == nearest_station]
        top_10_indices = station_filtered_indices[:10]
    else:
        top_10_indices = [i[0] for i in sim_scores[:10]]

    # Get the top 10 restaurants and sort them by rating
    top_10_restaurants = restaurants.iloc[top_10_indices]
    sorted_restaurants = top_10_restaurants.sort_values(by='rating_val', ascending=False)
    return sorted_restaurants[['name', 'rating_val', 'nearest_station']]
    

# Streamlit UI: Select genres and a station
selected_genres = st.multiselect(
    "Select genres to get restaurant recommendations:", genres_df.columns
)
selected_station = st.selectbox("Choose a nearby station:", restaurants['nearest_station'].unique())


# Button to get recommendations
if st.button("Recommend"):
    recommended_data = recommend_by_genre(selected_genres, selected_station)
    
    # Check if the result is a string (indicating an error message) or an empty dataframe
    if isinstance(recommended_data, str):
        st.write(recommended_data)
    elif recommended_data.empty:
        st.write("No restaurants found based on the selected criteria. Please modify your selection.")
    else:
        recommended_data = recommended_data.rename(columns={"name": "Name", "rating_val": "Rating", "nearest_station": "Nearest Station"})
        
        # Ensure that 'address' column is present in recommended_data
        recommended_data = pd.merge(recommended_data, restaurants[['name', 'nearest_station', 'address']], 
                                left_on=['Name', 'Nearest Station'], 
                                right_on=['name', 'nearest_station'], 
                                how='left').drop_duplicates()

        # Geocode the addresses of the recommended restaurants only
        recommended_data['latitude'], recommended_data['longitude'] = zip(*recommended_data['address'].map(get_coordinates_from_address))

        st.table(recommended_data[['Name', 'Rating', 'Nearest Station']])

        # Display map
        st.map(recommended_data[['latitude', 'longitude']])

Overwriting StreamLit/recommender_app_genre.py


[Click here to access Tokyo Restaurants Recommender Streamlit](https://tokyo-restaurants.streamlit.app/)