In [1]:
#Importing Libraries
import nltk
nltk.download('stopwords')

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
import re
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saksh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv("zomato.csv")
df.head()

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1/5,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1/5,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8/5,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari
3,https://www.zomato.com/bangalore/addhuri-udupi...,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,No,No,3.7/5,88,+91 9620009302,Banashankari,Quick Bites,Masala Dosa,"South Indian, North Indian",300,"[('Rated 4.0', ""RATED\n Great food and proper...",[],Buffet,Banashankari
4,https://www.zomato.com/bangalore/grand-village...,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,No,No,3.8/5,166,+91 8026612447\r\n+91 9901210005,Basavanagudi,Casual Dining,"Panipuri, Gol Gappe","North Indian, Rajasthani",600,"[('Rated 4.0', 'RATED\n Very good restaurant ...",[],Buffet,Banashankari


Here’s a brief explanation of each column based on typical data found in such datasets:

1. url: The web address for the restaurant's page.

2. address: The physical address of the restaurant.

3. name: The name of the restaurant.

4. online_order: Whether the restaurant accepts online orders (Yes/No).

5. book_table: Whether table booking is available (Yes/No).

6. rate: The rating of the restaurant, usually on a scale (e.g., 1-5).

7. votes: The number of votes or reviews the restaurant has received.

8. phone: The contact phone number for the restaurant.

9. location: The general area or neighborhood where the restaurant is located.

10. rest_type: The type of restaurant (e.g., Casual Dining, Cafe).

11. dish_liked: Popular dishes at the restaurant.

12. cuisines: Types of cuisines offered (e.g., Italian, Chinese).

13. approx_cost(for two people): The approximate cost for two people dining at the restaurant.

14. reviews_list: A list of reviews for the restaurant.

15. menu_item: Items available on the restaurant's menu.

16. listed_in(type): Category or type of listing (e.g., delivery, dine-in).

17. listed_in(city): The city where the restaurant is located.

Here's a structured breakdown of the notebook for handling and analyzing the restaurant dataset, focusing on loading the data, cleaning it, performing text preprocessing, and building a recommendation system:

1. Loading the Dataset
Objective: Load the data and import necessary libraries.
Steps:
Import essential libraries (e.g., pandas, numpy, matplotlib, seaborn, nltk, sklearn).
Load the dataset into a DataFrame using pandas.

2. Data Cleaning
Objective: Clean the data to ensure it is ready for analysis.
Steps:
Delete Redundant Columns: Remove columns that are not needed for analysis.
Rename Columns: Rename columns to make them more readable and consistent.
Drop Duplicates: Remove duplicate rows to ensure each entry is unique.
Clean Individual Columns: Clean specific columns to remove inconsistencies and standardize formats.
Remove NaN Values: Handle missing values by either filling them in or removing the rows/columns.

3. Text Preprocessing
Objective: Preprocess the text data, particularly reviews, for further analysis or modeling.
Steps:
Clean Unnecessary Words in Reviews: Remove stop words and other irrelevant words.
Remove Links and Other Unnecessary Items: Strip out URLs, HTML tags, and other extraneous content.
Remove Symbols: Remove special characters and symbols to clean the text.

4. Some Transformations
Objective: Perform additional transformations on the data as needed.
Steps:
This can include encoding categorical variables, normalizing numerical data, or other data transformations to prepare for modeling.

5. Recommendation System
Objective: Build a recommendation system based on the cleaned and preprocessed data.
Steps:
Choose a recommendation algorithm (e.g., collaborative filtering, content-based filtering).
Implement the recommendation system using the chosen algorithm.
Evaluate the recommendation system's performance.

In [3]:
df.shape

(51717, 17)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   url                          51717 non-null  object
 1   address                      51717 non-null  object
 2   name                         51717 non-null  object
 3   online_order                 51717 non-null  object
 4   book_table                   51717 non-null  object
 5   rate                         43942 non-null  object
 6   votes                        51717 non-null  int64 
 7   phone                        50509 non-null  object
 8   location                     51696 non-null  object
 9   rest_type                    51490 non-null  object
 10  dish_liked                   23639 non-null  object
 11  cuisines                     51672 non-null  object
 12  approx_cost(for two people)  51371 non-null  object
 13  reviews_list                 51

In [5]:
# Delete redundant columns
zomato = df.drop(['url','dish_liked','phone','address','menu_item',],axis=1)

In [6]:
#Removing the Duplicates
zomato.duplicated().sum()
zomato.drop_duplicates(inplace=True)

In [7]:
#Removing the NaN
zomato.isnull().sum()

name                              0
online_order                      0
book_table                        0
rate                           7757
votes                             0
location                         21
rest_type                       227
cuisines                         45
approx_cost(for two people)     345
reviews_list                      0
listed_in(type)                   0
listed_in(city)                   0
dtype: int64

In [8]:
zomato.dropna(how='any',inplace=True)

In [9]:
zomato.columns

Index(['name', 'online_order', 'book_table', 'rate', 'votes', 'location',
       'rest_type', 'cuisines', 'approx_cost(for two people)', 'reviews_list',
       'listed_in(type)', 'listed_in(city)'],
      dtype='object')

In [10]:
#Changing the Columns Names
zomato = zomato.rename(columns={'approx_cost(for two people)':'cost','listed_in(type)':'type',
                                  'listed_in(city)':'city'})
zomato.columns

Index(['name', 'online_order', 'book_table', 'rate', 'votes', 'location',
       'rest_type', 'cuisines', 'cost', 'reviews_list', 'type', 'city'],
      dtype='object')

In [11]:
zomato['cost'].unique()

array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
       '900', '200', '750', '150', '850', '100', '1,200', '350', '250',
       '950', '1,000', '1,500', '1,300', '199', '80', '1,100', '160',
       '1,600', '230', '130', '1,700', '1,400', '1,350', '2,200', '2,000',
       '1,800', '1,900', '180', '330', '2,500', '2,100', '3,000', '2,800',
       '3,400', '50', '40', '1,250', '3,500', '4,000', '2,400', '2,600',
       '1,450', '70', '3,200', '560', '240', '360', '6,000', '1,050',
       '2,300', '4,100', '120', '5,000', '3,700', '1,650', '2,700',
       '4,500'], dtype=object)

In [12]:
# Remove commas and convert to float
zomato['cost'] = zomato['cost'].str.replace(',', '').astype(float)
zomato['cost'].unique()

array([ 800.,  300.,  600.,  700.,  550.,  500.,  450.,  650.,  400.,
        900.,  200.,  750.,  150.,  850.,  100., 1200.,  350.,  250.,
        950., 1000., 1500., 1300.,  199.,   80., 1100.,  160., 1600.,
        230.,  130., 1700., 1400., 1350., 2200., 2000., 1800., 1900.,
        180.,  330., 2500., 2100., 3000., 2800., 3400.,   50.,   40.,
       1250., 3500., 4000., 2400., 2600., 1450.,   70., 3200.,  560.,
        240.,  360., 6000., 1050., 2300., 4100.,  120., 5000., 3700.,
       1650., 2700., 4500.])

In [13]:
# Convert votes to integer
zomato['votes'] = zomato['votes'].astype(int)

In [14]:
zomato['rate'].unique()

array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
       '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
       '4.3/5', 'NEW', '2.9/5', '3.5/5', '2.6/5', '3.8 /5', '3.4/5',
       '4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5',
       '3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5',
       '4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5',
       '3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5',
       '4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5',
       '4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
       '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

In [15]:
# Remove '/5' and convert to float
zomato['rate'] = zomato['rate'].str.replace('/5', '').str.strip().replace('NEW', np.nan).replace('-', np.nan).astype(float)
zomato['rate'].unique()

array([4.1, 3.8, 3.7, 3.6, 4.6, 4. , 4.2, 3.9, 3.1, 3. , 3.2, 3.3, 2.8,
       4.4, 4.3, nan, 2.9, 3.5, 2.6, 3.4, 4.5, 2.5, 2.7, 4.7, 2.4, 2.2,
       2.3, 4.8, 4.9, 2.1, 2. , 1.8])

In [16]:
# Standardize columns
zomato['location'] = zomato['location'].str.strip().str.title()
zomato['rest_type'] = zomato['rest_type'].str.strip().str.title()
zomato['type'] = zomato['type'].str.strip().str.title()
zomato['city'] = zomato['city'].str.strip().str.title()
zomato['online_order'] = zomato['online_order'].map({'Yes': True, 'No': False})
zomato['book_table'] = zomato['book_table'].map({'Yes': True, 'No': False})

In [17]:
zomato['cuisines']

0                     North Indian, Mughlai, Chinese
1                        Chinese, North Indian, Thai
2                             Cafe, Mexican, Italian
3                         South Indian, North Indian
4                           North Indian, Rajasthani
                            ...                     
51709                      North Indian, Continental
51711    Andhra, South Indian, Chinese, North Indian
51712                                    Continental
51715                                    Finger Food
51716         Finger Food, North Indian, Continental
Name: cuisines, Length: 43480, dtype: object

In [18]:
# Split cuisines and strip whitespace
zomato['cuisines'] = zomato['cuisines'].apply(lambda x: [cuisine.strip() for cuisine in x.split(',')])
zomato['cuisines']

0                     [North Indian, Mughlai, Chinese]
1                        [Chinese, North Indian, Thai]
2                             [Cafe, Mexican, Italian]
3                         [South Indian, North Indian]
4                           [North Indian, Rajasthani]
                             ...                      
51709                      [North Indian, Continental]
51711    [Andhra, South Indian, Chinese, North Indian]
51712                                    [Continental]
51715                                    [Finger Food]
51716         [Finger Food, North Indian, Continental]
Name: cuisines, Length: 43480, dtype: object

In [19]:
## Checking Null values
zomato.isnull().sum()
zomato.dropna(how='any',inplace=True)

In [20]:
# Computing Mean Rating
zomato['Mean Rating'] = zomato.groupby('name')['rate'].transform('mean')
zomato.head()

Unnamed: 0,name,online_order,book_table,rate,votes,location,rest_type,cuisines,cost,reviews_list,type,city,Mean Rating
0,Jalsa,True,True,4.1,775,Banashankari,Casual Dining,"[North Indian, Mughlai, Chinese]",800.0,"[('Rated 4.0', 'RATED\n A beautiful place to ...",Buffet,Banashankari,4.118182
1,Spice Elephant,True,False,4.1,787,Banashankari,Casual Dining,"[Chinese, North Indian, Thai]",800.0,"[('Rated 4.0', 'RATED\n Had been here for din...",Buffet,Banashankari,4.1
2,San Churro Cafe,True,False,3.8,918,Banashankari,"Cafe, Casual Dining","[Cafe, Mexican, Italian]",800.0,"[('Rated 3.0', ""RATED\n Ambience is not that ...",Buffet,Banashankari,3.8
3,Addhuri Udupi Bhojana,False,False,3.7,88,Banashankari,Quick Bites,"[South Indian, North Indian]",300.0,"[('Rated 4.0', ""RATED\n Great food and proper...",Buffet,Banashankari,3.7
4,Grand Village,False,False,3.8,166,Basavanagudi,Casual Dining,"[North Indian, Rajasthani]",600.0,"[('Rated 4.0', 'RATED\n Very good restaurant ...",Buffet,Banashankari,3.8


In [21]:
# Scaling Mean Rating to range (1, 5)
scaler = MinMaxScaler(feature_range=(1, 5))
zomato[['Mean Rating']] = scaler.fit_transform(zomato[['Mean Rating']]).round(2)
zomato.head()

Unnamed: 0,name,online_order,book_table,rate,votes,location,rest_type,cuisines,cost,reviews_list,type,city,Mean Rating
0,Jalsa,True,True,4.1,775,Banashankari,Casual Dining,"[North Indian, Mughlai, Chinese]",800.0,"[('Rated 4.0', 'RATED\n A beautiful place to ...",Buffet,Banashankari,3.99
1,Spice Elephant,True,False,4.1,787,Banashankari,Casual Dining,"[Chinese, North Indian, Thai]",800.0,"[('Rated 4.0', 'RATED\n Had been here for din...",Buffet,Banashankari,3.97
2,San Churro Cafe,True,False,3.8,918,Banashankari,"Cafe, Casual Dining","[Cafe, Mexican, Italian]",800.0,"[('Rated 3.0', ""RATED\n Ambience is not that ...",Buffet,Banashankari,3.58
3,Addhuri Udupi Bhojana,False,False,3.7,88,Banashankari,Quick Bites,"[South Indian, North Indian]",300.0,"[('Rated 4.0', ""RATED\n Great food and proper...",Buffet,Banashankari,3.45
4,Grand Village,False,False,3.8,166,Basavanagudi,Casual Dining,"[North Indian, Rajasthani]",600.0,"[('Rated 4.0', 'RATED\n Very good restaurant ...",Buffet,Banashankari,3.58


In [22]:
zomato['reviews_list']

0        [('Rated 4.0', 'RATED\n  A beautiful place to ...
1        [('Rated 4.0', 'RATED\n  Had been here for din...
2        [('Rated 3.0', "RATED\n  Ambience is not that ...
3        [('Rated 4.0', "RATED\n  Great food and proper...
4        [('Rated 4.0', 'RATED\n  Very good restaurant ...
                               ...                        
51709    [('Rated 4.0', 'RATED\n  Ambience- Big and spa...
51711    [('Rated 4.0', 'RATED\n  A fine place to chill...
51712    [('Rated 5.0', "RATED\n  Food and service are ...
51715    [('Rated 4.0', 'RATED\n  Nice and friendly pla...
51716    [('Rated 5.0', 'RATED\n  Great ambience , look...
Name: reviews_list, Length: 41221, dtype: object

In [23]:
# Define the function to clean reviews
def clean_reviews(reviews):
       
    # Remove URLs
    reviews = re.sub(r"http\S+|www\S+|https\S+", '', reviews)
    
    # Remove special characters
    reviews = re.sub(r'[^\w\s]', '', reviews)  # Remove punctuation
    
    # Convert to lowercase
    reviews = reviews.lower()
    
    # Remove extra whitespace
    reviews = ' '.join(reviews.split())
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    reviews = ' '.join(word for word in reviews.split() if word not in stop_words)
        
    return reviews

# Apply the function to the 'reviews_list' column
zomato['reviews_list'] = zomato['reviews_list'].apply(clean_reviews)


In [24]:
zomato[['reviews_list', 'cuisines']].sample(5)

Unnamed: 0,reviews_list,cuisines
44458,rated 30 ratedn ambience355nservice455nfood45n...,"[Continental, European, Mediterranean, Cafe, S..."
14759,rated 10 ratedn first make bad pizzas top staf...,[Pizza]
19732,rated 20 ratedn much salt rated 10 ratedn stra...,"[North Indian, Chinese, Biryani]"
23606,rated 40 ratedn place centre town food tasty p...,"[Continental, Italian, Chinese, North Indian]"
12534,rated 50 ratedn undoubtedly say best steak pla...,"[North Eastern, Asian, Naga, Steak, Momos]"


In [25]:
# RESTAURANT NAMES:
restaurant_names = list(zomato['name'].unique())
restaurant_names

['Jalsa',
 'Spice Elephant',
 'San Churro Cafe',
 'Addhuri Udupi Bhojana',
 'Grand Village',
 'Timepass Dinner',
 'Rosewood International Hotel - Bar & Restaurant',
 'Onesta',
 'Penthouse Cafe',
 'Smacznego',
 'CafÃ\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â© Down The Alley',
 'Cafe Shuffle',
 'The Coffee Shack',
 'Caf-Eleven',
 'Cafe Vivacity',
 'Catch-up-ino',
 "Kirthi's Biryani",
 'T3H Cafe',
 '360 Atoms Restaurant And Cafe',
 'The Vintage Cafe',
 'Woodee Pizza',
 'Cafe Coffee Day',
 'My Tea House',
 'Hide Out Cafe',
 'CAFE NOVA',
 'Coffee Tindi',
 'Sea Green Cafe',
 'Cuppa',
 "Srinathji's Cafe",
 'Redberrys',
 'Foodiction',
 'Sweet Truth',
 'Ovenstory Pizza',
 'Faasos',
 'Behrouz Biryani',
 'Fast And Fresh',
 'Szechuan Dragon',
 'Empire Restaurant',
 'Maruthi Davangere Benne Dosa',
 'Chaatimes',
 'Havyaka Mess',
 "McDonald's",
 "Domino's Pizza",
 'Hotboxit',
 'Kitchen Garden',
 'Recipe',
 'Beijing Bites',
 'Tasty Bytes',
 'Petoo',
 'Shree Cool Point'

In [26]:
zomato.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41221 entries, 0 to 51716
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          41221 non-null  object 
 1   online_order  41221 non-null  bool   
 2   book_table    41221 non-null  bool   
 3   rate          41221 non-null  float64
 4   votes         41221 non-null  int32  
 5   location      41221 non-null  object 
 6   rest_type     41221 non-null  object 
 7   cuisines      41221 non-null  object 
 8   cost          41221 non-null  float64
 9   reviews_list  41221 non-null  object 
 10  type          41221 non-null  object 
 11  city          41221 non-null  object 
 12  Mean Rating   41221 non-null  float64
dtypes: bool(2), float64(3), int32(1), object(7)
memory usage: 3.7+ MB


In [27]:
# Convert cuisines to a single string for each restaurant
zomato['cuisines'] = zomato['cuisines'].apply(lambda x: ' '.join(x))

In [28]:
zomato.head()

Unnamed: 0,name,online_order,book_table,rate,votes,location,rest_type,cuisines,cost,reviews_list,type,city,Mean Rating
0,Jalsa,True,True,4.1,775,Banashankari,Casual Dining,North Indian Mughlai Chinese,800.0,rated 40 ratedn beautiful place dine inthe int...,Buffet,Banashankari,3.99
1,Spice Elephant,True,False,4.1,787,Banashankari,Casual Dining,Chinese North Indian Thai,800.0,rated 40 ratedn dinner family turned good choo...,Buffet,Banashankari,3.97
2,San Churro Cafe,True,False,3.8,918,Banashankari,"Cafe, Casual Dining",Cafe Mexican Italian,800.0,rated 30 ratedn ambience good enough pocket fr...,Buffet,Banashankari,3.58
3,Addhuri Udupi Bhojana,False,False,3.7,88,Banashankari,Quick Bites,South Indian North Indian,300.0,rated 40 ratedn great food proper karnataka st...,Buffet,Banashankari,3.45
4,Grand Village,False,False,3.8,166,Basavanagudi,Casual Dining,North Indian Rajasthani,600.0,rated 40 ratedn good restaurant neighbourhood ...,Buffet,Banashankari,3.58


In [100]:
zomato=zomato.drop(['rest_type', 'type', 'votes'],axis=1)

In [101]:
# Randomly sample 60% of the dataframe and set the index to 'name'
df_percent = zomato.sample(frac=0.6, random_state=42)

# Reset index and set 'name' as the index again
df_percent.reset_index(drop=True, inplace=True)
df_percent.set_index('name', inplace=True)

# Print the sampled dataframe shape to verify
print(df_percent.shape)

(24733, 9)


In [102]:
df_percent.head()

Unnamed: 0_level_0,online_order,book_table,rate,location,cuisines,cost,reviews_list,city,Mean Rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
MRA,True,False,3.9,Btm,Bakery,200.0,rated 50 ratedn lovely snacks savouries cookie...,Jp Nagar,3.55
S S Popular Frozen Stone,True,False,3.7,Bellandur,Beverages Ice Cream Sandwich Fast Food,250.0,rated 40 ratedn quantity sandwich less rated 5...,Bellandur,3.45
The Lassi Bar,False,False,3.7,Btm,Beverages Juices,300.0,rated 30 ratedn favs menu hard rock coffee ava...,Koramangala 7Th Block,3.45
Pizza Cart,True,False,3.8,Hsr,Pizza,800.0,rated 40 ratedn ever crave really nice pizza p...,Koramangala 5Th Block,3.58
Absolute Chinese,True,False,3.8,Rajajinagar,Chinese Fast Food Momos,450.0,rated 30 ratedn ordered scheswaan fried rice h...,Malleshwaram,3.58


In [111]:
# Initialize the TF-IDF Vectorizer
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0.0, stop_words='english')

# Fit and transform the reviews_list column
tfidf_matrix = tfidf.fit_transform(df_percent['reviews_list'])

# Print the shape of the TF-IDF matrix to verify
print(tfidf_matrix.shape)

(24733, 1455411)


In [112]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)

print(cosine_similarities.shape)

(24733, 24733)


In [113]:
def recommend(name, cosine_similarities, indices, df_percent):
    # Create a list to store the top restaurant names
    recommend_restaurant = []
    
    # Find the index of the restaurant entered
    idx = indices[indices == name].index[0]
    
    # Find the restaurants with a similar cosine-sim value and order them from highest to lowest
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False)
    
    # Extract top 30 restaurant indexes with a similar cosine-sim value
    top30_indexes = list(score_series.iloc[1:31].index)  # Use iloc[1:31] to skip the restaurant itself
    
    # Names of the top 30 restaurants
    for each in top30_indexes:
        recommend_restaurant.append(df_percent.index[each])
    
    # Creating the new dataset to show similar restaurants
    df_new = pd.DataFrame(columns=['cuisines', 'Mean Rating', 'cost'])
    
    # Create the top 30 similar restaurants with some of their columns
    for each in recommend_restaurant:
        sample_df = df_percent[['cuisines', 'Mean Rating', 'cost']][df_percent.index == each]
        df_new = pd.concat([df_new, sample_df], ignore_index=True)
    
    # Drop duplicate entries and sort only the top 10 by the highest rating
    df_new = df_new.drop_duplicates(subset=['cuisines', 'Mean Rating', 'cost'], keep='first')
    df_new = df_new.sort_values(by='Mean Rating', ascending=False).head(10)
    
    # Include the restaurant names as a column in the final DataFrame
    df_new.index = recommend_restaurant[:len(df_new)]
    df_new.index.name = 'Restaurant Name'
    df_new.reset_index(inplace=True)
    
    print('TOP %s RESTAURANTS LIKE %s WITH SIMILAR REVIEWS: ' % (str(len(df_new)), name))
    
    return df_new[['Restaurant Name', 'cuisines', 'Mean Rating', 'cost']]

# Example usage:
# result = recommend('Restaurant Name', cosine_similarities, indices, df_percent)
# print(result)


In [114]:
recommend('Pai Vihar', cosine_similarities, indices, df_percent)

TOP 10 RESTAURANTS LIKE Pai Vihar WITH SIMILAR REVIEWS: 


Unnamed: 0,Restaurant Name,cuisines,Mean Rating,cost
0,Pai Vihar,South Indian Chinese Street Food Juices,3.84,300.0
1,Pai Vihar,North Indian South Indian Chinese,3.71,850.0
2,Pai Vihar,Street Food Fast Food Rolls Desserts,3.6,200.0
3,Pai Vihar,North Indian Mithai,3.6,250.0
4,Pai Vihar,Street Food Beverages,3.6,150.0
5,Samosa Singh,Chettinad South Indian Biryani,3.58,700.0
6,Samosa Singh,South Indian Chinese North Indian,3.58,350.0
7,Samosa Singh,Fast Food North Indian South Indian,3.45,200.0
8,Kadai Crust - Amma Veetu Samayal,South Indian,3.45,150.0
9,Kadai Crust - Amma Veetu Samayal,Chinese North Indian South Indian,3.32,250.0
