🔹 Step 1: Load the Dataset
Assuming your dataset is saved as foodiebay.csv, we load it using Pandas:


In [1]:
import pandas as pd

# Load data
df = pd.read_csv("E:\FoodieBay DataSet\data\FoodieBay.csv")

# Preview
df.head()


Unnamed: 0,url,address,name,phone,location,rest_type,cuisines,menu_item,listed_in_type,listed_in_city,online_order,book_table,ave_cost_for_two,dish_liked,votes,ave_review_ranking,rate
0,https://www.zomato.com/bangalore/d2v-cafe-1-ba...,"173/218, GF, Opposite Ranka Colony, Bannerghat...",D2V Cafe,+91 9886986111\n+91 8550051111,Bannerghatta Road,Cafe,Cafe,[],Cafes,JP Nagar,No,No,700.0,,13,4.75,3.6
1,https://www.zomato.com/bangalore/the-burger-pl...,"2nd Floor, MMR Plaza, Above DCB Bank, Sarjapur...",The Burger Place,+91 9108974600,Koramangala 1st Block,Quick Bites,"Burger, Continental, Fast Food",[],Dine-out,Koramangala 5th Block,Yes,No,400.0,,28,4.5,3.8
2,https://www.zomato.com/bangalore/millet-mama-b...,"Next To Surana College, South End Circle, Basa...",Millet Mama,+91 7411918648\n+91 9986975625,Basavanagudi,Quick Bites,"South Indian, Healthy Food",[],Delivery,Jayanagar,Yes,No,200.0,,18,4.0,3.9
3,https://www.zomato.com/bangalore/red-onion-sha...,"Money Chambers Double Road, Shanti Nagar, Bang...",Red Onion,+91 8867253669,Shanti Nagar,Casual Dining,"Chinese, North Indian, Biryani, Kebab","['Hyderabadi Biryani', 'Special Veg Combo', 'S...",Delivery,Brigade Road,Yes,Yes,1200.0,"Fish, Dumplings, Biryani, Paneer Tikka Masala,...",550,4.8,4.3
4,https://www.zomato.com/bangalore/chaiywaala-da...,"Shop 67, 69, 70, Inside Ramaiah Campus, New BE...",Chaiywaala Da Dhaba,+91 8217431260\n+91 7975991975,New BEL Road,Cafe,"Cafe, Tea",[],Dine-out,New BEL Road,Yes,No,250.0,"Ginger Chai, Pakoda, Tea, Paneer Thali, Chole ...",67,3.0,3.7


🔹 Step 2: Understand Columns and Missing Data

In [2]:
# Overview of data
df.info()

# Check missing values
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40130 entries, 0 to 40129
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   url                 40130 non-null  object 
 1   address             40130 non-null  object 
 2   name                40130 non-null  object 
 3   phone               39246 non-null  object 
 4   location            40130 non-null  object 
 5   rest_type           40130 non-null  object 
 6   cuisines            40112 non-null  object 
 7   menu_item           40130 non-null  object 
 8   listed_in_type      40130 non-null  object 
 9   listed_in_city      40130 non-null  object 
 10  online_order        40130 non-null  object 
 11  book_table          40130 non-null  object 
 12  ave_cost_for_two    39890 non-null  float64
 13  dish_liked          17351 non-null  object 
 14  votes               40130 non-null  int64  
 15  ave_review_ranking  33751 non-null  float64
 16  rate

url                       0
address                   0
name                      0
phone                   884
location                  0
rest_type                 0
cuisines                 18
menu_item                 0
listed_in_type            0
listed_in_city            0
online_order              0
book_table                0
ave_cost_for_two        240
dish_liked            22779
votes                     0
ave_review_ranking     6379
rate                   8336
dtype: int64

🔹 Step 3: Clean Columns One-by-One
🧽 Clean rate and ave_review_ranking

In [4]:
df['cuisines'] = df['cuisines'].fillna('Unknown')
df['ave_cost_for_two'] = df['ave_cost_for_two'].fillna(df['ave_cost_for_two'].median())
df['dish_liked'] = df['dish_liked'].fillna('Not Specified')
df['ave_review_ranking'] = df['ave_review_ranking'].fillna(df['ave_review_ranking'].median())
df['rate'] = df['rate'].fillna(df['rate'].median())

In [5]:
# Remove '/5', strip whitespace, convert to float
df['rate'] = df['rate'].astype(str).str.replace('/5', '').str.strip()
df['rate'] = df['rate'].replace('NEW', None).replace('-', None)
df['rate'] = df['rate'].astype(float)

# Convert ave_review_ranking to float
df['ave_review_ranking'] = pd.to_numeric(df['ave_review_ranking'], errors='coerce')


🔢 Convert votes, ave_cost_for_two

In [6]:
df['votes'] = pd.to_numeric(df['votes'], errors='coerce')
df['ave_cost_for_two'] = pd.to_numeric(df['ave_cost_for_two'], errors='coerce')


🔁 Handle Yes/No in online_order and book_table

In [7]:
df['online_order'] = df['online_order'].map({'Yes': 1, 'No': 0})
df['book_table'] = df['book_table'].map({'Yes': 1, 'No': 0})


🧹 Clean cuisines, dish_liked
These are text-based lists. You can either keep them as-is or extract top features later.

Example: Count most common cuisine styles:

python
Copy
Edit


In [8]:
from collections import Counter

# Example: Top 10 cuisines
all_cuisines = df['cuisines'].dropna().str.split(', ').sum()
cuisine_counts = pd.Series(all_cuisines).value_counts().head(10)


If needed, we can one-hot encode popular cuisines later.



🧹 Clean menu_item
Some entries are empty or massive arrays. You can:

Drop this column for modeling

Extract only counts (e.g., number of items)

In [9]:
df['menu_item_count'] = df['menu_item'].apply(lambda x: len(eval(x)) if isinstance(x, str) and x.startswith('[') else 0)
df.drop(columns=['menu_item'], inplace=True)


🏷️ Encode Categorical Features
Columns like location, rest_type, listed_in_city, listed_in_type → one-hot or label encoding:

In [10]:
# Option 1: Label Encoding (for tree models)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in ['location', 'rest_type', 'listed_in_type', 'listed_in_city']:
    df[col] = le.fit_transform(df[col].astype(str))


✅ Final Step: Drop Unused Columns

In [11]:
df.drop(columns=['url', 'address', 'phone', 'name', 'dish_liked'], inplace=True)


df.info()
df.describe()


In [12]:
df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40130 entries, 0 to 40129
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   location            40130 non-null  int64  
 1   rest_type           40130 non-null  int64  
 2   cuisines            40130 non-null  object 
 3   listed_in_type      40130 non-null  int64  
 4   listed_in_city      40130 non-null  int64  
 5   online_order        40130 non-null  int64  
 6   book_table          40130 non-null  int64  
 7   ave_cost_for_two    40130 non-null  float64
 8   votes               40130 non-null  int64  
 9   ave_review_ranking  40130 non-null  float64
 10  rate                40130 non-null  float64
 11  menu_item_count     40130 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 3.7+ MB


Unnamed: 0,location,rest_type,listed_in_type,listed_in_city,online_order,book_table,ave_cost_for_two,votes,ave_review_ranking,rate,menu_item_count
count,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0
mean,35.374009,3.641964,2.749439,14.070994,0.611986,0.102018,506.059183,225.685547,3.610388,3.66615,26.936058
std,27.336681,1.703521,1.094749,8.244336,0.487304,0.302676,323.298238,598.568781,0.839315,0.384137,64.301264
min,0.0,0.0,0.0,0.0,0.0,0.0,40.0,0.0,1.0,1.8,0.0
25%,9.0,2.0,2.0,7.0,0.0,0.0,300.0,6.0,3.2,3.5,0.0
50%,32.0,5.0,2.0,15.0,1.0,0.0,400.0,36.0,3.727273,3.7,0.0
75%,55.0,5.0,4.0,20.0,1.0,0.0,600.0,174.0,4.0,3.9,0.0
max,92.0,6.0,6.0,29.0,1.0,1.0,2500.0,12121.0,5.0,4.9,715.0


In [13]:
df.head()

Unnamed: 0,location,rest_type,cuisines,listed_in_type,listed_in_city,online_order,book_table,ave_cost_for_two,votes,ave_review_ranking,rate,menu_item_count
0,3,1,Cafe,1,12,0,0,700.0,13,4.75,3.6,0
1,41,5,"Burger, Continental, Fast Food",4,17,1,0,400.0,28,4.5,3.8,0
2,4,5,"South Indian, Healthy Food",2,13,1,0,200.0,18,4.0,3.9,0
3,78,2,"Chinese, North Indian, Biryani, Kebab",2,5,1,1,1200.0,550,4.8,4.3,100
4,60,1,"Cafe, Tea",4,24,1,0,250.0,67,3.0,3.7,0


# Use top N cuisines (better, keeps more info)

In [15]:
# Count cuisines (take first cuisine from list for consistency)
df['primary_cuisine'] = df['cuisines'].apply(lambda x: x.split(',')[0].strip())

# Keep top 20 cuisines, rest as "Other"
top_cuisines = df['primary_cuisine'].value_counts().nlargest(20).index
df['primary_cuisine'] = df['primary_cuisine'].apply(lambda x: x if x in top_cuisines else 'Other')


In [19]:
df = pd.get_dummies(df, columns=['primary_cuisine'], drop_first=True)


In [20]:
df = df.drop(columns=['cuisines'])

In [21]:
df.head()

Unnamed: 0,location,rest_type,listed_in_type,listed_in_city,online_order,book_table,ave_cost_for_two,votes,ave_review_ranking,rate,...,primary_cuisine_Fast Food,primary_cuisine_Healthy Food,primary_cuisine_Italian,primary_cuisine_Kerala,primary_cuisine_North Indian,primary_cuisine_Other,primary_cuisine_Pizza,primary_cuisine_Seafood,primary_cuisine_South Indian,primary_cuisine_Street Food
0,3,1,1,12,0,0,700.0,13,4.75,3.6,...,False,False,False,False,False,False,False,False,False,False
1,41,5,4,17,1,0,400.0,28,4.5,3.8,...,False,False,False,False,False,False,False,False,False,False
2,4,5,2,13,1,0,200.0,18,4.0,3.9,...,False,False,False,False,False,False,False,False,True,False
3,78,2,2,5,1,1,1200.0,550,4.8,4.3,...,False,False,False,False,False,False,False,False,False,False
4,60,1,4,24,1,0,250.0,67,3.0,3.7,...,False,False,False,False,False,False,False,False,False,False


In [22]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40130 entries, 0 to 40129
Data columns (total 31 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   location                      40130 non-null  int64  
 1   rest_type                     40130 non-null  int64  
 2   listed_in_type                40130 non-null  int64  
 3   listed_in_city                40130 non-null  int64  
 4   online_order                  40130 non-null  int64  
 5   book_table                    40130 non-null  int64  
 6   ave_cost_for_two              40130 non-null  float64
 7   votes                         40130 non-null  int64  
 8   ave_review_ranking            40130 non-null  float64
 9   rate                          40130 non-null  float64
 10  menu_item_count               40130 non-null  int64  
 11  primary_cuisine_Arabian       40130 non-null  bool   
 12  primary_cuisine_Asian         40130 non-null  bool   
 13  p

Unnamed: 0,location,rest_type,listed_in_type,listed_in_city,online_order,book_table,ave_cost_for_two,votes,ave_review_ranking,rate,menu_item_count
count,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0,40130.0
mean,35.374009,3.641964,2.749439,14.070994,0.611986,0.102018,506.059183,225.685547,3.610388,3.66615,26.936058
std,27.336681,1.703521,1.094749,8.244336,0.487304,0.302676,323.298238,598.568781,0.839315,0.384137,64.301264
min,0.0,0.0,0.0,0.0,0.0,0.0,40.0,0.0,1.0,1.8,0.0
25%,9.0,2.0,2.0,7.0,0.0,0.0,300.0,6.0,3.2,3.5,0.0
50%,32.0,5.0,2.0,15.0,1.0,0.0,400.0,36.0,3.727273,3.7,0.0
75%,55.0,5.0,4.0,20.0,1.0,0.0,600.0,174.0,4.0,3.9,0.0
max,92.0,6.0,6.0,29.0,1.0,1.0,2500.0,12121.0,5.0,4.9,715.0


In [25]:
# Convert all boolean columns to int
df = df.astype({col: 'int' for col in df.columns if df[col].dtype == 'bool'})


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40130 entries, 0 to 40129
Data columns (total 31 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   location                      40130 non-null  int64  
 1   rest_type                     40130 non-null  int64  
 2   listed_in_type                40130 non-null  int64  
 3   listed_in_city                40130 non-null  int64  
 4   online_order                  40130 non-null  int64  
 5   book_table                    40130 non-null  int64  
 6   ave_cost_for_two              40130 non-null  float64
 7   votes                         40130 non-null  int64  
 8   ave_review_ranking            40130 non-null  float64
 9   rate                          40130 non-null  float64
 10  menu_item_count               40130 non-null  int64  
 11  primary_cuisine_Arabian       40130 non-null  int64  
 12  primary_cuisine_Asian         40130 non-null  int64  
 13  p

In [27]:
df.to_csv(r"E:\FoodieBay DataSet\data\foodiebay_cleaned.csv", index=False)