# Generate Cleaned Data, Numerical Data, and Encoder Files

This notebook:
1. Cleans `swiggy.csv` and saves `cleaned_data.csv`.
2. Extracts numeric columns (`rating`, `rating_count`, `cost`) to `numerical_data.csv`.
3. Fits a OneHotEncoder on `name`, `city`, and `cuisine` and saves as `encoder.pkl`.

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.preprocessing import OneHotEncoder

In [2]:
df = pd.read_csv("C:/Users/sinwa/Desktop/swiggy_project/data/raw_data/swiggy.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148541 entries, 0 to 148540
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            148541 non-null  int64 
 1   name          148455 non-null  object
 2   city          148541 non-null  object
 3   rating        148455 non-null  object
 4   rating_count  148455 non-null  object
 5   cost          148410 non-null  object
 6   cuisine       148442 non-null  object
 7   lic_no        148312 non-null  object
 8   link          148541 non-null  object
 9   address       148455 non-null  object
 10  menu          148541 non-null  object
dtypes: int64(1), object(10)
memory usage: 12.5+ MB


In [4]:
df.replace(["--", "Too Few Ratings"], np.nan, inplace=True)
df["rating_count"] = df["rating_count"].astype(str).str.replace(r"[^0-9]", "", regex=True)
df["cost"] = df["cost"].astype(str).str.replace(r"[^0-9]", "", regex=True)

In [5]:



for col in ["rating", "rating_count", "cost"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")
for col in ["rating", "rating_count", "cost"]:
    median_val = df[col].median()
    df[col].fillna(median_val, inplace=True)
df_cleaned = df.drop_duplicates().dropna(subset=["name", "city", "cuisine"]).reset_index(drop=True)
df_cleaned.to_csv("cleaned_data.csv", index=False)
df_cleaned.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always beha

Unnamed: 0,id,name,city,rating,rating_count,cost,cuisine,lic_no,link,address,menu
0,567335,AB FOODS POINT,Abohar,4.0,50.0,200.0,"Beverages,Pizzas",22122652000138,https://www.swiggy.com/restaurants/ab-foods-po...,"AB FOODS POINT, NEAR RISHI NARANG DENTAL CLINI...",Menu/567335.json
1,531342,Janta Sweet House,Abohar,4.4,50.0,200.0,"Sweets,Bakery",12117201000112,https://www.swiggy.com/restaurants/janta-sweet...,"Janta Sweet House, Bazar No.9, Circullar Road,...",Menu/531342.json
2,158203,theka coffee desi,Abohar,3.8,100.0,100.0,Beverages,22121652000190,https://www.swiggy.com/restaurants/theka-coffe...,"theka coffee desi, sahtiya sadan road city",Menu/158203.json
3,187912,Singh Hut,Abohar,3.7,20.0,250.0,"Fast Food,Indian",22119652000167,https://www.swiggy.com/restaurants/singh-hut-n...,"Singh Hut, CIRCULAR ROAD NEAR NEHRU PARK ABOHAR",Menu/187912.json
4,543530,GRILL MASTERS,Abohar,4.0,50.0,250.0,"Italian-American,Fast Food",12122201000053,https://www.swiggy.com/restaurants/grill-maste...,"GRILL MASTERS, ADA Heights, Abohar - Hanumanga...",Menu/543530.json


In [6]:
# Save numerical data
numerical_cols = ["rating", "rating_count", "cost"]
df_numerical_data = df_cleaned[numerical_cols]
df_numerical_data.to_csv("numerical_data.csv", index=False)
df_numerical_data.head()

Unnamed: 0,rating,rating_count,cost
0,4.0,50.0,200.0
1,4.4,50.0,200.0
2,3.8,100.0,100.0
3,3.7,20.0,250.0
4,4.0,50.0,250.0


In [8]:
# Fit and save OneHotEncoder
categorical_cols = ["name", "city", "cuisine"]
encoder = OneHotEncoder(sparse_output=True, handle_unknown="ignore")
encoder.fit(df_cleaned[categorical_cols])
with open("encoder.pkl", "wb") as f:
    pickle.dump(encoder, f)
print("encoder.pkl saved")

encoder.pkl saved


In [9]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148442 entries, 0 to 148441
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            148442 non-null  int64  
 1   name          148442 non-null  object 
 2   city          148442 non-null  object 
 3   rating        148442 non-null  float64
 4   rating_count  148442 non-null  float64
 5   cost          148442 non-null  float64
 6   cuisine       148442 non-null  object 
 7   lic_no        148299 non-null  object 
 8   link          148442 non-null  object 
 9   address       148442 non-null  object 
 10  menu          148442 non-null  object 
dtypes: float64(3), int64(1), object(7)
memory usage: 12.5+ MB


In [10]:
print("Initial Shape:", df_cleaned.shape)
print("\nMissing Values:\n", df_cleaned.isnull().sum())

Initial Shape: (148442, 11)

Missing Values:
 id                0
name              0
city              0
rating            0
rating_count      0
cost              0
cuisine           0
lic_no          143
link              0
address           0
menu              0
dtype: int64


In [11]:
df_cleaned.drop_duplicates()
print("After removing duplicates:", df_cleaned.shape)

After removing duplicates: (148442, 11)


In [13]:
df_cleaned.isnull().sum()

id                0
name              0
city              0
rating            0
rating_count      0
cost              0
cuisine           0
lic_no          143
link              0
address           0
menu              0
dtype: int64

In [14]:
df_cleaned = pd.read_csv("cleaned_data.csv")
df_cleaned = df_cleaned.drop(columns=["lic_no"])
df_cleaned.to_csv("cleaned_data.csv", index=False)
