# Data Preprocessing Pipeline

Data preprocessing is a critical step in data science tasks, ensuring that raw data is transformed into a clean, organized, and structured format suitable for analysis. A data preprocessing pipeline streamlines this complex process by automating a series of steps, enabling data professionals to efficiently and consistently preprocess diverse datasets

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [2]:
def Data_Preprocessing_Pipeline(data):
        ## Extracting numeric and categorical features from dataset
        numeric_features = data.select_dtypes(include=['float', 'int']).columns
        categorical_features = data.select_dtypes(include=['object']).columns
        
        ## Applying fillna function and mean to change NaN values from the dataset.
        data[numeric_features] = data[numeric_features].fillna(data[numeric_features].mean())
        
        for feature in numeric_features:  ## Now for outliers in our datasets we will use IQR to remove the outliers.
            
            
            Q1 = data[feature].quantile(0.25)
            Q3 = data[feature].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - (1.5 * IQR)
            upper_bound = Q3 + (1.5 * IQR)    ## Here summarizing lower and higher IQR
            data[feature] = np.where((data[feature] < lower_bound) | (data[feature] > upper_bound),
                                 data[feature].mean(), data[feature])
       
        # Normalize numeric features   
        
        scaler = StandardScaler()
        scaled_data = scaler.fit_transform(data[numeric_features]) # Here we are using fit_tranform the method to determine the parameters and transfrom the data
        data[numeric_features] = scaler.transform(data[numeric_features]) ## Made 
        
        #Handle missing values in categorical features
        
        data[categorical_features] = data[categorical_features].fillna(data[categorical_features].mode().iloc[0])
       
        return data

In [3]:
data = pd.read_csv("zomato.csv", encoding='latin-1')

In [4]:
data.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


In [5]:
data.isna().sum() ## here we can seee that there are 9 NaN values in the zomato dataset.

Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                9
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64

In [6]:
### Perform Data preprocessing 
cleaned_data = Data_Preprocessing_Pipeline(data)
print("Processed Data")
print(cleaned_data)

Processed Data
      Restaurant ID           Restaurant Name  Country Code              City  \
0         -0.310940          Le Petit Souffle      3.102262       Makati City   
1         -0.312458          Izakaya Kikufuji      3.102262       Makati City   
2         -0.312946    Heat - Edsa Shangri-La      3.102262  Mandaluyong City   
3         -0.310841                      Ooma      3.102262  Mandaluyong City   
4         -0.311319               Sambo Kojin      3.102262  Mandaluyong City   
...             ...                       ...           ...               ...   
9546      -0.356658              NamlÛ± Gurme      3.102262         ÛÁstanbul   
9547      -0.357452             Ceviz AÛôacÛ±      3.102262         ÛÁstanbul   
9548      -0.356649                     Huqqa      3.102262         ÛÁstanbul   
9549      -0.356614              Aôôk Kahve      3.102262         ÛÁstanbul   
9550      -0.355330  Walter's Coffee Roastery      3.102262         ÛÁstanbul   

            

In [7]:
cleaned_data.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,-0.31094,Le Petit Souffle,3.102262,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",-1.972128,-1.950434,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,1.869044,2.62823,Dark Green,Excellent,3.25123
1,-0.312458,Izakaya Kikufuji,3.102262,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",-1.972128,-1.950434,Japanese,...,Botswana Pula(P),Yes,No,No,No,1.869044,2.114224,Dark Green,Excellent,1.184898
2,-0.312946,Heat - Edsa Shangri-La,3.102262,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",-1.972128,-1.950434,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,0.189293,1.942889,Green,Very Good,2.672463
3,-0.310841,Ooma,3.102262,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",-1.972128,-1.950434,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,0.189293,2.799565,Dark Green,Excellent,1.184898
4,-0.311319,Sambo Kojin,3.102262,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",-1.972128,-1.950434,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,0.189293,2.62823,Dark Green,Excellent,2.133158


In [8]:
cleaned_data.isna().sum() ## here we have completed the task successfully and there is nan values in our updated dataset.

Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                0
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64