
## Problem Statement:
FoodHub Data Analysis and Prediction
## Business Context:
The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.
## Project Objective:
The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience.

1. Exploratory Data Analysis (EDA):

2. Classification Task (Binary Classification):

3. Regression Task (Predictive Modeling):

4. Model Evaluation:

5. Model Deployment:


#### Importing the necessary libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


#### Loading the dataset

In [None]:
df = pd.read_csv('/content/2-foodhub_order_New.csv')

## 1- Basic Exploration of data
* 1.1 Checking the top 5 rows
* 1.2 Checking the shape of dataset
* 1.3 Checking the info of dataset
* 1.4 Checking the Statistical summary

#### 1.1 Checking the top 5 rows

In [None]:
df.head()

###### Observations:
1. There are some values in rating which are required to be given.
2. In the feature delivery_time, we are having wrong entries like "?".

#### 1.2 Checking the shape (no.of rows and columns in dataset)

In [None]:
df.shape

In [None]:
df.columns

###### Result:
There are **1898** rows and **9** columns in dataset

#### 1.3 Checking the info of dataset

In [None]:
df.info()

###### Observations:
1. All 1898 observations are complete.
2. The data types are suitable for each respective variable

In [None]:
df[df['cost_of_the_order']>40]

#### 1.4 Statistical summary

In [None]:
df.describe()

#### Observations:
* 1 - We need to check the record where cost_of_the_order is 121920.
* 2 - There are some missing values in food_preperation_time.

## 2-Exploratory Data Analysis (EDA)
* 2.1 Checking the duplicate rows and fetching them
* 2.2 Checking Null Values (columnwise, percentage wise in columns and row-wise
* 2.3 Outliers Analysis
* 2.4 Univariate Analysis
* 2.5 Bivariate Analysis

#### 2.1 Checking the duplicate rows

In [None]:
# total number of duplicate rows
df.duplicated().sum()

In [None]:
# Fetching duplicate rows
df[df.duplicated()]

### 2.2 Checking Null Values

In [None]:
# Column-wise null values
df.isnull().sum()

In [None]:
# Percentage wise null values in columns
df.isnull().sum()/len(df)*100

In [None]:
# row-wise null values
df.isnull().sum(axis=1).sort_values(ascending=False)

In [None]:
df['rating'].value_counts()

###### Results:

* Out of the 1898 orders, 736 do not have a rating.



### 2.3 Outliers Analysis
* 2.3.1 Visualizing outliers
* 2.3.2 Finding the no. of outliers in each column

In [None]:
# Fetching the datasets having categorical & Numerical variables seperately
cat_variables= df.select_dtypes('object')
num_variables= df.select_dtypes(['int','float'])

##### 2.3.1 Visualizing outliers

In [None]:
for i in cat_variables.columns:
    plt.figure(figsize=(6,3))
    sns.countplot(data=df, x=i)
    plt.show()

In [None]:
for i in num_variables:
    plt.figure(figsize=(7,3))
    sns.boxplot(data=df, x=i)


##### 2.3.2 Finding the no. of outliers in each column

In [None]:
for i in num_variables:
    Q1= np.quantile(df[i],0.25)
    Q3= np.quantile(df[i],0.75)
    IQR= Q3-Q1
    lower_limit= Q1- 1.5*IQR
    upper_limit= Q3+ 1.5*IQR
    print("Number of ouliers in ",i,":", len(df[df[i]>upper_limit]))
    print('-----------------------')

### 2.4 Univariate Analysis

##### 2.4.1. Categorical variables vizual analysis:

In [None]:
df.head()

In [None]:
for i in cat_variables.columns:
    plt.figure(figsize=(5,2))
    if i== "cuisine_type":
        sns.countplot(data=df, y=i)
    else:
        sns.countplot(data=df, x=i)
    plt.show()

In [None]:
plt.figure(figsize=(5,2))
sns.countplot(data=df, x='delivery_time')
plt.title('Countplot of delivery_time')

In [None]:
for i in num_variables.columns:
    print(df[i].nunique())
    print(df[i].value_counts())
    print('-------------')

In [None]:
for i in cat_variables.columns:
    print(df[i].nunique())
    print(df[i].value_counts())
    print('-------------')

##### 2.4.3. Numerical variables analysis:

In [None]:
for i in num_variables:
    plt.figure(figsize=(6,3))
    sns.histplot(data=df,x=i);

### 2.5 Bi-variate analysis
* Since during model building we will be having "Rating" and "delivery_time" as our target variables so we will only do bi-variate analysis of these features only.

In [None]:
for i in cat_variables.columns:
    print(i)
    print(df.groupby(i)['cost_of_the_order'].mean())
    print('--------------------')

In [None]:
num_variables=df.select_dtypes(['int','float'])

In [None]:
plt.figure(figsize=(5,2))
sns.pairplot(num_variables);

In [None]:
for i in cat_variables.columns:
    plt.figure(figsize=(10, 5))
    top_15 = df[i].value_counts().nlargest(15).index
    df_top = df[df[i].isin(top_15)]
    sns.barplot(data=df_top, x=i, y='delivery_time', errorbar=None, order=top_15)

    # Ensure y-axis is not inverted
    plt.gca().invert_yaxis()  # Remove this line if present elsewhere!

    plt.xticks(rotation=45, ha='right')
    plt.title(f"{i} vs Delivery Time")
    plt.tight_layout()
    plt.show()


####  2.5.2. Relationship of  'delivery_time' with other numerical features
* Few plots which we can use are:
  * Pairplot
  * Jointplot
  * scatterplot

In [None]:
sns.barplot(data=df, x='rating',y='delivery_time');

In [None]:
for i in num_variables.columns:
    plt.figure(figsize=(5,2))
    sns.scatterplot(data=df, y='rating',x=i);

##### Checking the correlation with heatmap to check above observation

In [None]:
corr= num_variables.corr()
plt.figure(figsize=(5,2))
sns.heatmap(corr,annot=True);

####  2.5.3 Relationship of 'delivery_time' with all other features
* We can use following plots & tables :
    * Countplot with hue
    * crosstab
    * df.plot.bar(stacked=True)

### 2.5.3.1 Visual analysis of "delivery_time" with other categorical variables

In [None]:
for i in cat_variables.columns:
    plt.figure(figsize=(15,5))
    sns.countplot(data=df, x=i, hue='delivery_time');

# 3- Data Cleaning & pre processing
* 1.Dropping duplicate rows
* 2.Replacing wrong entries
* 3.Missing values imputation (SimpleImputer, fillna())
* 4.Handle outliers (IQR, Z-score method)
* 5.Encoding
* 6.Data splitting
* 7.Feature scaling: StandardScaler, MinMaxScaler
* 8.Feature selection:Based on correlation, domain knowledge, or model-based methods


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df=df.drop(['restaurant_name','order_id','customer_id'], axis=1)

In [None]:
df.head()

In [None]:
df_copy= df.copy()
cat_variables= df_copy.select_dtypes('object')
num_variables= df_copy.select_dtypes(['int','float'])

#### 3.1 Dropping duplicates

In [None]:
print(" No. of rows before dropping duplicates :", df_copy.shape[0])
df_copy.drop_duplicates(inplace=True)
print(" No. of rows after dropping duplicates :", df_copy.shape[0])

#### 3.2 Replacing wrong entries

##### 3.2.1 rating

In [None]:
# Before Cleaning
print("Unique categories in feature : ",df_copy['rating'].unique())
print("Number of Unique categories in feature : ",df_copy['rating'].nunique())
print("Count of Unique categories in feature : ",df_copy['rating'].value_counts())

In [None]:
df_copy2 = df_copy.copy()


In [None]:
df_copy2['rating'] = df_copy2['rating'].astype(str)
df_copy2['rating'] = df_copy2['rating'].apply(lambda x: 'Given' if x.strip().isdigit() else 'Not given')
# Confirm both classes exist
print(df_copy2['rating'].value_counts())




In [None]:
# After Cleaning
print("Unique categories in feature : ",df_copy['rating'].unique())
print("Number of Unique categories in feature : ",df_copy['rating'].nunique())
print("Count of Unique categories in feature : ",df_copy['rating'].value_counts())

In [None]:
print(df_copy['rating'].value_counts())
print(df_copy['rating'].dtypes)


##### 3.2.2 delivery_time

In [None]:
# Before Cleaning
print("Unique categories in feature : ",df_copy['delivery_time'].unique())
print("Number of Unique categories in feature : ",df_copy['delivery_time'].nunique())
print("Count of Unique categories in feature : ",df_copy['delivery_time'].value_counts())

In [None]:
# Fetching the records where delivery_time is "?"
df_copy[df_copy['delivery_time']=='?']

In [None]:
df_copy['delivery_time']=df_copy['delivery_time'].replace('?',df_copy['delivery_time'].mode()[0])

In [None]:
# After Cleaning
print("Unique categories in feature : ",df_copy['delivery_time'].unique())
print("Number of Unique categories in feature : ",df_copy['delivery_time'].nunique())
print("Count of Unique categories in feature : ",df_copy['delivery_time'].value_counts())

###### How to replace two same wrong entries with two seperate values

In [None]:
df_copy['delivery_time'].value_counts()

In [None]:
df_copy[df_copy['delivery_time']=='?']

#### 3.3 Missing values Treatment

| Acronym  | Full Form                    | Meaning                                                                        | Bias Introduced | Example                                                                        |
| -------- | ---------------------------- | ------------------------------------------------------------------------------ | --------------- | ------------------------------------------------------------------------------ |
| **MCAR** | Missing Completely At Random | The missingness has **no relation** to any data, observed or missing.          | ❌ No            | A sensor randomly fails and misses temperature readings.                       |
| **MAR**  | Missing At Random            | The missingness is **related to observed data**, not the missing value itself. | ✅ Yes (mild)    | People with higher incomes are less likely to report income, but age is known. |
| **MNAR** | Missing Not At Random        | The missingness is related to the **missing value itself**.                    | ✅ High          | People with very low income tend to skip the income question.                  |


##### Example
| Name  | Age | Income |
| ----- | --- | ------ |
| Alice | 25  | 50k    |
| Bob   | 30  | NaN    |
| Carol | NaN | 70k    |
| David | 40  | NaN    |


* MCAR: Missing income for Bob and David is due to random system error.
* MAR: Missing income depends on age (older people don't report income), but income itself doesn't influence missingness.
* MNAR: Income is missing because it's very high or very low, and people choose not to report it.

| Type | Can You Impute?                           | Need Advanced Methods?                         |
| ---- | ----------------------------------------- | ---------------------------------------------- |
| MCAR | ✅ Yes (Mean/Median Imputation)            | ❌ No                                           |
| MAR  | ✅ Yes (Advanced Imputers: KNN, Iterative) | ⚠️ Maybe                                       |
| MNAR | ❌ Not reliably                            | ✅ Yes (Model-based or domain knowledge needed) |


# Null values Treatment general guideline:
* Check the datatype of feature:
  * If datatype== Categorical ; replace null values with mode
  * If datatype== Numerical:
    * Check for outliers:
      * If outliers are present; replace null values with median
      * If outliers are NOT present; replace null values with mean

In [None]:
df.isnull().sum()

In [None]:
# Filling null values using fillna- cuisine_type, food_preparation_time

In [None]:
df_copy['cuisine_type'].mode()[0]

In [None]:
df_copy['cuisine_type'] = df_copy['cuisine_type'].fillna(df_copy['cuisine_type'].mode()[0])

In [None]:
df_copy['food_preparation_time'].mean()

In [None]:
df_copy['food_preparation_time'] = df_copy['food_preparation_time'].fillna(df_copy['food_preparation_time'].mean())

In [None]:
df_copy['delivery_time'].mode()[0]

In [None]:
df_copy['delivery_time'] = df_copy['delivery_time'].fillna(df_copy['delivery_time'].mode()[0])

In [None]:
df_copy.isnull().sum()

In [None]:
df_copy.to_csv('df_copy.csv', index=False)

In [None]:
df_copy.delivery_time.value_counts()

In [None]:
sns.pointplot(data=df_copy, x='day_of_the_week',y=df_copy['delivery_time'].astype('int'));

##### Other popular methods

| Method                                 | Description                                              |
| -------------------------------------- | -------------------------------------------------------- |
| `ffill()` / `bfill()`                  | Forward or backward fill values                          |
| Mode/Median Imputation                 | Use most frequent / median value                         |
| KNN Imputer (`KNNImputer`)             | Predict null values using nearest neighbors              |
| Iterative Imputer (`IterativeImputer`) | Uses regression models to predict missing values         |
| Drop missing (`dropna`)                | Drop rows/columns with missing values (when appropriate) |
| Domain-specific value                  | E.g., fill age with 0 only if 0 means "unknown"          |


## 3.4 Outliers Treatment

#### 1. Popular Methods for Outlier Detection & Treatment

| Method                           | Type         | Handles   | Robustness  | When to Use                   |
| -------------------------------- | ------------ | --------- | ----------- | ----------------------------- |
| **IQR (Interquartile Range)**    | Univariate   | Numerical | ✅ Robust    | Simple, small-medium datasets |
| **Z-Score**                      | Univariate   | Numerical | ❌ Sensitive | Normal-like distributions     |
| **Percentile Capping**           | Univariate   | Numerical | ✅ Robust    | Quick wins, business rules    |
| **Isolation Forest**             | Multivariate | Numerical | ✅ Good      | Large, high-dimensional data  |
| **DBSCAN (Clustering)**          | Multivariate | All       | ✅ Moderate  | Cluster-shaped datasets       |
| **Boxplots / Visual Inspection** | Univariate   | Numerical | Manual      | For EDA or small data         |
| **LOF (Local Outlier Factor)**   | Multivariate | All       | ✅ High      | Density-based outliers        |


In [None]:
df_copy2= pd.read_csv('df_copy.csv')

In [None]:
for i in df_copy2.select_dtypes(['int','float']).columns:
    plt.figure(figsize=(6,3))
    sns.boxplot(data=df_copy2,x=i);

In [None]:
# A.IQR Method (Interquartile Range)
Q1 = df_copy2['cost_of_the_order'].quantile(0.25)
Q3 = df_copy2['cost_of_the_order'].quantile(0.75)
IQR = Q3 - Q1
# Define bounds
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
median = df_copy2['cost_of_the_order'].median()
lower,upper

In [None]:
# Filter out outliers
df_copy2[(df_copy2['cost_of_the_order'] <= lower) | (df_copy2['cost_of_the_order'] >= upper)]

In [None]:
df_copy2['cost_of_the_order'].quantile(0.99)

In [None]:
df_copy2['cost_of_the_order'].describe()

In [None]:
df_copy2['cost_of_the_order'] =np.where(df_copy2['cost_of_the_order']<= lower,df_copy2['cost_of_the_order'].median(),df_copy2['cost_of_the_order'])
df_copy2['cost_of_the_order'] =np.where(df_copy2['cost_of_the_order']>= upper,df_copy2['cost_of_the_order'].median(),df_copy2['cost_of_the_order'])

In [None]:
df_copy2[(df_copy2['cost_of_the_order'] <= lower) | (df_copy2['cost_of_the_order'] >= upper)]

In [None]:
df_copy2['cost_of_the_order'] = np.where(df_copy2['cost_of_the_order'] <= lower, median, df_copy2['cost_of_the_order'])


In [None]:
df_copy2['rating'].describe()

In [None]:
sns.boxplot(data=df_copy2, x='rating');

In [None]:
df_copy2['rating'] =np.where(df_copy2['rating']==0,df_copy2['rating'].mode()[0],df_copy2['rating'])

In [None]:
df_copy2.to_csv('df_copy3.csv', index=False)


In [None]:
sns.barplot(data=df_copy2, x='rating',y='delivery_time');


#### C.Percentile Capping (Winsorization)
from scipy.stats.mstats import winsorize

Winsorize at 5th and 95th percentile

df['capped_age'] = winsorize(df['Age'], limits=[0.05, 0.05])

#### D. Isolation Forest (Multivariate)
from sklearn.ensemble import IsolationForest

clf = IsolationForest(contamination=0.1)

df['outlier'] = clf.fit_predict(df[['Age']])

df_filtered = df[df['outlier'] == 1]

print(df_filtered)

#### E.Local Outlier Factor (LOF)
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20)

df['outlier'] = lof.fit_predict(df[['Age']])

df_filtered = df[df['outlier'] == 1]

print(df_filtered)

##### Comparison Summary
| Method           | Speed     | Interpretability | Works on Multivariate | Scikit-learn Support |
| ---------------- | --------- | ---------------- | --------------------- | -------------------- |
| IQR              | ✅ Fast    | ✅ Easy           | ❌ No                  | ❌ No                 |
| Z-Score          | ✅ Fast    | ✅ Easy           | ❌ No                  | ❌ No                 |
| Winsorization    | ✅ Fast    | ✅ Easy           | ❌ No                  | ❌ No (in `scipy`)    |
| Isolation Forest | ⚠️ Slower | ✅ Moderate       | ✅ Yes                 | ✅ Yes                |
| LOF              | ⚠️ Slower | ⚠️ Hard          | ✅ Yes                 | ✅ Yes                |


## 3.5 Encoding

Encoding is the process of converting categorical variables (text labels or categories) into a numerical format, so they can be used in machine learning models (which require numerical input).

Types of Categorical Variables
* Nominal – No natural order. E.g., Gender, Color, Country
* Ordinal – Has a meaningful order. E.g., Size (Small < Medium < Large), Rating (Low < Medium < High)

In [None]:
df_3= pd.read_csv('df_copy3.csv')

In [None]:
df_copy2['cuisine_type'] = df_copy2['cuisine_type'].astype('category').cat.codes

In [None]:
df.head()

In [None]:
cat_variables= df_3.select_dtypes('object')
num_variables= df_3.select_dtypes(['int','float'])


In [None]:
cat_variables.columns

In [None]:
cat_variables.head(3)

In [None]:
# 2. Nominal encoding
nominal_features=['cuisine_type']
for i in nominal_features:
    print(i,";")
    print(pd.Categorical(cat_variables[i],ordered= False))
    print(pd.Categorical(cat_variables[i],ordered= False).codes)
    print('----------------------------------------')
    cat_variables[i]=pd.Categorical(cat_variables[i],ordered= False).codes

In [None]:
cat_variables.head()

In [None]:
# 3. One Hot encoding
cat_variables_encoded= pd.get_dummies(cat_variables, columns=['rating','day_of_the_week'], dtype=int)
df_encoded=pd.concat([num_variables,cat_variables_encoded], axis=1,)
df_encoded.head()

In [None]:
df_encoded.head()

In [None]:
cat_variables['day_of_the_week'].value_counts()

In [None]:
cat_variables['cuisine_type'].value_counts()

In [None]:
df_encoded.to_csv('df_encoded', index=False)

##### Common encoding techniques:
| Encoding Method      | Type of Data | Pros                  | Cons                             |
| -------------------- | ------------ | --------------------- | -------------------------------- |
| Label Encoding       | Ordinal      | Simple                | Imposes order on nominal data    |
| One-Hot Encoding     | Nominal      | No order imposed      | High dimensionality              |
| Ordinal Encoding     | Ordinal      | Preserves order       | You must define the order        |
| Frequency Encoding   | Nominal      | Simple, compact       | May mislead the model            |
| Target/Mean Encoding | Nominal      | Can boost performance | Risk of overfitting/data leakage |


#### Difference in `pd.Categorical` technique and Sklearn's `LabelEncoder` / `OrdinalEncoder`
| Feature                    | `pd.Categorical`      | `LabelEncoder` / `OrdinalEncoder` |
| -------------------------- | --------------------- | --------------------------------- |
| Built into Pandas          | ✅ Yes                 | ❌ No                              |
| Easy for quick exploration | ✅ Very                | ➖ Slightly more verbose           |
| Custom category ordering   | ✅ Yes                 | ✅ Yes                             |
| Part of sklearn pipelines  | ❌ Not natively        | ✅ Integrates well                 |
| Handles unknown values     | ❌ Fails or assigns -1 | ✅ Can handle with parameters      |


## 3.6 Train Test Split (Required when building model for predictions)
#### What is train_test_split?
train_test_split is a function from scikit-learn used to split your dataset into two parts:

Training set – to train the model

Validation set – to evaluate the model's performance

#### Why is it required?
When building machine learning models, we train the model on one portion of the data and test it on unseen data to check how well it generalizes.

This prevents overfitting and gives a fair estimate of model accuracy.

In [None]:
df_TrainTestSplit= pd.read_csv('df_encoded')

In [None]:
df_TrainTestSplit.head()

#### Separating Features and Target

In [None]:
X = df_TrainTestSplit.drop('delivery_time', axis=1)
y = df_TrainTestSplit['delivery_time']


In [None]:
X.head()

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y,test_size=0.2,random_state=20)

In [None]:
X_train.head()

## 3.7 Scaling
* Why Feature Scaling is Important:

Many machine learning algorithms (like KNN, SVM, Gradient Descent-based models, Neural Networks) compute distances or rely on the magnitude of features. If one feature has a large range and another has a small range, the model might become biased toward the feature with the larger range.

##### Most common techniques:
![image.png](attachment:e545afc2-0cd6-44c5-a944-9699a82fbb30.png)
![image.png](attachment:78d23bc0-1915-42a6-b681-beae43ee6649.png)

In [None]:
X_train.columns

In [None]:
cat_cols=['cuisine_type','rating_3', 'rating_4', 'rating_5', 'rating_Not given','day_of_the_week_Weekday', 'day_of_the_week_Weekend']
num_cols= ['cost_of_the_order', 'food_preparation_time']

In [None]:
df.head()

In [None]:
# 1. Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
MMscaler = MinMaxScaler()
Xtrain_scaled_MinMax = MMscaler.fit_transform(X_train[num_cols])
Xval_scaled_MinMax = MMscaler.transform(X_val[num_cols])

In [None]:
Xtrain_scaled_MinMax=pd.DataFrame(Xtrain_scaled_MinMax,columns=X_train[num_cols].columns)
Xval_scaled_MinMax=pd.DataFrame(Xval_scaled_MinMax,columns=X_val[num_cols].columns)

In [None]:
Xtrain_scaled_MinMax.head(2)

In [None]:
# 2. z-score Scaling (standardization)
from sklearn.preprocessing import StandardScaler
SSscaler = StandardScaler()
Xtrain_scaled_StandardScalar = SSscaler.fit_transform(X_train[num_cols])
Xval_scaled_StandardScalar = SSscaler.transform(X_val[num_cols])

In [None]:
Xtrain_scaled_StandardScalar=pd.DataFrame(Xtrain_scaled_StandardScalar,columns=X_train[num_cols].columns)
Xval_scaled_StandardScalar=pd.DataFrame(Xval_scaled_StandardScalar,columns=X_val[num_cols].columns)

In [None]:
Xtrain_scaled_StandardScalar.head()

In [None]:
cat_cols = [col for col in cat_cols if col in X_train.columns]


In [None]:
scaled_train=pd.concat([Xtrain_scaled_StandardScalar,X_train[cat_cols]],axis=1)
scaled_test=pd.concat([Xval_scaled_StandardScalar,X_val[cat_cols]],axis=1)

In [None]:
round(Xtrain_scaled_StandardScalar.describe(),2)

In [None]:
X_train[num_cols].head(2)

In [None]:
Xtrain_scaled_StandardScalar.head(2)

In [None]:
Xtrain_scaled_StandardScalar.shape

In [None]:
X_train[cat_cols].shape

In [None]:
X_train[cat_cols].reset_index(drop=True)

In [None]:
scaled_train=pd.concat([Xtrain_scaled_StandardScalar,X_train[cat_cols].reset_index(drop=True)],axis=1)
scaled_test=pd.concat([Xval_scaled_StandardScalar,X_val[cat_cols].reset_index(drop=True)],axis=1)

In [None]:
scaled_train.to_csv('Scaled_data_train')
scaled_test.to_csv('Scaled_data_val')

#### Summary Table of most popular methods;
| Method          | Range    | Affected by Outliers | Use Case                            |
| --------------- | -------- | -------------------- | ----------------------------------- |
| Min-Max Scaling | \[0, 1]  | ✅ Yes                | Image data, bounded values          |
| Standardization | \~N(0,1) | ❌ Less               | Linear models, SVM, Neural Networks |
| Robust Scaling  | Depends  | ❌ No                 | Data with outliers                  |
| MaxAbs Scaling  | \[-1, 1] | ❌ Less               | Sparse data (e.g., NLP features)    |


# 4-Model Building (Regression), Evaluation & Tuning

### 4.1 Regression algorithms
    * Linear Regression
    * KNN
    * Decision Trees (CART)
    * Random Forest
    * Boosting
        * Adaboost,
        * Gboost,
        * XGboost
### 4.2 Model Evaluation: Regression metrics: R² & RMSE
1. R-squared (R²) — Coefficient of Determination
    * What it means:
        * Measures how well the model explains the variability in the target variable.
        * Value lies between 0 and 1 (can be negative if model performs worse than the mean).
    * Interpretation:
        * R² = 1 → perfect prediction
        * R² = 0 → model is no better than the average
        * Higher is better
          ![image.png](attachment:e23679fd-3fe1-4ef2-b9e9-1dfed173e585.png)

2. RMSE — Root Mean Squared Error
    * What it means:
        * Measures average prediction error in the same units as the target variable.
        * It gives more weight to larger errors.
    * Interpretation:
        * Lower is better
        * Easy to interpret because it’s in the same unit as
          ![image.png](attachment:65110f95-03fd-4549-be04-631970e6b345.png)
### 4.3 Model Tuning
    * GridSearchCV
    * Hyper Parameter Tuning


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
# pip install xgboost
from xgboost import XGBRegressor

In [None]:
# loading the encoded data
df_TrainTestSplit= pd.read_csv('df_encoded')

#seperating the target variable from rest of the data
X = df_TrainTestSplit.drop('delivery_time', axis=1)
y = df_TrainTestSplit['delivery_time']

# Splitting the data into train & validation set
X_train, X_val, y_train, y_val = train_test_split(X, y,test_size=0.2,random_state=20)

cat_cols=['cuisine_type','rating_3', 'rating_4', 'rating_5', 'rating_Not given','day_of_the_week_Weekday', 'day_of_the_week_Weekend']
num_cols= ['cost_of_the_order', 'food_preparation_time']

from sklearn.preprocessing import StandardScaler
SSscaler = StandardScaler()
Xtrain_scaled_StandardScalar = SSscaler.fit_transform(X_train[num_cols])
Xval_scaled_StandardScalar = SSscaler.transform(X_val[num_cols])

Xtrain_scaled_StandardScalar=pd.DataFrame(Xtrain_scaled_StandardScalar,columns=X_train[num_cols].columns)
Xval_scaled_StandardScalar=pd.DataFrame(Xval_scaled_StandardScalar,columns=X_val[num_cols].columns)

scaled_train=pd.concat([Xtrain_scaled_StandardScalar,X_train[cat_cols].reset_index(drop=True)],axis=1)
scaled_val=pd.concat([Xval_scaled_StandardScalar,X_val[cat_cols].reset_index(drop=True)],axis=1)

In [None]:
# loading the datasets
df_TrainTestSplit= pd.read_csv('df_encoded')
X = df_TrainTestSplit.drop('delivery_time', axis=1)
y = df_TrainTestSplit['delivery_time']
X_train, X_val, y_train, y_val = train_test_split(X, y,test_size=0.2,random_state=20)
x_trainFinal= pd.read_csv('Scaled_data_train')
x_valFinal= pd.read_csv('Scaled_data_val')

In [None]:
# Building a Linear regression model
LR = LinearRegression()
LR.fit(scaled_train,y_train)

y_train_pred = LR.predict(scaled_train)
y_val_pred = LR.predict(scaled_val)

rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
r2_train = r2_score(y_train, y_train_pred)

rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))
r2_val = r2_score(y_val, y_val_pred)

print("Train RMSE",rmse_train,"| Train R2",r2_train)
print('Test RMSE',rmse_val,'| Test R2',r2_val)

Train RMSE 4.21002122072206 | Train R2 0.289213740241521
Test RMSE 4.167355729845208 | Test R2 0.26801189334127273


In [None]:
# Building a knn model
knn = KNeighborsRegressor()
knn.fit(scaled_train,y_train)
y_train_pred = knn.predict(scaled_train)
y_val_pred = knn.predict(scaled_val)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
r2_train = r2_score(y_train, y_train_pred)
rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))
r2_val = r2_score(y_val, y_val_pred)
print("Train RMSE",rmse_train,"| Train R2",r2_train)
print('Test RMSE',rmse_val,'| Test R2',r2_val)

Train RMSE 3.8128553096922286 | Train R2 0.4169965779217023
Test RMSE 4.523436048427183 | Test R2 0.13757807993984772


In [None]:
models = {
    "Linear Regression": LinearRegression(),
    "KNN": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "AdaBoost": AdaBoostRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "XGBoost": XGBRegressor()
}
results = []
for name, model in models.items():
    model.fit(scaled_train, y_train)

    # Predictions
    y_train_pred = model.predict(scaled_train)
    y_val_pred = model.predict(scaled_val)

    # Metrics
    rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
    r2_train = r2_score(y_train, y_train_pred)

    rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))
    r2_val = r2_score(y_val, y_val_pred)

    # Store result
    results.append({
        "Model": name,
        "Train_RMSE": rmse_train,
        "Train_R²": r2_train,
        "Val_RMSE": rmse_val,
        "Val_R²": r2_val
    })

# Convert to DataFrame
results_df = pd.DataFrame(results)

# Show results sorted by Validation RMSE
results_df

Unnamed: 0,Model,Train_RMSE,Train_R²,Val_RMSE,Val_R²
0,Linear Regression,4.210021,0.289214,4.167356,0.268012
1,KNN,3.812855,0.416997,4.523436,0.137578
2,Decision Tree,0.638657,0.983643,5.727932,-0.382861
3,Random Forest,1.781177,0.872771,4.545759,0.129045
4,AdaBoost,4.183148,0.298259,4.176035,0.26496
5,Gradient Boosting,3.89456,0.391743,4.343002,0.205008
6,XGBoost,1.78013,0.872921,4.966863,-0.039794


In [None]:
knn = KNeighborsRegressor()

In [None]:
from sklearn.model_selection import cross_val_score, KFold
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Store results
results = []

for name, model in models.items():
    # Cross-validation scores (negative RMSE)
    neg_mse_scores = cross_val_score(model, scaled_train, y_train, scoring='neg_root_mean_squared_error', cv=cv)
    r2_scores = cross_val_score(model, scaled_train, y_train, scoring='r2', cv=cv)

    cv_rmse_mean = -np.mean(neg_mse_scores)
    cv_r2_mean = np.mean(r2_scores)

    # Train model on full training data
    model.fit(scaled_train, y_train)

    # Predict on validation set
    y_val_pred = model.predict(scaled_val)

    val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
    val_r2 = r2_score(y_val, y_val_pred)

    results.append({
        "Model": name,
        "CV_RMSE_(Train)": cv_rmse_mean,
        "Val_RMSE": val_rmse,
        "CV_R²_(Train)": cv_r2_mean,
        "Val_R²": val_r2
    })

# Display final results
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,CV_RMSE_(Train),Val_RMSE,CV_R²_(Train),Val_R²
0,Linear Regression,4.228807,4.167356,0.279375,0.268012
1,KNN,4.646412,4.523436,0.129763,0.137578
2,Decision Tree,6.123984,5.710505,-0.474075,-0.374459
3,Random Forest,4.551052,4.512561,0.163101,0.14172
4,AdaBoost,4.219715,4.131701,0.277189,0.280484
5,Gradient Boosting,4.272118,4.342638,0.264227,0.205141
6,XGBoost,5.0453,4.966863,-0.027037,-0.039794


In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define the model
ridge_model = Ridge()

# Define hyperparameter grid for Ridge
param_grid = {
    "alpha": [0.001, 0.01, 0.1, 1, 10, 100]  # Regularization strength
}

# GridSearchCV setup
grid_search = GridSearchCV(
    estimator=ridge_model,
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Fit on training data
grid_search.fit(scaled_train, y_train)

# Best model and parameters
best_ridge = grid_search.best_estimator_
best_params = grid_search.best_params_

# Train performance
y_train_pred = best_ridge.predict(scaled_train)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)

# Validation performance
y_val_pred = best_ridge.predict(scaled_val)
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
val_r2 = r2_score(y_val, y_val_pred)

# Print results
print("Best Hyperparameters for Ridge Regression:")
print(best_params)

print("\nPerformance Metrics:")
print(f"Train RMSE: {train_rmse:.4f}")
print(f"Train R²   : {train_r2:.4f}")
print(f"Val RMSE   : {val_rmse:.4f}")
print(f"Val R²     : {val_r2:.4f}")


Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best Hyperparameters for Ridge Regression:
{'alpha': 10}

Performance Metrics:
Train RMSE: 4.2102
Train R²   : 0.2891
Val RMSE   : 4.1656
Val R²     : 0.2686


# *Pending work

In [None]:
# Define model
xgb_model = XGBRegressor(objective='reg:squarederror', random_state=42)

# Define parameter grid
param_grid_xgb = {
    'n_estimators': [50, 100],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

# Grid search
grid_xgb = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid_xgb,
    scoring='neg_root_mean_squared_error',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Fit model
grid_xgb.fit(scaled_train, y_train)

# Best model and params
best_xgb = grid_xgb.best_estimator_
best_xgb_params = grid_xgb.best_params_

# Train evaluation
y_train_pred = best_xgb.predict(scaled_train)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)

# Validation evaluation
y_val_pred = best_xgb.predict(scaled_val)
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
val_r2 = r2_score(y_val, y_val_pred)

# Output
print(" Best Hyperparameters for XGBoost:")
print(best_xgb_params)

print("\n XGBoost Performance:")
print(f"Train RMSE: {train_rmse:.4f}")
print(f"Train R²   : {train_r2:.4f}")
print(f"Val RMSE   : {val_rmse:.4f}")
print(f"Val R²     : {val_r2:.4f}")


Fitting 5 folds for each of 18 candidates, totalling 90 fits
 Best Hyperparameters for XGBoost:
{'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 50}

 XGBoost Performance:
Train RMSE: 4.1315
Train R²   : 0.3155
Val RMSE   : 4.1552
Val R²     : 0.2723


# 5-Model Building (Classification), Evaluation & Tuning
* Data splitting
* Classification algorithms
    * Logistic Regression
    * Naive Bayes, KNN
    * Decision Trees (CART)
    * Random Forest
    * Boosting - Adaboost, Gboost, XGboost
* Model Evaluation
    * Classification metrics:Accuracy, Precision, Recall, F1-score, Confusion matrix,ROC Curve, AUC
* Model Tuning
    * GridSearchCV
    * RandomSearchCV

In [None]:
df_3= pd.read_csv('df_copy3.csv')

In [None]:
df.head()

Unnamed: 0,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time
0,Korean,30.75,Weekend,Not given,25.0,20
1,Japanese,12.08,Weekend,Not given,25.0,?
2,Mexican,12.23,Weekday,5,23.0,28
3,American,29.2,Weekend,3,25.0,15
4,American,11.59,Weekday,4,25.0,24


In [None]:
cat_variables= df_3.select_dtypes('object')
num_variables= df_3.select_dtypes(['int','float'])

In [None]:
# 2. Nominal encoding
nominal_features=['cuisine_type']
for i in nominal_features:
    print(i,";")
    print(pd.Categorical(cat_variables[i],ordered= False))
    print(pd.Categorical(cat_variables[i],ordered= False).codes)
    print('----------------------------------------')
    cat_variables[i]=pd.Categorical(cat_variables[i],ordered= False).codes

cuisine_type ;
['Korean', 'Japanese', 'Mexican', 'American', 'American', ..., 'Mexican', 'American', 'Japanese', 'Mediterranean', 'Japanese']
Length: 1898
Categories (14, object): ['American', 'Chinese', 'French', 'Indian', ..., 'Southern', 'Spanish', 'Thai',
                          'Vietnamese']
[6 5 8 ... 5 7 5]
----------------------------------------


In [None]:
# 3. One Hot encoding
cat_variables_encoded= pd.get_dummies(cat_variables, columns=['rating','day_of_the_week'], dtype=int)
df_encoded=pd.concat([num_variables,cat_variables_encoded], axis=1,)
df_encoded.head()

In [None]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score, precision_score,recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from xgboost import XGBClassifier

In [None]:
# Loading the previously cleaned and outliers treated dataset
df_3= pd.read_csv('df_copy3.csv')

cat_variables= df_3.select_dtypes('object')
num_variables= df_3.select_dtypes(['int','float'])


# 2. Nominal encoding
nominal_features=['cuisine_type']
for i in nominal_features:
    cat_variables[i]=pd.Categorical(cat_variables[i],ordered= False).codes

# 3. One Hot  Encoding
cat_variables_encoded= pd.get_dummies(cat_variables, columns=['rating','day_of_the_week'], dtype=int)

# Concatinating the encoded part and numerical variables
df_encoded=pd.concat([num_variables,cat_variables_encoded], axis=1,)
df_encoded.to_csv('df_encoded_classification', index=False)

In [None]:
df_copy['rating'] = df_copy['rating'].astype(str).str.strip()
df_copy['rating'] = df_copy['rating'].replace('Not given', 0)
df_copy['rating'] = df_copy['rating'].astype(int)


In [None]:
df_TrainTestSplit= pd.read_csv('df_encoded_classification')
df_TrainTestSplit.head()

Unnamed: 0,cost_of_the_order,food_preparation_time,delivery_time,cuisine_type,rating_3,rating_4,rating_5,rating_Not given,day_of_the_week_Weekday,day_of_the_week_Weekend
0,30.75,25.0,20,6,0,0,0,1,0,1
1,12.08,25.0,24,5,0,0,0,1,0,1
2,12.23,23.0,28,8,0,0,1,0,1,0
3,29.2,25.0,15,0,1,0,0,0,0,1
4,11.59,25.0,24,0,0,1,0,0,1,0


In [None]:
# Loading the previously cleaned and outliers treated dataset
df_3= pd.read_csv('df_copy3.csv')

cat_variables= df_3.select_dtypes('object')
num_variables= df_3.select_dtypes(['int','float'])


# 2. Nominal encoding
nominal_features=['cuisine_type']
for i in nominal_features:
    cat_variables[i]=pd.Categorical(cat_variables[i],ordered= False).codes

# 3. One Hot  Encoding
cat_variables_encoded= pd.get_dummies(cat_variables, columns=['day_of_the_week'], dtype=int)

# Concatinating the encoded part and numerical variables
df_encoded=pd.concat([num_variables,cat_variables_encoded], axis=1,)
df_encoded.to_csv('df1_encoded_classification', index=False)

In [None]:
df_copy['rating'] = df_copy['rating'].astype(str).str.strip()
df_copy['rating'] = df_copy['rating'].replace('Not given', 0)
df_copy['rating'] = df_copy['rating'].astype(int)

In [None]:
df_TrainTestSplit= pd.read_csv('df1_encoded_classification')
df_TrainTestSplit.head()

Unnamed: 0,cost_of_the_order,food_preparation_time,delivery_time,cuisine_type,rating,day_of_the_week_Weekday,day_of_the_week_Weekend
0,30.75,25.0,20,6,Not given,0,1
1,12.08,25.0,24,5,Not given,0,1
2,12.23,23.0,28,8,5,1,0
3,29.2,25.0,15,0,3,0,1
4,11.59,25.0,24,0,4,1,0


In [None]:
df_TrainTestSplit= pd.read_csv('df1_encoded')

FileNotFoundError: [Errno 2] No such file or directory: 'df1_encoded'

In [None]:
# loading the encoded data
df_TrainTestSplit= pd.read_csv('df1_encoded_classification')

#seperating the target variable from rest of the data
X = df_TrainTestSplit.drop(['rating'], axis=1)
y = df_TrainTestSplit['rating']

# Splitting the data into train & validation set
X_train, X_val, y_train, y_val = train_test_split(X, y,test_size=0.2,random_state=20)

cat_cols=['cuisine_type','day_of_the_week_Weekday', 'day_of_the_week_Weekend']
num_cols= ['cost_of_the_order', 'food_preparation_time','delivery_time']

from sklearn.preprocessing import StandardScaler
SSscaler = StandardScaler()
Xtrain_scaled_StandardScalar = SSscaler.fit_transform(X_train[num_cols])
Xval_scaled_StandardScalar = SSscaler.transform(X_val[num_cols])

Xtrain_scaled_StandardScalar=pd.DataFrame(Xtrain_scaled_StandardScalar,columns=X_train[num_cols].columns)
Xval_scaled_StandardScalar=pd.DataFrame(Xval_scaled_StandardScalar,columns=X_val[num_cols].columns)

scaled_train=pd.concat([Xtrain_scaled_StandardScalar,X_train[cat_cols].reset_index(drop=True)],axis=1)
scaled_val=pd.concat([Xval_scaled_StandardScalar,X_val[cat_cols].reset_index(drop=True)],axis=1)

In [None]:
gb= GradientBoostingClassifier()
gb.fit(scaled_train,y_train)
pred_train= gb.predict(scaled_train)
pred_val= gb.predict(scaled_val)
print("Classification report for Train Dataset")
print(classification_report(y_train, pred_train))
print("-------------------------------------------------")
print("Classification report for Test Dataset")
print(classification_report(y_val, pred_val))

Classification report for Train Dataset
              precision    recall  f1-score   support

           3       1.00      0.22      0.36       152
           4       0.82      0.25      0.38       328
           5       0.58      0.55      0.57       461
   Not given       0.52      0.86      0.65       577

    accuracy                           0.57      1518
   macro avg       0.73      0.47      0.49      1518
weighted avg       0.65      0.57      0.54      1518

-------------------------------------------------
Classification report for Test Dataset
              precision    recall  f1-score   support

           3       0.00      0.00      0.00        36
           4       0.11      0.03      0.05        58
           5       0.25      0.24      0.25       127
   Not given       0.38      0.55      0.45       159

    accuracy                           0.32       380
   macro avg       0.19      0.21      0.19       380
weighted avg       0.26      0.32      0.28       380

