# **NYC Restaurants**


### Loading necessaries packages

In [None]:
# Importing the dataset
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

# Set pandas display options
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.max_rows', None)  # Display all rows
pd.set_option('display.width', 1000)  # Adjust the width of the display

# Configure matplotlib settings
plt.rcParams['figure.figsize'] = [10, 6]  # Set default figure size
plt.rcParams['axes.grid'] = True  # Enable grid by default

matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
# Set seaborn style
sns.set(style="whitegrid")

# Optional: Set numpy display options
np.set_printoptions(threshold=np.inf)  # Display entire numpy arrays

**Problem Statement:**

The dataset provides information on customer orders from various restaurants, including details like the cost of the order, food preparation time, delivery time, and customer ratings. The objective is to predict a key outcome based on these features, such as:

1. **Predicting the total cost of the order** based on factors such as cuisine type, day of the week, food preparation time, and delivery time.
2. **Predicting delivery time** given the food preparation time, day of the week, restaurant type, and customer ratings.
3. **Predicting customer ratings** based on the order’s characteristics, including cost, food preparation time, delivery time, and the type of restaurant or cuisine.

**Goal:**  
By understanding the relationships between these features, we aim to create a model that can help optimize restaurant operations, improve delivery times, predict customer satisfaction (ratings), or even estimate total order costs in real-time. This would be particularly useful for decision-making, improving customer experience, and enhancing restaurant efficiency.

**Key challenges:**  
- Categorical data (restaurant names, cuisine types, and days of the week) needs to be appropriately encoded for use in machine learning models.
- Handling missing values, such as "Not given" in the ratings column, is essential.
- Scaling and normalizing numerical data such as cost, food preparation time, and delivery time are necessary to improve model performance. 

The model can help predict key business metrics or identify factors that significantly influence customer satisfaction or operational efficiency.

## **Loading the Dataset**

In [None]:
# Read the dataset from the CSV file
df = pd.read_csv("food_order.csv")

# Display the first 5 rows of the dataframe to get an overview of the data
df.head()

**Dataset Description:**

The dataset consists of **1898 records** related to customer orders from various restaurants. Each record includes the following attributes:

- **order_id**: A unique identifier for each order.
- **customer_id**: A unique identifier for the customer who placed the order.
- **restaurant_name**: The name of the restaurant where the order was placed.
- **cuisine_type**: The type of cuisine offered by the restaurant (e.g., Korean, Japanese, American, Mexican).
- **cost_of_the_order**: The total monetary value of the order (float).
- **day_of_the_week**: The day on which the order was placed (e.g., Weekend, Weekday).
- **rating**: The rating given by the customer, though some ratings are marked as "Not given."
- **food_preparation_time**: The time (in minutes) it took to prepare the food.
- **delivery_time**: The time (in minutes) it took to deliver the order to the customer.

This dataset provides a comprehensive view of customer interactions with various restaurants and includes both numerical and categorical data. The data is clean, with no missing values for the core features, though the **rating** field includes instances where the rating was not provided ("Not given"). This dataset can be used for tasks such as predictive modeling, analysis of customer satisfaction, operational optimization, and understanding factors affecting delivery times and costs.

## **Exploratory Data Analysis (EDA)**

In [None]:
#total columns in this dataset
df.columns

In [None]:
# Display the shape of the dataframe to understand its dimensions
print("\nShape of the dataset (number of rows and columns):")
print(df.shape)

**There are 1898 rows and 9 columns in this dataset.**

In [None]:
# To find the duplicate values in the dataset
print(f'Number of dublicate rows {df.duplicated().sum()}')

**Note**: No Duplicate Rows Detected

During the data preprocessing stage, it was confirmed that the dataset contains no duplicate rows. This ensures the integrity and uniqueness of the data entries, providing a robust foundation for the machine learning model's training and evaluation processes. Having a dataset free from duplicates helps in maintaining the accuracy of the model’s predictions and avoids potential biases or redundancies.

In [None]:
# Check for missing values
print("Missing values:\n", df.isna().sum())

**Note**: Missing Values

There are no missing values in the dataset 

In [None]:
df.info()

In [None]:
df.describe().T

## **Observations**:

### Shape of the Dataset:
- **Number of Rows (Data points)**: 1898
- **Number of Columns (Features)**: 9

### Missing Values:
The dataset does not have any missing values for most columns. However, the **rating** column contains instances of "Not given," which represent missing data. These missing values in the **rating** column will be handled by either imputing them or creating a binary feature to represent the missing status.

### Target Variable:
- The target variable could be **rating** if we are predicting customer satisfaction or **cost_of_the_order**, **food_preparation_time**, or **delivery_time** if the focus is on predicting operational or financial outcomes. Depending on the task, the appropriate target variable will be chosen for model training. 

## **Feature Engineering**

In [None]:
df['total_order_time'] = df['food_preparation_time'] + df['delivery_time']
df.head()

In [None]:
print(f'Number of unique orders {df.order_id.nunique()}')

In [None]:
n_customer = df.customer_id.nunique()
print(f'Number of unique customers {n_customer}')

In [None]:
n_cuisine_type= df.cuisine_type.nunique()
print(f'Number of unique type of Cuisine {n_cuisine_type}')

In [None]:
print(f'Number of unique rating {df['rating'].unique()}')

In [None]:
(df['rating'] == 'Not given').sum()

In [None]:
df.head()

In [None]:
# Filter the rows where rating is 'Not given'
not_given_df = df[df['rating'] == 'Not given']

# Display the filtered rows
not_given_df.sample(10)


## **Visualize the data** 

In [None]:
# Set plot style
sns.set(style="whitegrid")

# Plot the distribution of ratings
plt.figure(figsize=(10, 6))
sns.histplot(df['rating'], bins=20, kde=True)
plt.title('Distribution of Restaurant Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

# Plot the distribution of total order time
plt.figure(figsize=(10, 6))
sns.histplot(df['total_order_time'], bins=20, kde=True)
plt.title('Distribution of Total Order Time')
plt.xlabel('Total Order Time (Minutes)')
plt.ylabel('Frequency')
plt.show()


In [None]:
# Boxplot for ratings by cuisine type
plt.figure(figsize=(10, 6))
sns.boxplot(x='cuisine_type', y='rating', data=df)
plt.title('Ratings by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Rating')
plt.xticks(rotation=45)
plt.show()

# Boxplot for total order time by cuisine type
plt.figure(figsize=(10, 6))
sns.boxplot(x='cuisine_type', y='total_order_time', data=df)
plt.title('Total Order Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Total Order Time (Minutes)')
plt.xticks(rotation=45)
plt.show()


In [None]:
# Assuming you have already done some initial aggregations, we calculate order_count here
df['order_count'] = df.groupby('restaurant_name')['order_id'].transform('count')

# Now check the dataframe to ensure `order_count` is added
print(df[['restaurant_name', 'order_count']].head())


In [None]:
# Correlation heatmap
corr = df[['rating', 'total_order_time', 'order_count']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Between Key Variables')
plt.show()


In [None]:
# Extract unique total order times
total_orders = not_given_df['total_order_time'].unique()

# Create a new figure
plt.figure(figsize=(10, 6))

# Plotting the unique total order times
plt.plot(total_orders, marker='o')

# Adding labels and title
plt.xlabel('Index')
plt.ylabel('Total Order Time')
plt.title('Unique Total Order Times for Ratings Not Given')

# Show the plot
plt.grid(True)
plt.show()


In [None]:
not_given_df['cost_of_the_order'].unique().sum()

In [None]:
not_given_df['cost_of_the_order'].sum()

In [None]:
# Extract cost of the orders
costs = not_given_df['cost_of_the_order'].unique()

# Create a new figure
plt.figure(figsize=(10, 6))

# Plotting the cost of the orders
plt.plot(costs, marker='o', linestyle='-', color='b')

# Adding labels and title
plt.xlabel('Index')
plt.ylabel('Cost of the Order')
plt.title('Cost of the Orders for Ratings Not Given')

# Show the plot
plt.grid(True)
plt.show()

In [None]:
#delete the rows that have 'not given' in the rating column
df.drop(df.index[df['rating']=='Not given'], inplace= True)

In [None]:
df.head(10)

In [None]:
print(f'Number of different cuisine type {df.cuisine_type.nunique()}')

In [None]:
n_restaurants = df.restaurant_name.nunique()
print(f'Number of unique restaurants {n_restaurants}')

In [None]:
# Count the occurrences of each cuisine type for weekdays and weekends
cuisine_counts = df.groupby(['cuisine_type', 'day_of_the_week']).size().unstack(fill_value=0)

# Create a bar plot
cuisine_counts.plot(kind='bar', figsize=(12, 6), width=0.8)

# Adding labels and title
plt.xlabel('Cuisine Type')
plt.ylabel('Number of Orders')
plt.title('Cuisine Sold on Weekends vs Weekdays')
plt.legend(title='Day of the Week')
plt.grid(True)

# Show the plot
plt.show()

In [None]:
df['cuisine_type'].value_counts()

In [None]:
fig = px.pie(values=df['cuisine_type'].value_counts().values, names=df['cuisine_type'].value_counts().index, title="Cuisine Type vs Order Count")
fig.show()

## **Observations**:

In examining the data for cuisine preferences on both weekends and weekdays, it is evident that **American cuisine** is the most popular choice, with **368** orders accounting for approximately **31.7%** of the total. Following closely behind, **Japanese cuisine** is preferred by around **23.5%** of the customers, with **273** orders. **Italian cuisine** also holds a significant share, with **172** orders representing about **14.8%** of the total. **Chinese cuisine** has **133** orders, making up about **11.4%** of the total orders. Other cuisines, including **Indian, Mexican, Middle Eastern, Mediterranean, Southern, French, Thai, Korean, Spanish, and Vietnamese**, are ordered less frequently and constitute the remaining percentage of orders.

In [None]:
# Create box plot
fig = px.box(df, y="cost_of_the_order", x="cuisine_type")
fig.show()

In [None]:
# Group by cuisine type and day of the week, and sum the cost of the orders
grouped_df = df.groupby(['cuisine_type', 'day_of_the_week'])['cost_of_the_order'].sum().unstack(fill_value=0)

# Create a grouped bar plot
grouped_df.plot(kind='bar', figsize=(12, 6))

# Adding labels and title
plt.xlabel('Cuisine Type')
plt.ylabel('Total Cost of the Orders')
plt.title('Total Cost Spent on Each Cuisine Type by Day of the Week')
plt.legend(title='Day of the Week')
plt.grid(True)

# Show the plot
plt.show()

In [None]:
# Create box plot
fig = px.box(df, y="cost_of_the_order", x="cuisine_type", color="day_of_the_week", 
             title="Cost of the Order by Cuisine Type on Weekdays and Weekends")
fig.show()

In [None]:
# Create box plot for food preparation time vs cuisine type
fig = px.box(df, y="food_preparation_time", x="cuisine_type", 
             title="Distribution of Food Preparation Time by Cuisine Type")
fig.show()

In [None]:
# Convert the rating column to numeric type
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

# Calculate the average rating for each cuisine type
average_ratings = df.groupby('cuisine_type')['rating'].mean()

# Create a line plot for average ratings by cuisine type
plt.figure(figsize=(12, 6))
average_ratings.plot(kind='line', marker='o', linestyle='-', color='skyblue')

# Adding labels and title
plt.xlabel('Cuisine Type')
plt.ylabel('Average Rating')
plt.title('Average Rating by Cuisine Type')
plt.grid(True)
plt.show()

In [None]:
fig = px.box(df, y="delivery_time", x="day_of_the_week", title='Delivery Time - Day of Week')
fig.show()

In [None]:
df.head()

In [None]:
# Create box plot for food preparation time by restaurant name
fig = px.box(df, y="food_preparation_time", x="restaurant_name", 
             title="Food Preparation Time by Restaurant Name")
fig.show()

In [None]:
# Calculate the average food preparation time for each restaurant
avg_prep_time = df.groupby('restaurant_name')['food_preparation_time'].mean()

# Create a line plot
plt.figure(figsize=(12, 6))
avg_prep_time.plot(kind='line', marker='o', linestyle='-', color='skyblue')

# Adding labels and title
plt.xlabel('Restaurant Name')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.title('Average Food Preparation Time by Restaurant')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
df.head()

In [None]:
# Calculate the average delivery time for each restaurant
avg_delivery_time = df.groupby('restaurant_name')['delivery_time'].mean().reset_index()

# Sort the restaurants by average delivery time
sorted_delivery_time = avg_delivery_time.sort_values(by='delivery_time')

# Select the 15 restaurants with the least delivery time
least_delivery_time = sorted_delivery_time.head(15)

# Select the 15 restaurants with the most delivery time
most_delivery_time = sorted_delivery_time.tail(15)

# Merge the data back into the original dataframe
least_delivery_df = df[df['restaurant_name'].isin(least_delivery_time['restaurant_name'])]
most_delivery_df = df[df['restaurant_name'].isin(most_delivery_time['restaurant_name'])]

# Create bar plots for both groups
fig_least = px.bar(least_delivery_df, y="restaurant_name", x="delivery_time", 
                   title="Delivery Time for Restaurants with Least Delivery Time", orientation="h")
fig_most = px.bar(most_delivery_df, y="restaurant_name", x="delivery_time", 
                  title="Delivery Time for Restaurants with Most Delivery Time", orientation="h")

# Show the plots
fig_least.show()
fig_most.show()

In [None]:
# Calculate the average cost of the order for each restaurant
avg_cost = df.groupby('restaurant_name')['cost_of_the_order'].mean().reset_index()

# Create a bar plot for average cost by restaurant name
fig = px.bar(avg_cost, y="restaurant_name", x="cost_of_the_order", 
             title="Average Cost of Cuisine by Restaurant", orientation="h")

# Update layout for better readability
fig.update_layout(
    height=1200, 
    width=1400, 
    margin=dict(l=300, r=50, b=100, t=100), 
    xaxis_title="Average Cost of the Order",
    yaxis_title="Restaurant Name"
)

fig.show()

In [None]:
# Create box plot for cost of the order based on the day of the week
fig = px.box(df, y="cost_of_the_order", x="day_of_the_week", 
             title='Amount Spent on Orders by Day of the Week')
fig.show()

In [None]:
# Bar plot for day_of_the_week variable
sns.countplot(data=df, x='day_of_the_week');
plt.show()

In [None]:
# Create box plot for delivery time by restaurant name and day of the week
fig = px.box(df, y="restaurant_name", x="delivery_time", color="day_of_the_week", 
             title="Delivery Time by Restaurant and Day of the Week", orientation="h")

fig.show()

In [None]:
# Filter data for weekends
weekend_df = df[df['day_of_the_week'] == 'Weekend']

# Create a bar plot for weekends
fig_weekend = px.bar(weekend_df, y="restaurant_name", x="delivery_time", 
                     title="Delivery Time by Restaurant (Weekend)", orientation="h", color_discrete_sequence=['red'])

# Increase the figure size and adjust margins
fig_weekend.update_layout(
    height=1200,
    width=1400,
    margin=dict(l=300, r=50, b=100, t=100),
    xaxis_title="Delivery Time (minutes)",
    yaxis_title="Restaurant Name"
)

fig_weekend.show()

In [None]:
# Filter data for weekdays
weekday_df = df[df['day_of_the_week'] == 'Weekday']

# Create a bar plot for weekdays
fig_weekday = px.bar(weekday_df, y="restaurant_name", x="delivery_time", 
                     title="Delivery Time by Restaurant (Weekday)", orientation="h", color_discrete_sequence=['blue'])

# Increase the figure size and adjust margins
fig_weekday.update_layout(
    height=1200,
    width=1400,
    margin=dict(l=300, r=50, b=100, t=100),
    xaxis_title="Delivery Time (minutes)",
    yaxis_title="Restaurant Name"
)

fig_weekday.show()

In [None]:
# Count the number of orders for each restaurant
order_counts = df['restaurant_name'].value_counts().reset_index()
order_counts.columns = ['restaurant_name', 'number_of_orders']

# Create a bar plot for the number of orders made from each restaurant
fig = px.bar(order_counts, x='restaurant_name', y='number_of_orders', 
             title='Number of Orders Made from Each Restaurant', 
             labels={'restaurant_name':'Restaurant Name', 'number_of_orders':'Number of Orders'})

# Increase the figure size and adjust margins
fig.update_layout(
    height=800,
    width=1000,
    margin=dict(l=50, r=50, b=100, t=100)
)

fig.show()

In [None]:
# Count the number of orders for each cuisine type
cuisine_counts = df['cuisine_type'].value_counts().reset_index()
cuisine_counts.columns = ['cuisine_type', 'number_of_orders']

# Create a bar plot for the number of orders made for each cuisine type
fig = px.bar(cuisine_counts, x='cuisine_type', y='number_of_orders', 
             title='Number of Orders by Cuisine Type', 
             labels={'cuisine_type':'Cuisine Type', 'number_of_orders':'Number of Orders'})

# Increase the figure size and adjust margins
fig.update_layout(
    height=800,
    width=1000,
    margin=dict(l=50, r=50, b=100, t=100)
)

fig.show()

In [None]:
# Scatter plot of rating vs. restaurant_name
fig = px.scatter(df, x='restaurant_name', y='rating', 
                 title='Customer Ratings', 
                 labels={'restaurant_name': 'Restaurant Name', 'rating': 'Rating'})

# Adjust figure size and margins
fig.update_layout(
    height=800,
    width=1000,
    margin=dict(l=50, r=50, b=100, t=100)
)

# Show the plot
fig.show()

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

# Creating the subplots layout
fig = make_subplots(
    rows=2, cols=1, shared_xaxes=True, subplot_titles=['Restaurant Rating', 'Restaurant Rating Counts']
)

# Add Box plot for ratings by restaurant name
fig.add_trace(
    px.box(df[df['restaurant_name'].isin(df.head(10)['restaurant_name'])],
           y="rating", 
           x="restaurant_name")['data'][0],
    row=1, col=1
)

# Add Bar plot for count of ratings per restaurant (using rating as example)
fig.add_trace(
    px.bar(df, 
           y="rating", 
           x="restaurant_name", 
           title="Restaurant Rating Counts").data[0],  # Use the correct trace from the bar plot
    row=2, col=1
)

# Update layout for better appearance
fig.update_layout(
    height=800,
    width=1000,
    margin=dict(l=50, r=50, b=100, t=100),
    showlegend=False
)

# Show the plot
fig.show()


In [None]:
# Aggregate data to count the number of orders per restaurant
agg_data = df.groupby('restaurant_name').size().reset_index(name='number_of_orders')

# Create a bar plot for the number of orders per restaurant
import plotly.express as px

fig = px.bar(agg_data, 
             x='restaurant_name', 
             y='number_of_orders', 
             title='Number of Orders by Restaurant', 
             labels={'restaurant_name': 'Restaurant Name', 'number_of_orders': 'Number of Orders'})

fig.show()


In [None]:
import plotly.express as px

# Create a histogram of ratings
fig = px.histogram(df, x="rating", nbins=50, title='Distribution of Restaurant Ratings')

# Show the plot
fig.show()


In [None]:
fig = px.histogram(df, x="cost_of_the_order", nbins=50, title='Distribution of Order Costs')

# Show the plot
fig.show()


In [None]:
# Calculate the average rating for each restaurant
avg_ratings = df.groupby('restaurant_name')['rating'].mean().reset_index()

# Sort the restaurants by average rating in descending order
sorted_ratings = avg_ratings.sort_values(by='rating', ascending=False)

# Select the top 15 restaurants with the highest average rating
top_100_ratings = sorted_ratings.head(100)

# Create a bar plot for the top 15 restaurants with the highest average rating
fig = px.bar(top_100_ratings, x='restaurant_name', y='rating', 
             title='Top 100 Restaurants with Highest Average Rating', 
             labels={'restaurant_name':'Restaurant Name', 'rating':'Average Rating'})

# Increase the figure size and adjust margins
fig.update_layout(
    height=800,
    width=1000,
    margin=dict(l=50, r=50, b=100, t=100)
)

fig.show()

In [None]:
import plotly.express as px

# Assuming 'df' is your DataFrame

# If the number of order_ids is large, we can aggregate the data first.
# For example, you can aggregate by restaurant_name and get the sum of costs.
agg_df = df.groupby('restaurant_name')['cost_of_the_order'].sum().reset_index()

# Create a bar plot for aggregated cost_by_restaurant
fig = px.bar(agg_df, x='restaurant_name', y='cost_of_the_order', 
             title='Total Cost Spent at Each Restaurant', 
             labels={'restaurant_name': 'Restaurant Name', 'cost_of_the_order': 'Total Cost of Orders'},
             color='cost_of_the_order',  # Adding color for better visual appeal
             color_continuous_scale=px.colors.sequential.Viridis)  # Using a color palette

# Adjusting the layout
fig.update_layout(
    height=800,
    width=1000,
    margin=dict(l=50, r=50, b=100, t=100),
    xaxis_title='Restaurant Name',
    yaxis_title='Total Cost of Orders',
    xaxis=dict(
        tickangle=-45,  # Rotating the x-axis labels for better readability
        tickmode='array',  # Use an array of specific tick labels
        showticklabels=True  # Display labels more cleanly
    ),
    yaxis=dict(tickmode='linear'),
    plot_bgcolor='rgba(0,0,0,0)',  # Making the background transparent
    paper_bgcolor='rgba(0,0,0,0)',  # Making the background transparent
    showlegend=False  # Hiding the legend if not necessary
)

# Adding light gridlines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='LightGray')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGray')

# Optionally remove range selector for a cleaner view
fig.update_xaxes(rangeslider=dict(visible=False))

fig.show()


In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Create subplots with 2 rows and 1 column
fig = make_subplots(rows=2, cols=1, subplot_titles=['Total Cost of Orders', 'Number of Orders'])

# Add the total cost bar plot
fig.add_trace(go.Bar(x=agg_df['restaurant_name'], y=agg_df['cost_of_the_order'], name='Total Cost'), row=1, col=1)

# Add the number of orders bar plot
num_orders = df.groupby('restaurant_name').size().reset_index(name='num_orders')
fig.add_trace(go.Bar(x=num_orders['restaurant_name'], y=num_orders['num_orders'], name='Number of Orders'), row=2, col=1)

# Adjust layout to increase the size of the plot
fig.update_layout(
    title='Restaurant Order Insights',
    height=800,  # Increase the height
    width=1000,  # Increase the width
    showlegend=False  # Optionally, hide legend if not necessary
)

fig.show()


In [None]:
# Create the scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df,
    x="total_order_time",
    y="rating",
    hue="cuisine_type",
    style="cuisine_type",
    s=100
)

# Add labels and title
plt.title("Ratings vs. Total Order Time by Cuisine Type", fontsize=14)
plt.xlabel("Total Order Time (minutes)", fontsize=12)
plt.ylabel("Rating", fontsize=12)
plt.legend(title="Cuisine Type", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()

In [None]:
# Calculate average ratings by cuisine type
avg_ratings = df.groupby("cuisine_type")["rating"].mean().sort_values(ascending=False)

# Create the bar chart
plt.figure(figsize=(10, 6))
avg_ratings.plot(kind="bar", color="skyblue", edgecolor="black")

# Add labels and title
plt.title("Average Ratings by Cuisine Type", fontsize=14)
plt.xlabel("Cuisine Type", fontsize=12)
plt.ylabel("Average Rating", fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

# Display the plot
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x="cuisine_type", palette="pastel", order=df['cuisine_type'].value_counts().index)
plt.title("Cuisine Type Distribution", fontsize=14)
plt.xlabel("Cuisine Type", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="cuisine_type", y="total_order_time", palette="pastel")
plt.title("Total Order Time by Cuisine Type", fontsize=14)
plt.xlabel("Cuisine Type", fontsize=12)
plt.ylabel("Total Order Time (minutes)", fontsize=12)
plt.xticks(rotation=45)
plt.show()

In [None]:
top_restaurants = df.nlargest(5, "rating")
least_restaurants = df.nsmallest(5, "rating")

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(data=top_restaurants, x="restaurant_name", y="rating", palette="Blues_d", ci=None)
plt.title("Top 5 Highest-Rated Restaurants", fontsize=14)
plt.xlabel("Restaurant Name", fontsize=12)
plt.ylabel("Rating", fontsize=12)
plt.xticks(rotation=45)
plt.show()

In [None]:
# Least-rated restaurants
plt.figure(figsize=(10, 6))
sns.barplot(data=least_restaurants, x="restaurant_name", y="rating", palette="Reds_d", ci=None)
plt.title("Bottom 5 Least-Rated Restaurants", fontsize=14)
plt.xlabel("Restaurant Name", fontsize=12)
plt.ylabel("Rating", fontsize=12)
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x="rating", kde=True, bins=10, color="purple", alpha=0.7)
plt.title("Rating Distribution", fontsize=14)
plt.xlabel("Rating", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# Heatmap: Cuisine Type and Total Order Time Correlation
heatmap_data = df.pivot_table(values="total_order_time", index="cuisine_type", aggfunc="mean")
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap="YlGnBu", cbar=True)
plt.title("Average Total Order Time by Cuisine Type", fontsize=14)
plt.xlabel("Cuisine Type", fontsize=12)
plt.ylabel("")
plt.show()

In [None]:
df.columns

In [None]:
# Count the number of orders for each restaurant
order_counts = df.groupby(["restaurant_name", "cuisine_type"]).size().reset_index(name="order_count")

# Sort by order count
sorted_orders = order_counts.sort_values(by="order_count", ascending=False)

# Top 15 restaurants
top_15 = sorted_orders.head(15)

# Bottom 15 restaurants
bottom_15 = sorted_orders.tail(15)

In [None]:
# Plot the top 15 restaurants
plt.figure(figsize=(12, 6))
plt.barh(top_15["restaurant_name"], top_15["order_count"], color="skyblue", edgecolor="black")
plt.title("Top 15 Restaurants by Number of Orders", fontsize=14)
plt.xlabel("Number of Orders", fontsize=12)
plt.ylabel("Restaurant Name", fontsize=12)
plt.gca().invert_yaxis()  # Invert y-axis to display the highest at the top
plt.grid(axis='x', linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
# Plot the bottom 15 restaurants
plt.figure(figsize=(12, 6))
plt.barh(bottom_15["restaurant_name"], bottom_15["order_count"], color="salmon", edgecolor="black")
plt.title("Bottom 15 Restaurants by Number of Orders", fontsize=14)
plt.xlabel("Number of Orders", fontsize=12)
plt.ylabel("Restaurant Name", fontsize=12)
plt.gca().invert_yaxis()  # Invert y-axis to display the highest at the top
plt.grid(axis='x', linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Calculate the count of orders for each restaurant
order_count = df['restaurant_name'].value_counts()

# Limit to top N restaurants (e.g., top 10)
top_n = 10
order_count_top_n = order_count.head(top_n)

# Plot the result as a bar chart
plt.figure(figsize=(10, 6))
order_count_top_n.plot(kind='bar', color='skyblue')

# Add titles and labels
plt.title(f'Top {top_n} Restaurants by Order Count')
plt.xlabel('Restaurant Name')
plt.ylabel('Order Count')

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45, ha='right')

# Display the plot
plt.tight_layout()
plt.show()


In [None]:
# Calculate the count of orders for each restaurant
order_count = df['restaurant_name'].value_counts()

# Filter out restaurants with fewer than 5 orders (or set your own threshold)
filtered_order_count = order_count[order_count >= 5]

# Plot the result as a bar chart
plt.figure(figsize=(10, 6))
filtered_order_count.plot(kind='bar', color='skyblue')

# Add titles and labels
plt.title('Restaurants with 5 or more Orders')
plt.xlabel('Restaurant Name')
plt.ylabel('Order Count')

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45, ha='right')

# Display the plot
plt.tight_layout()
plt.show()


In [None]:
# Group by restaurant_name and get aggregate statistics
result = df.groupby('restaurant_name').agg(
    order_count=('order_id', 'count'),  # Counting orders per restaurant
    average_rating=('rating', 'mean'),  # Average rating
    average_order_time=('total_order_time', 'mean')  # Average order time
).reset_index()

# Display result
print(result.head())


In [None]:
# Sort by order count and display the top 10 restaurants
top_ordered_restaurants = result.sort_values('order_count', ascending=False).head(10)
print("Top 10 Ordered Restaurants:")
print(top_ordered_restaurants)

# Sort by rating and display the top 10 highest-rated restaurants
top_rated_restaurants = result.sort_values('average_rating', ascending=False).head(10)
print("Top 10 Rated Restaurants:")
print(top_rated_restaurants)


In [None]:
# Top 10 ordered restaurants
plt.figure(figsize=(10, 6))
sns.barplot(x='order_count', y='restaurant_name', data=top_ordered_restaurants)
plt.title('Top 10 Ordered Restaurants')
plt.xlabel('Order Count')
plt.ylabel('Restaurant Name')
plt.show()

# Top 10 rated restaurants
plt.figure(figsize=(10, 6))
sns.barplot(x='average_rating', y='restaurant_name', data=top_rated_restaurants)
plt.title('Top 10 Rated Restaurants')
plt.xlabel('Average Rating')
plt.ylabel('Restaurant Name')
plt.show()


In [None]:
# Top 10 Ordered Restaurants Plot
top_ordered = result.nlargest(10, 'order_count')
plt.figure(figsize=(10, 6))
sns.barplot(x='order_count', y='restaurant_name', data=top_ordered, palette='viridis')
plt.title('Top 10 Ordered Restaurants')
plt.xlabel('Order Count')
plt.ylabel('Restaurant Name')
plt.show()

# Top 10 Rated Restaurants Plot
top_rated = result.nlargest(10, 'average_rating')
plt.figure(figsize=(10, 6))
sns.barplot(x='average_rating', y='restaurant_name', data=top_rated, palette='coolwarm')
plt.title('Top 10 Rated Restaurants')
plt.xlabel('Average Rating')
plt.ylabel('Restaurant Name')
plt.show()



In [None]:
def encourage_ratings(order):
    if order['rating'] is None:
        print(f"Order {order['order_id']} - Encourage the customer to leave a rating by offering a discount on their next order.")
    else:
        print(f"Order {order['order_id']} - Rating received: {order['rating']}")

# Apply this function to each order
for index, order in df.iterrows():
    encourage_ratings(order)

In [None]:
def calculate_charges(order_value):
    if 5 <= order_value < 15:
        return order_value * 0.15
    elif 15 <= order_value < 25:
        return order_value * 0.20
    elif order_value >= 25:
        return order_value * 0.25
    else:
        return 0

df['charges'] = df['cost_of_the_order'].apply(calculate_charges)


In [None]:
df.head()

### Identifying Top 15 Restaurants with Minimal Order Times

In [None]:
# Sort the dataset by total_order_time
sorted_df = df.sort_values(by="total_order_time")

# Select the top 15 to 20 rows (in this case, it's just 5 rows for demonstration)
top_restaurants = sorted_df.head(15)[["restaurant_name", "cuisine_type", "total_order_time", "rating"]]

top_restaurants

### **Based on this data, here are some observations:**

1. **Popular Cuisine Types**: Middle Eastern, American, and Chinese cuisines are frequently represented among the restaurants with lower total order times, suggesting efficiency in these categories.

2. **Restaurant Efficiency**: Restaurants like "RedFarm Broadway" (Chinese) and "Westville Hudson" (American) exhibit the lowest `total_order_time` of 35 minutes, making them stand out for their speed.

3. **Notable Repetitions**: Some restaurants, such as "Blue Ribbon Sushi" (Japanese), appear multiple times with consistent total order times, indicating reliability in their performance.

4. **Range of Order Times**: The listed total order times range narrowly between 35 and 37 minutes, showing relatively consistent preparation and delivery efficiency across these restaurants.

If you'd like, we could analyze this further—perhaps focusing on identifying any patterns in efficiency for a specific cuisine type or examining how these times correlate with ratings or other factors in your dataset!

### **Displaying the Restaurants with the Longest Total Order Times**

In [None]:
# Sort the dataset by total_order_time in descending order
sorted_df = df.sort_values(by="total_order_time", ascending=False)

# Select the last 15 restaurants and display the required columns
last_restaurants = sorted_df.tail(15)[["restaurant_name", "cuisine_type", "total_order_time", "rating"]]

last_restaurants

Here are some key observations based on the dataset provided:

1. **High Ratings and Total Order Time**:
   - Restaurants like "Jack's Wife Freda," "Vanessa's Dumplings," and "Blue Ribbon Fried Chicken" maintain a perfect rating of 5 even with a total order time of 37 minutes. This suggests that customers may prioritize food quality and satisfaction over time in some cases.

2. **Efficiency Coupled with Quality**:
   - "Westville Hudson" and "RedFarm Broadway" achieve a balance, with the shortest total order time of 35 minutes and perfect ratings of 5 for "Westville Hudson" and a slightly lower rating (3) for one entry of "RedFarm Broadway."

3. **Consistently High Performers**:
   - Restaurants such as "ilili Restaurant," "Rubirosa," and "Blue Ribbon Sushi" consistently achieve a perfect rating of 5, indicating a strong customer preference, despite the total order time being 36 minutes.

4. **Variation Among Same Restaurants**:
   - A notable variation can be seen for "Jack's Wife Freda," which has ratings of both 5 and 3 for the same total order time (37 minutes). This suggests differing customer experiences even with similar efficiency.

5. **Cuisine Breakdown**:
   - Several cuisine types appear repeatedly:
     - **American**: Strong performers like "Schnipper's Quality Kitchen" and "Westville Hudson."
     - **Chinese**: "RedFarm Broadway" and "Vanessa's Dumplings" stand out with mixed ratings.
     - **Middle Eastern**: "ilili Restaurant" and "Cafe Mogador" show differing ratings despite similar total order times.
     - **Japanese**: "Blue Ribbon Sushi" has consistently good ratings (4 or 5).

6. **Focus for Improvement**:
   - Restaurants with lower ratings (e.g., "Jack's Wife Freda" with 3) might focus on enhancing the customer experience or food quality to match other high-rated competitors within the same total order time range.

This mix of ratings and time efficiency offers a valuable perspective for understanding customer satisfaction and performance trends across these restaurants! Would you like to explore additional insights, such as correlations between ratings and cuisine type?

### **Top 5 Highest-Rated Restaurants with Total Order Times**

In [None]:
# Sort the dataset by rating in descending order
sorted_df = df.sort_values(by="rating", ascending=False)

# Select the top 5 restaurants with the highest ratings
top_rated_restaurants = sorted_df.head(5)[["restaurant_name", "cuisine_type", "rating", "total_order_time"]]

top_rated_restaurants

Here are some observations based on the given data:

1. **Perfect Ratings Across All Restaurants**: All listed restaurants have a perfect rating of 5, indicating exceptional customer satisfaction and quality of service.

2. **Cuisine Diversity**: The top-rated restaurants feature a variety of cuisines:
   - **Mediterranean**: "Jack's Wife Freda"
   - **Mexican**: "Cafe Habana" and "Chipotle Mexican Grill $1.99 Delivery"
   - **American**: "The Smile"
   - **Japanese**: "TAO"

   This showcases high performance across different culinary traditions.

3. **Correlation Between Rating and Total Order Time**: 
   - Despite having perfect ratings, these restaurants exhibit varying `total_order_time` values, ranging from 42 minutes ("TAO") to 54 minutes ("Jack's Wife Freda"). 
   - This implies that customers may prioritize other factors, such as food quality and overall experience, over faster service.

4. **Mexican Cuisine Dominance**: Two out of the five restaurants serve Mexican cuisine, highlighting its potential popularity or effectiveness in achieving customer satisfaction.

5. **Potential for Improvement**: While all restaurants scored perfectly in ratings, reducing their total order times could further enhance their appeal and competitiveness.

These insights could help pinpoint what drives high ratings across different cuisines and operational styles. Would you like further exploration, such as identifying patterns in customer preferences or operational efficiency?

### Bottom 5 Least-Rated Restaurants Based on Customer Feedback

In [None]:
# Sort the dataset by rating in ascending order
sorted_df = df.sort_values(by="rating", ascending=True)

# Select the last 5 restaurants with the lowest ratings
least_rated_restaurants = sorted_df.head(5)[["restaurant_name", "cuisine_type", "rating", "total_order_time"]]

least_rated_restaurants

Here are some observations based on the provided data of the least-rated restaurants:

1. **Low Ratings Across All Restaurants**:
   - All listed restaurants have a low rating of 3, which indicates room for improvement in terms of customer satisfaction or overall experience.

2. **Cuisine Type Distribution**:
   - The cuisines represented include:
     - **Chinese**: "RedFarm Broadway"
     - **American**: "brgr" and "Shake Shack"
     - **Italian**: "The Meatball Shop"
     - **Japanese**: "Sushi of Gari 46"
   - American cuisine appears twice in the list, suggesting potential issues with consistency or customer expectations in this category.

3. **High Total Order Times**:
   - The total order times for these restaurants range from 48 minutes ("Sushi of Gari 46") to 63 minutes ("RedFarm Broadway"), with the longest being significantly higher than the shortest.
   - High order times might contribute to customer dissatisfaction, potentially impacting the ratings.

4. **Focus Areas for Improvement**:
   - For "RedFarm Broadway," both its low rating and the highest total order time (63 minutes) stand out, suggesting that reducing wait time could enhance customer satisfaction.
   - Similarly, "Shake Shack" and "The Meatball Shop" might benefit from addressing both their operational efficiency and overall service quality.

5. **Japanese Cuisine as an Outlier**:
   - Despite typically being associated with high quality, "Sushi of Gari 46" has the lowest total order time (48 minutes) among the group but still received a low rating, possibly pointing to issues beyond service speed—such as taste or value.

Improving operational aspects and analyzing customer feedback for these restaurants could lead to better ratings in the future. Let me know if you want to explore strategies for improvement or other insights!

### **Aggregate Information by Restaurant:**

In [None]:

# Calculate the count of orders for each restaurant
order_count = df['restaurant_name'].value_counts()

# Display the result
order_count

In [None]:
# Group by restaurant_name and get aggregate information
result = df.groupby('restaurant_name').agg(
    order_count=('order_id', 'count'),  # Assuming you want to count orders
    cuisine_type=('cuisine_type', 'first'),  # Take the first cuisine type for each restaurant
    rating=('rating', 'mean'),  # Calculate the average rating
    total_order_time=('total_order_time', 'mean')  # Calculate the average total order time
).reset_index()

# Round the numerical columns to 1 decimal place
result['rating'] = result['rating'].round(1)  # Round rating to 1 decimal place
result['total_order_time'] = result['total_order_time'].round(1)  # Round total_order_time to 1 decimal place

# Display the result
result.head()


In [None]:
# Descriptive statistics for numerical columns
df.describe()

### **Restaurant-Based Recommendation:**

In [None]:

# Restaurant-based recommendation: Sort by rating first, then by total order time
recommendations = df.sort_values(by=['rating', 'total_order_time'], ascending=[False, True])

print("Restaurant-Based Recommendations:")
result = recommendations[['restaurant_name', 'rating', 'total_order_time']]
result.head()

### Dataset Explanation and Trend Analysis

The dataset provided contains information about customer orders, including details about **restaurants**, **orders**, **food preparation time**, **delivery time**, **total order time**, and **customer ratings**. The dataset is rich with various performance metrics that help us understand the dynamics of food ordering.

#### Key Columns in the Dataset:
1. **Restaurant Name**: The name of the restaurant from which food is ordered.
2. **Order Count**: The number of times a particular restaurant has been ordered from.
3. **Cuisine Type**: The type of cuisine offered by the restaurant (e.g., American, Japanese, Italian).
4. **Rating**: The average customer rating for the restaurant (on a scale of 1 to 5).
5. **Total Order Time**: The total time (in minutes) taken to process the order, including both food preparation and delivery time.
6. **Cost of the Order**: The total cost paid by the customer for the order.
7. **Food Preparation Time**: The time taken by the restaurant to prepare the food.
8. **Delivery Time**: The time taken for the food to be delivered to the customer.
9. **Charges**: Additional charges related to the order (e.g., delivery fee, service charges).

#### General Observations:
- **Order Count and Popularity**: 
   - Restaurants such as **Blue Ribbon Sushi**, **Bareburger**, and **Shake Shack** have the highest order counts, indicating these are some of the most popular options in the dataset.
   - Popular restaurants tend to attract many customers, which is often a sign of good customer retention and satisfaction.

- **Cuisine Preferences**:
   - The **American** cuisine category is the most frequently ordered, with several American restaurants such as **Bareburger**, **Shake Shack**, and **Five Guys Burgers and Fries** topping the list.
   - **Japanese** cuisine also seems to be popular, with restaurants like **Blue Ribbon Sushi** and **Sushi of Gari** performing well in terms of ratings and order frequency.
   - **Indian** and **Mexican** cuisines show a good balance of popularity and customer satisfaction, although **Mexican** restaurants tend to have a slightly lower average rating than other cuisines.

- **Ratings Distribution**:
   - Most restaurants have ratings close to 4, indicating that customers generally have positive experiences.
   - However, a few restaurants, like **Sarabeth’s West**, **Hampton Chutney Co.**, and **Pepe Giallo**, have lower ratings, which suggests that they might have issues with customer satisfaction, which could be due to food quality, service, or other factors.
   - Restaurants with higher ratings often show consistency in their order counts, as customers tend to favor those that have proven to offer quality and timely services.

- **Delivery and Preparation Times**:
   - The **total order time** varies widely across restaurants, with some restaurants, such as **Blue Ribbon Sushi Izakaya**, having longer total order times (e.g., 53.2 minutes), while others like **Amma** and **Sushi Choshi** maintain shorter times.
   - Restaurants that manage quicker delivery times (under 50 minutes) tend to have higher customer satisfaction, as reflected in their ratings.
   - **Food preparation time** and **delivery time** seem to correlate with overall customer satisfaction. Faster delivery and preparation times tend to align with higher ratings.

- **Cost Insights**:
   - The average order cost in the dataset is approximately **$16.76**, but there is variation in pricing across restaurants.
 
   - Some restaurants, like Cipriani Le Specialita, have very high order costs (around $65.00), which indicates they might be offering premium or luxury services. Despite the high cost, they tend to maintain high ratings, suggesting that customers are satisfied with the value they are receiving.

    - As for the charges, they typically range between **`$2 to $4`** in additional fees, which can include delivery charges, service fees, or other associated costs. These charges are generally lower compared to the base cost of the order but still contribute to the overall total.



### Restaurant-Based Recommendations

Now that we have an understanding of the dataset trends, let's proceed to the **Restaurant-Based Recommendations**.

#### Key Insights and Recommendations:
Based on the analysis of order count, ratings, and total order time, we can offer the following restaurant-based recommendations:

1. **Top Restaurants by Popularity**:
   - **Blue Ribbon Sushi**: With a high number of orders (73) and consistent ratings (4.2), it is a top contender for a reliable dining option.
   - **Bareburger**: A consistent performer with 17 orders and a moderate rating of 4.1, but it's still quite popular.
   - **Shake Shack**: With 133 orders and a rating of 4.3, this American burger joint is a go-to option for many customers.

2. **Restaurants with High Ratings**:
   - **Amma** (Indian) and **Anjappar Chettinad** (Indian) stand out with perfect ratings (5.0) across multiple orders, making them prime recommendations for customers seeking quality.
   - **Kori Restaurant and Bar** (Korean) also maintains a strong rating (5.0) despite having fewer orders, suggesting they might provide an outstanding experience.

3. **Fast Delivery and Preparation**:
   - Restaurants like **Ravagh Persian Grill** (Middle Eastern) offer excellent delivery times (67 minutes) and preparation times, which may make them ideal for customers looking for speed.
   - **Grand Sichuan International** (Chinese) and **Taro Sushi** (Japanese) are also known for their quick service and consistent quality.

4. **Cuisines with Balanced Performance**:
   - **Mexican restaurants** such as **Cafe Habana** and **Dos Caminos** provide a balanced combination of popularity and customer satisfaction, offering both good ratings and reasonable order times.

### Final Conclusion:

In this project, we built a **Restaurant-Based Recommendation System** using available data to help customers discover high-quality restaurants based on order count, customer ratings, and total order time. 

The dataset provided valuable insights into restaurant performance, revealing trends such as the popularity of **American** and **Japanese** cuisines, customer satisfaction through high ratings, and the importance of **timely service**.

Despite some challenges with implementing a **Customer-Based Recommendation System**, which faced technical issues with dynamic customer preferences, the **Restaurant-Based Recommendation System** provided a solid framework for suggesting top-performing restaurants. 

### Recommendations:
- **For Customers**: Rely on the **restaurant-based recommendations** to discover top-rated and high-traffic restaurants for a great dining experience.
- **For Restaurant Owners**: Focusing on maintaining high ratings, reducing delivery times, and ensuring a consistent quality of service is key to attracting repeat customers.
- **For Future Work**: The customer-based recommendations could be further enhanced by implementing personalized filters and preferences, ensuring the system adapts dynamically to individual customer tastes.

By refining both systems, the recommendation engine can offer a more personalized and comprehensive dining experience for all customers.