<a href="https://colab.research.google.com/github/Faz-Fz/final_ml/blob/main/Pizza_sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
ulrikthygepedersen_pizza_place_sales_path = kagglehub.dataset_download('ulrikthygepedersen/pizza-place-sales')

print('Data source import complete.')


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h1>Importing modules and dataset</h1>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/pizza-place-sales/pizzaplace.csv')
df.head()

In [None]:
df.info()

The dataset contains **49,574 rows and 8 columns.**
*  '**id'** is an object (string) column, likely an identifier for each entry.
* **'date'** is an object (string) column, representing dates.
* **'time'** is an object (string) column, representing timestamps.
*  **'name'** is an object (string) column, likely representing the name of the pizza.
*  **'size'** is an object (string) column, possibly representing the size of the pizza.
* **'type'** is an object (string) column, likely representing the type or category of the pizza.
*  **'price'** is a float64 column, representing the price of the pizza.

In [None]:
df.isna().sum() # no null values

In [None]:
df.describe()

In [None]:
df['size'].value_counts()

In [None]:
df['name'].value_counts()

<h2>Visualizing</h2>

In [None]:
#Visualizing the distribution of pizza types
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='type')
plt.title('Pizza Type Distribution')
plt.xlabel('Pizza Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
#we see 4 types of pizza with amount of distribution

In [None]:
# Visualize the distribution of pizza sizes
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='size')
plt.title('Pizza Size Distribution')
plt.xlabel('Pizza Size')
plt.ylabel('Count')
plt.show()
#we see most common size bought by users is L size

In [None]:
# Visualize the relationship between price and pizza type using a boxplot
plt.figure(figsize=(12, 8))
sns.boxplot(data=df, x='type', y='price')
plt.title('Price Distribution by Pizza Type')
plt.xlabel('Pizza Type')
plt.ylabel('Price')
plt.xticks(rotation=45)
plt.show()
# a boxplot explains 5 features - min, max, median(middle line), q0 q1

In [None]:
df['date'] = pd.to_datetime(df['date'])

# Extractday of the week (0 = Monday, 6 = Sunday)
df['day_of_week'] = df['date'].dt.dayofweek

days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

#  DataFrame to store daily order counts
daily_orders = df['day_of_week'].value_counts().sort_index().reindex(range(7), fill_value=0)

plt.figure(figsize=(10, 6))
sns.barplot(x=days_of_week, y=daily_orders.values)
plt.title('Number of Orders by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Orders')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Visualize the monthly sales over time
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
monthly_sales = df.groupby('month')['price'].sum()

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(x=monthly_sales.index, y=monthly_sales.values)
plt.title('Monthly Sales Over Time')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()
# monthly sales are as follows

**Which is the favorite pizza of customers (most ordered pizza)?**

In [None]:
favorite_pizza = df.groupby(['name', 'size'])['id'].count().idxmax()
print("Favorite Pizza:", favorite_pizza)

<h2>Regression</h2>

In [None]:
cat_cols=df.select_dtypes(include=['object']).columns

In [None]:
cat_cols

In [None]:
from sklearn.preprocessing import LabelEncoder
en=LabelEncoder()
for i in cat_cols:
    df[i]=en.fit_transform(df[i])

In [None]:
# Dropings from the df
columns_to_drop = ['Unnamed: 0', 'id', 'date', 'time']
df = df.drop(columns=columns_to_drop)

In [None]:
df.head()

In [None]:
X= df.drop('price',axis=1)
y = df['price']

In [None]:
# # Dropings from the df
# columns_to_drop = ['Unnamed: 0', 'id', 'date', 'time','price']
# df = df.drop(columns=columns_to_drop)

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=40)

In [None]:
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

In [None]:
lr = LinearRegression()
lr.fit(X_train,y_train)

In [None]:
rf  = RandomForestRegressor()
rf.fit(X_train,y_train)

In [None]:

gbr = GradientBoostingRegressor()
gbr.fit(X_train,y_train)

In [None]:
xg = XGBRegressor()
xg.fit(X_train,y_train)

In [None]:
# Create a DecisionTreeRegressor instance
tree_reg = DecisionTreeRegressor()

# Fit the model to the training data
tree_reg.fit(X_train, y_train)




<h3>Prediction on Test Data</h3>

In [None]:
y_pred1 = lr.predict(X_test)
y_pred2 = rf.predict(X_test)
y_pred3 = gbr.predict(X_test)
y_pred4 = xg.predict(X_test)
y_pred5= tree_reg.predict(X_test)

<h3> Evaluating the Algorithm</h3>

In [None]:
from sklearn import metrics
score1 = metrics.r2_score(y_test,y_pred1) #linear regression
score2 = metrics.r2_score(y_test,y_pred2) #random forest
score3 = metrics.r2_score(y_test,y_pred3) #gbr
score4 = metrics.r2_score(y_test,y_pred4) #xg
score5 = metrics.r2_score(y_test,y_pred5) #dt

In [None]:
print(score1,score2,score3,score4,score5)


In [None]:
#visualizing
final_data = pd.DataFrame({'Models':['LR','RF','GB','XGR','DT'],
             'R2_SCORE':[score1,score2,score3,score4,score5]})

In [None]:
import seaborn as sns
sns.barplot(x=final_data['Models'],y=final_data['R2_SCORE'])

* **Linear Regression (score1):** An R-squared score of 0.5209 indicates that the Linear Regression model explains approximately 52.09% of the variance in the test data. This means the model provides a moderate fit to the data.

* **Support Vector Machine Regressor (score2):** An R-squared score of -0.0002 is very close to zero. It shows that svm is not performing well on the test data and may not be a suitable model for this particular problem. Negative R-squared values can indicate that the model is performing worse than a horizontal line.

* **Random Forest Regressor (score3):** An R-squared score of 1.0 indicates a perfect fit to the data.

* **Gradient Boosting Regressor (score4):** An R-squared score of 0.9977 indicates that the Gradient Boosting Regressor provides an excellent fit to the test data, explaining almost 99.77% of the variance. This is a strong performance.

* **XGBoost Regressor (score5):** An R-squared score of 1.0, similar to the Random Forest.

In [None]:
df.head()

<h3>final model</h3>

In [None]:
custom_values = [[3,2,1,3,3]]

# Convert the custom input values to a NumPy array
custom_array = np.array(custom_values)
y_pred = tree_reg.predict(custom_array)

# Print the predicted values
print("Predicted Values:", y_pred)