## **Advanced Pune Caffe Sales(Year 2000) Regression Project**

In [2]:
# Importing necessary libraries

import pandas as pd     # for data cleaning, filtering and manipulation
import numpy as np           # numpy used for numerical computing
import seaborn as sns           # for advance and beautiful visualizations
import matplotlib.pyplot as plt     # for creating visualizations
%matplotlib inline 
from ydata_profiling import ProfileReport     # for generating profile report     
import plotly.express as px         # interactive and web based plotting library in python
import warnings                 # for supressing warning messages during code execution
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)   # for displaying all columns

#### **Necessary Libraries**
- **`Pandas`** - Pandas is used for data analysis and handling tabular data like Excel.
It provides easy tools to load, clean, and explore data using DataFrames.
- **`Numpy`** - NumPy is a Python library used for fast mathematical and array operations.
It helps in working with large datasets using arrays, matrices, and functions.
- **`Seaborn`** - Seaborn is a Python library for making beautiful and simple statistical charts.
It is built on top of Matplotlib and works well with Pandas.
- **`Matplotlib`** - Matplotlib is used to create basic graphs like line, bar, and pie charts.
It helps in visualizing data in static and simple formats.
- **`Plotly`** - Plotly is a library for creating interactive and web-based charts.
It is used when you want zoomable and clickable graphs.
- **`%matplotlib inline`** - This is used in Jupyter Notebook to show plots right below your code cell.
It helps display Matplotlib charts inside the notebook.
warnings (import warnings)
- **`import warnings`** - import warnings is used to manage or ignore warning messages in Python.
It helps keep output clean and readable during code execution.
- **`ydata_profiling`** - ydata_profiling quickly creates an automatic EDA report from your data.
It gives summaries, graphs, and insights in one HTML file.

In [4]:
# Loading dataset using .read_csv() method of pandas

df = pd.read_csv("C:\\Users\\HP\\OneDrive\\Documents\\Advanced_Pune_Cafe_Sales_2000.csv")

In [5]:
# Displaying first five rows of the dataframe

df.head()

Unnamed: 0,order_id,date,outlet_location,outlet_manager,customer_type,item_category,item_name,quantity_sold,price_per_unit,total_bill,cost_price,profit,payment_mode,time_of_day,temperature_c,day_of_week,customer_rating,discount_percent,final_amount,special_event,staff_id,day_type,cumulative_sales_outlet
0,ORD51036,2025-10-19,Hinjewadi,Priya Nair,Regular,Snack,Cheese Garlic Bread,4,245,980,771.03,208.97,Cash,Morning,34.9,Sunday,3.1,15,833.0,Weekend Special,ST11,Weekend,833.0
1,ORD94971,2025-10-09,Kothrud,Arjun Mehta,Tourist,Beverage,Cappuccino,3,134,402,267.03,134.97,Card,Evening,25.3,Thursday,3.0,0,402.0,,ST4,Weekday,402.0
2,ORD82709,2025-10-07,Viman Nagar,Rohit Deshmukh,Tourist,Beverage,Green Tea,3,243,729,519.16,209.84,Cash,Evening,29.0,Tuesday,4.8,5,692.55,Weekend Special,ST22,Weekday,692.55
3,ORD61541,2025-10-22,Viman Nagar,Rohit Deshmukh,Tourist,Beverage,Cappuccino,1,174,174,105.44,68.56,Card,Evening,25.7,Wednesday,4.8,0,174.0,Diwali Offer,ST22,Weekday,866.55
4,ORD20674,2025-10-07,Kothrud,Arjun Mehta,New,Dessert,Cheesecake,4,175,700,559.49,140.51,Card,Afternoon,26.8,Tuesday,4.4,15,595.0,Coffee Fest,ST4,Weekday,997.0


In [6]:
# Displaying last five rows of the dataframe

df.tail()

Unnamed: 0,order_id,date,outlet_location,outlet_manager,customer_type,item_category,item_name,quantity_sold,price_per_unit,total_bill,cost_price,profit,payment_mode,time_of_day,temperature_c,day_of_week,customer_rating,discount_percent,final_amount,special_event,staff_id,day_type,cumulative_sales_outlet
1995,ORD88057,2025-10-03,Viman Nagar,Rohit Deshmukh,New,Beverage,Hot Chocolate,4,169,676,468.12,207.88,Cash,Evening,25.8,Friday,4.9,15,574.6,Weekend Special,ST14,Weekday,188004.3
1996,ORD71910,2025-10-26,Viman Nagar,Rohit Deshmukh,Tourist,Dessert,Brownie,1,171,171,113.79,57.21,Cash,Night,32.5,Sunday,3.8,0,171.0,Coffee Fest,ST20,Weekend,188175.3
1997,ORD13597,2025-10-14,Kothrud,Arjun Mehta,Tourist,Snack,Pasta,4,162,648,516.21,131.79,Card,Afternoon,32.7,Tuesday,3.1,5,615.6,,ST16,Weekday,175690.5
1998,ORD37000,2025-10-18,Koregaon Park,Amit Sharma,New,Snack,French Fries,3,234,702,515.19,186.81,Card,Morning,30.0,Saturday,3.2,10,631.8,,ST10,Weekend,190355.85
1999,ORD86916,2025-10-22,Kothrud,Arjun Mehta,New,Beverage,Hot Chocolate,4,224,896,704.06,191.94,Card,Evening,25.2,Wednesday,3.0,5,851.2,,ST11,Weekday,176541.7


In [7]:
# Displaying list of column names

print(df.columns.tolist())

['order_id', 'date', 'outlet_location', 'outlet_manager', 'customer_type', 'item_category', 'item_name', 'quantity_sold', 'price_per_unit', 'total_bill', 'cost_price', 'profit', 'payment_mode', 'time_of_day', 'temperature_c', 'day_of_week', 'customer_rating', 'discount_percent', 'final_amount', 'special_event', 'staff_id', 'day_type', 'cumulative_sales_outlet']


#### **Column Description**
- **`order_id`** – Unique id for each order.
- **`date`** – Date when the order was placed.
- **`outlet_location`** – Location or branch of the outlet where the sale occurred.
- **`outlet_manager`** – Name  of the manager handling that outlet.
- **`customer_type`** – Type of customer (e.g., regular, new, member).
- **`item_category`** – Category to which the sold item belongs.
- **`item_name`** – Name of the product or item sold.
- **`quantity_sold`** – Number of units of the item sold in the order.
- **`price_per_unit`** – Selling price of one unit of the item.
- **`total_bill`** – Total amount before discount and taxes.
- **`cost_price`** – Original cost of the item to the outlet.
- **`profit`** – Profit earned from the order or item sale.
- **`payment_mode`** – Mode of payment used (e.g., cash, card, UPI).
- **`time_of_day`** – Time slot of the sale (e.g., morning, afternoon, evening).
- **`temperature_c`** – Temperature in Celsius at the time of the sale.
- **`day_of_week`** – Day on which the sale occurred (e.g., Monday, Friday).
- **`customer_rating`** – Rating given by the customer after purchase.
- **`discount_percent`** – Discount percentage applied on the bill.
- **`final_amount`** – Final payable amount after discount.
- **`special_event`** – Indicates if the sale occurred during any special event or festival.
- **`staff_id`** – Identifier of the staff member who processed the order.
- **`day_type`** – Type of day (e.g., weekday, weekend, holiday).
- **`cumulative_sales_outlet`** – Running total of all sales for the outlet up to that date.


In [None]:
# Generating profile report before cleaning the dataset

Profile = ProfileReport(df)
Profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/23 [00:00<?, ?it/s][A
  4%|▍         | 1/23 [00:00<00:02,  7.61it/s][A
 22%|██▏       | 5/23 [00:00<00:00, 20.84it/s][A
 52%|█████▏    | 12/23 [00:00<00:00, 37.60it/s][A
 70%|██████▉   | 16/23 [00:00<00:00, 31.66it/s][A
100%|██████████| 23/23 [00:00<00:00, 35.41it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Displaying some random samples from the dataset

df.sample(6)

In [None]:
# Displaying datatypes of columns

df.dtypes

In [None]:
# Here datatype of date is object, so we will convert it to datetime 

df['date'] = pd.to_datetime(df['date'])

In [None]:
# Extracting columns from the date

# Creating new features Year, Month, Day, Weekday, Weekend from 'Date' feature
df['Year'] = df['date'].dt.year       # year from the date
df['Month'] = df['date'].dt.month          # month from the date
df['Day'] = df['date'].dt.day          # day number from the date
df['IsWeekend'] = df['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)     # isweekend or not(yes or no)

In [None]:
# Now here we are removing three columns order_id, staff_id and outlet_manager because they are not important or useful for our regression models and they will not influence our target column

# Dropping 'order_id'
df.drop(columns='order_id', axis=1, inplace=True)

# Dropping 'staff_id'
df.drop(columns='staff_id', axis=1, inplace=True)

# Dropping 'outlet_manager'
df.drop(columns='outlet_manager', axis=1, inplace=True)

# Also we have extracted new columns from the column 'date' so we can drop column date
df.drop(columns='date', axis=1, inplace=True)

In [None]:
# Displcaying shape of the dataframe

df.shape

**Their are 2000 records and 20 columns present in the dataset.**

In [None]:
# Checking total number of missing values present in every column of the dataset

df.isnull().sum()

In [None]:
# Their are 1407 missing values are present in 'special_event' column, so we will fill it will keyword 'unknown'

df['special_event'] = df['special_event'].fillna('Unknown')

In [None]:
# Checking missing values again after filling them

df.isnull().sum()

In [None]:
# Checking total number of duplicate records present in the dataset

df.duplicated().sum()

**Dataset has now no missing values and duplicate records.**

In [None]:
# Getting statistical summary of numerical columns

df.describe()

In [None]:
# Getting statistical summary of catgorical columns

df.describe(include=[object])

### **Summary of categorical columns -**
#### **`Outlet Location`**
- It has 5 unique categories.
- Top category is 'Koregaon Park' with a frequency 428.
#### **`Customer Type`**
- It has 3 unique categories.
- Top category is 'Tourist' with a frequency 675.
#### **`Item Category`**
- It has 3 unique categories.
- Top category is 'Beverage' with a frequency 696.
#### **`Item Name`**
- It has 17 unique categories.
- Top category is 'Ice Cream' with a frequency 145.
#### **`Payment Mode`**
- It has 3 unique categories.
- Top category is 'Card' with a frequency 683.
#### **`Time of day`**
- It has 4 unique categories.
- Top category is 'Morning' with a frequency 524.
#### **`Day of week`**
- It has 7 unique categories.
- Top category is 'Wednesday' with a frequency 324.
#### **`Special Event`**
- It has 4 unique categories.
- Top category is 'Unknown' with a frequency 1407.
#### **`Day Type`**
- It has 2 unique categories.
- Top category is 'weekday' with a frequency 1483.

In [None]:
# Creating two lists one for all categorical columns and another for numerical columns

# List of categorical columns
cat_col = ['outlet_location', 'customer_type', 'item_category', 'item_name', 'payment_mode', 'time_of_day', 'special_event', 'day_type', 'day_of_week','Year', 'Month', 'Day', 'IsWeekend']

# List of numerical columns
num_col =  ['quantity_sold', 'price_per_unit', 'total_bill','cost_price', 'profit', 'cumulative_sales_outlet', 'customer_rating', 'discount_percent', 'final_amount','cumulative_sales_outlet']
        

In [None]:
# getting no of unique values in categorical column

for col in cat_col:
    print(f'\nNo of unique values in {col} column are :')
    print(df[col].nunique())

In [None]:
# getting no of unique values in numerical column

for col in num_col:
    print(f'\nNo of unique values in {col} column are :')
    print(df[col].nunique())

In [None]:
# getting unique values in categorical column

for col in cat_col:
    print(f'\nUnique values in {col} column are :')
    print(df[col].unique())

In [None]:
# getting value counts of categorical columns

for col in cat_col:
    counts = df[col].value_counts()
    print(f'\nValue_counts for {col} column :')
    print(counts)

In [None]:
# getting highest value for each numerical column

for column in num_col:
    print(f'\nMax value of {column} column is :')
    print(df[column].max())

In [None]:
# getting lowest value for each numerical column

for column in num_col:
    print(f'\nMin value of {column} column is :')
    print(df[column].min())

In [None]:
# getting top(most frequent) value of each categorical feature

for col in cat_col:
    print(f'\nTop value of {column} column is :')
    print(df[col].mode()[0])

In [None]:
# Checking skewness or shape of the data

for col in num_col:
    skew = df[col].skew()
    print(f"The skewness of column '{col}' is {round(skew,3)}\n")

In [None]:
# List of categorical columns for pie charts

# List of categorical columns
cat_col = ['outlet_location', 'customer_type', 'item_category', 'item_name', 'payment_mode', 'time_of_day', 'special_event', 'day_type', 'day_of_week']

custom_colors = ['#008080', '#66ffff', '#00cccc','#009999']

# creating 6 rows and 3 columns plots in subplot
fig, axes = plt.subplots(3, 3, figsize=(22, 18))

# Change axes to 1D list so loop is easy
axes = axes.flatten()

# Go through each column in list
for i, col in enumerate(cat_col):
    # value_counts
    counts = df[col].value_counts()

    # Push out the biggest slice
    explode = [0.05 if v == counts.max() else 0 for v in counts]

    # Make pie chart
    axes[i].pie(counts,
                labels=counts.index,     # labels for each slice
                autopct='%1.1f%%',       # show percentage upto one decimal point
                colors=custom_colors,   
                explode=explode)         # push biggest slice

    # Title for pie chart
    axes[i].set_title(f'Distribution of {col}', fontsize=22, fontweight='bold')

    # it Keep pie chart round
    axes[i].axis('equal')

# Hide extra empty plots
for j in range(len(cat_col), len(axes)):
    axes[j].axis('off')

# Adjust space between plots and overlapping
plt.tight_layout()

# Show all charts
plt.show()

In [None]:
# Using Interquartile range to check outliers 

for col in num_col:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower) | (df[col] > upper)]
    if not outliers.empty:
        print(f"Column '{col}' has {len(outliers)} outliers.")
    else:
        print(f"Column '{col}' has no outliers.")

In [None]:
# Plotting boxplot to detect outliers 

# Create a figure with 7 rows and 2 columns of subplot
# Axes is individual plot where we plot boxplot here
fig, axes = plt.subplots(5, 2, figsize=(15, 20))

# Flatten the 2D array of axes into a 1D array so we can loop easily
axes = axes.flatten()

# Iterate throught each column with position which helps to pick up correct subplot
for i, col in enumerate(num_col):
    sns.boxplot(x=df[col], ax=axes[i],palette="Blues")
    axes[i].set_title(col)

# Adjusting the layout so that the plots don't overlap with each other
plt.tight_layout()
plt.show()

#### **`Distribution of quantity sold`** -
- The middle 50% of quantity sold are between around 1 and 3.
- The median rented bike count is near 500.
- The lower whisker is 1 and the upper whisker is 4.
- There are no outliers present in quantity sold column.

#### **`Distribution of price per unit`** -
- The middle 50% values of price per unit are between around 150 and 250.
- The median price per unit value is near 200.
- The lower whisker is near -18 and the upper whisker is near 40.
- There are no outliers present in Temperature column.

#### **`Distribution of total bill`** -
- The middle 50% values of total bill are between around 300 and 700.
- The median value of total bill is near 450.
- The lower whisker is near 100 and the upper whisker is near 1200.
- There are no outliers present in total bill column.

#### **`Distribution of cost price`** -
- The middle 50% values of cost price are between around 200 and 500.
- The median cost price is near 300.
- The lower whisker is near 100 and the upper whisker is near 900.
- **There are few high outliers present in wind cost price column.**

#### **`Distribution of profit`** -
- The middle 50% values of profit are between around 80 and 210.
- The median value of profit is near 130.
- The lower whisker is 0 and the upper whisker is near 400.
- **There are extreme high outliers present in profit column.**

**`In this dataset 3 outliers in cost_price column and 36 outliers in profit.`**

In [None]:
# Here we are going to remove the outliers because no of outliers are very few

for col in num_col:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower) | (df[col] > upper)]
    df = df[(df[col] >= lower) & (df[col] <= upper)]

In [None]:
# Plotting boxplot to detect outliers 

# Create a figure with 7 rows and 2 columns of subplot
# Axes is individual plot where we plot boxplot here
fig, axes = plt.subplots(5, 2, figsize=(15, 20))

# Flatten the 2D array of axes into a 1D array so we can loop easily
axes = axes.flatten()

# Iterate throught each column with position which helps to pick up correct subplot
for i, col in enumerate(num_col):
    sns.boxplot(x=df[col], ax=axes[i], palette="Blues")
    axes[i].set_title(col)

# Adjusting the layout so that the plots don't overlap with each other
plt.tight_layout()
plt.show()

In [None]:
# Here we are using log transformation to handle remaining outliers
# Log Transformation is a technique used to reduce the effect of extreme values (outliers) and make skewed data more normal.

df['cost_price'] = np.log1p(df['cost_price'])
df['profit'] = np.log1p(df['profit'])

In [None]:
# Using Interquartile range to check outliers after handling 

for col in num_col:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower) | (df[col] > upper)]
    if not outliers.empty:
        print(f"Column '{col}' has {len(outliers)} outliers.")
    else:
        print(f"Column '{col}' has no outliers.")

**`Now dataset has no missing values, no duplicate records and no outliers.`**

In [None]:
# Generating profile report after cleaning the dataset

Profile = ProfileReport(df)
Profile

In [None]:
# Displaying countplot for categorical columns

# List of categorical columns
cat_col = ['outlet_location', 'customer_type', 'item_category', 'payment_mode', 'time_of_day', 'special_event', 'day_type', 'day_of_week','Year', 'Month', 'Day', 'IsWeekend']

# Create subplots: 3 rows, 3 columns
fig, axes = plt.subplots(6, 2 , figsize=(22, 40)) 

# Flatten the 2D array of axes to 1D
axes = axes.flatten()

# Plot each categorical column
for i, col in enumerate(cat_col):
    if col != 'iteam_name':
        sns.countplot(data=df, x=col, palette="Blues", ax=axes[i])
        axes[i].set_title(f'{col} Count', fontweight='bold', fontsize=14)
        axes[i].set_xlabel('')
        axes[i].set_ylabel('Count', fontsize=12)
plt.xticks(rotation=50)
plt.tight_layout()
plt.show()

#### **`Outlet Location Count`**
- Here are five unique outlets and all of have above 250 orders count.
- Top catgory is Koregaon Park having more than 400 orders count.
#### **`Customers Type Count`**
- Here are three types of customers(regular, tourist, new) and all of have count above 600.
#### **`Item Category Count`**
- Here are three item categories which are snack, beverage and dessert and all of this have above 600 count.
- Top category is beverage with near count of 650.
#### **`Payment Mode Count`**
- Here are three iunique payment modes(cash, card, upi).
- Customers done payments using all of this three modes with count above 600.
- Top mode is card with near count of 650
#### **`Time of the day Count`**
- Customers visit caffe's to morning, afternoon, evening as well as night and all have count above 450.
- top time of the day is morning with the frequency above 500 orders count.
#### **`Special Event Count`**
- Above 1300 events are unknown.
- Weekend special, diwali offer, coffee fest have count below 200.
#### **`Day Type Count`**
- Near 500 days are weekends and above 1400 days are weekdays.
#### **`Day of Week Count`**
- Thursday, Wednesday and Friday orders count is above 300.
- Sunday, Tuesday, Monday and Saturday orders count is below 250.
#### **`Year Count`**
- All of this records of the year 2024.
#### **`Month Count`**
- All of this records of the month October.
#### **`Days Count`**
- All of this 31 days have above 60 orders.

In [None]:
# Countplot for food items

plt.figure(figsize=(15, 6))
sns.countplot(data=df, x='item_name', palette="Blues")
plt.xticks(rotation=50)
plt.title('Count of Food Items',fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

#### **`Count of Food item`**
- Above 140 orders have food item Ice Cream and Ice Latte.
- All of this food items are ordered in above 60 orders.

In [None]:
df.head()

In [None]:
# Plotting scatterplot of numerical columns with target column

# List of numerical columns
num_col =  ['quantity_sold', 'price_per_unit', 'total_bill','cost_price','temperature_c' ,'profit','customer_rating', 'discount_percent', 'final_amount']
        
fig, axes = plt.subplots(3,3, figsize=(15, 10))  
axes = axes.flatten() 

for i, col in enumerate(num_col):
    sns.scatterplot(x=df[col], y=df['cumulative_sales_outlet'], ax=axes[i])
    axes[i].set_title(f'{col} vs cumulative sales outlet')

plt.tight_layout()  # Adjust spacing
plt.show()

In [None]:
from sklearn.preprocessing import LabelEncoder  # import LabelEncoder from sklearn.preprocessing
le = LabelEncoder()  # initialize label encoder

# List of categorical columns
cat_col = ['outlet_location', 'customer_type','item_name', 'item_category', 'payment_mode', 'time_of_day', 'special_event', 'day_type', 'day_of_week','Year', 'Month', 'Day', 'IsWeekend']

# Apply label encoding to each categorical column
for col in cat_col:
    df[col] = le.fit_transform(df[col])  # Convert categories to numbers

In [None]:
df.head()    # now categorical columns are converted into numerical

In [None]:
df.columns

In [None]:
# Displaying correlation heatmap between all numerical columns

plt.figure(figsize=(22, 20))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='Blues')
plt.title('Correlation Heatmap')
plt.show()

### **`Top Positive Correlations`**
- cost price is highly correlated with final amount, quantity sold, price per unit, profit, total bill.
- total bill, cost price is highly correlated with final payment.
- profit is highly correlated with price per unit.
- quantity sold is highly correlated with total bill, cost price, profit.
                                                                

In [None]:
# Here we are using Robust Scaling because in this datset their are lots of outliers present

from sklearn.preprocessing import RobustScaler   # importing robust scaler from skleran.preprocessing
scaler = RobustScaler()                     # initializing robust scaler

# List of categorical columns
# cat_col = ['outlet_location', 'customer_type','item_name', 'item_category', 'payment_mode', 'time_of_day', 'special_event', 'day_type', 'day_of_week','Year', 'Month', 'Day', 'IsWeekend']

# List of columns with numerical values
cols = ['quantity_sold','price_per_unit','total_bill','cost_price','profit','final_amount']

# Applying robust scaler to each numerical columns
df[cols] = scaler.fit_transform(df[cols])

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split  # import train_test_split to split the dataset into testing and training datasets

# Defining X and y
X = df.drop('cumulative_sales_outlet', axis=1)  # all columns except the target column
y = df['cumulative_sales_outlet']               # only target column

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

In [None]:
# Gradient Boosting Regressor

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import GradientBoostingRegressor
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train, y_train)
y_pred = gb_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae}, MSE: {mse}, RMSE: {rmse}")
print(f"R² Score: {r2}")

In [None]:
# Gradient Boosting Regressor

gbr_model = GradientBoostingRegressor(
    n_estimators=150,    
    learning_rate=0.05, 
    max_depth=8,            # Maximum depth of each tree
    min_samples_split=10,   # Minimum samples required to split a node
    min_samples_leaf=5,     # Minimum samples required at a leaf
    random_state=42
)
gbr_model.fit(X_train, y_train)
y_pred_gbr = gbr_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred_gbr)
mse = mean_squared_error(y_test, y_pred_gbr)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred_gbr)
print(f"Gradient Boosting Regressor Results:")
print(f"MAE: {mae:.4f}, MSE: {mse:.4f}, RMSE: {rmse:.4f}, R²: {r2:.4f}")

In [None]:
# XGBoost Regressor

from xgboost import XGBRegressor
xgb_model = XGBRegressor()
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae}, MSE: {mse}, RMSE: {rmse}")
print(f"R² Score: {r2}")

In [None]:
# Decision Tree Regressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred_dt)
mse = mean_squared_error(y_test, y_pred_dt)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred_dt)
print(f"Decision Tree Regressor Results:")
print(f"MAE: {mae:.4f}, MSE: {mse:.4f}, RMSE: {rmse:.4f}, R²: {r2:.4f}")

In [None]:
# Multiple linear Regression

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae}, MSE: {mse}, RMSE: {rmse}")
print(f"R² Score: {r2}")

In [None]:
# Initializing all models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "XGBoost": XGBRegressor(random_state=42)
}

# Training and evaluating each model
results = {}
for model_name, model in models.items():
    # Training  model
    model.fit(X_train, y_train)
    
    # Making predictions
    y_pred_model = model.predict(X_test)
    
    # Evaluating metrics
    mae = mean_absolute_error(y_test, y_pred_model)
    mse = mean_squared_error(y_test, y_pred_model)
    rmse = mse ** 0.5
    r2 = r2_score(y_test, y_pred_model)
    
    # Storing results
    results[model_name] = {
        "MAE": mae,
        "MSE": mse,
        "RMSE": rmse,
        "R²": r2
    }

# Displaying results
results_df = pd.DataFrame(results).T
results_df.sort_values(by="R²", ascending=False)