# Step 4:  Pre-processing and Training Data

In [1]:
import pandas as pd

In [2]:
df = pd.read_excel('Cleaned_data.xlsx')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1330 entries, 0 to 1329
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Unnamed: 0      1330 non-null   int64         
 1   Region          1330 non-null   object        
 2   Country         1330 non-null   object        
 3   Item_Type       1330 non-null   object        
 4   Sales_Channel   1330 non-null   object        
 5   Order_Priority  1330 non-null   object        
 6   Order_Date      1330 non-null   datetime64[ns]
 7   Order_ID        1330 non-null   int64         
 8   Ship_Date       1330 non-null   datetime64[ns]
 9   Units_Sold      1330 non-null   int64         
 10  Unit_Price      1330 non-null   float64       
 11  Unit_Cost       1330 non-null   float64       
 12  Total_Revenue   1330 non-null   float64       
 13  Total_Cost      1330 non-null   float64       
 14  Total_Profit    1330 non-null   float64       
 15  Ship

# Dropping the Unnecessary Column

In [4]:
# Dropping the 'Unnamed' column as it is just an index column
df = df.drop(columns=['Unnamed: 0'])

df.columns

Index(['Region', 'Country', 'Item_Type', 'Sales_Channel', 'Order_Priority',
       'Order_Date', 'Order_ID', 'Ship_Date', 'Units_Sold', 'Unit_Price',
       'Unit_Cost', 'Total_Revenue', 'Total_Cost', 'Total_Profit',
       'Shipping_Days'],
      dtype='object')

# Creating Dummy Variables for Categorical Columns

In [5]:
# List of categorical columns to encode
categorical_cols = ['Region', 'Country', 'Item_Type', 'Sales_Channel', 'Order_Priority']

# using One-hot encode for creating numerical dummy columns for categorical columns 
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


df_encoded.columns          # new column names after encoding

Index(['Order_Date', 'Order_ID', 'Ship_Date', 'Units_Sold', 'Unit_Price',
       'Unit_Cost', 'Total_Revenue', 'Total_Cost', 'Total_Profit',
       'Shipping_Days', 'Country_Andorra', 'Country_Armenia',
       'Country_Austria', 'Country_Belarus', 'Country_Belgium',
       'Country_Bosnia and Herzegovina', 'Country_Bulgaria', 'Country_Croatia',
       'Country_Cyprus', 'Country_Czech Republic', 'Country_Denmark',
       'Country_Estonia', 'Country_Finland', 'Country_France',
       'Country_Georgia', 'Country_Germany', 'Country_Greece',
       'Country_Hungary', 'Country_Iceland', 'Country_Ireland',
       'Country_Italy', 'Country_Kosovo', 'Country_Latvia',
       'Country_Liechtenstein', 'Country_Lithuania', 'Country_Luxembourg',
       'Country_Macedonia', 'Country_Malta', 'Country_Moldova ',
       'Country_Monaco', 'Country_Montenegro', 'Country_Netherlands',
       'Country_Norway', 'Country_Poland', 'Country_Portugal',
       'Country_Romania', 'Country_Russia', 'Country_San Mar

### Code over view:

I have used **pd.get_dummies()** to convert categorical columns into numerical dummy variables.

**drop_first=True** is used to avoid multicollinearity (it drops the first category as a baseline).

# Standardizing the Numerical Features

In [6]:
from sklearn.preprocessing import StandardScaler            # library for standard Scaling

# List of numerical columns for standardization
numeric_cols = ['Units_Sold', 'Unit_Price', 'Unit_Cost', 'Total_Revenue', 'Total_Cost', 'Total_Profit', 'Shipping_Days']

# Initializing the StandardScaler as scaler
scaler = StandardScaler()

# Applying standardization to the numerical columns
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

# Displaying a summary of the standardized data
print("Standardized numeric features.")

df_encoded[numeric_cols].describe()

Standardized numeric features.


Unnamed: 0,Units_Sold,Unit_Price,Unit_Cost,Total_Revenue,Total_Cost,Total_Profit,Shipping_Days
count,1330.0,1330.0,1330.0,1330.0,1330.0,1330.0,1330.0
mean,-5.609548e-17,1.335607e-18,-7.746519000000001e-17,-1.028417e-16,-2.7379940000000002e-17,2.9383350000000003e-17,6.410912e-17
std,1.000376,1.000376,1.000376,1.000376,1.000376,1.000376,1.000376
min,-1.704131,-1.176402,-1.024045,-0.9791042,-0.9385844,-1.075377,-1.700192
25%,-0.8916129,-0.8431323,-0.8598134,-0.7620871,-0.7506134,-0.8277662,-0.8767197
50%,0.003060859,-0.5101853,-0.5099975,-0.3574975,-0.4217324,-0.2758673,0.01537562
75%,0.8644933,0.7931554,0.4320634,0.4443894,0.3844298,0.5253938,0.8388482
max,1.739532,1.856809,1.917816,2.254104,2.086995,2.555134,1.730944


### Code over view:

I have used StandardScaler to ensure all numerical columns have a mean of 0 and standard deviation of 1, and as you can observe that this goal is achieved.

**Note:**

Over all this step is crucial for many machine learning algorithms that are sensitive to the scale of the data

# Spliting Data into Training and Testing Sets

In [7]:
from sklearn.model_selection import train_test_split           # library to generate/split data into test and train
import numpy as np

# Step 1: Drop unnecessary columns
df = df.drop(columns=['Order_ID','Ship_Date', 'Total_Revenue', 'Total_Cost'])


# Step 2: Extract features from 'Order_Date'
df['Year'] = df['Order_Date'].dt.year
df['Month'] = df['Order_Date'].dt.month



# Step 3: One-hot encode categorical columns
categorical_cols = ['Region', 'Country', 'Item_Type', 'Sales_Channel', 'Order_Priority']
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


# Step 4: Standardize numerical features
numeric_cols = ['Units_Sold', 'Unit_Price', 'Unit_Cost', 'Shipping_Days', 'Year', 'Month']
scaler = StandardScaler()
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])


# Step 5: Define features (X) and target (y)
X = df_encoded.drop(columns=['Total_Profit', 'Order_Date'])     # Use all features except 'Total_Profit'
y = df_encoded['Total_Profit']

# Step 6: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=14)

# Display the shapes of the datasets
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Testing target shape:", y_test.shape)


Training features shape: (1064, 68)
Testing features shape: (266, 68)
Training target shape: (1064,)
Testing target shape: (266,)


## Explanation:

### step 1:

Removing Total_Revenue and Total_Cost to avoid data leakage, as they can be derived from other columns.

Removing Ship_Date since they are redundant when Order_Date and Shipping_Days is already included.

Removing Order_ID column as this column is just representing unique index for each order.

**By excluding these redundant columns, I have reduce the risk of overfitting and ensure that the model learns meaningful patterns rather than relying on direct relationships.**

### step 2:

Extracting year and month from Order_Date column so that model can capture seasonal patterns and trends that can be crucial for sales predictions.

### step 3:

creating dummy variabels by using One-hot encode categorical columns as (Region, Country, Item_Type, Sales_Channel, Order_Priority).

### step 4:

standardizing the numerical fetures which means to change all numerical column's unit one sigle/same unit or in simple words you can say that making one single unit for all numerical columns.

### step 5:

defining the features and target columns and saving in X and y data respectively.

### step 6:

Now finally splitting the data into training and test data set where training data is of 80% and testing data is 20% and random_state define that the sample taken for training and test data has been fixed at random_state 14.

**NOTE:** You can change random_state as per your choise

# Step 5: Data Modeling

### importing essential libraries for different model building

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV

## Developing a function which calculates the matric

In [9]:
# function for calculating R² and RMSE
def calculate_metrics(y_true, y_pred):                  ## inputs: True target value as y_true and predicted value as y_pred
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    return r2, rmse


## Model 1: Simple Linear Regression Model

In [10]:
linear_model = LinearRegression()               # creating modelobject

# Fitting and predicting
linear_model.fit(X_train, y_train)
y_pred_train_linear = linear_model.predict(X_train)
y_pred_test_linear = linear_model.predict(X_test)

# Evaluating model metrics using calculate_metrics function define in above step
r2_train_linear, rmse_train_linear = calculate_metrics(y_train, y_pred_train_linear)
r2_test_linear, rmse_test_linear = calculate_metrics(y_test, y_pred_test_linear)

print("Model 1: Linear Regression")
print(f"R² (Train): {r2_train_linear:.2f}, R² (Test): {r2_test_linear:.2f}")
print(f"RMSE (Train): {rmse_train_linear:.2f}, RMSE (Test): {rmse_test_linear:.2f}\n")

Model 1: Linear Regression
R² (Train): 0.85, R² (Test): 0.87
RMSE (Train): 135484.49, RMSE (Test): 120776.40



## Model 2: Random Forest Regressor with Hyperparameter Tuning

In [11]:
rf_model = RandomForestRegressor(random_state=14)  ## creating model object at random state 14

# Defining hyperparameter grid
param_grid_rf = {
    'n_estimators': [5, 10],  # setting number of trees
    'max_depth': [3],        # setting depth
    'min_samples_split': [10, 15]  #  setting samples split
}

# Performing Grid Search for the best hyperparameters
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid_rf, cv=3, scoring='r2', n_jobs=-1)
grid_search_rf.fit(X_train, y_train)

# Best estimator from Grid Search
best_rf_model = grid_search_rf.best_estimator_
print(f"Best Random Forest Parameters: {grid_search_rf.best_params_}")

# Fit and predict with the best model
y_pred_train_rf = best_rf_model.predict(X_train)
y_pred_test_rf = best_rf_model.predict(X_test)

# Evaluating model metrics usinf function
r2_train_rf, rmse_train_rf = calculate_metrics(y_train, y_pred_train_rf)
r2_test_rf, rmse_test_rf = calculate_metrics(y_test, y_pred_test_rf)

## final metric results
print("Model 2: Random Forest Regressor")
print(f"R² (Train): {r2_train_rf:.2f}, R² (Test): {r2_test_rf:.2f}")
print(f"RMSE (Train): {rmse_train_rf:.2f}, RMSE (Test): {rmse_test_rf:.2f}\n")


Best Random Forest Parameters: {'max_depth': 3, 'min_samples_split': 10, 'n_estimators': 10}
Model 2: Random Forest Regressor
R² (Train): 0.91, R² (Test): 0.90
RMSE (Train): 104655.93, RMSE (Test): 106013.93



## Model 3: Decision Tree Regressor

In [12]:
dt_model = DecisionTreeRegressor(max_depth=3, random_state=15) 
                                                        # Limiting the depth to avoid overfitting, same depth as of model 2

# Fiting and predicting
dt_model.fit(X_train, y_train)
y_pred_train_dt = dt_model.predict(X_train)
y_pred_test_dt = dt_model.predict(X_test)

# Evaluating model
r2_train_dt, rmse_train_dt = calculate_metrics(y_train, y_pred_train_dt)
r2_test_dt, rmse_test_dt = calculate_metrics(y_test, y_pred_test_dt)


## observing metrics
print("Model 3: Decision Tree Regressor")
print(f"R² (Train): {r2_train_dt:.2f}, R² (Test): {r2_test_dt:.2f}")
print(f"RMSE (Train): {rmse_train_dt:.2f}, RMSE (Test): {rmse_test_dt:.2f}\n")


Model 3: Decision Tree Regressor
R² (Train): 0.89, R² (Test): 0.88
RMSE (Train): 113483.33, RMSE (Test): 115176.42



# Final Report

The goal of this project was to build predictive models to estimate **Total Profit** based on various features from a sales dataset. As I explored three regression models: **Linear Regression**, **Random Forest Regressor**, and **Decision Tree Regressor**. Each model was evaluated based on its performance on both the training and testing datasets using **R²** (coefficient of determination) and **RMSE** (Root Mean Squared Error) as evaluation metrics.

### **Model Performance Summary**

| Model                     | R² (Train) | R² (Test) | RMSE (Train) | RMSE (Test) |
|---------------------------|------------|-----------|--------------|-------------|
| Linear Regression         | 0.85       | 0.87      | 135,484.49   | 120,776.40  |
| Random Forest Regressor   | 0.91       | 0.90      | 104,655.93   | 106,013.93  |
| Decision Tree Regressor   | 0.89       | 0.88      | 113,483.33   | 115,176.42  |

**Important Note: As I have use same depth of 3 to make fine comparision between Random Forest & Decision Tree model but you can set the depth as your own to observe more precision in both of the models**


### Analysis of Each Model

1. **Linear Regression**:
   - This model served as the baseline. With an R² score of **0.87** on the test data, it performed well in capturing the linear relationships between the features and the target variable.
   - The **RMSE** of **120,776.40** on the test data indicates the average error in predicting Total Profit.
   - The model's lower complexity makes it easy to interpret, but it may not capture non-linear relationships effectively.

2. **Random Forest Regressor**:
   - The Random Forest model provided the best performance, with an R² score of **0.90** on the test data.
   - The **RMSE** of **106,013.93** was the lowest among the three models, indicating more accurate predictions.
   - This model benefited from **hyperparameter tuning** (`max_depth=3`, `min_samples_split=10`, `n_estimators=10`), which helped prevent overfitting while still capturing complex patterns in the data.

3. **Decision Tree Regressor**:
   - The Decision Tree model achieved an R² score of **0.88** on the test data, slightly lower than Random Forest but still competitive.
   - The **RMSE** was **115,176.42**, which is higher than that of the Random Forest but better than the baseline Linear Regression model.
   - Limiting the tree depth (`max_depth=3`) (same as in Random Forest Regressor) helped reduce overfitting, but this model can still be prone to capturing noise in the data compared to ensemble methods like Random Forest.


### Best Model Selection


- **Best Model**: The **Random Forest Regressor** is the best-performing model based on both R² and RMSE scores. It balances bias and variance well, providing accurate predictions without overfitting.
- **Second Choice**: The **Decision Tree Regressor** is a good alternative if interpretability is important, as it offers a more transparent decision-making process compared to Random Forest.
- **Baseline Model**: The **Linear Regression** model serves as a useful baseline, demonstrating strong performance while being simple and interpretable.


### Analysis of Questions

**Does my data involve a time series or forecasting? If so, am I splitting the train and test data appropriately?**


  - Yes, the dataset includes a time-based feature (`Order_Date`), from which we extracted **Year** and **Month** for potential seasonal patterns.
  - However, this project did not explicitly use a time series or forecasting approach (e.g., time-based splitting or rolling forecasting origin). Instead, I used a random split (`train_test_split`) to divide the data into training and testing sets.
  - This method is appropriate in this context because:
    - As I included time-based features in the model (Year and Month), allowing it to capture some time-related patterns.
    - Random splitting was chosen due to the lack of explicit sequential dependence (e.g., future predictions based purely on historical data).


**Is my response variable continuous or categorical?**


  - The response variable, **Total Profit**, is a **continuous numeric variable**. It may represents a dollar amount that can take on any value within a certain range, making this a **regression problem**.
  - Given the continuous nature of the target variable, I have chosed regression models (Linear Regression, Random Forest Regressor, Decision Tree Regressor) to predict it effectively.


# Conclusion

1. The **Random Forest Regressor** is the recommended model, as it provided the best balance between high accuracy (R² of 0.90) and low error (RMSE of 106,013.93).
2. The features used in the model (including time-based features extracted from `Order_Date`) helped capture trends and patterns, contributing to improved performance.

# xxxxxxxxxxxxxxxxxxxxxxxx(Thanks)xxxxxxxxxxxxxxxxxxxxxxxxxxx