# <b> Feature scaling and data modelling </b>

Feature scaling is the process of transforming variables in a dataset to a common scale, in order to ensure that the variables contribute equally to a machine learning model. This is important because some machine learning algorithms, such as K-Nearest Neighbors or Support Vector Machines, are sensitive to differences in scale between variables. By scaling the variables, we can ensure that they have similar ranges and standard deviations, and that they contribute equally to the model.

Data modeling, on the other hand, is the process of creating and training a machine learning model using a dataset. The goal of data modeling is to create a model that can predict an outcome or classify data based on input features. There are many different types of machine learning models, including linear regression, decision trees, random forests, neural networks, and others.

In [2]:
#importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [3]:
data = pd.read_csv('Dataset/Processed.csv')

In [4]:
# Creating the target and the data separation 

x = data.drop(['Weekly_Sales'],axis=1)
y = data['Weekly_Sales']

In [5]:
scaler = StandardScaler()

In [6]:
scaled_data = scaler.fit_transform(x)

In [7]:
# Creating the test-train split 

x_train, x_test, y_train, y_test = train_test_split(scaled_data, y, test_size=0.2, random_state=50)

<b> Model Training and Evaluation </b>

Model Used : Linear Regression

Linear Regression is a statistical method used for modeling the linear relationship between a dependent variable and one or more independent variables. It tries to fit a line through the data points that minimizes the differences between the observed data and the line. This line can then be used to make predictions about the dependent variable based on values of the independent variable(s).

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [12]:
lr = LinearRegression()

In [13]:
lr.fit(x_train,y_train)

In [14]:
y_predict = lr.predict(x_test)

In [15]:
# Calculating the mean squared error

mse = mean_squared_error(y_test,y_predict)

In [16]:
# Calculating root mean squared error
rmse = np.sqrt(mse)

In [17]:
print('MSE',mse)
print('RMSE',rmse)

MSE 265739234814.2515
RMSE 515499.0153378098


In [18]:
r2_score(y_test,y_predict)

0.1315480049548715

<b> Inference : </b> Linear regression is not able to explain the variance of the data because the r2_score achieved is very less

Model Used : Decision Tree Regressor

The decision tree regressor is a type of machine learning model that is used for regression problems, where the goal is to predict a continuous numerical value. It works by constructing a tree-like structure, where each internal node represents a decision based on the values of one of the input features, and each leaf node represents a predicted output value.

The decision tree regressor algorithm works by recursively partitioning the feature space into regions, based on the values of the input features. At each step, the algorithm selects the feature that provides the most information gain in terms of reducing the variance of the output variable. This process continues until a stopping criterion is reached, such as a maximum depth or a minimum number of samples per leaf node.



In [19]:
# importing the model

from sklearn.tree import DecisionTreeRegressor

In [20]:
tree_reg = DecisionTreeRegressor(max_depth=3)

In [24]:
tree_reg.fit(x_train, y_train)

# Predict the output for the test data
y_predict = tree_reg.predict(x_test)

# Evaluate the performance of the model using mean squared error
mse = mean_squared_error(y_test, y_predict)
print("Mean squared error: ", mse)

Mean squared error:  167122809756.95074


In [25]:
print(r2_score(y_test,y_predict))

0.45383248487028593


<b> Inference 2 : </b> Decision Tree regressor is better than the Linear Regression in terms of the fit as the <b>r2_score</b> for the DTR is better than <b>LR</b>

Model Used : Random Forest 

Random forest regression is an ensemble learning method that combines multiple decision trees to create a more accurate and stable model for regression tasks. In random forest regression, the input data is split into multiple random subsets, called "bootstrap samples", and a decision tree is trained on each subset using a randomly selected subset of features at each node. The final prediction is obtained by averaging the predictions of all the decision trees in the forest.

Random forest regression has several advantages over a single decision tree regressor. It can handle high-dimensional data and nonlinear relationships between the features and target variable, and is less prone to overfitting. Additionally, it can provide estimates of feature importance, which can be useful for feature selection and understanding the underlying relationships in the data.


In [26]:
# importing the model

from sklearn.ensemble import RandomForestRegressor

In [29]:
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(x_train, y_train)

# Predict the output for the test data
y_pred = rf_reg.predict(x_test)

# Evaluate the performance of the model using mean squared error and R2 score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean squared error: ", mse)
print("R2 score: ", r2)

Mean squared error:  13832158045.8536
R2 score:  0.9547956655361873


<b> Inference : </b> Random Forest Regressor is the best fitting model for the given dataset and it was able to nearly accurate about fitting the data.

# <b> Final Observations </b>

| **Model** | **R2_Score** | **MSE** |
| --- | --- | --- |
| *Random Forest Regressor* | 0.9547956655361873 | 13832158045.8536 |
| *Decision Tree Regressor* | 0.45383248487028593 | 167122809756.95074 |
| *Linear Regression* | 0.1315480049548715 | 265739234814.2515 |