### **Metrics on Regression Dataset (Global Stores)**

* Metrics is the benchmark that provide model's performance.
* In the context of regression, R-squared, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) are common evaluation metrics that help assess the performance of a model.

* **R-squared (R²)** is also known as " coefficient measure of determination ", is a statistical measure that indicates how well the regression model fits with the data.
* Formula : **R2= 1 - (Actual_value - Predicted_value)^2 / (Actual_value - mean of Actual_values)^2**
 
​


* **MAE (Mean Absolute Error)** measures the average magnitude of errors in the predictions, without considering their direction. It’s the average of the absolute differences between the predicted and actual values.

*  **Mean Squared Error (MSE)** measures the average squared difference between the predicted values and the actual values. It penalizes larger errors more heavily than MAE due to the squaring of differences.

* **Root Mean Squared Error (RMSE)** is the square root of the MSE and provides an estimate of the average error in the same units as the target variable. It gives an idea of how much error the model is making in its predictions on average.

**Step 1 : Import Necessary Libraries**

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xlrd

**Step 2 : Load the Dataset**

In [2]:
df = pd.read_excel("E:\\Machine Learning\\global_superstore\\Global Superstore.xls")
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,City,State,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
0,32298,CA-2012-124891,2012-07-31,2012-07-31,Same Day,RH-19495,Rick Hansen,Consumer,New York City,New York,...,TEC-AC-10003033,Technology,Accessories,Plantronics CS510 - Over-the-Head monaural Wir...,2309.65,7,0.0,762.1845,933.57,Critical
1,26341,IN-2013-77878,2013-02-05,2013-02-07,Second Class,JR-16210,Justin Ritter,Corporate,Wollongong,New South Wales,...,FUR-CH-10003950,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",3709.395,9,0.1,-288.765,923.63,Critical
2,25330,IN-2013-71249,2013-10-17,2013-10-18,First Class,CR-12730,Craig Reiter,Consumer,Brisbane,Queensland,...,TEC-PH-10004664,Technology,Phones,"Nokia Smart Phone, with Caller ID",5175.171,9,0.1,919.971,915.49,Medium
3,13524,ES-2013-1579342,2013-01-28,2013-01-30,First Class,KM-16375,Katherine Murray,Home Office,Berlin,Berlin,...,TEC-PH-10004583,Technology,Phones,"Motorola Smart Phone, Cordless",2892.51,5,0.1,-96.54,910.16,Medium
4,47221,SG-2013-4320,2013-11-05,2013-11-06,Same Day,RH-9495,Rick Hansen,Consumer,Dakar,Dakar,...,TEC-SHA-10000501,Technology,Copiers,"Sharp Wireless Fax, High-Speed",2832.96,8,0.0,311.52,903.04,Critical


In [3]:
df.shape

(51290, 24)

In [4]:
print("Dataset_columns:", df.columns)

Dataset_columns: Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country',
       'Postal Code', 'Market', 'Region', 'Product ID', 'Category',
       'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
       'Profit', 'Shipping Cost', 'Order Priority'],
      dtype='object')


**Step 3 : Preprocessing the data**

In [5]:
df.isnull().sum()

Row ID                0
Order ID              0
Order Date            0
Ship Date             0
Ship Mode             0
Customer ID           0
Customer Name         0
Segment               0
City                  0
State                 0
Country               0
Postal Code       41296
Market                0
Region                0
Product ID            0
Category              0
Sub-Category          0
Product Name          0
Sales                 0
Quantity              0
Discount              0
Profit                0
Shipping Cost         0
Order Priority        0
dtype: int64

In [6]:
df.drop(columns = ['Row ID','Order ID','Product ID','Customer ID'])

Unnamed: 0,Order Date,Ship Date,Ship Mode,Customer Name,Segment,City,State,Country,Postal Code,Market,Region,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
0,2012-07-31,2012-07-31,Same Day,Rick Hansen,Consumer,New York City,New York,United States,10024.0,US,East,Technology,Accessories,Plantronics CS510 - Over-the-Head monaural Wir...,2309.650,7,0.0,762.1845,933.570,Critical
1,2013-02-05,2013-02-07,Second Class,Justin Ritter,Corporate,Wollongong,New South Wales,Australia,,APAC,Oceania,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",3709.395,9,0.1,-288.7650,923.630,Critical
2,2013-10-17,2013-10-18,First Class,Craig Reiter,Consumer,Brisbane,Queensland,Australia,,APAC,Oceania,Technology,Phones,"Nokia Smart Phone, with Caller ID",5175.171,9,0.1,919.9710,915.490,Medium
3,2013-01-28,2013-01-30,First Class,Katherine Murray,Home Office,Berlin,Berlin,Germany,,EU,Central,Technology,Phones,"Motorola Smart Phone, Cordless",2892.510,5,0.1,-96.5400,910.160,Medium
4,2013-11-05,2013-11-06,Same Day,Rick Hansen,Consumer,Dakar,Dakar,Senegal,,Africa,Africa,Technology,Copiers,"Sharp Wireless Fax, High-Speed",2832.960,8,0.0,311.5200,903.040,Critical
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51285,2014-06-19,2014-06-19,Same Day,Katrina Edelman,Corporate,Kure,Hiroshima,Japan,,APAC,North Asia,Office Supplies,Fasteners,"Advantus Thumb Tacks, 12 Pack",65.100,5,0.0,4.5000,0.010,Medium
51286,2014-06-20,2014-06-24,Standard Class,Zuschuss Carroll,Consumer,Houston,Texas,United States,77095.0,US,Central,Office Supplies,Appliances,Hoover Replacement Belt for Commercial Guardsm...,0.444,1,0.8,-1.1100,0.010,Medium
51287,2013-12-02,2013-12-02,Same Day,Laurel Beltran,Home Office,Oxnard,California,United States,93030.0,US,West,Office Supplies,Envelopes,"#10- 4 1/8"" x 9 1/2"" Security-Tint Envelopes",22.920,3,0.0,11.2308,0.010,High
51288,2012-02-18,2012-02-22,Standard Class,Ross Baird,Home Office,Valinhos,São Paulo,Brazil,,LATAM,South,Office Supplies,Binders,"Acco Index Tab, Economy",13.440,2,0.0,2.4000,0.003,Medium


In [7]:
df.drop(columns = ['Postal Code'], inplace = True)

In [8]:
df.drop(columns = ['Order ID','Product ID','Customer ID','Order Date', 'Ship Date'], inplace = True)

**Step 4 : Encoding the categorical columns**

In [9]:
categorical_cols = df.select_dtypes(include = ['object']).columns
df = pd.get_dummies(df, columns = categorical_cols, drop_first = True)

In [10]:
df.fillna(value = 0, inplace = True)

**Step 5 : Scaling the data**

In [11]:
features = df.drop(columns=['Profit'], errors='ignore')

In [12]:
features = features.apply(pd.to_numeric)

In [13]:
columns_to_scale = ['Sales', 'Quantity', 'Discount', 'Shipping Cost']
standard_scaler = StandardScaler()
df_standard_scaled = features.copy()
df_standard_scaled[columns_to_scale] = standard_scaler.fit_transform(features[columns_to_scale])

**Step 6 : Splitting the data into train and test columns**

In [14]:
target = df['Profit']

In [15]:
df_final = pd.concat([df.drop(columns=columns_to_scale), df_standard_scaled[columns_to_scale]], axis=1)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(df_final, target, test_size=0.2, random_state=42)

In [17]:
X_train = X_train.astype(np.float64)
y_train = y_train.astype(np.float64)

**Step 7 : Performing Linear Regression**

In [30]:
X_train_small = X_train[:1000]  # use first 1000 samples
y_train_small = y_train[:1000]
x_test_small = X_test[:1000]
y_test_small = y_test[:1000]

model = LinearRegression()
model.fit(X_train_small, y_train_small)


**Step 8 : Predicting the model**

In [31]:
y_pred = model.predict(x_test_small)

**Step 9 : Evaluating the model**

In [33]:
r2 = r2_score(y_test_small, y_pred)
mae = mean_absolute_error(y_test_small, y_pred)
mse = mean_squared_error(y_test_small, y_pred)
rmse = np.sqrt(mse)

**Step 10 : Printing evaluation metrics**

In [34]:
print(f"R-squared: {r2}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


R-squared: 0.9999999999999484
Mean Absolute Error (MAE): 1.8421420541493324e-05
Mean Squared Error (MSE): 1.2211304001814016e-09
Root Mean Squared Error (RMSE): 3.4944676278102816e-05


In [35]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1, random_state=42)  # alpha is the regularization strength
lasso.fit(X_train_small, y_train_small)

# Step 5: Predict on the test set
y_pred = lasso.predict(x_test_small)


r2 = r2_score(y_test_small, y_pred)

In [36]:
r2

0.9999999999222979