# BigMart Sales Prediction!
***Sales Prediction for Big Mart Outlets***

**The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.**

# Data Dictionary

**We have train (8523) and test (5681) data set, train data set has both input and output variable(s). You need to predict the sales for test data set.**


# Evaluation Metric
**Model performance will be evaluated on the basis of your prediction of the sales for the test data (test.csv), which contains similar data-points as train except for the sales to be predicted.

**Your submission needs to be in the format as shown in sample submission.

**We will use the Root Mean Square Error value to judge your response.

# We were given **Trainfile(8523)** and **Testfile(5681)**

In [1]:
# Import all the necessary libraries
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load datasets
train_df = pd.read_csv(r'C:\Users\likit\Desktop\BigMart_Sales\train_data.csv')
test_df = pd.read_csv(r'C:\Users\likit\Desktop\BigMart_Sales\test_data.csv')

# Fix inconsistent category labels in 'Item_Fat_Content'
train_df['Item_Fat_Content'] = train_df['Item_Fat_Content'].str.lower().replace({'lf': 'low fat', 'reg': 'regular'})
test_df['Item_Fat_Content'] = test_df['Item_Fat_Content'].str.lower().replace({'lf': 'low fat', 'reg': 'regular'})

# Handle missing values using KNNImputer
# The K-Nearest Neighbors Imputer (KNNImputer) is a technique in machine learning used to fill missing values based on the similarity of data points.
imputer = KNNImputer(n_neighbors=5)
numeric_cols = train_df.select_dtypes(include=['number']).columns
numeric_cols = numeric_cols.drop('Item_Outlet_Sales')  # Exclude target variable
train_df[numeric_cols] = imputer.fit_transform(train_df[numeric_cols])
test_df[numeric_cols] = imputer.transform(test_df[numeric_cols])

# Feature Engineering
# Adding of new columns
# Outlet_Age helps capture store performance over time.
# Item_Weight_MRP combines weight & price to understand pricing impact.
# Log transformation ensures smooth data distribution and handles outliers
train_df['Outlet_Age'] = 2025 - train_df['Outlet_Establishment_Year']
test_df['Outlet_Age'] = 2025 - test_df['Outlet_Establishment_Year']
train_df['Item_Weight_MRP'] = train_df['Item_Weight'] * train_df['Item_MRP']
test_df['Item_Weight_MRP'] = test_df['Item_Weight'] * test_df['Item_MRP']
train_df['Item_Weight_MRP'] = np.log1p(train_df['Item_Weight_MRP'])
test_df['Item_Weight_MRP'] = np.log1p(test_df['Item_Weight_MRP'])

# Define features and target-Split traindata into X and y
X = train_df.drop(columns=['Item_Outlet_Sales'])
y = train_df['Item_Outlet_Sales']
X_test = test_df.copy()

# Label Encoding for categorical features
label_encoders = {}
for col in X.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    if col in X_test.columns:
        X_test[col] = le.transform(X_test[col])
    else:
        X_test[col] = 0  # Assign 0 for missing categories in test set
    label_encoders[col] = le

# Align test dataset columns with train dataset
# ensures that the test dataset (X_test) has the same columns as the training dataset (X).
X_test = X_test.reindex(columns=X.columns, fill_value=0)

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

# Split train data into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Hyperparameter tuning using GridSearchCV
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
param_grid = {
    'n_estimators': [1000, 1500],
    'learning_rate': [0.01, 0.03, 0.05],
    'max_depth': [6, 8, 10],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}
grid_search = GridSearchCV(xgb_model, param_grid, scoring='neg_root_mean_squared_error', cv=3, verbose=2)
grid_search.fit(X_train, y_train)
best_xgb_model = grid_search.best_estimator_

# Predict and evaluate
valid_preds = best_xgb_model.predict(X_valid)
rmse = np.sqrt(mean_squared_error(y_valid, valid_preds))
print(f"Optimized RMSE: {rmse:.4f}")

# Predict on test data
predictions = best_xgb_model.predict(X_test_scaled)
predictions = np.maximum(0, predictions)  # Ensure no negative sales

# Save submission file
submission = test_df[['Item_Identifier', 'Outlet_Identifier']].copy()
submission['Item_Outlet_Sales'] = predictions
submission.to_csv('submission_optimized_File1.csv', index=False)

print("Submission file saved as submission_optimized_File1.csv")


Fitting 3 folds for each of 162 candidates, totalling 486 fits
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000, subsample=0.7; total time=   1.8s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000, subsample=0.7; total time=   1.4s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000, subsample=0.7; total time=   1.6s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000, subsample=0.8; total time=   2.0s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000, subsample=0.8; total time=   1.6s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000, subsample=0.8; total time=   1.5s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000, subsample=0.9; total time=   1.6s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000, subsample=0.9; total time=   1.4s
[

[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=6, n_estimators=1500, subsample=0.9; total time=   2.9s
[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=6, n_estimators=1500, subsample=0.9; total time=   2.9s
[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=8, n_estimators=1000, subsample=0.7; total time=   4.1s
[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=8, n_estimators=1000, subsample=0.7; total time=   4.0s
[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=8, n_estimators=1000, subsample=0.7; total time=   4.0s
[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=8, n_estimators=1000, subsample=0.8; total time=   3.8s
[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=8, n_estimators=1000, subsample=0.8; total time=   4.4s
[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=8, n_estimators=1000, subsample=0.8; total time=   4.3s
[CV] END colsample_bytree=0.7, learning_rate=0.03, max_depth=8, 

[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=8, n_estimators=1500, subsample=0.8; total time=   6.1s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=8, n_estimators=1500, subsample=0.9; total time=   5.7s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=8, n_estimators=1500, subsample=0.9; total time=   6.0s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=8, n_estimators=1500, subsample=0.9; total time=   6.6s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=10, n_estimators=1000, subsample=0.7; total time=   5.5s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=10, n_estimators=1000, subsample=0.7; total time=   6.0s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=10, n_estimators=1000, subsample=0.7; total time=   7.3s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=10, n_estimators=1000, subsample=0.8; total time=   5.6s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth

[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, n_estimators=1500, subsample=0.8; total time=   6.9s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, n_estimators=1500, subsample=0.8; total time=   7.2s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, n_estimators=1500, subsample=0.8; total time=   7.3s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, n_estimators=1500, subsample=0.9; total time=   6.7s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, n_estimators=1500, subsample=0.9; total time=   7.1s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, n_estimators=1500, subsample=0.9; total time=   7.2s
[CV] END colsample_bytree=0.8, learning_rate=0.03, max_depth=6, n_estimators=1000, subsample=0.7; total time=   1.1s
[CV] END colsample_bytree=0.8, learning_rate=0.03, max_depth=6, n_estimators=1000, subsample=0.7; total time=   0.9s
[CV] END colsample_bytree=0.8, learning_rate=0.03, max_dep

[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=1500, subsample=0.7; total time=   1.8s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=1500, subsample=0.7; total time=   1.9s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=1500, subsample=0.8; total time=   1.5s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=1500, subsample=0.8; total time=   1.7s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=1500, subsample=0.8; total time=   1.6s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=1500, subsample=0.9; total time=   1.9s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=1500, subsample=0.9; total time=   2.2s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=1500, subsample=0.9; total time=   1.7s
[CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=8, 

[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, n_estimators=1000, subsample=0.9; total time=   2.5s
[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, n_estimators=1500, subsample=0.7; total time=   3.4s
[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, n_estimators=1500, subsample=0.7; total time=   3.9s
[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, n_estimators=1500, subsample=0.7; total time=   3.7s
[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, n_estimators=1500, subsample=0.8; total time=   3.6s
[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, n_estimators=1500, subsample=0.8; total time=   3.5s
[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, n_estimators=1500, subsample=0.8; total time=   3.9s
[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, n_estimators=1500, subsample=0.9; total time=   3.4s
[CV] END colsample_bytree=0.9, learning_rate=0.01, max_depth=8, 

[CV] END colsample_bytree=0.9, learning_rate=0.03, max_depth=10, n_estimators=1000, subsample=0.9; total time=   5.0s
[CV] END colsample_bytree=0.9, learning_rate=0.03, max_depth=10, n_estimators=1000, subsample=0.9; total time=   5.0s
[CV] END colsample_bytree=0.9, learning_rate=0.03, max_depth=10, n_estimators=1000, subsample=0.9; total time=   5.1s
[CV] END colsample_bytree=0.9, learning_rate=0.03, max_depth=10, n_estimators=1500, subsample=0.7; total time=   7.8s
[CV] END colsample_bytree=0.9, learning_rate=0.03, max_depth=10, n_estimators=1500, subsample=0.7; total time=   7.9s
[CV] END colsample_bytree=0.9, learning_rate=0.03, max_depth=10, n_estimators=1500, subsample=0.7; total time=   8.1s
[CV] END colsample_bytree=0.9, learning_rate=0.03, max_depth=10, n_estimators=1500, subsample=0.8; total time=   8.1s
[CV] END colsample_bytree=0.9, learning_rate=0.03, max_depth=10, n_estimators=1500, subsample=0.8; total time=   8.1s
[CV] END colsample_bytree=0.9, learning_rate=0.03, max_d