# 📊 E-Commerce Sales Prediction (ML & Data Analytics)
## 🚀 Machine Learning Model for Sales Forecasting
This project aims to analyze historical e-commerce sales data and predict future sales trends using machine learning models.

## 📌 Step 1: Importing Necessary Libraries
We will use various Python libraries for data preprocessing, visualization, and machine learning.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

warnings.filterwarnings('ignore')

## 📌 Step 2: Load and Explore Data
We will load the dataset and perform an initial exploration to understand its structure.

In [4]:
df = pd.read_csv('/kaggle/input/walmart-sales-dataset/walmart.csv')
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,7969


## 📌 Step 3: Exploratory Data Analysis (EDA)
Perform basic data analysis, including checking for missing values and statistical summary.

In [5]:
df.info()
df.describe()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   User_ID                     550068 non-null  int64 
 1   Product_ID                  550068 non-null  object
 2   Gender                      550068 non-null  object
 3   Age                         550068 non-null  object
 4   Occupation                  550068 non-null  int64 
 5   City_Category               550068 non-null  object
 6   Stay_In_Current_City_Years  550068 non-null  object
 7   Marital_Status              550068 non-null  int64 
 8   Product_Category            550068 non-null  int64 
 9   Purchase                    550068 non-null  int64 
dtypes: int64(5), object(5)
memory usage: 42.0+ MB


User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category              0
Purchase                      0
dtype: int64

## 📌 Step 4: Feature Engineering
Transform categorical features into numerical representations and create new derived features.

In [6]:
df_encoded = df.copy()
label_encoders = {}

for col in ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col])
    label_encoders[col] = le

df_encoded['Total_Spending'] = df_encoded.groupby('User_ID')['Purchase'].transform('sum')
df_encoded['Avg_Product_Purchase'] = df_encoded.groupby('Product_ID')['Purchase'].transform('mean')
df_encoded['Purchase_Count'] = df_encoded.groupby('User_ID')['Purchase'].transform('count')
df_final = df_encoded.drop(columns=['User_ID', 'Product_ID'])

df_final.head()

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase,Total_Spending,Avg_Product_Purchase,Purchase_Count
0,0,0,10,0,2,0,3,8370,334093,11870.863436,35
1,0,0,10,0,2,0,1,15200,334093,16304.030981,35
2,0,0,10,0,2,0,12,1422,334093,1237.892157,35
3,0,0,10,0,2,0,12,1057,334093,1455.140762,35
4,1,6,16,2,4,0,8,7969,810472,7692.763547,77


## 📌 Step 5: Data Splitting
Split the dataset into training and testing sets for model evaluation.

In [7]:
X = df_final.drop(columns=['Purchase'])
y = df_final['Purchase']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 📌 Step 6: Train Machine Learning Models
Train and evaluate different regression models.

In [8]:
def evaluate_model(y_test, y_pred, model_name):
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mse ** 0.5
    r2 = r2_score(y_test, y_pred)
    return {
        'Model': model_name,
        'MAE': mae,
        'RMSE': rmse,
        'R2 Score': r2
    }

# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
y_pred_lr = lr_model.predict(X_test_scaled)

# Random Forest
rf_model = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Model Evaluation
lr_results = evaluate_model(y_test, y_pred_lr, 'Linear Regression')
rf_results = evaluate_model(y_test, y_pred_rf, 'Random Forest')

model_comparison = pd.DataFrame([lr_results, rf_results])
print(model_comparison)

               Model          MAE         RMSE  R2 Score
0  Linear Regression  1905.981103  2579.579380  0.735167
1      Random Forest  1902.074300  2577.193871  0.735657


## 📌 Step 7: Export Processed Data
Save the processed dataset for further analysis.

In [9]:
df_final.to_csv('processed_ecommerce_data.csv', index=False)
print('Dataset successfully saved!')

Dataset successfully saved!
