# Task 7: Walmart Sales Forecasting

This notebook performs full EDA, feature engineering, model training, and forecasting for Walmart sales data. It also includes instructions to run the interactive dashboard app.

**Contents:**
- Data Loading & Merging
- Exploratory Data Analysis (EDA)
- Feature Engineering (time features, lags)
- Model Training (Random Forest, XGBoost)
- Forecasting & Evaluation
- Plotting Results
- How to Run the Dashboard

---

## 5. Push to Remote Repository

Push the committed changes to GitHub:
```
git push origin main
```

## 4. Commit Changes

Stage the notebook and other Task 7 files using:
```
git add "Task 7 - Sales Forecasting"
```
Then commit:
```
git commit -m "Task 7: Sales Forecasting (notebook, dashboard app, requirements, README)"
```

## 3. Restore Notebook to Previous State

If you need to restore the notebook to a previous state, use version control (e.g., Git) with commands like `git checkout` or `git restore`.

## 2. Recreate the Notebook

If needed, delete or rename the original notebook and create a new one with the same name. Add new or updated content as required.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Task 7: Sales Forecasting

This notebook performs EDA and builds forecasting models for Walmart sales data. It is structured for clarity, reproducibility, and easy review.

---

## 1. Create a New Notebook

We start by creating a new notebook and adding initial content.

In [None]:
# Extract text from the PDF with PyPDF2
import PyPDF2

pdf_path = 'Machine Learning Tasks.pdf'
text = ''
with open(pdf_path, 'rb') as file:
    reader = PyPDF2.PdfReader(file)
    for page in reader.pages:
        text += page.extract_text() or ''

# Show the first 2000 characters to inspect the instructions
print(text[:2000])

In [2]:
# Save all extracted text from the PDF to a .txt file for full inspection
with open('Machine_Learning_Tasks_extracted.txt', 'w', encoding='utf-8') as f:
    f.write(text)
print('Full PDF text extracted to Machine_Learning_Tasks_extracted.txt')

Full PDF text extracted to Machine_Learning_Tasks_extracted.txt


In [None]:
# 1. Data Loading & Merging
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
import plotly.express as px
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')

# Load data
sales = pd.read_csv('train.csv', parse_dates=['Date'])
features = pd.read_csv('features.csv', parse_dates=['Date'])
stores = pd.read_csv('stores.csv')

# Merge datasets
sales = sales.merge(features, on=['Store', 'Date', 'IsHoliday'], how='left')
sales = sales.merge(stores, on='Store', how='left')

print('Data loaded and merged. Shape:', sales.shape)
sales.head()

In [None]:
# 2. Exploratory Data Analysis (EDA)
# Basic info
print(sales.info())
print(sales.describe())

# Missing values
missing = sales.isnull().mean().sort_values(ascending=False)
print('Missing values (fraction):')
print(missing[missing > 0])

# Sales over time
plt.figure(figsize=(12,5))
sns.lineplot(data=sales.groupby('Date')['Weekly_Sales'].sum().reset_index(), x='Date', y='Weekly_Sales')
plt.title('Total Weekly Sales Over Time')
plt.show()

# Sales by Store Type
plt.figure(figsize=(8,4))
sns.boxplot(data=sales, x='Type', y='Weekly_Sales')
plt.title('Weekly Sales by Store Type')
plt.show()

# Sales by Department
plt.figure(figsize=(12,4))
sns.boxplot(data=sales, x='Dept', y='Weekly_Sales')
plt.title('Weekly Sales by Department')
plt.xticks([], [])
plt.show()

In [None]:
# 3. Feature Engineering (time features, lags)
sales['Year'] = sales['Date'].dt.year
sales['Month'] = sales['Date'].dt.month
sales['Week'] = sales['Date'].dt.isocalendar().week
sales['Day'] = sales['Date'].dt.day
sales['IsHoliday'] = sales['IsHoliday'].astype(int)

# Lag features (previous week's sales per Store/Dept)
sales = sales.sort_values(['Store', 'Dept', 'Date'])
sales['Lag_1'] = sales.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(1)
sales['Lag_2'] = sales.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(2)

# Fill NA lags with 0 (or could use mean/median)
sales[['Lag_1', 'Lag_2']] = sales[['Lag_1', 'Lag_2']].fillna(0)

sales[['Year', 'Month', 'Week', 'Day', 'Lag_1', 'Lag_2']].head()

In [None]:
# 4. Model Training (Random Forest & XGBoost)
# We'll predict next week's sales using all features

# Select features
features_cols = ['Store', 'Dept', 'Type', 'Size', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment',
                 'IsHoliday', 'Year', 'Month', 'Week', 'Day', 'Lag_1', 'Lag_2']
# Convert categorical
sales['Type'] = sales['Type'].map({'A': 0, 'B': 1, 'C': 2})

# Drop rows with missing target
model_data = sales.dropna(subset=['Weekly_Sales'])

# Train/test split by date (time-aware)
cutoff_date = model_data['Date'].quantile(0.85)
train = model_data[model_data['Date'] <= cutoff_date]
test = model_data[model_data['Date'] > cutoff_date]

X_train = train[features_cols]
y_train = train['Weekly_Sales']
X_test = test[features_cols]
y_test = test['Weekly_Sales']

# Random Forest
rf = RandomForestRegressor(n_estimators=50, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# XGBoost
xgb_model = xgb.XGBRegressor(n_estimators=50, random_state=42, n_jobs=-1)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# Evaluation
print('Random Forest RMSE:', np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print('XGBoost RMSE:', np.sqrt(mean_squared_error(y_test, y_pred_xgb)))

In [None]:
# 5. Forecasting & Evaluation Plots
# Plot actual vs predicted for XGBoost
plt.figure(figsize=(12,5))
plt.plot(test['Date'], y_test, label='Actual', alpha=0.7)
plt.plot(test['Date'], y_pred_xgb, label='Predicted (XGBoost)', alpha=0.7)
plt.title('Actual vs Predicted Weekly Sales (XGBoost)')
plt.xlabel('Date')
plt.ylabel('Weekly Sales')
plt.legend()
plt.show()

# Feature importance
xgb.plot_importance(xgb_model, max_num_features=10)
plt.title('Top 10 Feature Importances (XGBoost)')
plt.show()

## How to Run the Interactive Dashboard

To launch the interactive dashboard for further exploration and visualization:

1. Open a terminal in this folder.
2. Run:
   ```bash
   python app.py
   ```
3. Open your browser and go to [http://127.0.0.1:8050/](http://127.0.0.1:8050/)

The dashboard allows you to filter by store, department, and date, and visualize sales trends interactively.