![image.png](https://i.imgur.com/a3uAqnb.png)

# Linear Regression for Walmart Sales Forecasting - Homework Assignment

In this homework, you will implement a **Linear Regression model** to predict weekly sales for Walmart stores using historical data. This project will help you understand the fundamentals of linear regression applied to time series forecasting.

## 📌 Project Overview
- **Task**: Predict weekly sales for Walmart stores
- **Algorithm**: Linear Regression
- **Dataset**: Walmart Store Sales dataset (provided)
- **Goal**: Build an accurate regression model using scikit-learn or custom implementation

## 📚 Learning Objectives
By completing this assignment, you will:
- Understand linear regression for forecasting problems
- Learn data preprocessing and feature engineering for time series
- Practice holiday feature creation and encoding
- Learn about regression metrics and model evaluation
- Understand train/test splitting for time series data

### Dataset Overview

You are provided with a CSV file named **`Walmart_Store_sales.csv`**, containing weekly sales data from **February 5, 2010** to **November 1, 2012**. The dataset includes the following columns:

1. **Store** — Store ID number
2. **Date** — Week of the sales
3. **Weekly\_Sales** — Weekly sales amount for the given store
4. **Holiday\_Flag** — Whether the week included a major holiday

   * `1` = Holiday week
   * `0` = Non-holiday week
5. **Temperature** — Temperature on the day of sale
6. **Fuel\_Price** — Cost of fuel in the region
7. **CPI** — Consumer Price Index
8. **Unemployment** — Unemployment rate in the region

## 1️⃣ Initial Setup and Library Installation

**Task**: Set up the environment and install necessary libraries.

In [2]:
from IPython.display import clear_output

In [3]:
# Incase you run this notebook outside colab (where the libraries aren't already pre-installed)

# %pip install pandas
# %pip install numpy
# %pip install scikit-learn
# %pip install matplotlib
# %pip install seaborn

clear_output()

## 2️⃣ Import Libraries and Configuration

**Task**: Import all necessary libraries and set up configuration parameters.

**Requirements**:
- Import data processing libraries (pandas, numpy)
- Import scikit-learn modules for regression and metrics
- Import visualization libraries
- Set random seeds for reproducibility
- Configure parameters for data splitting and model training

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [6]:
# Set random seed for reproducibility
np.random.seed(42)

# Configuration parameters
TEST_SIZE = 0.2              # Test set size (80:20 split)
RANDOM_STATE = 42           # Random state for reproducibility
SCALE_FEATURES = True       # Whether to scale features
PLOT_STYLE = 'seaborn'      # Plotting style

## 3️⃣ Data Loading and Exploration

**Task**: Load the Walmart sales dataset and explore its structure.

**Requirements**:
- Download and load the dataset
- Display basic information about the data
- Check for missing values and data types
- Understand the features and target variable
- Explore the date range and holiday patterns

In [7]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("yasserh/walmart-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/yasserh/walmart-dataset?dataset_version_number=1...


100%|████████████████████████████████████████████████████████████████████████████████| 122k/122k [00:01<00:00, 66.1kB/s]

Extracting files...
Path to dataset files: /home/ali/.cache/kagglehub/datasets/yasserh/walmart-dataset/versions/1





In [None]:
# TODO: Load the dataset
data_df = None

# TODO: Display basic information about the dataset

In [None]:
# TODO: Check the shape of the dataset

# TODO: Display first few rows

# TODO: Check data types and info

In [None]:
# TODO: Check for missing values

# TODO: Display basic statistics

# TODO: Check unique stores and date range

## 4️⃣ Data Preprocessing and Feature Engineering

**Task**: Clean and prepare the data for linear regression training.

**Requirements**:
- Handle missing values if any
- Convert date column to datetime format
- Create holiday type features based on specific dates
- Extract time-based features from dates
- Handle categorical variables
- Prepare features and target variables

In [None]:
# TODO: Create a copy of the data for preprocessing
processed_data = None

# TODO: Convert Date column to datetime format

# TODO: Handle any missing values

In [None]:
# TODO: Create holiday mapping function
def get_holiday_type(date):
    """
    Map dates to specific holiday types based on the given holiday dates:
    - Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
    - Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13  
    - Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
    - Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
    """
    # TODO: Implement holiday mapping logic
    pass

# TODO: Apply holiday mapping to create new feature

In [None]:
# TODO: Create additional time-based features (optional)
# Examples: month, quarter, week of year, etc.

# TODO: Encode categorical variables (holiday types)

# TODO: Remove unnecessary columns (like original Date if converted)

## 5️⃣ Data Splitting and Preprocessing

**Task**: Split the data and prepare features for linear regression.

**Requirements**:
- Separate features from target variable
- Split data into training and testing sets (80:20)
- Scale features if necessary
- Ensure proper data types for regression

In [None]:
# TODO: Separate features and target
# Target variable: Weekly_Sales
# Features: All other relevant columns

X = None  # Features
y = None  # Target

# TODO: Display feature names and target info

In [None]:
# TODO: Split the data into training and testing sets

X_train, X_test, y_train, y_test = None, None, None, None

# TODO: Display split information

In [None]:
# TODO: Scale features if needed (optional but recommended)
scaler = None

# TODO: Apply scaling to training and test sets if using scaler

# TODO: Display shapes of final datasets

## 6️⃣ Model Training

**Task**: Train a linear regression model on the prepared data.

**Requirements**:
- Initialize Linear Regression model
- Train the model on training data
- Display model coefficients and intercept

In [None]:
# TODO: Initialize Linear Regression model
model = None

# TODO: Train the model

# TODO: Display model information (coefficients, intercept)

## 7️⃣ Model Evaluation

**Task**: Evaluate the trained model using appropriate regression metrics.

**Requirements**:
- Make predictions on both training and test sets
- Calculate Mean Absolute Error (MAE)
- Calculate Root Mean Squared Error (RMSE)
- Calculate R-squared score
- Compare training vs test performance

In [None]:
# TODO: Make predictions on training and test sets
y_train_pred = None
y_test_pred = None

In [None]:
# TODO: Calculate evaluation metrics for training set
train_mae = None
train_rmse = None
train_r2 = None

# TODO: Calculate evaluation metrics for test set  
test_mae = None
test_rmse = None
test_r2 = None

# TODO: Display results in a formatted way

## 8️⃣ Visualization and Analysis

**Task**: Create visualizations to analyze model performance and results.

**Requirements**:
- Plot predicted vs actual values
- Visualize feature importance (coefficients)
- Show sample predictions

In [None]:
# TODO: Create predicted vs actual scatter plot for test set

# TODO: Add perfect prediction line for reference

In [None]:
# TODO: Visualize feature importance (model coefficients)

# TODO: Create bar plot of feature coefficients

In [None]:
# TODO: Show sample predictions vs actual values

# TODO: Display first 10 predictions with actual values in a nice format

## 📝 Evaluation Criteria

Your homework will be evaluated based on:

1. **Implementation Correctness (50%)**
   - Proper data preprocessing and cleaning
   - Correct holiday feature engineering
   - Working linear regression implementation
   - Appropriate train/test splitting

2. **Model Performance (25%)**
   - Reasonable regression metrics (MAE, RMSE, R²)
   - Proper model evaluation and interpretation
   - Good generalization to test set

3. **Code Quality and Analysis (25%)**
   - Clean, readable code with comments
   - Meaningful visualizations and analysis
   - Proper interpretation of results
   - Good coding practices and documentation