# Task 3: Feature Engineering – Retail Sales Forecasting

Feature Engineering is a crucial stage in any machine learning pipeline.
In this task, raw cleaned retail data is transformed into meaningful,
model-ready features that improve predictive performance.

This notebook covers:
- Loading cleaned retail data
- Encoding categorical variables
- Scaling numerical features
- Time-based & retail domain feature creation
- Correlation analysis for feature selection
- Exporting the final engineered dataset


In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, StandardScaler
import matplotlib.pyplot as plt


In [2]:
df = pd.read_csv("cleaned_retail_data.csv")
df.head()


Unnamed: 0,date,store_type,product_category,region,brand,sales,inventory,price,discount
0,2023-01-01,Urban,Electronics,North,Brand_A,120,30,15000,0.1
1,2023-01-02,Urban,Grocery,North,Brand_B,300,200,150,0.05
2,2023-01-03,Rural,Electronics,South,Brand_A,90,25,15000,0.15
3,2023-01-04,Rural,Clothing,South,Brand_C,200,60,1200,0.2
4,2023-01-05,Urban,Grocery,East,Brand_B,350,220,150,0.0


In [3]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              15 non-null     object 
 1   store_type        15 non-null     object 
 2   product_category  15 non-null     object 
 3   region            15 non-null     object 
 4   brand             15 non-null     object 
 5   sales             15 non-null     int64  
 6   inventory         15 non-null     int64  
 7   price             15 non-null     int64  
 8   discount          15 non-null     float64
dtypes: float64(1), int64(3), object(5)
memory usage: 1.2+ KB


In [4]:
df['date'] = pd.to_datetime(df['date'])


In [5]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['week'] = df['date'].dt.isocalendar().week.astype(int)
df['weekday'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter


In [7]:
df['is_weekend'] = np.where(df['weekday'] >= 5, 1, 0)


In [8]:
def get_season(month):
    if month in [12, 1, 2]:
        return "Winter"
    elif month in [3, 4, 5]:
        return "Summer"
    elif month in [6, 7, 8]:
        return "Monsoon"
    else:
        return "Autumn"

df['season'] = df['month'].apply(get_season)


In [9]:
le = LabelEncoder()

df['store_type_enc'] = le.fit_transform(df['store_type'])
df['region_enc'] = le.fit_transform(df['region'])
df['brand_enc'] = le.fit_transform(df['brand'])


In [10]:
df = pd.get_dummies(
    df,
    columns=['product_category', 'season'],
    drop_first=True
)


In [11]:
df = df.sort_values('date')

df['rolling_7day_sales'] = df['sales'].rolling(7, min_periods=1).mean()
df['rolling_30day_sales'] = df['sales'].rolling(30, min_periods=1).mean()


In [12]:
df['discount_impact'] = df['sales'] * df['discount']


In [13]:
df['price_elasticity'] = df['sales'] / df['price']


In [14]:
scaler = StandardScaler()

num_cols = [
    'sales', 'inventory', 'price', 'discount',
    'rolling_7day_sales', 'rolling_30day_sales',
    'discount_impact', 'price_elasticity'
]

df[num_cols] = scaler.fit_transform(df[num_cols])


In [15]:
corr_matrix = df[num_cols].corr().abs()
corr_matrix


Unnamed: 0,sales,inventory,price,discount,rolling_7day_sales,rolling_30day_sales,discount_impact,price_elasticity
sales,1.0,0.955678,0.869612,0.775344,0.568467,0.545366,0.083957,0.904475
inventory,0.955678,1.0,0.714396,0.722963,0.560006,0.571863,0.265229,0.988911
price,0.869612,0.714396,1.0,0.66585,0.539679,0.496375,0.304943,0.605537
discount,0.775344,0.722963,0.66585,1.0,0.283325,0.276809,0.372277,0.67697
rolling_7day_sales,0.568467,0.560006,0.539679,0.283325,1.0,0.916451,0.039641,0.53015
rolling_30day_sales,0.545366,0.571863,0.496375,0.276809,0.916451,1.0,0.0032,0.556128
discount_impact,0.083957,0.265229,0.304943,0.372277,0.039641,0.0032,1.0,0.369412
price_elasticity,0.904475,0.988911,0.605537,0.67697,0.53015,0.556128,0.369412,1.0


In [16]:
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

to_drop = [col for col in upper.columns if any(upper[col] > 0.85)]
to_drop


['inventory', 'price', 'rolling_30day_sales', 'price_elasticity']

In [17]:
df.drop(columns=to_drop, inplace=True)


In [18]:
df.drop(columns=['date', 'store_type', 'region', 'brand'], inplace=True)


In [19]:
df.head()
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   sales                         15 non-null     float64
 1   discount                      15 non-null     float64
 2   year                          15 non-null     int32  
 3   month                         15 non-null     int32  
 4   week                          15 non-null     int64  
 5   weekday                       15 non-null     int32  
 6   quarter                       15 non-null     int32  
 7   is_weekend                    15 non-null     int64  
 8   store_type_enc                15 non-null     int64  
 9   region_enc                    15 non-null     int64  
 10  brand_enc                     15 non-null     int64  
 11  product_category_Electronics  15 non-null     bool   
 12  product_category_Furniture    15 non-null     bool   
 13  product

In [20]:
df.to_csv("engineered_retail_data.csv", index=False)


## Conclusion

In this task, raw cleaned retail data was transformed into
a machine learning–ready dataset using feature engineering techniques.

Key achievements:
- Time-based and seasonal feature extraction
- Categorical encoding and numerical scaling
- Retail domain feature creation
- Multicollinearity reduction
- Export of final engineered dataset

This dataset is now ready for forecasting and predictive modeling.
