Plan for developing the anomaly detection model for your weekly sales data:

### Plan Outline:

1. **Data Preprocessing**:
   - Load the data and convert the 'Date' column to a datetime format.
   - Sort the data by state and date to ensure the time series is in the correct order.
   - Handle missing values if necessary (e.g., fill or remove).

2. **Feature Engineering**:
   - Group data by state as operations need to be state-wise.
   - Create lag features for 'Weekly_Sales' to capture previous sales trends (consider several weeks back).
   - Consider the impact of 'Holiday_Flag' in feature creation to capture holiday sales effects.

3. **Model Selection**:
   - Given the need for high precision and good recall, consider models well-suited for anomaly detection in time series data. Potential candidates include:
     - **Isolation Forest**
     - **Random Forest Classifier**
     - **Gradient Boosting Machines (e.g., XGBoost)**

4. **Training and Validation**:
   - Split the data into training (60%), validation (20%), and test (20%) sets. Ensure this split respects the time series nature, meaning no future data is used in the training or validation of earlier data.
   - Train the models on the training set.
   - Evaluate each model on the validation set using accuracy, precision, recall, and F1-score.

5. **Final Model Selection**:
   - Select the model that best achieves the balance between precision and recall as specified.
   - Retrain this model on combined training and validation sets.
   - Perform a final evaluation on the test set.

6. **Model Deployment**:
   - Prepare the model and the preprocessing steps to be applied to new or unseen future data for predictions.



In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from xgboost import XGBClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif

import warnings
# Suppress all warnings
warnings.filterwarnings("ignore")

In [2]:
base_folder = 'C:\\Geeta\\learning\\projects\\AnomalyDetectionSXM\\Notebooks\\Datasets\\Pipeline'
dataset_name = 'Walmart_Weekly'
train_file = base_folder + '/train/' + dataset_name +'_train.csv'
inference_file = base_folder + '/inference/' + dataset_name +'_inference.csv'


In [3]:
# Load the data
file_path = train_file
data = pd.read_csv(file_path)

# Convert 'Date' to datetime
data['Date'] = pd.to_datetime(data['Date'], format='mixed') #format='%Y-%m-%d')

# Sort data by state and date
data.sort_values(by=['State', 'Date'], inplace=True)

# Handling missing values (simple forward fill as a placeholder)
data.ffill(inplace=True)

data['month'] = data['Date'].dt.month
data['year'] = data['Date'].dt.year

# Remove rows with NaN values created by lag features
data.dropna(inplace=True)
data

Unnamed: 0,Date,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,Holiday_Flag,Anomaly,Sales_Amount_Upper,Sales_Amount_Lower,State,month,year
142,2021-02-05,12535669.50,32.53,2.77,150.50,8.84,0,0,13672907.31,11591637.96,California,2,2021
143,2021-02-12,11788052.22,32.71,2.75,150.58,8.84,1,0,12967716.55,10886447.20,California,2,2021
144,2021-02-19,11969069.14,37.00,2.73,150.62,8.84,0,0,12691307.91,10610038.55,California,2,2021
145,2021-02-26,11131071.10,35.52,2.73,150.66,8.84,0,0,12804766.72,10723497.37,California,2,2021
146,2021-03-05,11680802.02,41.10,2.77,150.69,8.84,0,0,12610512.80,10529243.45,California,3,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...
705,2023-09-22,6851231.75,70.84,3.90,175.44,7.26,0,0,7771717.32,6550932.76,Virginia,9,2023
706,2023-09-29,6801804.41,70.62,3.84,175.57,7.26,0,0,7538640.58,6317856.02,Virginia,9,2023
707,2023-10-06,7354905.93,68.47,3.84,175.69,6.96,0,0,7571680.67,6350896.11,Virginia,10,2023
708,2023-10-13,7149626.82,61.82,3.92,175.82,6.96,0,0,7860594.00,6639809.44,Virginia,10,2023


In [4]:
def preprocess_date(df):
    # Convert 'Date' to datetime format
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Extract year, month and day from 'Date'
    df['Year'] = df['Date'].dt.year
    df['Month'] = df['Date'].dt.month
    df['Day'] = df['Date'].dt.day
    # Sort DataFrame by date
    df.sort_values('Date', inplace=True)

    # Drop 'Date' column as it's no longer needed
    df = df.drop('Date', axis=1)
    return df

In [5]:

# Group by 'State' column
grouped = data.groupby('State')

# Initialize a scaler and an encoder
# scaler = MinMaxScaler()
encoder = LabelEncoder()

processed_dfs = []

for name, group in grouped:
    # Separate target variable 'Anomaly'
    target = group['Anomaly']
    group = group.drop('Anomaly', axis=1)

    # Normalize the group
    numeric_cols = group.select_dtypes(include=['float64', 'int64']).columns
    # group[numeric_cols] = scaler.fit_transform(group[numeric_cols])
    # group['Weekly_Sales'] = scaler.fit_transform(group[['Weekly_Sales']])
    
    # Select K best features
    selector = SelectKBest(score_func=f_classif, k='all')
    selected_features = selector.fit_transform(group[numeric_cols], target)

    # Concatenate selected numeric features and rejected non-numeric features
    group = pd.concat([pd.DataFrame(selected_features, columns=numeric_cols, index=group.index), group.drop(columns=numeric_cols)], axis=1)
    
    
    #Re-assigning 'Anomaly' to group
    group['Anomaly'] = target
    
    # Create lagged features
    for i in range(52, 0, -1):
        group['Weekly_sales_lag_'+str(i)] = group['Weekly_Sales'].shift(i)

    # # Handle any remaining NaN values
    # group = group.dropna()
        
    # Append the result to the list
    processed_dfs.append(group)

# Concatenate all processed dfs
data_preprocessed = pd.concat(processed_dfs)

# Apply label encoding to 'State' column
data_preprocessed['State'] = encoder.fit_transform(data['State'])

data_preprocessed = preprocess_date(data_preprocessed)

data_preprocessed.dropna(inplace=True)
data_preprocessed.reset_index(inplace=True, drop=True)

data_preprocessed.shape

(450, 67)

In [6]:
data_preprocessed.sample(3)

Unnamed: 0,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,Holiday_Flag,Sales_Amount_Upper,Sales_Amount_Lower,State,month,...,Weekly_sales_lag_7,Weekly_sales_lag_6,Weekly_sales_lag_5,Weekly_sales_lag_4,Weekly_sales_lag_3,Weekly_sales_lag_2,Weekly_sales_lag_1,Year,Month,Day
87,11421205.89,68.15,3.87,153.68,8.36,0.0,12020871.82,9939602.47,0,6,...,10860567.44,11980295.3,10772361.35,11277845.75,10835830.55,10573794.19,10909456.28,2022,6,3
227,8464546.97,41.87,3.33,173.06,8.01,0.0,8704213.78,7483429.21,4,12,...,6760843.59,7253659.47,7089488.58,6845337.79,9119292.99,7180298.03,7960013.13,2022,12,16
417,11715668.14,73.35,3.95,165.71,7.31,1.0,12555119.33,9704251.54,2,9,...,11212867.28,10711004.76,11501434.14,11713115.39,11593601.41,11978527.86,11813627.25,2023,9,8


In [7]:
data_preprocessed.columns

Index(['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment',
       'Holiday_Flag', 'Sales_Amount_Upper', 'Sales_Amount_Lower', 'State',
       'month', 'year', 'Anomaly', 'Weekly_sales_lag_52',
       'Weekly_sales_lag_51', 'Weekly_sales_lag_50', 'Weekly_sales_lag_49',
       'Weekly_sales_lag_48', 'Weekly_sales_lag_47', 'Weekly_sales_lag_46',
       'Weekly_sales_lag_45', 'Weekly_sales_lag_44', 'Weekly_sales_lag_43',
       'Weekly_sales_lag_42', 'Weekly_sales_lag_41', 'Weekly_sales_lag_40',
       'Weekly_sales_lag_39', 'Weekly_sales_lag_38', 'Weekly_sales_lag_37',
       'Weekly_sales_lag_36', 'Weekly_sales_lag_35', 'Weekly_sales_lag_34',
       'Weekly_sales_lag_33', 'Weekly_sales_lag_32', 'Weekly_sales_lag_31',
       'Weekly_sales_lag_30', 'Weekly_sales_lag_29', 'Weekly_sales_lag_28',
       'Weekly_sales_lag_27', 'Weekly_sales_lag_26', 'Weekly_sales_lag_25',
       'Weekly_sales_lag_24', 'Weekly_sales_lag_23', 'Weekly_sales_lag_22',
       'Weekly_sales_lag_21', '

In [8]:
# Feature scaling
# scaler = StandardScaler()
# features = ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'lag1', 'lag2', 'lag3', 'lag4']
# data[features] = scaler.fit_transform(data[features])

# Splitting the dataset respecting time series nature
splitter = TimeSeriesSplit(n_splits=5)
for train_index, test_index in splitter.split(data_preprocessed):
    train_data, test_data = data_preprocessed.iloc[train_index], data.iloc[test_index]

train, validate, test = np.split(train_data, [int(.6*len(train_data)), int(.8*len(train_data))])

# Separate features and target
X_train = train.drop(['Anomaly'], axis=1)
y_train = train['Anomaly']
X_validate = validate.drop(['Anomaly'], axis=1)
y_validate = validate['Anomaly']
X_test = test.drop(['Anomaly'], axis=1)
y_test = test['Anomaly']

In [9]:
X_train.shape, X_validate.shape, X_test.shape

((225, 66), (75, 66), (75, 66))

# Train models:

In [10]:
# Function to evaluate models
def evaluate_model(model, X, y, model_type=None):
    if model_type == 'isolation_forest':
        # Convert anomaly scores to binary labels
        predictions = model.predict(X)
        predictions = np.where(predictions == -1, 1, 0)  # Convert -1 to 1 (anomaly) and 1 to 0 (normal)
    else:
        predictions = model.predict(X)
    # accuracy = accuracy_score(y, predictions)
    # precision = precision_score(y, predictions, pos_label=1)
    # recall = recall_score(y, predictions, pos_label=1)
    # f1 = classification_report(y, predictions)#, pos_label=1)
    return classification_report(y, predictions)
    # accuracy, precision, recall, f1

In [11]:
y_validate.value_counts()

Anomaly
0    55
1    20
Name: count, dtype: int64

In [12]:
# Training Isolation Forest
iso_forest = IsolationForest(n_estimators=100, contamination=float(np.mean(y_train)))
iso_forest.fit(X_train)
iso_metrics = evaluate_model(iso_forest, X_validate, y_validate, 'isolation_forest')
print("Isolation Forest Metrics Classification Report:\n", iso_metrics)


Isolation Forest Metrics Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.91      0.85        55
           1       0.58      0.35      0.44        20

    accuracy                           0.76        75
   macro avg       0.69      0.63      0.64        75
weighted avg       0.74      0.76      0.74        75



In [13]:
# Training Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
rf_metrics = evaluate_model(rf_classifier, X_validate, y_validate)
print("Random Forest Metrics Classification Report:\n", rf_metrics)

Random Forest Metrics Classification Report:
               precision    recall  f1-score   support

           0       0.73      1.00      0.85        55
           1       0.00      0.00      0.00        20

    accuracy                           0.73        75
   macro avg       0.37      0.50      0.42        75
weighted avg       0.54      0.73      0.62        75



In [14]:
# Training XGBoost Classifier
xgb_classifier = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_classifier.fit(X_train, y_train)
xgb_metrics = evaluate_model(xgb_classifier, X_validate, y_validate)
print("XGBoost Metrics Classification Report:\n", xgb_metrics)

XGBoost Metrics Classification Report:
               precision    recall  f1-score   support

           0       0.74      1.00      0.85        55
           1       1.00      0.05      0.10        20

    accuracy                           0.75        75
   macro avg       0.87      0.53      0.47        75
weighted avg       0.81      0.75      0.65        75

