- Overview
- Architecture & Data Flow
- Core Components
- Technical Implementation
- Features & Capabilities
- Installation & Setup
- Usage Guide
- API Integration
- Model Performance
- Troubleshooting
This is a comprehensive sales forecasting application built with Streamlit that leverages machine learning (LightGBM) and AI (Google Gemini) to provide accurate sales predictions. The application can handle various data formats, automatically map columns, engineer features, and generate forecasts for different business scenarios.
- Universal Data Handling: Works with any CSV format through intelligent column mapping
- AI-Powered Automation: Uses Google Gemini for column mapping, feature selection, and hyperparameter optimization
- Multiple Model Approaches: Compares original target, log-transformed, and AI-pruned models
- Interactive Predictions: Supports various prediction modes (items, stores, combinations)
- Top-Selling Analysis: Identifies best-performing products for strategic planning
- Advanced Hyperparameter Optimization: Data-driven parameter tuning with domain-specific guidance
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ CSV Upload โโโโโถโ Column Mapping โโโโโถโ Data Processing โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Predictions โโโโโโ Model Training โโโโโโ Feature Eng. โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Raw CSV โ Column Detection โ Mapping โ Timeseries Conversion โ Encoding
Steps:
- File Upload: User uploads CSV file through Streamlit interface
- Column Detection:
- Gemini AI analyzes sample data for intelligent mapping
- Fallback to smart keyword-based mapping if AI fails
- Data Standardization: Converts to standard format (date, store, item, sales)
- Encoding: Converts categorical variables to numeric codes with mapping dictionaries
Standardized Data โ Lag Features โ Rolling Features โ Calendar Features โ Original Features
Feature Categories:
- Lag Features: Previous sales values (1, 2, 3, 7, 14, 30 days)
- Rolling Features: Moving averages, std, min, max (3, 7, 14, 30 windows)
- Calendar Features: Year, month, day, dayofweek, seasonal indicators
- Trigonometric Features: Cyclical encoding for time patterns
- Expanding Features: Cumulative statistics
- Original Features: Preserved from input data (numeric/categorical)
Feature Set โ Multiple Models โ Performance Comparison โ Best Model Selection
Model Variants:
- Original Target: Direct sales prediction
- Log-Transformed: Log-transformed target for skewed distributions
- AI-Pruned: Gemini-selected optimal feature subset
Best Model โ Recursive Forecasting โ Multi-step Predictions โ Results Display
def ask_gemini_for_column_mapping_from_sample(df_sample, api_key):
# Sends sample data to Gemini with specific prompt
# Returns mapping dictionary: {'date': 'Date', 'store': 'Store', ...}
def smart_column_mapping(df):
# Keyword-based column detection
# Handles common naming conventions
# Returns standardized mapping
Process:
- Store Handling: Creates numeric codes for stores, handles missing values
- Item Handling: Creates numeric codes for items, handles missing values
- Date Processing: Converts to datetime, creates default range if missing
- Sales Calculation: Handles direct columns or formulas (e.g., "Quantity * Price")
- Feature Preservation: Keeps all original columns as additional features
Output Structure:
{
'store': [1, 2, 3, ...], # Numeric store codes
'item': [1, 2, 3, ...], # Numeric item codes
'date': [datetime objects], # Standardized dates
'sales': [float values], # Sales quantities
'orig_price': [float values], # Preserved original features
'orig_category_encoded': [1, 2, 3, ...] # Encoded categorical features
}
Generated Features:
Lag Features:
sales_lag_1
,sales_lag_2
,sales_lag_3
,sales_lag_7
,sales_lag_14
,sales_lag_30
- Purpose: Previous sales values for trend analysis
Rolling Features:
sales_rolling_mean_3/7/14/30
,sales_rolling_std_3/7/14/30
,sales_rolling_min_3/7/14/30
,sales_rolling_max_3/7/14/30
- Purpose: Moving window statistics for pattern recognition
Calendar Features:
year
,month
,day
,dayofweek
,quarter
,is_weekend
,is_month_start
,is_month_end
,is_summer
,is_winter
,is_holiday_season
- Purpose: Time-based indicators for seasonal patterns
Trigonometric Features:
month_sin
,month_cos
,dayofweek_sin
,dayofweek_cos
- Purpose: Cyclical time encoding for smooth seasonal transitions
Expanding Features:
sales_expanding_mean
,sales_price_ratio
- Purpose: Cumulative statistics and price relationships
# Data-driven hyperparameter range suggestion
param_ranges = gemini_suggest_lgbm_param_ranges(df_sample, features, api_key)
# Optuna performs Bayesian optimization
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)
Enhanced Hyperparameter Guidance:
The system now provides data-driven hyperparameter optimization with:
- Dataset Size Analysis: Adapts parameters based on small/medium/large datasets
- Sales Statistics Analysis: Considers mean, std, range, coefficient of variation
- Sparsity Detection: Adjusts regularization based on zero sales ratio
- Domain-Specific Guidance: Sales forecasting specific parameter recommendations
- Computational Efficiency: Balances accuracy with training time
Parameter Categories:
- Tree Complexity:
num_leaves
,max_depth
(scaled by dataset size) - Learning Control:
learning_rate
,n_estimators
(based on data variance) - Regularization:
min_data_in_leaf
,lambda_l1
,lambda_l2
(based on sparsity) - Sampling:
feature_fraction
,bagging_fraction
(based on feature count)
# Train three model variants
models = {
'Original': train_original_target(),
'Log-Transformed': train_log_transformed(),
'AI-Pruned': train_ai_pruned()
}
# Select best based on Rยฒ score
best_model = max(models, key=lambda x: models[x]['r2_score'])
Algorithm:
- Initial State: Use last available observation
- Iterative Prediction: For each forecast step:
- Predict next value using current features
- Update lag features with new prediction
- Update rolling features with new window
- Update expanding features with cumulative data
- Return: List of predictions for requested horizon
# Core ML & Data Processing
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.model_selection import ParameterSampler
# AI Integration
import google.generativeai as genai
# Optimization
import optuna # Optional: for hyperparameter optimization
# Web Framework
import streamlit as st
robust_parse_dates()
: Handles various date formats with error recoveryconvert_to_timeseries()
: Main data conversion pipelinemanual_feature_engineering()
: Comprehensive feature generation
ask_gemini_for_column_mapping_from_sample()
: Intelligent column mappinggemini_suggest_lgbm_param_ranges()
: Enhanced data-driven hyperparameter optimizationgemini_select_best_features()
: Feature importance analysis and selection
get_default_lgbm_params()
: Default LightGBM configurationrecursive_forecast()
: Multi-step prediction algorithmsave_predictions_to_csv()
: Export functionality
# AI failure fallbacks
if gemini_mapping_fails:
use_smart_column_mapping()
if gemini_params_fail:
use_default_parameters()
if optuna_not_available:
use_grid_search_or_defaults()
# Handle missing/invalid data
df = df.replace([np.inf, -np.inf], np.nan)
df = df.fillna(0) # or appropriate defaults
- Flexible Column Mapping: Works with any CSV structure
- Automatic Type Detection: Handles numeric, categorical, and datetime data
- Missing Value Handling: Robust imputation strategies
- Formula Support: Can calculate sales from multiple columns (e.g., "Quantity * Price")
- Intelligent Column Mapping: Gemini AI analyzes data structure
- Feature Selection: AI-driven feature importance analysis
- Advanced Hyperparameter Optimization: Data-driven parameter tuning with domain-specific guidance
- Business Logic Integration: AI considers domain knowledge
- Item-Specific: Forecast for selected products
- Store-Specific: Forecast for selected locations
- Combination: Item-store specific predictions
- Random Sampling: Quick insights with random items
- Top-Selling Analysis: Strategic product ranking
- Model Comparison: Multiple approaches with performance metrics
- Feature Importance: Understanding what drives predictions
- Error Analysis: Prediction accuracy assessment
- Trend Analysis: Seasonal and cyclical pattern detection
# Python 3.8+ required
python --version
# Install required packages
pip install streamlit pandas numpy lightgbm scikit-learn google-generativeai
# Optional: For advanced hyperparameter optimization
pip install optuna
# Clone repository
git clone <repository-url>
cd sales-forecasting-app
# Install dependencies
pip install -r requirements.txt
# Set up API key (optional)
export GEMINI_API_KEY="your-api-key-here"
# Start Streamlit app
streamlit run app2.py
# Access at http://localhost:8501
- Format: CSV file with sales data
- Required Columns: At least one of date, store, item, sales
- Data Quality: Clean data preferred, but app handles missing values
- Gemini API Key: Enable AI features (optional but recommended)
- Feature Toggles: Control which AI features to use
- Model Settings: Adjust training parameters
- Upload Data: Select CSV file
- Review Mapping: Check column mapping results
- Monitor Training: Watch model training progress
- Analyze Results: Review model performance
- Generate Predictions: Use interactive prediction modes
- Export Results: Download predictions as CSV
- Select one or more items
- Choose forecast horizon (1-30 days)
- Get detailed predictions with confidence intervals
- Select stores to analyze
- View all items in selected stores
- Compare performance across locations
- Forecast all items for 2 months
- Rank by total predicted sales
- Strategic planning insights
# Configuration
MODEL_NAME = "gemini-1.5-flash"
MAX_RETRIES = 3
RETRY_DELAY = 2
# Usage
genai.configure(api_key=api_key)
model = genai.GenerativeModel(MODEL_NAME)
response = model.generate_content(prompt)
- Security: API key stored securely in Streamlit
- Fallback: Graceful degradation when API unavailable
- Rate Limiting: Built-in retry logic with delays
- MSE (Mean Squared Error): Overall prediction accuracy
- RMSE (Root Mean Squared Error): Error in original units
- Rยฒ Score: Model fit quality (0-1, higher is better)
# Compare multiple approaches
model_scores = {
'Original': calculate_metrics(original_model),
'Log-Transformed': calculate_metrics(log_model),
'AI-Pruned': calculate_metrics(ai_pruned_model)
}
# Select best based on Rยฒ score
best_model = max(model_scores, key=lambda x: model_scores[x]['r2'])
- Early Stopping: Prevents overfitting
- Cross-Validation: Robust performance estimation
- Feature Selection: Reduces noise and improves speed
- Advanced Hyperparameter Tuning: Data-driven optimization with domain-specific guidance
Symptoms: Incorrect column identification Solutions:
- Check data format and column names
- Use manual mapping if AI fails
- Verify data quality and completeness
Symptoms: Poor performance or training errors Solutions:
- Increase training data size
- Adjust hyperparameters
- Check for data quality issues
- Try different model variants
Symptoms: Invalid or unrealistic predictions Solutions:
- Verify feature engineering
- Check for data leakage
- Validate input data format
- Review model performance metrics
Symptoms: Gemini features not working Solutions:
- Verify API key validity
- Check internet connectivity
- Review API usage limits
- Use fallback methods
# Enable debug information
st.write("## ๐ DEBUG: All Available Features")
st.write(f"**Total Features:** {len(all_feature_cols)}")
# Show detailed feature breakdown
with st.expander("๐ View ALL Features (Click to expand)"):
for i, feature in enumerate(all_feature_cols, 1):
st.write(f"{i:2d}. {feature}")
- Training Time: Monitor model training duration
- Memory Usage: Track resource consumption
- Prediction Speed: Measure inference time
- Accuracy Metrics: Regular performance assessment
- Real-time Data Integration: Connect to live data sources
- Advanced Visualization: Interactive charts and dashboards
- Ensemble Methods: Combine multiple model predictions
- Anomaly Detection: Identify unusual sales patterns
- Automated Reporting: Generate business intelligence reports
- Model Persistence: Save and load trained models
- Batch Processing: Handle larger datasets efficiently
- API Endpoints: RESTful API for external integration
- Cloud Deployment: Scalable cloud infrastructure
- Real-time Predictions: Streaming prediction capabilities
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For support and questions, please open an issue in the repository or contact the development team.
This project uses a .env
file to store sensitive information such as the Gemini API key. Do not commit your .env
file to version control.
-
Create a file named
.env
in the project root. -
Add your Gemini API key:
GEMINI_API_KEY=your_actual_key_here
-
The application will automatically load this key at runtime.
- The
.env
file is included in.gitignore
and will not be committed to GitHub. - Each user should provide their own API key in their local
.env
file.