#### Check for inconsistencies and rare categories in qualitative features


In [23]:
qualitative_summary = {}

for column in qualitative_features:
    # Get unique categories and their counts
    value_counts = df_relevant_features[column].value_counts()
    
    # Identify rare categories
    rare_categories = value_counts[value_counts < 200].index.tolist()  # Threshold can be adjusted
    
    # Store summary
    qualitative_summary[column] = {
        "unique_categories": value_counts.index.tolist(),
        "rare_categories": rare_categories,
        "category_counts": value_counts.to_dict()
    }

# Print summary of qualitative feature checks
print("Qualitative Feature Summary:")
for column, summary in qualitative_summary.items():
    print(f"\nColumn: {column}")
    print(f"Unique Categories: {summary['unique_categories']}")
    print(f"Rare Categories: {summary['rare_categories']}")
    print(f"Category Counts: {summary['category_counts']}")

Qualitative Feature Summary:

Column: Type
Unique Categories: ['DEBIT', 'TRANSFER', 'PAYMENT', 'CASH']
Rare Categories: []
Category Counts: {'DEBIT': 69295, 'TRANSFER': 49883, 'PAYMENT': 41725, 'CASH': 19616}

Column: Delivery Status
Unique Categories: ['Late delivery', 'Advance shipping', 'Shipping on time', 'Shipping canceled']
Rare Categories: []
Category Counts: {'Late delivery': 98977, 'Advance shipping': 41592, 'Shipping on time': 32196, 'Shipping canceled': 7754}

Column: Category Name
Unique Categories: ['Cleats', "Men's Footwear", "Women's Apparel", 'Indoor/Outdoor Games', 'Fishing', 'Water Sports', 'Camping & Hiking', 'Cardio Equipment', 'Shop By Sport', 'Electronics', 'Accessories', 'Golf Balls', "Girls' Apparel", 'Golf Gloves', 'Trade-In', 'Video Games', "Children's Clothing", "Women's Clothing", 'Baseball & Softball', 'Hockey', 'Cameras ', 'Toys', 'Golf Shoes', 'Pet Supplies', 'Garden', 'Crafts', 'DVDs', 'Computers', 'Golf Apparel', 'Hunting & Shooting', 'Music', 'Consumer

An inspection of the values shows that these are actualy related to business cases and are not data entry errors except fot the case of Customer State '95758' and '91732'. Checking the records for the same.


In [25]:
filtered_data = df_relevant_features[df_relevant_features['Customer State'].isin(['95758', '91732'])]

# Display the filtered data
print(filtered_data)

          Type  Benefit per order  Sales per customer   Delivery Status  Late_delivery_risk         Category Name Customer City Customer Country Customer Segment Customer State Department Name        Market Order City Order Country order date (DateOrders)  Order Item Discount  \
35704    DEBIT          66.379997          189.660004     Late delivery                   1  Consumer Electronics            CA          EE. UU.         Consumer          95758      Technology        Europe    Valence       Francia         11/2/2017 18:31            63.220001   
46440  PAYMENT          10.910000           38.959999  Shipping on time                   0           Video Games            CA          EE. UU.        Corporate          95758      Discs Shop  Pacific Asia     Manila     Filipinas        12/10/2017 15:18             0.800000   
82511    DEBIT          59.990002          299.959992  Shipping on time                   0   Children's Clothing            CA          EE. UU.         Consume

This would not have any signifcant impact with only 3 rows. Hence leaving it as it is.

### Forecasting

In [26]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import joblib

# Ensure the saved_models directory exists
os.makedirs('saved_models', exist_ok=True)

# Assuming df_treebased_models_features is your features DataFrame
# and df['Sales'] is your target variable

# Convert date columns to datetime objects
df_treebased_models_features['order date (DateOrders)'] = pd.to_datetime(df_treebased_models_features['order date (DateOrders)'])
df_treebased_models_features['shipping date (DateOrders)'] = pd.to_datetime(df_treebased_models_features['shipping date (DateOrders)'])

# Extract features from date columns
df_treebased_models_features['order_year'] = df_treebased_models_features['order date (DateOrders)'].dt.year
df_treebased_models_features['order_month'] = df_treebased_models_features['order date (DateOrders)'].dt.month
df_treebased_models_features['order_day'] = df_treebased_models_features['order date (DateOrders)'].dt.day
df_treebased_models_features['order_weekday'] = df_treebased_models_features['order date (DateOrders)'].dt.weekday

df_treebased_models_features['shipping_year'] = df_treebased_models_features['shipping date (DateOrders)'].dt.year
df_treebased_models_features['shipping_month'] = df_treebased_models_features['shipping date (DateOrders)'].dt.month
df_treebased_models_features['shipping_day'] = df_treebased_models_features['shipping date (DateOrders)'].dt.day
df_treebased_models_features['shipping_weekday'] = df_treebased_models_features['shipping date (DateOrders)'].dt.weekday

# Calculate delivery time as the difference in days between shipping date and order date
df_treebased_models_features['delivery_time_days'] = (df_treebased_models_features['shipping date (DateOrders)'] - df_treebased_models_features['order date (DateOrders)']).dt.days

# Drop original date columns after extracting features
df_treebased_models_features.drop(columns=['order date (DateOrders)', 'shipping date (DateOrders)'], inplace=True)

# Apply label encoding to categorical columns
for col in df_treebased_models_features.select_dtypes(include=['object']).columns:
    df_treebased_models_features[col] = LabelEncoder().fit_transform(df_treebased_models_features[col])

# Define features and target
X = df_treebased_models_features
y = df['Sales']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    'RandomForest': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42, objective='reg:squarederror'),
    'DecisionTree': DecisionTreeRegressor(random_state=42)
}

# Train and evaluate each model
results = pd.DataFrame(columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2'])

for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    # Append results to DataFrame
    results = pd.concat([results, pd.DataFrame({
        'Model': [name],
        'MAE': [mae],
        'MSE': [mse],
        'RMSE': [rmse],
        'R2': [r2]
    })], ignore_index=True)
    
    # Print evaluation metrics
    print(f'{name} Evaluation:')
    print(f'Mean Absolute Error (MAE): {mae:.4f}')
    print(f'Mean Squared Error (MSE): {mse:.4f}')
    print(f'Root Mean Squared Error (RMSE): {rmse:.4f}')
    print(f'R-squared (R²): {r2:.4f}\n')
    
    # Save the model to the saved_models directory
    joblib.dump(model, f'saved_models/{name}_model.pkl')

# Display results sorted by R-squared
results = results.sort_values(by='R2', ascending=False)
print(results)


  results = pd.concat([results, pd.DataFrame({


RandomForest Evaluation:
Mean Absolute Error (MAE): 0.0156
Mean Squared Error (MSE): 0.6864
Root Mean Squared Error (RMSE): 0.8285
R-squared (R²): 1.0000

XGBoost Evaluation:
Mean Absolute Error (MAE): 0.1463
Mean Squared Error (MSE): 0.2087
Root Mean Squared Error (RMSE): 0.4568
R-squared (R²): 1.0000

DecisionTree Evaluation:
Mean Absolute Error (MAE): 0.0167
Mean Squared Error (MSE): 6.9588
Root Mean Squared Error (RMSE): 2.6380
R-squared (R²): 0.9996

          Model       MAE       MSE      RMSE        R2
1       XGBoost  0.146284  0.208674  0.456808  0.999988
0  RandomForest  0.015601  0.686434  0.828513  0.999960
2  DecisionTree  0.016689  6.958838  2.637961  0.999599


### ** Interpretation of the Results **

The results for the tree-based regression models indicate strong performance overall. Interpreting the metrics for each model:


### Model Evaluations:

#### 1. RandomForest
- **Mean Absolute Error (MAE):** 0.0156. The MAE is very low, indicating minimal average deviation from actual values. Comparable to DecisionTree but slightly higher than XGBoost.

- **Mean Squared Error (MSE):** 0.6864. Indicates that large prediction errors are rare. Higher than XGBoost, indicating more frequent larger errors.

- **Root Mean Squared Error (RMSE):** 0.8285. Reflects a small average error magnitude. Higher than XGBoost but better than DecisionTree.

- **R-squared (R²):** 1.0000. Excellent. Shows that the model explains almost all the variability in the data.Matches XGBoost and is better than DecisionTree.

#### 2. XGBoost
- **Mean Absolute Error (MAE):** 0.1463. Though slightly higher than RandomForest, it still indicates accurate predictions. Higher than RandomForest and DecisionTree, suggesting slightly less consistency.

- **Mean Squared Error (MSE):** 0.2087. Excellent. Indicates very few large errors. Lowest among the models, demonstrating its precision.

- **Root Mean Squared Error (RMSE):** 0.4568. Excellent. Shows the smallest average error magnitude.  Best among all models, indicating the highest precision.

- **R-squared (R²):** 1.0000. Excellent. Indicates a perfect fit. Matches RandomForest and surpasses DecisionTree.

#### 3. DecisionTree
- **Mean Absolute Error (MAE):** 0.0167. Low MAE suggests minimal deviation from actual values. Similar to RandomForest but better than XGBoost.

- **Mean Squared Error (MSE):** 6.9588. Poor. High MSE indicates frequent large errors.  Much higher than both XGBoost and RandomForest, indicating less precision.

- **Root Mean Squared Error (RMSE):** 2.6380. Poor. Reflects a larger average error magnitude. Highest among the models, suggesting more variability.

- **R-squared (R²):** 0.9996. Still a strong fit but not as perfect as the other models. Slightly lower than RandomForest and XGBoost.

### Summary:
- **XGBoost**: Stands out as the best model due to its lowest MSE and RMSE values, indicating high accuracy and precision. It effectively minimizes large errors, although it has a slightly higher MAE than the others.
- **RandomForest**: Offers strong performance with minimal deviation and a perfect R² score, making it a reliable choice.
- **DecisionTree**: While it has a good MAE, its higher MSE and RMSE suggest it is less reliable for precise predictions compared to the other models.

Overall, **XGBoost** is the top-performing model for this task, with **RandomForest** also being a strong contender, while **DecisionTree** lags in precision and consistency.