# Task
Generate a synthetic dataset with at least 5000 samples for predicting 'Next_Hour_Load_MW' based on 'Time', 'Current_Load_MW', 'Temperature_C', 'Area_Type', and 'Consumers', ensuring specified logical relationships. Then, prepare the data, train and evaluate Linear Regression, Random Forest Regressor, and XGBoost Regressor models, select the best one, and implement a prediction function that calculates 'Utilization' and 'Risk' based on new inputs. Finally, visualize model performance, predictions vs. actuals, and feature importance, and provide a summary of the model development with an example prediction.

## Generate Synthetic Dataset

### Subtask:
Generate a synthetic dataset with at least 5000 samples, including columns 'Time', 'Current_Load_MW', 'Temperature_C', 'Area_Type', 'Consumers', and 'Next_Hour_Load_MW'. Ensure the data reflects the specified logical relationships: Commercial and Industrial areas have higher load, higher temperature slightly increases load, and evening hours have peak demand.


**Reasoning**:
The first step is to import the necessary libraries as instructed, which include `pandas` for data manipulation, `numpy` for numerical operations, `random` for specific random choices, and `datetime` for time-related features.



In [None]:
import pandas as pd
import numpy as np
import random
import datetime

print("Libraries imported successfully.")

Libraries imported successfully.


**Reasoning**:
The next step is to generate the synthetic dataset with at least 5000 samples, incorporating all specified columns and logical relationships. This involves defining sample size, generating values for 'Time', 'Current_Load_MW', 'Temperature_C', 'Area_Type', and 'Consumers', and then calculating 'Next_Hour_Load_MW' based on the given rules. Finally, all this data will be assembled into a pandas DataFrame.



In [None]:
n_samples = 5000

# 3. Create lists for categorical features
area_types = ['Residential', 'Commercial', 'Industrial']

# Initialize lists to store generated data
data = {
    'Time': [],
    'Current_Load_MW': [],
    'Temperature_C': [],
    'Area_Type': [],
    'Consumers': [],
    'Next_Hour_Load_MW': []
}

# Generate synthetic data
for _ in range(n_samples):
    # a. 'Time': Generate hourly timestamps (0-23)
    hour = random.randint(0, 23)
    data['Time'].append(hour)

    # b. 'Current_Load_MW': Base load with variation
    current_load = random.uniform(500, 2000) # Base load range
    data['Current_Load_MW'].append(current_load)

    # c. 'Temperature_C': Realistic temperature values
    temperature = random.uniform(5, 35) # Realistic temperature range
    data['Temperature_C'].append(temperature)

    # d. 'Area_Type': Randomly select area type
    area_type = random.choice(area_types)
    data['Area_Type'].append(area_type)

    # e. 'Consumers': Generate consumers based on area type
    if area_type == 'Residential':
        consumers = random.randint(1000, 5000)
    elif area_type == 'Commercial':
        consumers = random.randint(3000, 10000)
    else: # Industrial
        consumers = random.randint(5000, 15000)
    data['Consumers'].append(consumers)

    # 5. Calculate 'Next_Hour_Load_MW' with logical relationships
    next_hour_load = current_load * random.uniform(0.9, 1.1) # Baseline influenced by current load

    # b. Area_Type influence (Commercial and Industrial have higher load)
    if area_type == 'Commercial':
        next_hour_load += random.uniform(100, 300)
    elif area_type == 'Industrial':
        next_hour_load += random.uniform(200, 500)

    # c. Temperature_C influence (higher temperature slightly increases load)
    next_hour_load += temperature * random.uniform(2, 5) # Positive coefficient

    # d. Time influence (evening hours have peak demand)
    if 18 <= hour <= 22: # Peak demand hours
        next_hour_load += random.uniform(150, 400)
    elif 6 <= hour <= 9: # Morning peak
        next_hour_load += random.uniform(50, 200)
    elif 0 <= hour <= 5: # Low demand hours
        next_hour_load *= random.uniform(0.8, 1.0)

    # e. Consumers influence
    next_hour_load += consumers * random.uniform(0.05, 0.15)

    # f. Add random noise
    next_hour_load += random.uniform(-50, 50)

    # Ensure load is not negative
    next_hour_load = max(50, next_hour_load)

    data['Next_Hour_Load_MW'].append(next_hour_load)

# 6. Assemble into a Pandas DataFrame
df_synthetic = pd.DataFrame(data)

print(f"Synthetic dataset created with {len(df_synthetic)} samples.")
print(df_synthetic.head())


Synthetic dataset created with 5000 samples.
   Time  Current_Load_MW  Temperature_C    Area_Type  Consumers  \
0    20      1070.289162      16.268644  Residential       4827   
1    11       783.853269      11.229392   Commercial       6089   
2    21      1351.323380      10.886659  Residential       3333   
3    20      1524.933377       6.147333   Industrial       6702   
4    16      1185.600150      33.883281   Industrial       8719   

   Next_Hour_Load_MW  
0        2145.826962  
1        1856.835088  
2        1918.915530  
3        3138.209283  
4        2641.733058  


**Reasoning**:
The synthetic dataset has been generated. To ensure its quality and characteristics align with the requirements, it's crucial to inspect its structure and descriptive statistics. This will help verify data types, check for any anomalies, and confirm that the generated values broadly follow the intended distributions and relationships.



In [None]:
print('DataFrame Info:')
df_synthetic.info()
print('\nDataFrame Descriptive Statistics:')
df_synthetic.describe(include='all')

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Time               5000 non-null   int64  
 1   Current_Load_MW    5000 non-null   float64
 2   Temperature_C      5000 non-null   float64
 3   Area_Type          5000 non-null   object 
 4   Consumers          5000 non-null   int64  
 5   Next_Hour_Load_MW  5000 non-null   float64
dtypes: float64(3), int64(2), object(1)
memory usage: 234.5+ KB

DataFrame Descriptive Statistics:


Unnamed: 0,Time,Current_Load_MW,Temperature_C,Area_Type,Consumers,Next_Hour_Load_MW
count,5000.0,5000.0,5000.0,5000,5000.0,5000.0
unique,,,,3,,
top,,,,Residential,,
freq,,,,1694,,
mean,11.363,1253.969911,20.056235,,6482.3612,2196.333988
std,6.974823,435.889717,8.625976,,3558.597155,690.736398
min,0.0,500.089487,5.006003,,1003.0,607.507543
25%,5.0,872.662518,12.537859,,3618.5,1701.221101
50%,11.0,1253.202975,19.954888,,5886.0,2159.220173
75%,17.0,1639.676026,27.563705,,8966.0,2650.984339


## Prepare Data for Modeling

### Subtask:
Split the generated dataset into training and testing sets (80% train, 20% test) and encode categorical features like 'Area_Type', preparing the data for model training.


**Reasoning**:
The subtask requires splitting the data into training and testing sets and encoding categorical features. This step involves importing necessary libraries, defining features and target, performing the split, applying one-hot encoding to 'Area_Type' for both training and testing sets, and then reassembling the processed data into new DataFrames.



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

print("sklearn.model_selection.train_test_split and sklearn.preprocessing.OneHotEncoder imported successfully.")

# 2. Define features (X) and target (y)
X = df_synthetic.drop('Next_Hour_Load_MW', axis=1)
y = df_synthetic['Next_Hour_Load_MW']

# 3. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Data split into training and testing sets: X_train shape {X_train.shape}, X_test shape {X_test.shape}")

# 4. Identify the categorical features
categorical_features = ['Area_Type']

# 5. Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# 6. Fit the encoder to the categorical features in X_train and transform both X_train and X_test
X_train_categorical_encoded = encoder.fit_transform(X_train[categorical_features])
X_test_categorical_encoded = encoder.transform(X_test[categorical_features])

# 7. Create DataFrames for the one-hot encoded features
encoded_feature_names = encoder.get_feature_names_out(categorical_features)
X_train_encoded_df = pd.DataFrame(X_train_categorical_encoded, columns=encoded_feature_names, index=X_train.index)
X_test_encoded_df = pd.DataFrame(X_test_categorical_encoded, columns=encoded_feature_names, index=X_test.index)

# 8. Drop the original 'Area_Type' column from X_train and X_test
X_train_numerical = X_train.drop(categorical_features, axis=1)
X_test_numerical = X_test.drop(categorical_features, axis=1)

# 9. Concatenate the numerical features with their respective one-hot encoded DataFrames
X_train_processed = pd.concat([X_train_numerical, X_train_encoded_df], axis=1)
X_test_processed = pd.concat([X_test_numerical, X_test_encoded_df], axis=1)

print(f"X_train_processed shape: {X_train_processed.shape}")
print(f"X_test_processed shape: {X_test_processed.shape}")
print("First 5 rows of X_train_processed:")
print(X_train_processed.head())

sklearn.model_selection.train_test_split and sklearn.preprocessing.OneHotEncoder imported successfully.
Data split into training and testing sets: X_train shape (4000, 5), X_test shape (1000, 5)
X_train_processed shape: (4000, 7)
X_test_processed shape: (1000, 7)
First 5 rows of X_train_processed:
      Time  Current_Load_MW  Temperature_C  Consumers  Area_Type_Commercial  \
4227     7      1319.530865       7.833202       9739                   1.0   
4676    14       732.264196      27.934641       4327                   1.0   
800      4      1341.173973      11.249824       4616                   0.0   
3671    20       668.614528      10.843448       7863                   0.0   
4193     2      1688.920746      16.506578       4611                   1.0   

      Area_Type_Industrial  Area_Type_Residential  
4227                   0.0                    0.0  
4676                   0.0                    0.0  
800                    0.0                    1.0  
3671              

## Train and Evaluate Models

### Subtask:
Train Linear Regression, Random Forest Regressor, and XGBoost Regressor models on the training data. Evaluate each model's performance on the test set using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² score.


**Reasoning**:
The first step is to import all the necessary classes and functions for model training and evaluation, including `LinearRegression`, `RandomForestRegressor`, `XGBRegressor`, and evaluation metrics like `mean_absolute_error`, `mean_squared_error`, and `r2_score`. This sets up the environment for the subsequent model training and assessment.



In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("LinearRegression, RandomForestRegressor, XGBRegressor, and evaluation metrics imported successfully.")

LinearRegression, RandomForestRegressor, XGBRegressor, and evaluation metrics imported successfully.


**Reasoning**:
Now that the necessary libraries are imported, the next step is to instantiate, train, and evaluate each of the three models (Linear Regression, Random Forest Regressor, and XGBoost Regressor) as per the instructions. This involves fitting the models to the training data, making predictions on the test data, and then calculating and storing their respective MAE, RMSE, and R² scores.



In [None]:
model_performance = {}

# 3. Instantiate and train a Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_processed, y_train)

# 4. Make predictions on X_test_processed using the trained Linear Regression model
y_pred_lr = lr_model.predict(X_test_processed)

# 5. Calculate MAE, RMSE, and R² scores for the Linear Regression model
mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
r2_lr = r2_score(y_test, y_pred_lr)

print("\n--- Linear Regression Performance ---")
print(f"MAE: {mae_lr:.2f}")
print(f"RMSE: {rmse_lr:.2f}")
print(f"R-squared: {r2_lr:.2f}")

model_performance['Linear Regression'] = {'MAE': mae_lr, 'RMSE': rmse_lr, 'R2': r2_lr}

# 6. Instantiate and train a RandomForestRegressor model
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train_processed, y_train)

# 7. Make predictions on X_test_processed using the trained Random Forest model
y_pred_rf = rf_model.predict(X_test_processed)

# 8. Calculate MAE, RMSE, and R² scores for the Random Forest model
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("\n--- Random Forest Regressor Performance ---")
print(f"MAE: {mae_rf:.2f}")
print(f"RMSE: {rmse_rf:.2f}")
print(f"R-squared: {r2_rf:.2f}")

model_performance['Random Forest'] = {'MAE': mae_rf, 'RMSE': rmse_rf, 'R2': r2_rf}

# 9. Instantiate and train an XGBRegressor model
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train_processed, y_train)

# 10. Make predictions on X_test_processed using the trained XGBoost model
y_pred_xgb = xgb_model.predict(X_test_processed)

# 11. Calculate MAE, RMSE, and R² scores for the XGBoost model
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)

print("\n--- XGBoost Regressor Performance ---")
print(f"MAE: {mae_xgb:.2f}")
print(f"RMSE: {rmse_xgb:.2f}")
print(f"R-squared: {r2_xgb:.2f}")

model_performance['XGBoost'] = {'MAE': mae_xgb, 'RMSE': rmse_xgb, 'R2': r2_xgb}

# 12. Display stored model performance for comparison
print("\n--- All Model Performance Summary ---")
for model_name, metrics in model_performance.items():
    print(f"{model_name}: MAE={metrics['MAE']:.2f}, RMSE={metrics['RMSE']:.2f}, R2={metrics['R2']:.2f}")



--- Linear Regression Performance ---
MAE: 201.64
RMSE: 255.00
R-squared: 0.86

--- Random Forest Regressor Performance ---
MAE: 203.77
RMSE: 266.04
R-squared: 0.85

--- XGBoost Regressor Performance ---
MAE: 199.23
RMSE: 264.88
R-squared: 0.85

--- All Model Performance Summary ---
Linear Regression: MAE=201.64, RMSE=255.00, R2=0.86
Random Forest: MAE=203.77, RMSE=266.04, R2=0.85
XGBoost: MAE=199.23, RMSE=264.88, R2=0.85


## Select Best Model

### Subtask:
Identify and select the best-performing model among Linear Regression, Random Forest, and XGBoost based on the evaluation metrics (e.g., lowest RMSE or MAE). This model will be used for predictions and feature importance analysis.


**Reasoning**:
The next step is to programmatically identify the best performing model based on the RMSE metric from the 'model_performance' dictionary, as specified in the instructions. This involves iterating through the dictionary, comparing RMSE values, and storing the name and the model object of the best performing model.



In [None]:
best_model_name = None
lowest_rmse = float('inf')
best_model = None

# Create a dictionary mapping model names to their objects
models = {
    'Linear Regression': lr_model,
    'Random Forest': rf_model,
    'XGBoost': xgb_model
}

# Iterate through the model_performance dictionary to find the best model
for model_name, metrics in model_performance.items():
    if metrics['RMSE'] < lowest_rmse:
        lowest_rmse = metrics['RMSE']
        best_model_name = model_name

# Retrieve the best model object
best_model = models[best_model_name]

print(f"\n--- Best Performing Model: {best_model_name} ---")
print(f"MAE: {model_performance[best_model_name]['MAE']:.2f}")
print(f"RMSE: {model_performance[best_model_name]['RMSE']:.2f}")
print(f"R-squared: {model_performance[best_model_name]['R2']:.2f}")
print(f"Best model object type: {type(best_model)}")


--- Best Performing Model: Linear Regression ---
MAE: 201.64
RMSE: 255.00
R-squared: 0.86
Best model object type: <class 'sklearn.linear_model._base.LinearRegression'>
