# Capstone Project Problem Statement: Forecasting Power Consumption in Tetuan City Using Environmental and Solar Radiation Indicators

## Business Context:
Energy consumption is a critical concern for modern organizations aiming to reduce operational costs, optimize energy usage, and support environmental sustainability. With advancements in sensor technologies and IoT, real-time data on environmental conditions (like temperature, humidity, wind speed, and solar radiation) can be collected alongside energy usage statistics. Leveraging this data enables organizations to uncover patterns, forecast energy demands, and implement smart energy-saving strategies across various zones of a facility or campus.

This project analyzes multivariate environmental and energy data to uncover insights and build predictive models to aid in efficient energy management across three distinct zones. The ultimate goal is to enable data-driven decisions to reduce waste and improve energy efficiency.

# Project Description: Tetuan City Energy Consumption Analysis

This dataset contains hourly records of environmental conditions and electricity usage from Tetuan City. It includes variables such as temperature, humidity, wind speed, solar radiation (general and diffuse flows), and power consumption across three distinct zones. Using this dataset, I aim to perform exploratory data analysis to uncover patterns and relationships between weather conditions and energy usage. The project also involves building machine learning models to solve both regression and classification problems—predicting actual power consumption values and classifying usage levels as high or low. This end-to-end workflow will serve as my capstone project, showcasing how data science can be applied to optimize energy efficiency and support smart city initiatives. If time permits, the project may also include deployment of the predictive model for real-time use.

# Project Objective:

The project aims to perform a comprehensive analysis of zone-wise power consumption using weather and radiation parameters. It includes multiple tasks from data understanding to building predictive models:

a. Exploratory Data Analysis (EDA):
* Understand the distribution and relationships of variables such as Temperature, Humidity, Wind Speed, and different types of solar radiation (general   diffuse flows and diffuse flows).
* Visualize and analyze power consumption patterns in Zone 1, Zone 2, and Zone 3 over time.
* Identify outliers, missing values, trends, seasonality, and correlations between environmental conditions and power consumption.

b. Classification Task (Binary Classification):
* Objective: Classify high vs. low power consumption for a selected zone (e.g., Zone 1).
* Method:
   * Define a binary target variable: e.g., Power consumption above a certain threshold = 1 (high), else = 0 (low).
   * Use features like temperature, humidity, wind speed, and radiation data as inputs.
   * Apply classification models such as Logistic Regression, Random Forest, or XGBoost.
   * Evaluate class balance and apply resampling if necessary.

c. Regression Task (Predictive Modeling):
* Objective: Predict actual power consumption (e.g., Zone 1 Power Consumption) using the environmental variables.
* Method:
   * Use models such as Linear Regression, Decision Trees, Random Forest, or Gradient Boosting Regressors.
   * Compare models using metrics like RMSE, MAE, and R².
   * Perform feature importance analysis to determine key contributors to energy usage.

d. Model Evaluation:
* Evaluate both classification and regression models using appropriate metrics:
  * Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC.
  * Regression: R² Score, Mean Squared Error (MSE), Mean Absolute Error (MAE), RMSE.
* Use cross-validation and hyperparameter tuning (Grid Search/CV) to ensure model robustness and generalizability.

e. Model Deployment:
* If time permits, deploy the best-performing model using:
* A Flask or FastAPI web application.
* Streamlit for interactive visualization and prediction interface.
* Export model via joblib or pickle for real-time inference.
* Demonstrate how to input new environmental data and predict power consumption or classification outcome.

## Data Description
The dataset contains the following columns:

* DateTime: Timestamp of the observation (hourly).
* Temperature: Ambient temperature (°C).
* Humidity: Humidity percentage (%).
* Wind Speed: Wind speed at the time of observation (m/s).
* General Diffuse Flows: Total diffuse solar radiation measured.
* Diffuse Flows: Diffuse component of solar radiation.
* Zone 1 Power Consumption: Power usage in Zone 1 (kW).
* Zone 2 Power Consumption: Power usage in Zone 2 (kW).
* Zone 3 Power Consumption: Power usage in Zone 3 (kW).

# 1.Exploratory *Data* Analysis:

# EDA ALL STEPS:
* Load Data	                     `✅`
* Check structure	               ✅
* Summary stats                  ✅
* Missing values	               ✅
* Duplicates	                   ✅
* Univariate analysis            ✅
* Bivariate analysis	           ✅
* Correlation	                   ✅
* Outliers	                     ✅
* Feature engineering (optional) ✅

# Importing Libreries


In [None]:
import numpy as np
import pandas as pd
import matplotlib as mlt
import seaborn as sns
import matplotlib.pyplot as plt

#### Observation:
##### This cell imports necessary Python libraries for data manipulation (numpy, pandas) and data visualization (matplotlib, seaborn). These are      essential for handling datasets and plotting graphs during exploratory data analysis (EDA).

# Reading the Dataset


In [None]:
df = pd.read_csv('data/Tetuan City power consumption.csv')

### Observation:
##### This line reads the CSV file named "Tetuan City power consumption.csv" into a pandas DataFrame named df. This is the primary dataset that will be used for analysis.

# Displaying Initial Records of the Dataset

In [None]:
df.head(3)

#### Observation:
##### This cell displays the first 3 rows of the DataFrame df.

# Displaying Last 5 Records of the Dataset

In [None]:
df.tail(5)

#### Observation:
##### This cell displays the last 5 rows of the DataFrame df

# Checking the Shape of the Dataset

In [None]:
df.shape

#### Observation:
##### This cell returns the dimensions of the DataFrame df in the form (rows, columns).

# Listing All Column Names

In [None]:
df.columns

#### Observation:
##### This cell displays the names of all columns in the DataFrame df.

# Displaying Dataset Information

In [None]:
df.info()

#### Observation:
* This cell provides a summary of the DataFrame’s structure, including:
* The number of entries (rows).
* Each column name and its data type.
* The count of non-null (non-missing) values in each column.
* The memory usage of the dataset.

# Generating Descriptive Statistics

In [None]:
df.describe()

#### Observation:
* This cell generates summary statistics for all numeric columns in the dataset. It includes:
* Count: Number of non-null entries.
* Mean: Average value.
* Standard deviation (std): Spread of data.
* Min and Max: Minimum and maximum values.
* 25%, 50%, 75%: Quartile values (useful for detecting skewness or outliers).

# Check for Missing Value

In [None]:
df.isnull().sum()

#### Observation:
* This cell checks for missing (null/NaN) values in each column of the DataFrame by:
* Using isnull() to create a boolean mask where True indicates missing values.
* Using sum() to count the total True values in each column.
* This is a crucial step in data cleaning, helping to:
  * Detect columns that require imputation or removal.
  * Ensure the dataset is complete before modeling or analysis.

# Checking for Duplicate Rows

In [None]:
df.duplicated().sum()

#### Observation:
* This cell checks for duplicate records in the dataset:
  * df.duplicated() returns a Boolean Series indicating which rows are duplicates.
  * sum() counts the number of True values, i.e., the total number of duplicate rows.

# Visualizing Missing Data Using Missingno

In [None]:
import missingno as msno
import matplotlib.pyplot as plt

msno.matrix(df)
plt.show()

#### Observation:
* This cell uses the missingno library to visually inspect missing data in the DataFrame using a matrix plot:
  * Each column is shown as a vertical bar.
  * White lines (if any) indicate missing entries.
  * The plot also shows a sparkline indicating data density across rows.

* This graphical tool helps to:
  * Quickly identify patterns or blocks of missing data.
  * Detect if missing values are randomly distributed or occur in chunks.
  * Complement the numeric check from df.isnull().sum().

# Univariate Analysis:–
### Histogram and KDE Plot for Temperature

In [None]:
sns.histplot(df['Temperature'],kde = True)
plt.title("Temp distribution")
plt.show()

#### Observation:
* This is a univariate analysis of the Temperature column using a histogram with a Kernel Density Estimate (KDE) overlay:
* The histogram shows the frequency distribution of temperature values.
* The KDE curve estimates the probability density, giving a smooth distribution shape.

# Univariate Analysis:–
### Average Power Consumption by Zone

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import re

# Normalize column names: remove extra spaces
df.columns = [re.sub(' +', ' ', col.strip()) for col in df.columns]

# Plot average power consumption
df[['Zone 1 Power Consumption', 'Zone 2 Power Consumption', 'Zone 3 Power Consumption']].mean().plot(kind='bar')
plt.title("Average Power Consumption by Zone")
plt.ylabel("Power Consumption")
plt.show()

#### Observation:
* This cell performs column name cleaning using regex to remove extra spaces, ensuring reliable column access.
* It then performs univariate analysis by plotting the average power consumption for each zone (Zone 1, Zone 2, and Zone 3) using a bar chart.

# Bivariate Analysis:–
### Humidity Bins vs Zone 1 Power Consumption (Boxplot)

In [None]:
df['Humidity Bin'] = pd.cut(df['Humidity'], bins=5)
sns.boxplot(x='Humidity Bin', y='Zone 1 Power Consumption', data=df)
plt.xticks(rotation=45)
plt.title('Humidity Bins vs. Zone 1 Power Consumption')
plt.show()

#### Observation:
* This cell performs a bivariate analysis between Humidity (binned) and Zone 1 Power Consumption using a boxplot.
* pd.cut() divides the humidity values into 5 equal-width bins to simplify analysis.
* The boxplot shows the distribution (median, quartiles, and outliers) of power consumption within each humidity range.

# Bivariate Analysis:–
### Temperature vs Zone 1 Power Consumption (Scatter Plot)

In [None]:
sns.scatterplot(x='Temperature', y='Zone 1 Power Consumption', data=df)
plt.title('Temperature vs. Zone 1 Power Consumption')
plt.show()

#### Observation:
* This scatter plot performs a bivariate analysis between Temperature and Zone 1 Power Consumption.
* Each point represents a record in the dataset with:
  * X-axis: Temperature value
  * Y-axis: Corresponding Zone 1 power usage

# Bivariate Analysis:–
### Wind Speed vs. High/Low Zone 1 Power Consumption (Stacked Bar Chart)

In [None]:
df['Wind Speed Bin'] = pd.cut(df['Wind Speed'], bins=4)
df['High Power'] = df['Zone 1 Power Consumption'] > df['Zone 1 Power Consumption'].median()
pd.crosstab(df['Wind Speed Bin'], df['High Power']).plot(kind='bar', stacked=True)
plt.title('Wind Speed vs. High/Low Power Consumption')
plt.ylabel('Count')
plt.show()

#### Observation:
* This cell analyzes how Zone 1 power consumption varies with wind speed, using a stacked bar chart.
* Wind speed is binned into 4 intervals.
* Power consumption is classified as High or Low based on whether it is above the median.
* pd.crosstab() creates a contingency table, and the chart visualizes the count of high vs low consumption in each wind speed bin.

# Multivariate Analysis:–
### Correlation Matrix of Numerical Features (Heatmap)

In [None]:
corr = df.corr(numeric_only=True)

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

# Observation:
* This heatmap visualizes the pairwise correlation between all numeric columns in the dataset.
* Correlation values range from -1 to 1:
   * +1: Perfect positive correlation
   * -1: Perfect negative correlation
   * 0: No correlation
* annot=True displays the exact correlation values inside the heatmap cells.
* coolwarm color palette helps easily distinguish positive (warm) and negative (cool) relationships.

# Outlier Detection Using Boxplots and the IQR Method

In [None]:
numeric_cols = df.select_dtypes(include='number').columns

# Boxplots for each numeric column to visually inspect outliers
for col in numeric_cols:
    plt.figure(figsize=(6, 1.5))
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot for {col}')
    plt.show()

# Detect outliers using the IQR method
outlier_summary = {}
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_summary[col] = len(outliers)

# Print number of outliers in each column
for col, count in outlier_summary.items():
    print(f"{col}: {count} outliers")

#### Observation:
* Numeric Column Selection:
  * Selected only the numeric columns from the DataFrame for outlier analysis.
* Boxplot Visualization:
  * Generated boxplots for each numeric column to visually inspect the spread and detect potential outliers.
* IQR Calculation:
  * Computed the Interquartile Range (IQR) for each column using Q1 (25th percentile) and Q3 (75th percentile).
* Outlier Detection Logic:
  * Defined outliers as data points below Q1 - 1.5 × IQR or above Q3 + 1.5 × IQR.
* Outlier Counting:
  * Counted and stored the number of outliers for each numeric column in a summary dictionary.
* Output Summary:
  * Printed the number of detected outliers for every numeric column, helping to identify which features contain extreme values.

# 3- Data Cleaning & pre processing
* 0.Dropping unwanted columns
* 1.Dropping duplicate rows
* 2.Replacing wrong entries
* 3.Missing values imputation (SimpleImputer, fillna())
* 4.Handle outliers (IQR, Z-score method)
* 5.Encoding
* 6.Data splitting
* 7.Feature scaling: StandardScaler, MinMaxScaler
* 8.Feature selection:Based on correlation, domain knowledge, or model-based methods


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('/content/Tetuan City power consumption.csv')

In [None]:
# Make a copy of the original dataset
df_copy = df.copy()
# Select categorical variables(Object type)
cat_variables = df_copy.select_dtypes(include = 'object')
# Select numerical varialbles(int,float)
num_variables = df_copy.select_dtypes(include = ['int','float'])

# Print Results
print("Categorical Varibles:")
print(cat_variables.head())
print("\nNumerical variables:")
print(num_variables.head())

### 3.0 Dropping unwanted columns:
* Suppose we identify some unwanted columns (example: 'Unnamed: 0' or any irrelevant column)


In [None]:
# print columns of dataset
print('columns in data set')
print(df_copy.columns)

# If you don't want to drop anything, keep the list empty(as it is)
unwanted_columns = []

# Check if the unwanted columns actually exist in the dataframe
columns_to_drop = [col for col in unwanted_columns if col in df_copy.columns]

### 3.1 Dropping duplicates Rows:

In [None]:
# Find all duplicate rows
duplicates = df_copy[df.duplicated()]

#show number of duplicated rows
print("number of dupliate rows:",duplicates.shape[0])

# preview duplicates, if any
duplicates.head()

In [None]:
# remove duplicates
df_cleaned = df.drop_duplicates()

# check new shape
print("New shape after removing:",df_cleaned.shape)

### 3.2 Replacing wrong entries, if any  

* Missing or NaN

* Outliers or unrealistic values

* Wrong data types

* Duplicate timestamps or logic issues

* Inconsistent column formats or typos

## steps:
* 1.load data set(already done in EDA)
* 2.detect wrong entries
* 3.relace or fixed wrong entries
* 4.save the clean file


#####  3.2.1Load dataset:

In [None]:
# import pandas as pd

# # Load the CSV file
# df = pd.read_csv("etuan City power consumption.csv")

# Show first 5 rows
# df.head(5)

##### 3.2.2 detect wrong entries:

In [None]:
# Check for Null/Missing Values
df.isnull().sum()

In [None]:
# Check Data Types
df.dtypes

#### Detect outliers:

In [None]:
# Detect outliers:

# Describe stats to spot anomalies
# df.describe()

# Check negative values in power consumption (if not expected)
invalid_power = df[(df["Zone 1 Power Consumption"] < 0) |
                   (df["Zone 2  Power Consumption"] < 0) |
                   (df["Zone 3  Power Consumption"] < 0)]
print(invalid_power)

##### 3.2.3 Replace or fix wrong entries:

In [None]:
# Replace Negative Values with Mean of the Column
for col in ["Zone 1 Power Consumption", "Zone 2  Power Consumption", "Zone 3  Power Consumption"]:
    mean_val = df[df[col] >= 0][col].mean()
    df[col] = df[col].apply(lambda x: mean_val if x < 0 else x)


In [None]:
# Fill Missing Values with Forward Fill
df.fillna(method='ffill', inplace=True)


##### Replace Strings or Wrong Entries
 * (Example: replace misspelled entries in a "City" column)

In [None]:
# df["City"] = df["City"].replace({"Tetuon": "Tetuan"})

In [None]:
# Save the Cleaned File
df.to_csv("Cleaned_Tetuan_City_power.csv", index=False)

### 3.3 Missing values imputation (SimpleImputer, fillna())
Missing data is common in real-world datasets. If not handled, it can lead to:

Errors in model training

Biased or incomplete analysis

Failures in ML algorithms (most don’t accept nulls)



In [None]:
# Make a copy before imputing
df_original = df.copy()

In [None]:
# Perform imputation...
from sklearn.impute import SimpleImputer
import pandas as pd

# Select numeric columns
numeric_cols = df.select_dtypes(include=['float64', 'int64'])

# Create imputer: strategy = mean, median, most_frequent, or constant
imputer = SimpleImputer(strategy='mean')

# Fit and transform
df[numeric_cols.columns] = imputer.fit_transform(numeric_cols)


In [None]:
#Then compare
print("Before:\n", df_original[numeric_cols.columns].isnull().sum())
print("After:\n", df[numeric_cols.columns].isnull().sum())


In [None]:
# If you want to check if any missing values remain:
print(df.isnull().sum())        # Shows number of missing values in each column

### 3.4 Handle outliers (IQR method) :
* IQR = Q3 − Q1

* Q1 = 25th percentile
* Q3 = 75th percentile

* Lower bound = Q1 − 1.5 × IQR

* Upper bound = Q3 + 1.5 × IQR

In [None]:
import pandas as pd

# Example for one column
col = 'Temperature'  # replace with your column name

Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
print("Outliers:\n", outliers)

# Optionally remove them:
df_cleaned = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]


#### Z-score Method
* Z-score = (x − mean) / std

* Outliers have Z-scores > 3 or < -3



In [None]:
import pandas as pd
from scipy.stats import zscore

# Load your dataset
df = pd.read_csv("Tetuan City power consumption.csv")

# Select only numeric columns
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Compute Z-scores
z_scores = zscore(numeric_df)

# Convert back to DataFrame with same column names
z_df = pd.DataFrame(z_scores, columns=numeric_df.columns)

# Show Z-score output
print(z_df.head())



### 3.5 Encoding:
* Load the dataset
* Identify Categorical Columns
* Option A: Label Encoding (for Ordinal/Ordered data)
* Option B: One-Hot Encoding (for Nominal/Unordered data)


In [None]:
# Show non-numeric (categorical) columns
categorical_cols = df.select_dtypes(include=['object']).columns
print("Categorical Columns:", categorical_cols.tolist())


##### Label Encoding (for Ordinal/Ordered data)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create a label encoder object
le = LabelEncoder()

# Apply label encoding to all categorical columns
for col in categorical_cols:
    df[col + '_label'] = le.fit_transform(df[col])

# Show result
df[[*categorical_cols, *[col + '_label' for col in categorical_cols]]].head()


#####  One-Hot Encoding (for Nominal/Unordered data)

In [None]:
# One-hot encode categorical columns (remove first column to avoid dummy variable trap)
df_onehot = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Show result
df_onehot.head()


#### 3.6 Data Splitting:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Step 1: Load your dataset
df = pd.read_csv("Tetuan City power consumption.csv")
# We are predicting "Zone 1 Power Consumption"
target_column = "Zone 1 Power Consumption"
# X = all columns except the target
X = df.drop(columns=[target_column])

# y = only the target column
y = df[target_column]
X_train, X_test, y_train, y_test = train_test_split(
    X,         # input features
    y,         # target values
    test_size=0.2,      # 20% for test, 80% for training
    random_state=42     # ensures same split every time (reproducibility)
)

# Multiple outputs neatly
{
    "X_train": X_train.shape,
    "X_test": X_test.shape,
    "y_train": y_train.shape,
    "y_test": y_test.shape
}



#### 3.7 Feature Scaling

In [None]:
# If X_train is a DataFrame (not NumPy array), drop the DateTime column
X_train = X_train.drop(columns=['DateTime'], errors='ignore')
X_test = X_test.drop(columns=['DateTime'], errors='ignore')

# Now apply scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler
standard_scaler = StandardScaler()
X_train_standard = standard_scaler.fit_transform(X_train)
X_test_standard = standard_scaler.transform(X_test)

# MinMaxScaler
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)

print("✅ Scaling done!")
print("Standard Scaled X_train shape:", X_train_standard.shape)
print("MinMax Scaled X_train shape:", X_train_minmax.shape)


#### 3.8 Feature selection

##### 3.8.1  Based on Correlation (Filter method)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix
corr_matrix = X_train.corr()

# Plot heatmap (optional)
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

# Drop features highly correlated with others (e.g., corr > 0.9)
threshold = 0.9
upper = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
to_drop = [column for column in upper.columns if any(upper[column].abs() > threshold)]

# Drop from dataset
X_train_corr = X_train.drop(columns=to_drop)
X_test_corr = X_test.drop(columns=to_drop)


##### 3.8.2  Model-Based Selection (Wrapper or Embedded methods)

In [None]:
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

# Fit a model
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Get feature importances
importances = pd.Series(model.feature_importances_, index=X_train.columns)
importances.sort_values(ascending=False).plot(kind='barh')
plt.title("Feature Importance (Random Forest)")
plt.show()

# Select top N features
top_features = importances.sort_values(ascending=False).head(5).index.tolist()
X_train_model = X_train[top_features]
X_test_model = X_test[top_features]

# 4-Model Building (Regression), Evaluation & Tuning

### 4.1 Regression algorithms
    * Linear Regression
    * KNN
    * Decision Trees (CART)
    * Random Forest
    * Boosting
        * Adaboost,
        * Gboost,
        * XGboost
### 4.2 Model Evaluation: Regression metrics: R² & RMSE
1. R-squared (R²) — Coefficient of Determination
    * What it means:
        * Measures how well the model explains the variability in the target variable.
        * Value lies between 0 and 1 (can be negative if model performs worse than the mean).
    * Interpretation:
        * R² = 1 → perfect prediction
        * R² = 0 → model is no better than the average
        * Higher is better
          ![image.png](attachment:e23679fd-3fe1-4ef2-b9e9-1dfed173e585.png)

2. RMSE — Root Mean Squared Error
    * What it means:
        * Measures average prediction error in the same units as the target variable.
        * It gives more weight to larger errors.
    * Interpretation:
        * Lower is better
        * Easy to interpret because it’s in the same unit as
          ![image.png](attachment:65110f95-03fd-4549-be04-631970e6b345.png)
### 4.3 Model Tuning
    * GridSearchCV
    * Hyper Parameter Tuning


In [None]:
# 📌 Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

# 📌 Load the dataset
df = pd.read_csv("Tetuan City power consumption.csv")

# 📌 Select features and target
features = ['Temperature', 'Humidity', 'Wind Speed']
target = 'Zone 1 Power Consumption'

X = df[features]
y = df[target]

# 📌 Train-Test Split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 📌 Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# 📌 Define models
models = {
    "Linear Regression": LinearRegression(),
    "KNN": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "AdaBoost": AdaBoostRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "XGBoost": XGBRegressor(objective='reg:squarederror')
}

# 📌 Train and evaluate models
results = []

for name, model in models.items():
    model.fit(X_train_scaled, y_train)

    y_train_pred = model.predict(X_train_scaled)
    y_val_pred = model.predict(X_val_scaled)

    rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
    r2_train = r2_score(y_train, y_train_pred)
    rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))
    r2_val = r2_score(y_val, y_val_pred)

    results.append({
        "Model": name,
        "Train RMSE": rmse_train,
        "Train R²": r2_train,
        "Val RMSE": rmse_val,
        "Val R²": r2_val
    })

results_df = pd.DataFrame(results)
display(results_df.sort_values("Val RMSE"))
