<a href="https://colab.research.google.com/github/Samiimasmoudii/ML-Course-/blob/main/Housing%20Energy%20Consumption.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Importing Libraries:
The code imports essential libraries for data manipulation (pandas), visualization (matplotlib, seaborn), machine learning models (sklearn, catboost), and feature engineering.

2. Data Loading:
The training and test datasets are loaded from CSV files using pandas.

3. Data Preprocessing:
The Preprocess function splits complex columns like random load mesures into separate Cooling and Lights features, and further processes the WWR column into four distinct features (WWR_1, WWR_2, WWR_3, WWR_4). After this, unnecessary columns are dropped from both the training and test sets. Additional features like cool_ratio (difference between Lights and Cooling) and Total Area (calculated as Total Floors Area / Number of Floors) are generated, further improving the dataset.

4. Scaling:
The features in both the training and test datasets are scaled using StandardScaler to normalize the data.

5. Exploratory Data Analysis (EDA):
A correlation matrix is computed to examine the relationship between features and the target variable (Operational Energy). A heatmap is plotted to visualize these correlations.

6. Feature Selection:
Using SelectKBest, the code selects the 10 most important features based on statistical relevance to the target variable using the f_regression method.

7. Model Training and Evaluation:
The CatBoostRegressor model is used for regression, with specific hyperparameters like iterations, learning rate, and depth. The model is trained on a subset of the data, and early stopping is applied to prevent overfitting. After training, predictions are made on the test dataset.

8. Submission Generation:
The final predictions are formatted into a submission file. A new submission_id is created by concatenating the building's ID and Town for each record, and the predictions are saved to a CSV file.



**Importing the Libraries**


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

**Importing the Dataset**

In [None]:
data = pd.read_csv("/content/AI_HACK/Train.csv")
test= pd.read_csv("/content/AI_HACK/Test.csv")

# Data_Fix

In [None]:
def Preprocess(data):
  Cooling=[]
  Lights=[]
  for i in data['random load mesures']:
      c=i.split(',')[0]
      l=i.split(',')[1]
      l=l.split(' ')[2]
      c=c.split(' ')[1]
      Cooling.append(c[1:-3])
      Lights.append(l[1:-4])
  Cooling = [eval(i) for i in Cooling]
  Lights = [eval(i) for i in Lights]
  data.insert(28,"Cooling",Cooling)
  data.insert(28,'Lights',Lights)
  data=data.drop(["random load mesures","building","File"],axis=1)
  wwr_columns = data['WWR'].str.strip('()').str.split(',')
  data[['WWR_1', 'WWR_2', 'WWR_3', 'WWR_4']] = pd.DataFrame(wwr_columns.tolist(), index=data.index)
  data.drop('WWR', axis=1, inplace=True)
  wwr1=data['WWR_1']
  wwr2=data["WWR_2"]
  wwr3=data["WWR_3"]
  wwr4=data["WWR_4"]
  data=data.drop(["WWR_1","WWR_2","WWR_3","WWR_4"],axis=1)
  data.insert(3,"WWR_1",wwr1)
  data.insert(3,"WWR_2",wwr2)
  data.insert(3,"WWR_3",wwr3)
  data.insert(3,"WWR_4",wwr4)
  data['WWR_1'] = data['WWR_1'].astype(float)
  data['WWR_2'] = data['WWR_2'].astype(float)
  data['WWR_3'] = data['WWR_3'].astype(float)
  data['WWR_4'] = data['WWR_4'].astype(float)

  return data

**Use the preprocess function on data and test**

In [None]:
data=Preprocess(data)
test=Preprocess(test)


In [None]:
data.info

In [None]:
data=data[["Total Floors Area","EUI","Number of Floors"]]
test=test[["Total Floors Area","EUI","Number of Floors"]]

In [None]:
y = data['Operational Energy']

In [None]:
test=test[["Total Floors Area","EUI","Number of Floors"]]

# Step 1: Data Cleaning and Preprocessing

**# calculate z-scores for each column**

In [None]:
# Create new feature for total heat gain
#data['Total Heat Gain'] = data['Equipment Heat Gain'] + data['Light Heat Gain']
# Create new feature for total thermal resistance
#data['Total Thermal Resistance'] = data['Internal Wall Rt'] + data['Internal Floor Rt'] + data['Ground Floor Rt'] + data['Windows Rt'] + data['Wall Rt'] + data['Roof Rt']
# Create new feature for total area
data["cool_ratio"]= data['Lights'] - data['Cooling']
data['Total Area'] = data['Total Floors Area'] / data['Number of Floors']
# Drop the original features used to create the new features
data.drop(['Total Floors Area', 'Number of Floors',"Cooling","Lights"], axis=1, inplace=True)
# One-hot encode the 'Town' feature
#data = pd.get_dummies(data, columns=['Building'])

Test part

In [None]:
# Create new feature for total heat gain
#test['Total Heat Gain'] = test['Equipment Heat Gain'] + test['Light Heat Gain']
# Create new feature for total thermal resistance
#test['Total Thermal Resistance'] = test['Internal Wall Rt'] + test['Internal Floor Rt'] + test['Ground Floor Rt'] + test['Windows Rt'] + test['Wall Rt'] + test['Roof Rt']
# Create new feature for total area
test["cool_ratio"]= test['Lights'] - test['Cooling']
test['Total Area'] = test['Total Floors Area'] / test['Number of Floors']
# Drop the original features used to create the new features
test.drop(['Total Floors Area', 'Number of Floors',"Cooling","Lights"], axis=1, inplace=True)
#test.drop(['Equipment Heat Gain', 'Light Heat Gain', 'Internal Wall Rt', 'Internal Floor Rt', 'Ground Floor Rt', 'Windows Rt', 'Wall Rt', 'Roof Rt', 'Total Floors Area', 'Number of Floors'], axis=1, inplace=True)
# One-hot encode the 'Town' feature

In [None]:
#X = data.drop('Operational Energy', axis=1)
#y = data['Operational Energy']
#X=X.drop(["Internal Mass","Town","WWR_4","Start Time","Heating COP","Boiler Efficiency","WWR_1","WWR_2","WWR_3","Height","Light Heat Gain","Heating Setpoint","Operating Hours"],axis=1)
scaler = StandardScaler()
X=data
X = scaler.fit_transform(X)
#test=test.drop(["Internal Mass","Town","WWR_4","Start Time","Heating COP","Boiler Efficiency","WWR_1","WWR_2","WWR_3","Height","Light Heat Gain","Heating Setpoint","Operating Hours"],axis=1)
test=scaler.fit_transform(test)


## Step 2: Exploratory Data Analysis

A correlation coefficient of 0.49 would be considered a moderate correlation, while coefficients of 0.85 and 0.75 would be considered strong correlations. The strength of a correlation is typically evaluated using the absolute value of the coefficient, with values closer to 1 indicating stronger correlations

In [None]:
# convert the numpy array to a pandas dataframe
df = pd.DataFrame(data)
# Increase the font size
pd.set_option('display.max_colwidth', None)

# Increase the column width
pd.set_option('display.max_columns', None)
# calculate the correlation matrix

# Print your dataframe
corr_matrix = df.corr()
fig, ax = plt.subplots(figsize=(100,100))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', ax=ax)
plt.show()

KeyboardInterrupt: ignored

**Correlation with the output only not features with eachother!**

In [None]:
# Compute the correlation coefficients between the "Operational Energy" column and all other columns
correlations = df.corr()['Operational Energy']

# Print the correlations in descending order
print(correlations.sort_values(ascending=False))

Operational Energy     1.000000
Total Floors Area      0.785652
Lights                 0.774495
Cooling                0.614184
Number of Floors       0.512663
EUI                    0.209727
Permeability           0.196442
Light Heat Gain        0.147437
Heating Setpoint       0.141919
Operating Hours        0.137906
Height                 0.133712
WWR_1                  0.030362
WWR_2                  0.029752
WWR_3                  0.027290
Internal Mass          0.006585
WWR_4                  0.005898
Start Time             0.003386
Internal Wall Rt       0.003001
Internal Floor Rt     -0.002452
Town                  -0.004624
Heating COP           -0.014947
Ground Floor Rt       -0.016440
Boiler Efficiency     -0.019678
Roof Rt               -0.026664
windows g-value       -0.030619
Wall Rt               -0.037066
Cooling Setpoint      -0.039054
Windows Rt            -0.045664
Cooling COP           -0.059204
Occupancy             -0.088590
Equipment Heat Gain   -0.127682
Infiltra

**# Step 3: Feature Selection**

we aim to identify the most relevant features that are strongly related to the target variable and remove the irrelevant or redundant features to simplify the model and improve its performance.

One common technique for feature selection is called Recursive Feature Elimination (RFE),

In [None]:
selector = SelectKBest(f_regression, k=10)
X = selector.fit_transform(X, y)

**Feature Engineering**

In [None]:
data.head()

Unnamed: 0,Cooling Setpoint,EUI,Cooling COP,WWR_4,WWR_3,WWR_2,WWR_1,Operating Hours,Infiltration,Occupancy,...,Start Time,windows g-value,Boiler Efficiency,Internal Mass,Permeability,Lights,Cooling,Operational Energy,Total Thermal Resistance,Total Area
0,26.804565,37.155511,4.430542,0.193317,0.574652,0.436005,0.771637,12.166667,0.21,23.231812,...,8.778931,0.381354,0.92189,44.441528,3.942261,10475.824219,6743.000738,135322.229618,16.577374,91051.25
1,25.219604,64.131327,2.855347,0.831842,0.278021,0.061151,0.451276,9.166667,0.222,19.02771,...,8.13147,0.556696,0.908726,41.702271,5.122925,3108.639378,6485.849597,168155.546796,17.09109,65551.25
2,26.69104,31.992473,2.863892,0.300433,0.695172,0.669464,0.324164,10.166667,0.221,20.664673,...,7.406128,0.572552,0.911157,24.244995,4.495239,4866.355447,3861.976104,108467.281862,21.201095,54246.4
3,25.468384,40.932114,3.922485,0.766254,0.570367,0.924347,0.772406,10.333333,0.245,19.004517,...,7.831909,0.520404,0.929751,24.655151,6.311646,7679.311052,10788.092245,177465.271945,19.143548,108390.0
4,25.152832,57.792356,2.828613,0.449683,0.077466,0.238745,0.353882,10.833333,0.313,18.679199,...,8.794434,0.475854,0.94748,29.20166,7.530762,5904.960314,4314.221007,205165.175932,17.919507,56800.64


**# Step 4: Model Selection and Tuning**

We will use catboost regression model

In [None]:
pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split

# assume X_train, y_train, X_test, y_test are already defined

# split into train/validation set
X_train, X_val, y_train, y_val = train_test_split(X, y)

# initialize the model
catboost = CatBoostRegressor(iterations=5000, learning_rate=0.05, depth=8, loss_function='RMSE')

# train the model
catboost.fit(X_train, y_train, eval_set=(X_val, y_val),early_stopping_rounds=40, verbose=100)

# make predictions on test set
y_pred = catboost.predict(test)


0:	learn: 62115.4427257	test: 62640.8944231	best: 62640.8944231 (0)	total: 130ms	remaining: 10m 49s
100:	learn: 4398.6585001	test: 4453.7496375	best: 4453.7496375 (100)	total: 4.25s	remaining: 3m 26s
200:	learn: 2008.3297552	test: 2029.7241460	best: 2029.7241460 (200)	total: 5.42s	remaining: 2m 9s
300:	learn: 1499.1472541	test: 1537.9497435	best: 1537.9497435 (300)	total: 6.17s	remaining: 1m 36s
400:	learn: 1268.0979513	test: 1322.3910294	best: 1322.3910294 (400)	total: 6.93s	remaining: 1m 19s
500:	learn: 1143.1406610	test: 1207.5315661	best: 1207.5315661 (500)	total: 7.71s	remaining: 1m 9s
600:	learn: 1064.9351279	test: 1137.0791399	best: 1137.0791399 (600)	total: 8.47s	remaining: 1m 2s
700:	learn: 1008.5923626	test: 1089.5484189	best: 1089.5484189 (700)	total: 9.24s	remaining: 56.7s
800:	learn: 963.5586070	test: 1052.6036916	best: 1052.6036916 (800)	total: 10s	remaining: 52.5s
900:	learn: 929.7619025	test: 1024.9314913	best: 1024.9314913 (900)	total: 10.8s	remaining: 49s
1100:	learn:

In [None]:
len(test[0])

31

# Submitting_Code

In [None]:
test= pd.read_csv("/content/AI_HACK/Test.csv")
building_IDs=test["building"]
building_Town=test["Town"]



In [None]:
submission_id=[]
for i,j in enumerate(building_IDs):
  sub=j+'_'+'Town'+'_'+str(building_Town[i])
  submission_id.append(sub)
test.insert(0,'submission_id',submission_id)

In [None]:
sub=pd.DataFrame({'submission id':submission_id,'Operational Energy':y_pred})

In [None]:
sub

Unnamed: 0,submission id,Operational Energy
0,Building_1_Town_1,73485.732566
1,Building_100_Town_1,106804.940480
2,Building_1000_Town_2,89399.901946
3,Building_10000_Town_0,344392.384781
4,Building_10005_Town_2,239689.448349
...,...,...
23395,Building_9989_Town_1,137547.806604
23396,Building_999_Town_2,114398.124684
23397,Building_9996_Town_0,229606.512112
23398,Building_9997_Town_1,58980.409742


In [None]:
sub.to_csv(r'submit09.csv',index=False)