## Algorithm Selection & Justification

### 1. Random Forest Regressor
- Suitable for real estate price prediction due to its ability to handle both numerical and categorical data.
- Reduces overfitting by averaging multiple decision trees.
- Efficient and requires minimal feature scaling.

### 2. Support Vector Machine (SVM) Regressor
- Useful for capturing non-linear relationships between features.
- Works well with smaller structured datasets.
- Requires feature scaling but can model complex interactions better than simpler models.

### Comparison:
| Feature | Random Forest | SVM |
|---------|--------------|-----|
| Handles categorical data | ✅ Yes | ❌ No |
| Handles missing values | ✅ Yes | ❌ No |
| Overfitting resistance | ✅ Yes | ✅ Yes |
| Computational efficiency | ⚡ Fast | 🐌 Slower |
| Non-linear modeling | ⚡ Good | ✅ Excellent |

These algorithms will be tested, and their performance compared to determine the best model for predicting rental prices.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Prep the data

In [None]:
!pip install scikit-learn xgboost



## import necessary libraries and load data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.svm import SVR
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Load dataset
df = pd.read_csv("https://github.com/157nouraalhumaid/SW485-Project-Group4/raw/refs/heads/main/Dataset/real_estate_rental_prices.csv")  # Adjust path if needed

## preprocess data

In [None]:
# Drop unnecessary columns
df = df.drop(columns=["Unnamed: 0", "الرقم"], errors="ignore")

# Convert 'سعر الليلة' (Price per Night) to numeric
df["سعر الليلة"] = df["سعر الليلة"].str.replace(r"[^\d]", "", regex=True).astype(float)

# Handle missing values
df = df.dropna()

# Remove outliers using Interquartile Range (IQR)
Q1 = df["سعر الليلة"].quantile(0.25)
Q3 = df["سعر الليلة"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df = df[(df["سعر الليلة"] >= lower_bound) & (df["سعر الليلة"] <= upper_bound)]

# Log transformation of price
df["log_سعر الليلة"] = np.log1p(df["سعر الليلة"])

#Calculate price per square meter correctly
df["سعر للمتر"] = df["سعر الليلة"] / df["المساحة"]
df["سعر للمتر"] = df["سعر للمتر"].replace([np.inf, -np.inf], np.nan)  # Remove invalid values
df = df.dropna(subset=["سعر للمتر"])  # Drop rows where "سعر للمتر" is NaN

# Display the cleaned dataset
print("\n📌 Cleaned Dataset Sample:")
print(df.head())


📌 Cleaned Dataset Sample:
   التقييم  عدد المقيمين  المساحة                        اسم العقار  \
0     10.0             7       40          استديو بسرير ماستر وجلسة   
1      9.2             6     3000  استديو بسريرين فردية وبأثاث بسيط   
2     10.0            43     1000       شقة بغرفة معيشة وغرفتين نوم   
3      9.4             4      400    استراحة بصالة جلوس وغرفتين نوم   
4      9.6            29     3000          شقة بغرفة جلوس وغرفة نوم   

            الحي  سعر الليلة المدينة  التصنيف  log_سعر الليلة  سعر للمتر  
0    حي العزيزية       250.0   العلا   استديو        5.525453   6.250000  
1         العذيب       280.0   العلا   استديو        5.638355   0.093333  
2    حي العزيزية       400.0   العلا      شقة        5.993961   0.400000  
3     حي المعتدل       799.0   العلا  استراحة        6.684612   1.997500  
4  جنوب المستشفى       550.0   العلا      شقة        6.311735   0.183333  


## Encode Categorical Variables

In [None]:
categorical_cols = ["اسم العقار", "الحي", "المدينة", "التصنيف"]
label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le  # Store encoder for later use

# Display transformed dataset
print("\n📌 Encoded Dataset Sample:")
print(df.head())



📌 Encoded Dataset Sample:
   التقييم  عدد المقيمين  المساحة  اسم العقار  الحي  سعر الليلة  المدينة  \
0     10.0             7       40         189   567       250.0       10   
1      9.2             6     3000         224   277       280.0       10   
2     10.0            43     1000        2357   567       400.0       10   
3      9.4             4      400         497   657       799.0       10   
4      9.6            29     3000        2344   326       550.0       10   

   التصنيف  log_سعر الليلة  سعر للمتر  
0        0        5.525453   6.250000  
1        0        5.638355   0.093333  
2        4        5.993961   0.400000  
3        1        6.684612   1.997500  
4        4        6.311735   0.183333  


## define features and normalize data

In [None]:
# Define features (X) and target variable (y)
X = df.drop(columns=["سعر الليلة"])  # Using log-transformed price
y = df["log_سعر الليلة"]

# Normalize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Display shape of datasets
print("Training Set:", X_train.shape, y_train.shape)
print("Testing Set:", X_test.shape, y_test.shape)


Training Set: (12552, 9) (12552,)
Testing Set: (3139, 9) (3139,)


#Random Forest

In [None]:
# Train Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate Random Forest model
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

# Display results
print("\n📌 Random Forest Model Performance:")
print(f"Mean Absolute Error (MAE): {mae_rf}")
print(f"Root Mean Squared Error (RMSE): {rmse_rf}")
print(f"R² Score: {r2_rf}")


📌 Random Forest Model Performance:
Mean Absolute Error (MAE): 0.0001303402649249048
Root Mean Squared Error (RMSE): 0.0011706240470063319
R² Score: 0.9999962112301307


# SVM

In [None]:
# Display dataset info
print("📌 Dataset Info:")
df.info()


📌 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 15691 entries, 0 to 16912
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   التقييم         15691 non-null  float64
 1   عدد المقيمين    15691 non-null  int64  
 2   المساحة         15691 non-null  int64  
 3   اسم العقار      15691 non-null  int64  
 4   الحي            15691 non-null  int64  
 5   سعر الليلة      15691 non-null  float64
 6   المدينة         15691 non-null  int64  
 7   التصنيف         15691 non-null  int64  
 8   log_سعر الليلة  15691 non-null  float64
 9   سعر للمتر       15691 non-null  float64
dtypes: float64(4), int64(6)
memory usage: 1.3 MB


In [None]:
# Train SVM model
svm_model = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svm_model.fit(X_train, y_train)

# Predict on test set
y_pred_svm = svm_model.predict(X_test)

# Evaluate SVM model
mae_svm = mean_absolute_error(y_test, y_pred_svm)
mse_svm = mean_squared_error(y_test, y_pred_svm)
rmse_svm = np.sqrt(mse_svm)
r2_svm = r2_score(y_test, y_pred_svm)

# Display results
print("\n📌 SVM Model Performance:")
print(f"MAE: {mae_svm}")
print(f"RMSE: {rmse_svm}")
print(f"R² Score: {r2_svm}")



📌 SVM Model Performance:
MAE: 0.04162460187130798
RMSE: 0.0535536697411771
R² Score: 0.9920705727920054


# Performance Comparison & Results Interpretation

Since we're working on a regression problem (predicting rental prices), traditional classification metrics like accuracy, precision, recall, and F1-score are not applicable. Instead, we will compare using appropriate regression metrics

* Mean Absolute Error (MAE) → Lower is better.
* Root Mean Squared Error (RMSE) → Lower is better.
* R² Score (Coefficient of Determination) → Higher is better.



### Comparison
| Metric | Random Forest | SVM |
|---------|--------------|-----|
|MAE	|0.00013	|0.0416|
|RMSE	|0.00117	|0.0535|
|R² Score|	0.999996	|0.9921|


## Results Interpretation
### 1. Random Forest Model

   - MAE (0.00013):

      The MAE is extremely low, indicating that the average absolute difference between the predicted and actual values is almost negligible.

- RMSE (0.00117):

     The RMSE is also very low, suggesting that the model's predictions are very close to the actual values.

- R² Score (0.999996):

     The R² score is almost 1, indicating that the model explains nearly all the variance in the target variable. This is an excellent result.


###2. SVM Model

- MAE (0.0416):

  The MAE is very low, indicating that the average absolute difference between the predicted and actual values is minimal.

- RMSE (0.0535):

  The RMSE is also low, suggesting that the model's predictions are close to the actual values.

- R² Score (0.9921):

  The R² score is very close to 1, indicating that the model explains almost all the variance in the target variable.

![](https://drive.google.com/uc?export=view&id=19oSX5vpPJ6ygVe4-_aXac92C5bGKVWtk)
##Conclusion:

The SVM model is performing exceptionally well, with metrics very close to those of the Random Forest model, yet Random Forest remains the best-performing model, with the SVM model being a strong contender.