## **Step 1: Loading the Dataset**

In [1]:
# First, we're importing pandas, it's the library we need to work with the dataset
import pandas as pd

# Now, we load the dataset from the CSV file using pandas
df = pd.read_csv(r"../Dataset/Cleaned_Averages.csv")

# We display the first few rows of the dataset to understand its structure and confirm that the data loaded correctly
df.head()


Unnamed: 0,اسم_المدرسة,المنطقة_الإدارية,الإدارة_التعليمية,المكتب_التعليمي,السلطة,نوع_التعليم,الجنس,تخصص_الاختبار,متوسط_أداء_الطلبة_في_المدرسة,ترتيب_المدرسة_على_مستوى_المدارس
0,28,11,45,294,3,6,0,0,1.0,1
1,2496,3,38,71,0,6,0,1,0.87271,1
2,3050,11,34,218,3,6,0,0,0.991601,2
3,1276,3,38,71,3,0,0,1,0.85085,2
4,3938,4,31,245,0,6,0,0,0.982336,3


**Loading the Dataset:** In this step, we import the pandas library to handle the dataset. We use **pd.read_csv()** to load our data from a CSV file into a pandas DataFrame. Then, we call **df.head()** to display the first few rows of the dataset to confirm that it has loaded correctly and that the data structure looks fine.

## **Step 2: Splitting the Data into Training and Testing Sets**

In [2]:
# We select the input features (X) 
X = df[['المنطقة_الإدارية', 'المكتب_التعليمي', 'السلطة', 'الإدارة_التعليمية', 'نوع_التعليم', 'الجنس', 'متوسط_أداء_الطلبة_في_المدرسة']]

# The target (y) is what we want to predict
y = df['ترتيب_المدرسة_على_مستوى_المدارس']  

# Now, we split the data into training and testing sets (80% for training, 20% for testing)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Checking the size of the sets
print(f"Training set size: {X_train.shape}, Test set size: {X_test.shape}")



Training set size: (5364, 7), Test set size: (1342, 7)


**Splitting the Data:** After selecting the features and the target variable, we divide our data into training and test sets using **train_test_split()** from sklearn. The training set will consist of 80% of the data, and the remaining 20% will be used for testing the model. This is an important step because we want to train the model on one set of data and evaluate it on another to check how well it performs on unseen data.

## **Step 3: Random Forest Algorithm**

In [3]:
# First, we import RandomForestRegressor and some metrics we need for evaluation
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Now, we create the Random Forest model
model_rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Training the model using the training data
model_rf.fit(X_train, y_train)

# Making predictions with the test data
y_pred_rf = model_rf.predict(X_test)

# Calculating RMSE (Root Mean Squared Error) to evaluate the model
rmse_rf = mean_squared_error(y_test, y_pred_rf)**0.5
# Calculating MAE (Mean Absolute Error) to evaluate the model
mae_rf = mean_absolute_error(y_test, y_pred_rf)
# Calculating R² (R-squared) to evaluate the model
r2_rf = r2_score(y_test, y_pred_rf)

# Displaying the results
print(f"RMSE (Random Forest): {rmse_rf}")
print(f"MAE (Random Forest): {mae_rf}")
print(f"R² (Random Forest): {r2_rf}")


RMSE (Random Forest): 276.6023270118728
MAE (Random Forest): 174.3432041728763
R² (Random Forest): 0.8053589667131565


**Random Forest Algorithm:** In this step, we use the Random Forest algorithm to build our model. We set **n_estimators=100** to use 100 trees in the forest and **random_state=42** for reproducibility. After training the model, we use it to predict the school rankings on the test set.

**We calculate several evaluation metrics:**

- **RMSE (Root Mean Squared Error):** This gives us an idea of how far off our predictions are on average. A lower RMSE indicates better performance.

- **MAE (Mean Absolute Error):** This shows the average error in our predictions.

- **R² (R-squared):** This tells us how well our model is explaining the variance in the target variable. A higher R² means the model fits the data well.

## **Step 4: XGBoost Algorithm**

In [4]:
# We import the XGBoost model and the metrics for evaluation
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Creating the XGBoost model
xgb_model = xgb.XGBRegressor(random_state=42)

# Training the model using the training data
xgb_model.fit(X_train, y_train)

# Making predictions using the test data
y_pred_xgb = xgb_model.predict(X_test)

# Calculating RMSE (Root Mean Squared Error) for XGBoost
rmse_xgb = mean_squared_error(y_test, y_pred_xgb)**0.5
# Calculating MAE (Mean Absolute Error) for XGBoost
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
# Calculating R² (R-squared) for XGBoost
r2_xgb = r2_score(y_test, y_pred_xgb)

# Displaying the results
print(f"RMSE (XGBoost): {rmse_xgb}")
print(f"MAE (XGBoost): {mae_xgb}")
print(f"R² (XGBoost): {r2_xgb}")


RMSE (XGBoost): 288.62883009585164
MAE (XGBoost): 180.7254638671875
R² (XGBoost): 0.7880652546882629


- **XGBoost Algorithm:** In this step, we are using the XGBoost model, which is known for being highly effective in many regression tasks. We train the model using the training data and make predictions using the test data.
- Again, we calculate RMSE, MAE, and R² to measure the performance.


## **Step 5: Optimizing the Models**

### **Random Forest Optimization**

In [5]:
# Importing necessary libraries for tuning
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Creating the Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Here we define the hyperparameters we want to test
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

# Performing grid search with cross-validation
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid_rf, cv=3, scoring='neg_mean_squared_error')
grid_search_rf.fit(X_train, y_train)

# Getting the best parameters for Random Forest
best_rf_params = grid_search_rf.best_params_
print(f"Best parameters for Random Forest: {best_rf_params}")

# Retraining Random Forest with the best parameters
optimized_rf_model = RandomForestRegressor(n_estimators=best_rf_params['n_estimators'],
                                           max_depth=best_rf_params['max_depth'],
                                           min_samples_split=best_rf_params['min_samples_split'],
                                           random_state=42)
optimized_rf_model.fit(X_train, y_train)

# Making predictions with the optimized model
y_pred_rf_optimized = optimized_rf_model.predict(X_test)

# Calculating RMSE, MAE, and R² for the optimized model
rmse_rf_optimized = mean_squared_error(y_test, y_pred_rf_optimized) ** 0.5
mae_rf_optimized = mean_absolute_error(y_test, y_pred_rf_optimized)
r2_rf_optimized = r2_score(y_test, y_pred_rf_optimized)

print(f"Optimized RMSE (Random Forest): {rmse_rf_optimized}")
print(f"Optimized MAE (Random Forest): {mae_rf_optimized}")
print(f"Optimized R² (Random Forest): {r2_rf_optimized}")


Best parameters for Random Forest: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 300}
Optimized RMSE (Random Forest): 257.9967886088908
Optimized MAE (Random Forest): 173.9562655541574
Optimized R² (Random Forest): 0.8306632022025562


- **Hyperparameter Grid:** We define a range of hyperparameters (**n_estimators**, **max_depth**, **min_samples_split**) to optimize the model. This is to find the best combination of parameters that improves performance.
- **Grid Search:** We use **GridSearchCV** to automatically search through the specified hyperparameters and evaluate the model's performance using cross-validation. It helps find the best parameters.
- **Re-training:** After obtaining the best parameters, we re-train the Random Forest model with these settings to ensure that it performs at its best.
- **Evaluation:** After training the optimized model, we calculate RMSE, MAE, and R² to evaluate how well the model has improved. RMSE and MAE help us measure the model's prediction accuracy, and R² tells us how much variance in the target variable is explained by the model.

### **XGBoost Optimization**

In [6]:
# Importing necessary libraries for tuning
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Creating the XGBoost model
xgb_model = xgb.XGBRegressor(random_state=42)

# Setting up the hyperparameter grid for XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0]
}

# Performing grid search with cross-validation
grid_search_xgb = GridSearchCV(estimator=xgb_model, param_grid=param_grid_xgb, cv=3, scoring='neg_mean_squared_error')
grid_search_xgb.fit(X_train, y_train)

# Getting the best parameters for XGBoost
best_xgb_params = grid_search_xgb.best_params_
print(f"Best parameters for XGBoost: {best_xgb_params}")

# Retraining XGBoost with the best parameters
optimized_xgb_model = xgb.XGBRegressor(n_estimators=best_xgb_params['n_estimators'],
                                       learning_rate=best_xgb_params['learning_rate'],
                                       max_depth=best_xgb_params['max_depth'],
                                       subsample=best_xgb_params['subsample'],
                                       random_state=42)
optimized_xgb_model.fit(X_train, y_train)

# Making predictions with the optimized model
y_pred_xgb_optimized = optimized_xgb_model.predict(X_test)

# Calculating RMSE, MAE, and R² for the optimized model
rmse_xgb_optimized = mean_squared_error(y_test, y_pred_xgb_optimized) ** 0.5
mae_xgb_optimized = mean_absolute_error(y_test, y_pred_xgb_optimized)
r2_xgb_optimized = r2_score(y_test, y_pred_xgb_optimized)

print(f"Optimized RMSE (XGBoost): {rmse_xgb_optimized}")
print(f"Optimized MAE (XGBoost): {mae_xgb_optimized}")
print(f"Optimized R² (XGBoost): {r2_xgb_optimized}")


Best parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 1.0}
Optimized RMSE (XGBoost): 253.71276666488424
Optimized MAE (XGBoost): 176.1708221435547
Optimized R² (XGBoost): 0.8362401723861694


- **Hyperparameter Grid:** In this section, we specify the hyperparameters (**n_estimators**, **learning_rate**, **max_depth**, **subsample**) that we will use to tune the XGBoost model.
- **Grid Search:** Just like with Random Forest, we use **GridSearchCV** to optimize the hyperparameters. The cross-validation ensures the model is generalized and avoids overfitting.
- **Re-training:** After identifying the best parameters, we train the model with them to ensure the model performs optimally on unseen data.
- **Evaluation:** RMSE, MAE, and R² are used again here to measure the model’s performance. These metrics give us insights into the model’s accuracy and goodness of fit.

## **Conclusion**
### **Why These Algorithms?**  
Since our problem is a regression task, we chose Random Forest and XGBoost for their strong performance in numerical predictions.

- Random Forest is a robust ensemble method that reduces overfitting and handles complex relationships in data by averaging multiple decision trees.
- XGBoost is an optimized gradient boosting algorithm that improves accuracy by learning from previous mistakes, making it highly effective for structured data.

Both models handle missing data well, reduce overfitting, and work efficiently with tabular datasets, making them ideal for predicting school rankings.
 




---

### **Performance Comparison**  
To evaluate how well our models predict school rankings, we used the following metrics:  

- **RMSE (Root Mean Squared Error)** – Measures how far the predictions are from actual rankings. A lower RMSE means fewer large mistakes. We chose RMSE because it penalizes big errors more, which is important when ranking schools.  

- **MAE (Mean Absolute Error)** – Tells us the average size of the errors in ranking predictions. Unlike RMSE, it treats all errors equally, making it useful for understanding overall accuracy.  

- **R² (R-Squared)** – Measures how well the model explains the variation in school rankings. A higher R² means the model is learning meaningful patterns instead of guessing.  

These metrics help us quantify model performance, ensuring our predictions are both accurate and reliable.  

---

### **Results**  

| **Model**  | **RMSE (Lower is Better)** | **MAE (Lower is Better)** | **R² (Higher is Better)** |
|------------|--------------------------|--------------------------|--------------------------|
| **Random Forest** | 257.99 | 173.95 | 0.830 |
| **XGBoost** | **253.71** | **176.17** | **0.836** |

---

### **Final Decision: Best Model for Our Data**  

Based on our evaluation, **XGBoost** performed the best because:  

 **Lowest RMSE** – It makes the smallest mistakes in ranking predictions.  
 **Highest R²** – It explains the most variance in the rankings.  
 **Better performance than Random Forest** – While both models are strong, XGBoost’s ability to focus on hard-to-predict rankings gives it an edge.  

---

## **References**


- Breiman, L. (2001). *Random forests*. Machine Learning, 45(1), 5-32.  
  [Springer: Random Forest Paper](https://link.springer.com/article/10.1023/A:1010933404324)  

- **Scikit-learn Documentation:**  
  [Random Forest Documentation](https://scikit-learn.org/stable/modules/ensemble.html#random-forests)  


- **XGBoost Official Documentation:**  
  [XGBoost Docs](https://xgboost.readthedocs.io/en/stable/)  


- Bergstra, J., & Bengio, Y. (2012). *Random Search for Hyper-Parameter Optimization*. JMLR, 13, 281-305.  
  [JMLR Paper](https://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf)  

- **Scikit-learn Grid Search Documentation:**  
  [GridSearchCV](https://scikit-learn.org/stable/modules/grid_search.html)  


- Willmott, C. J., & Matsuura, K. (2005). *Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE)*. Climate Research, 30(1), 79-82.  
   [Research Paper](https://www.int-res.com/articles/cr2005/30/c030p079.pdf)  

- **Scikit-learn Metrics Documentation:**  
  [Model Evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)  

