## **Step 1: Loading the Dataset**

In [1]:
# First, we're importing pandas, it's the library we need to work with the dataset
import pandas as pd

# Now, we load the dataset from the CSV file using pandas
df = pd.read_csv(r"../Dataset/Cleaned_Averages.csv")

# We display the first few rows of the dataset to understand its structure and confirm that the data loaded correctly
df.head()


Unnamed: 0,اسم_المدرسة,المنطقة_الإدارية,الإدارة_التعليمية,المكتب_التعليمي,السلطة,نوع_التعليم,الجنس,تخصص_الاختبار,متوسط_أداء_الطلبة_في_المدرسة,ترتيب_المدرسة_على_مستوى_المدارس
0,28,11,45,294,3,6,0,0,1.0,1
1,2496,3,38,71,0,6,0,1,0.87271,1
2,3050,11,34,218,3,6,0,0,0.991601,2
3,1276,3,38,71,3,0,0,1,0.85085,2
4,3938,4,31,245,0,6,0,0,0.982336,3


**Loading the Dataset:** In this step, we import the pandas library to handle the dataset. We use **pd.read_csv()** to load our data from a CSV file into a pandas DataFrame. Then, we call **df.head()** to display the first few rows of the dataset to confirm that it has loaded correctly and that the data structure looks fine.

## **Step 2: Splitting the Data into Training and Testing Sets**

In [2]:
# We select the input features (X) 
X = df[['المنطقة_الإدارية', 'المكتب_التعليمي', 'السلطة', 'الإدارة_التعليمية', 'نوع_التعليم', 'الجنس', 'متوسط_أداء_الطلبة_في_المدرسة']]

# The target (y) is what we want to predict
y = df['ترتيب_المدرسة_على_مستوى_المدارس']  

# Now, we split the data into training and testing sets (80% for training, 20% for testing)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Checking the size of the sets
print(f"Training set size: {X_train.shape}, Test set size: {X_test.shape}")



Training set size: (5364, 7), Test set size: (1342, 7)


**Splitting the Data:** After selecting the features and the target variable, we divide our data into training and test sets using **train_test_split()** from sklearn. The training set will consist of 80% of the data, and the remaining 20% will be used for testing the model. This is an important step because we want to train the model on one set of data and evaluate it on another to check how well it performs on unseen data.

## **Step 3: Random Forest Algorithm**

In [3]:
# First, we import RandomForestRegressor and some metrics we need for evaluation
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Now, we create the Random Forest model
model_rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Training the model using the training data
model_rf.fit(X_train, y_train)

# Making predictions with the test data
y_pred_rf = model_rf.predict(X_test)

# Calculating RMSE (Root Mean Squared Error) to evaluate the model
rmse_rf = mean_squared_error(y_test, y_pred_rf)**0.5
# Calculating MAE (Mean Absolute Error) to evaluate the model
mae_rf = mean_absolute_error(y_test, y_pred_rf)
# Calculating R² (R-squared) to evaluate the model
r2_rf = r2_score(y_test, y_pred_rf)

# Displaying the results
print(f"RMSE (Random Forest): {rmse_rf}")
print(f"MAE (Random Forest): {mae_rf}")
print(f"R² (Random Forest): {r2_rf}")


RMSE (Random Forest): 276.6023270118728
MAE (Random Forest): 174.3432041728763
R² (Random Forest): 0.8053589667131565


**Random Forest Algorithm:** In this step, we use the Random Forest algorithm to build our model. We set **n_estimators=100** to use 100 trees in the forest and **random_state=42** for reproducibility. After training the model, we use it to predict the school rankings on the test set.

**We calculate several evaluation metrics:**

- **RMSE (Root Mean Squared Error):** This gives us an idea of how far off our predictions are on average. A lower RMSE indicates better performance.

- **MAE (Mean Absolute Error):** This shows the average error in our predictions.

- **R² (R-squared):** This tells us how well our model is explaining the variance in the target variable. A higher R² means the model fits the data well.

## **Step 4: SVM Algorithm**

In [4]:
# We import the SVM model and the necessary metrics for evaluation
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Now, let's create the SVM model
model_svr = SVR()

# Training the model with the training data
model_svr.fit(X_train, y_train)

# Making predictions using the test data
y_pred_svr = model_svr.predict(X_test)

# Calculating RMSE (Root Mean Squared Error) for SVM
rmse_svr = mean_squared_error(y_test, y_pred_svr)**0.5
# Calculating MAE (Mean Absolute Error) for SVM
mae_svr = mean_absolute_error(y_test, y_pred_svr)
# Calculating R² (R-squared) for SVM
r2_svr = r2_score(y_test, y_pred_svr)

# Displaying the results
print(f"RMSE (SVM): {rmse_svr}")
print(f"MAE (SVM): {mae_svr}")
print(f"R² (SVM): {r2_svr}")


RMSE (SVM): 621.4546953787732
MAE (SVM): 530.3119733589066
R² (SVM): 0.017479343136376113


- **SVM Algorithm:** In this step, we implement the Support Vector Machine (SVM) model using SVR from sklearn. This is a regression model that tries to find a hyperplane in a higher-dimensional space to make predictions. After training, we predict the rankings of schools in the test data.
- We calculate the same evaluation metrics as before: RMSE, MAE, and R².

## **Step 5: XGBoost Algorithm**

In [5]:
# We import the XGBoost model and the metrics for evaluation
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Creating the XGBoost model
xgb_model = xgb.XGBRegressor(random_state=42)

# Training the model using the training data
xgb_model.fit(X_train, y_train)

# Making predictions using the test data
y_pred_xgb = xgb_model.predict(X_test)

# Calculating RMSE (Root Mean Squared Error) for XGBoost
rmse_xgb = mean_squared_error(y_test, y_pred_xgb)**0.5
# Calculating MAE (Mean Absolute Error) for XGBoost
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
# Calculating R² (R-squared) for XGBoost
r2_xgb = r2_score(y_test, y_pred_xgb)

# Displaying the results
print(f"RMSE (XGBoost): {rmse_xgb}")
print(f"MAE (XGBoost): {mae_xgb}")
print(f"R² (XGBoost): {r2_xgb}")


RMSE (XGBoost): 288.62883009585164
MAE (XGBoost): 180.7254638671875
R² (XGBoost): 0.7880652546882629


- **XGBoost Algorithm:** In this step, we are using the XGBoost model, which is known for being highly effective in many regression tasks. We train the model using the training data and make predictions using the test data.
- Again, we calculate RMSE, MAE, and R² to measure the performance.


## **Step 6: Optimizing the Models**

### **Random Forest Optimization**

In [6]:
# Importing necessary libraries for tuning
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Creating the Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Here we define the hyperparameters we want to test
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

# Performing grid search with cross-validation
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid_rf, cv=3, scoring='neg_mean_squared_error')
grid_search_rf.fit(X_train, y_train)

# Getting the best parameters for Random Forest
best_rf_params = grid_search_rf.best_params_
print(f"Best parameters for Random Forest: {best_rf_params}")

# Retraining Random Forest with the best parameters
optimized_rf_model = RandomForestRegressor(n_estimators=best_rf_params['n_estimators'],
                                           max_depth=best_rf_params['max_depth'],
                                           min_samples_split=best_rf_params['min_samples_split'],
                                           random_state=42)
optimized_rf_model.fit(X_train, y_train)

# Making predictions with the optimized model
y_pred_rf_optimized = optimized_rf_model.predict(X_test)

# Calculating RMSE, MAE, and R² for the optimized model
rmse_rf_optimized = mean_squared_error(y_test, y_pred_rf_optimized) ** 0.5
mae_rf_optimized = mean_absolute_error(y_test, y_pred_rf_optimized)
r2_rf_optimized = r2_score(y_test, y_pred_rf_optimized)

print(f"Optimized RMSE (Random Forest): {rmse_rf_optimized}")
print(f"Optimized MAE (Random Forest): {mae_rf_optimized}")
print(f"Optimized R² (Random Forest): {r2_rf_optimized}")


Best parameters for Random Forest: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 300}
Optimized RMSE (Random Forest): 257.9967886088908
Optimized MAE (Random Forest): 173.9562655541574
Optimized R² (Random Forest): 0.8306632022025562


- **Hyperparameter Grid:** We define a range of hyperparameters (**n_estimators**, **max_depth**, **min_samples_split**) to optimize the model. This is to find the best combination of parameters that improves performance.
- **Grid Search:** We use **GridSearchCV** to automatically search through the specified hyperparameters and evaluate the model's performance using cross-validation. It helps find the best parameters.
- **Re-training:** After obtaining the best parameters, we re-train the Random Forest model with these settings to ensure that it performs at its best.
- **Evaluation:** After training the optimized model, we calculate RMSE, MAE, and R² to evaluate how well the model has improved. RMSE and MAE help us measure the model's prediction accuracy, and R² tells us how much variance in the target variable is explained by the model.

### **XGBoost Optimization**

In [7]:
# Importing necessary libraries for tuning
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Creating the XGBoost model
xgb_model = xgb.XGBRegressor(random_state=42)

# Setting up the hyperparameter grid for XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0]
}

# Performing grid search with cross-validation
grid_search_xgb = GridSearchCV(estimator=xgb_model, param_grid=param_grid_xgb, cv=3, scoring='neg_mean_squared_error')
grid_search_xgb.fit(X_train, y_train)

# Getting the best parameters for XGBoost
best_xgb_params = grid_search_xgb.best_params_
print(f"Best parameters for XGBoost: {best_xgb_params}")

# Retraining XGBoost with the best parameters
optimized_xgb_model = xgb.XGBRegressor(n_estimators=best_xgb_params['n_estimators'],
                                       learning_rate=best_xgb_params['learning_rate'],
                                       max_depth=best_xgb_params['max_depth'],
                                       subsample=best_xgb_params['subsample'],
                                       random_state=42)
optimized_xgb_model.fit(X_train, y_train)

# Making predictions with the optimized model
y_pred_xgb_optimized = optimized_xgb_model.predict(X_test)

# Calculating RMSE, MAE, and R² for the optimized model
rmse_xgb_optimized = mean_squared_error(y_test, y_pred_xgb_optimized) ** 0.5
mae_xgb_optimized = mean_absolute_error(y_test, y_pred_xgb_optimized)
r2_xgb_optimized = r2_score(y_test, y_pred_xgb_optimized)

print(f"Optimized RMSE (XGBoost): {rmse_xgb_optimized}")
print(f"Optimized MAE (XGBoost): {mae_xgb_optimized}")
print(f"Optimized R² (XGBoost): {r2_xgb_optimized}")


Best parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 1.0}
Optimized RMSE (XGBoost): 253.71276666488424
Optimized MAE (XGBoost): 176.1708221435547
Optimized R² (XGBoost): 0.8362401723861694


- **Hyperparameter Grid:** In this section, we specify the hyperparameters (**n_estimators**, **learning_rate**, **max_depth**, **subsample**) that we will use to tune the XGBoost model.
- **Grid Search:** Just like with Random Forest, we use **GridSearchCV** to optimize the hyperparameters. The cross-validation ensures the model is generalized and avoids overfitting.
- **Re-training:** After identifying the best parameters, we train the model with them to ensure the model performs optimally on unseen data.
- **Evaluation:** RMSE, MAE, and R² are used again here to measure the model’s performance. These metrics give us insights into the model’s accuracy and goodness of fit.

### **SVM Optimization**

In [None]:
# Importing necessary libraries for tuning
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Creating the SVM model
svm_model = SVR()

# Setting up the hyperparameter grid for SVM
param_grid_svm = {
    'C': [1, 10, 100],
    'epsilon': [0.1, 0.2, 0.3],
    'kernel': ['linear', 'rbf']
}

# Performing grid search with cross-validation
grid_search_svm = GridSearchCV(estimator=svm_model, param_grid=param_grid_svm, cv=3, scoring='neg_mean_squared_error')
grid_search_svm.fit(X_train, y_train)

# Getting the best parameters for SVM
best_svm_params = grid_search_svm.best_params_
print(f"Best parameters for SVM: {best_svm_params}")

# Retraining SVM with the best parameters
optimized_svm_model = SVR(C=best_svm_params['C'], epsilon=best_svm_params['epsilon'], kernel=best_svm_params['kernel'])
optimized_svm_model.fit(X_train, y_train)

# Making predictions with the optimized model
y_pred_svm_optimized = optimized_svm_model.predict(X_test)

# Calculating RMSE, MAE, and R² for the optimized model
rmse_svm_optimized = mean_squared_error(y_test, y_pred_svm_optimized) ** 0.5
mae_svm_optimized = mean_absolute_error(y_test, y_pred_svm_optimized)
r2_svm_optimized = r2_score(y_test, y_pred_svm_optimized)

print(f"Optimized RMSE (SVM): {rmse_svm_optimized}")
print(f"Optimized MAE (SVM): {mae_svm_optimized}")
print(f"Optimized R² (SVM): {r2_svm_optimized}")


- **Hyperparameter Grid:** This is the section where we define which hyperparameters of the SVM model to tune. The most important ones here are **C**, **epsilon**, and **kernel**, as they directly affect how well the model generalizes.
- **Grid Search:** As in the other models, we use **GridSearchCV** to explore multiple hyperparameter combinations and use cross-validation to ensure that the model performs optimally.
- **Re-training:** After the best parameters are found, we re-train the SVM model with those parameters and make predictions on the test data.
- **Evaluation:** The evaluation metrics (RMSE, MAE, and R²) give us a sense of how well the SVM model is performing on the test data. These metrics tell us how far off the predictions are and how well the model fits the data.

## **Conclusion:**
### **Why These Algorithms?**

We chose Random Forest, XGBoost, and SVM because they are all powerful machine learning algorithms that have proven to be effective in regression tasks. Both Random Forest and XGBoost are ensemble methods, which means they combine multiple models (trees) to improve performance and increase predictive accuracy. These methods are especially useful when dealing with complex datasets. On the other hand, SVM is a solid option for regression, particularly when the dataset is smaller and has non-linear relationships between the features.


### **Performance Comparison:**
To evaluate the effectiveness of these algorithms, we compared them based on the following metrics: RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R² (Coefficient of Determination). Lower RMSE and MAE values indicate better predictive accuracy, while a higher R² suggests a better fit to the data. These metrics help us understand how well each model performs when predicting the school rankings.

### **Results:**

**Random Forest:**

- RMSE: 257.99
- MAE: 173.95
- R²: 0.830
  
**XGBoost:**

- RMSE: 253.71
- MAE: 176.17
- R²: 0.836

**SVM:**

- RMSE: 
- MAE: 
- R²: 

### **Best Model:**
Based on the performance comparison, XGBoost emerges as the best model for predicting school rankings. Here's why:

- **Lowest RMSE:** XGBoost has the lowest RMSE, which indicates the smallest average error between the predicted and actual values.
Highest R²: XGBoost also has the highest R², meaning it explains the most variance in the data, showing that it fits the data better than the other models.
- **MAE:** While XGBoost's MAE is slightly higher than Random Forest, it is still much better than SVM in terms of predictive accuracy.
  
### **Conclusion:**
In summary, XGBoost stands out as the most optimized model with the best balance between RMSE, MAE, and R². Therefore, XGBoost is the recommended model to use for this task, as it provides the most accurate predictions and best data fit.