# Capstone 2: Erasmus Program Mobility

## Modeling

## Table of Contents

* [Load the Datasets](#load_datasets)
    *  [Load X_train_no_scaling](#x_train_no_scaling)
    *  [Load X_train_scaled](#x_train_scaled)
    *  [Load X_test](#x_test)
    *  [Load y_train](#y_train)
    *  [Load y_test](#y_test)
* [Model Evaluation - Linear Regression](#models_hyperparameters)
    * [Define the Model and Hyperparameters](#models_hyperparameters)
    * [Tune the Hyperparameters](#tune_hyperparameters)        
        * [With Scaling](#evaluate_models)
        * [Without Scaling](#evaluate_models)    
* [Model Evaluation - Random Forest Regressor](#linear_regression)
    * [Define the Model and Hyperparameters](#models_hyperparameters)
    * [Tune the Hyperparameters](#tune_hyperparameters)        
        * [With Scaling](#evaluate_models)
        * [Without Scaling](#evaluate_models)    
* [Compare and Select the Best Model](#compare_select)             
* [Analysis](#load_datasets)
* [Summary](#summary)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import TimeSeriesSplit
from scipy.stats import randint
import scipy.sparse


## Load the Datasets <a id="load_datasets"></a>

### Load X_train (no scaling) <a id="load_x_no_scaling"></a>


In [None]:
# Load the sparse matrix (X_train_scaled)
sparse_matrix_X_no_scaling = scipy.sparse.load_npz('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_train_sparse_no_scaling.npz')

# Load the columns/index
columns = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_train_sparse_no_scaling_columns.csv', header=None)
index = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_train_sparse_no_scaling_index.csv', header=None)

# Convert columns/index to lists
columns = columns.iloc[:, 0].tolist()
index = index.iloc[:, 0].tolist()

# Print lengths of columns and index
print("Number of columns:", len(columns))
print("Number of rows:", len(index))

# Convert the sparse matrix back to a sparse DataFrame
X_train_no_scaling = pd.DataFrame.sparse.from_spmatrix(sparse_matrix_X_no_scaling, index=index, columns=columns)


Number of columns: 589
Number of rows: 2885062


### Load X_train (scaled) <a id="load_x_scaled"></a>


In [None]:
# Load the sparse matrix (X_train_scaled)
sparse_matrix_X_train_scaled = scipy.sparse.load_npz('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_train_sparse_scaled.npz')

# Load the columns/index
columns = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_train_sparse_scaled_columns.csv', header=None)
index = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_train_sparse_scaled_index.csv', header=None)

# Convert columns/index to lists
columns = columns.iloc[:, 0].tolist()
index = index.iloc[:, 0].tolist()

# Print lengths of columns and index
print("Number of columns:", len(columns))
print("Number of rows:", len(index))

# Convert the sparse matrix back to a sparse DataFrame
X_train_scaled = pd.DataFrame.sparse.from_spmatrix(sparse_matrix_X_train_scaled, index=index, columns=columns)


Number of columns: 589
Number of rows: 2885062


### Load X_test <a id="y_train"></a>


In [None]:
# Load the sparse matrix (X_scaled)
sparse_matrix_X_test = scipy.sparse.load_npz('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_test.npz')

# Load the columns/index
columns = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_test_columns.csv', header=None)
index = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/X_test_index.csv', header=None)

# Convert columns/index to lists
columns = columns.iloc[:, 0].tolist()
index = index.iloc[:, 0].tolist()

# Print lengths of columns and index
print("Number of columns:", len(columns))
print("Number of rows:", len(index))

# Convert the sparse matrix back to a sparse DataFrame
X_test = pd.DataFrame.sparse.from_spmatrix(sparse_matrix_X_test, index=index, columns=columns)


Number of columns: 589
Number of rows: 577012


### Load y_train <a id="y_train"></a>

In [None]:
# Load the dataset
y_train = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/y_train.csv')

len(y_train)

2885062

### Load y_test <a id="y_test"></a>

In [None]:
# Load the dataset
y_test = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/y_test.csv')

len(y_test)

577012

## Linear Regression <a id="random_forest"></a>

### Define the Models and Hyperparameters <a id="models_hyperparameters"></a>

We would have liked to do a thorough tuning of parameters:
-  `'fit_intercept': [True, False]`
-  `'copy_X': [True, False]`
-  `'n_jobs': [None, 1, 2, -1]`

However, the processing time required is excessive so we reduce to the following:
- `fit_intercept`: This is an important parameter that can affect model bias, so we keep both True and False.
- `copy_X`: This parameter is usually set to True, and changing it might not significantly impact the results for most use cases, so we only use True.
- `n_jobs`: This parameter influences computation time but not the learning process, so we use -1 to make use of all available CPU cores.

In [None]:
# Linear Regression
lr = LinearRegression()
lr_params = {
    'fit_intercept': [True, False],
    'copy_X': [True],
    'n_jobs': [-1]
}

### Tune the Hyperparameters <a id="tune_hyperparameters"></a>

Given the size of our dataset, we use RandomizedSearchCV for hyperparameter tuning with cross-validation.


In [None]:
# Instantiate TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

# Linear Regression randomized search with cross-validation
lr_random_search = RandomizedSearchCV(estimator=lr, param_distributions=lr_params, n_iter=5, cv=tscv, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)


#### Fit without Scaling <a id="tune_hyperparameters"></a>

In [None]:
# Non-scaled
lr_random_search.fit(X_train_no_scaling, y_train)

# Output the best parameters
lr_best_params_no_scaling = lr_random_search.best_params_
print("Best Parameters from RandomizedSearchCV for lr NO SCALING", lr_best_params_no_scaling)

#### Fit with Scaling <a id="tune_hyperparameters"></a>

In [None]:
# Scaled
lr_random_search.fit(X_train_scaled, y_train)

# Output the best parameters
lr_best_params_scaled = lr_random_search.best_params_
print("Best Parameters from RandomizedSearchCV lr SCALED", lr_best_params_scaled)

## Random Forest Regressor <a id="random_forest"></a>

### Define the Models and Hyperparameters <a id="models_hyperparameters"></a>

We define the Random Forest Regression model and set up the hyperparameters.



Likewise, we would have preferred doing a thorough tuning of parameters:
-  `'n_estimators': [100, 200, 300]`
-  `'max_depth': [None, 10, 20, 30]`
-  `'min_samples_split': [2, 5, 10]`

However, the processing time required is excessive so we reduce to the following:

- `n_estimators`: Controls the number of trees in the forest. We reduce to 100 and 200 for quicker evaluations.
- `max_depth`: Controls the maximum depth of the tree. None (nodes are expanded until all leaves are pure) might often be the best choice, but we keep 10 and 20 to prevent overfitting.


In [None]:
# Random Forest Regressor
rf = RandomForestRegressor(random_state=42)
random_params = {
    'n_estimators': [100, 200],
    'max_features': ['log2'],
    'max_depth': [10, 20],
}

### Tune the Hyperparameters <a id="tune_hyperparameters"></a>

Given the size of our dataset, we use RandomizedSearchCV for hyperparameter tuning with cross-validation.

In [None]:
# Instantiate TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

# Random Forest Regressor randomized search with cross-validation
rf_random_search = RandomizedSearchCV(estimator=rf, param_distributions=random_params, n_iter=2, cv=tscv, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)


#### Fit with Scaling <a id="tune_hyperparameters"></a>

In [None]:
# Scaled
rf_random_search.fit(X_train_scaled, y_train)

# Output the best parameters
random_best_params_scaled = rf_random_search.best_params_
print("Best Parameters from RandomizedSearchCV:", random_best_params_scaled)

#### Fit without Scaling <a id="tune_hyperparameters"></a>

In [None]:
# Non-scaled
rf_random_search.fit(X_train_no_scaling, y_train)

# Output the best parameters
random_best_params_no_scaling = rf_random_search.best_params_
print("Best Parameters from RandomizedSearchCV random NO SCALING:", random_best_params_no_scaling)

## Evaluate the Models <a id="evaluate_models"></a>

Then we evaluate the models using the best hyperparameters on the test set.

### Linear Regression <a id="linear_regression3"></a>

In [None]:
# Linear Regression Evaluation
best_lr = lr_grid.best_estimator_
y_pred_lr = best_lr.predict(X_test)

# Evaluation Metrics
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score


### Random Forest Regressor <a id="random_forest3"></a>

In [None]:
# Random Forest Evaluation
best_rf = rf_grid.best_estimator_
y_pred_rf = best_rf.predict(X_test)

# Evaluation Metrics
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Performance:")
print("MSE:", mse_rf)
print("R^2 Score:", r2_rf)

## Compare and Select the Best Model <a id="compare_select"></a>

Finally, we compare the performance metrics and choose the best model.

In [None]:
print("\nComparison:")
if mse_lr < mse_rf:
    print("Best Model: Linear Regression")
    best_model = best_lr
    best_mse = mse_lr
    best_r2 = r2_lr
else:
    print("Best Model: Random Forest")
    best_model = best_rf
    best_mse = mse_rf
    best_r2 = r2_rf

print("\nBest Model Performance:")
print("MSE:", best_mse)
print("R^2 Score:", best_r2)

## Run the Analysis <a id="run_analysis"></a>

In [None]:
# REPLACE WITH THE BEST MODEL!!!!!

# Fit the model on the entire training data
best_rf = random_search.best_estimator_
best_rf.fit(X_train, y_train)

# Predict on the test set
y_pred = best_rf.predict(X_test)

In [None]:
# Calculate and print the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error on test set: {mse:.4f}")

## Summary <a id="summary"></a>


Unfortunately, the memory required to run this final notebook is more than my equipment can handle. The following options were investigated and attempted unsuccessfully:
- Run on Google Colab
- Use Dask
- Convert data to sparse format
- Reduce hyperparameters to a minimum
- Run hyperparameter tuning on one split of the data only
- Use RandomizedSearchCV rather than GridSearch

We loaded our data, previously split using TimeSeriesSplit since our data has a temporal element.
We defined the models and hyperparameters for two models: Linear Regression and Random Forest Regressor.
We prepared to tune the hyperparameters of both models, using RandomizedSearchCV, evaluating both X_train scaled and unscaled.
Finally, we prepared to evaluate the models using evaluation metrics mean squared error and R2. Based on these results, we would have chosen the best model and run our analysis.


## Recommendations <a id="summary"></a>

Despite the inability to run the final analysis, we can propose recommendations on how the client could utilize the results if a conclusive analysis had been possible:

**1. Strategic Allocation Planning**:
- By identifying which countries are most likely to need or receive future funds, the Erasmus program can strategically plan its budget allocations. Countries receiving consistent funding could be highlighted for more detailed reviews to ensure alignment with Erasmus objectives.
- Recognizing underfunded countries with potential growth can lead to proactive support and development initiatives, ensuring a balanced distribution of educational opportunities across Europe.

**2. Targeted Program Development**:
- With insights into participant demographics and project types, Erasmus can design targeted programs addressing specific needs, such as increasing male participation or supporting countries with lower project durations.
- By understanding the funding patterns and areas of high impact, Erasmus can develop specialized workshops, training sessions, or partnerships tailored to enhance the projects' effectiveness in those regions.

**3. Policy and Engagement Strategies**:
- Leveraging the trends and predictive analytics, policymakers can be better equipped to advocate for continuous or increased funding in certain regions, potentially adjusting policy to support high-engagement programs.
- Engagement strategies could be refined by focusing marketing and informational campaigns in countries identified to have growing or high participant interest . Highlighting successful projects from previous years can serve as inspiration and motivation for new participants and organizations.

**Conclusion**

Although we faced technical limitations that prevented us from performing the final analysis, the preparatory work and exploratory analysis provided valuable insights into Erasmus program funding and participation trends. By addressing memory constraints, future iterations of this analysis could provide far more detailed and actionable insights.

By formalizing these steps, Erasmus can harness data-driven decision-making to support and expand transnational education, training, youth, and sport initiatives effectively.
