# Capstone 2: Erasmus Program Mobility 

## Preprocessing and Training Data Development

## Table of Contents
* [Feature Engineering](#feature_engineering) 
    *  [Country-year aggregations](#country_year)
    *  [Trends over time](#trends_time)
    *  [Target variable creation](#target_variable)
    
* [Feature Transformation](#feature_transformation) 
    *  [Feature selection](#feature_selection)
    *  [One-hot encoding](#one_hot) 

* [Train-Test Split](#train_test_split)

* [Summary](#summary)

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from calendar import month_name
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [2]:
# Load the projects dataset
erasmus = pd.read_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/erasmus_eda.csv')

In [3]:
erasmus.head()

Unnamed: 0,participant_id,project_id,academic_year,start_month,end_month,duration_in_days,activity,field_of_education,participant_nationality,education_level,...,end_month_dt,start_year_month,end_year_month,start_year,end_year,start_month_num,end_month_num,start_month_name,end_month_name,age_group
0,1,2014-1-MT01-KA104-000103,2014-2015,2014-05,2014-06,42,Staff training abroad,Architecture and construction,MT,ISCED-6 - First cycle / Bachelor’s or equivale...,...,2014-06-01,2014-05,2014-06,2014,2014,5,6,May,June,30-39
1,2,2014-1-AT01-KA102-000262,2014-2015,2014-06,2014-06,13,Training/teaching assignments abroad,Business and administration,AT,ISCED-7 - Second cycle / Master’s or equivalen...,...,2014-06-01,2014-06,2014-06,2014,2014,6,6,June,June,50-59
2,3,2014-1-AT01-KA102-000262,2014-2015,2014-06,2014-06,15,VET learners traineeships in vocational instit...,Business and administration,AT,ISCED-5 - Short-cycle within the first cycle /...,...,2014-06-01,2014-06,2014-06,2014,2014,6,6,June,June,
3,4,2014-1-AT01-KA102-000262,2014-2015,2014-06,2014-06,15,VET learners traineeships in vocational instit...,Business and administration,AT,ISCED-5 - Short-cycle within the first cycle /...,...,2014-06-01,2014-06,2014-06,2014,2014,6,6,June,June,
4,5,2014-1-AT01-KA102-000262,2014-2015,2014-06,2014-06,15,VET learners traineeships in vocational instit...,Business and administration,AT,ISCED-7 - Second cycle / Master’s or equivalen...,...,2014-06-01,2014-06,2014-06,2014,2014,6,6,June,June,


In [4]:
erasmus.columns

Index(['participant_id', 'project_id', 'academic_year', 'start_month',
       'end_month', 'duration_in_days', 'activity', 'field_of_education',
       'participant_nationality', 'education_level', 'participant_gender',
       'participant_role', 'special_needs', 'fewer_opportunities',
       'groupleader', 'participant_age', 'sending_country', 'sending_city',
       'receiving_country', 'receiving_city', '# of participants',
       'key_action', 'action_type', 'call_year', 'grant_in_euros',
       'country_type', 'country_pairs', 'allocated_grant', 'start_month_dt',
       'end_month_dt', 'start_year_month', 'end_year_month', 'start_year',
       'end_year', 'start_month_num', 'end_month_num', 'start_month_name',
       'end_month_name', 'age_group'],
      dtype='object')

## Feature Engineering <a id="feature_engineering"></a>

### Country-year aggregations <a id="country_year"></a>

To consider the temporal aspect of the data, we aggregate the data for country-year granularity. 

In [5]:
# Aggregate data to country-year level
df_agg = erasmus.groupby(['receiving_country', 'call_year']).agg({
       'allocated_grant': 'sum',
       'participant_id': 'nunique',  # Number of unique participants
       'duration_in_days': 'mean',
       '# of participants': 'mean'
   }).reset_index()

In [6]:
df_agg.head()

Unnamed: 0,receiving_country,call_year,allocated_grant,participant_id,duration_in_days,# of participants
0,AF,2015,5871.111111,2,5.0,1.0
1,AF,2016,19353.3,6,7.333333,1.5
2,AF,2017,5058.695652,2,5.0,1.0
3,AL,2014,274999.537457,294,24.006803,1.221088
4,AL,2015,817178.602912,388,28.492268,1.082474


### Trends over time <a id="trends_time"></a>

We also create features that capture trends over time.
- **Lag Feature**: These are values from previous years.
- **Rolling Statistics**: These could help capture trends and smooth out yearly fluctuations.

In [7]:
# Add lag feature
df_agg['previous_year_allocated_grant'] = erasmus.groupby('receiving_country')['allocated_grant'].shift(1)

In [8]:
# Add rolling statistics feature - for funding
df_agg['rolling_avg_grant_3yr'] = erasmus.groupby('receiving_country')['allocated_grant'].rolling(3).mean().shift(1).reset_index(0,drop=True)

# Add rolling statistics feature - for project count
# Count the occurrences of project_id per year for each receiving_country
counts_per_year = erasmus.groupby(['receiving_country', 'call_year']).size().reset_index(name='project_count')

# Compute the rolling average on the yearly counts
df_agg['rolling_avg_projects_3yr'] = counts_per_year.groupby('receiving_country')['project_count'].rolling(3).mean().shift(1).reset_index(0, drop=True)


### Target variable creation <a id="target_variable"></a>

Because we want to predict multi-year future allocations, we create a more suitable target variable than simply `allocated_grant`. To do this, we extend the target variable to cover predictions for multiple future years. The `cumulative_future_allocations` variable will now be our target feature.

In [9]:
df_agg['cumulative_future_allocations'] = erasmus.groupby('receiving_country')['allocated_grant'].transform(lambda x: x.shift(-1).fillna(0) + x.shift(-2).fillna(0) + x.shift(-3).fillna(0))

In [10]:
df_agg.head()

Unnamed: 0,receiving_country,call_year,allocated_grant,participant_id,duration_in_days,# of participants,previous_year_allocated_grant,rolling_avg_grant_3yr,rolling_avg_projects_3yr,cumulative_future_allocations
0,AF,2015,5871.111111,2,5.0,1.0,,5087.821545,,1383.333333
1,AF,2016,19353.3,6,7.333333,1.5,,6371.258065,,8703.0
2,AF,2017,5058.695652,2,5.0,1.0,2901.0,,,7602.333333
3,AL,2014,274999.537457,294,24.006803,1.221088,2901.0,,3.333333,6431.261905
4,AL,2015,817178.602912,388,28.492268,1.082474,2901.0,2901.0,,5260.190476


We handle any missing values that may have been introduced by the shift operations.

In [11]:
df_agg.fillna(0, inplace=True)

In [12]:
df_agg.shape

(685, 10)

We add these new features to our original dataframe.

In [13]:
df_merged = erasmus.merge(df_agg, on=['receiving_country', 'call_year'], how='left', suffixes=('', '_agg'))

In [14]:
df_merged.shape

(3462074, 47)

In [15]:
df_merged.columns

Index(['participant_id', 'project_id', 'academic_year', 'start_month',
       'end_month', 'duration_in_days', 'activity', 'field_of_education',
       'participant_nationality', 'education_level', 'participant_gender',
       'participant_role', 'special_needs', 'fewer_opportunities',
       'groupleader', 'participant_age', 'sending_country', 'sending_city',
       'receiving_country', 'receiving_city', '# of participants',
       'key_action', 'action_type', 'call_year', 'grant_in_euros',
       'country_type', 'country_pairs', 'allocated_grant', 'start_month_dt',
       'end_month_dt', 'start_year_month', 'end_year_month', 'start_year',
       'end_year', 'start_month_num', 'end_month_num', 'start_month_name',
       'end_month_name', 'age_group', 'allocated_grant_agg',
       'participant_id_agg', 'duration_in_days_agg', '# of participants_agg',
       'previous_year_allocated_grant', 'rolling_avg_grant_3yr',
       'rolling_avg_projects_3yr', 'cumulative_future_allocations'],
   

## Feature Transformation <a id="feature_transformation"></a>

### Feature selection <a id="feature_selection"></a>

Now we create a new DataFrame containing only those features to be used in our modelling.

In [16]:
# List of columns to include in the final DataFrame
columns_to_include = [
    'participant_nationality', 'participant_gender', 'participant_age',
    'participant_role', 'special_needs', 'fewer_opportunities', 'groupleader', 
    'age_group', 'academic_year', 'call_year', 'start_year_month', 'end_year_month',
    'start_year', 'end_year', 'duration_in_days', 'activity', 'field_of_education',
    'sending_country', 'receiving_country', 'country_type', 
    'allocated_grant', 'allocated_grant_agg', 'previous_year_allocated_grant', 'rolling_avg_grant_3yr', 'rolling_avg_projects_3yr',
    'cumulative_future_allocations', '# of participants', '# of participants_agg', 'participant_id_agg', 
    'duration_in_days_agg']

# Create the final DataFrame with the selected columns
df_merged = df_merged[columns_to_include]
df_merged.shape

(3462074, 30)

### One-hot encoding <a id="one_hot"></a>

We encode the categorical variables into numerical format using one-hot encoding.
Due to memory limitations, this list was minimized.

In [17]:
# Perform one-hot encoding on selected columns
columns_to_encode = ['participant_gender', 'participant_age',
    'participant_role', 'special_needs', 'fewer_opportunities', 'groupleader', 
    'age_group', 'academic_year', 'call_year', 'activity', 'receiving_country']
df_final = pd.get_dummies(df_merged, columns=columns_to_encode)


In [18]:
df_final.columns

Index(['participant_nationality', 'start_year_month', 'end_year_month',
       'start_year', 'end_year', 'duration_in_days', 'field_of_education',
       'sending_country', 'country_type', 'allocated_grant',
       ...
       'receiving_country_UK', 'receiving_country_US', 'receiving_country_UY',
       'receiving_country_UZ', 'receiving_country_VE', 'receiving_country_VN',
       'receiving_country_XK', 'receiving_country_ZA', 'receiving_country_ZM',
       'receiving_country_ZW'],
      dtype='object', length=330)

In [19]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3462074 entries, 0 to 3462073
Columns: 330 entries, participant_nationality to receiving_country_ZW
dtypes: bool(311), float64(10), int64(3), object(6)
memory usage: 1.5+ GB


Since we don't intend to use learning models that rely on distance-based metrics, we will not scale the numeric data.

## Train-Test Split <a id="train_test_split"></a>

Split the data into training and testing sets.

In [20]:
X = df_final.drop(columns=['cumulative_future_allocations'])
y = df_final['cumulative_future_allocations']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [21]:
df_final.to_csv('C:/Users/midol/Documents/Springboard/Springboard/Capstone_2/Erasmus/erasmus_preprocess.csv', index=False)

## Summary <a id="summary"></a>

Original Problem Statement: Based on past funding from 2014 to 2019, analyze previous patterns and factors to identify trends and predict which countries are most likely to receive future allocations.

To prepare for modeling, we did some feature engineering. We created several country-year aggregations by summing funding, counting participants, and calculating the mean project duration in days and number of participants.

We also created features that would capture trends over time such as a lag feature for country-funding. To smooth out yearly fluctuations, rolling statistics were also created for funding mean and project counts, and a 3-year rolling average for project counts. Because we want to predict multi-year future allocations, we created a more suitable target variable that extends over multiple future years: `cumulative_future_allocations`.

We chose the features that would be included in the modeling and hot-encoded the categorical variables into numerical format.

Finally, we split our data into training and testing sets.


**For further exploration:**

Consider integrating more features such as:
- Economic indicators
- Political stability
- Project success rates