# Solution for Predicting Olympic Medal Count
We have a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2012. We want to predict the number of medals for the 2016 Olympic Games. The dataset can be divided in three tables, the ground truth, country of olympics and the specific event with medals.\
Here are the attributes of our tables.

<p><c>
    <img src="dataset.png" alt="Olympic Logo" width=500/>
</c></p>

Here we provide a solution for predicting Olympic Medal Count in 2016. Now you need to **follow the provided solution and select a subset of features** you think is helpful for predicting the Medal Count of the coming Olympic Games 2020 in Tokyo. You are required to finish the tasks below:
### Steps:

#### 1. learn about the dataset and the task of predicting Olympic Medal Count in the first cell called load data (input, ground truth, data, regressive model);
#### 2. load the suggested features from experts and Automatic algorithms and <span style="color:blue">check the features in the 3rd cell by clicking the button on the right of the cell</span>.
#### 3. Checked the information of the features in the table view and select a subset of features by ticking the checkbox, submit your selection by clicking the <span style="color:blue">blue button Submit Selection in the view</span>.
#### 4. Copy the feature ids and evaluate these features in the evaluation cell by pasting the id list in the <span style="color:blue">auto_human_feature_matrix.iloc[:,[feature id list]</span>.

#### You can select and evaluate the features iteratively or create new features from the three tables by yourself.

## Load data

In [1]:
# load data and groung truth
import os
import pandas as pd
import utils_original as utils
import foldCode1 as load_df
import foldCode2 as our_model

DATA_DIR = os.path.join(os.getcwd(),"data/olympic_games_data")
es = utils.load_entityset(data_dir=DATA_DIR)
groundtruth_table, countries_table, events_table, cutoff_times_gt, dates, labels = load_df.loadData(es, DATA_DIR)

You can add new cells to print the 3 tables (groundtruth_table, countries_table, events_table) and the input (dates) and labels.

## <span style="color:brown">Generate Features</span> (important part)

In [2]:
select_human_feature_names, select_human_features = load_df.human_features() #7
all_features = pd.read_csv("features/natural_features_all.csv")

# check feature matrix
# all_features

# you can also check single feature:
# all_features.iloc[:, #id]

In [3]:
#check features

# your feature selection: input feature ID in the list below:
feature_id = [2,3]

## Evaluate features by a regressive model (random forest)

In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
import utils_original as utils

# parameters of the regressive model
pipeline_preprocessing = [("imputer",
                            SimpleImputer()),
                            ("scaler", RobustScaler(with_centering=True))]
splitter = utils.TimeSeriesSplitByDate(dates=dates, earliest_date=pd.Timestamp('01/01/2012')) # using data before and contain 2012 for training

# input values
# all_X = all_features.values   # you can edit here to evalute feature subsets
all_X = all_features.iloc[:, feature_id].values
rf_regressor = RandomForestRegressor(
        n_estimators=200,
        random_state=50
    )
pipeline_reg = Pipeline(pipeline_preprocessing + [('rf_reg', rf_regressor)])
regression_score = utils.fit_and_score(all_X, labels, splitter, pipeline_reg)

print(regression_score)

         r2        mse Olympics Year
0  0.722891  91.911598    2016-08-05


In [17]:
# check the importance score of selected features
feature_imp = utils.get_feature_importances(pipeline_reg, 
#                                              all_features,
                                            all_features.iloc[:, feature_id], # edit here to check the score of selected features
                                            labels, splitter, 300)
 
test_date = pd.Timestamp('08/05/2016')
feature_imp[test_date].reset_index(drop=True)

Unnamed: 0,Importance,Feature
0,0.415657,MEAN(countries_at_olympic_games.NUM_UNIQUE(med...
1,0.371758,MEAN(countries_at_olympic_games.COUNT(medals_w...
2,0.06382,ABSOLUTE(MEAN(medals_won.Height))
3,0.054523,MEAN(medals_won.Height)
4,0.049019,MIN(countries_at_olympic_games.MEAN(medals_won...
5,0.045224,MAX(countries_at_olympic_games.MIN(medals_won....
