# Solution for Predicting Olympic Medal Count
We have a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2012. We want to predict the number of medals for the 2016 Olympic Games. The dataset can be divided in three tables, the ground truth, country of olympics and the specific event with medals.\
Here are the attributes of our tables.

<p><c>
    <img src="dataset.png" alt="Olympic Logo" width=500/>
</c></p>

Here we provide a solution for predicting Olympic Medal Count in 2016. Now you need to **follow the provided solution and select a subset of features** you think is helpful for predicting the Medal Count of the coming Olympic Games  in Tokyo. You are required to finish the tasks below:
### Steps:

#### 1. learn about the dataset and the task of predicting Olympic Medal Count in the first cell called load data (input, ground truth, data, regressive model);
#### 2. load the suggested features from experts and Automatic algorithms and <span style="color:blue">check the features in the 3rd cell by clicking the button on the right of the cell</span>.
#### 3. Checked the information of the features in the table view and select a subset of features by ticking the checkbox, submit your selection by clicking the <span style="color:blue">blue button Submit Selection in the view</span>.
#### 4. Copy the feature ids and evaluate these features in the evaluation cell by pasting the id list in the <span style="color:blue">auto_human_feature_matrix.iloc[:,[feature id list]</span>.

#### You can select and evaluate the features iteratively or create new features from the three tables by yourself.

## Load data

In [139]:
# load data and groung truth
import os
import pandas as pd
import utils_original as utils
import foldCode1 as load_df
import foldCode2 as our_model

DATA_DIR = os.path.join(os.getcwd(),"data/olympic_games_data")
es = utils.load_entityset(data_dir=DATA_DIR)
groundtruth_table, countries_table, events_table, cutoff_times_gt, dates, labels = load_df.loadData(es, DATA_DIR)

  countries['Country'] = countries['Country'].str.replace('*', '')


You can add new cells to print the 3 tables (groundtruth_table, countries_table, events_table) and the input (dates) and labels.

## <span style="color:brown">Generate Features</span> (Edit)

In [140]:
select_human_feature_names, select_human_features = load_df.human_features() #7
all_features = pd.read_csv("features/natural_features_all_815.csv")

# Check feature matrix
all_features

Unnamed: 0,MEAN(countries_at_olympic_games.NUM_UNIQUE(medals_won.Event)),ABSOLUTE(MEAN(medals_won.Height)),MIN(countries_at_olympic_games.MEAN(medals_won.Height)),MEAN(medals_won.Height),MEAN(countries_at_olympic_games.COUNT(medals_won)),MEAN(countries_at_olympic_games.SUM(medals_won.Height)),MAX(countries_at_olympic_games.MIN(medals_won.Age)),MEAN(countries_at_olympic_games.SUM(medals_won.Weight)),"PERCENTILE(TREND(medals_won.Height, Year))",MIN(countries_at_olympic_games.SKEW(medals_won.Age)),...,COUNT(medaling_athletes WHERE athlete.Gender = Men),"TREND(countries_at_olympic_games.MAX(medals_won.Age), Year)",ABSOLUTE(SKEW(medals_won.Age)),PERCENTILE(COUNT(medals_won)),SUM(countries_at_olympic_games.MIN(medals_won.Age)),SKEW(countries_at_olympic_games.MAX(medals_won.Age)),SKEW(countries_at_olympic_games.MEAN(medals_won.Weight)),"RATIO(COUNT(athletes WHERE Gender = Men), COUNT(athletes WHERE Gender = Women))","RATIO(COUNT(athletes WHERE Gender = Women), COUNT(athletes.Athelete))",PERCENTILE(MEAN(medals_won.Height))
0,,,,,,,,,,,...,6,,,0.502273,0.0,,,10.0,0.0,
1,,,,,,,,,,,...,52,,,0.502273,0.0,,,10.0,0.0,
2,,,,,,,,,,,...,33,,,0.502273,0.0,,,10.0,0.0,
3,,,,,,,,,,,...,6,,,0.502273,0.0,,,10.0,0.0,
4,,,,,,,,,,,...,7,,,0.502273,0.0,,,10.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1146,1.857143,173.846154,166.0,173.846154,1.857143,322.857143,31.0,107.142857,0.318681,-1.732051,...,1,-0.000156,0.073284,0.702273,177.0,-0.710342,-0.050544,10.0,0.0,0.341085
1147,2.285714,173.250000,158.0,173.250000,2.285714,396.000000,24.0,155.428571,0.538462,-0.585583,...,18,-0.000120,0.221115,0.720455,146.0,-0.178474,0.059369,10.0,0.0,0.294574
1148,1.444444,171.916667,157.0,171.916667,1.444444,229.222222,30.0,90.555556,0.395604,,...,0,0.000147,1.194443,0.702273,233.0,0.838972,0.687104,0.0,1.0,0.240310
1149,1.000000,183.000000,183.0,183.000000,1.000000,183.000000,23.0,90.000000,,,...,1,,,0.429545,23.0,,,10.0,0.0,0.899225


In [145]:
#check features


In [186]:
# OPTIONAL: write new features
# all_features.insert()

# OPTIONAL: you can check single feature:
# all_features.iloc[:, #id]

# MUST EDIT: feature selection: input feature ID in the list below:
#  current best: 18,54,70,96,97,30,20 --- decisionTree 0.856124  47.72093
# 18,54,70,96,97,30,20,28 --- ard 0.865194  44.712577
# [18,54,70,96,97,30,20,28 55] --- ard 0.865194  44.712383

# human: 18,54,70,96,97
feature_id = [18,54,70,96,97,30,20,28,55]
# overlap: 4,30
# feature_id = [4,30]
# auto:0,3,7,10,18
# feature_id = [0,3,7,10,18]

# feature_id = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,38,39,40,41,42,43,45,46,47,48,49,50,51,52,53,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,90,91,92,93,94,95,18,20,37,44,54,70,89,96,97,98]




## Evaluate features with given models  

In [184]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import ExtraTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import LinearSVC
from sklearn.svm import LinearSVR
from sklearn.linear_model import ARDRegression
import utils_original as utils

# parameters of the regressive model
pipeline_preprocessing = [("imputer",
                            SimpleImputer()),
                            ("scaler", RobustScaler(with_centering=True))]
splitter = utils.TimeSeriesSplitByDate(dates=dates, earliest_date=pd.Timestamp('01/01/2012')) 

# If you want to input all features
# all_X = all_features.values  

# Take your selection as input
all_X = all_features.iloc[:, feature_id].values

### Run Models <span style="color:brown">(No changes to code are required)</span>
If you don't want to use a model, you can comment specific lines of code.

In [185]:
# Random Forest Regression
rf_regressor = RandomForestRegressor(n_estimators=300, random_state=50)  
pipeline_reg0 = Pipeline(pipeline_preprocessing + [('rf_reg', rf_regressor)])
regression_score = utils.fit_and_score(all_X, labels, splitter, pipeline_reg0)
print("RandomForest Reg:", regression_score)
# --------------------------------------------------------------------------------------------------------------------

# Linear Regression
# linear_reg = LinearRegression()
# pipeline_reg1 = Pipeline(pipeline_preprocessing + [('linear_reg', linear_reg)])
# regression_score1 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg1)
# print("Linear Reg: ", regression_score1)
# --------------------------------------------------------------------------------------------------------------------

# Logistic Regression
logistic_reg = LogisticRegression()
pipeline_reg2 = Pipeline(pipeline_preprocessing + [('logistic_reg', logistic_reg)])
regression_score2 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg2)
print("Logistic Reg: ", regression_score2)
# --------------------------------------------------------------------------------------------------------------------

# SGD Regression
sgd_reg = SGDRegressor()
pipeline_reg3 = Pipeline(pipeline_preprocessing + [('sgd_reg', sgd_reg)])
regression_score3 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg3)
print("SGD Reg: ", regression_score3) 
# --------------------------------------------------------------------------------------------------------------------

# KNN Regression
# knn_reg = KNeighborsRegressor()
# pipeline_reg4 = Pipeline(pipeline_preprocessing + [('knn_reg', knn_reg)])
# regression_score4 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg4)
# print("KNN Reg: ", regression_score4)
# --------------------------------------------------------------------------------------------------------------------

# MLP Regression
mlp_reg = MLPRegressor()
pipeline_reg5 = Pipeline(pipeline_preprocessing + [('mlp_reg', mlp_reg)])
regression_score5 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg5)
print("MLP Reg: ", regression_score5)
# --------------------------------------------------------------------------------------------------------------------

# DecisionTree Regression
decisionTree_reg = DecisionTreeRegressor()
pipeline_reg6 = Pipeline(pipeline_preprocessing + [('decisionTree_reg', decisionTree_reg)])
regression_score6 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg6)
print("decisionTree Reg: ", regression_score6)
# --------------------------------------------------------------------------------------------------------------------

# ExtTree Regression
extTree_reg = ExtraTreeRegressor()
pipeline_reg7 = Pipeline(pipeline_preprocessing + [('extTree_reg', extTree_reg)])
regression_score7 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg7)
print("ExtTree Reg: ", regression_score7)
# --------------------------------------------------------------------------------------------------------------------

# GradientBoosting Regression
gb_reg = GradientBoostingRegressor()
pipeline_reg8 = Pipeline(pipeline_preprocessing + [('gb_reg', gb_reg)])
regression_score8 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg8)
print("GradientBoosting Reg: ", regression_score8)
# --------------------------------------------------------------------------------------------------------------------

# LinearSVC Regression
# svc_reg = LinearSVC()
# pipeline_reg9 = Pipeline(pipeline_preprocessing + [('svc_reg', svc_reg)])
# regression_score9 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg9)
# print("svc Reg: ", regression_score9)
# --------------------------------------------------------------------------------------------------------------------

# LinearSVR Regression
svr_reg = LinearSVR()
pipeline_reg10 = Pipeline(pipeline_preprocessing + [('svr_reg', svr_reg)])
regression_score10 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg10)
print("svr Reg: ", regression_score10)
# --------------------------------------------------------------------------------------------------------------------

# ARD Regression
ard_reg = ARDRegression()
pipeline_reg11 = Pipeline(pipeline_preprocessing + [('ard_reg', ard_reg)])
regression_score11 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg11)
print("ard Reg: ", regression_score11)
# --------------------------------------------------------------------------------------------------------------------

RandomForest Reg:          r2        mse Olympics Year
0  0.818305  60.264804    2016-08-05


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Reg:          r2         mse Olympics Year
0  0.23245  254.581395    2016-08-05
SGD Reg:               r2           mse Olympics Year
0 -2.794621e+25  9.269214e+27    2016-08-05




MLP Reg:           r2        mse Olympics Year
0  0.813609  61.822099    2016-08-05
decisionTree Reg:          r2        mse Olympics Year
0  0.78061  72.767442    2016-08-05
ExtTree Reg:           r2        mse Olympics Year
0  0.792004  68.988372    2016-08-05
GradientBoosting Reg:           r2        mse Olympics Year
0  0.769455  76.467151    2016-08-05
svr Reg:           r2        mse Olympics Year
0  0.812374  62.231732    2016-08-05
ard Reg:           r2        mse Olympics Year
0  0.865194  44.712383    2016-08-05




### Check the importance score of selected features (Select a model and get the score of features)

In [182]:
feature_imp = utils.get_feature_importances(pipeline_reg0,                     # select the regressive model for computing important scores of features (Default: RandomForest)
#                                               all_features,                  # Compute the score of all features
                                            all_features.iloc[:, feature_id], # Compute the score of selected features: 
                                            labels, splitter, 100)
 
test_date = pd.Timestamp('08/05/2016')
feature_imp[test_date].reset_index(drop=True)

Unnamed: 0,Importance,Feature
0,0.907871,SUM(medaling_athletes.NUM_UNIQUE(athletes.Athe...
1,0.017035,TREND(countries_at_olympic_games.NUM_UNIQUE(me...
2,0.015862,MAX(countries_at_olympic_games.COUNT(medals_won))
3,0.015481,MEAN(countries_at_olympic_games.MAX(medals_won...
4,0.011639,COUNT(medals_won.Medal WHERE sports.Sport = Gy...
5,0.007288,"RATIO(COUNT(athletes WHERE Gender = Women), CO..."
6,0.007159,SKEW(countries_at_olympic_games.NUM_UNIQUE(med...
7,0.006498,MIN(atheletes.Age)
8,0.006438,"RATIO(COUNT(athletes WHERE Gender = Men), COUN..."
9,0.00473,COUNT(medaling_athletes WHERE athlete.Gender =...
