# Solution for Predicting Olympic Medal Count
We have a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2012. We want to predict the number of medals for the 2016 Olympic Games. The dataset can be divided in three tables, the ground truth, country of olympics and the specific event with medals.\
Here are the attributes of our tables.

<p><c>
    <img src="dataset.png" alt="Olympic Logo" width=500/>
</c></p>

Here we provide a solution for predicting Olympic Medal Count in 2016. Now you need to **follow the provided solution and select a subset of features** you think is helpful for predicting the Medal Count of the coming Olympic Games  in Tokyo. You are required to finish the tasks below:
### Steps:

#### 1. learn about the dataset and the task of predicting Olympic Medal Count in the first cell called load data (input, ground truth, data, regressive model);
#### 2. load the suggested features from experts and Automatic algorithms and <span style="color:blue">check the features in the 3rd cell by clicking the button on the right of the cell</span>.
#### 3. Checked the information of the features in the table view and select a subset of features by ticking the checkbox, submit your selection by clicking the <span style="color:blue">blue button Submit Selection in the view</span>.
#### 4. Copy the feature ids and evaluate these features in the evaluation cell by pasting the id list in the <span style="color:blue">auto_human_feature_matrix.iloc[:,[feature id list]</span>.

#### You can select and evaluate the features iteratively or create new features from the three tables by yourself.

## Load data

In [95]:
# load data and groung truth
import os
import pandas as pd
import utils_original as utils
import foldCode1 as load_df
import foldCode2 as our_model

DATA_DIR = os.path.join(os.getcwd(),"data/olympic_games_data")
es = utils.load_entityset(data_dir=DATA_DIR)
groundtruth_table, countries_table, events_table, cutoff_times_gt, dates, labels = load_df.loadData(es, DATA_DIR)

In [113]:
events_table

Unnamed: 0,Year,City,Sport,Athlete,Country,Gender,Medal,Age,Height,Weight
0,1896,Athens,Aquatics,"HAJOS, Alfred",HUN,Men,Gold,18.0,,
1,1896,Athens,Aquatics,"HERSCHMANN, Otto",AUT,Men,Silver,19.0,,
2,1896,Athens,Aquatics,"DRIVAS, Dimitrios",GRE,Men,Bronze,,,
3,1896,Athens,Aquatics,"MALOKINIS, Ioannis",GRE,Men,Gold,,,
4,1896,Athens,Aquatics,"CHASAPIS, Spiridon",GRE,Men,Silver,,,
...,...,...,...,...,...,...,...,...,...,...
33183,2016,Rio de Janeiro,Athletics,Zhang Wenxiu,CHN,Women,Silver,30.0,183.0,105.0
33184,2016,Rio de Janeiro,Weightlifting,Zhazira Abdrakhmanovna Zhapparkul,KAZ,Women,Silver,22.0,155.0,69.0
33185,2016,Rio de Janeiro,Wrestling,Valeriya Sergeyevna Zholobova-Koblova,RUS,Women,Silver,23.0,164.0,58.0
33186,2016,Rio de Janeiro,Volleyball,Bojana ivkovi,SRB,Women,Silver,28.0,186.0,72.0


You can add new cells to print the 3 tables (groundtruth_table, countries_table, events_table) and the input (dates) and labels.

## <span style="color:brown">Generate Features</span> (Edit)

In [96]:
select_human_feature_names, select_human_features = load_df.human_features() #7
# all_features = pd.read_csv("features/natural_features_all_815.csv")
test = pd.read_csv("../source/natural_814.csv", error_bad_lines=False)
# check feature matrix
# all_features
test

Unnamed: 0,Checkbox,ID,Creator,Feature,Input,Description
0,check,0,auto,MEAN(countries_at_olympic_games.NUM_UNIQUE(med...,Sport,The average of the number of unique elements i...
1,check,1,auto,ABSOLUTE(MEAN(medals_won.Height)),Athelete.height,"The absolute value of the average of the ""Heig..."
2,check,2,auto,MIN(countries_at_olympic_games.MEAN(medals_won...,Athelete.height,"The minimum of the average of the ""Height"" of ..."
3,check,3,auto,MEAN(medals_won.Height),Athelete.height,"The average of the ""Height"" of all instances o..."
4,check,4,human / auto,MEAN(countries_at_olympic_games.COUNT(medals_w...,Medal,Human: The average number of Olympic Medals wo...
...,...,...,...,...,...,...
94,check,94,auto,SKEW(countries_at_olympic_games.MAX(medals_won...,Athelete.age,"The skewness of the maximum of the ""Age"" of al..."
95,check,95,auto,SKEW(countries_at_olympic_games.MEAN(medals_wo...,Athelete.weight,"The skewness of the average of the ""Weight"" of..."
96,check,96,human,RATIO(COUNT(athletes WHERE Gender = Men)-COUNT...,Athelete.gender,The ratio of male to female athletes.
97,check,97,human,RATIO(COUNT(athletes WHERE Gender = Women)-COU...,Athelete.gender,The proportion of female athletes in the compe...


In [None]:
#check features

In [121]:
# OPTIONAL: write new features

# all_features.insert()

# OPTIONAL: you can check single feature:
# all_features.iloc[:, #id]

# MUST EDIT1: feature selection: input feature ID in the list below:
feature_id = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97] #loc 0-98
#feature_id = [18,20,37,44,54,70,89,96,97]
#feature_id = [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,19,21,22,23,24,25,26,27,28,29,31,32,33,34,35,36,38,39,40,41,42,43,45,46,47,48,49,50,51,52,53,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,90,91,92,93,94,95,98]
#feature_id = [70,54,89,15,16,30]
#feature_id = [70]
#feature_id = [4,30,18,20,37,44,54,70,89,96,97,0,59,62,83,92]

## Evaluate features with this given models  

In [122]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
import utils_original as utils

# parameters of the regressive model
pipeline_preprocessing = [("imputer",
                            SimpleImputer()),
                            ("scaler", RobustScaler(with_centering=True))]
splitter = utils.TimeSeriesSplitByDate(dates=dates, earliest_date=pd.Timestamp('01/01/2012')) # using data before and contain 2012 for training

# default all features
# all_X = all_features.values  

# MUST EDIT2: selection input, you can edit here to evalute feature subsets
all_X = all_features.iloc[:, feature_id].values





rf_regressor = RandomForestRegressor(
        n_estimators=200,
        random_state=50
    )
pipeline_reg = Pipeline(pipeline_preprocessing + [('rf_reg', rf_regressor)])
regression_score = utils.fit_and_score(all_X, labels, splitter, pipeline_reg)

print(regression_score)

# all feature, r2 = 0.81
# human_only, r2 = 0.80
# auto_only, r2 = 0.76
# pick some, r2 = 0.79
# some feature, r2 = 0.74



# [4,30,18,20,37,44,54,70,89,96,97,0,59,62,83,92]
# 0  0.810814  62.749402    2016-08-05

         r2        mse Olympics Year
0  0.810696  62.788378    2016-08-05


In [100]:
# check the importance score of selected features
feature_imp = utils.get_feature_importances(pipeline_reg, 
#                                              all_features,
                                            # MUST EDIT3: 
                                            all_features.iloc[:, feature_id], # edit here to check the score of selected features
                                            labels, splitter, 300)
 
test_date = pd.Timestamp('08/05/2016')
feature_imp[test_date].reset_index(drop=True)

Unnamed: 0,Importance,Feature
0,0.824960,SUM(medaling_athletes.NUM_UNIQUE(athletes.Athe...
1,0.053980,COUNT(medaling_athletes WHERE athlete.Gender =...
2,0.007792,MEAN(countries_at_olympic_games.NUM_UNIQUE(med...
3,0.005464,MEAN(countries_at_olympic_games.SUM(medals_won...
4,0.005082,MAX(countries_at_olympic_games.MIN(medals_won....
...,...,...
93,0.000458,MAX(countries_at_olympic_games.SUM(medals_won....
94,0.000439,ABSOLUTE(MAX(medals_won.Age))
95,0.000425,TREND(countries_at_olympic_games.MIN(medals_wo...
96,0.000421,"ABSOLUTE(TREND(medals_won.Age, Year))"


### Optional: Try multiple regressive models

In [123]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import ExtraTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import LinearSVC
from sklearn.svm import LinearSVR
from sklearn.linear_model import ARDRegression

linear_reg = LinearRegression()
logistic_reg = LogisticRegression()
sgd_reg = SGDRegressor()
knn_reg = KNeighborsRegressor()
mlp_reg = MLPRegressor()
decisionTree_reg = DecisionTreeRegressor()
extTree_reg = ExtraTreeRegressor()
gb_reg = GradientBoostingRegressor()
svc_reg = LinearSVC()
svr_reg = LinearSVR()
ard_reg = ARDRegression()

In [124]:
pipeline_reg1 = Pipeline(pipeline_preprocessing + [('linear_reg', linear_reg)])
regression_score1 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg1)
print("linear_reg: ")
print(regression_score1)

pipeline_reg2 = Pipeline(pipeline_preprocessing + [('logistic_reg', logistic_reg)])
regression_score2 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg2)
print("logistic_reg: ")
print(regression_score2)

pipeline_reg3 = Pipeline(pipeline_preprocessing + [('sgd_reg', sgd_reg)])
regression_score3 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg3)
print("sgd_reg: ") 
print(regression_score3)

pipeline_reg4 = Pipeline(pipeline_preprocessing + [('knn_reg', knn_reg)])
regression_score4 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg4)
print("knn_reg: ")
print(regression_score4)

pipeline_reg5 = Pipeline(pipeline_preprocessing + [('mlp_reg', mlp_reg)])
regression_score5 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg5)
print("mlp_reg: ")
print(regression_score5)

pipeline_reg6 = Pipeline(pipeline_preprocessing + [('decisionTree_reg', decisionTree_reg)])
regression_score6 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg6)
print("decisionTree_reg: ")
print(regression_score6)

pipeline_reg7 = Pipeline(pipeline_preprocessing + [('extTree_reg', extTree_reg)])
regression_score7 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg7)
print("extTree_reg: ")
print(regression_score7)

pipeline_reg8 = Pipeline(pipeline_preprocessing + [('gb_reg', gb_reg)])
regression_score8 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg8)
print("gb_reg: ")
print(regression_score8)

pipeline_reg9 = Pipeline(pipeline_preprocessing + [('svc_reg', svc_reg)])
regression_score9 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg9)
print("svc_reg: ")
print(regression_score9)

pipeline_reg10 = Pipeline(pipeline_preprocessing + [('svr_reg', svr_reg)])
regression_score10 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg10)
print("svr_reg: ")
print(regression_score10)

pipeline_reg11 = Pipeline(pipeline_preprocessing + [('ard_reg', ard_reg)])
regression_score11 = utils.fit_and_score(all_X, labels, splitter, pipeline_reg11)
print("ard_reg: ")
print(regression_score11)

linear_reg: 
         r2        mse Olympics Year
0  0.767589  77.086065    2016-08-05


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


logistic_reg: 
        r2         mse Olympics Year
0  0.15031  281.825581    2016-08-05
sgd_reg: 
             r2           mse Olympics Year
0 -2.267728e+25  7.521612e+27    2016-08-05
knn_reg: 
         r2        mse Olympics Year
0  0.714993  94.531163    2016-08-05




mlp_reg: 
         r2        mse Olympics Year
0  0.701377  99.047554    2016-08-05
decisionTree_reg: 
         r2        mse Olympics Year
0  0.804344  64.895349    2016-08-05
extTree_reg: 
         r2   mse Olympics Year
0  0.817596  60.5    2016-08-05
gb_reg: 
        r2        mse Olympics Year
0  0.80071  66.100621    2016-08-05




svc_reg: 
         r2        mse Olympics Year
0  0.504707  164.27907    2016-08-05




svr_reg: 
         r2        mse Olympics Year
0  0.762559  78.754593    2016-08-05
ard_reg: 
        r2        mse Olympics Year
0  0.75362  81.719286    2016-08-05


In [None]:
# check the importance score of selected features
feature_imp = utils.get_feature_importances(pipeline_reg8, # EDIT: model name
                                             all_features,
#                                             all_features.iloc[:, feature_id], # edit here to check the score of selected features
                                            labels, splitter, 300)
 
test_date = pd.Timestamp('08/05/2016')
feature_imp[test_date].reset_index(drop=True)