<a name="toc"></a>


# Table of Contents
<a href="#introduction">Introduction</a>

<a href="#adding_total_times">Adding Total Times</a>

<a href="#functions">Prediction Functions</a>

<a href="#eval_functions">Functions for Evaluation</a>

<a href="#normal_functions">Normal functions</a>

<a href = "#comparisons">Comparing Weights Take 1</a>

<a href = "#comparisons2">Comparing Weights Take 2</a>

<a href = "#combivars">Considering Combined Variables</a>

<a href = "#conclusions">Conclusions</a>

<!--<a href="#weighted_season">Weighted season predictions</a>-->

<!--<a href="#model_comparisons">Model comparisons</a>-->

<!--<a href="#complications">Complications</a>-->

In [1]:
# First a cell to prepare the notebook for the stuff that I might need to use. 
# More imports can be added as necessary.

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 
%load_ext memory_profiler

from fnmatch import fnmatch

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from pattern import web
import seaborn as sns
import math

# And the additional modules that I've used

import fnmatch
import os
import pickle
from PyPDF2 import PdfFileReader
from tabula import read_pdf
import urllib
import random
import sklearn
import scipy.stats as stats
import re

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.linear_model import ElasticNet

import joblib

import matplotlib as mpl
sns.set(color_codes=True)

<a name="introduction"></a>

# Introduction

In this notebook, we continue the process that we began in Biathlon Report A and continued in Biathlon Report B. At this point, we have collected data on which predictor variables seem to correlate most strongly with the various pieces that make up the totality of the time for a biathlon sprint race. For most of these pieces, we have two or three possibilities, which means that even if we were to choose only a single variable to consider for each of the pieces, we still end up with 24 total possible weight combinations. As a result, we wish to compare the effectiveness of these combinations by running a small number (100) trials on a subset of races using each of these combinations and seeing which of the weight combinations seem to result in the best predictions.

To begin with, there is some data that needs to be loaded in.

<a href="#toc">Table of Contents</a>

In [2]:
with open('best_variables1.pickle', 'rb') as handle:
    best_variables = pickle.load(handle)

In [3]:
absolute_mens_speed = pd.read_pickle('absolute_mens_speed.pkl')
absolute_mens_prone_range = pd.read_pickle('absolute_mens_prone_range.pkl')
absolute_mens_prone_shooting = pd.read_pickle('absolute_mens_prone_shooting.pkl')
absolute_mens_standing_range = pd.read_pickle('absolute_mens_standing_range.pkl')
absolute_mens_standing_shooting = pd.read_pickle('absolute_mens_standing_shooting.pkl')


In [4]:
wc_quant_snow_similarities = pd.read_pickle('wc_quant_snow_similarities.pkl')
ibu_quant_snow_similarities = pd.read_pickle('ibu_quant_snow_similarities.pkl')

wc_quant_weather_similarities = pd.read_pickle('wc_quant_weather_similarities.pkl')
ibu_quant_weather_similarities = pd.read_pickle('ibu_quant_weather_similarities.pkl')

wc_altitude_similarities = pd.read_pickle('wc_altitude_similarities.pkl')
ibu_altitude_similarities = pd.read_pickle('ibu_altitude_similarities.pkl')

wc_wind_c_similarities = pd.read_pickle('wc_wind_c_similarities.pkl')
ibu_wind_c_similarities = pd.read_pickle('ibu_wind_c_similarities.pkl')

wc_season_similarities = pd.read_pickle('wc_season_similarities.pkl')
ibu_season_similarities = pd.read_pickle('ibu_season_similarities.pkl')

wc_maximum_climb_similarities = pd.read_pickle('wc_maximum_climb_similarities.pkl')
ibu_maximum_climb_similarities = pd.read_pickle('ibu_maximum_climb_similarities.pkl')

wc_event_similarities = pd.read_pickle('wc_event_similarities.pkl')
ibu_event_similarities = pd.read_pickle('ibu_event_similarities.pkl')


<a name="adding_total_times"></a>

<a name="adding_total_times"></a>

# Adding total times

One of our first steps here is to add the total race times for each racer to our sprint dataframes. In order to do this, we'll use two (slightly) different functions, one for world cup sprint competitions, and the other for ibu cup sprint competitions. Those functions are
1. <a href="#add_total_times">```add_total_times```</a>: This function takes the dataframe associated to a competition, loads the pickle file, adds a file that is the sum of the ski time and prone and standing range times for each racer, and repickles the resulting dataframe.
2. <a href="#ibu_add_total_times">```ibu_add_total_times```</a>:  This function performs exactly the same thing as ```add_total_times```, but is slightly adapted to reflect the fact that ibu cup events can have multiple sprint race competitions.

<a href="#toc">Table of Contents</a>

<a name="add_total_times"></a>

<a href="#adding_total_times">Back to Adding times</a>

In [5]:
"""
Function
--------

add_total_times : for a given world cup sprint race, loads the saved dataframe, computes 
                  the total race times as the sum of the ski times and the prone and 
                  standing range times, adds them to the dataframe, and then pickles
                  the resulting dataframe 

Parameters
----------

year : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'

Returns
-------

Stores an altered pickle file on the hard drive

Examples
--------
"""


def add_total_times(year, event):
    
    filename = 'companal_SMSP_%(year)s_%(event)s.pkl' %{'year': year, 'event' : event}
    event_data = pd.read_pickle(filename)

    for i in range(len(event_data)):
        total = (event_data.loc[i,'Total Ski'] + event_data.loc[i,'prone range'] 
                         + event_data.loc[i,'standing range'])
        event_data.loc[i,'Total Time'] = total
        
    event_data.to_pickle(filename)
    #return event_data

In [6]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

for season in seasons:
    for event in events:
        try:
            add_total_times(season, event)
        except: #race doesn't exist
            pass

<a name="ibu_add_total_times"></a>

<a href="#adding_total_times">Back to Adding times</a>

In [7]:
"""
Function
--------

ibu_add_total_times : for a given ibu cup sprint race, loads the saved dataframe, computes 
                      the total race times as the sum of the ski times and the prone and 
                      standing range times, adds them to the dataframe, and then pickles
                      the resulting dataframe 

Parameters
----------

year : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
code : a string that codes the sprint competition at the given event. This is necessary
       because the ibu cup often has two different sprint races at a single event, which
       contrasts with the world cup, which never has more than one. Possible values are
       'SMSP' and 'SMSPS'

Returns
-------

Stores a pickle object on the hard drive

Examples
--------
"""


def ibu_add_total_times(year, event,code):
    
    filename = ('ibu_%(code)s_%(year)s_%(event)s.pkl'
                        %{'code' : code, 'year': year, 'event' : event})
    event_data = pd.read_pickle(filename)

    for i in range(len(event_data)):
        total = (event_data.loc[i,'Total Ski'] + event_data.loc[i,'prone range']
                         + event_data.loc[i,'standing range'])
        event_data.loc[i,'Total Time'] = total
        
    event_data.to_pickle(filename)


In [8]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']
codes = ['SMSP','SMSPS']

for season in seasons:
    for event in events:
        for code in codes:
            try:
                add_total_times(season, event)
                ibu_add_total_times(year,event,code)
            except: #race doesn't exist
                pass

<a name="functions"></a>

# Prediction Functions

Now that total times have been added to our dataframe, we build a handful of functions that allow us to use the data from previous races to make predictions about the outcomes of later races. Those functions are
In order to have models to compare, we need to make predictions. Because these functions are taken directly from the previous notebook (Biathlon Report C-a), I'm going to write fairly minimally here. The five functions found below are:
1. <a href="#adjust_times">```adjust_times```</a>: It appears from looking at the data for individual racers that racer speeds generally increase somewhat linearly over time. The function adjust_times allows early career race speeds to be adjusted based on a linear fit of racer speed over all previous seasons (performed on a season by season basis) in order to try to mitigate the effects of time on speeds.
2. <a href="#build_racer_speed_distribution">```build_racer_speed_distribution```</a>: Creates a list of racer speeds by taking all of a given racer's speeds and then repeating them with a multiplicity that is dependent on the similarities between the conditions for the current competition and the individual prior competitions. A prior race with a similarity value of 1 (all predictor variables under consideration have identical or nearly identical values) would have its speed appear 10 times in the list. A prior race with a similarity value of 0.5 (the distance between the predictor variables under consideration is roughly half of the total spread for that variable) would have its speed appear 5 times on the list. ```build_racer_speed_distribution``` calls ```adjust_times``` to adjust the speeds for events that were held before the season under consideration to reflect general improvements in speed.
3. <a href="#build_racer_pr_distribution">```build_racer_pr_distribution```</a>:  Creates a list of n predicted total range times for a given racer in a given event. This is a fairly complicated process that involves the following steps:
    1. For the given racer, determine which previous races (of both types) that racer has competed in. Collect range and accuracy (either prone or standing, depending on circumstance) into a pair of lists.
    2. Using the weights associated to range times, produce a pair of weighted lists for range times and accuracy. Take a paired bootstrap sample (use the same ordered list of indices for both lists) and perform a linear regression on the results. The intercept is taken as the shooting time, and the slope is taken as the penalty loop time for any missed shots.
    3. Using the weights associated to accuracy, produce a weighted list for accuracy. The mean of this list (or rather $a = (5-\bar{x})/5$) will be taken to be the expected probability of making a particular shot. Five random values in the interval [0,1] are then computed. For each value, if the value is below $a$, the shot is made. Otherwise, the shot is missed.
    4. The total range time is calculated as the sum of the shooting time and the product of the number of missed shots and the penalty loop time.
4. <a href="#racer_time_predict">```racer_time_predict```</a>: For a given racer, this function 
    1. first calls ```racer_speed_distribution```. Then, for each of the n desired predictions, it randomly selects a sample of size 10 from the returned list and computes the average.
    2. next calls ```build_racer_pr_distribution``` twice, once for prone range times and once for standing range times. 
    3. finally, adds together the times in the resulting lists to produce a single list of n predicted race times for the given biathlete in the given race
5. <a href="#race_time_predictions">```race_time_predictions```</a>: calls ```racer_time_predict``` for each of the racers competing in a given race and stitches the outcomes together to form a single dataframe.


<a href="#toc">Table of Contents</a>

<a name = 'adjust_times'></a>
<a href="#functions">Back to Prediction Functions</a>

In [9]:
"""
Function
--------

adjust_times : Takes a racer and his event speeds over the course of a career, finds the
               best fit line through the data, and adjusts early speeds to reflect what 
               they would be predicted to be if the race were run in the season under 
               consideration.

Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.

Returns
-------

adjusted_racer : a list containing the racer's speed adjusted for the season under 
                 consideration

Examples
--------
"""


# To be called inside speed_predictions

def adjust_times(racer, season):
    
    indices = racer.index.tolist()
    years = [item.split(':')[1] for item in indices]
    years = [''.join(['20',item[2:4]]) for item in years]
    years = [float(item) for item in years]
    years = np.array(years).reshape(-1,1)
    speeds = np.array(racer.tolist()).reshape(-1,1)
    
    linreg = LinearRegression()
    linreg.fit(years , speeds)
    
    coef = linreg.coef_[0][0]
    
    # Now that we know the slope of the best fit line here, I want to create a 
    # time adjusted version of the speeds
    
    adjusted_racer = racer.copy()
    
    for i in range(len(racer)):
        time_delta = float(season)- years[i][0]
        adjusted_racer[i] = adjusted_racer[i] + coef*time_delta
        
    return adjusted_racer

<a name = 'build_racer_speed_distribution'></a>
<a href="#functions">Back to Prediction Functions</a>

In [10]:
"""
Function
--------

build_racer_speed_distribution : takes a racer, season, event, and dataframes containing
                                 similarity data about world cup vs world cup and world cup
                                 vs ibu cup race conditions, and returns a list of speeds
                                 from previous races, where the multiplicity of the speed
                                 is determined by the degree of similarity between the race
                                 of interest and the race from which the speed was taken

Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
wc_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
         between world cup (wc) races in a pairwise fashion. Pairs with a similarity value
         of 1 were run under nearly identical conditions, while those with a similarity
         value of 0 were run under extremely different conditions
ibu_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
          between world cup (wc) races and ibu cup races in a pairwise fashion. 

Returns
-------

racer_predict : a list of speeds derived from the speeds of the racer's previous events. Each 
                prior speed is in the list with multiplicity n, where n is the rounded
                product of 10 and the similarity score for the pairing between the current
                event and the prior event

Examples
--------
"""


def build_racer_speed_distribution(racer, season, event, wc_sim, ibu_sim):

    col_name = ':'.join(['wc',season, event])
    racer_data = absolute_mens_speed.loc[racer, :col_name]
    name = racer
    actual = float(racer_data[col_name])
    short_racer_data = racer_data[2:-1].copy()
    short_racer_data.dropna(inplace = True)
    
    # To drop later
    #short_racer_data = short_racer_data[-20:]
    
    predictors = len(short_racer_data)
    
    year = ''.join(['20',season[2:4]])
    short_racer_data = adjust_times(short_racer_data, year)
    indices = short_racer_data.index.tolist()

    race_weights = []
    for index in indices:
        split_index = index.split(':')
        try:
            if split_index[0] == 'wc':
                race_weights.append(wc_sim.loc[':'.join([season,event]),
                                               ':'.join([split_index[1],split_index[2]])])
            else:
                race_weights.append(ibu_sim.loc[':'.join([season,event]),
                                                ':'.join([split_index[1],split_index[2]])])
        except:
            race_weights.append(0.0)
    race_weights_rounded = [int(round(item,1)*10) for item in race_weights]

    race_weights_rounded

    # Making the weighted data list
    
    racer_predict = []

    for i in range(len(short_racer_data)): 
        for j in range(race_weights_rounded[i]):
            racer_predict.append(float(short_racer_data[i]))

    return racer_predict

<a name = 'build_racer_pr_distribution'></a>
<a href="#functions">Back to Prediction Functions</a>

In [11]:
"""
Function
--------

build_racer_pr_distribution : takes a racer, season, event, and dataframes containing
                              similarity data about world cup vs world cup and world cup
                              vs ibu cup race conditions, and returns a list of predicted
                              range times for the competition under consideration

Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
wc_acc_sim : a dataframe containing values between 0 and 1 which codes the degree of 
             similarity between world cup (wc) races in a pairwise fashion. Pairs with 
             a similarity value of 1 were run under nearly identical conditions, while
             those with a similarity value of 0 were run under extremely different 
             conditions. Chosen for predictive power for shooting accuracy
ibu_acc_sim : a dataframe containing values between 0 and 1 which codes the degree of 
              similarity between world cup (wc) races and ibu cup races in a pairwise 
              fashion. Chosen for predictive power for shooting accuracy
wc_range_sim : a dataframe containing values between 0 and 1 which codes the degree of 
               similarity between world cup (wc) races in a pairwise fashion. Chosen for
               predictive power for range time (shooting time together with penalty time)
ibu_range_sim : a dataframe containing values between 0 and 1 which codes the degree of 
                similarity between world cup (wc) races and ibu cup races in a pairwise 
                fashion. Chosen for predictive power for range time
n : the length of the list of predicted range times that is returned

Returns
-------

racer_predict : a list of predicted range times

Examples
--------
"""


def build_racer_pr_distribution(racer, season, event, wc_acc_sim, ibu_acc_sim,
                                wc_range_sim, ibu_range_sim,n):

    col_name = ':'.join(['wc',season, event])
    racer_time_data = absolute_mens_prone_range.loc[racer, :col_name]
    racer_shot_data = absolute_mens_prone_shooting.loc[racer, :col_name]
    name = racer
    actual = float(racer_time_data[col_name])
    short_racer_time_data = racer_time_data[2:-1].copy()
    short_racer_shot_data = racer_shot_data[2:-1].copy()
    short_racer_time_data.dropna(inplace = True)
    short_racer_shot_data.dropna(inplace = True)
    predictors = len(short_racer_shot_data)
    
    # To drop later
    
    short_racer_time_data = short_racer_time_data[-20:]
    short_racer_shot_data = short_racer_shot_data[-20:]

    
    year = ''.join(['20',season[2:4]])
    indices = short_racer_shot_data.index.tolist()
    
    # Separate weights for shooting accuracy and range times
    
    # Accuracy
    accuracy_weights = []
    for index in indices:
        split_index = index.split(':')
        if split_index[0] == 'wc':
            accuracy_weights.append(wc_acc_sim.loc[':'.join([season,event]),
                                                   ':'.join([split_index[1],split_index[2]])])
        else:
            try:
                accuracy_weights.append(ibu_acc_sim.loc[':'.join([season,event]),
                                                    ':'.join([split_index[1],split_index[2]])])
            except: # There is missing data
                accuracy_weights.append(0.0)
    accuracy_weights_rounded = [int(round(item,1)*10) for item in accuracy_weights]

    # Range times
    range_weights = []
    for index in indices:
        split_index = index.split(':')
        if split_index[0] == 'wc':
            range_weights.append(wc_range_sim.loc[':'.join([season,event]),
                                                  ':'.join([split_index[1],split_index[2]])])
        else:
            try:
                range_weights.append(ibu_range_sim.loc[':'.join([season,event]),
                                                    ':'.join([split_index[1],split_index[2]])])
            except: # There is missing data
                range_weights.append(0.0)
    range_weights_rounded = [int(round(item,1)*10) for item in range_weights]

    # And I'm going to have to split now to build my distributions
    
        # Making the weighted data list
    
    racer_accuracy = []
    racer_shot = []
    racer_time = []
    racer_predict = []

    for i in range(len(short_racer_shot_data)): 
        for j in range(accuracy_weights_rounded[i]):
            racer_accuracy.append(float(short_racer_shot_data[i]))
            
    for i in range(len(short_racer_time_data)):
        for j in range(range_weights_rounded[i]):
            racer_shot.append(float(short_racer_shot_data[i]))
            racer_time.append(float(short_racer_time_data[i]))
            
    # Build a model for shooting time and penalty loop time
    
    for k in range(n): 
        index_sample = np.random.choice(range(len(racer_shot)), 
                                                len(racer_shot), replace = True)
        racer_shot_sample = [racer_shot[i] for i in index_sample]
        racer_time_sample = [racer_time[i] for i in index_sample]
        
        accuracy = np.mean(racer_accuracy)/5
        
        linreg = LinearRegression()
        linreg.fit(np.array(racer_shot_sample).reshape(-1,1), 
                           np.array(racer_time_sample).reshape(-1,1))
        loop = linreg.coef_
        shot_time = linreg.intercept_
        
    # And predict number of missed shots

        shooting = np.random.sample(5)
        count = 0
        for m in range(5):
            if shooting[m] < accuracy:
                count += 1
        range_time = shot_time + count*loop
        racer_predict.append(range_time[0][0])
          
    return racer_predict



<a name = 'racer_time_predict'></a>
<a href="#functions">Back to Prediction Functions</a>

In [27]:
"""
Function
--------

racer_time_predict : produces a list of predicted times for a given racer in a given 
                     competition

Parameters
----------

racer : a string containing the name of a biathlete
year : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
n : an integer giving the desired number of times predicted
length : a float giving the length of the ski portion of the course for the competition
         under consideration
weight_type : a list of lists containing the world cup and ibu similarity weightings
              for the speed prediction and prone and standing range predictions

Returns
-------

race_predictions : an n element list containing predicted total times for the given
                   racer in the given competition

Examples
--------
"""



def racer_time_predict(racer,year,event,n,length, weight_type):
    
    # First up, produce the speed estimates
    
    weightings = weight_type
    #print weights[weightings[0]][0]
    #print weights[weightings[0]][1]
    
    speed_distribution = build_racer_speed_distribution(racer, year, event, 
                                    weights[weightings[0]][0], weights[weightings[0]][1])
    
    ski_time_predictions = []
    for i in range(n):
        speed_sample = np.random.choice(speed_distribution, 10)
        ski_time_predictions.append(length/np.mean(speed_sample))
        
    prone_range = build_racer_pr_distribution(racer, year, event, weights[weightings[1]][0], 
                                    weights[weightings[1]][1],weights[weightings[3]][0], 
                                              weights[weightings[3]][1],n)
    standing_range = build_racer_pr_distribution(racer,year,event, weights[weightings[2]][0],
                                    weights[weightings[2]][1],weights[weightings[4]][0],
                                                 weights[weightings[4]][1],n)
    
    time_predictions = [x+y+z for x,y,z in zip(ski_time_predictions, 
                                               prone_range, standing_range)]
    
    race_predictions = [racer]
    race_predictions.extend(time_predictions)
    
    return race_predictions

<a name = 'race_time_predictions'></a>
<a href="#functions">Back to Prediction Functions</a>

In [28]:
"""
Function
--------

race_time_predictions : calls racer_time_predict repeatedly to create a dataframe of 
                        time predictions for all of the racers in a given competition

Parameters
----------

year : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
n : an integer indicating the total number of time predictions to be made for each racer
weight_type : a list of lists containing the world cup and ibu similarity weightings
              for the speed prediction and prone and standing range predictions

Returns
-------

predicted_times : a dataframe containing whose index consists of those racers who competed
                  in the given race, and whose rows are the n predicted times for those
                  racers
problem_racers : a list of racers for whom racer_time_predict was unable to be executed,
                 generally due to lack of prior race data

Examples
--------
"""


def race_time_predictions(year,event,n, weight_type):
    
    # Find the length of the race
    course_url = 'course_summary_%(year)s_M.pkl' %{'year' : year}
    course_data = pd.read_pickle(course_url)
    length = course_data.loc[course_data['Event'] == event]['Length'].tolist()[0]
        
    # Get the list of racers
    
    race_code = ':'.join(['wc', year, event])
    racer_indices = absolute_mens_speed[race_code].dropna().index.tolist()

    # Make the predictions
    
    predicted_times = []
    problem_racers = []
    
    for racer in racer_indices[:-1]:
        try:
            predicted_times.append(racer_time_predict(racer,year,event,n, length, weight_type))
        except:
            problem_racers.append(racer)
    
    predicted_times = pd.DataFrame(predicted_times)
    predicted_times.set_index(0, drop = True, inplace = True)
    
    return predicted_times, problem_racers

<a name="eval_functions"></a>

<a name="eval_functions"></a>
# Evaluation Functions

In order to evaluate the outcomes produced by our models, we need a number of functions that allow us to compare outcomes across multiple models. In order to do this, we define several additional functions.
1. <a href="#place_counts">```place_counts```</a>: The function <a href='#race_time_predictions'>```race_time_predictions```</a> produces a dataframe containing one row for each competitor in a competition (with the exception of those racers who have at most a single prior race) and with $n$ columns of predicted times. The function ```place_counts``` treats each column as a running of the race, and determines, for each column, the order of finish predicted for each racer. It then aggregates this data to produce a dataframe which again has a row for each competitor, but whose columns contain the number of predicted first place finishes, second place finishes, third place finishes, etc.  
2. <a href="#finding_percentiles">```finding_percentiles```</a>: The function <a href='#race_time_predictions'>```race_time_predictions```</a> produces a dataframe containing one row for each competitor in a competition (with the exception of those racers who have at most a single prior race) and with $n$ columns of predicted times. The function ```finding_percentiles``` calculates, for each individual racer, various attributes of the distribution of times predicted for that racer, among them mean, median, 25th and 75th percentiles, and the difference between the racer's actual time and the mean of their predicted times. It then returns a dataframe which contains one row for each racer and columns for each attribute of their time distributions.
3. <a href="#evaluating_percentiles">```evaluating_percentiles```</a>: This function takes the dataframe that is output by ```finding_percentiles``` and, for each racer, determines where in that racers distribution of time predictions their actual time falls. The results are then returned as a two column dataframe containing the racers' names in the first column and the code for the location of their actual times within their distributions as the second. The codes for the different parts of the distribution range are as follows:
    - 0 : actual time is faster than the minimum predicted time
    - 1 : actual time is between the minimum and the 5th percentile of the predicted times
    - 2 : actual time is between the 5th percentile and the 10th percentile of the predicted times
    - 3 : actual time is between the 10th percentile and the 25th percentile of the predicted times
    - 4 : actual time is between the 25th percentile and the median of the predicted times
    - 5 : actual time is between the median and the 75th percentile of the predicted times
    - 6 : actual time is between the 75th percentile and the 90th percentile of the predicted times
    - 7 : actual time is between the 90th percentile and the 95th percentile of the predicted times
    - 8 : actual time is between the 95th percentile and the maximum of the predicted times
    - 9 : actual time is slower than the maximum predicted time
4. <a href="#evaluating_place_counts">```evaluating_place_counts```</a>: This function takes the dataframe that is output by ```place_counts``` and, for each racer, determines with what frequency the predicted place is correct, within one place of being correct, within two places of being correct, within five places of being correct, within ten places of being correct, and within twenty places of being correct. (For example, a racer who actually finished 35th would have all predicted finishes between 30th and 40th counted when determining the value for beining within five places of being correct.) It then returns a dataframe with one row for each competitor and one column for each of the seven categories under consideration.
5. <a href="#evaluating_dist_from_mean">```evaluating_dist_from_mean```</a>: This function takes the dataframe output by ```finding_percentiles```. It uses the column ```dist_from_mean``` to determine for what percentage of the biathletes the mean of their distributions were within 10 seconds of their actual times, within 25 seconds of their actual times, within 50 seconds of their actual times, within 100 seconds of their actual times, within 150 seconds of their actual times, and within 200 seconds of their actual times. The result is then returned as a dataframe with only a single column.
<!--6. <a href="#average_from_mean">```average_from_mean```</a>: This function takes the ```diff from mean``` column from the dataframe output by ```finding_percentiles``` and calculates both the mean of the values in the column and the mean of the absolute values of the entries in the column and returns both values as floats. The first computation gives us some indication of how balanced our errors in prediction are with respect to the actual times, since positive and negative values will cancel each other out. The second computation gives us an indication of overall error. In the case that the absolute values of these two numbers are the same (or nearly the same), it is an indication that the values being predicted tend to fall consistantly too high or too low.-->
<!--7. <a href="#check_predictions">```check_predictions```</a>: This function takes the dataframe produced by ```finding_percentiles``` and returns a dataframe that, for each racer, indicates (via ```True``` or ```False```) whether that racer's actual time fell in the middle 50% of his predicted time distribution and whether his actual time fell in the middle 90% of his predicted time distribution.-->
<!--8. <a href="#inside_outside">```inside_outside```</a>: This function takes the dataframe produced by ```check_predictions``` and returns two floats. The first is the percentage of the racers for whom the actual time fell inside of the middle 50% of their distribution, and the second is the percentage of the racers for whom the actual time fell inside of the middle 90% of their distribution.-->

<a href="#toc">Table of Contents</a>


<a name="place_counts"></a>
<a href="#eval_functions">Back to Evaluation Functions</a>

In [12]:
"""
Function
--------

place_counts : treats each column of a race_time_predictions dataframe as an instance
               of a race, and determines the finishing places for each racer within 
               that race

Parameters
----------

times : a dataframe that is the output of a call to race_time_predictions

Returns
-------

place_count : a dataframe of integers indicating which place a given racer would have 
              had if the kth column of 

Examples
--------
"""



def place_counts(times):
    
    predicted_places = times.copy()

    for j in range(predicted_places.shape[1]):
        predicted_places.sort_values(j+1,inplace = True)
        for i in range(len(predicted_places)):
            predicted_places.iloc[i,j] = i+1

    place_counts = pd.DataFrame(columns = range(1, len(times)+1), index = times.index)
    
    for racer in place_counts.index:
        for i in place_counts.columns:
            count = len([item for item in predicted_places.loc[racer] if item == i])
            place_counts.loc[racer,i] = count
            
    return place_counts

<a name="finding_percentiles"></a>
<a href="#eval_functions">Back to Evaluation Functions</a>

In [13]:
"""
Function
--------

finding_percentiles : takes the output of a call to race_time_predictions and returns a
                      dataframe containing information about the distribution of predicted
                      times for each racer

Parameters
----------

times : a dataframe that is the output of a call to race_time_predictions
year : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'

Returns
-------

racer_percentiles : a dataframe containing the name, mean time, standard deviation of time,
                    minimum time, fifth percentile, tenth percentile, twenty-fifth percentile,
                    median, seventy-fifth percentile, ninetieth percentile, ninety-fifth
                    percentile, maximum time, actual time, and difference between the actual
                    time and the mean predicted time for each racer

Examples
--------
"""



def finding_percentiles(times, year, event):
    
    racer_percentiles = []
    
    filename = 'companal_SMSP_%(year)s_%(event)s.pkl' %{'year': year, 'event' : event}
    event_data = pd.read_pickle(filename)
    
    for i in range(len(times)):
        name = times.index[i]
        racer = [name]
        racer_data = times.iloc[i,:]
        mean = np.mean(racer_data)
        stdev = np.std(racer_data)
        minimum = min(racer_data)
        per5 = np.percentile(racer_data,5)
        per10 = np.percentile(racer_data, 10)
        per25 = np.percentile(racer_data,25)
        median = np.median(racer_data)
        per75 = np.percentile(racer_data,75)
        per90 = np.percentile(racer_data,90)
        per95 = np.percentile(racer_data,95)
        maximum = max(racer_data)
        actual = event_data.loc[event_data['Name'] == name]['Total Time'].tolist()[0]
        difference = actual - mean
        racer.extend([mean,stdev,minimum,per5, per10, per25,median,per75,per90,per95,maximum,
                      actual, difference])
        racer_percentiles.append(racer)
        
    racer_percentiles = pd.DataFrame(racer_percentiles, columns = ['Name','mean','deviation',
                                    'min', '5th per','10th per','25th per','median',
                                    '75th per','90th per', '95th per','maximum','actual time', 
                                    'diff from mean'])
    
    return racer_percentiles

<a name="evaluating_percentiles"></a>
<a href="#eval_functions">Back to Evaluation Functions</a>

In [14]:
"""
Function
--------

evaluating_percentiles : takes a dataframe that is the output of a call to finding 
                         percentiles and returns a dataframe that indicates for each 
                         racer into what part of that racer's predicted times distribution
                         his actual time falls

Parameters
----------

percentiles : a dataframe that is the output of a call to finding_percentiles

Returns
-------

evaluation : a dataframe containing, for each racer, a code indicating in which portion 
             of that racer's predicted distribution his actual time falls.

Examples
--------
"""



def evaluating_percentiles(percentiles):
    
    evaluation = []
    for i in range(len(percentiles)):
        racer_data = percentiles.iloc[i,:]
        name = racer_data[0]
        actual = racer_data[-2]
        if actual < percentiles.iloc[i,3]:
            loc = 0
        elif percentiles.iloc[i,3] <= actual < percentiles.iloc[i,4]:
            loc = 1
        elif percentiles.iloc[i,4] <= actual < percentiles.iloc[i,5]:
            loc = 2
        elif percentiles.iloc[i,5] <= actual < percentiles.iloc[i,6]:
            loc = 3
        elif percentiles.iloc[i,6] <= actual < percentiles.iloc[i,7]:
            loc = 4
        elif percentiles.iloc[i,7] <= actual < percentiles.iloc[i,8]:
            loc = 5
        elif percentiles.iloc[i,8] <= actual < percentiles.iloc[i,9]:
            loc = 6
        elif percentiles.iloc[i,9] <= actual < percentiles.iloc[i,10]:
            loc = 7
        elif percentiles.iloc[i,10] <= actual < percentiles.iloc[i,11]:
            loc = 8
        else:
            loc = 9

        evaluation.append([name,loc])
        
    evaluation = pd.DataFrame(evaluation, columns = ['Name','Location'])
    
    return evaluation

<a name="evaluating_place_counts"></a>
<a href="#eval_functions">Back to Evaluation Functions</a>

In [None]:
"""
Function
--------

evaluating_place_counts : takes a dataframe that is the output of place_counts and, for each
                          racer, determines how many of the trial races had the racer placed
                          correctly, within one place of his actual finish place, within two 
                          places of his actual finish place, etc 

Parameters
----------

place_data : a dataframe that is the output of place_counts
year : a string that codes the season under consideration. It is of the form y1y2 where
       y1 is the last two digits of the year in which the season started, and y2 is the 
       last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'

Returns
-------

place_evaluations : a dataframe containing a row for each racer in the competition. Each
                    row in turn contains the number of predicted finishes places that were
                    correct, within one place of the true place,within two places of the 
                    true place,within three places of the true place, within five places 
                    of the true place, within ten places of the true place, and within
                    twenty places of the true place.

Examples
--------
"""



def evaluating_place_counts(place_data,year,event):
    
    # Get the actual places
    filename = 'companal_SMSP_%(year)s_%(event)s.pkl' %{'year' : year, 'event' : event}
    finish_place = pd.read_pickle(filename)[['Name','Total Time']]
    finish_place.set_index('Name', inplace = True, drop = True)
    finish_place = finish_place.loc[place_data.index]
    finish_place.sort_values('Total Time', inplace = True)
    finish_place.reset_index(inplace = True)
    finish_place.reset_index(inplace = True)
    finish_place.columns = ['Place','Name','Total Time']
    finish_place.set_index('Name', inplace = True, drop = True)
    
    place_evaluations = []
    
    for racer in place_data.index.tolist():
        racer_places = place_data.loc[racer]
        racer_actual = finish_place.loc[racer,'Place']+1
        racer_evaluation = [racer, racer_actual]
        
        for i in [0, 1, 2, 3, 5, 10, 20]:
            
            racer_range = range(racer_actual- i, racer_actual+i+1)
            possible_range = set(range(1,len(racer_places)+1))
            check_range = set(possible_range.intersection(racer_range))
            count = 0
            for j in check_range:
                count = count + racer_places[j]
            racer_evaluation.append(count)
            
        place_evaluations.append(racer_evaluation)
        
    place_evaluations = pd.DataFrame(place_evaluations)
    place_evaluations.columns = ['Name','Place','Correct','Within 1','Within 2','Within 3',
                                'Within 5','Within 10','Within 20']
    
    place_evaluations.sort_values('Place',inplace = True)
    return place_evaluations

<a name="evaluating_dist_from_mean"></a>
<a href="#eval_functions">Back to Evaluation Functions</a>

In [16]:
"""
Function
--------

evaluating_dist_from_mean : takes the output of finding_percentiles and returns a dataframe
                            that indicates how well the centers of the individual racer's
                            time distributions align with their actual times

Parameters
----------

diff_from_mean : a dataframe that is the output of a call to finding_percentiles

Returns
-------

evaluated_distances : a dataframe containing a single column indicating what percentage
                      of differences from the mean (the difference between the mean of a
                      racer's time distribution and his actual time) for a given competition
                      were within 10 seconds of the actual time, within 25 seconds of the 
                      actual time, within 50 seconds of the actual time, within 100 seconds
                      of the actual time, within 150 seconds of the actual time, and within
                      200 seconds of the actual time.

Examples
--------
"""

def evaluating_dist_from_mean(diff_from_mean):
    
    evaluated_distances = []
    
    for i in [10,25,50,100,150,200]:
        count = 0
        for j in range(len(diff_from_mean)):
            if abs(diff_from_mean[j])<= i:
                count += 1
        evaluated_distances.append(float(count)/len(diff_from_mean)*100)
        
    evaluated_distances = pd.DataFrame(evaluated_distances, 
                            index = ['in 10', 'in 25', 'in 50', 'in 100', 'in 150','in 200'])
    
    return evaluated_distances

<a name="normal_functions"></a>

# Normal Predictions

In order to evaluate the quality of the predictions made by a model of this sort, it seems that there are two possibilities. The first, and most obvious, is to simply compare the outcomes of these models to the actual race times. Using this method of evaluation, we might say that if a racer's time for a particular race is 1500 seconds, and it is predicted to be 1501 seconds, that the model is good, while if the racer's time is predicted to be 1500 seconds but is actually 1530 seconds, the model is bad. Similarly, if a racer's most likely finish is predicted to be in the top 5, and that racer finishes third, we might say that the model is good, while if the racer finishes seventh, we might say that the model is bad. The problem here is that, given the nature of biathlon, if these are the requirements to find a good model, it may well be impossible to find anything but a bad model.

The second option is to compare these models with a naive model of what biathlon times might look like. That might be constructed in the following way: 

1. Observe that, for each racer, we find a wide variety of total times for the sprint race. For instance, for the last season's sprint races for Tarjei Boe, we find total times ranging from 1360 seconds to 1642 seconds, with a mean time of 1519 seconds and a standard deviation of 82.53 seconds. As a result, any naive model we choose should reflect this uncertaintly.
2. Observe that it seems reasonable to think of these values as being drawn from some sort of distribution, and that the most obvious distribution is the Gaussian or normal distribution. 
3. Decide that a reasonable (naive) model would be to assume that the total times for each racer are drawn from a normal distribution, and that, further, the mean and standard deviation of total times for a racer's most recent 20 races (to account for speed drift over time) are good estimates of the actual mean and standard deviation of these normal distributions.

Note here that this approach requires first creating a dataframe of all the racers for whom we have data and all of the races for which we have data, ordered chronologically, in order to determine, given a racer and a race, what his 20 most recent competitions were. Thus, we begin with a <a href="#event_order">cell</a><a name="event_order_back"></a> to interleave ibu cup and world cup events for each season (since racers often bounce back and forth between the two levels) and <a href="#season_order">another</a><a name = "season_order_back"></a> to put the individual seasons in order.  A third <a href="#collecting_times">cell</a><a name = "collecting_times_back"></a> then loops through all of the events in order, merging the subdataframes containing racer names and total times in order to create a single large dataframe, which is then pickled.

Given these assumptions, we build our model using the following functions:
1. <a href="#racer_from_normal">```racer_from_normal```</a>: This function takes the name of a racer and the codes (year and event) specifying a particular sprint competition, finds the mean and standard deviation of the racer's times for his most recent 20 sprint competitions (or all sprint competitions, in the event that he has fewer than 20). These are then used to define a normal distribution from which $n$ predicted times are randomly drawn.
2. <a href="#predictions_from_normal">```predictions_from_normal```</a>: This function calls ```racer_from_normal``` for each competitor in a given race and returns a dataframe with a row for each competitor and $n$ columns, one for each predicted time (for a given competitor).
<!--3. <a href="#normal_evaluate_on_season">```normal_evaluate_on_season```</a>: This function calls ```predictions_from_normal``` for each event in the given season. It then returns four objects (which are the same as the objects returned by <a href="#evaluate_on_season">```evaluate_on_season```</a> in <a href="#weighted_season">weighted season prediction</a>).-->
<!--    1. median_places_correct : a dataframe containing the concatenated outputs of the medians of the results of <a href="#evaluating_place_counts">```evaluating_place_counts```</a> for the events in the given season. In other words, a dataframe which contains one column for each event in the season, whose entries are the medians of the columns of the dataframes produced by ```evaluating_place_counts```. -->
 <!--   2. distance_percentages : a dataframe containing the concatenated outputs of all the results of <a href="#evaluating_dist_from_mean">```evaluating_dist_from_mean```</a> for the events in the given season.-->
 <!--   3. averages_from_mean : an array containing the outputs of <a href="#average_from_mean">```average_from_mean```</a> for each event in the given season.-->
 <!--   4. percentages_in_center : an array containing the outputs of <a href="#inside_outside">```inside_outside```</a> for each event of the season-->




<a href="#toc">Table of Contents</a>

<a name="event_order"></a>
<a href="#event_order_back">Back to Normal Functions</a>

In [17]:
events_0405 = [['ibu', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP01'], ['companal', 'CP02'], 
               ['ibu', 'CP03'], ['companal', 'CP03'], ['ibu', 'CP04'], ['companal', 'CP04'], 
               ['ibu', 'CP05'], ['companal', 'CP05'], ['ibu', 'CP06'], ['companal', 'CP06'], 
               ['companal', 'CP07'], ['companal', 'CP08'], ['ibu', 'CH__'], 
               ['companal', 'CH__'], ['ibu', 'CP07'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_0506 = [['companal', 'CP01'], ['ibu', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'], 
               ['ibu', 'CP03'], ['companal', 'CP03'], ['companal', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP04'], ['ibu', 'CP05'], ['companal', 'CP06'],
               ['companal', 'OG__'], ['ibu', 'CH__'], ['companal', 'CP07'], ['ibu', 'CP06'], 
               ['companal', 'CP08'], ['ibu', 'CP07'], ['companal', 'CP09']]

events_0607 = [['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP01'], 
               ['companal', 'CP03'], ['ibu', 'CP02'], ['companal', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP03'], ['companal', 'CP06'], 
               ['ibu', 'CP04'], ['companal', 'CH__'], ['ibu', 'CP06'], ['ibu', 'CH__'],
               ['companal', 'CP07'], ['companal', 'CP08'], ['ibu', 'CP07'], 
               ['companal', 'CP09'], ['ibu', 'CP08']]

events_0708 = [['ibu', 'CP01'], ['companal', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP02'],
               ['ibu', 'CP03'], ['companal', 'CP03'], ['ibu', 'CP04'], ['companal', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'],
               ['companal', 'CH__'], ['ibu', 'CH__'], ['companal', 'CP07'], 
               ['companal', 'CP08'], ['ibu', 'CP07'], ['companal', 'CP09'], ['ibu', 'CP08']]

events_0809 = [['ibu', 'CP01'], ['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'], 
               ['companal', 'CH__'], ['ibu', 'CP07'], ['ibu', 'CH__'], ['companal', 'CP07'], 
               ['ibu', 'CP08'], ['companal', 'CP08'], ['companal', 'CP09']]

events_0910 = [['ibu', 'CP01'], ['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'],
               ['ibu', 'CP07'], ['companal', 'OG__'], ['ibu', 'CH__'], ['companal', 'CP07'],
               ['ibu', 'CP08'], ['companal', 'CP08'], ['companal', 'CP09']]

events_1011 = [['ibu', 'CP01'], ['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], 
               ['companal', 'CP07'], ['ibu', 'CP06'], ['companal', 'CP08'], ['ibu', 'CP07'], 
               ['ibu', 'CH__'], ['companal', 'CH__'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1112 = [['ibu', 'CP01'], ['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CH__'],
               ['companal', 'CP07'], ['companal', 'CP08'], ['ibu', 'CP06'], ['ibu', 'CP07'], 
               ['companal', 'CH__'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1213 = [['companal', 'CP01'], ['ibu', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP02'], 
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'], 
               ['companal', 'CH__'], ['ibu', 'CP07'], ['ibu', 'CH__'], ['companal', 'CP07'],
               ['companal', 'CP08'], ['companal', 'CP09']]

events_1314 = [['companal', 'CP01'], ['ibu', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP02'], 
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'],
               ['ibu', 'CH__'], ['companal', 'OG__'], ['ibu', 'CP07'], ['companal', 'CP07'],
               ['companal', 'CP08'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1415 = [['companal', 'CP01'], ['ibu', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CH__'], 
               ['companal', 'CP07'], ['ibu', 'CP06'], ['companal', 'CP08'], ['ibu', 'CP07'], 
               ['companal', 'CH__'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1516 = [['companal', 'CP01'], ['ibu', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'], 
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'], 
               ['companal', 'CP07'], ['companal', 'CP08'], ['ibu', 'CP07'], ['ibu', 'CH__'], 
               ['companal', 'CH__'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1617 = [['companal', 'CP01'], ['ibu', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'], 
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CH__'], 
               ['ibu', 'CP06'], ['companal', 'CH__'], ['companal', 'CP07'], ['ibu', 'CP07'], 
               ['companal', 'CP08'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1718 = [['ibu', 'CP01'], ['companal', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CH__'], 
               ['ibu', 'CP06'], ['companal', 'OG__'], ['companal', 'CP07'], ['ibu', 'CP07'],
               ['companal', 'CP08'], ['ibu', 'CP08'], ['companal', 'CP09']              ]

<a name="season_order"></a>
<a href="#season_order_back">Back to Normal Functions</a>

In [18]:
ordered_events ={'0405' : events_0405, '0506' : events_0506, '0607' : events_0607,
                 '0708' : events_0708, '0809' : events_0809, '0910' : events_0910, 
                 '1011' : events_1011, '1112' : events_1112, '1213' : events_1213, 
                 '1314' : events_1314, '1415' : events_1415, '1516' : events_1516, 
                 '1617' : events_1617, '1718' : events_1718}

<a name="collecting_times"></a>
<a href="#collecting_times_back">Back to Normal Functions</a>

In [19]:
# And collecting the total time data

absolute_mens_time = pd.DataFrame(columns = ['Name'])

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    events = ordered_events[season]
    for event in events:
        if event[0] == 'ibu':
            filename1 = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            filename2 = ('%(cup)s_SMSPS_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})

            colname1 = ':'.join(['ibu',season,event[1]])
            colname2 = ':'.join(['ibuS',season,event[1]])
            try:
                df = pd.read_pickle(filename1)[['Name','Total Time']]
                df.columns = ['Name',colname1]
                absolute_mens_time = absolute_mens_time.merge(df, how = 'outer', on = 'Name')
            
            except:
                #print season, event, 'has no companal file'
                pass
            try:
                df = pd.read_pickle(filename2)[['Name','Total Time']]
                df.columns = ['Name',colname2]
                absolute_mens_time = absolute_mens_time.merge(df, how = 'outer', on = 'Name')
            
            except:
                #print season, event, 'has no companal file'
                pass
        else:
            filename = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                            %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            colname = ':'.join(['wc',season,event[1]])
            
            try:
                df = pd.read_pickle(filename)[['Name','Total Time']]
                df.columns = ['Name',colname]
                absolute_mens_time = absolute_mens_time.merge(df, how = 'outer', on = 'Name')
            
            except:
                print season, event, 'has no companal file'
                #pass

last_row = len(absolute_mens_time)

for col in absolute_mens_time.columns.tolist():
    absolute_mens_time.loc[last_row, col] = "".join(['20', col[2:4]])
    
for i in range(len(absolute_mens_time)):
    absolute_mens_time.loc[i,'count'] = absolute_mens_time.loc[i].count() - 1
    
absolute_mens_time.loc[last_row,'Name'] = 'Year'

absolute_mens_time.set_index('Name', drop=True, inplace=True)

0607 ['companal', 'CP08'] has no companal file
0607 ['companal', 'CP09'] has no companal file
0809 ['companal', 'CH__'] has no companal file
1314 ['companal', 'CP05'] has no companal file
1516 ['companal', 'CP05'] has no companal file
1617 ['companal', 'CP06'] has no companal file
1718 ['companal', 'CP05'] has no companal file


In [20]:
absolute_mens_time.to_pickle('absolute_mens_time.pkl')

<a name="racer_from_normal"></a>
<a href="#normal_functions">Back to Normal Functions</a>

In [21]:
"""
Function
--------

racer_from_normal : predicts n race times for a given racer in a given event based on the
                    premise that the racer's times are normally distributed and that the 
                    distribution is well represented by the racer's times on his previous 20
                    races

Parameters
----------

racer : a string containing the name of a biathlete
year : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
n : an integer giving the number of predicted times desired for the given racer.

Returns
-------

racer_predict : a list of n predicted race times

Examples
--------
"""


def racer_from_normal(racer, year, event, n):
    
    col_name = ':'.join(['wc',year, event])
    racer_data = absolute_mens_time.loc[racer, :col_name]
    name = racer
    actual = float(racer_data[col_name])
    short_racer_data = racer_data[2:-1].copy()
    short_racer_data.dropna(inplace = True)
    
    short_racer_data = short_racer_data[-20:]
    predictors = len(short_racer_data)
    
    mean = np.mean(short_racer_data)
    stdev = np.std(short_racer_data)
    
    predictions = np.random.normal(mean, stdev, n)
    
    racer_predict = [name]#, actual]
    
    racer_predict.extend(predictions)

    return racer_predict

<a name="predictions_from_normal"></a>
<a href="#normal_functions">Back to Normal Functions</a>

In [22]:
"""
Function
--------

predictions_from_normal : for a given competition, repeats predict_from_normal for all
                          competitors and returns the time predictions in the form of
                          a dataframe

Parameters
----------

year : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
n : an integer giving the number of predicted times desired for the given race.

Returns
-------

predicted_times : a dataframe containing all n of the predicted times for each racer.
problem_racers : a list of racers for whome predict_from_normal failed to execute. This
                 is typically due to a lack of prior races for a given competitor.

Examples
--------
"""



def predictions_from_normal(year, event, n):
    
    # Get the list of racers
    
    race_code = ':'.join(['wc', year, event])
    racer_indices = absolute_mens_speed[race_code].dropna().index.tolist()

    # Make the predictions
    
    predicted_times = []
    problem_racers = []
    
    for racer in racer_indices[:-1]:
        try:
            predicted_times.append(racer_from_normal(racer, year,event,n))
        except:
            problem_racers.append(racer)
    
    predicted_times = pd.DataFrame(predicted_times)
    predicted_times.set_index(0, drop = True, inplace = True)
    
    return predicted_times, problem_racers

<a name = "comparisons"></a>

# Comparing Weights Take 1

We begin here by carrying over the predictor variables that seemed to have the most impact on the various pieces of our model from our previous notebooks. Note that this leaves us with a total of 24 combinations to check, in contrast with $13^5 = 371293$ possibilities if we were to simply consider every possible combination of 5 weights (and far more than that if we were to include multiple different variables for a single piece of the model). From there, a series of nested ```for``` loops creates an array of weight combinations to try.

Of course, in order to choose a weight combination, we need some way of comparing what comes out of the prediction functions. We have two functions that we use here
1. <a href = "#compare_weightings">```compare_weightings```</a>: this function takes a competition and each of a collection of weights and, for each weight in the collection, predicts a fixed number of times for each racer. It then calls <a href = "#evaluating_place_counts"> ```evaluating_place_counts```</a> and <a href="#evaluating_dist_from_mean">```evaluating_dist_from_mean```</a> to measure the goodness of the results for each weight under consideration.
2. <a href = "#pluck_best_weights">```pluck_best_weights```</a>: this function takes either of the two dataframes output by ```compare_weightings``` and assigns scores to each weight combination based on how it fairs in comparison with the other weight combinations for each of the events under consideration.

Next, I randomly selected ten events from the 2009-2010 to 2013-2014 seasons, and ran ```compare_weightings``` and ```pluck_best_weights``` using a fairly small value of $n$ (100 trials) on each of them, storing the outputs in dataframes ```race_place_scores``` and ```race_distance_scores```. At that point, I had information about which of the weightings under consideration were the best  for each of the individual races, but the idea was to find a weighting (weightings) that would fair the best overall across all ten of these events, under the theory that a weight model that faired well across all ten of the randomly chosen events would likely fair well more generally. For this, I added two more functions
3. <a href = "#score_of_scores">```score_of_scores```</a>: takes a dataframe produced by concatenating the outputs of ```pluck_best_weights``` via ```compare_weightings``` over a collection of events, and, for each column, gives a score of 3 for every value above the 83rd percentile for that column, a score of 2 for every value above 75th percentile, and a score of -1.5 for every value below the 17th percentile. 
4. <a href = "#rank_weightings">```rank_weightings```</a>: takes two dataframes produced by concatenating the outputs of ```pluck_best_weights``` via ```compare_weightings``` over a collection of events and  applies ```score_of_scores``` to each of them. It then considers four results:
    1. the output of ```score_of_scores``` applied to the race distances dataframe,
    2. the output of ```score_of_scores``` applied to the race places dataframe,
    3. the sum (across columns) of the race distances dataframe, and
    4. the sum (across columns) of the race places dataframe,
    
and, for each category, gives a positive point if the value is in the top quartile and a negative point if it is in the bottom quartile.


I initially chose to keep all weightings for which the score was at least three, but after realizing that left me with only 3 weightings, and, recognizing that running only 100 trials on each weight meant that my results were perhaps less stable than I would have liked, I decided to lower the threshhold score to 2. This resulted in having 8 weightings to consider, which gave me a list to carry forward to the second round of comparing weights.

<a href="#toc">Table of Contents</a>

In [23]:
weights = {"quant_snow" : [wc_quant_snow_similarities, ibu_quant_snow_similarities],
"quant_weather" : [wc_quant_weather_similarities, ibu_quant_weather_similarities],
"altitude" : [wc_altitude_similarities, ibu_altitude_similarities],
"wind_c" : [wc_wind_c_similarities, ibu_wind_c_similarities],
"quant_season" : [wc_season_similarities, ibu_season_similarities],
"max_climb" : [wc_maximum_climb_similarities, ibu_maximum_climb_similarities],
"quant_event" : [wc_event_similarities, ibu_event_similarities]}


peak memory: 201.68 MiB, increment: 0.00 MiB


In [24]:
best_variables

{'prone_acc': ['quant_weather'],
 'prone_range': ['quant_weather', 'quant_snow', 'quant_event'],
 'speed': ['wind_c', 'quant_snow'],
 'standing_acc': ['altitude', 'wind_c'],
 'standing_range': ['quant_weather', 'quant_snow']}

In [25]:
weight_combos = []
for speed in best_variables['speed']:
    for pronea in best_variables['prone_acc']:
        for standinga in best_variables['standing_acc']:
            for proner in best_variables['prone_range']:
                for standingr in best_variables['standing_range']:
                    weight_list = [speed, pronea, standinga, proner, standingr]
                    weight_combos.append(weight_list)
                    

<a name = "compare_weightings"></a>
<a href = "#comparisons">Back to Comparing Weights Take 1</a>

In [29]:
"""
Function
--------

compare_weightings : for a given competition and each of a collection of weights, predicts
                     a set number of times for each racer, and then uses 
                     evaluating_place_counts and evaluating_dist_from_mean to measure
                     the goodness of results for each weight under consideration

Parameters
----------

year : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
n : an integer indicating the total number of time predictions to be made for each racer
weights : a list of lists containing the world cup and ibu similarity weightings
          for the speed prediction and prone and standing range predictions


Returns
-------

race_places1 : A dataframe that contains one column for each weight that was considered 
               for a given race, and whose rows consist of the median number of correct 
               places (averaged over all of the competitors in the race), the median number 
               of times the prediction was within one place of the actual race, and so on 
               for two places, three places, five places, ten places, and twenty places.
race_distances1 : A dataframe that contains one column for each weight that was considered 
                  for a given race, and whose rows consist of the percentage of racers for 
                  whom the mean of the distribution is within 10 seconds of their actual 
                  times, within 25 seconds of their actual times, and so on for 50 seconds, 
                  100 seconds, 150 seconds, and 200 seconds

Examples
--------
"""


def compare_weightings(year, event, n, weights):

    race_places1 = pd.DataFrame()
    race_distances1 = pd.DataFrame()

    for i in range(len(weights)): 
        #print 'Beginning cycle', i
        weightings = weights[i]
        print weightings
        predicted_times, problem_racers = race_time_predictions(year,event,
                                                n, weightings)
                                        
        places = place_counts(predicted_times)
                    
        found_percentiles = finding_percentiles(predicted_times, year,event)
                    
        place_order_evaluations = evaluating_place_counts(places,year,event)
        place_order = pd.DataFrame(place_order_evaluations.median(axis = 'rows'), 
                                               columns = [i])
        race_places1 = race_places1.merge(place_order, how='outer', left_index=True,
                                                    right_index=True)
                    
        distances = evaluating_dist_from_mean(found_percentiles['diff from mean'])
        distances.columns = [i]
        race_distances1 = race_distances1.merge(distances, how='outer', 
                                                          left_index=True, right_index=True)
    
    return race_places1, race_distances1

In [30]:
race_places, race_distances = compare_weightings('1314', 'CP02', 10,weight_combos)

['wind_c', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['wind_c', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_snow', 'quant_snow']
['wind_c', 'quant_weather', 'altitude', 'quant_event', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_event', 'quant_snow']
['wind_c', 'quant_weather', 'wind_c', 'quant_weather', 'quant_weather']
['wind_c', 'quant_weather', 'wind_c', 'quant_weather', 'quant_snow']
['wind_c', 'quant_weather', 'wind_c', 'quant_snow', 'quant_weather']
['wind_c', 'quant_weather', 'wind_c', 'quant_snow', 'quant_snow']
['wind_c', 'quant_weather', 'wind_c', 'quant_event', 'quant_weather']
['wind_c', 'quant_weather', 'wind_c', 'quant_event', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['quant_s

<a name = "pluck_best_weights"></a>
<a href = "#comparisons">Back to Comparing Weights Take 1</a>

In [31]:
"""
Function
--------

pluck_best_weights : takes a dataframe output by compare_weightings and assigns
                     scores based on how each weighting fairs in comparison with
                     the others in each of the categories


Parameters
----------

df : a dataframe produced by compare_weightings (either race_places or race_distances)

Returns
-------

score : a pandas series of integers calculated by assigning 4 points for each time
        a particular weighting scores the best of all weightings, 2 points for each
        time it scores in the top quartile, and 1 point for each time it scores in the
        top half and then summing these values

Examples
--------
"""


def pluck_best_weights(df):
    
    score = pd.DataFrame(columns = df.columns.tolist(), index = df.index.tolist())
    if len(df) >= 8:
        for i in range(0,len(df)):
            #print df.iloc[i]
            max_value = np.percentile(df.iloc[i,:],100)
            quartile_3 = np.percentile(df.iloc[i,:],75)
            median = np.percentile(df.iloc[i,:],50)
        #print i, max_value,quartile_3, median
            for j in range(df.shape[1]):
                if df.iloc[i,j] == max_value:
                    score.iloc[i,j] = 4
                elif quartile_3 <= df.iloc[i,j] < max_value:
                    score.iloc[i,j] = 2
                elif median <= df.iloc[i,j] < quartile_3:
                    score.iloc[i,j] = 1
                else:
                    score.iloc[i,j] = 0
    else:
        for i in range(0,len(df)):
            #print df.iloc[i]
            max_value = np.percentile(df.iloc[i,:],100)
            quartile_3 = np.percentile(df.iloc[i,:],200/len(df))
            median = np.percentile(df.iloc[i,:],50)
        #print i, max_value,quartile_3, median
            for j in range(df.shape[1]):
                if df.iloc[i,j] == max_value:
                    score.iloc[i,j] = 4
                elif quartile_3 <= df.iloc[i,j] < max_value:
                    score.iloc[i,j] = 2
                elif median <= df.iloc[i,j] < quartile_3:
                    score.iloc[i,j] = 1
                else:
                    score.iloc[i,j] = 0

        
                
    score = score.sum(axis = 'rows')
    

    return score

In [32]:
# Draw 10 random events from the seasons 2009-10, 2010-11, 2011-12, 2012-13, 2013-14

seasons = ['0910', '1011', '1112', '1213', '1314']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__']

chosen_events = []
for i in range(10):
    redraw = 0
    while redraw == 0:
        competition = [random.choice(seasons), random.choice(events)]
        redraw = 1
        if competition in chosen_events:
            redraw = 0
        if competition[0] == '1314':
            if competition[1] == 'CP05':
                redraw = 0
    if competition[0] in ['0910', '1314']:
        if competition[1] == 'CH__':
            competition[1] = 'OG__'

    chosen_events.append(competition)
    
chosen_events

[['1213', 'CP08'],
 ['1213', 'CP06'],
 ['1314', 'CP09'],
 ['0910', 'CP08'],
 ['1011', 'CP04'],
 ['1213', 'CH__'],
 ['1011', 'CP01'],
 ['1112', 'CP05'],
 ['1213', 'CP04'],
 ['1112', 'CP03']]

OK, so now what I want to do is to run through these events using all of weightings, but I need some way of keeping track of how well each weighting does for each race...

In [33]:
race_place_scores = pd.DataFrame()
race_distance_scores = pd.DataFrame()

for competition in chosen_events:
    year = competition[0]
    event = competition[1]
    print year, event
    race_places, race_distances = compare_weightings(year, event, 100,weight_combos)
    
    distances = pluck_best_weights(race_distances)
    places = pluck_best_weights(race_places)
    
    race_place_scores = pd.concat([race_place_scores, places], axis = 'columns')
    race_distance_scores = pd.concat([race_distance_scores, distances], axis = 'columns')

1213 CP08
['wind_c', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['wind_c', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_snow', 'quant_snow']
['wind_c', 'quant_weather', 'altitude', 'quant_event', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_event', 'quant_snow']
['wind_c', 'quant_weather', 'wind_c', 'quant_weather', 'quant_weather']
['wind_c', 'quant_weather', 'wind_c', 'quant_weather', 'quant_snow']
['wind_c', 'quant_weather', 'wind_c', 'quant_snow', 'quant_weather']
['wind_c', 'quant_weather', 'wind_c', 'quant_snow', 'quant_snow']
['wind_c', 'quant_weather', 'wind_c', 'quant_event', 'quant_weather']
['wind_c', 'quant_weather', 'wind_c', 'quant_event', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']

['quant_snow', 'quant_weather', 'wind_c', 'quant_weather', 'quant_weather']
['quant_snow', 'quant_weather', 'wind_c', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'wind_c', 'quant_snow', 'quant_weather']
['quant_snow', 'quant_weather', 'wind_c', 'quant_snow', 'quant_snow']
['quant_snow', 'quant_weather', 'wind_c', 'quant_event', 'quant_weather']
['quant_snow', 'quant_weather', 'wind_c', 'quant_event', 'quant_snow']
1213 CH__
['wind_c', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['wind_c', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_snow', 'quant_snow']
['wind_c', 'quant_weather', 'altitude', 'quant_event', 'quant_weather']
['wind_c', 'quant_weather', 'altitude', 'quant_event', 'quant_snow']
['wind_c', 'quant_weather', 'wind_c', 'quant_weather', 'quant_weather']
['wind_c', 'quant_weather', 'wind_c', 'quant_weather', '

['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_event', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_event', 'quant_snow']
['quant_snow', 'quant_weather', 'wind_c', 'quant_weather', 'quant_weather']
['quant_snow', 'quant_weather', 'wind_c', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'wind_c', 'quant_snow', 'quant_weather']
['quant_snow', 'quant_weather', 'wind_c', 'quant_snow', 'quant_snow']
['quant_snow', 'quant_weather', 'wind_c', 'quant_event', 'quant_weather']
['quant_snow', 'quant_weather', 'wind_c', 'quant_event', 'quant_snow']


In [34]:
race_place_scores

Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,10.0,17.0,13.0,14.0,12.0,19.0,7.0,22.0,9.0,16.0
1,14.0,24.0,11.0,10.0,22.0,12.0,8.0,20.0,21.0,16.0
2,12.0,17.0,9.0,22.0,30.0,14.0,20.0,22.0,6.0,19.0
3,9.0,16.0,14.0,24.0,18.0,19.0,17.0,9.0,19.0,17.0
4,17.0,6.0,19.0,9.0,18.0,8.0,12.0,10.0,18.0,11.0
5,15.0,7.0,17.0,17.0,24.0,13.0,14.0,15.0,4.0,12.0
6,13.0,20.0,21.0,23.0,17.0,9.0,13.0,16.0,19.0,15.0
7,12.0,12.0,14.0,23.0,16.0,13.0,16.0,22.0,13.0,15.0
8,12.0,7.0,20.0,17.0,16.0,7.0,15.0,20.0,17.0,19.0
9,12.0,15.0,22.0,19.0,17.0,10.0,13.0,12.0,11.0,10.0


In [35]:
race_distance_scores

Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,4.0,10.0,12.0,10.0,12.0,6.0,10.0,16.0,12.0,16.0
1,12.0,12.0,14.0,12.0,16.0,4.0,8.0,18.0,14.0,12.0
2,8.0,16.0,12.0,10.0,16.0,6.0,12.0,14.0,12.0,16.0
3,12.0,14.0,14.0,12.0,18.0,2.0,14.0,12.0,16.0,14.0
4,6.0,14.0,8.0,10.0,14.0,6.0,16.0,10.0,10.0,12.0
5,8.0,16.0,12.0,6.0,10.0,8.0,10.0,12.0,12.0,10.0
6,6.0,10.0,14.0,8.0,16.0,8.0,10.0,12.0,14.0,14.0
7,10.0,10.0,16.0,12.0,14.0,10.0,8.0,16.0,18.0,16.0
8,6.0,14.0,12.0,10.0,14.0,8.0,8.0,16.0,14.0,16.0
9,4.0,10.0,10.0,14.0,10.0,12.0,10.0,16.0,12.0,14.0


<a name = "score_of_scores"></a>
<a href = "#comparisons">Back to Comparing Weights Take 1</a>

In [36]:
"""
Function
--------

score_of_scores : takes a dataframe and, for each column, gives a score of 3 for every
                  value above the 83rd percentile for that column, a score of 2 for 
                  every value above 75th percentile, and a score of -1.5 for every value 
                  below the 17th percentile

Parameters
----------

df : a dataframe that is the output of pluck_best_weights applied to one of the ouputs
    of compare_weightings

Returns
-------

scores : a dataframe that contains a score for each of the weightings under consideration

Examples
--------
"""


def score_of_scores(df):
    
    scores = pd.DataFrame(index = df.index.tolist(), columns = ['scores'])
    scores.fillna(0, inplace = True)
    for j in range(df.shape[1]):
        # find the 83rd percentile
        best = np.percentile(df.iloc[:,j], 83)
        # find the 75th percentile
        better = np.percentile(df.iloc[:,j],75)
        # find the 17th percentile
        bad = np.percentile(df.iloc[:,j],17)
        for i in range(len(df)):
        # score 3 for everything above the 83rd percentile
            if df.iloc[i,j] >= best:
                scores.loc[i,'scores'] = scores.loc[i,'scores'] + 3
        # score 2 for everything above the 75th percentile
            elif better <= df.iloc[i,j] < best:
                scores.loc[i,'scores'] = scores.loc[i,'scores'] + 2
        # score -1.5 for everything below the 17th percentile
            elif df.iloc[i,j] <= bad:
                scores.loc[i,'scores'] = scores.loc[i,'scores'] - 1.5
        
    return scores

OK, so now I have some way of calculating which weightings fair the best when it comes to predictions. In fact, I have four different values: the sum of race distance scores, the sum of race place scores, and the score of scores for both race distance scores and race place scores. I want to give each weighting a point for each time it scores in the top quartile (and a negative point for each time it scores in the bottom quartile(?)), and then choose the top 5-6 weightings.

<a name = "rank_weightings"></a>
<a href = "#comparisons">Back to Comparing Weights Take 1</a>

In [37]:
"""
Function
--------

rank_weightings : takes two dataframes produced by concatenating the outputs of 
                  pluck_best_weights via compare_weightings over a collection of 
                  events, applies score_or_scores to each of them, and, for
                  each weighting, gives a positive point if they are in the top 
                  quartile and a negative point if they are in the bottom quartile

Parameters
----------

df1 : a dataframe that is the output of pluck_best_weights applied to one of the ouputs
      of compare_weightings
df2 : a dataframe that is the output of pluck_best_weights applied to the other ouput
      of compare_weightings

Returns
-------

scores : a dataframe with a integer value between -4 and 4 for each weighting under
         consideration

Examples
--------
"""


def rank_weightings(df1, df2):
    
    scores = pd.DataFrame(index = df1.index.tolist(), columns = ['scores']).fillna(0)
    
    # Deal with df1
    
    score_scores = score_of_scores(df1)
    sum_of_scores = df1.sum(axis = 'columns')
    
    #return score_scores
    for i in range(len(df1)):
        if score_scores.loc[i,'scores'] >= np.percentile(score_scores, 75):
            scores.loc[i,'scores'] = scores.loc[i,'scores'] + 1
        elif score_scores.loc[i,'scores'] <= np.percentile(score_scores,25):
            scores.loc[i,'scores'] = scores.loc[i,'scores'] - 1
            
    for i in range(len(df1)):
        if sum_of_scores.loc[i] >= np.percentile(sum_of_scores, 75):
            scores.loc[i,'scores'] = scores.loc[i,'scores'] + 1
        elif sum_of_scores.loc[i] <= np.percentile(sum_of_scores,25):
            scores.loc[i,'scores'] = scores.loc[i,'scores'] - 1

    # Deal with df2
    
    score_scores = score_of_scores(df2)
    sum_of_scores = df2.sum(axis = 'columns')
    
    for i in range(len(df1)):
        if score_scores.loc[i,'scores'] >= np.percentile(score_scores, 75):
            scores.loc[i,'scores'] = scores.loc[i,'scores'] + 1
        elif score_scores.loc[i,'scores'] <= np.percentile(score_scores,25):
            scores.loc[i,'scores'] = scores.loc[i,'scores'] - 1
            
    for i in range(len(df1)):
        if sum_of_scores.loc[i] >= np.percentile(sum_of_scores, 75):
            scores.loc[i,'scores'] = scores.loc[i,'scores'] + 1
        elif sum_of_scores.loc[i] <= np.percentile(sum_of_scores,25):
            scores.loc[i,'scores'] = scores.loc[i,'scores'] - 1

    return scores

In [38]:
weight_rankings = rank_weightings(race_place_scores, race_distance_scores)

In [39]:
weight_combos_to_keep = [i for i in weight_rankings.index.tolist() 
                                     if weight_rankings.loc[i,'scores']>=3]
weight_combos_to_keep

[13, 15, 19]

Hmm, this is rather fewer of these than I might like. What happens if I look at the outputs?

In [40]:
weight_rankings

Unnamed: 0,scores
0,-4
1,0
2,1
3,0
4,-4
5,-4
6,-2
7,0
8,-3
9,-4


In [41]:
weight_combos_to_keep = [i for i in weight_rankings.index.tolist() 
                                     if weight_rankings.loc[i,'scores']>=2]
weight_combos_to_keep

[12, 13, 14, 15, 16, 18, 19, 21]

In [42]:
kept_weight_combos = []
for i in range(len(weight_combos)):
    if i in weight_combos_to_keep:
        kept_weight_combos.append(weight_combos[i])
        
kept_weight_combos

[['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather'],
 ['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow'],
 ['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather'],
 ['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_snow'],
 ['quant_snow', 'quant_weather', 'altitude', 'quant_event', 'quant_weather'],
 ['quant_snow', 'quant_weather', 'wind_c', 'quant_weather', 'quant_weather'],
 ['quant_snow', 'quant_weather', 'wind_c', 'quant_weather', 'quant_snow'],
 ['quant_snow', 'quant_weather', 'wind_c', 'quant_snow', 'quant_snow']]

So now I have 8 weightings that seem to be the best of the best. It's probably worth looking at what they are, and then I want to run cycles of 1000 trials on them to see how they fair.

<a name = "comparisons2"></a>

<a name = "comparisons2"></a>

# Comparing Weights Take 2

Having reduced the number of weight combinations under consideration to 8, I wanted to compare them for a larger number of trials over a new list of 10 events. My initial evaluation of their relative merits failed to show that any weightings were as much stronger than the others as I would have <a href = "#too_weak">liked</a>. As a result, I put together three additional functions for comparing the outcomes that I got:
1. <a href = "#score_the_scores_v2">```score_the_scores_v2```</a>: this function takes a dataframe produced by concatenating outputs of <a href = "#pluck_best_weights">```pluck_best_weights</a> and, for each column, assigns a score of 4 for each entry that is the maximal value for that column, a score of 2 for each entry that is in the top quartile of values for that column, and a score of -1.5 for each entry that is in the bottom quartile for that column. Scores are then summed along rows in order to produce a single score for each row (corresponding to a weight combination) in the dataframe.
2. <a href = "#ranking_points">```ranking_points```</a>:  this function takes a one column dataframe and assigns scores to each index based on the order of the values of the entries in the column (first place gets 1  point, second place gets 2, etc. Tied values get points for their highest ranking).
3. <a href = "#ranking_by_places">```ranking_by_places```</a>:  this function takes two dataframes df1 and df2 obtained by concatenating the outputs of ```pluck_best_weights``` applied to the results of <a href = "#compare_weights">```compare_weights```</a> over a selection of races. It then uses ```score_the_scores_v2``` and ```ranking_points``` to assign four different place values to each weight combination. The sums of these four values are then returned in a dataframe.



<a href = "#toc">Table of Contents</a>

Now, setting up to run 500 cycles on a randomly selected batch of 10 races from the seasons 2009-2010 to 2013-2014, and seeing what comes out.

In [43]:
seasons = ['0910', '1011', '1112', '1213', '1314']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__']

chosen_events = []
for i in range(10):
    redraw = 0
    while redraw == 0:
        competition = [random.choice(seasons), random.choice(events)]
        redraw = 1
        if competition in chosen_events:
            redraw = 0
        if competition[0] == '1314':
            if competition[1] == 'CP05':
                redraw = 0
    if competition[0] in ['0910', '1314']:
        if competition[1] == 'CH__':
            competition[1] = 'OG__'

    chosen_events.append(competition)
    
chosen_events

[['1213', 'CH__'],
 ['1314', 'CP08'],
 ['1314', 'CP09'],
 ['1112', 'CP06'],
 ['1112', 'CH__'],
 ['1314', 'CP06'],
 ['1213', 'CP05'],
 ['1314', 'CP03'],
 ['0910', 'CP02'],
 ['1213', 'CP03']]

In [44]:
race_place_scores = pd.DataFrame()
race_distance_scores = pd.DataFrame()

for competition in chosen_events:
    year = competition[0]
    event = competition[1]
    print year, event
    race_places, race_distances = compare_weightings(year, event, 500,kept_weight_combos)
    
    distances = pluck_best_weights(race_distances)
    places = pluck_best_weights(race_places)
    
    race_place_scores = pd.concat([race_place_scores, places], axis = 'columns')
    race_distance_scores = pd.concat([race_distance_scores, distances], axis = 'columns')

1213 CH__
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_event', 'quant_weather']
['quant_snow', 'quant_weather', 'wind_c', 'quant_weather', 'quant_weather']
['quant_snow', 'quant_weather', 'wind_c', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'wind_c', 'quant_snow', 'quant_snow']
1314 CP08
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_event', 'quant_weather']
['quant_s

In [45]:
race_distance_scores

Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,12.0,12.0,12.0,18.0,12.0,14.0,20.0,20.0,16.0,20.0
1,10.0,20.0,20.0,20.0,16.0,18.0,16.0,14.0,14.0,16.0
2,10.0,16.0,14.0,16.0,16.0,18.0,18.0,14.0,16.0,20.0
3,12.0,20.0,20.0,18.0,12.0,10.0,16.0,16.0,8.0,18.0
4,14.0,18.0,12.0,18.0,18.0,16.0,20.0,14.0,14.0,20.0
5,18.0,14.0,14.0,16.0,16.0,12.0,14.0,16.0,12.0,18.0
6,18.0,14.0,16.0,18.0,12.0,12.0,16.0,20.0,12.0,18.0
7,18.0,16.0,18.0,16.0,12.0,16.0,22.0,20.0,14.0,16.0


In [46]:
race_place_scores

Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,18.0,11.0,11.0,23.0,11.0,9.0,24.0,13.0,11.0,9.0
1,11.0,9.0,20.0,20.0,11.0,21.0,6.0,13.0,15.0,10.0
2,22.0,19.0,21.0,15.0,5.0,12.0,15.0,10.0,23.0,17.0
3,7.0,7.0,17.0,7.0,12.0,7.0,11.0,13.0,9.0,11.0
4,6.0,15.0,12.0,10.0,18.0,5.0,10.0,8.0,7.0,14.0
5,16.0,12.0,20.0,11.0,9.0,16.0,17.0,13.0,19.0,10.0
6,19.0,20.0,9.0,11.0,10.0,22.0,14.0,26.0,6.0,15.0
7,7.0,10.0,12.0,10.0,22.0,8.0,15.0,9.0,11.0,20.0


<a name = "too_weak"></a>
<a href = "#comparisons2">Back to Comparing Weights Take 2</a>

In [47]:
rank_weightings(race_distance_scores, race_place_scores)

Unnamed: 0,scores
0,1
1,2
2,2
3,-3
4,0
5,-1
6,1
7,1


In [48]:
race_distance_scores.sum(axis = 'columns').sort_values()

3    150.0
5    150.0
0    156.0
6    156.0
2    158.0
1    164.0
4    164.0
7    168.0
dtype: float64

In [50]:
score_of_scores(race_distance_scores).sort_values('scores')

Unnamed: 0,scores
5,-1.5
6,1.5
3,3.0
7,4.5
2,7.5
1,9.0
4,9.0
0,10.5


In [51]:
race_place_scores.sum(axis = 'columns').sort_values()

3    101.0
4    105.0
7    124.0
1    136.0
0    140.0
5    143.0
6    152.0
2    159.0
dtype: float64

In [52]:
score_of_scores(race_place_scores).sort_values('scores')

Unnamed: 0,scores
4,-6.0
3,-3.0
7,1.5
0,6.0
1,7.5
5,9.0
6,9.0
2,13.5


Looking at the scoring, it seems somewhat unclear whether or not there are any weightings that are significantly better than the others, so I'm going to go back to the dataframes ```race_place_scores``` and ```race_distance_scores``` and see what I can tease out from those. In particular, I'm going to consider slightly changing my function for ```score_of_scores``` in order to weight having the best result in a particular race slightly more heavily (from 3 to 4), but I'm going to leave all of the other values alone. I'll then have

<a name = "score_the_scores_v2"></a>
<a href = "#comparisons2">Back to Comparisons Take 2</a>

In [57]:
"""
Function
--------

score_the_scores_v2 : takes a dataframe and, for each column, assigns a score of 4 for
                      each entry that is the maximal value for that column, a score of
                      2 for each entry that is in the top quartile of values for that 
                      column, and a score of -1.5 for each entry that is in the bottom
                      quartile for that column. Scores are then summed along rows in 
                      order to produce a single score for each row (which correspond
                      to weight combinations) in the dataframe.

Parameters
----------

df : a dataframe that is the output of running compare_weights and pluck_best_weights
     over a selection of races. It could be either the dataframe of distance measures
     or the dataframe of place accuracy

Returns
-------

summed_scores : a dataframe with a single column containing the summed scores assigned
                to each weight combination (index value)

Examples
--------
"""

def score_the_scores_v2(df):
    
    scores = pd.DataFrame(index = df.index.tolist(), columns = range(0,df.shape[1]))
    
    for i in range(df.shape[1]):
        max_value = max(df.iloc[:,i])
        #print max_value
        quart_3 = np.percentile(df.iloc[:,i],75)
        #print quart_3
        #print max_value - quart_3
        quart_1 = np.percentile(df.iloc[:,i],25)
        
        for j in range(len(df)):
            if df.iloc[j,i] >= max_value:
                scores.iloc[j,i] = 4
            elif max_value > df.iloc[j,i] >= quart_3:
                scores.iloc[j,i] = 2
            elif df.iloc[j,i] <= quart_1:
                scores.iloc[j,i] = -1.5
            else:
                scores.iloc[j,i] = 0
    scores = pd.DataFrame(scores)
    summed_scores = scores.sum(axis = 'columns')
    
    return summed_scores

And then in order to rank the weightings, I want a somewhat different function than I was using before. In particular, for the 4 different measures that we had, I was giving a weighting one point for being in the top 25%, and removing a point 

<a name = "ranking_points"></a>
<a href = "#comparisons2">Back to Comparing Weights Take 2</a>

In [58]:
"""
Function
--------

ranking_points : takes a single column dataframe and assigns scores to each index based on
                 the order of the values of the entries in the column (first place gets 1 
                 point, second place gets 2, etc. Tied values get points for their highest
                 ranking).

Parameters
----------

df : a dataframe containing a single column of values

Returns
-------

scores : a single column dataframe consisting of an index of weight combinations and
         a column of places (scores)

Examples
--------
"""

def ranking_points(df):
    
    df = df.sort_values(ascending = False)
    
    scores = []
    current_score = 0
    current_value = 1000
    
    for i in range(len(df)):
        if df.iloc[i] == current_value:
            scores.append([df.index[i],current_score])
        else:
            current_value = df.iloc[i]
            current_score = i+1
            scores.append([df.index[i],current_score])
    
    scores = pd.DataFrame(scores, columns = ['weight','place'])
    scores.set_index('weight', drop = True, inplace = True)
    return scores

<a name = "ranking_by_places"></a>
<a href = "#comparisons2">Back to Comparing Weights Take 2</a>

In [59]:
"""
Function
--------

ranking_by_places : takes two dataframes df1 and df2 obtained by running compare_weights
                    and pluck_best_weights over a selection of races. It then uses
                    score_the_scores_v2 and ranking_points to assign four different 
                    place values to each weight combination. The sums of these four values
                    are then returned in a dataframe.

Parameters
----------

df1 : a dataframe that is the output of running compare_weights and pluck_best_weights
      over a selection of races. It could be either the dataframe of distance measures
      or the dataframe of place accuracy
df2 : the dataframe that is the pair to df1. It is distance measures where df1 is place
      accuracy, or place accuracy where df1 is distance measures

Returns
-------

ranking : a dataframe containing the weight combinations as the index and a score
          as the sole column

Examples
--------
"""

def ranking_by_places(df1,df2):
    
    points_1 = ranking_points(df1.sum(axis = 'columns'))
    points_2 = ranking_points(score_the_scores_v2(df1))
    points_3 = ranking_points(df2.sum(axis = 'columns'))
    points_4 = ranking_points(score_the_scores_v2(df2))
    
    ranking = points_1+points_2+points_3+points_4
    
    return ranking

In [60]:
ranking_by_places(F, race_distance_scores)

Unnamed: 0_level_0,place
weight,Unnamed: 1_level_1
0,14
1,13
2,9
3,28
4,21
5,22
6,16
7,17


So looking at both this and the result of cell [47], it appears that the strongest weights here are numbers 2 and 1, with 2 the stronger of them. These correspond to weights 13 and 14 in our original list of weight combinations, which have as weights

In [61]:
print weight_combos[13]
print weight_combos[14]

['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']


<a name = "combivars"></a>
# Considering combined variables

Looking at the two best weight combinations above, we see that they agree on three of the five weight inputs, and have opposite values for the last two (range) weight inputs. As a result, it seems not unreasonable to me to consider these two weightings in conjunction with a few combined weightings, namely replacing either the prone range or standing range weighting (or both) with the geometric mean of ```quant_weather``` and ```quant_snow```. 

From there, we again choose 10 random events from our five seasons and run 500 trials for each combination of event and weighting. We evaluate the results using both <a href = "#rank_weightings">```rank_weightings```</a> and <a href = "#weightings_by_places">```weightings_by_places```</a>, and we find that one weighting, number 4 of the weightings that we consider here which corresponds to number 14 in the original list of weight combinations, is clearly the best performer over these races. We see that its <a href = "#ranking_by_ranks_score">```rank_weightings```</a> result is 4, which indicates that it is in the top quartile for all four measures of goodness of fit, and that its <a href = "#ranking_by_places_score">```weightings_by_places```</a> result is 5, which means that it was the best weight combination in three of the four measures, and the second best combination for the fourth. 

<a href = "#toc">Table of Contents</a>

In [63]:
wc_product = (wc_quant_weather_similarities*wc_quant_snow_similarities)**(0.5)
ibu_product = (ibu_quant_weather_similarities*ibu_quant_snow_similarities)**(0.5)
weights['product'] = [wc_product, ibu_product]

In [64]:
more_combos_to_test = [
    ['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow'],
    ['quant_snow', 'quant_weather', 'altitude', 'product', 'quant_snow'],
    ['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'product'],
    ['quant_snow', 'quant_weather', 'altitude', 'product', 'product'],
    ['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather'],
    ['quant_snow', 'quant_weather', 'altitude', 'product', 'quant_weather'],
    ['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'product'],
]

And now I want to do the same thing that I did above, namely running tests of 500 trials over 10 different races to see which of these (if any) ends up being clearly the best.

In [65]:
seasons = ['0910', '1011', '1112', '1213', '1314']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__']

chosen_events = []
for i in range(10):
    redraw = 0
    while redraw == 0:
        competition = [random.choice(seasons), random.choice(events)]
        redraw = 1
        if competition in chosen_events:
            redraw = 0
        if competition[0] == '1314':
            if competition[1] == 'CP05':
                redraw = 0
    if competition[0] in ['0910', '1314']:
        if competition[1] == 'CH__':
            competition[1] = 'OG__'

    chosen_events.append(competition)
    
chosen_events

[['1112', 'CP02'],
 ['1314', 'CP07'],
 ['1314', 'CP09'],
 ['1112', 'CP07'],
 ['0910', 'CP05'],
 ['1112', 'CP08'],
 ['1314', 'OG__'],
 ['1213', 'CP09'],
 ['1011', 'CP04'],
 ['0910', 'CP06']]

In [66]:
race_place_scores = pd.DataFrame()
race_distance_scores = pd.DataFrame()

for competition in chosen_events:
    year = competition[0]
    event = competition[1]
    print year, event
    race_places, race_distances = compare_weightings(year, event, 500,more_combos_to_test)
    
    distances = pluck_best_weights(race_distances)
    places = pluck_best_weights(race_places)
    
    race_place_scores = pd.concat([race_place_scores, places], axis = 'columns')
    race_distance_scores = pd.concat([race_distance_scores, distances], axis = 'columns')

1112 CP02
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'product', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'product']
['quant_snow', 'quant_weather', 'altitude', 'product', 'product']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'product', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'product']
1314 CP07
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'product', 'quant_snow']
['quant_snow', 'quant_weather', 'altitude', 'quant_weather', 'product']
['quant_snow', 'quant_weather', 'altitude', 'product', 'product']
['quant_snow', 'quant_weather', 'altitude', 'quant_snow', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'product', 'quant_weather']
['quant_snow', 'quant_weather', 'altitude', 'quant_sn

In [67]:
race_place_scores

Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,12.0,7.0,11.0,20.0,16.0,7.0,16.0,19.0,23.0,25.0
1,7.0,13.0,21.0,18.0,17.0,14.0,9.0,8.0,14.0,14.0
2,25.0,15.0,25.0,8.0,6.0,8.0,10.0,6.0,12.0,13.0
3,7.0,10.0,12.0,10.0,13.0,19.0,14.0,17.0,20.0,14.0
4,10.0,22.0,11.0,11.0,25.0,15.0,23.0,24.0,9.0,13.0
5,19.0,8.0,18.0,19.0,6.0,13.0,4.0,9.0,10.0,13.0
6,13.0,18.0,9.0,8.0,17.0,25.0,11.0,17.0,23.0,8.0


In [69]:
race_distance_scores

Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,16.0,16.0,18.0,20.0,16.0,18.0,14.0,19.0,18.0,12.0
1,14.0,16.0,14.0,14.0,22.0,20.0,14.0,16.0,18.0,16.0
2,14.0,18.0,22.0,12.0,16.0,22.0,16.0,19.0,18.0,18.0
3,20.0,18.0,18.0,16.0,16.0,20.0,18.0,17.0,20.0,18.0
4,16.0,16.0,14.0,16.0,18.0,16.0,22.0,20.0,20.0,20.0
5,12.0,12.0,16.0,14.0,18.0,20.0,18.0,15.0,20.0,12.0
6,14.0,12.0,16.0,12.0,14.0,20.0,20.0,15.0,14.0,18.0


<a name = "ranking_by_places_score"></a>
<a href = "#combivars">Back to Combined Variables</a>

In [70]:
ranking_by_places(race_place_scores, race_distance_scores)

Unnamed: 0_level_0,place
weight,Unnamed: 1_level_1
0,11
1,21
2,18
3,11
4,5
5,25
6,20


<a name = "ranking_by_ranks_score"></a>
<a href = "#combivars">Back to Combined Variables</a>

In [71]:
rank_weightings(race_distance_scores, race_place_scores)

Unnamed: 0,scores
0,2
1,-1
2,-2
3,2
4,4
5,-3
6,-1


<a name = "conclusions"></a>

# Conclusions

Based on the work in this notebook, it seems that we have found a single weight combination which, at least under the conditions that we have applied, appears to be superior to the other combinations.  It uses the weights
- speed : ```quant_snow```
- prone accuracy : ```quant_weather```
- standing accuracy : ```altitude```
- prone range time : ```quant_snow```
- standing range time :```quant_weather```

To this point, however, all of the work that we have done has considered the effectiveness of our model on races from the middle part of our data, chronologically. At this point, we need to see how effective our model is on more recent races, and to see how it fairs in comparison with a more naive model of these races. For that, we will move to a new notebook.

<a href = "#toc">Table of Contents</a>