
# Table of Contents

[Introduction](#Introduction)

[Collecting the Data](#Collecting-data-by-racer)

[Weighting the Races](#Weighting-the-Races)

[Isolating Variable Effects](#isolating_variable_effects)

[Pulling it all together](#pulling_together)

In [1]:
# First a cell to prepare the notebook for the stuff that I might need to use. More imports
# can be added as necessary. 

# A special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

from fnmatch import fnmatch

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from pattern import web
import seaborn as sns
import math

# And the additional modules that I've used

import fnmatch
import os
import pickle
from PyPDF2 import PdfFileReader
from tabula import read_pdf
import urllib
import random
import sklearn
import scipy.stats as stats
import re

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.linear_model import ElasticNet

import joblib

import matplotlib as mpl
sns.set(color_codes=True)


# Introduction

My goal for this notebook is to take the data that I've collected in the previous notebook, and to explore the ability of each of our predictor variables to successfully produce a reasonable distribution of predictions for each of the pieces that constitute a sprint biathlon race. 

Next Section: [Collecting the Data](#Collecting-data-by-racer)

[Table of Contents](#Table-of-Contents)





# Collecting data by racer

Above, I divide the the information about races into several pieces, each of which has different main influences. I hope to use this information to allow me to make some predictions about both racer times in future events and likely rankings. I hope to use this model to predict outcomes for the 2015-2016, 2016-2017, and 2017-2018 seasons, each time using only (and all) of the data from races that have previously occured. In order to accomplish this, I first will need to collate all of the various pieces (speed, missed shots, and range/penalty time) sorted by racer. In order to do this, I first need to correct some of the data that I have. In particular, those biathletes whose names contain accents or a written in a non Latin alphabet often have multiple spellings for their names. As a result, their spellings need to be made consistant in order to collect data over their entire careers. In order to do this, I systematically went through a list of all of the athletes that I had results for, looking for likely doubles, and checking them against the current (official) spellings used for each athlete by the International Biathlon Union. I then entered the data in an Excel spreadsheet (because it's sometimes better to see what you have), and stored it as a .csv file. From there, I used

```replace_names```: this function takes a competition analysis dataframe and ensures that all of the given names are the official IBU versions of those names, corrects those that are not, and returns a fixed version of the dataframe.

to replace the names. Next, I interleaved World Cup and IBU Cup races for the 14 seasons for which I had data, so that a biathlete who bounced back and forth between the two levels of competition for a season or two (not an unusual occurence, particularly for early career athletes) would have all of their events in the order in which they occured. I then used this list to loop through all of the ibu cup and world cup races in order of their occurence from the 2004-2005 season through the 2017-2018 season. The code for this can be found here:
1. [Collecting the speed data](#Collecting-the-speed-data)
2. [Collecting the accuracy data](#Collecting-the-accuracy-data)
3. [Collecting the Range and Penalty Data](#Collecting-the-range-and-penalty-data)

In each of these three cases, we loop through the events in chronological order, and use an outer merge on the name column to collect the relevent data into a dataframe or dataframes. The resulting dataframes are then pickled to allow them to be easily used in subsequent notebooks without redoing the above.

Previous Section: [Introduction](#Introduction)


Next Section: [Weighting the Races](#Weighting-the-Races)



[Table of Contents](#Table-of-Contents)



In [2]:
name_equivalences = pd.read_csv('name_equivalences.csv', encoding = 'utf8')
name_equivalences.set_index('Name', inplace = True)
name_changes = name_equivalences.to_dict('dict')['Unnamed: 1']

In [3]:
"""
Function
--------

replace_names : takes a competition analysis dataframe and ensures that all of the given 
                names are the official IBU versions of those names

Parameters
----------

df : a dataframe containing the (filtered) competition analysis data for a world cup or
     ibu cup men's sprint race
name_changes : a dictionary giving the correspondance between old versions of a racer's 
               name and current (IBU official) versions

Returns
-------

df : the original dataframe, but with any old name versions replaced with current versions

Examples
--------
"""

def replace_names(df, name_changes):
    
    for i in range(len(df)):
        if df.loc[i,'Name'] in name_changes:
            df.loc[i,'Name'] = name_changes[df.loc[i,'Name']]
            
        
    return df

In [4]:
# Fixing the names 

seasons = ['0405','0506','0607','0708','0809','0910',
           '1011','1112','1213','1314','1415','1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

for season in seasons:
    print season
    for event in events:
        filename = ('companal_SMSP_%(season)s_%(event)s.pkl' 
                        %{'season' : season, 'event' : event})
        
        try:
            df = pd.read_pickle(filename)
            df = replace_names(df, name_changes)
            df.to_pickle(filename)
            
        except: # race has no competition analysis file
            pass

        filename1 = 'ibu_SMSP_%(season)s_%(event)s.pkl' %{'season' : season, 'event' : event}
        
        try:
            df = pd.read_pickle(filename1)
            df = replace_names(df, name_changes)
            df.to_pickle(filename1)
            
        except: # race has no competition analysis file
            pass

        filename2 = 'ibu_SMSPS_%(season)s_%(event)s.pkl' %{'season' : season, 'event' : event}
        
        try:
            df = pd.read_pickle(filename2)
            df = replace_names(df, name_changes)
            df.to_pickle(filename2)
            
        except: # race has no competition analysis file
            pass



0405
0506
0607
0708
0809
0910
1011
1112
1213
1314
1415
1516
1617
1718



### Interleaving World Cup and IBU Cup races



In [5]:
events_0405 = [['ibu', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP01'], ['companal', 'CP02'], 
               ['ibu', 'CP03'], ['companal', 'CP03'], ['ibu', 'CP04'], ['companal', 'CP04'], 
               ['ibu', 'CP05'], ['companal', 'CP05'], ['ibu', 'CP06'], ['companal', 'CP06'], 
               ['companal', 'CP07'], ['companal', 'CP08'], ['ibu', 'CH__'], 
               ['companal', 'CH__'], ['ibu', 'CP07'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_0506 = [['companal', 'CP01'], ['ibu', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'], 
               ['ibu', 'CP03'], ['companal', 'CP03'], ['companal', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP04'], ['ibu', 'CP05'], ['companal', 'CP06'],
               ['companal', 'OG__'], ['ibu', 'CH__'], ['companal', 'CP07'], ['ibu', 'CP06'], 
               ['companal', 'CP08'], ['ibu', 'CP07'], ['companal', 'CP09']]

events_0607 = [['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP01'], 
               ['companal', 'CP03'], ['ibu', 'CP02'], ['companal', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP03'], ['companal', 'CP06'], 
               ['ibu', 'CP04'], ['companal', 'CH__'], ['ibu', 'CP06'], ['ibu', 'CH__'],
               ['companal', 'CP07'], ['companal', 'CP08'], ['ibu', 'CP07'], 
               ['companal', 'CP09'], ['ibu', 'CP08']]

events_0708 = [['ibu', 'CP01'], ['companal', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP02'],
               ['ibu', 'CP03'], ['companal', 'CP03'], ['ibu', 'CP04'], ['companal', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'],
               ['companal', 'CH__'], ['ibu', 'CH__'], ['companal', 'CP07'], 
               ['companal', 'CP08'], ['ibu', 'CP07'], ['companal', 'CP09'], ['ibu', 'CP08']]

events_0809 = [['ibu', 'CP01'], ['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'], 
               ['companal', 'CH__'], ['ibu', 'CP07'], ['ibu', 'CH__'], ['companal', 'CP07'], 
               ['ibu', 'CP08'], ['companal', 'CP08'], ['companal', 'CP09']]

events_0910 = [['ibu', 'CP01'], ['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'],
               ['ibu', 'CP07'], ['companal', 'OG__'], ['ibu', 'CH__'], ['companal', 'CP07'],
               ['ibu', 'CP08'], ['companal', 'CP08'], ['companal', 'CP09']]

events_1011 = [['ibu', 'CP01'], ['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], 
               ['companal', 'CP07'], ['ibu', 'CP06'], ['companal', 'CP08'], ['ibu', 'CP07'], 
               ['ibu', 'CH__'], ['companal', 'CH__'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1112 = [['ibu', 'CP01'], ['companal', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CH__'],
               ['companal', 'CP07'], ['companal', 'CP08'], ['ibu', 'CP06'], ['ibu', 'CP07'], 
               ['companal', 'CH__'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1213 = [['companal', 'CP01'], ['ibu', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP02'], 
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'], 
               ['companal', 'CH__'], ['ibu', 'CP07'], ['ibu', 'CH__'], ['companal', 'CP07'],
               ['companal', 'CP08'], ['companal', 'CP09']]

events_1314 = [['companal', 'CP01'], ['ibu', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP02'], 
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'],
               ['ibu', 'CH__'], ['companal', 'OG__'], ['ibu', 'CP07'], ['companal', 'CP07'],
               ['companal', 'CP08'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1415 = [['companal', 'CP01'], ['ibu', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CH__'], 
               ['companal', 'CP07'], ['ibu', 'CP06'], ['companal', 'CP08'], ['ibu', 'CP07'], 
               ['companal', 'CH__'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1516 = [['companal', 'CP01'], ['ibu', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'], 
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CP06'], 
               ['companal', 'CP07'], ['companal', 'CP08'], ['ibu', 'CP07'], ['ibu', 'CH__'], 
               ['companal', 'CH__'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1617 = [['companal', 'CP01'], ['ibu', 'CP01'], ['companal', 'CP02'], ['ibu', 'CP02'], 
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'], 
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CH__'], 
               ['ibu', 'CP06'], ['companal', 'CH__'], ['companal', 'CP07'], ['ibu', 'CP07'], 
               ['companal', 'CP08'], ['ibu', 'CP08'], ['companal', 'CP09']]

events_1718 = [['ibu', 'CP01'], ['companal', 'CP01'], ['ibu', 'CP02'], ['companal', 'CP02'],
               ['companal', 'CP03'], ['ibu', 'CP03'], ['companal', 'CP04'], ['ibu', 'CP04'],
               ['companal', 'CP05'], ['ibu', 'CP05'], ['companal', 'CP06'], ['ibu', 'CH__'], 
               ['ibu', 'CP06'], ['companal', 'OG__'], ['companal', 'CP07'], ['ibu', 'CP07'],
               ['companal', 'CP08'], ['ibu', 'CP08'], ['companal', 'CP09']              ]

In [6]:
ordered_events ={'0405' : events_0405, '0506' : events_0506, '0607' : events_0607,
                 '0708' : events_0708, '0809' : events_0809, '0910' : events_0910, 
                 '1011' : events_1011, '1112' : events_1112, '1213' : events_1213, 
                 '1314' : events_1314, '1415' : events_1415, '1516' : events_1516, 
                 '1617' : events_1617, '1718' : events_1718}


## Collecting the speed data

[Collecting data by racer](#Collecting-data-by-racer)



In [7]:
# And collecting the speed data

absolute_mens_speed = pd.DataFrame(columns = ['Name'])

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    events = ordered_events[season]
    for event in events:
        if event[0] == 'ibu':
            filename1 = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            filename2 = ('%(cup)s_SMSPS_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})

            colname1 = ':'.join(['ibu',season,event[1]])
            colname2 = ':'.join(['ibuS',season,event[1]])
            try:
                df = pd.read_pickle(filename1)[['Name','Speed']]
                df.columns = ['Name',colname1]
                absolute_mens_speed = absolute_mens_speed.merge(df, how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
            try:
                df = pd.read_pickle(filename2)[['Name','Speed']]
                df.columns = ['Name',colname2]
                absolute_mens_speed = absolute_mens_speed.merge(df, how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
        else:
            filename = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                            %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            colname = ':'.join(['wc',season,event[1]])
            
            try:
                df = pd.read_pickle(filename)[['Name','Speed']]
                df.columns = ['Name',colname]
                absolute_mens_speed = absolute_mens_speed.merge(df, how = 'outer', on = 'Name')
            
            except:
                print season, event, 'has no companal file'

last_row = len(absolute_mens_speed)

for col in absolute_mens_speed.columns.tolist():
    absolute_mens_speed.loc[last_row, col] = "".join(['20', col[2:4]])
    
for i in range(len(absolute_mens_speed)):
    absolute_mens_speed.loc[i,'count'] = absolute_mens_speed.loc[i].count() - 1
    
absolute_mens_speed.loc[last_row,'Name'] = 'Year'

absolute_mens_speed.set_index('Name', drop=True, inplace=True)

0607 ['companal', 'CP08'] has no companal file
0607 ['companal', 'CP09'] has no companal file
0809 ['companal', 'CH__'] has no companal file
1314 ['companal', 'CP05'] has no companal file
1516 ['companal', 'CP05'] has no companal file
1617 ['companal', 'CP06'] has no companal file
1718 ['companal', 'CP05'] has no companal file


And in the interest of not having to do this again, let's pickle it.

In [8]:
absolute_mens_speed.to_pickle('absolute_mens_speed.pkl')


## Collecting the accuracy data

[Collecting data by racer](#Collecting-data-by-racer)



In [9]:
# Collecting the prone accuracy data

absolute_mens_prone_shooting = pd.DataFrame(columns = ['Name'])

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    events = ordered_events[season]
    for event in events:
        if event[0] == 'ibu':
            filename1 = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            filename2 = ('%(cup)s_SMSPS_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})

            colname1 = ':'.join(['ibu',season,event[1]])
            colname2 = ':'.join(['ibuS',season,event[1]])
            try:
                df = pd.read_pickle(filename1)[['Name','P1']]
                df.columns = ['Name',colname1]
                absolute_mens_prone_shooting = absolute_mens_prone_shooting.merge(df, 
                                                                how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
            try:
                df = pd.read_pickle(filename2)[['Name','P1']]
                df.columns = ['Name',colname2]
                absolute_mens_prone_shooting = absolute_mens_prone_shooting.merge(df, 
                                                                how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
        else:
            filename = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                            %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            colname = ':'.join(['wc',season,event[1]])
            
            try:
                df = pd.read_pickle(filename)[['Name','P1']]
                df.columns = ['Name',colname]
                absolute_mens_prone_shooting = absolute_mens_prone_shooting.merge(df,
                                                                how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass

last_row = len(absolute_mens_prone_shooting)

for col in absolute_mens_prone_shooting.columns.tolist():
    absolute_mens_prone_shooting.loc[last_row, col] = "".join(['20', col[2:4]])
    
for i in range(len(absolute_mens_prone_shooting)):
    absolute_mens_prone_shooting.loc[i,'count'] = \
                                    absolute_mens_prone_shooting.loc[i].count() - 1    
absolute_mens_prone_shooting.loc[last_row,'Name'] = 'Year'

absolute_mens_prone_shooting.set_index('Name', drop=True, inplace=True)

In [10]:
# Collecting the standing accuracy data

absolute_mens_standing_shooting = pd.DataFrame(columns = ['Name'])

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    events = ordered_events[season]
    for event in events:
        if event[0] == 'ibu':
            filename1 = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            filename2 = ('%(cup)s_SMSPS_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})

            colname1 = ':'.join(['ibu',season,event[1]])
            colname2 = ':'.join(['ibuS',season,event[1]])
            try:
                df = pd.read_pickle(filename1)[['Name','S1']]
                df.columns = ['Name',colname1]
                absolute_mens_standing_shooting = absolute_mens_standing_shooting.merge(df, 
                                                                 how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
            try:
                df = pd.read_pickle(filename2)[['Name','S1']]
                df.columns = ['Name',colname2]
                absolute_mens_standing_shooting = absolute_mens_standing_shooting.merge(df,
                                                                how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
        else:
            filename = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                            %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            colname = ':'.join(['wc',season,event[1]])
            
            try:
                df = pd.read_pickle(filename)[['Name','S1']]
                df.columns = ['Name',colname]
                absolute_mens_standing_shooting = absolute_mens_standing_shooting.merge(df, 
                                                                how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass

last_row = len(absolute_mens_standing_shooting)

for col in absolute_mens_standing_shooting.columns.tolist():
    absolute_mens_standing_shooting.loc[last_row, col] = "".join(['20', col[2:4]])
    
for i in range(len(absolute_mens_standing_shooting)):
    absolute_mens_standing_shooting.loc[i,'count'] = \
                                        absolute_mens_standing_shooting.loc[i].count() - 1
    
absolute_mens_standing_shooting.loc[last_row,'Name'] = 'Year'

absolute_mens_standing_shooting.set_index('Name', drop=True, inplace=True)

And then to pickle them both.

In [11]:
absolute_mens_prone_shooting.to_pickle('absolute_mens_prone_shooting.pkl')
absolute_mens_standing_shooting.to_pickle('absolute_mens_standing_shooting.pkl')

## Collecting the range and penalty data

[Collecting data by racer](#Collecting-data-by-racer)


In [12]:
# Collecting the prone range data (Note: this includes penalty loop times for prone shooting)

absolute_mens_prone_range = pd.DataFrame(columns = ['Name'])

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    events = ordered_events[season]
    for event in events:
        if event[0] == 'ibu':
            filename1 = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            filename2 = ('%(cup)s_SMSPS_%(season)s_%(event)s.pkl' 
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})

            colname1 = ':'.join(['ibu',season,event[1]])
            colname2 = ':'.join(['ibuS',season,event[1]])
            try:
                df = pd.read_pickle(filename1)[['Name','prone range']]
                df.columns = ['Name',colname1]
                absolute_mens_prone_range = absolute_mens_prone_range.merge(df,
                                                                how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
            try:
                df = pd.read_pickle(filename2)[['Name','prone range']]
                df.columns = ['Name',colname2]
                absolute_mens_prone_range = absolute_mens_prone_range.merge(df, 
                                                                how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
        else:
            filename = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                            %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            colname = ':'.join(['wc',season,event[1]])
            
            try:
                df = pd.read_pickle(filename)[['Name','prone range']]
                df.columns = ['Name',colname]
                absolute_mens_prone_range = absolute_mens_prone_range.merge(df, 
                                                                how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass

last_row = len(absolute_mens_prone_range)

for col in absolute_mens_prone_range.columns.tolist():
    absolute_mens_prone_range.loc[last_row, col] = "".join(['20', col[2:4]])
    
for i in range(len(absolute_mens_prone_range)):
    absolute_mens_prone_range.loc[i,'count'] = absolute_mens_prone_range.loc[i].count() - 1
    
absolute_mens_prone_range.loc[last_row,'Name'] = 'Year'

absolute_mens_prone_range.set_index('Name', drop=True, inplace=True)

In [13]:
# Collecting the standing range times (including penalties)

absolute_mens_standing_range = pd.DataFrame(columns = ['Name'])

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    events = ordered_events[season]
    for event in events:
        if event[0] == 'ibu':
            filename1 = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl'
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            filename2 = ('%(cup)s_SMSPS_%(season)s_%(event)s.pkl'
                             %{'cup' : event[0], 'season' : season, 'event' : event[1]})

            colname1 = ':'.join(['ibu',season,event[1]])
            colname2 = ':'.join(['ibuS',season,event[1]])
            try:
                df = pd.read_pickle(filename1)[['Name','standing range']]
                df.columns = ['Name',colname1]
                absolute_mens_standing_range = absolute_mens_standing_range.merge(df, 
                                                                    how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
            try:
                df = pd.read_pickle(filename2)[['Name','standing range']]
                df.columns = ['Name',colname2]
                absolute_mens_standing_range = absolute_mens_standing_range.merge(df, 
                                                                    how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass
        else:
            filename = ('%(cup)s_SMSP_%(season)s_%(event)s.pkl' 
                            %{'cup' : event[0], 'season' : season, 'event' : event[1]})
            colname = ':'.join(['wc',season,event[1]])
            
            try:
                df = pd.read_pickle(filename)[['Name','standing range']]
                df.columns = ['Name',colname]
                absolute_mens_standing_range = absolute_mens_standing_range.merge(df, 
                                                                    how = 'outer', on = 'Name')
            
            except: # race has no competition analysis file
                pass

last_row = len(absolute_mens_standing_range)

for col in absolute_mens_standing_range.columns.tolist():
    absolute_mens_standing_range.loc[last_row, col] = "".join(['20', col[2:4]])
    
for i in range(len(absolute_mens_standing_range)):
    absolute_mens_standing_range.loc[i,'count'] = \
                                             absolute_mens_standing_range.loc[i].count() - 1    
absolute_mens_standing_range.loc[last_row,'Name'] = 'Year'

absolute_mens_standing_range.set_index('Name', drop=True, inplace=True)

And now to pickle them both.

In [14]:
absolute_mens_prone_range.to_pickle('absolute_mens_prone_range.pkl')
absolute_mens_standing_range.to_pickle('absolute_mens_standing_range.pkl')

<a name="WeightingtheRaces"></a>



# Weighting the Races 

One of the most important uses of the investigations above on the impacts of season, event, etc on attributes such as penalty loop times, speed, etc, is to hopefully allow us to more accurately predict outcomes for both total times for individual racers in a particular race, and, perhaps more usefully, to predict how racers are likely to fare relative to each other. In order to accomplish this, however, I need to have some way of determining whether, say the third event of the 2016-2017 season is more similar to the third event of the 2015-2016 season or to the fifth event of the 2012-2013 season. Furthermore, given the dependencies that I found above, it seems as if we may need several different similarity values, one for each of the pieces of the puzzle that we want to put together : speed, prone accuracy, standing accuracy, prone range time, standing range time, prone penalty loop time, and standing penalty loop time.

We start by, for each predictor variable, creating a dataframe of size $n \times n$, where $n$ is the number of races that we have data for, such that the entry in row i and column j is the difference in values of the predictor variable between races i and j.

In order to do this, I use the following functions:
1. ```course_similarities```: this function cycles through all pairings of events in a pair of world cup seasons (possibly the same), and determines the absolute value of the differences of the values of the given predictor variable for those two competitions. Here we restrict to variables that describe the course itself: 'Length', 'Height Diff', 'Max Climb', 'Total Climb', 'Altitude', 'Quant Year', and 'Quant Event'
2. ```course_similarities_pred```: cycles through all pairs of world cup seasons for a given course predictor and applies ```course_similarities``` for each pairing. It then divides the differences found by the total change in values for that variable, taken over all seasons, and rounds the obtained values to the nearest tenth. These values are then combined into a single dataframe which is returned by the function.
3. ```ibu_course_similarities```: this function is the equivalent of ```course_similarities```, except that rather than finding similarities between two world cup seasons, it finds similarities between one world cup season and one ibu cup season.
4. ```ibu_course_similarities_pred```: this function is the equivalent of ```course_similarities_pred``` except that, rather than run through all pairings of two world cup seasons, it runs through all pairings of one world cup season and one ibu cup season.
5. ```condition_similarities```: this function is the equivalent of ```course_similarities```, except that rather than restricting to variables describing the course, we restrict to variables describing the (weather) conditions: 'Air Temp C', 'Snow Temp C', 'Wind C', 'Humidity C', 'Quant Weather', 'Humidity C', 'Quant Weather', and 'Quant Snow'
2. ```condition_similarities_pred```: this function is the equivalent of ```course_similarities_pred```, except that it applies ```condition_similarities``` for each season pairing rather than ```course_similarities``` 
3. ```ibu_condition_similarities```: this function is the equivalent to ```condition_similarities```, except that rather than finding similarities between two world cup seasons, it finds similarities between one world cup season and one ibu cup season.
4. ```ibu_condition_similarities_pred```: this function is the equivalent of ```condition_similarites_pred```, except that, rather than run through all pairings of two world cup seasons, it runs through all pairings of one world cup season and one ibu season.

Previous Section: [Collecting the Data](#Collecting-data-by-racer)

Next Section: [Isolating Variable Effects](#isolating_variable_effects)

[Table of Contents](#Table-of-Contents)



In [15]:
weather_conditions = pd.read_pickle('weather_averages.pkl')
snow_conditions = pd.read_pickle('snow_averages.pkl')

In [16]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    filename = 'weather_summary_%(season)s_M.pkl' %{'season' : season}
    df = pd.read_pickle(filename)

    for i in range(len(df)):
        df.loc[i,'Quant Weather'] = weather_conditions.loc[df.loc[i,'Weather C'], 
                                                                           'Weather Quant']
        df.loc[i, 'Quant Snow'] = snow_conditions.loc[df.loc[i, 'Snow Cond C'], 'Snow Quant']
        
    df.to_pickle(filename)
    
for season in seasons:
    filename1 = 'ibu_weather_summary_%(season)s_M.pkl' %{'season' : season}
    df1 = pd.read_pickle(filename1)
    
    for i in range(len(df1)):
        df1.loc[i,'Quant Weather'] = weather_conditions.loc[df1.loc[i,'Weather C'], 
                                                                            'Weather Quant']
        df1.loc[i, 'Quant Snow'] = snow_conditions.loc[df1.loc[i, 'Snow Cond C'], 
                                                                               'Snow Quant']
        
    df1.to_pickle(filename1)



In [17]:
"""
Function
--------

course_similarities : cycles through all pairings of events in a pair of seasons (possibly
                      the same), and determines the differences in the values 
                      taken by a given predictor variable for those two competitions

Parameters
----------

season1 : a string that codes one season under consideration. It is of the form y1y2 where
          y1 is the last two digits of the year in which the season started, and y2 is the 
          last two digits of the year in which the season ended.
season2 : a string that codes the other season under consideration. It is of the form y1y2
          where y1 is the last two digits of the year in which the season started, and y2
          is the last two digits of the year in which the season ended.
predictor : one of the pieces of information about the courses for which we have data. 
            Possibilities include 'Length', 'Height Diff','Max Climb','Total Climb',
            'Altitude','Quant Year', and 'Quant Event'

Returns
-------

similarities : a dataframe containing all of the season1 world cup events as rows and season2
               world cup events as columns. The entries are the absolute values of the 
               differences between the predictor variable values for the two events.

Examples
--------
"""

def course_similarities(season1, season2, predictor):
    
    filename1 = 'course_summary_%(season1)s_M.pkl' %{'season1' : season1}
    filename2 = 'course_summary_%(season2)s_M.pkl' %{'season2' : season2}
    
    df1 = pd.read_pickle(filename1)[['Year','Event',predictor]]
    df2 = pd.read_pickle(filename2)[['Year','Event',predictor]]
    
    similarities = pd.DataFrame()
    
    for i in range(len(df1)):
        rowname = ':'.join([season1,df1.loc[i,'Event']])
        for j in range(len(df2)):
            colname = ':'.join([season2,df2.loc[j,'Event']])
            similarities.loc[rowname,colname] = abs(float(df1.loc[i,predictor]) - 
                                                            float(df2.loc[j,predictor]))
    return similarities

In [18]:
"""
Function
--------

course_similarities_pred : cycles through all pairs of seasons for a given predictor and
                           applies course_similarities for each pairing of world cup seasons,
                           then converts them to a fraction of the total range taken by
                           the given predictor, and concatenates all of the results into a 
                           dataframe

Parameters
----------

predictor : one of the pieces of information about the courses for which we have data. 
            Possibilities include 'Length', 'Height Diff','Max Climb','Total Climb',
            'Altitude','Quant Year', and 'Quant Event'

Returns
-------

stores a pickled dataframe on the hard drive

Examples
--------
"""

def course_similarities_pred(predictor):
    
    seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
               '1516','1617','1718']

    wc_course_similarities = pd.DataFrame()

    for season1 in seasons:
        similarities = pd.DataFrame()
        for season2 in seasons:
            two_season_similarites = course_similarities(season1, season2, predictor)
            similarities = similarities.join(two_season_similarites, how='outer')
        wc_course_similarities = pd.concat([wc_course_similarities, similarities])
    
    max_value = np.max(np.array(wc_course_similarities))

    wc_course_similarities = (max_value - wc_course_similarities)/max_value
    
    filename = 'wc_%(pred)s_similarities.pkl' %{'pred' : predictor.lower().replace(" ", "_")}
    wc_course_similarities.to_pickle(filename)

In [19]:
"""
Function
--------

ibu_course_similarities : cycles through all pairings of events in one world cup season 
                          and one ibu cup season (possibly the same years), and determines
                          the differences in the values taken by a given predictor variable 
                          for those two competitions

Parameters
----------

season1 : a string that codes the world cup season under consideration. It is of the form 
          y1y2 where y1 is the last two digits of the year in which the season started, 
          and y2 is the last two digits of the year in which the season ended.
season2 : a string that codes the ibu cup season under consideration. It is of the form y1y2
          where y1 is the last two digits of the year in which the season started, and y2
          is the last two digits of the year in which the season ended.
predictor : one of the pieces of information about the courses for which we have data. 
            Possibilities include 'Length', 'Height Diff','Max Climb','Total Climb',
            'Altitude','Quant Year', and 'Quant Event'


Returns
-------

similarities : a dataframe containing all of the season1 world cup events as rows and season2
               ibu cup events as columns. The entries are the absolute values of the 
               differences between the predictor variable values for the two events.

Examples
--------
"""

# Note that this puts the ibu events as the columns and world cup events as the rows, since
# I'm interested in making predictions about world cup times

def ibu_course_similarities(season1, season2, predictor):
    
    filename1 = 'course_summary_%(season1)s_M.pkl' %{'season1' : season1}
    filename2 = 'ibu_course_summary_%(season2)s_M.pkl' %{'season2' : season2}
    
    df1 = pd.read_pickle(filename1)[['Year','Event',predictor]]
    df2 = pd.read_pickle(filename2)[['Year','Event','Race',predictor]]
    similarities = pd.DataFrame()
    
    for i in range(len(df1)):
        rowname = ':'.join([season1,df1.loc[i,'Event']])
        for j in range(len(df2)):
            if df2.loc[j,'Race'] == 'SMSP':
                colname = ':'.join([season2,df2.loc[j,'Event']])
            else:
                colname = ':'.join([season2,"".join([df2.loc[j,'Event'],'S'])])
            similarities.loc[rowname,colname] = abs(float(df1.loc[i,predictor]) - 
                                                                float(df2.loc[j,predictor]))
    return similarities

In [20]:
"""
Function
--------

ibu_course_similarities_pred : cycles through all pairings of one world cup season and one 
                               ibu cup season, applies course_similarities with the given 
                               predictor for each pairing, then converts them to a fraction
                               of the total range taken by the given predictor, and 
                               concatenates all of the results into a dataframe

Parameters
----------

predictor : one of the pieces of information about the courses for which we have data. 
            Possibilities include 'Length', 'Height Diff','Max Climb','Total Climb',
            'Altitude','Quant Year', and 'Quant Event'

Returns
-------

saves a pickled dataframe on the hard drive

Examples
--------
"""

def ibu_course_similarities_pred(predictor):
    
    seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
               '1516','1617','1718']

    ibu_course_sims = pd.DataFrame()

    for season1 in seasons:
        similarities = pd.DataFrame()
        for season2 in seasons:
            two_season_similarites = ibu_course_similarities(season1, season2, predictor)
            similarities = similarities.join(two_season_similarites, how='outer')
        ibu_course_sims = pd.concat([ibu_course_sims, similarities])
    
    max_value = np.max(np.array(ibu_course_sims))

    ibu_course_sims = (max_value - ibu_course_sims)/max_value
    
    filename = 'ibu_%(pred)s_similarities.pkl' %{'pred' : predictor.lower().replace(" ", "_")}
    ibu_course_sims.to_pickle(filename)

In [21]:
predictors = ['Length', 'Height Diff','Max Climb','Total Climb','Altitude','Quant Year',
              'Quant Event']

for predictor in predictors:
    course_similarities_pred(predictor)
    ibu_course_similarities_pred(predictor)

In [22]:
"""
Function
--------

condition_similarities : cycles through all pairings of events in two world cup seasons
                         (possibly the same years), and determines the differences in the
                         values taken by a given predictor variable for those two competitions

Parameters
----------

season1 : a string that codes one world cup season under consideration. It is of the form 
          y1y2 where y1 is the last two digits of the year in which the season started, 
          and y2 is the last two digits of the year in which the season ended.
season2 : a string that codes another world cup season under consideration. It is of
          the form y1y2 where y1 is the last two digits of the year in which the season 
          started, and y2 is the last two digits of the year in which the season ended.
predictor : one of the pieces of information about the courses for which we have data. 
            Possibilities include 'Air Temp C','Snow Temp C','Wind C','Humidity C',
            'Quant Weather', 'Humidity C', 'Quant Weather', and 'Quant Snow'


Returns
-------

similarities : a dataframe containing all of the season1 world cup events as rows and season2
               world cup events as columns. The entries are the absolute values of the 
               differences between the predictor variable values for the two events.

Examples
--------
"""

def condition_similarities(season1, season2, predictor):
    
    filename1 = 'weather_summary_%(season1)s_M.pkl' %{'season1' : season1}
    filename2 = 'weather_summary_%(season2)s_M.pkl' %{'season2' : season2}
    
    df1 = pd.read_pickle(filename1)[['Year','Event',predictor]]
    df2 = pd.read_pickle(filename2)[['Year','Event',predictor]]
    
    similarities = pd.DataFrame()
    
    for i in range(len(df1)):
        rowname = ':'.join([season1,df1.loc[i,'Event']])
        for j in range(len(df2)):
            colname = ':'.join([season2,df2.loc[j,'Event']])
            similarities.loc[rowname,colname] = abs(float(df1.loc[i,predictor]) - 
                                                            float(df2.loc[j,predictor]))
    return similarities

In [23]:
"""
Function
--------

condition_similarities_pred : cycles through all pairings of two world cup seasons, applies
                              course_similarities with the given predictor for each pairing,
                              then converts them to a fraction of the total range taken by 
                              the given predictor, and concatenates all of the results into
                              a dataframe

Parameters
----------

predictor : one of the pieces of information about the courses for which we have data. 
            Possibilities include 'Air Temp C','Snow Temp C','Wind C','Humidity C',
            'Quant Weather', 'Humidity C', 'Quant Weather', and 'Quant Snow'

Returns
-------

stores a pickled dataframe on the hard drive

Examples
--------
"""

def condition_similarities_pred(predictor):
    seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415','1516','1617','1718']

    wc_condition_sims = pd.DataFrame()

    for season1 in seasons:
        similarities = pd.DataFrame()
        for season2 in seasons:
            two_season_similarites = condition_similarities(season1, season2, predictor)
            similarities = similarities.join(two_season_similarites, how='outer')
        wc_condition_sims = pd.concat([wc_condition_sims, similarities])
    
    max_value = np.max(np.array(wc_condition_sims))

    wc_condition_sims = (max_value - wc_condition_sims)/max_value
    
    filename = 'wc_%(pred)s_similarities.pkl' %{'pred' : predictor.lower().replace(" ", "_")}
    wc_condition_sims.to_pickle(filename)

In [24]:
"""
Function
--------

ibu_condition_similarities : cycles through all pairings of events in one world cup season 
                             and one ibu cup season (possibly the same years), and determines
                             the differences in the values taken by a given predictor variable 
                             for those two competitions

Parameters
----------

season1 : a string that codes the world cup season under consideration. It is of the form 
          y1y2 where y1 is the last two digits of the year in which the season started, 
          and y2 is the last two digits of the year in which the season ended.
season2 : a string that codes the ibu cup season under consideration. It is of
          the form y1y2 where y1 is the last two digits of the year in which the season 
          started, and y2 is the last two digits of the year in which the season ended.
predictor : one of the pieces of information about the courses for which we have data. 
            Possibilities include 'Air Temp C','Snow Temp C','Wind C','Humidity C',
            'Quant Weather', 'Humidity C', 'Quant Weather', and 'Quant Snow'


Returns
-------

similarities : a dataframe containing all of the season1 world cup events as rows and season2
               ibu cup events as columns. The entries are the absolute values of the 
               differences between the predictor variable values for the two events.

Examples
--------
"""

def ibu_condition_similarities(season1, season2, predictor):
    
    filename1 = 'weather_summary_%(season1)s_M.pkl' %{'season1' : season1}
    filename2 = 'ibu_weather_summary_%(season2)s_M.pkl' %{'season2' : season2}
    
    df1 = pd.read_pickle(filename1)[['Year','Event',predictor]]
    df2 = pd.read_pickle(filename2)[['Year','Event','Race',predictor]]
    
    similarities = pd.DataFrame()
    
    for i in range(len(df1)):
        rowname = ':'.join([season1,df1.loc[i,'Event']])
        for j in range(len(df2)):
            if df2.loc[j,'Race'] == 'SMSP':
                colname = ':'.join([season2,df2.loc[j,'Event']])
            else:
                colname = ':'.join([season2,"".join([df2.loc[j,'Event'],'S'])])
            similarities.loc[rowname,colname] = abs(float(df1.loc[i,predictor]) - 
                                                            float(df2.loc[j,predictor]))
    return similarities

In [25]:
"""
Function
--------

ibu_condition_similarities_pred : cycles through all pairings of one world cup season and 
                                  one ibu cup season, applies course_similarities with the
                                  given predictor for each pairing, then converts them to
                                  a fraction of the total range taken by the given predictor,
                                  and concatenates all of the results into a dataframe

Parameters
----------

predictor : one of the pieces of information about the courses for which we have data. 
            Possibilities include 'Air Temp C','Snow Temp C','Wind C','Humidity C',
            'Quant Weather', 'Humidity C', 'Quant Weather', and 'Quant Snow'

Returns
-------

saves a pickled dataframe to the hard drive

Examples
--------
"""

def ibu_condition_similarities_pred(predictor):
    seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
               '1516','1617','1718']

    ibu_condition_sims = pd.DataFrame()

    for season1 in seasons:
        similarities = pd.DataFrame()
        for season2 in seasons:
            two_season_similarites = ibu_condition_similarities(season1, season2, predictor)
            similarities = similarities.join(two_season_similarites, how='outer')
        ibu_condition_sims = pd.concat([ibu_condition_sims, similarities])
    
    max_value = np.max(np.array(ibu_condition_sims))

    ibu_condition_sims = (max_value - ibu_condition_sims)/max_value
    
    filename = 'ibu_%(pred)s_similarities.pkl' %{'pred' : predictor.lower().replace(" ", "_")}
    ibu_condition_sims.to_pickle(filename)

In [26]:
predictors = ['Air Temp C','Snow Temp C','Wind C','Humidity C','Quant Weather',
              'Humidity C', 'Quant Weather', 'Quant Snow']

for predictor in predictors:
    condition_similarities_pred(predictor)
    ibu_condition_similarities_pred(predictor)

<a name = "isolating_variable_effects"></a>



# Isolating the Variable Effects


So, I think that much of the stuff below here is actually garbage, and I'll end up deleting it (and if for some reason, that's wrong, that's why I made a copy of the file.... One of the things that I did at the end was to try to isolate each individual predictor and see what effect weighting by only that predictor had on the predictive accuracy of my model. So, in order to do that, I need a random set of races to draw from, and I think that, since I want to make predictions about the last four seasons, that I want to use the first 10 seasons for testing. Furthermore, since my predictions are heavily dependant on a racer's performance in his previous races, I want to avoid randomly sampling from the first 5 seasons. And I think that 10 races for comparison are probably sufficient. I'm then going to consider each of the pieces of the total time separately. We have the following functions:

- [Variable Effects on Speed](#Variable-Effects-on-Speed)
    1. ```adjust_times```: Given a racer and a competition, this function takes his event speeds over all previous seasons in his career, finds the best fit line through the data (with ```quant_season``` as the predictor variable and ```speed``` as the response variable), and adjusts speeds from prior seasons to reflect what they would be predicted to be were the race run in the season under consideration.
    2. ```build_racer_speed_distribution```: Given a racer, season, and event as well as dataframes containing similarity data about world cup vs world cup and world cup vs ibu cup race conditions, this function creates a list of speeds from previous races, in which the multiplicity of the speed is determined by the degree of similarity between the race of interest and the race from which the speed was taken. It then returns a list containing the racer's name, the mean of the list of speeds, the standard deviation of the list, the 10th, 25th, 50th, 75th, and 90th percentiles, the racer's actual speed, and the number of prior races in which the racer had competed.
    3. ```check_predictions```: This function takes as input a dataframe created by concatenating the output of repeated calls to ```build_racer_*_distribution```. It returns a dataframe with one row for each racer which indicates whether or not that racer's actual *-value falls within the middle 50 percent and the middle 80 percent of the distributions obtained.
    4. ```inside_outside```: This function takes a dataframe output by ```check_predictions``` and returns the percentage of racers whose actual *-values fall within the middle 50 percent of their time distributions. 
    5. ```above_below```: This function takes as input a dataframe created by concatenating the output of repeated calls to ```build_racer_*_distribution```. For each racer, it then determines whether or not that racer's actual *-value was above the mean of their distribution or not. The function then returns the percentage of racers for which this is true.
    6. ```evaluate_variable_impact```: This function pulls together ```build_racer_speed_distribution```, ```check_predictions```, ```inside_outside```, and ```above_below```. Given a competition (season and event) and a predictor variable, it calculates the percentage of racers whose actual times were within the middle 50% of their distributions and the percentage of racers whose actual times were above the mean of their distributions and returns the results in the form of a list.
    7. ```variable_score```: This function takes as input a dataframe produced by concatenating the results of repeated calls to ```evaluate_variable_impact```. It then sorts the dataframe event by event, assigning one point for the first place variable, two for the second place variable, etc as it goes. It then sums these values for each variable and returns a dataframe containing these sums.
    8. ```pluck_best_variables```: This function takes the two dataframes produced by concatenation of repeated calls to inside_outside and above_below, calls variable_scores on each, and returns all variables with both scores at least 70% of the maximum (for the percentages in the middle 50%) and at most 130% of the minimum (for the percentages above the mean)

- [Variable Effects on Prone Accuracy](#Variable-Effects-on-Prone-Accuracy)
    1. ```build_racer_pa_distribution```: Given a racer, season, and event as well as dataframes containing similarity data about world cup vs world cup and world cup vs ibu cup race conditions, this function creates a list of accuracies in the following manner. First, it creates a master list in the same manner used by ```build_racer_speed_distribution```, that is to say, by taking multiplicities of the numbers of missed shots over all prior races based on the similarities between the prior race and the current race. Second, the function takes samples of size $n$ (the length of the list) and  calculates the sample average. It then simulates 5 shots and uses the sample average to decide whether each of those shots were misses or not and appends the number of missed shots to the list of predictions. Finally, using this list of predictions, it returns a list containing the racer's name, the mean of the list of prone accuracies, the standard deviation of the list, the 10th, 25th, 50th, 75th, and 90th percentiles, the racer's actual accuracy, and the number of previous races for which the racer had data
    2. ```evaluate_variable_impactPA```: This function pulls together ```build_racer_pa_distribution```, ```check_predictions```, ```inside_outside```, and ```above_below```. Given a competition (season and event) and a predictor variable, it calculates the percentage of racers whose actual prone accuracies were within the middle 50% of their distributions and the percentage of racers whose actual prone accuracies were above the mean of their distributions and returns the results in the form of a list.

- [Variable Effects on Standing Accuracy](#Variable-Effects-on-Standing-Accuracy)
    1. ```build_racer_sa_distribution```: this function is the equivalent of ```build_racer_pa_distribution``` applied to standing rather than prone shooting accuracy
    2. ```evaluate_variable_impactSA```: this function is the equivalent of ```evaluate_variable_impactPA``` but using ```build_racer_sa_distribution``` in place of ```build_racer_pa_distribution```


- [Variable Effects on Prone Range Times](#Variable-Effects-on-Prone-Range)
    1. ```build_racer_pr_distribution```: Given a racer and a competition (season and event) along with similarity dataframes, first builds weighted lists of range times and missed shots. It then creates bootstrap samples from these lists, and fits a linear regression (with missed shots as the predictor variable and range time as the response variable) to estimate the shooting time (intercept) and penalty loop time (slope). Using the same bootstrap sample, the function calculates the average prone shooting percentage, and uses it to simulate a series of five shots. The predicted range time is then calculated as
    
    *predicted\_range\_time* = _shooting time_ + _penalty loop time_ $\times$ _missed shots_ 
    
    and added to the list of predicted range times. The function returns a list containing the racer's name, the mean of the list of predicted prone range times, the standard deviation of the list, the 10th, 25th, 50th, 75th, and 90th percentiles, the racer's actual prone range time, and the number of previous races for which the racer had data.
    2. ```evaluate_variable_impactPR```: This function pulls together ```build_racer_pr_distribution```, ```check_predictions```, ```inside_outside```, and ```above_below```. Given a competition (season and event) and a predictor variable, it calculates the percentage of racers whose actual prone range times were within the middle 50% of their distributions and the percentage of racers whose actual prone range times were above the mean of their distributions and returns the results in the form of a list.


- [Variable Effects on Standing Range Times](#Variable-Effects-on-Standing-Range)
    1. ```build_racer_sr_distribution```: this function is the equivalent of ```build_racer_pr_distribution``` applied to standing rather than prone range times
    2. ```evaluate_variable_impactSR```: this function is the equivalent of ```evaluate_variable_impactPR``` but using ```build_racer_sr_distribution``` in place of ```build_racer_pr_distribution```


\*-value: Here, we use the notation \*-value to reflect the fact that we use certain functions, for instance ```check_predictions``` and ```above_below``` among others, with the outputs of all of our build distribution functions. The notation \*-value can thus be interpreted as referring to the racer's actual speed, actual prone shooting accuracy, etc, depending on the context in which the function is being used. Similarly, the notation ```build_racer_*_distribution``` indicates that the function being described can take outputs from ```build_racer_speed_distribution```, ```build_racer_pa_distribution```, etc.

Previous Section: [Weighting the Races](#Weighting-the-Races)

Next Section: [Pulling it all together](#pulling_together)

[Table of Contents](#Table-of-Contents)



In [27]:
seasons = ['0910', '1011', '1112', '1213', '1314']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__']

chosen_events = []
for i in range(10):
    redraw = 0
    while redraw == 0:
        competition = [random.choice(seasons), random.choice(events)]
        redraw = 1
        if competition in chosen_events:
            redraw = 0
        if competition[0] == '1314':
            if competition[1] in ['CP05','CP06']:
                redraw = 0
    if competition[0] in ['0910', '1314']:
        if competition[1] == 'CH__':
            competition[1] = 'OG__'

    chosen_events.append(competition)
    
chosen_events

[['1112', 'CP07'],
 ['1112', 'CP09'],
 ['1011', 'CP07'],
 ['1314', 'CP09'],
 ['0910', 'CP04'],
 ['1011', 'CP09'],
 ['1112', 'CP02'],
 ['1314', 'CP03'],
 ['1112', 'CP03'],
 ['0910', 'CP07']]

In [28]:
weights = ["quant_snow", "quant_weather", "humidity_c", "wind_c", "snow_temp_c", "air_temp_c", 
           "quant_event", "quant_year", "altitude", "total_climb", "max_climb", "height_diff", 
           "length"]


## Variable Effects on Speed

[Isolating the Variable Effects](#Isolating-the-Variable-Effects)

In [29]:
"""
Function
--------

adjust_times : Takes a racer and his event speeds over the course of a career, finds the
               best fit line through the data, and adjusts early speeds to reflect what 
               they would be predicted to be if the race were run in the season under 
               consideration.

Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.

Returns
-------

adjusted_racer : a list containing the racer's speed adjusted for the season under 
                 consideration

Examples
--------

"""

# To be called inside speed_predictions

def adjust_times(racer, season):
    
    indices = racer.index.tolist()
    years = [item.split(':')[1] for item in indices]
    years = [''.join(['20',item[2:4]]) for item in years]
    years = [float(item) for item in years]
    years = np.array(years).reshape(-1,1)
    speeds = np.array(racer.tolist()).reshape(-1,1)
    
    linreg = LinearRegression()
    linreg.fit(years , speeds)
    
    coef = linreg.coef_[0][0]
    
    # creating a time adjusted version of the speeds
    
    adjusted_racer = racer.copy()
    
    for i in range(len(racer)):
        time_delta = float(season)- years[i][0]
        adjusted_racer[i] = adjusted_racer[i] + coef*time_delta
        
    return adjusted_racer

In [30]:
"""
Function
--------

build_racer_speed_distribution : takes a racer, season, event, and dataframes containing
                                 similarity data about world cup vs world cup and world cup
                                 vs ibu cup race conditions, and finds a list of speeds
                                 from previous races, where the multiplicity of the speed
                                 is determined by the degree of similarity between the race
                                 of interest and the race from which the speed was taken. It
                                 then returns a list containing the racer's name, the mean of
                                 the list of speeds, the standard deviation of the list, the
                                 10th, 25th, 50th, 75th, and 90th percentiles, the racer's 
                                 actual speed, and the number of previous races for which the
                                 racer had data

Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
wc_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
         between world cup (wc) races in a pairwise fashion. Pairs with a similarity value
         of 1 were run under nearly identical conditions, while those with a similarity
         value of 0 were run under extremely different conditions
ibu_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
          between world cup (wc) races and ibu cup races in a pairwise fashion. 

Returns
-------

racer_predict : a list of describing the distribution of the weighted prior data

Examples
--------
"""

def build_racer_speed_distribution(racer, season, event, wc_sim, ibu_sim):

    col_name = ':'.join(['wc',season, event])
    racer_data = absolute_mens_speed.loc[racer, :col_name]
    name = racer
    actual = float(racer_data[col_name])
    short_racer_data = racer_data[2:-1].copy()
    short_racer_data.dropna(inplace = True)
    predictors = len(short_racer_data)
    
    year = ''.join(['20',season[2:4]])
    short_racer_data = adjust_times(short_racer_data, year)
    indices = short_racer_data.index.tolist()

    race_weights = []
    for index in indices:
        split_index = index.split(':')
        if split_index[0] == 'wc':
            race_weights.append(wc_sim.loc[':'.join([season,event]),
                                               ':'.join([split_index[1],split_index[2]])])
        else:
            race_weights.append(ibu_sim.loc[':'.join([season,event]),
                                                ':'.join([split_index[1],split_index[2]])])
    race_weights_rounded = [int(round(item,1)*10) for item in race_weights]

    race_weights_rounded

    # Making the weighted data list
    
    racer_predict = []

    for i in range(len(short_racer_data)): 
        for j in range(race_weights_rounded[i]):
            racer_predict.append(float(short_racer_data[i]))

    mean = np.mean(racer_predict)
    dev = np.std(racer_predict)
    tenth = np.percentile(racer_predict, 10)
    q1 = np.percentile(racer_predict, 25)
    med = np.median(racer_predict)
    q3 = np.percentile(racer_predict, 75)
    ninetieth = np.percentile(racer_predict, 90)
    
    return [name,mean, dev, tenth, q1, med, q3, ninetieth, actual,predictors]


            
    return racer_predict

In [31]:
"""
Function
--------

check_predictions : takes the output of repeated calls to build_racer_*_distribution
                    and returns a dataframe indicating which racers fall within the
                    middle 50 and 80 percents of their time distributions

Parameters
----------

df : a dataframe that is the output of repeated calls to build_racer_*_distribution

Returns
-------

checked_predictions : a dataframe with one row for each racer in a given competition
                      with columns that indicate whether or not the actual speed of
                      the racer during the competition fell within the middle 50%
                      or the middle 80% of his speed distribution

Examples
--------
"""

def check_predictions(df):
    
    checked_predictions = []
    for i in range(len(df)):
        name = df.loc[i,'name']
        predictors = df.loc[i,'predictors']
        if df.loc[i,'q1'] <= df.loc[i,'actual'] <= df.loc[i,'q3']:
            middle_50 = True
        else:
            middle_50 = False

        if df.loc[i,'tenth'] <= df.loc[i,'actual'] <= df.loc[i,'ninetieth']:
            middle_80 = True
        else:
            middle_80 = False
            
        checked_predictions.append([name,middle_50, middle_80, predictors])
        
    checked_predictions = pd.DataFrame(checked_predictions, columns = ['name', 'middle_50',
                                                            'middle_80', 'predictors'])
    
    return checked_predictions

In [32]:
"""
Function
--------

inside_outside : takes a dataframe output by check_predictions and returns the percentage
                 of racers within the middle 50% 

Parameters
----------

df : a dataframe that is the output of a call to check_predictions

Returns
-------

percent_inside_50 : a float giving the percentage of racers in a given race whose actual time
                    was in the middle 50% of their time distribution

Examples
--------

"""

def inside_outside(df):
    
    inside_50 = 0
    outside_50 = 0

    for i in range(len(df)):
        if df.loc[i,'middle_50']:
            inside_50 += 1
        else:
            outside_50 += 1
    percent_inside_50 = [inside_50, outside_50, float(inside_50)/(inside_50 + outside_50)*100]


    return percent_inside_50

In [33]:
"""
Function
--------

above_below : takes a dataframe produced by repeated calls to build_racer_*_distribution
              and determines for each racer whether or not that racer's speed was above
              (faster than) the mean of their distribution. The percentage that have
              speeds above their means is returned.

Parameters
----------

predictions : a dataframe produced by repeated calls to build_racer_*_distribution

Returns
-------

portion_above : a float telling for what percentage of racers in a race, their actual
                speed was faster than the mean of their distribution

Examples
--------
"""

def above_below(predictions):
    
    count_above = 0
    count_below = 0
    
    for i in range(len(predictions)):
        if predictions.loc[i,'actual'] >= predictions.loc[i,'mean']:
            count_above += 1
        else:
            count_below += 1
            
    portion_above = (100.0*count_above)/(count_above + count_below)
    
    return portion_above

In [34]:
"""
Function
--------

evaluate_variable_impact : pulls together build_racer_speed_distribution, check_predictions,
                           inside_outside, and above_below for one predictor and for a single
                           event returns the percentage of racers whose actual times were
                           within the middle 50% of their distributions and the percentage
                           of racers whose actual times were above the mean of their
                           distributions.

Parameters
----------

competition : a list of the form [season, event], where season is a string that codes 
              the season under consideration, and event is a string that codes the event
              under consideration.
variable : a predictor variable. Possible values are "quant_snow", "quant_weather",
           "humidity_c", "wind_c", "snow_temp_c", "air_temp_c", "quant_event", 
           "quant_year", "altitude", "total_climb", "max_climb", "height_diff", and
           "length"

Returns
-------

portion_inside : a float giving the percentage of the racers for whom the actual speed
                 was in the middle 50% of their speed distribution
portion_above : a float giving the percentage of the racers for whom the actual speed
                was faster than the mean of their speed distribution

Examples
--------
"""

def evaluate_variable_impact(competition, variable):
    
    wc_filename = 'wc_%(variable)s_similarities.pkl' %{'variable' :variable}
    ibu_filename = 'ibu_%(variable)s_similarities.pkl' %{'variable' :variable}
    
    wc_speed_similarities = pd.read_pickle(wc_filename)
    ibu_speed_similarities = pd.read_pickle(ibu_filename)
    
    season = competition[0]
    event = competition[1]
    
    predictions = []
    race_code = ':'.join(['wc', season, event])
    racer_indices = absolute_mens_speed[race_code].dropna().index.tolist()
    for index in racer_indices:
        try:
            prediction = build_racer_speed_distribution(index,season,event, 
                                                 wc_speed_similarities, ibu_speed_similarities)
            if prediction[-1] > 2:
                predictions.append(prediction)

        except:
            pass
    predictions = pd.DataFrame(predictions, columns = ['name','mean', 'dev', 'tenth', 'q1', 
                                            'med', 'q3', 'ninetieth', 'actual','predictors'])
    
    checked_predictions = check_predictions(predictions)
    inside_50 = inside_outside(checked_predictions)
    
    portion_inside = inside_50[2]
    
    portion_above = above_below(predictions)
    
    return portion_inside, portion_above

In [35]:
percents_in_50 = pd.DataFrame()
percents_above_mean = pd.DataFrame()

for variable in weights:
    for i in range(len(chosen_events)):
        inside, above = evaluate_variable_impact(chosen_events[i], variable)
        
        percents_in_50.loc[variable, i] = np.rint(inside)
        percents_above_mean.loc[variable, i] = np.rint(above)


In [36]:
"""
Function
--------

variable_score : takes a dataframe produces by agglomerating the results of repeated
                 calls to evaluate_variable_impact and, for each event in question,
                 sorts them and assigns one point for the first place variable, two 
                 for the second place variable, and so on. Returns a dataframe that 
                 contains the sum of these scores for each variable.

Parameters
----------

df : a dataframe that contains either the percents in the middle 50% of the distributions
     for a collection of events modeled with each of the available variables, or the 
     percents above the mean under the same conditions

Returns
-------

df_scores : a dataframe containing the summed scores over all the events for each variable

Examples
--------
"""

def variable_score(df):
    
    df_scores = {}

    df_scores[df.index.tolist()[0]] = 1

    score = 1
    compare = 0

    for i in range(1,len(df)):
        if df[0][i] == df[0][compare]:
            df_scores[df.index.tolist()[i]] = score
        else:
            df_scores[df.index.tolist()[i]] = i+1
            score = i+1
            compare = i

    for j in range(1, df.shape[1]):
    
        df.sort_values(j,axis = 'rows', inplace = True)
        df_scores[df.index.tolist()[0]] = df_scores[df.index.tolist()[0]] + 1

        score = 1
        compare = 0

        for i in range(1,len(df)):
            if df[j][i] == df[j][compare]:
                df_scores[df.index.tolist()[i]] = df_scores[df.index.tolist()[i]] + score
            else:
                df_scores[df.index.tolist()[i]] = df_scores[df.index.tolist()[i]] + (i + 1)
                score = i + 1
                compare = i
            
    df_scores = pd.DataFrame.from_dict(df_scores, orient='index')
        
    df_scores.columns = ['score']       
            
    return df_scores

In [37]:
"""
Function
--------

pluck_best_variables : takes the pair of dataframes produced by concatenation of repeated calls
                       to inside_outside and above_below, calls variable_scores on each, 
                       and returns all variables with both scores at least 70% of the
                       maximum (for percent_in_50) and at most 130% of the minimum 
                       (for perc_above_mean)

Parameters
----------

perc_in_50 : a dataframe produced by concatenation of repeated calls to inside_outside
perc_above_mean : a dataframe produced by concatenation of repeated calls to above_below

Returns
-------

predictors_to_keep : a list of predictor variables that faired well relative to the other 
                     variables for the current aspect under consideration

Examples
--------
"""

def pluck_best_variables(perc_in_50, perc_above_mean):
    
    percent_score = variable_score(perc_in_50).sort_values('score',ascending = False)
    max_percent = percent_score.iloc[0,0]
    
    dist_from_50 = abs(perc_above_mean - 50)
    distance_score = variable_score(dist_from_50).sort_values('score')
    min_distance = distance_score.iloc[0,0]
    
    overall_score = ((variable_score(dist_from_50)-min_distance) +
                        (max_percent-variable_score(perc_in_50))).sort_values('score')
    
    predictors_to_consider = []
    
    predictors_to_consider.append(percent_score.index.tolist()[0])
    for i in range(1, len(percent_score)):
        if percent_score.iloc[i,0] >= 0.70*percent_score.iloc[0,0]:
            predictors_to_consider.append(percent_score.index.tolist()[i])

    predictors_to_consider.append(distance_score.index.tolist()[0])
    for i in range(1, len(distance_score)):
        if distance_score.iloc[i,0] <= 1.30*distance_score.iloc[0,0]:
            predictors_to_consider.append(distance_score.index.tolist()[i])

    predictors_to_consider.append(overall_score.index.tolist()[0])
    for i in range(1, len(overall_score)):
        if overall_score.iloc[i,0] <= 1.30*overall_score.iloc[0,0]:
            predictors_to_consider.append(overall_score.index.tolist()[i])
    
    predictors = {}
    
    for pred in predictors_to_consider:
        if pred in predictors:
            predictors[pred] = predictors[pred] + 1
        else:
            predictors[pred] = 1
    
    predictors_to_keep = [item for item in predictors if predictors[item] >= 2]
    predictors_to_keep.extend([item for item in predictors if predictors[item] >=3])
    
    return predictors_to_keep

And I want to append this to the list of variables that I had before...

In [38]:
with open('best_variables.pickle', 'rb') as handle:
    best_variables = pickle.load(handle)

# best_variables

In [39]:
speed_variables = pluck_best_variables(percents_in_50, percents_above_mean)
best_variables['speed'].extend(speed_variables)

speed_variables

['wind_c', 'quant_snow', 'wind_c']


## Variable Effects on Prone Accuracy


[Isolating the Variable Effects](#Isolating-the-Variable-Effects)

In [40]:
"""
Function
--------

build_racer_pa_distribution : for a given racer and a given event and season, produces
                              a list of predicted prone shooting accuracies derived by
                              sampling from the list of prior data. It then returns a 
                              list containing the racer's name, the mean of the list of
                              speeds, the standard deviation of the list, the 10th, 25th,
                              50th, 75th, and 90th percentiles, the racer's actual speed,
                              and the number of previous races for which the racer had data

Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
wc_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
         between world cup (wc) races in a pairwise fashion. Pairs with a similarity value
         of 1 were run under nearly identical conditions, while those with a similarity
         value of 0 were run under extremely different conditions
ibu_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
          between world cup (wc) races and ibu cup races in a pairwise fashion. 


Returns
-------

racer_predict : a list of describing the distribution of the accuracy predictions 
                produced using weighted prior data

Examples
--------
"""

def build_racer_pa_distribution(racer, season, event, wc_sim, ibu_sim):

    col_name = ':'.join(['wc',season, event])
    racer_data = absolute_mens_prone_shooting.loc[racer, :col_name]
    name = racer
    actual = float(racer_data[col_name])
    short_racer_data = racer_data[2:-1].copy()
    short_racer_data.dropna(inplace = True)
    predictors = len(short_racer_data)
    
    year = ''.join(['20',season[2:4]])
    indices = short_racer_data.index.tolist()

    race_weights = []
    for index in indices:
        try:
            split_index = index.split(':')
            if split_index[0] == 'wc':
                race_weights.append(wc_sim.loc[':'.join([season,event]),
                                               ':'.join([split_index[1],split_index[2]])])
            else:
                race_weights.append(ibu_sim.loc[':'.join([season,event]),
                                                ':'.join([split_index[1],split_index[2]])])
        except:
            race_weights.append(0.0)
    race_weights_rounded = [int(round(item,1)*10) for item in race_weights]

    race_weights_rounded

    # Making the weighted data list
    
    racer_accuracy = []
    racer_predict = []

    for i in range(len(short_racer_data)): 
        for j in range(race_weights_rounded[i]):
            racer_accuracy.append(float(short_racer_data[i]))
                        
    for k in range(len(racer_accuracy)): 
        shooting_sample = np.random.choice(racer_accuracy, len(racer_accuracy), 
                                                       replace = True)
        pred_percent = np.mean(shooting_sample)/5
        shooting = np.random.sample(5)
        count = 0
        for m in range(5):
            if shooting[m] < pred_percent:
                count += 1
        racer_predict.append(count)

    mean = np.mean(racer_predict)
    dev = np.std(racer_predict)
    tenth = np.percentile(racer_predict, 10)
    q1 = np.percentile(racer_predict, 25)
    med = np.median(racer_predict)
    q3 = np.percentile(racer_predict, 75)
    ninetieth = np.percentile(racer_predict, 90)
    
    return [name,mean, dev, tenth, q1, med, q3, ninetieth, actual,predictors]


            
    return racer_predict

In [41]:
"""
Function
--------

evaluate_variable_impactPA : pulls together build_racer_pa_distribution, check_predictions,
                             inside_outside, and above_below for all predictors for a given
                             event returns the percentage of racers whose actual times were
                             within the middle 50% of their distributions and the percentage
                             of racers whose actual times were above the mean of their
                             distributions.

Parameters
----------

competition : a list of the form [season, event], where season is a string that codes 
              the season under consideration, and event is a string that codes the event
              under consideration.
variable : a predictor variable. Possible values are "quant_snow", "quant_weather",
           "humidity_c", "wind_c", "snow_temp_c", "air_temp_c", "quant_event", 
           "quant_year", "altitude", "total_climb", "max_climb", "height_diff", and
           "length"

Returns
-------

portion_inside : a float giving the percentage of the racers for whom the actual speed
                 was in the middle 50% of their speed distribution
portion_above : a float giving the percentage of the racers for whom the actual speed
                was faster than the mean of their speed distribution

Examples
--------
"""
def evaluate_variable_impactPA(competition, variable):
    
    wc_filename = 'wc_%(variable)s_similarities.pkl' %{'variable' :variable}
    ibu_filename = 'ibu_%(variable)s_similarities.pkl' %{'variable' :variable}
    
    wc_similarities = pd.read_pickle(wc_filename)
    ibu_similarities = pd.read_pickle(ibu_filename)
    
    season = competition[0]
    event = competition[1]
    
    predictions = []
    race_code = ':'.join(['wc', season, event])
    racer_indices = absolute_mens_prone_shooting[race_code].dropna().index.tolist()
    for index in racer_indices:
        try:
            prediction = build_racer_pa_distribution(index,season,event, 
                                                        wc_similarities, ibu_similarities)
            if prediction[-1] > 2:
                predictions.append(prediction)

        except:
            pass
    predictions = pd.DataFrame(predictions, columns = ['name','mean', 'dev', 'tenth', 'q1',
                                            'med', 'q3', 'ninetieth', 'actual','predictors'])
    
    checked_predictions = check_predictions(predictions)
    inside_50 = inside_outside(checked_predictions)
    
    portion_inside = inside_50[2]
    
    portion_above = above_below(predictions)
    
    return portion_inside, portion_above

In [42]:
percents_in_50 = pd.DataFrame()
percents_above_mean = pd.DataFrame()

for variable in weights:
    for i in range(len(chosen_events)):

        inside, above = evaluate_variable_impactPA(chosen_events[i], variable)
        
        percents_in_50.loc[variable, i] = np.rint(inside)
        percents_above_mean.loc[variable, i] = np.rint(above)

            

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)
  keepdims=keepdims)
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)


In [44]:
prone_acc_variables = pluck_best_variables(percents_in_50, percents_above_mean)

best_variables['prone_acc'].extend(prone_acc_variables)
prone_acc_variables

['quant_year', 'quant_weather']


## Variable Effects on Standing Accuracy

[Isolating the Variable Effects](#Isolating-the-Variable-Effects)



In [45]:
"""
Function
--------

build_racer_sa_distribution : for a given racer and a given event and season, produces
                              a list of predicted standing shooting accuracies derived by
                              sampling from the list of prior data. It then returns a 
                              list containing the racer's name, the mean of the list of
                              speeds, the standard deviation of the list, the 10th, 25th,
                              50th, 75th, and 90th percentiles, the racer's actual speed,
                              and the number of previous races for which the racer had data

Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
wc_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
         between world cup (wc) races in a pairwise fashion. Pairs with a similarity value
         of 1 were run under nearly identical conditions, while those with a similarity
         value of 0 were run under extremely different conditions
ibu_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
          between world cup (wc) races and ibu cup races in a pairwise fashion. 


Returns
-------

racer_predict : a list of describing the distribution of the accuracy predictions 
                produced using weighted prior data

Examples
--------
"""

def build_racer_sa_distribution(racer, season, event, wc_sim, ibu_sim):

    col_name = ':'.join(['wc',season, event])
    racer_data = absolute_mens_standing_shooting.loc[racer, :col_name]
    name = racer
    actual = float(racer_data[col_name])
    short_racer_data = racer_data[2:-1].copy()
    short_racer_data.dropna(inplace = True)
    predictors = len(short_racer_data)
    
    year = ''.join(['20',season[2:4]])
    indices = short_racer_data.index.tolist()

    race_weights = []
    for index in indices:
        split_index = index.split(':')
        try:
            if split_index[0] == 'wc':
                race_weights.append(wc_sim.loc[':'.join([season,event]),
                                                   ':'.join([split_index[1],split_index[2]])])
            else:
                race_weights.append(ibu_sim.loc[':'.join([season,event]),
                                                    ':'.join([split_index[1],split_index[2]])])
        except:
            race_weights.append(0.0)
    race_weights_rounded = [int(round(item,1)*10) for item in race_weights]

    race_weights_rounded

    # Making the weighted data list
    
    racer_accuracy = []
    racer_predict = []

    for i in range(len(short_racer_data)): 
        for j in range(race_weights_rounded[i]):
            racer_accuracy.append(float(short_racer_data[i]))
                        
    for k in range(len(racer_accuracy)): 
        shooting_sample = np.random.choice(racer_accuracy, len(racer_accuracy), replace = True)
        pred_percent = np.mean(shooting_sample)/5
        shooting = np.random.sample(5)
        count = 0
        for m in range(5):
            if shooting[m] < pred_percent:
                count += 1
        racer_predict.append(count)

    mean = np.mean(racer_predict)
    dev = np.std(racer_predict)
    tenth = np.percentile(racer_predict, 10)
    q1 = np.percentile(racer_predict, 25)
    med = np.median(racer_predict)
    q3 = np.percentile(racer_predict, 75)
    ninetieth = np.percentile(racer_predict, 90)
    
    return [name,mean, dev, tenth, q1, med, q3, ninetieth, actual,predictors]


            
    return racer_predict, racer_accuracy

In [46]:
"""
Function
--------

evaluate_variable_impactSA : pulls together build_racer_sa_distribution, check_predictions,
                             inside_outside, and above_below for all predictors for a given
                             event returns the percentage of racers whose actual times were
                             within the middle 50% of their distributions and the percentage
                             of racers whose actual times were above the mean of their
                             distributions.

Parameters
----------

competition : a list of the form [season, event], where season is a string that codes 
              the season under consideration, and event is a string that codes the event
              under consideration.
variable : a predictor variable. Possible values are "quant_snow", "quant_weather",
           "humidity_c", "wind_c", "snow_temp_c", "air_temp_c", "quant_event", 
           "quant_year", "altitude", "total_climb", "max_climb", "height_diff", and
           "length"

Returns
-------

portion_inside : a float giving the percentage of the racers for whom the actual speed
                 was in the middle 50% of their speed distribution
portion_above : a float giving the percentage of the racers for whom the actual speed
                was faster than the mean of their speed distribution

Examples
--------
"""
def evaluate_variable_impactSA(competition, variable):
    
    wc_filename = 'wc_%(variable)s_similarities.pkl' %{'variable' :variable}
    ibu_filename = 'ibu_%(variable)s_similarities.pkl' %{'variable' :variable}
    
    wc_speed_similarities = pd.read_pickle(wc_filename)
    ibu_speed_similarities = pd.read_pickle(ibu_filename)
    
    season = competition[0]
    event = competition[1]


    predictions = []
    race_code = ':'.join(['wc',season, event])
    racer_indices = absolute_mens_standing_shooting[race_code].dropna().index.tolist()
    for index in racer_indices:
        try:
            prediction = build_racer_sa_distribution(index,season,event, 
                                              wc_speed_similarities, ibu_speed_similarities)
            if prediction[-1] > 2:
                predictions.append(prediction)

        except:
            pass
    predictions = pd.DataFrame(predictions, columns = ['name','mean', 'dev', 'tenth', 'q1',
                                            'med', 'q3', 'ninetieth', 'actual','predictors'])
    
    checked_predictions = check_predictions(predictions)
    inside_50 = inside_outside(checked_predictions)
    
    portion_inside = inside_50[2]
    
    portion_above = above_below(predictions)
    
    return portion_inside, portion_above

In [47]:

percents_in_50 = pd.DataFrame()
percents_above_mean = pd.DataFrame()

for variable in weights:
    for i in range(len(chosen_events)):

        inside, above = evaluate_variable_impactSA(chosen_events[i], variable)
        
        percents_in_50.loc[variable, i] = np.rint(inside)
        percents_above_mean.loc[variable, i] = np.rint(above)

            

In [48]:
standing_acc_variables = pluck_best_variables(percents_in_50, percents_above_mean)

best_variables['standing_acc'].extend(standing_acc_variables)
standing_acc_variables

['altitude', 'wind_c', 'altitude']


## Variable Effects on Prone Range

So, I think that what I want to do here is to 
1. randomly sample the weighted racer data
2. fit a best fit linear regression to the sample
3. take the mean of the sampled shooting data to predict the number of missed shots
4. use the predicted intercept and coeff to calculate the predicted range time.

[Isolating the Variable Effects](#Isolating-the-Variable-Effects)



In [49]:
"""
Function
--------

build_racer_pr_distribution : for a given racer and a given event and season, produces
                              a list of predicted prone shooting accuracies and associated
                              range times derived by using linear regression and 
                              sampling from the list of prior data. It then returns a 
                              list containing the racer's name, the mean of the list of
                              speeds, the standard deviation of the list, the 10th, 25th,
                              50th, 75th, and 90th percentiles, the racer's actual speed,
                              and the number of previous races for which the racer had data

Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
wc_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
         between world cup (wc) races in a pairwise fashion. Pairs with a similarity value
         of 1 were run under nearly identical conditions, while those with a similarity
         value of 0 were run under extremely different conditions
ibu_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
          between world cup (wc) races and ibu cup races in a pairwise fashion. 


Returns
-------

name : the name of the racer
mean : the mean of the distribution of predicted range times
dev : the standard deviation of the prediction of predicted range times
tenth : the 10th percentile of the prediction of predicted range times
q1 : the 25th percentile of the prediction of predicted range times
med : the median of the prediction of predicted range times
q3 : the 75th percentile of the prediction of predicted range times
ninetieth : the 90th percentile of the prediction of predicted range times
actual : the racer's actual prone range time
predictors : the number of prior sprint races in which the racer has participated

Examples
--------
"""

def build_racer_pr_distribution(racer, season, event, wc_sim, ibu_sim):

    col_name = ':'.join(['wc',season, event])
    racer_time_data = absolute_mens_prone_range.loc[racer, :col_name]
    racer_shot_data = absolute_mens_prone_shooting.loc[racer, :col_name]
    name = racer
    actual = float(racer_time_data[col_name])
    short_racer_time_data = racer_time_data[2:-1].copy()
    short_racer_shot_data = racer_shot_data[2:-1].copy()
    short_racer_time_data.dropna(inplace = True)
    short_racer_shot_data.dropna(inplace = True)
    predictors = len(short_racer_shot_data)
    
    year = ''.join(['20',season[2:4]])
    indices = short_racer_shot_data.index.tolist()

    race_weights = []
    for index in indices:
        split_index = index.split(':')
        if split_index[0] == 'wc':
            race_weights.append(wc_sim.loc[':'.join([season,event]),
                                           ':'.join([split_index[1],split_index[2]])])
        else:
            race_weights.append(ibu_sim.loc[':'.join([season,event]),
                                            ':'.join([split_index[1],split_index[2]])])
    race_weights_rounded = [int(round(item,1)*10) for item in race_weights]

    race_weights_rounded

    # Making the weighted data list
    
    racer_shot = []
    racer_time = []
    racer_predict = []

    for i in range(len(short_racer_shot_data)): 
        for j in range(race_weights_rounded[i]):
            racer_shot.append(float(short_racer_shot_data[i]))
            racer_time.append(float(short_racer_time_data[i]))
                        
    for k in range(100): 
        index_sample = np.random.choice(range(len(racer_shot)), len(racer_shot),
                                                replace = True)
        racer_shot_sample = [racer_shot[i] for i in index_sample]
        racer_time_sample = [racer_time[i] for i in index_sample]
        
        accuracy = np.mean(racer_shot_sample)/5
        
        linreg = LinearRegression()
        linreg.fit(np.array(racer_shot_sample).reshape(-1,1), 
                   np.array(racer_time_sample).reshape(-1,1))
        loop = linreg.coef_
        shot_time = linreg.intercept_
        
        shooting = np.random.sample(5)
        count = 0
        for m in range(5):
            if shooting[m] < accuracy:
                count += 1
        range_time = shot_time + count*loop
        racer_predict.append(range_time)
                                        
    mean = np.mean(racer_predict)
    dev = np.std(racer_predict)
    tenth = np.percentile(racer_predict, 10)
    q1 = np.percentile(racer_predict, 25)
    med = np.median(racer_predict)
    q3 = np.percentile(racer_predict, 75)
    ninetieth = np.percentile(racer_predict, 90)
    
    return [name,mean, dev, tenth, q1, med, q3, ninetieth, actual,predictors]

In [50]:
"""
Function
--------

evaluate_variable_impactPR : pulls together build_racer_pr_distribution, check_predictions,
                             inside_outside, and above_below for all predictors for a given
                             event returns the percentage of racers whose actual times were
                             within the middle 50% of their distributions and the percentage
                             of racers whose actual times were above the mean of their
                             distributions.

Parameters
----------

competition : a list of the form [season, event], where season is a string that codes 
              the season under consideration, and event is a string that codes the event
              under consideration.
variable : a predictor variable. Possible values are "quant_snow", "quant_weather",
           "humidity_c", "wind_c", "snow_temp_c", "air_temp_c", "quant_event", 
           "quant_year", "altitude", "total_climb", "max_climb", "height_diff", and
           "length"

Returns
-------

portion_inside : a float giving the percentage of the racers for whom the actual speed
                 was in the middle 50% of their speed distribution
portion_above : a float giving the percentage of the racers for whom the actual speed
                was faster than the mean of their speed distribution

Examples
--------
"""



def evaluate_variable_impactPR(competition, variable):
    
    wc_filename = 'wc_%(variable)s_similarities.pkl' %{'variable' :variable}
    ibu_filename = 'ibu_%(variable)s_similarities.pkl' %{'variable' :variable}
    
    wc_speed_similarities = pd.read_pickle(wc_filename)
    ibu_speed_similarities = pd.read_pickle(ibu_filename)
    
    season = competition[0]
    event = competition[1]
    
    predictions = []
    race_code = ':'.join(['wc',season, event])
    racer_indices = absolute_mens_prone_range[race_code].dropna().index.tolist()
    for index in racer_indices:
        try:
            prediction = build_racer_pr_distribution(index,season,event, 
                                                 wc_speed_similarities, ibu_speed_similarities)
            if prediction[-1] > 2:
                predictions.append(prediction)

        except:
            pass
    predictions = pd.DataFrame(predictions, columns = ['name','mean', 'dev', 'tenth', 'q1',
                                            'med', 'q3', 'ninetieth', 'actual','predictors'])
    
    checked_predictions = check_predictions(predictions)
    inside_50 = inside_outside(checked_predictions)
    
    portion_inside = inside_50[2]
    
    portion_above = above_below(predictions)
    
    return portion_inside, portion_above

In [51]:

percents_in_50 = pd.DataFrame()
percents_above_mean = pd.DataFrame()

for variable in weights:
    for i in range(len(chosen_events)):
        
         
        inside, above = evaluate_variable_impactPR(chosen_events[i], variable)
        
        percents_in_50.loc[variable, i] = np.rint(inside)
        percents_above_mean.loc[variable, i] = np.rint(above)

            

In [52]:
prone_range_variables = pluck_best_variables(percents_in_50, percents_above_mean)

best_variables['prone_range'].extend(prone_range_variables)
prone_range_variables

['altitude', 'quant_event', 'quant_event']


## Variable Effects on Standing Range

[Isolating the Variable Effects](#Isolating-the-Variable-Effects)


In [53]:
"""
Function
--------

build_racer_sr_distribution : for a given racer and a given event and season, produces
                              a list of predicted standing shooting accuracies and 
                              associated range times derived by using linear regression
                              and  sampling from the list of prior data. It then returns a 
                              list containing the racer's name, the mean of the list of
                              speeds, the standard deviation of the list, the 10th, 25th,
                              50th, 75th, and 90th percentiles, the racer's actual speed,
                              and the number of previous races for which the racer had data



Parameters
----------

racer : a string containing the name of a biathlete
season : a string that codes the season under consideration. It is of the form y1y2 where
         y1 is the last two digits of the year in which the season started, and y2 is the 
         last two digits of the year in which the season ended.
event : a string that codes the event under consideration. Possible values are 'CP01', 
        'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__', 'OG__'
wc_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
         between world cup (wc) races in a pairwise fashion. Pairs with a similarity value
         of 1 were run under nearly identical conditions, while those with a similarity
         value of 0 were run under extremely different conditions
ibu_sim : a dataframe containing values between 0 and 1 which codes the degree of similarity 
          between world cup (wc) races and ibu cup races in a pairwise fashion. 


Returns
-------

name : the name of the racer
mean : the mean of the distribution of predicted range times
dev : the standard deviation of the prediction of predicted range times
tenth : the 10th percentile of the prediction of predicted range times
q1 : the 25th percentile of the prediction of predicted range times
med : the median of the prediction of predicted range times
q3 : the 75th percentile of the prediction of predicted range times
ninetieth : the 90th percentile of the prediction of predicted range times
actual : the racer's actual standing range time
predictors : the number of prior sprint races in which the racer has participated

Examples
--------
"""

def build_racer_sr_distribution(racer, season, event, wc_sim, ibu_sim):

    col_name = ':'.join(['wc',season, event])
    racer_time_data = absolute_mens_standing_range.loc[racer, :col_name]
    racer_shot_data = absolute_mens_standing_shooting.loc[racer, :col_name]
    name = racer
    actual = float(racer_time_data[col_name])
    short_racer_time_data = racer_time_data[2:-1].copy()
    short_racer_shot_data = racer_shot_data[2:-1].copy()
    short_racer_time_data.dropna(inplace = True)
    short_racer_shot_data.dropna(inplace = True)
    predictors = len(short_racer_shot_data)
    
    year = ''.join(['20',season[2:4]])
    indices = short_racer_shot_data.index.tolist()

    race_weights = []
    for index in indices:
        split_index = index.split(':')
        try:
            if split_index[0] == 'wc':
                race_weights.append(wc_sim.loc[':'.join([season,event]),
                                               ':'.join([split_index[1],split_index[2]])])
            else:
                race_weights.append(ibu_sim.loc[':'.join([season,event]),
                                                ':'.join([split_index[1],split_index[2]])])
        except:
            race_weights.append(0.0)
    race_weights_rounded = [int(round(item,1)*10) for item in race_weights]

    race_weights_rounded

    # Making the weighted data list
    
    racer_shot = []
    racer_time = []
    racer_predict = []

    for i in range(len(short_racer_shot_data)): 
        for j in range(race_weights_rounded[i]):
            racer_shot.append(float(short_racer_shot_data[i]))
            racer_time.append(float(short_racer_time_data[i]))
                                    
    for k in range(100): 
        index_sample = np.random.choice(range(len(racer_shot)), len(racer_shot), 
                                        replace = True)
        racer_shot_sample = [racer_shot[i] for i in index_sample]
        racer_time_sample = [racer_time[i] for i in index_sample]
        
        accuracy = np.mean(racer_shot_sample)/5
        
        linreg = LinearRegression()
        linreg.fit(np.array(racer_shot_sample).reshape(-1,1), 
                               np.array(racer_time_sample).reshape(-1,1))
        loop = linreg.coef_
        shot_time = linreg.intercept_
        
        shooting = np.random.sample(5)
        count = 0
        for m in range(5):
            if shooting[m] < accuracy:
                count += 1
        range_time = shot_time + count*loop
        racer_predict.append(range_time)
                                        
    mean = np.mean(racer_predict)
    dev = np.std(racer_predict)
    tenth = np.percentile(racer_predict, 10)
    q1 = np.percentile(racer_predict, 25)
    med = np.median(racer_predict)
    q3 = np.percentile(racer_predict, 75)
    ninetieth = np.percentile(racer_predict, 90)
    
    return [name,mean, dev, tenth, q1, med, q3, ninetieth, actual,predictors]


In [54]:
"""
Function
--------

evaluate_variable_impactSR : pulls together build_racer_sr_distribution, check_predictions,
                             inside_outside, and above_below for all predictors for a given
                             event returns the percentage of racers whose actual times were
                             within the middle 50% of their distributions and the percentage
                             of racers whose actual times were above the mean of their
                             distributions.

Parameters
----------

competition : a list of the form [season, event], where season is a string that codes 
              the season under consideration, and event is a string that codes the event
              under consideration.
variable : a predictor variable. Possible values are "quant_snow", "quant_weather",
           "humidity_c", "wind_c", "snow_temp_c", "air_temp_c", "quant_event", 
           "quant_year", "altitude", "total_climb", "max_climb", "height_diff", and
           "length"

Returns
-------

portion_inside : a float giving the percentage of the racers for whom the actual speed
                 was in the middle 50% of their speed distribution
portion_above : a float giving the percentage of the racers for whom the actual speed
                was faster than the mean of their speed distribution

Examples
--------
"""


def evaluate_variable_impactSR(competition, variable):
    
    wc_filename = 'wc_%(variable)s_similarities.pkl' %{'variable' :variable}
    ibu_filename = 'ibu_%(variable)s_similarities.pkl' %{'variable' :variable}
    
    wc_speed_similarities = pd.read_pickle(wc_filename)
    ibu_speed_similarities = pd.read_pickle(ibu_filename)
    
    season = competition[0]
    event = competition[1]
    
    predictions = []
    race_code = ':'.join(['wc', season, event])
    racer_indices = absolute_mens_standing_range[race_code].dropna().index.tolist()
    for index in racer_indices:
        try:
            prediction = build_racer_sr_distribution(index, season,event, 
                                                 wc_speed_similarities, ibu_speed_similarities)
            if prediction[-1] > 2:
                predictions.append(prediction)

        except:
            pass
    predictions = pd.DataFrame(predictions, columns = ['name','mean', 'dev', 'tenth', 'q1',
                                           'med', 'q3', 'ninetieth', 'actual','predictors'])
    
    checked_predictions = check_predictions(predictions)
    inside_50 = inside_outside(checked_predictions)
    
    portion_inside = inside_50[2]
    
    portion_above = above_below(predictions)
    
    return portion_inside, portion_above

In [55]:
percents_in_50 = pd.DataFrame()
percents_above_mean = pd.DataFrame()

for variable in weights:
    for i in range(len(chosen_events)):
         
        inside, above = evaluate_variable_impactSR(chosen_events[i], variable)
        
        percents_in_50.loc[variable, i] = np.rint(inside)
        percents_above_mean.loc[variable, i] = np.rint(above)

            

In [56]:
standing_range_variables = pluck_best_variables(percents_in_50, percents_above_mean)

best_variables['prone_range'].extend(standing_range_variables)
standing_range_variables

['quant_weather', 'quant_weather']

<a name = "pulling_together"></a>
# Putting it all together

At this point, we have a dictionary of which has as its index the various pieces of the biathlon sprint time (speed, prone accuracy, standing accuracy, prone range time, standing range time, prone loop time, and standing loop time), and which has as entry for each index value a list of predictor variables that did well (relatively speaking) either in terms of their predictive powers (this notebook) or in terms of their degree of correlation (the previous notebook). These variables will form the basis of the models that we will build in the next notebooks, but before they can be used, there is a certain amount of housekeeping that needs to be done. First, we use a dictionary to make the code that we are using for each variable consistant. Second, we combine the variables for prone range time and prone loop time as well as for standing range time and standing loop time. Finally, we use <a href = "#count_frequency">```count_frequency```</a> (which simply counts the number of times each value appears in a list) to count the entries in the list associated with each aspect, and we save only those aspects that appear at least twice. (That is to say, we save only those variables that have both strong predictive power and relatively high correlation values.) The resulting dictionary is then stored as a pickle file in order to be accessible to later notebooks.

Previous Section: [Isolating Variable Effects](#isolating_variable_effects)

<a href = "#toc">Table of Contents</a>

In [57]:
best_variables1 = best_variables.copy()

In [58]:
translations = {'Quant Weather' : 'quant_weather', 'Quant Snow' : 'quant_snow', 
                'Snow Temp' : 'snow_temp', 'Air Temp' : 'air_temp', 
                'Quant Event' : 'quant_event', 'Quant Year' : 'quant_year', 'Wind' : 'wind_c',
                'Max Climb' : 'max_climb'}

In [59]:
for i in ['prone_acc', 'prone_loop','prone_range','speed','standing_acc','standing_loop','standing_range']:
    variables = best_variables[i]
    for j in range(len(variables)):
        if variables[j] in translations:
            variables[j] = translations[variables[j]]
    


In [60]:
# combining range and loop variables

best_variables['prone_range'].extend(best_variables['prone_loop'])
best_variables['standing_range'].extend(best_variables['standing_loop'])

if 'prone_loop' in best_variables:
    del best_variables['prone_loop']
    
if 'standing_loop' in best_variables:
    del best_variables['standing_loop']

In [61]:
"""
Function
--------

count_frequency : takes a list containing (possibly) repeating values and returns a 
                  list of the distinct values in that list, together with their 
                  frequencies

Parameters
----------

data_list : a list with (possibly) repeating values 

Returns
-------

frequencies : a list of distinct values in data_list together with the frequency with
              which each of them occured

Examples
--------
"""

def count_frequency(data_list):
    
    frequencies = {}
    for i in range(len(data_list)):
        if data_list[i] in frequencies:
            frequencies[data_list[i]] = frequencies[data_list[i]] + 1
        else:
            frequencies[data_list[i]] = 1
            
    return frequencies

In [62]:
for i in ['prone_acc','prone_range','speed','standing_range','standing_acc']:
    frequencies = count_frequency(best_variables[i])
    var_to_keep = [item for item in frequencies if frequencies[item]>=2]
    best_variables[i] = var_to_keep
    
best_variables

{'prone_acc': ['quant_weather'],
 'prone_range': ['quant_weather', 'quant_snow', 'quant_event'],
 'speed': ['wind_c', 'quant_snow'],
 'standing_acc': ['altitude', 'wind_c'],
 'standing_range': ['quant_weather', 'quant_snow']}

In [63]:
# Pickle the dictionary of the best variables

with open('best_variables1.pickle', 'wb') as handle:
    pickle.dump(best_variables, handle, protocol=pickle.HIGHEST_PROTOCOL)

