
# Table of Contents

[Introduction](#Introduction)

[Collecting the Race Data](#Collecting-the-race-data)

[Collecting the IBU Cup Sprint Data](#Collecting-IBU-Cup-sprint-data)

[Deleting Bad Racer Data](#Deleting-bad-racer-data)

[Collecting the Course Data](#Collecting-the-course-data)

[Collecting the Weather Data](#Collecting-the-weather-data)

[Collecting IBU Cup course and weather data](#Collecting-IBU-Cup-course-and-weather-data)

[Adjustments to course and weather data](#Adjustments-to-course-and-weather-data)

[Exploring Relationships](#Exploring-Relationships)

[Effect of Conditions on Speed](#Effects-of-conditions-on-speed)

[Evaluating Impacts on Speed](#Evaluating-Impacts-on-Speed)

[Effect of Conditions on Shooting Accuracy](#Effects-of-conditions-on-shooting-accuracy)

[Evaluating Impacts on Shooting Accuracy](#Evaluating-Impacts-on-Accuracy)

[Effect of Conditions on Range and Penalty Times](#Effects-of-conditions-on-range-and-penalty-times)

[Evaluating Impacts on Range Times](#Evaluating-Impacts-on-Range-Times)

[Evaluating Impacts on Penalty Loop Times](#Evaluating-Impacts-on-Penalty-Loop-Times)

[Handing off to the next notebook](#Handing-off-to-the-next-notebook)



# Introduction

The biathlon is a winter individual sport that consists of alternating stints of cross country skiing and target shooting. There are four major event types (not including relays) which exist in both men's and women's versions: sprint, pursuit, individual, and mass start. Of these, the sprint is the most frequently run, and typically has the largest number of competitors. (Pursuit and mass start both limit the number of competitors in a race in order to avoid overcrowding at the shooting range, while individual is run only a few times per season.)

The men's sprint race consists of a roughly 3300m ski loop (due to the fact that these events take place outside, there is some variation in this distance) followed by 5 shots in a prone position, then another 3300m ski loop followed by 5 shots in a standing position, and finally another 3300m ski loop before the finish line. Each missed shot requires the biathlete to ski a 150m penalty lap before continuing with the course. For the sprint, racers leave the start gate at 30 second intervals, which means that they are not in direct competition with each other. 

There are two levels of international biathlon competition, the higher of the two is the World Cup circuit, the other is the IBU (International Biathlon Union) Cup circuit. Most biathletes start their careers on the IBU Cup circuit before the most successful of them move up to compete on the World Cup circut. In this project, I am be interested in making predictions about outcomes of World Cup events, in part because the athletes who don't move up to the World Cup circuit tend not to compete for more than a few seasons, however, I will be collecting data from IBU Cup races as well in order to have data about each racer from the beginning of their career.

In this project, I hope to explore the effects of different conditions (snow, temperature, amount of climb) on the results of men's sprint races, as well as to use the results of earlier races to make some predictions about the outcomes of later races. 

[Next Section: Collecting the Race Data](#Collecting-the-race-data)

[Table of Contents](#Table-of-Contents)

In [5]:
# First a cell to prepare the notebook.
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# And the additional modules that I've used

import pickle
from PyPDF2 import PdfFileReader
from tabula import read_pdf
import urllib
import random
import scipy.stats as stats
import re


from sklearn.linear_model import LinearRegression
sns.set(color_codes=True)

#import requests
#from pattern import web
#from fnmatch import fnmatch
#import math
#import fnmatch
#import os
#import sklearn

#from sklearn.model_selection import train_test_split
#from sklearn.ensemble import RandomForestRegressor
#from sklearn.datasets import make_regression
#from sklearn.ensemble import AdaBoostRegressor
#from sklearn.tree import DecisionTreeRegressor

#from sklearn.linear_model import Lasso
#from sklearn.linear_model import LassoCV
#from sklearn.linear_model import Ridge
#from sklearn.preprocessing import PolynomialFeatures

#from sklearn.model_selection import GridSearchCV
#from sklearn import linear_model
#from sklearn.ensemble import RandomForestClassifier
#from sklearn.datasets import make_classification
#from sklearn.linear_model import ElasticNet

#import joblib

#import matplotlib as mpl

<a id="CollectingData"></a>

# Collecting the race data

The first step, obviously, was to collect the data. As it turned out, this was more difficult than I had envisioned, because the results of the races were stored in pdf files which needed to be downloaded and read using [tabula-py](https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302). To further complicate things, the [International Biathlon Union's results website](https://biathlonresults.com) buries access to the pdf files in several layers of java code, which makes downloading the source code useless. Eventually I was able to discover that the pdf files of the competition results could be accessed directly using urls that looked like this:

https://ibu.blob.core.windows.net/docs/1617/BT/SWRL/CP04/SMSP/BT_C77B_1.0.pdf

or more generally, like:

https://ibu.blob.core.windows.net/season/BT/SWRL/event/race/BT_C77B_1.0.pdf

In order to retrieve these pdf files and store the data that I wanted from them in a form that was both accessible and useful for me, I used the following group of functions:
1. ```get_analysis_file```: This function takes the season, event, and type of a given World Cup race, and finds the url of the pdf file containing the competition analysis data. It then retrieves the pdf and stores it on the hard drive as biathlon.pdf.
2. ```flatten_row```: This function takes a row from the tabula-py reading of a biathlon.pdf file and returns it as a list of values.
3. ```competition_analysis_splitSP```: This function takes a the dataframe produced by reading a single page of a competition analysis pdf using tabula-py, along with the year and event in which the race was held. It then parses it as follows:
    - First, eliminate any racers found at the end of the page who were disqualified, abandonned the race partway through, or did not start. Since these racers are listed at the very end of the file, any subsequent pages of the pdf can be disregarded.
    - Find all rows in which the word "Cumulative" occurs. These serve as anchor rows that allow us to locate all the other needed information.
    - Get the names of the competitors on the given page. The rows containing this information can be found directly above the rows with "Cumulative" in them. These will make up the first column of the dataframe.
    - Get the shooting times and missed shots. The rows containing this information are found two rows after the anchor rows. Add these columns to the dataframe.
    - Get the times for the individual course loops. The rows containing this information are found four rows after the anchor rows. Add these columns to the dataframe.
    - Get the range times. The rows containing this information are found three rows after the anchor rows. Add these columns to the dataframe.
    - For seasons beginning with 2011-2012, get the penalty times. The rows containing this information are found five rows after the anchor rows. Add these columns to the dataframe. (Note that prior to the 2011-2012 season, the penalty times are included in the range times.)
4. ```competition_analysis_mergeSP```: This function uses tabula-py to read in each page of a biathlon.pdf file, sends the resulting page to ```competition_analysis_splitSP```, and then concatenates the resulting dataframes into a single large dataframe.
5. ```pickle_competition_analysisSP```: This function takes a takes a year, event, and race type and  calls ```get_analysis_file``` to download the relevant competition analysis file to the hard drive. It then calls ```competition_analysis_merge``` to extract the desired data and place it into a dataframe. Finally, it stores the resulting dataframe as a pickle file.
6. ```pickle_biathlon_seasonSP```: This function loops through all possible events for a given year and attempts to run ```pickle_competition_analysisSP``` on each. A try and except structure handles cases where there is not actually a race of the desired type (for this project, a men's sprint) held during the given event.

I then looped through all the seasons for which I could obtain the requisite World Cup data (there are no [competition data summary](https://ibu.blob.core.windows.net/docs/0910/BT/SWRL/CP01/SMSP/BT_C82_1.0.pdf) files available before the 2004-2005 season) using ```pickle_biathlon_season``` and stored the dataframes produced as pickle files on the hard drive. Unfortunately, using tabula-py effectively requires specifying the area of the table to be read, and, while in theory there should be a consistant layout for each season, in practice each host site seems to have the ability to design their own layouts. As a result, some of the competition analysis files were not correctly read. 

In order to read these files, I used a slightly altered version of ```competition_analysis_mergeSP``` together with ```pickle_competition_analysisSP``` to obtain the desired data for those files that were previously problematic.

Finally, because the Turin (2006) and Sochi (2014) ```Olympic Games``` used a completely different form of url for their competition analysis files, I used ```pickle_competition_analysisSP``` on manually obtained pdf files in order to get and store the necessary data.

Previous Section: [Introduction](#Introduction)

Next Section: [Collecting the IBU Cup Sprint Data](#Collecting-IBU-Cup-sprint-data)



[Table of Contents](#Table-of-Contents)


In [2]:
"""
Function
--------

get_analysis_file : takes the season, event, and type of a given World 
                    Cup race, and finds the url of the pdf file containing the 
                    competition analysis data, then retrieves it and stores
                    it as a pdf file named biathlon.pdf

Parameters
----------

year: the years of the biathlon season, in form y1y2 where y1 is the last 2 digits
      of the first year and y2 is the last two digits of the second year
event: a four character code specifying the event. Possibilities are “CP01”, “CP02”,
       . . ., “CP09”, “OG__”,”CH__”
race: a four character code specifying the type of race. Possibilities are SWIN, SWSP,
      SWPU, SWMS, SMIN, SMSP, SMPU, SMMS

Returns
-------

Stores a .pdf file on the hard drive as biathlon.pdf. 

Example
-------

"""

def get_analysis_file(year,event,race):
    codes = {"SWIN": "A", "SMIN": "A", "SWSP": "B", "SMSP": "B", "SWPU": "D", "SMPU": "D",
             "SWMS": "D", "SMMS": "D"}
    L = codes[race]
    
    if year in ["0102", "0203","0304"]:
        if race in ["SWMS", "SMMS"]:
            L = "D"
        url = (
        "https://ibu.blob.core.windows.net/docs/%(y)s/BT/SWRL/%(e)s/%(r)s/BT_O77%(L)s_1.0.pdf"
               %{"e": event, "r":race, "y":year, "L":L} )       
    
    else:
        url = (
        "https://ibu.blob.core.windows.net/docs/%(y)s/BT/SWRL/%(e)s/%(r)s/BT_C77%(L)s_1.0.pdf"
               %{"e": event, "r":race, "y":year, "L":L})
        
    urllib.urlretrieve(url,"biathlon.pdf")
    


In [3]:
"""
Function
--------

flatten_row : Takes a row from the tabula-py reading of a biathlon pdf file and returns it as
              a list of values

Parameters
----------

df : a dataframe produced by a tabula-py reading of a biathlon pdf file
index : the index of the row to be flattened

Returns
-------

flattened_row : the flattened version of the row

Examples
--------
"""

def flatten_row(df, index):
    
    row = df.iloc[index,:].tolist()
    
    flat_row = []

    for i in range(len(row)):
        flat_row.append(unicode(row[i]).split())

    flat_row = [item for sublist in flat_row for item in sublist]
    
    flattened = []

    for i in range(len(flat_row)): 
        if len(unicode(flat_row[i]).split('+')) > 1:
            dumb = unicode(flat_row[i]).split('+')
            dumber = []
            dumber.append(dumb[0])
            for i in range(1, len(dumb)):
                dumber.append("+".join(["", dumb[i]]))
            flattened.append(dumber)
        else:
            flattened.append(flat_row[i])
        
    flattened_row = []

    for item in flattened:
        if type(item) is list:
            for i in range(len(item)):
                flattened_row.append(item[i])
        else:
            flattened_row.append(item)

    flattened_row = [item for item in flattened_row if item != '' and item != 'X']
        
    return flattened_row

In [4]:
"""
Function
--------

competition_analysis_splitSP : takes a the dataframe produced by reading a single page 
                               of a competition analysis pdf using tabula-py, along with
                               the year and event in which the race was held and, for each
                               racer, extracts name, shooting times, missed shots, course
                               time, range time, and, where relevant, penalty time. It then 
                               returns them as a dataframe.

Parameters
----------

race: the data from a single page of a pdf file containing a competition analysis page
      for an individual race
year : the season code for the race

Returns
-------

df : a dataframe containing a subset of the data from the file
continuer : a boolean indicating whether or not the next page of the pdf file should be
            read or not (assuming that it exists)

Example
-------

"""

def competition_analysis_splitSP(race, year):
    
    df = pd.DataFrame()
    continuer = True
    
    # First drop any all NaN columns
    
    race.dropna(axis = 1, how = "all", inplace = True)
    
    # And then replace any remaining NaNs with X
    race.fillna("X", inplace = True)
    
    # Step 0 is to get rid of any bad data (non starters, non finishers, disqualifications)
    # at the end of the page
    
    data_end = -1
    for i in range(len(race)):
        if "Jury" in unicode(race.iloc[i,0]):
            data_end = i
            continuer = False
            break
        elif "Did" in unicode(race.iloc[i,0]):
            data_end = i
            continuer = False
            break
        elif "Disqualification" in unicode(race.iloc[i,0]):
            data_end = i
            continuer = False
            break
        elif "Lapped" in unicode(race.iloc[i,0]):
            data_end = i
            continuer = False
            break
    
    if data_end != -1:
        race.drop(race.index[data_end:], inplace = True)
    
    # Step 1 is to find the data. Because of the way that the data on the page is 
    # structured, one can use the locations of the rows containing the expression 
    # "Cumulative Time" as a means of locating all the needed rows.
    
    rows = []
    for i in range(len(race)):
        if "Cumulative" in race.iloc[i,0]:
            rows.append(i)

    # Now, I know that the names are in the rows just above the rows that have "Cumulative 
    # Time" in them
    names = []
    for row in rows:
            
        flat_name_row = flatten_row(race, row-1)
        names.append(" ".join(flat_name_row[2:len(flat_name_row)-5]))
        
    df = pd.DataFrame(names, columns = ['Name'])
    
    # Also, I know that the shooting information is three rows after the rows with "Cumulative
    # Time" I want to collect both the time spent shooting and the number of misses from this
    # row and create two arrays from it
    
    shooting_times = []
    shooting_misses = []
    for row in rows:
                
        flat_shooting_row = flatten_row(race, row+2)
        
        shot_row = [flat_shooting_row[1], flat_shooting_row[5], flat_shooting_row[9]]
        shooting_misses.append(shot_row)
        shot_row = [flat_shooting_row[2], flat_shooting_row[6], flat_shooting_row[10]]
        shooting_times.append(shot_row)
        
    shooting_times = pd.DataFrame(shooting_times, columns = ['sh1','sh2','tot sh'])
    shooting_misses = pd.DataFrame(shooting_misses, columns = ['P1','S1','T'])
    
    df = pd.concat([df, shooting_times, shooting_misses], axis = 1)
        
    # Next, I want to get the course time for each loop, which is in the 5th row after the
    # "Cumulative Time" rows
    
    loop_times = []
    for row in rows:
        flat_loop_row = flatten_row(race, row + 4)
        loop_time = [flat_loop_row[2], flat_loop_row[5], flat_loop_row[8], flat_loop_row[11]]
        loop_times.append(loop_time)
        
    loop_times = pd.DataFrame(loop_times, columns = ["L1","L2",'L3','Total Ski'])
    
    df = pd.concat([df, loop_times], axis = 1)
    
    # For the sake of completeness, I want to get the range times
    
    range_times = []
    for row in rows:
        flat_range_row = flatten_row(race, row + 3)
        range_time = [flat_range_row[2], flat_range_row[5], flat_range_row[8]]
        range_times.append(range_time)
        
    range_times = pd.DataFrame(range_times, columns = ['r1','r2','tot r'])
    
    df = pd.concat([df, range_times], axis = 1)
    
    
    if year in ["1112","1213","1314","1415","1516","1617",'1718']:
        penalty_times = []
        for row in rows:
            flat_penalty_row = flatten_row(race, row + 5)
            penalty_time = [flat_penalty_row[2], flat_penalty_row[3], flat_penalty_row[4]]
            penalty_times.append(penalty_time)
        
        penalty_times = pd.DataFrame(penalty_times, columns = ['pen1', 'pen2', 'tot pen'])
    
        df = pd.concat([df, penalty_times], axis = 1)

    if year in ["1112","1213","1314","1415","1516","1617",'1718']:
        cols = ['Name', 'sh1','sh2', 'P1','S1', "L1","L2",'L3','r1','r2', 'pen1', 'pen2', 
                'tot sh', 'T','Total Ski','tot r','tot pen']    
    else:
         cols = ['Name', 'sh1','sh2', 'P1','S1', "L1","L2",'L3', 'r1','r2', 'tot sh', 'T', 
                 'Total Ski','tot r']
    df = df.reindex_axis([cols], axis=1)
        
    return df, continuer

In [5]:
"""
Function
--------

competition_analysis_mergeSP : Uses tabula-py to read in each page of a biathlon.pdf file,
                               sends the resulting page to competition_analysis_splitSP, 
                               and then concatenates the resulting dataframes into a single
                               large dataframe.

Parameters
----------

file_name: a competition analysis file for a biathlon race
year: the four digit code for a biathlon season

Returns
-------

results: a dataframe which contains the combined data from all of the pages of the pdf file

Note
----

There are two sets of areas given for many of the years here. This is due to the fact that, 
although in theory the layout of the pdf pages should be consistant across any given season,
in practice this is not true, as the individual host sites seem to produce their own pdf 
files for each race. As a result, there are a handful of races for which it will be necessary
to run a competition_analysis_mergeSP a second time with a slightly different choice of areas.

Example
-------

"""

def competition_analysis_mergeSP(file_name, year):
    
        # each season has a different part of the page to find the data
    areas = {"0405": [(204, 37, 706, 559), (204, 37, 706, 559)], 
         # "0405": [(145, 19, 777, 575), (145, 19, 777, 575)], # special cases
         "0506": [(202, 37, 701, 559), (202, 37, 701, 559)], 
         # "0506": [(159, 20, 737, 575),(159, 20, 737, 575)], # special cases
         "0607": [(203, 37, 716, 559), (203, 37, 716, 559)],
         "0708": [(192, 37, 692, 559), (192, 37, 692, 559)],
         "0809": [(192, 37, 692, 559), (192, 37, 692, 559)], 
         #"0809": [(140, 28, 722, 584), (140, 28, 722, 584)],# special cases
         "0910": [(192, 37, 692, 559), (192, 37, 692, 559)], 
         #"0910": [(140, 28, 722, 584), (140, 28, 722, 584)],# special cases
         "1011": [(162, 27, 727, 572), (18, 26, 814, 568)],
         "1112": [(190, 27, 686, 568), (20, 27, 815, 586)],
         "1213": [(194, 26, 691, 571), (20, 26, 667, 571)], 
        # "1213": [(194, 26, 691, 571), (20, 26, 744, 571)], # special cases   
        # "1213": [(194, 26, 691, 571), (194, 26, 691, 571)],# special cases
         "1314": [(193, 26, 689, 571), (20, 26, 662, 571)], 
         # special cases"1314": [(136, 20, 717, 576),(136, 20, 717, 576)],# special cases
         "1415": [(137, 26, 704, 568), (20, 26, 662, 568)],
         "1516": [(137, 26, 705, 568), (20, 26, 663, 568)],
         #"1617": [(151, 20, 702, 574), (20, 20, 728, 574)], # special cases
         "1617": [(151, 20, 727, 576), (20, 20, 759, 576)], 
            "1718": [(150, 19, 748, 576), (20, 20, 759, 576)]}

    # Figure out how many pages are in the document
    
    with open(file_name,'r') as f:
        doc = PdfFileReader(f)
        pages = doc.getNumPages()
        
    # Get the data from the first page of the pdf
        
    race = read_pdf(file_name, pages = 1, area = areas[year][0], guess = False,
                    multiple_tables = False)
        
    try:
        data = competition_analysis_splitSP(race, year)
        results = data[0]
        to_continue = data[1]
    except:
        if year in ["1112","1213","1314","1415","1516","1617",'1718']:
            results = pd.DataFrame(columns = ['Name', 'sh1','sh2', 'P1','S1', "L1","L2",'L3',
                                              'r1','r2','pen1', 'pen2', 'tot sh', 'T',
                                              'Total Ski','tot r','tot pen'])
        else:
            results = pd.DataFrame(columns = ['Name', 'sh1','sh2', 'P1','S1', "L1","L2",'L3',
                                              'r1','r2', 'tot sh', 'T', 'Total Ski','tot r'])
        to_continue = True
    
    # Get the data from the remaining pages of the pdf 
    
    for i in range(1,pages):
        
        if to_continue is True:
        
            race = read_pdf(file_name, pages = i + 1, area = areas[year][1], guess = False,
                            multiple_tables = False)
            try:
                race == None
            except:
                try:
                    data = competition_analysis_splitSP(race, year)
                    result = data[0]
                    to_continue = data[1]
                    results = pd.concat([results, result], axis = 0)
                except:
                    pass
    results.reset_index(inplace = True)
    results.drop('index', axis = 1, inplace = True)


    
    return results

Note that the layout of the competition analysis pdf changes from season to season, and so the areas that are being read also need to change from season to season. Furthermore, due to minor changes in the layout of some of the competition analysis files, even within a single season, the code for competition_analysis_mergeSP contains commented out area values that can be used in cases where a single race is unable to be read correctly.


In [6]:
"""
Function
--------

pickle_competition_analysisSP : takes the url of a biathlon data web page, determines whether
                                the page is actually a pdf file or not, and it if it, extracts
                                the relevant data, puts it into a data frame, and then pickles
                                the resulting data.

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SWMS,
       SMIN, SMSP, SMPU, SMMS

Returns
-------

stores a pickled dataframe object on the hard drive

Examples
--------
"""

def pickle_competition_analysisSP(year, event, racecode):
    
    # get the biathlon pdf
    
    filename = "companal_%(race)s_%(year)s_%(event)s.pkl" %{"race": racecode, "year": year,
                                                            "event": event}
    get_analysis_file(year,event,racecode) #stored in biathlon.pdf
    
    # check whether you have a valid biathlon data file
    with open("biathlon.pdf","r") as source:
        begin = source.read(100)
        if "BlobNotFound" not in begin:
            data = competition_analysis_mergeSP("biathlon.pdf",year)
            data.to_pickle(filename)
            
            return data

In [7]:
"""
Function
--------

pickle_biathlon_seasonSP : loops through all sprint races in a given biathlon season, applies 
                            pickle_competition_analysisSP to each, and pickles the resulting
                            dataframes

Parameters
----------

season : the code of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year

Returns
-------

stores a pickled data frame object on the hard drive for each sprint event in the season.

Examples
--------
"""



def pickle_biathlon_seasonSP(season):
    
    failed_races = []

    races = ["SMSP"]
    events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

    for race in races:
        for event in events:
            try:
                pickle_competition_analysisSP(season, event, race)
            except (ValueError, AttributeError, IndexError, AssertionError, TypeError):
                failed_races.append([season, event, race])
    
    return failed_races

I then ran ```pickle_biathlon_seasonSP``` over each season from 2004-2005 to 2017-2018. Prior to the 2004-2005 season, I had access to performance data for the biathletes, but not to data on the race and weather conditions which I suspected might be impacting speeds and shooting accuracies. 

The code looked something like this:

In [8]:
# This has beginning and end prints for each season because, at least on my elderly mac, 
# it takes a couple of hours and it's nice to have some sense of where it is

failed_races = []

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    print season
    failures = pickle_biathlon_seasonSP(season)
    failed_races.append(failures)


0405




0506
0607
0708
0809
0910
1011
1112
1213
1314
1415
1516
1617
1718


In [9]:
# Extending the list of failed races by a list of races for which the resulting dataframe
# was much shorter than expected

failed_races1 = [item for sublist in failed_races for item in sublist]
failed_races1.extend([['0405','CP07','SMSP'],['0809', 'CP07', 'SMSP'],['0809','CH__','SMSP'],
                      ['0910','OG__','SMSP'],['1213','CP02','SMSP']])


In [10]:
"""
Function
--------

competition_analysis_mergeSP

Parameters
----------

file_name: a competition analysis file for a biathlon race
year: the four digit code for a biathlon season

Returns
-------

results: a dataframe which contains the combined data from all of the pages of the pdf file

Note
----

Here I am using the alternative areas in order to obtain the sprint data for those races for
which the layout of the pdf file was somewhat different than the standard layout for the year

Example
-------

"""

def competition_analysis_mergeSP(file_name, year):
    
        # each season has a different part of the page to find the data
    areas = {#"0405": [(204, 37, 706, 559), (204, 37, 706, 559)], 
         "0405": [(145, 19, 777, 575), (145, 19, 777, 575)], # special cases
         #"0506": [(202, 37, 701, 559), (202, 37, 701, 559)], 
         "0506": [(159, 20, 737, 575),(159, 20, 737, 575)], # special cases
         "0607": [(203, 37, 716, 559), (203, 37, 716, 559)],
         "0708": [(192, 37, 692, 559), (192, 37, 692, 559)],
         #"0809": [(192, 37, 692, 559), (192, 37, 692, 559)], 
         "0809": [(140, 28, 722, 584), (140, 28, 722, 584)],# special cases
         #"0910": [(192, 37, 692, 559), (192, 37, 692, 559)], 
         "0910": [(140, 28, 722, 584), (140, 28, 722, 584)],# special cases
         "1011": [(162, 27, 727, 572), (18, 26, 814, 568)],
         "1112": [(190, 27, 686, 568), (20, 27, 815, 586)],
         #"1213": [(194, 26, 691, 571), (20, 26, 667, 571)], 
         "1213": [(194, 26, 691, 571), (20, 26, 744, 571)], # special cases   
        # "1213": [(194, 26, 691, 571), (194, 26, 691, 571)],# special cases
         #"1314": [(193, 26, 689, 571), (20, 26, 662, 571)], 
        "1314":[(143, 21, 727, 575),(143, 21, 727, 575)], # special cases
         "1415": [(137, 26, 704, 568), (20, 26, 662, 568)],
         "1516": [(137, 26, 705, 568), (20, 26, 663, 568)],
         "1617": [(151, 20, 702, 574), (20, 20, 728, 574)], # special cases
         #"1617": [(151, 20, 727, 576), (20, 20, 759, 576)] 
            "1718": [(150, 19, 748, 576), (20, 20, 759, 576)]}

    # Figure out how many pages are in the document
    
    with open(file_name,'r') as f:
        doc = PdfFileReader(f)
        pages = doc.getNumPages()
        
    # Get the data from the first page of the pdf
        
    race = read_pdf(file_name, pages = 1, area = areas[year][0], guess = False,
                    multiple_tables = False, encoding = 'utf8')
        
    try:
        data = competition_analysis_splitSP(race, year)
        results = data[0]
        to_continue = data[1]
    except:
        if year in ["1112","1213","1314","1415","1516","1617",'1718']:
            results = pd.DataFrame(columns = ['Name', 'sh1','sh2', 'P1','S1', "L1","L2",'L3',
                                              'r1','r2', 'pen1', 'pen2', 'tot sh', 'T',
                                              'Total Ski','tot r','tot pen'])
        else:
            results = pd.DataFrame(columns = ['Name', 'sh1','sh2', 'P1','S1', "L1","L2",'L3', 
                                              'r1','r2', 'tot sh', 'T', 'Total Ski','tot r'])
        to_continue = True
    
    # Get the data from the remaining pages of the pdf 
    
    for i in range(1,pages):
        
        if to_continue is True:
        
            race = read_pdf(file_name, pages = i + 1, area = areas[year][1], guess = False,
                            multiple_tables = False, encoding = 'utf8')
            try:
                race == None
            except:
                try:
                    data = competition_analysis_splitSP(race, year)
                    result = data[0]
                    to_continue = data[1]
                    results = pd.concat([results, result], axis = 0)
                except:
                    pass
    results.reset_index(inplace = True)
    results.drop('index', axis = 1, inplace = True)


    
    return results

In [11]:
still_failed = []
for i in range(len(failed_races1)):
    try:
        pickle_competition_analysisSP(failed_races1[i][0], failed_races1[i][1],
                                      failed_races1[i][2])
    except:
        still_failed.append(failed_races1[i])
        
still_failed

[]

In [12]:
url = "https://ibu.blob.core.windows.net/docs/1314/BT/SWRL/OG__/SMSP/BT_C77B_1.1.pdf"

urllib.urlretrieve(url,"biathlon.pdf")

results_sochi = competition_analysis_mergeSP('biathlon.pdf', '1314')



In [13]:
url = 'https://ibu.blob.core.windows.net/docs/0506/BT/SWRL/OG__/SMSP/BT_C77B_1.1.pdf'

urllib.urlretrieve(url, 'biathlon.pdf')

results_torino = competition_analysis_mergeSP('biathlon.pdf', '0506')



In [14]:
results_sochi.to_pickle('companal_SMSP_1314_OG__.pkl')
results_torino.to_pickle('companal_SMSP_0506_OG__.pkl')


# Collecting IBU Cup sprint data

In fact, the [International Biathlon Union](https://biathlonworld.com)(IBU) runs two separate cup seasons, in addition to various youth and junior competitions. The first is the World Cup, which has the top hundred or so racers. The races above are the men's sprint competions from these races. There is another cup though, the IBU cup. Although not all biathletes in the IBU cup will ever make the transition to the World Cup circuit, most World Cup racers have spent at least some time competing in the IBU cup circuit. As a result, their results on the IBU cup circuit might be useful for informing predictions about their results on the World Cup circuit. To that end, we will also collect the mens sprint data for the IBU cup races from 2004-2005 until the present season. The functions that we use here are analogous to those used for collecting the World Cup data, and so I will list them without further description. 

1. ```find_comp_url```: This function, together with ```get_comp_file```, is analogous to ```get_analysis_file```
2. ```get_comp_file```: This function, together with ```find_comp_url```, is analogous to ```get_analysis_file```
3. ```ibu_competition_analysis_splitSP```: This function is analogus to ```competition_analysis_splitSP```
4. ```ibu_competition_analysis_mergeSP```: This function is analogous to ```competition_analysis_mergeSP```
5. ```ibu_pickle_competition_analysisSP```: This function is analogous to ```pickle_competition_analysisSP```
6. ```ibu_pickle_biathlon_seasonSP```: This function is analogous to ```pickle_biathlon_seasonSP```

The IBU Cup races presented a few problems that were not encountered with the World Cup races.
1. Some of the IBU Cup events each year run not one but two men's sprint races. It is therefore necessary to ensure that the results of both races are taken into consideration.
2. Some of the IBU Cup files have shooting times included. Some of the IBU Cup files do not have shooting times included. Unfortunately, this does not seem to be a season by season thing but rather a race by race thing, due perhaps to the equipment set up at a given course. That means that in getting the shot data, it will be necessary to somehow distinguish between these two cases on the fly, rather than as a season by season thing.
3. Finally, there is a problem that is exclusive to the 2017-2018 season. The files for the 2017-2018 races have odd spacing for the IBU Cup races, which means that those racers who are slowest at the first shooting end up with an improperly parsed missed standing shots value. Here, the simplest fix seems to be to simply get rid of those rows for which this is an issue. To do this, we have the function
    - ```dump_bad_rows```: This function takes a ibu competition analysis dataframe and checks the column ```'S1'``` for entries that cannot be a number of missed shots (that is to say, values that do not belong to the set {0, 1, 2, 3, 4, 5} and eliminates those rows from the dataframe.

Previous Section: [Collecting the Race Data](#Collecting-the-race-data)


Next Section: [Deleting Bad Racer Data](#Deleting-bad-racer-data)


[Table of Contents](#Table-of-Contents)

In [15]:
"""
Function
--------

find_comp_url : Finds and return the url for the pdf of the competition analysis file 
                associated to a race in the ibu cup series

Parameters
----------

year : the season in which the race took place, given as a four digit string of the form y1y2,
        where y1 is the last two digits of the year in which the season started and y2 is the
        last two digits of the year in which the season ended.
event : a four character code indicating the event within a season with the possibilities CP01,
        CP02, CP03, CP04, CP05, CP06, CP07, CP08 (regular events), and CH__ (championships)
race : a four character code indicating the race within an event which has possible values
        SMSP, SMPU, SMIN, SWSP, SWPU, SWIN

Returns
-------

comp_url : the url at which the competition analysis pdf can be found

Example
-------

"""

# competition analysis urls

def find_comp_url(year,event,race):

    if (year in ['0405','0506','0607','0708','0809','0910','1011','1112']):
        cup = "SCEU"
    else:
        if event in ['CH__']:
            cup = 'SCEU'
        else:
            cup = "SIBU"
        
    comp_url = ("https://ibu.blob.core.windows.net/docs/%(y)s/BT/%(c)s/%(e)s/%(r)s/BT_C77%(L)s_1.0.pdf"
                %{"y" : year, "c" : cup, "e" : event, "r" : race, "L" : "B"})
    
    return comp_url

In [16]:
"""
Function
--------

get_comp_file : retrieves the competition analysis file from the web and stores it as
                a pdf file at "ibu_cup.pdf"

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SMIN,
        SMSP, SMPU

Returns
-------

Stores a .pdf file at ibu_cup.pdf. 

Example
-------

"""

def get_comp_file(year,event,race):
    url = find_comp_url(year, event, race)
        
    urllib.urlretrieve(url,"ibu_cup.pdf")
    

In [17]:
# Competition Analysis Areas, ibu cup races. Note that the first entry for each season
# is the data area on the first page of the pdf, while the second entry is the data area
# for all subsequent pages

competition_area = {'0405' : [(203, 37, 704, 560), (203, 37, 704, 560)] ,
                    '0506' : [(203, 37, 704, 560), (203, 37, 704, 560)],
                    '0607' : [(203, 37, 704, 560), (203, 37, 704, 560)],
                    '0708' : [(197, 38, 736, 559), (197, 38, 736, 559)],
                    '0809' : [(192, 38, 695, 558), (192, 38, 695, 558)],
                    '0910' : [(192, 38, 695, 558), (192, 38, 695, 558)],
                    '1011' : [(161, 26, 736, 569), (19, 26, 672, 569)],
                    '1112' : [(191, 26, 688, 572), (20, 26, 815, 572)],
                    '1213' : [(193, 26, 694, 572), (20, 26, 784, 572)],
                    '1314' : [(194, 26, 696, 572), (20, 26, 716, 572)],
                    '1415' : [(136, 26, 710, 572), (20, 26, 666, 572)],
                    '1516' : [(136, 26, 695, 572), (20, 26, 662, 572)],
                    '1617' : [(151, 20, 706, 576), (20, 20, 718, 576)],
                    '1718' : [(151, 20, 706, 576), (20, 20, 718, 576)]}

In [18]:
"""
Function
--------

ibu_competition_analysis_splitSP : takes a data frame produced by using read_pdf from the 
                                   tabula package to read a single page of a read in 
                                   competition analysis file, cleans it up, and returns the
                                   desired subset of the data in a format that is 
                                   straightforward to use.

Parameters
----------

race : the data from a single page of a pdf file containing a competition analysis page
year : the season code for the race

Returns
-------

df : a dataframe containing a subset of the data from the file

Example
-------

"""

def ibu_competition_analysis_splitSP(race, year):
    
    df = pd.DataFrame()
    continuer = True
    
    # First drop any all NaN columns
    
    race.dropna(axis = 1, how = "all", inplace = True)
    
    # And then replace any remaining NaNs with X
    race.fillna("X", inplace = True)
    
    # Step 0 is to get rid of any bad data (non starters, non finishers, disqualifications)
    # at the end of the page
    
    data_end = -1
    for i in range(len(race)):
        if "Jury" in unicode(race.iloc[i,0]):
            data_end = i
            continuer = False
            break
        elif "Did" in unicode(race.iloc[i,0]):
            data_end = i
            continuer = False
            break
        elif "Disqualification" in unicode(race.iloc[i,0]):
            data_end = i
            continuer = False
            break
        elif "Lapped" in unicode(race.iloc[i,0]):
            data_end = i
            continuer = False
            break
    
    if data_end != -1:
        race.drop(race.index[data_end:], inplace = True)
    
    # Step 1 is to find the data. Because of the way that the data on the page is structured,
    # one can use the locations of the rows containing the expression "Cumulative Time" as a
    # means of locating all the needed rows.
    
    rows = []
    for i in range(len(race)):
        if "Cumulative" in race.iloc[i,0]:
            rows.append(i)

    cols = ['Name', 'P1','S1','T', 'sh1','sh2','tot sh', "L1","L2",'L3','Total Ski',
            'r1','r2','tot r']
    cols_long = ['Name', 'P1','S1','T', 'sh1','sh2','tot sh', "L1","L2",'L3','Total Ski',
                    'r1','r2','tot r', 'pen1', 'pen2', 'tot pen' ]
    if year in ['1415','1516','1617','1718']:
        df = pd.DataFrame(columns = cols_long)
    else:
        df = pd.DataFrame(columns = cols)
    
    for row in rows:
        try:
            racer = []
            # Get the racers name
            flat_name_row = flatten_row(race, row-1)
            racer.append(" ".join(flat_name_row[2:len(flat_name_row)-5]))
            
            # Get the shot info
            flat_shooting_row = flatten_row(race, row+2)
            if len(flat_shooting_row)<6:
                shot_row = [flat_shooting_row[1],flat_shooting_row[2],flat_shooting_row[3]]
                racer.extend(shot_row)
                shot_row = ['X','X','X']
                racer.extend(shot_row)
            else:
                shot_row = [flat_shooting_row[1], flat_shooting_row[5], flat_shooting_row[9]]
                racer.extend(shot_row)
                shot_row = [flat_shooting_row[2], flat_shooting_row[6], flat_shooting_row[10]]
                racer.extend(shot_row)
            
            # Get the loop times
            flat_loop_row = flatten_row(race, row + 4)
            loop_time = [flat_loop_row[2], flat_loop_row[5], flat_loop_row[8], 
                                                                         flat_loop_row[11]]
            racer.extend(loop_time)
            
            # Get the range times
            flat_range_row = flatten_row(race, row + 3)
            range_time = [flat_range_row[2], flat_range_row[5], flat_range_row[8]]
            racer.extend(range_time)
            
            # If relevant, get the penalty times
            if year in ['1415','1516','1617','1718']:
                flat_penalty_row = flatten_row(race, row+5)
                penalty_time = [flat_penalty_row[2], flat_penalty_row[3], flat_penalty_row[4]]
                racer.extend(penalty_time)
            
            # Attach the column names
            if year in ['1415','1516','1617','1718']:
                racer = pd.DataFrame([racer], columns = cols_long)
                df = pd.concat([df,racer])
            else:
                racer = pd.DataFrame([racer], columns = cols)
                df = pd.concat([df, racer])
        except:
            pass
    if year in ['1415','1516','1617','1718']:
        new_cols = ['Name', 'sh1','sh2', 'P1','S1', "L1","L2",'L3','r1','r2', 'pen1', 'pen2', 
                'tot sh', 'T','Total Ski','tot r','tot pen'] 
    else:
        new_cols = ['Name', 'sh1','sh2', 'P1','S1', "L1","L2",'L3',
           'r1','r2', 'tot sh', 'T','Total Ski','tot r']
    df = df.reindex_axis([new_cols], axis=1)
    
    return df, continuer

In [19]:
"""
Function
--------

ibu_competition_analysis_mergeSP : takes a pdf file containing the competition analysis data
                                    for a given race and returns the desired subset of the 
                                    data in the form of a single dataframe

Parameters
----------

file_name: a competition analysis pdf file from a biathlon race
year : the four digit string indicating the season in which the competition was held

Returns
-------

results: a dataframe containing the combined data from all of the pages of the pdf file

Example
-------

"""

def ibu_competition_analysis_mergeSP(file_name, year):
    
        # each season has a different part of the page to find the data
    areas = {'0405' : [(203, 37, 704, 560), (203, 37, 704, 560)] ,
            '0506' : [(203, 37, 704, 560), (203, 37, 704, 560)],
            '0607' : [(203, 37, 704, 560), (203, 37, 704, 560)],
            '0708' : [(197, 38, 736, 559), (197, 38, 736, 559)],
            '0809' : [(192, 38, 695, 558), (192, 38, 695, 558)],
            '0910' : [(192, 38, 695, 558), (192, 38, 695, 558)],
            '1011' : [(161, 26, 736, 569), (19, 26, 784, 569)],
            '1112' : [(191, 26, 694, 573), (20, 26, 815, 573)],
            '1213' : [(193, 26, 694, 572), (20, 26, 784, 572)],
            '1314' : [(194, 26, 696, 572), (20, 26, 716, 572)],
            '1415' : [(136, 26, 710, 572), (20, 26, 666, 572)],
            '1516' : [(136, 26, 695, 572), (20, 26, 662, 572)],
            '1617' : [(151, 20, 706, 576), (20, 20, 718, 576)],
            '1718' : [(151, 20, 706, 576), (20, 20, 718, 576)]}

    # Figure out how many pages are in the document
    
    with open(file_name,'r') as f:
        doc = PdfFileReader(f)
        pages = doc.getNumPages()
        
    
    # Get the data from the first page of the pdf
    
    
    race = read_pdf(file_name, pages = 1, area = areas[year][0], guess = False, 
                    multiple_tables = False)
    
    data = ibu_competition_analysis_splitSP(race, year)
    results = data[0]
    to_continue = data[1]
    
    # Get the data from the remaining pages of the pdf 
    
    for i in range(1,pages):
        
        if to_continue is True:
    
    
            race = read_pdf(file_name, pages = i + 1, area = areas[year][1], guess = False, 
                            multiple_tables = False)
        
            try:
                race == None
            except:
                data = ibu_competition_analysis_splitSP(race, year)
                result = data[0]
                to_continue = data[1]
                results = pd.concat([results, result], axis = 0)
                                
    results.reset_index(inplace = True)
    results.drop('index', axis = 1, inplace = True)


    
    return results





In [20]:
"""
Function
--------

ibu_pickle_competition_analysisSP : takes the url of a biathlon data web page, determines 
                                    whether the page is actually a pdf file or not, and 
                                    in the case that it is, extracts the relevant data,
                                    puts it into a data frame, and then pickles the resulting 
                                    data.

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SMIN,
        SMSP, SMPU


Returns
-------

stores a pickled data frame object on the hard drive

Examples
--------
"""

def ibu_pickle_competition_analysisSP(year, event, racecode):
    
    # get the biathlon pdf
    
    filename = ("ibu_%(race)s_%(year)s_%(event)s.pkl" 
                %{"race": racecode, "year": year, "event": event})
    get_comp_file(year,event,racecode) #stored in ibu_cup.pdf
    
    
    # check whether you have a valid biathlon data file
    with open("ibu_cup.pdf","r") as source:
        begin = source.read(100)
        if "BlobNotFound" not in begin:
             # run biathlon_data to extract the data
            data = ibu_competition_analysis_mergeSP("ibu_cup.pdf",year)
            data.to_pickle(filename)
            
    return data

In [21]:
"""
Function
--------

ibu_pickle_biathlon_seasonSP : performs ibu_pickle_competition_analysisSP on every men's
                                sprint race in a given biathlon season

Parameters
----------

season : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year

Returns
-------

failed_races : a list of races for which we were unable to pickle an object to save

and, in addition, stores a pickled dataframe to the hard drive for every event that is 
successfully pickled

Example
-------

"""

def ibu_pickle_biathlon_seasonSP(season):
    
    failed_races = []

    races = ["SMSP","SMSPS"]
    events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CH__']

    for race in races:
        for event in events:
            try:
                ibu_pickle_competition_analysisSP(season, event, race)
            except:
                failed_races.append([season, event, race])
    
    return failed_races

In [22]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415','1516','1617','1718']

ibu_failures = []

for season in seasons:
    print season
    ibu_failures.extend(ibu_pickle_biathlon_seasonSP(season))

0405
0506
0607
0708
0809
0910
1011
1112
1213
1314
1415
1516
1617
1718


In [23]:
"""
Function
--------

dump_bad_rows : takes a ibu dataframe and eliminates those racer rows for which odd spacing
                in the pdf file lead to improperly parsed data being read in

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SMIN,
        SMSP, SMPU


Returns
-------

Nothing, the dataframe is changed in place rather than by making copies

Example
-------

"""

def dump_bad_rows(year, event, race):
    
    filename = 'ibu_%(race)s_%(year)s_%(event)s.pkl' %{'year' : year,
                                                       'event' : event, 'race' : race}
    df = pd.read_pickle(filename)

    bad_rows = []
    for i in range(len(df)):
        
        if df.loc[i,'S1'] not in [0, 1, 2, 3, 4, 5, '0','1','2','3','4','5',
                                  '0.0', '1.0', '2.0', '3.0', '4.0', '5.0']:
            bad_rows.append(i)

    df.drop(df.index[bad_rows], inplace=True)

    df.reset_index(drop = True, inplace = True)
    
    if len(df) > 10:
        df.to_pickle(filename)
    else:
        print 'check file for', year, event, race
    
    #return df

In [24]:
years = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
         '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CH__']
races = ['SMSP','SMSPS']

missing_files = []
for year in years:
    for event in events:
        for race in races:
            try:
                dump_bad_rows(year, event, race)
            except:
                missing_files.append([year, event, race])
            

check file for 1415 CH__ SMSP


# Deleting bad racer data

In biathlon competitions, each missed shot requires that the athlete ski a 150 meter penalty loop (which takes in the neighborhood of 25 to 30 seconds). If an athlete fails to ski some or all of his penalty loops, he is then assigned a time penalty of two minutes for each penalty loop that he fails to ski. These time penalties are assessed relatively infrequently (around twice a year on average), but have the potential to severely skew the data when they occur, particularly in those cases where an athlete missed all five shots, failed to ski any penalty laps, and was thus assessed a ten minute penalty. As a result, I have chosen to drop those racers who were assessed penalties of this type from the races in which the infractions occurred.

(And, since reading the code for all of these deletions isn't very interesting, here's a [link](#Collecting-the-course-data) to jump to the next section.)

Previous Section: [Collecting the IBU Cup Sprint Data](#Collecting-IBU-Cup-sprint-data)


Next Section: [Collecting the Course Data](#Collecting-the-course-data)



[Table of Contents](#Table-of-Contents)

In [25]:
# 0405 CP02 ISA Hidenori
race_data = pd.read_pickle('companal_SMSP_0405_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'ISA Hidenori'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0405_CP02.pkl')

# 0405 CP04 KLETCHEROV Michail
race_data = pd.read_pickle('companal_SMSP_0405_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'KLETCHEROV Michail'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0405_CP04.pkl')

# 0405 CP09 VITEK Zdenek
race_data = pd.read_pickle('companal_SMSP_0405_CP09.pkl')
race_data.drop(race_data[race_data.Name == u'VITEK Zdenek'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0405_CP09.pkl')

# 0506 CP01 MOYSEY Peter
race_data = pd.read_pickle('companal_SMSP_0506_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'MOYSEY Peter'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0506_CP01.pkl')

# 0506 CP02 OS Alexander, NOVIKOV Sergei
race_data = pd.read_pickle('companal_SMSP_0506_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'OS Alexander'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'NOVIKOV Sergei'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0506_CP02.pkl')

# 0708 CP08 SUMANN Christopher
race_data = pd.read_pickle('companal_SMSP_0708_CP08.pkl')
race_data.drop(race_data[race_data.Name == u'SUMANN Christopher'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0708_CP08.pkl')

# 0708 CP09 YAROSHENKO Dmitri
race_data = pd.read_pickle('companal_SMSP_0708_CP09.pkl')
race_data.drop(race_data[race_data.Name == u'YAROSHENKO Dmitri'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0708_CP09.pkl')

# 0809 CP04 TOBRELUTS Indrek
race_data = pd.read_pickle('companal_SMSP_0809_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'TOBRELUTS Indrek'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0809_CP04.pkl')

# 0910 CP02 LEGUELLEC Jean Philippe
race_data = pd.read_pickle('companal_SMSP_0910_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'LEGUELLEC Jean Philippe'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0910_CP02.pkl')

# 0910 CP03 KAUPPINEN Jarkko, KALDVEE Martten
race_data = pd.read_pickle('companal_SMSP_0910_CP03.pkl')
race_data.drop(race_data[race_data.Name == u'KAUPPINEN Jarkko'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'KALDVEE Martten'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0910_CP03.pkl')

# 0910 CP05 KOSARAC Namanja
race_data = pd.read_pickle('companal_SMSP_0910_CP05.pkl')
race_data.drop(race_data[race_data.Name == u'KOSARAC Namanja'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_0910_CP05.pkl')

# 1011 CH__ LOPATIC Stefan
race_data = pd.read_pickle('companal_SMSP_1011_CH__.pkl')
race_data.drop(race_data[race_data.Name == u'LOPATIC Stefan'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1011_CH__.pkl')

# 1112 CP07 EBERHARD Tobias
race_data = pd.read_pickle('companal_SMSP_1112_CP07.pkl')
race_data.drop(race_data[race_data.Name == u'EBERHARD Tobias'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1112_CP07.pkl')

# 1213 CP01 KLETCHEROV Martin
race_data = pd.read_pickle('companal_SMSP_1213_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'KLETCHEROV Martin'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1213_CP01.pkl')

# 1213 CP06 FOURCADE Simon, OTCENAS Martin
race_data = pd.read_pickle('companal_SMSP_1213_CP06.pkl')
race_data.drop(race_data[race_data.Name == u'FOURCADE Simon'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'OTCENAS Martin'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1213_CP06.pkl')

# 1213 CP09 CHRISTIANSEN Vetle Sjastad
race_data = pd.read_pickle('companal_SMSP_1213_CP09.pkl')
race_data.drop(race_data[race_data.Name == u'CHRISTIANSEN Vetle Sjastad'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1213_CP09.pkl')

# 1314 CP07 FILLON MAILLET Quentin
race_data = pd.read_pickle('companal_SMSP_1314_CP07.pkl')
race_data.drop(race_data[race_data.Name == u'FILLON MAILLET Quentin'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1314_CP07.pkl')

# 1314 CP09 BOGETVEIT Haavard
race_data = pd.read_pickle('companal_SMSP_1314_CP09.pkl')
race_data.drop(race_data[race_data.Name == u'BOGETVEIT Haavard'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1314_CP09.pkl')

# 1415 CP04 EBERHARD Julian
race_data = pd.read_pickle('companal_SMSP_1415_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'EBERHARD Julian'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1415_CP04.pkl')

# 1415 CH__ PODKORYTOV Vassiliy
race_data = pd.read_pickle('companal_SMSP_1415_CH__.pkl')
race_data.drop(race_data[race_data.Name == u'PODKORYTOV Vassiliy'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1415_CH__.pkl')

# 1415 CP09 STEPHAN Christoph
race_data = pd.read_pickle('companal_SMSP_1415_CP09.pkl')
race_data.drop(race_data[race_data.Name == u'STEPHAN Christoph'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('companal_SMSP_1415_CP09.pkl')


In [26]:
# Similarly, I have assessed time penalty data for IBU cup races

# 0506 CP02 EMONTS Ralph
race_data = pd.read_pickle('ibu_SMSP_0506_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'EMONTS Ralph'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0506_CP02.pkl')

# 0506 CP03 NAVARRO PEREZ Jose Ramon ,PRADES REVERTER Benjamin
race_data = pd.read_pickle('ibu_SMSP_0506_CP03.pkl')
race_data.drop(race_data[race_data.Name == u'NAVARRO PEREZ Jose Ramon'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'PRADES REVERTER Benjamin'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0506_CP03.pkl')

# 0506 CP04 PERRAS Scott
race_data = pd.read_pickle('ibu_SMSP_0506_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'PERRAS Scott'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0506_CP04.pkl')

# 0506 CH__ DEBAYLE Yann, FOIDL Hans Peter
race_data = pd.read_pickle('ibu_SMSP_0506_CH__.pkl')
race_data.drop(race_data[race_data.Name == u'DEBAYLE Yann'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'FOIDL Hans Peter'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0506_CH__.pkl')

# 0607 CP02 GJELLAND Egil, DAMJANOVSKI Darko
race_data = pd.read_pickle('ibu_SMSP_0607_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'GJELLAND Egil'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'DAMJANOVSKI Darko'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0607_CP02.pkl')


# 0607 CP03 FREI Thomas
race_data = pd.read_pickle('ibu_SMSP_0607_CP03.pkl')
race_data.drop(race_data[race_data.Name == u'FREI Thomas'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0607_CP03.pkl')


# 0607 CP06(S) COOL Herbert
race_data = pd.read_pickle('ibu_SMSPS_0607_CP06.pkl')
race_data.drop(race_data[race_data.Name == u'COOL Herbert'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSPS_0607_CP06.pkl')


# 0708 CP07 ICOSKI Gjorgji, STOJANOSKI Dejan
race_data = pd.read_pickle('ibu_SMSP_0708_CP07.pkl')
race_data.drop(race_data[race_data.Name == u'ICOSKI Gjorgji'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'STOJANOSKI Dejan'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0708_CP07.pkl')


# 0809 CP01 DRIFFILL Craig
race_data = pd.read_pickle('ibu_SMSP_0809_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'DRIFFILL Craig'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0809_CP01.pkl')


# 0809 CP02 TADEJEVIC Zvonimir
race_data = pd.read_pickle('ibu_SMSP_0809_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'TADEJEVIC Zvonimir'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0809_CP02.pkl')


# 0809 CP07 HODZIC Admir
race_data = pd.read_pickle('ibu_SMSP_0809_CP07.pkl')
race_data.drop(race_data[race_data.Name == u'HODZIC Admir'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0809_CP07.pkl')


# 0809 CH__ OTCENAS Martin
race_data = pd.read_pickle('ibu_SMSP_0809_CH__.pkl')
race_data.drop(race_data[race_data.Name == u'OTCENAS Martin'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0809_CH__.pkl')


# 0910 CP02 HRKALOVIC Emir
race_data = pd.read_pickle('ibu_SMSP_0910_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'HRKALOVIC Emir'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0910_CP02.pkl')


# 0910 CP03 CEPULIS Darius, RASTIC Ajlan, ANGELIS Apostolos
race_data = pd.read_pickle('ibu_SMSP_0910_CP03.pkl')
race_data.drop(race_data[race_data.Name == u'CEPULIS Darius'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'RASTIC Ajlan'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'ANGELIS Apostolos'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0910_CP03.pkl')


# 0910 CP05 RASTIC Demir
race_data = pd.read_pickle('ibu_SMSP_0910_CP05.pkl')
race_data.drop(race_data[race_data.Name == u'RASTIC Demir'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0910_CP05.pkl')


# 0910 CP06 NAUDI BERMON Miquel
race_data = pd.read_pickle('ibu_SMSP_0910_CP06.pkl')
race_data.drop(race_data[race_data.Name == u'NAUDI BERMON Miquel'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_0910_CP06.pkl')


# 1011 CP01(S) ICOSKI Gjorgji, GORBALD Paul
race_data = pd.read_pickle('ibu_SMSPS_1011_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'ICOSKI Gjorgji'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'GORBALD Paul'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSPS_1011_CP01.pkl')


# 1011 CP02 EHRHART Ludwig
race_data = pd.read_pickle('ibu_SMSP_1011_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'EHRHART Ludwig'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1011_CP02.pkl')


# 1011 CP03 BURIC Matej
race_data = pd.read_pickle('ibu_SMSP_1011_CP03.pkl')
race_data.drop(race_data[race_data.Name == u'BURIC Matej'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1011_CP03.pkl')


# 1011 CP04 SLOOF Lucien
race_data = pd.read_pickle('ibu_SMSP_1011_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'SLOOF Lucien'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1011_CP04.pkl')


# 1011 CP07 SHIPULIN Anton, PALEVSKI Radi, PETROV Andrej
race_data = pd.read_pickle('ibu_SMSP_1011_CP07.pkl')
race_data.drop(race_data[race_data.Name == u'SHIPULIN Anton'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'PALEVSKI Radi'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'PETROV Andrej'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1011_CP07.pkl')


# 1112 CP04 DAMJANOVSKI Darko
race_data = pd.read_pickle('ibu_SMSP_1112_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'DAMJANOVSKI Darko'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1112_CP04.pkl')


# 1112 CP08 KEIFER Alexander
race_data = pd.read_pickle('ibu_SMSP_1112_CP08.pkl')
race_data.drop(race_data[race_data.Name == u'KEIFER Alexander'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1112_CP08.pkl')


# 1213 CP02 SLETTEMARK Oystein
race_data = pd.read_pickle('ibu_SMSP_1213_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'SLETTEMARK Oystein'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1213_CP02.pkl')


# 1213 CP05 ARNAULT Clement
race_data = pd.read_pickle('ibu_SMSP_1213_CP05.pkl')
race_data.drop(race_data[race_data.Name == u'ARNAULT Clement'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1213_CP05.pkl')


# 1213 CP06 MAEDA Ryo
race_data = pd.read_pickle('ibu_SMSP_1213_CP06.pkl')
race_data.drop(race_data[race_data.Name == u'MAEDA Ryo'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1213_CP06.pkl')


# 1314 CP01 STANOESKI Tosho
race_data = pd.read_pickle('ibu_SMSP_1314_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'STANOESKI Tosho'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1314_CP01.pkl')


# 1314 CP02 CRNKOVIC Kresimir, CHENG Fangming
race_data = pd.read_pickle('ibu_SMSP_1314_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'CRNKOVIC Kresimir'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'CHENG Fangming'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1314_CP02.pkl')


# 1314 CP04 CUENOT Gaspard, KUEHN Johannes
race_data = pd.read_pickle('ibu_SMSP_1314_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'CUENOT Gaspard'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'KUEHN Johannes'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1314_CP04.pkl')


# 1314 CP05 ORHAN Erkan
race_data = pd.read_pickle('ibu_SMSP_1314_CP05.pkl')
race_data.drop(race_data[race_data.Name == u'ORHAN Erkan'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1314_CP05.pkl')


# 1415 CP01 LEE Su-Young, REITER Michael
race_data = pd.read_pickle('ibu_SMSP_1415_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'LEE Su-Young'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'REITER Michael'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1415_CP01.pkl')


# 1415 CP03 MARSANIC Mihael
race_data = pd.read_pickle('ibu_SMSP_1415_CP03.pkl')
race_data.drop(race_data[race_data.Name == u'MARSANIC Mihael'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1415_CP03.pkl')


# 1516 CP01(S) VAHTRA Eno
race_data = pd.read_pickle('ibu_SMSPS_1516_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'VAHTRA Eno'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSPS_1516_CP01.pkl')


# 1516 CP02 GOMBOS Karoly
race_data = pd.read_pickle('ibu_SMSP_1516_CP02.pkl')
race_data.drop(race_data[race_data.Name == u'GOMBOS Karoly'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1516_CP02.pkl')


# 1516 CP04 MILOSHEVSKI Velche
race_data = pd.read_pickle('ibu_SMSP_1516_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'MILOSHEVSKI Velche'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1516_CP04.pkl')


# 1516 CP04(S) PETROVIC Filip
race_data = pd.read_pickle('ibu_SMSPS_1516_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'PETROVIC Filip'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSPS_1516_CP04.pkl')


# 1516 CP07 TALIHAERM Johan, PETROVIC Filip
race_data = pd.read_pickle('ibu_SMSP_1516_CP07.pkl')
race_data.drop(race_data[race_data.Name == u'TALIHAERM Johan'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'PETROVIC Filip'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1516_CP07.pkl')


# 1516 CH__ REITER Michael
race_data = pd.read_pickle('ibu_SMSP_1516_CH__.pkl')
race_data.drop(race_data[race_data.Name == u'REITER Michael'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1516_CH__.pkl')


# 1516 CP08 JOVANOSKI Borce
race_data = pd.read_pickle('ibu_SMSP_1516_CP08.pkl')
race_data.drop(race_data[race_data.Name == u'JOVANOSKI Borce'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1516_CP08.pkl')


# 1516 CP08(S) STANOESKI Tosho
race_data = pd.read_pickle('ibu_SMSPS_1516_CP08.pkl')
race_data.drop(race_data[race_data.Name == u'STANOESKI Tosho'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSPS_1516_CP08.pkl')


# 1617 CP01 GOND Balazs
race_data = pd.read_pickle('ibu_SMSP_1617_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'GOND Balazsh'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1617_CP01.pkl')


# 1617 CP04 KRISTEJN Lukas, RAZQUIN MANGADO Alejandro
race_data = pd.read_pickle('ibu_SMSP_1617_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'KRISTEJN Lukas'].index, inplace = True)
race_data.drop(race_data[race_data.Name == u'RAZQUIN MANGADO Alejandro'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1617_CP04.pkl')


# 1617 CP04(S) YEREMIN Roman
race_data = pd.read_pickle('ibu_SMSPS_1617_CP04.pkl')
race_data.drop(race_data[race_data.Name == u'YEREMIN Roman'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSPS_1617_CP04.pkl')


# 1617 CP08 PADDER Rameez Ahmad
race_data = pd.read_pickle('ibu_SMSP_1617_CP08.pkl')
race_data.drop(race_data[race_data.Name == u'PADDER Rameez Ahmad'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1617_CP08.pkl')


# 1718 CP01(S) FLANAGAN Jeremy
race_data = pd.read_pickle('ibu_SMSPS_1718_CP01.pkl')
race_data.drop(race_data[race_data.Name == u'FLANAGAN Jeremy'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSPS_1718_CP01.pkl')


# 1718 CP08 TREIER Jan
race_data = pd.read_pickle('ibu_SMSP_1718_CP08.pkl')
race_data.drop(race_data[race_data.Name == u'TREIER Jan'].index, inplace = True)
race_data.reset_index(inplace = True, drop = True)
race_data.to_pickle('ibu_SMSP_1718_CP08.pkl')




# Collecting the course data

In order to consider the effects of conditions on race times and shooting accuracy, not to mention computing accurate speeds, it was necessary to collect a certain amount of data on the weather and course conditions. This data can be found at urls that look like:

https://ibu.blob.core.windows.net/docs/0809/BT/SWRL/CH__/SMSP/BT_C82_1.1.pdf

or more generally, like:

https://ibu.blob.core.windows.net/docs/<font color = red>season</font>/BT/SWRL/<font color = red>event</font>/<font color = red>race</font>/BT_C82_1.0.pdf

Due to the layout of these pages, it was simpler to collect all of the weather information over the course of a season and all of the course information over the course of a season separately, and then to pickle the resulting dataframes. As a result, we produce both a weather dataframe and a course dataframe for each season. We begin by collecting the course data.

In order to do this, we use the following functions:
1. ```get_summary_file```: This function takes a year, event, and race type, finds the url of the associated competition data summary, then downloads the file and stores it at summary.pdf.
2. ```extract_distance```: This function takes in a string  containing a distance and strips any letters from the string. It then converts any distances given in kilometers into meters and returns the relevant distances in meters.
3. ```extract_values```: This function takes a string containing a value and strips non numerical information from it.
4. ```course_layout_split```: This function takes as input the contents of the cell containing the course layout, which can be of many forms. It calls ```extract_distance``` to remove extraneous information and to convert all lengths to meters and returns the lengths of the individual laps.
5. ```get_course_info```: This function reads a pdf file containing the competition data summary for a race and returns the course information in the form of a dataframe containing a single line.
6. ```collect_course_info```: This function calls ```get_course_info``` for each men's sprint race in a world cup season, concatenates the returned dataframes, and stores the resulting dataframe as a pickle file.

I ran ```collect_course_info``` over all the seasons for which competition data summary files existed (from 2004-2005 on). Once again, some of the Olympic files posed a problem, once because of a non standard url (as above), and once because of a nonstandard pdf layout. In both cases, I added the necessary information to the relevent course file manually.

Previous Section: [Deleting Bad Racer Data](#Deleting-bad-racer-data)


Next Section: [Collecting the Weather Data](#Collecting-the-weather-data)



[Table of Contents](#Table-of-Contents)

In [27]:
"""
Function
--------

get_summary_file : downloads a biathlon results competition data summary file to the hard 
                    drive 

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SMIN,
        SMSP, SMPU


Returns
-------

Stores a .pdf file at summary.pdf. 

Example
-------

"""

def get_summary_file(year,event,race):

    url = ("https://ibu.blob.core.windows.net/docs/%(y)s/BT/SWRL/%(e)s/%(r)s/BT_C82_1.0.pdf" 
           %{"e": event, "r":race, "y":year})
        
    urllib.urlretrieve(url,"summary.pdf")
    


In [28]:
"""
Function 
--------

extract_distance : takes in a string of containing a distance, strips excess data, and returns 
                    the relevant distances in meters

Parameters
----------

string : a string which contains a distance, possibly in meters, possibly in kilometers, as 
        well as a possible unit or course description

Returns
-------

dist : the distance, stripped of any other words or symbols and converted into meters if
        necessary

Examples
--------

"""


def extract_distance(string):
    
    multiple = False
    # First, split the string to allow extraction.

    split_string = string.split()
    
    # Next, get rid of any pieces that are just color, unit, etc
    
    to_strip = ['yellow', 'green', 'blue', 'brown', 'red', 'course', 'km', 'm', 'Yellow',
                'Green','Blue','Brown','Red','Course']
    digits = ['0','1','2','3','4','5','6','7','8','9','0','.']
    
    to_keep = []
    for k in split_string:
        if k not in to_strip:
            to_keep.append(k)
            
    to_keep = "".join(to_keep)
    
            
    # For what is left
    
    if 'x' in to_keep:
        # We have multiple notation
        multiple = True
        
        to_keep = "".join(to_keep)
        for i in range(len(to_keep)):
            if to_keep[i] == 'x':
                b = i
                
        to_keep = to_keep[b+1:]
        
        number = []
        for j in range(len(to_keep)):
            if to_keep[j] in digits:
                number.append(to_keep[j])
        dist = "".join(number) 
        
        
    else:
        
        number = []
        for j in range(len(to_keep)):
            if to_keep[j] in digits:
                number.append(to_keep[j])

        dist = "".join(number) 
        
    if float(dist) < 100:
        dist = 1000*float(dist)
    else:
        dist = float(dist)


    return dist

In [29]:
"""
Function
--------

extract_values : takes a string containing a value and strips non numerical information
                from it, e.g., HD, MT, km, etc.


Parameters
----------

string : a string which contains a number and may contain other symbols 

Returns
-------

value : the number that remains when extraneous data has been stripped.

Examples
--------

"""

def extract_values(string):
    
    split_string = string.split()
    
    to_strip = ["(HD)", "HD", "(MM)", "MM", "(MT)", "MT", "(Length)", "Length", "km",
                "m", "Height", "Difference:", "Max.", "Max", "Climb:", "Total", "Course",
                "Length:", 'nan']
    
    to_keep = []
    for k in split_string:
        if str(k) not in to_strip:
            to_keep.append(str(k))
            
    to_keep = "".join(to_keep)
    
    number = []
    for j in range(len(to_keep)):
        try:
            number.append(str(int(to_keep[j])))
        except:
            pass
    value = "".join(number) 
    
    return value

In [30]:
"""
Function
--------

course_layout_split : takes the contents of the cell containing the course layout and 
                    returns the individual lap length

Parameters
----------

string : a string containing the (possibly) concatenated loop lengths for a sprint race
race : the code for the race (SWSP for a women's sprint, SMSP for a men's sprint)

Returns
-------

distance : a list containing the lengths of the three laps of the sprint race.

Examples
--------
"""

def course_layout_split(string, race):
    
    distance = []
            
    split = string.split("+")
    if len(split) == 3: # then each of the loop distances is in here separately
        for i in range(3):
            distance.append(extract_distance(split[i]))
        
    elif len(split) == 4 : # They're hopefully skiing the two middle laps together
        distance.append(extract_distance(split[0]))
        distance.append(extract_distance(split[1]) + extract_distance(split[2]))
        distance.append(extract_distance(split[3]))
        
    else: # The same distance is skied for each lap
        distance.extend([extract_distance(split[0]),extract_distance(split[0]),
                         extract_distance(split[0])])
        
                    
    return distance

In [31]:
"""
Function
--------

get_course_info : reads a pdf file containing the competition data summary for a race and 
                    returns the course information

Parameters
----------

filename : A file containing a pdf of competition data summary
year : The season of the race (necessary to determine which portion of the page to read)

Returns
-------

df : The relevant course data in a single line dataframe

Example
-------
"""

def get_course_info(filename, year, event, race):
    
    summary_areas = {"0405": [(290, 37, 395, 557),(400, 20, 588, 557)],
                "0506": [(290, 37, 395, 557),(400, 20, 588, 557)],
                "0607": [(290, 37, 395, 557),(400, 20, 588, 557)],
                "0708": [(273, 37, 380, 557),(400, 35, 560, 560)],
                "0809": [(273, 37, 380, 557),(400, 20, 590, 557)],
                "0910": [(273, 37, 380, 557),(273, 20, 590, 586)],
                "1011": [(242, 26, 350, 568),(380, 26, 550, 568)],
                "1112": [(270, 26, 375, 568),(400, 26, 590, 568)],
                "1213": [(275, 26, 381, 568),(400, 26, 590, 568)],
                "1314": [(275, 26, 381, 568),(275, 26, 600, 568)],
                "1415": [(217, 26, 323, 568),(350, 26, 530, 568)],
                "1516": [(221, 26, 326, 568),(300, 26, 580, 568)],
                "1617": [(252, 19, 386, 577),(400, 19, 590, 577)],
                "1718": [(252, 19, 386, 577),(400, 19, 590, 577)]}
    
    summary_areas_OG = {"0506" : [(250, 21, 355, 572), (434, 21, 515, 572)]}
    

    data = read_pdf(filename,area = summary_areas[year][1], guess = False, encoding = 'utf8')
    
    
    cols = ['Year', 'Event', 'Race','Loop 1', 'Loop 2', 'Loop 3', 
            'Length', 'Height Diff', 'Max Climb', 'Total Climb']
    
    info = [year, event, race]
    
    # First, I want to find where the Course Information cell is, because my stuff is there
    
    for i in range(len(data)):
        if "Course Information" in unicode(data.iloc[i,0]):
            b = i
        
    # The important information should be in rows b+1 to b+5
    
    # Get the loop information and add it to info
    
    distance = course_layout_split(data.iloc[b+1, 0], race)
    info.extend(distance)
    
    # Get the other information and add it to the info
    
    for i in range(2,6):
        stuff = data.iloc[b + i, :]
        stuff = [str(item) for item in stuff]
        string = " ".join(stuff)
        value = extract_values(string)
        info.append(value)
    
    info = pd.DataFrame([info], columns = cols)
    
    return info


In [32]:
"""
Function
--------

collect_course_info : collects the course information for all of the men's or women's 
                        sprint races for an entire world cup season

Parameters
----------

year : The code for the season under consideration
gender : M or W

Returns
-------

data : A data frame containing the data summary for all the men's or women's races in the 
        given season

Examples
--------

"""

def collect_course_info(year, gender):
    
    if gender == 'M':
        racecodes = ['SMSP']
    else:
        racecodes = ['SWSP']
        
    events = ['CP01', 'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 'CH__',
              'OG__']
    
    df = pd.DataFrame(columns = ['Year', 'Event', 'Race','Loop 1', 'Loop 2', 'Loop 3', 
                                 'Length', 'Height Diff', 'Max Climb', 'Total Climb'])

    for event in events:
        for race in racecodes:
            get_summary_file(year,event,race)
            try:
                data = get_course_info("summary.pdf", year, event, race)
                df = pd.concat([df, data])
            except:
                pass
            
    pickle = "course_summary_%(year)s_%(gender)s.pkl" %{"year": year, "gender": gender}
            
    df.reset_index(drop = True, inplace = True)    
        
    df.to_pickle(pickle)
    
    return df

In [34]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    collect_course_info(season, 'M')
    


#### Problems with the Sochi  and PyeongChang Olympic Games

Due to a nonstandardized choice of url for the competition summary files for the Sochi Olympics and then nonstandardized choice of layout for the PyeongChang Olympics, I judged that it would be simpler to simply add the relevant data for these events by hand rather than to try to write a function to put it in. The code for that is:



In [35]:
# For the Sochi Olympics

df = pd.read_pickle("course_summary_1314_M.pkl")

data = [['1314','OG__','SMSP',3393,3388,3494,10275,57,28,381]]

df1 = pd.DataFrame(data,columns = ['Year', 'Event', 'Race','Loop 1', 'Loop 2', 'Loop 3', 
                                 'Length', 'Height Diff', 'Max Climb', 'Total Climb'])

df = pd.concat([df, df1])
df.reset_index(drop = True, inplace = True)

df.to_pickle("course_summary_1314_M.pkl")

In [36]:
# For the PyeongChang Olympics

df = pd.read_pickle("course_summary_1718_M.pkl")

data = [['1718','OG__','SMSP',3159,3424,3230,9813,37,37,348]]

df1 = pd.DataFrame(data,columns = ['Year', 'Event', 'Race','Loop 1', 'Loop 2', 'Loop 3', 
                                 'Length', 'Height Diff', 'Max Climb', 'Total Climb'])

df = pd.concat([df, df1])
df.reset_index(drop = True, inplace = True)

df.to_pickle("course_summary_1718_M.pkl")


# Collecting the weather data

The weather data for the races (temperature, humidity, wind, etc) can be found on the same pdf as the course data. In order to keep things managable, I collected the weather data separately, and stored it in a separate dataframe from the course data.

The functions that were used for collecting the weather information are as follows:
1. ```get_weather_info```: This function takes a pdf file containing a competition data summary and uses tabula-py to read in the portion containing data about weather conditions. It then collects information about weather, snow conditions, air temperature, snow temperature, humidity, and wind speed and returns that information in the form of a dataframe containing a single row.
2. ```collect_weather```: This function calls ```get_weather_info``` for each men's sprint race in a world cup season, concatenates the returned dataframes, and stores the resulting dataframe as a pickle file.

Once again, we run into problems with the Sochi (url) and PyeongChang (layout) Olympic Games. I again judged that the simplest solution was to enter the missing information manually.

Previous Section: [Collecting the Course Data](#Collecting-the-course-data)


Next Section: [Collecting IBU Cup course and weather data](#Collecting-IBU-Cup-course-and-weather-data)


[Table of Contents](#Table-of-Contents)

In [37]:
"""
Function
--------

get_weather_info : takes a pdf file containing a competition data summary returns information 
                    about the weather and snow conditions during the given race

Parameters
----------

filename : A file containing a pdf of competition data summary
year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SMIN,
        SMSP, SMPU

Returns
-------

df : The relevant weather data in a single line dataframe

Example
-------
"""

def get_weather_info(filename, year, event, race):
    
    summary_areas = {"0405": [(200, 20, 428, 576),(474, 37, 554, 557)],
                "0506": [(244, 20, 428, 576),(474, 37, 554, 557)],
                "0607": [(244, 20, 428, 576),(474, 37, 554, 557)],
                "0708": [(244, 20, 428, 576),(464, 37, 540, 557)],
                "0809": [(244, 20, 428, 576),(474, 37, 554, 557)],
                "0910": [(244, 20, 428, 576),(474, 37, 554, 557)],
                "1011": [(200, 26, 400, 568),(442, 26, 520, 568)],
                "1112": [(200, 26, 425, 568),(471, 26, 548, 568)],
                "1213": [(200, 26, 425, 568),(475, 26, 552, 568)],
                "1314": [(200, 26, 425, 568),(475, 26, 552, 568)],
                "1415": [(180, 26, 400, 568),(417, 26, 493, 568)],
                "1516": [(180, 26, 400, 568),(420, 26, 498, 568)],
                "1617": [(200, 19, 450, 577),(488, 19, 576, 577)],
                "1718": [(200, 19, 450, 577),(488, 19, 576, 577)]}
    
    data = read_pdf(filename,area = summary_areas[year][0], guess = False)
            
    cols = ['Year', 'Event', 'Race', 'Weather A', 'Weather B', 'Weather C', 'Weather D',
            'Snow Cond A', 'Snow Cond B', 'Snow Cond C', 'Snow Cond D', 'Snow Temp A',
            'Snow Temp B', 'Snow Temp C', 'Snow Temp D', 'Air Temp A', 'Air Temp B',
            'Air Temp C','Air Temp D','Humidity A','Humidity B', 'Humidity C','Humidity D',
            'Wind A', 'Wind B', 'Wind C', 'Wind D']
    
    # Find the upper left corner of the data, which is the word 'Weather'
    
    info = [year, event, race]
    
    for i in range(data.shape[0]):
        for j in range(data.shape[1]):
            if unicode(data.iloc[i,j]) == 'Weather':
                row = i
                column = j
        
    # Drop all extraneous rows
    
    data.drop(data.index[row+8:], inplace = True)
        
    data.drop(data.index[:row], inplace = True)
        
    data.reset_index(drop = True, inplace = True)
    
    # Drop any columns that are now entirely nan values
    
    data.dropna(axis = 1, how = "all", inplace = True)
    
    for i in range(len(data)):
        if data.iloc[i,0] == "Weather":
            row1 = i
        if data.iloc[i,0] == "Humidity":
            row2 = i

    difference = row2 - row1
    
    if difference > 4:
        race = data.values.tolist()
        race_1 = []
        for i in range(len(race[1])):
            race_1.append(" ".join([unicode(race[1][i]),unicode(race[3][i])]))
        new_race = [race[0], race_1, race[4],race[5], race[6], race[7]]
        data = pd.DataFrame(new_race)
    
    # Fill the dataframe row
    
    if data.shape[1] > 5:
    
        data = data.values.tolist()
    
        cond = [item for item in data[0] if unicode(item) not in ['nan','NaN']]
        for i in range(1,5):
            info.append(cond[i])

        cond = [item for item in data[1] if unicode(item) not in ['nan','NaN']]            
        for i in range(1,5):
            info.append(cond[i])
        
        cond = [item for item in data[2] if unicode(item) not in ['nan','NaN']]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
        
        cond = [item for item in data[3] if unicode(item) not in ['nan','NaN']]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
        
        cond = [item for item in data[4] if unicode(item) not in ['nan','NaN']]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
        
        cond = [item for item in data[5] if unicode(item) not in ['nan','NaN']]
        for i in range(1,5):
            length = len(unicode(cond[i]).split())
            info.append(unicode(cond[i]).split()[length-2])
            
    else:
        
        data = data.values.tolist()
        
        cond = data[0]
        for i in range(1,5):
            info.append(cond[i])
            
        cond = data[1]
        for i in range(1,5):
            info.append(cond[i])
            
        cond = data[2]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
            
        cond = data[3]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
            
        cond = data[4]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
            
        cond = data[5]
        for i in range(1,5):
            length = len(unicode(cond[i]).split())
            info.append(unicode(cond[i]).split()[length-2])
    # Make it a dataframe
    
    df = pd.DataFrame([info], columns = cols)
            
    return df

In [38]:
"""
Function
--------

collect_weather : passes through all of the sprint races for a given season and collects 
                    the weather data into a single data frame 

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
gender : M or W

Returns
-------

data : A data frame containing the data summary for all the men's or women's races in the 
        given season

NB: columns coded 'A' give conditions 30 minutes before start time, columns coded 'B' 
    give conditions at start time, columns coded 'C' give conditions 30 minutes after
    start time, and columns coded 'D' give conditions at the finish.

Examples
--------

"""

def collect_weather(year, gender):
    
    if gender == 'M':
        racecodes = ['SMSP']
    else:
        racecodes = ['SWSP']
        
    events = ['CP01', 'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 
              'CH__', 'OG__']
    
    df = pd.DataFrame(columns = ['Year', 'Event', 'Race', 'Weather A', 'Weather B', 
                                 'Weather C', 'Weather D', 'Snow Cond A', 'Snow Cond B',
                                 'Snow Cond C', 'Snow Cond D', 'Snow Temp A', 'Snow Temp B',
                                 'Snow Temp C', 'Snow Temp D', 'Air Temp A', 'Air Temp B',
                                 'Air Temp C', 'Air Temp D', 'Humidity A', 'Humidity B',
                                 'Humidity C', 'Humidity D', 'Wind A', 'Wind B', 'Wind C',
                                 'Wind D'])
    
    for event in events:
        for race in racecodes:
            get_summary_file(year,event,race)
            try:
                data = get_weather_info("summary.pdf", year, event, race)
                df = pd.concat([df, data])
            except:
                pass
            
    pickle = "weather_summary_%(year)s_%(gender)s.pkl" %{"year": year, "gender": gender}
            
    df.reset_index(drop = True, inplace = True)   
        
    df.to_pickle(pickle)
    
    course_file = "course_summary_%(year)s_%(gender)s.pkl" %{"year": year, "gender": gender}
    course = pd.read_pickle(course_file)
    
    if len(course) > len(df):
        print "You have", len(course), "course files and only", len(df), """weather files \
        for the""", gender, year, "season."
    elif len(df) > len(course):
        print "You have", len(df), "weather files and only", len(course), """course files \
        for the""", gender, year, "season."
    else:
        print """You have the same number of weather and course files for \
        the""", gender, year, "season."

    
    return df

In [39]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415','1516','1617','1718']

for season in seasons:
    collect_weather(season, 'M')

You have the same number of weather and course files for         the M 0405 season.
You have the same number of weather and course files for         the M 0506 season.
You have the same number of weather and course files for         the M 0607 season.
You have the same number of weather and course files for         the M 0708 season.
You have the same number of weather and course files for         the M 0809 season.
You have the same number of weather and course files for         the M 0910 season.
You have the same number of weather and course files for         the M 1011 season.
You have the same number of weather and course files for         the M 1112 season.
You have the same number of weather and course files for         the M 1213 season.
You have 9 course files and only 8 weather files         for the M 1314 season.
You have the same number of weather and course files for         the M 1415 season.
You have the same number of weather and course files for         the M 1516 seas


## Problems with the Olympic Games


In [40]:
data = pd.read_pickle("weather_summary_1314_M.pkl")

df = [['1314','OG__', 'SMSP', 'Sky Clear', 'Partly Cloudy', 'Partly Cloudy', 'Partly Cloudy',
       'Packed', 'Packed', 'Packed','Packed', '-0.1', '-0.1', '-0.1', '-0.1', '1.9', '0.9',
       '0.4', '0.3', '47', '54', '56', '54', '0.1', '0.0', '0.0', '0.0']]
      
df = pd.DataFrame(df, columns = ['Year', 'Event', 'Race', 'Weather A', 'Weather B', 
                               'Weather C', 'Weather D', 'Snow Cond A', 'Snow Cond B', 
                               'Snow Cond C', 'Snow Cond D', 'Snow Temp A', 'Snow Temp B',
                               'Snow Temp C', 'Snow Temp D', 'Air Temp A', 'Air Temp B',
                               'Air Temp C', 'Air Temp D', 'Humidity A', 'Humidity B', 
                               'Humidity C', 'Humidity D', 'Wind A', 'Wind B', 'Wind C',
                               'Wind D'])

data = pd.concat([data,df])
data.reset_index(drop = True, inplace = True)

data.to_pickle('weather_summary_1314_M.pkl')

In [41]:
data = pd.read_pickle("weather_summary_1718_M.pkl")

df = [['1718', 'OG__', 'SMSP', u'Light snow', u'Mostly cloudy', u'Mostly cloudy', 
       u'Mostly cloudy', u'Compact', u'Compact', u'Compact', u'Compact', u'-12.9', u'-14.3',
       u'-14.3', u'-13.8', u'-11.0', u'-10.7', u'-11.0', u'-11.3', u'68', u'63', u'64',
       u'64', u'1.9', u'2.7', u'1.4', u'2.3']]
      
df = pd.DataFrame(df, columns = ['Year', 'Event', 'Race', 'Weather A', 'Weather B', 
                               'Weather C', 'Weather D', 'Snow Cond A', 'Snow Cond B', 
                               'Snow Cond C', 'Snow Cond D', 'Snow Temp A', 'Snow Temp B',
                               'Snow Temp C', 'Snow Temp D', 'Air Temp A', 'Air Temp B',
                               'Air Temp C', 'Air Temp D', 'Humidity A', 'Humidity B', 
                               'Humidity C', 'Humidity D', 'Wind A', 'Wind B', 'Wind C',
                               'Wind D'])

data = pd.concat([data,df])
data.reset_index(drop = True, inplace = True)

data.to_pickle('weather_summary_1718_M.pkl')


# Collecting IBU Cup course and weather data

In order to collect course and weather data for IBU Cup races, I used essentially the same functions as I used to get the same data for the World Cup races.

The functions that are used here are:
1. ```ibu_find_course_url```: This function, together with ```ibu_get_summary_file``` is analogous to ```get_summary_file```.
2. ```ibu_get_summary_file```: This function, together with ```ibu_find_course_url``` is analogous to ```get_summary_file```).
3. ```ibu_extract_distance```: This function is analogous to ```extract_distance```.
4. ```ibu_extract_values```: This function is analogous to ```extract_values```.
5. ```ibu_course_layout_split```: This function is analogous to ```course_layout_split```.
6. ```ibu_get_course_info```: This function is analogous to ```get_course_info```.
7. ```ibu_collect_course_info```: This function is analogous to ```collect_course_info```.
8. ```ibu_get_weather_info```: This function is analogous to ```get_weather_info```.
9. ```ibu_get_weather_info```: This function is analogous to ```get_weather_info```.

Previous Section: [Collecting the Weather Data](#Collecting-the-weather-data)

Next Section: [Adjustments to course and weather data](#Adjustments-to-course-and-weather-data)



[Table of Contents](#Table-of-Contents)

In [42]:
"""
Function
--------

ibu_find_course_url : given the year, event, and race code of a particular competition, 
                        returns the url for the competition data summary

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SMIN,
        SMSP, SMPU

Returns
-------

course_url : a string containing the url of the competition data summary pdf

Example
-------

"""

def ibu_find_course_url(year, event, race):
    
    if year in ['0405','0506','0607','0708','0809','0910','1011','1112']:
        cup = "SCEU"
    elif event in ['CH__']:
        cup = "SCEU"
    else:
        cup = "SIBU"
    
    course_url = (
    "https://ibu.blob.core.windows.net/docs/%(y)s/BT/%(c)s/%(e)s/%(r)s/BT_C82_1.0.pdf"
        %{"y" : year, "c" : cup, "e" : event, "r" : race})
    
    return course_url

In [43]:
# Course and weather areas

ibu_course_area = {'0405' : [(268, 34, 453, 560), (453, 34, 594, 560)],
               '0506' : [(355, 37, 451, 558), (451, 37, 700, 558)],
               '0607' : [(266, 37, 431, 558), (431, 37, 586, 558)],
               '0708' : [(266, 37, 439, 558), (439, 37, 593, 558)],
               '0809' : [(243, 37, 428, 558), (428, 37, 571, 558)],
               '0910' : [(241, 37, 419, 558), (419, 37, 589, 558)],
               '1011' : [(210, 25, 416, 568), (416, 25, 567, 568)],
               '1112' : [(232, 25, 431, 568), (431, 25, 573, 568)],
               '1213' : [(239, 25, 438, 568), (438, 25, 576, 568)],
               '1314' : [(249, 25, 440, 568), (440, 25, 579, 568)],
               '1415' : [(182, 25, 382, 568), (382, 25, 529, 568)],
               '1516' : [(183, 25, 383, 568), (383, 25, 530, 568)],
               '1617' : [(210, 20, 440, 572), (440, 20, 609, 572)],
               '1718' : [(210, 20, 440, 572), (440, 20, 609, 572)]}

In [44]:
"""
Function
--------

ibu_get_summary_file : retrieves a competition data summary pdf file from the web
                        and stores it on the hard drive

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SMIN,
        SMSP, SMPU


Returns
-------

Stores a .pdf file at summary.pdf. 

Example
-------

"""

def ibu_get_summary_file(year,event,race):

    url = ibu_find_course_url(year, event, race)
    
    
    urllib.urlretrieve(url,"summary.pdf")
    

In [45]:
"""
Function 
--------

ibu_extract_distance : takes in a string of containing a distance, strips excess data, and
                        returns the relevant distances in meters

Parameters
----------

string : a string which contains a distance, possibly in meters, possibly in kilometers, as 
        well as a possible unit or course description

Returns
-------

dist : the distance, stripped of any other words or symbols and converted into meters if
        necessary

Examples
--------


"""

def ibu_extract_distance(string):
    
    multiple = False
    # First, split the string to allow extraction.

    split_string = string.split()
    
    # Next, get rid of any pieces that are just color, unit, etc
    
    to_strip = ['yellow', 'green', 'blue', 'brown', 'red', 'course', 'km', 'm', 'Yellow',
                'Green','Blue','Brown','Red', 'Course']
    digits = ['0','1','2','3','4','5','6','7','8','9','0','.']
    
    to_keep = []
    for k in split_string:
        if k not in to_strip:
            to_keep.append(k)
            
    to_keep = "".join(to_keep)
    
            
    # For what is left
    
    if 'x' in to_keep:
        # We have multiple notation
        multiple = True
        
        to_keep = "".join(to_keep)
        for i in range(len(to_keep)):
            if to_keep[i] == 'x':
                b = i
                
        to_keep = to_keep[b+1:]
        
        number = []
        for j in range(len(to_keep)):
            if to_keep[j] in digits:
                number.append(to_keep[j])
        dist = "".join(number) 
        
        
    else:
        
        number = []
        for j in range(len(to_keep)):
            if to_keep[j] in digits:
                number.append(to_keep[j])

        dist = "".join(number) 
        
    if float(dist) < 100:
        dist = 1000*float(dist)
    else:
        dist = float(dist)


    return dist

In [46]:
"""
Function
--------

ibu_extract_values : takes a string containing a value and strips non numerical information
                from it, e.g., HD, MT, km, etc.


Parameters
----------

string : a string which contains a number and may contain other symbols 

Returns
-------

value : the number that remains when extraneous data has been stripped.

Examples
--------

"""

def ibu_extract_values(string):
    
    split_string = string.split()
    
    to_strip = ["(HD)", "HD", "(MM)", "MM", "(MT)", "MT", "(Length)", "Length", "km", "m", "Height", "Difference:",
               "Max.", "Max", "Climb:", "Total", "Course", "Length:", 'nan']
    
    to_keep = []
    for k in split_string:
        if str(k) not in to_strip:
            to_keep.append(str(k))
            
    to_keep = "".join(to_keep)
    
    number = []
    for j in range(len(to_keep)):
        try:
            number.append(str(int(to_keep[j])))
        except:
            pass
    value = "".join(number) 
    
    return value

In [47]:
"""
Function
--------

ibu_course_layout_split : takes the contents of the cell containing the course layout and 
                    returns the individual lap length

Parameters
----------

string : a string containing the (possibly) concatenated loop lengths for a sprint race
race : the code for the race (SWSP for a women's sprint, SMSP for a men's sprint)

Returns
-------

distance : a list containing the lengths of the three laps of the sprint race.

Examples
--------

"""

def ibu_course_layout_split(string, race):
    
    distance = []
    
    if race in ['SWSP', 'SMSP','SWSPS','SMSPS']: # (race is a sprint)
        
        split = string.split("+")
        if len(split) == 3: # then each of the loop distances is in here separately
            for i in range(3):
                distance.append(ibu_extract_distance(split[i]))
        elif len(split) == 4 : # They're hopefully skiing the two middle laps together
            distance.append(ibu_extract_distance(split[0]))
            distance.append(ibu_extract_distance(split[1]) + ibu_extract_distance(split[2]))
            distance.append(ibu_extract_distance(split[3]))
        else: # The same distance is skied for each lap
            distance.extend([ibu_extract_distance(split[0]),ibu_extract_distance(split[0]),
                             ibu_extract_distance(split[0])])
        
    return distance

In [48]:
"""
Function
--------

ibu_get_course_info : reads a pdf file containing the competition data summary for a race and 
                    returns the course information

Parameters
----------

filename : A file containing a pdf of competition data summary
year : The season of the race (necessary to determine which portion of the page to read)

Returns
-------

df : The relevant course data in a single line dataframe

Example
-------

"""

def ibu_get_course_info(filename, year, event, race):
    
    summary_areas = {"0405": [(290, 37, 395, 557),(400, 20, 588, 557)],
                "0506": [(290, 37, 395, 557),(400, 20, 588, 557)],
                "0607": [(290, 37, 395, 557),(400, 20, 588, 557)],
                "0708": [(273, 37, 380, 557),(400, 35, 560, 560)],
                "0809": [(273, 37, 380, 557),(400, 20, 590, 557)],
                "0910": [(273, 37, 380, 557),(273, 20, 590, 586)],
                "1011": [(242, 26, 350, 568),(380, 26, 550, 568)],
                "1112": [(270, 26, 375, 568),(400, 26, 590, 568)],
                "1213": [(275, 26, 381, 568),(400, 26, 590, 568)],
                "1314": [(275, 26, 381, 568),(275, 26, 600, 568)],
                "1415": [(217, 26, 323, 568),(350, 26, 530, 568)],
                "1516": [(221, 26, 326, 568),(300, 26, 580, 568)],
                "1617": [(252, 19, 386, 577),(400, 19, 590, 577)],
                "1718": [(252, 19, 386, 577),(400, 19, 590, 577)]}
        

    data = read_pdf(filename,area = summary_areas[year][1], guess = False)
        
    cols = ['Year', 'Event', 'Race','Loop 1', 'Loop 2', 'Loop 3', 
            'Length', 'Height Diff', 'Max Climb', 'Total Climb']
    
    info = [year, event, race]
    
    # First, I want to find where the Course Information cell is, because my stuff is there
    
    for i in range(len(data)):
        if "Course Information" in unicode(data.iloc[i,0]):
            b = i
        
    # The important information should be in rows b+1 to b+5
    
    # Get the loop information and add it to info
    
    distance = ibu_course_layout_split(data.iloc[b+1, 0], race)
    info.extend(distance)
    
    # Get the other information and add it to the info
    
    for i in range(2,6):
        stuff = data.iloc[b + i, :]
        stuff = [str(item) for item in stuff]
        string = " ".join(stuff)
        value = ibu_extract_values(string)
        info.append(value)
    
    info = pd.DataFrame([info], columns = cols)
    
    return info
    

In [49]:
"""
Function
--------

ibu_collect_course_info : collects the course information for all of the men's or women's 
                        sprint races for an entire IBU world cup season

Parameters
----------

year : The code for the season under consideration
gender : M or W

Returns
-------

data : A data frame containing the data summary for all the men's or women's races in the 
        given season

Examples
--------


"""

def ibu_collect_course_info(year, gender):
    
    racecodes = ['SMSP','SMSPS']
        
    events = ['CP01', 'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CH__', 'OG__']
    
    df = pd.DataFrame(columns = ['Year', 'Event', 'Race','Loop 1', 'Loop 2', 'Loop 3', 
                                 'Length', 'Height Diff', 'Max Climb', 'Total Climb'])

    for event in events:
        for race in racecodes:
            ibu_get_summary_file(year,event,race)
            try:
                data = ibu_get_course_info("summary.pdf", year, event, race)
                df = pd.concat([df, data])
            except:
                pass
            
    pickle = "ibu_course_summary_%(year)s_%(gender)s.pkl" %{"year": year, "gender": gender}
            
    df.reset_index(drop = True, inplace = True)    
        
    df.to_pickle(pickle)
    
    return df

In [50]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    ibu_collect_course_info(season, 'M')

In [51]:
"""
Function
--------

ibu_get_weather_info : takes a pdf file containing a competition data summary returns
                    information about the weather and snow conditions during the given race

Parameters
----------

filename : A file containing a pdf of competition data summary
year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”
race : a four character code specifying the race. Possibilities are SWIN, SWSP, SWPU, SMIN,
        SMSP, SMPU

Returns
-------

df : The relevant weather data in a single line dataframe

Example
-------

"""

def ibu_get_weather_info(filename, year, event, race):
    
    summary_areas = {"0405": [(200, 20, 428, 576),(474, 37, 554, 557)],
                "0506": [(244, 20, 428, 576),(474, 37, 554, 557)],
                "0607": [(244, 20, 428, 576),(474, 37, 554, 557)],
                "0708": [(244, 20, 428, 576),(464, 37, 540, 557)],
                "0809": [(244, 20, 428, 576),(474, 37, 554, 557)],
                "0910": [(244, 20, 428, 576),(474, 37, 554, 557)],
                "1011": [(200, 26, 400, 568),(442, 26, 520, 568)],
                "1112": [(200, 26, 425, 568),(471, 26, 548, 568)],
                "1213": [(200, 26, 425, 568),(475, 26, 552, 568)],
                "1314": [(200, 26, 425, 568),(475, 26, 552, 568)],
                "1415": [(180, 26, 400, 568),(417, 26, 493, 568)],
                "1516": [(180, 26, 400, 568),(420, 26, 498, 568)],
                "1617": [(200, 19, 450, 577),(488, 19, 576, 577)],
                "1718": [(200, 19, 450, 577),(488, 19, 576, 577)]}
    
    data = read_pdf(filename,area = summary_areas[year][0], guess = False)
        
    
    cols = ['Year', 'Event', 'Race', 'Weather A', 'Weather B', 'Weather C', 'Weather D', 
            'Snow Cond A', 'Snow Cond B', 'Snow Cond C', 'Snow Cond D', 'Snow Temp A', 
            'Snow Temp B', 'Snow Temp C', 'Snow Temp D', 'Air Temp A', 'Air Temp B', 
            'Air Temp C', 'Air Temp D', 'Humidity A', 'Humidity B','Humidity C', 
            'Humidity D', 'Wind A', 'Wind B', 'Wind C', 'Wind D']
    
    # Find the upper left corner of the data, which is the word 'Weather'
    
    info = [year, event, race]
    
    for i in range(data.shape[0]):
        for j in range(data.shape[1]):
            if unicode(data.iloc[i,j]) == 'Weather':
                row = i
                column = j
        
    # Drop all extraneous rows
    
    data.drop(data.index[row+8:], inplace = True)
        
    data.drop(data.index[:row], inplace = True)
        
    data.reset_index(drop = True, inplace = True)
    
    # Drop any columns that are now entirely nan values
    
    data.dropna(axis = 1, how = "all", inplace = True)
    
    for i in range(len(data)):
        if data.iloc[i,0] == "Weather":
            row1 = i
        if data.iloc[i,0] == "Humidity":
            row2 = i

    difference = row2 - row1
    
    if difference > 4:
        race = data.values.tolist()
        race_1 = []
        for i in range(len(race[1])):
            race_1.append(" ".join([unicode(race[1][i]),unicode(race[3][i])]))
        new_race = [race[0], race_1, race[4],race[5], race[6], race[7]]
        data = pd.DataFrame(new_race)
    
    # Fill the dataframe row
    
    if data.shape[1] > 5:
    
        data = data.values.tolist()
    
        cond = [item for item in data[0] if unicode(item) not in ['nan', 'NaN']]
        for i in range(1,5):
            info.append(cond[i])

        cond = [item for item in data[1] if unicode(item) not in ['nan', 'NaN']]            
        for i in range(1,5):
            info.append(cond[i])
        
        cond = [item for item in data[2] if unicode(item) not in ['nan', 'NaN']]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
        
        cond = [item for item in data[3] if unicode(item) not in ['nan', 'NaN']]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
        
        cond = [item for item in data[4] if unicode(item) not in ['nan', 'NaN']]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
        
        cond = [item for item in data[5] if unicode(item) not in ['nan', 'NaN']]
        for i in range(1,5):
            length = len(unicode(cond[i]).split())
            info.append(unicode(cond[i]).split()[length-2])
            
    else:
        
        data = data.values.tolist()
        
        cond = data[0]
        for i in range(1,5):
            info.append(cond[i])
            
        cond = data[1]
        for i in range(1,5):
            info.append(cond[i])
            
        cond = data[2]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
            
        cond = data[3]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
            
        cond = data[4]
        for i in range(1,5):
            info.append(unicode(cond[i]).split()[0])
            
        cond = data[5]
        for i in range(1,5):
            length = len(unicode(cond[i]).split())
            info.append(unicode(cond[i]).split()[length-2])
    # Make it a dataframe
    
    df = pd.DataFrame([info], columns = cols)
            
    return df

In [52]:
"""
Function
--------

ibu_collect_weather : passes through all of the sprint races for a given season and collects 
                    the weather data into a single data frame 

Parameters
----------

year : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
gender : M or W

Returns
-------

data : A data frame containing the data summary for all the men's or women's races in the 
        given season

NB: columns coded 'A' give conditions 30 minutes before start time, columns coded 'B' 
    give conditions at start time, columns coded 'C' give conditions 30 minutes after
    start time, and columns coded 'D' give conditions at the finish.

Examples
--------


"""

def ibu_collect_weather(year, gender):
    
    racecodes = ['SMSP', 'SMSPS']
            
    events = ['CP01', 'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 
              'CH__', 'OG__']
    
    df = pd.DataFrame(columns = ['Year', 'Event', 'Race', 'Weather A', 'Weather B', 
                                 'Weather C', 'Weather D', 'Snow Cond A', 'Snow Cond B',
                                 'Snow Cond C', 'Snow Cond D', 'Snow Temp A', 'Snow Temp B',
                                 'Snow Temp C', 'Snow Temp D', 'Air Temp A', 'Air Temp B', 
                                 'Air Temp C', 'Air Temp D', 'Humidity A', 'Humidity B',
                                 'Humidity C', 'Humidity D', 'Wind A', 'Wind B', 'Wind C',
                                 'Wind D'])
    
    for event in events:
        for race in racecodes:
            ibu_get_summary_file(year,event,race)
            try:
                data = ibu_get_weather_info("summary.pdf", year, event, race)
                df = pd.concat([df, data])
            except:
                pass
            
    pickle = "ibu_weather_summary_%(year)s_%(gender)s.pkl" %{"year": year, "gender": gender}
            
    df.reset_index(drop = True, inplace = True)   
        
    df.to_pickle(pickle)
    
    course_file = "ibu_course_summary_%(year)s_%(gender)s.pkl" %{"year": year, "gender": gender}
    course = pd.read_pickle(course_file)
    
    if len(course) > len(df):
        print "You have", len(course), "course files and only", len(df), """weather files \
        for the""", gender, year, "season."
    elif len(df) > len(course):
        print "You have", len(df), "weather files and only", len(course), """course files \
        for the""", gender, year, "season."
    else:
        print """You have the same number of weather and course files for \
        the""", gender, year, "season."
    
    return df

In [53]:
for season in seasons:
    ibu_collect_weather(season, 'M')

You have the same number of weather and course files for         the M 0405 season.
You have the same number of weather and course files for         the M 0506 season.
You have the same number of weather and course files for         the M 0607 season.
You have the same number of weather and course files for         the M 0708 season.
You have the same number of weather and course files for         the M 0809 season.
You have the same number of weather and course files for         the M 0910 season.
You have the same number of weather and course files for         the M 1011 season.
You have the same number of weather and course files for         the M 1112 season.
You have the same number of weather and course files for         the M 1213 season.
You have the same number of weather and course files for         the M 1314 season.
You have the same number of weather and course files for         the M 1415 season.
You have the same number of weather and course files for         the M 1516 


# Adjustments to course and weather data

Although at this point we have collected course and weather data for both the World Cup and IBU Cup races, there are still a few things to be done with these data files. In particular, I want to
1. Convert quantitative variables from strings to float values. In some cases, this will require converting numbers given in European format. In order to do this, I use the function
    - ```euro_to_float```: This function takes as input either a string containing a number in either decimal or European format or a number (float or integer). It then converts the input into a float and returns it. (Obviously, inputs that are actually floats are returned as is.) 
2. Add altitudes to for each race. The altitudes are currently stored in a .csv file, and were obtained by using the website [elevationmap.net](https://elevationmap.net/old/) together with the names of the biathlon stadia (available in the headers of the competition analysis files). 
3. Add columns for quantitative year and quantitative event (which will be useful later).
4. Make a handful of corrections for badly read or inaccurate data.

We then perform the same steps for the IBU Cup data

Previous Section: [Collecting IBU Cup course and weather data](#Collecting-IBU-Cup-course-and-weather-data)


Next Section: [Exploring Relationships](#Exploring-Relationships)


[Table of Contents](#Table-of-Contents)

In [54]:
"""
Function
--------

euro_to_float : converts numbers from European format (with commas as decimal 
                separators) to standard decimal format

Parameters
----------

string : a string containing a number, possibly in decimal format, possibly in European 
        format. Could also be an integer or a real number

Returns
-------

string2 : a real number (float) derived from the input string

Example
-------

"""

def euro_to_float(string):
    
    try:
        string2 = float(string)
    except:    
        string1 = string.split(',')
        if len(string1) != 2:
            string2 = string
        else:
            try:
                string2 = float(".".join(string1))
            except:
                string2 = string
    
    return string2

In [55]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    filename = 'weather_summary_%(season)s_M.pkl' %{'season' : season}
    filename1 = 'ibu_weather_summary_%(season)s_M.pkl' %{'season' : season}
    
    df = pd.read_pickle(filename)
    
    columns = df.columns.tolist()

    for column in columns:
        for i in range(len(df)):
            df.loc[i,column] = euro_to_float(df.loc[i,column])
        
    df.to_pickle(filename)
    
    df1 = pd.read_pickle(filename1)
    
    columns = df1.columns.tolist()

    for column in columns:
        for i in range(len(df1)):
            df1.loc[i,column] = euro_to_float(df1.loc[i,column])
        
    df1.to_pickle(filename1)

In [56]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    filename = 'course_summary_%(season)s_M.pkl' %{'season' : season}
    filename1 = 'ibu_course_summary_%(season)s_M.pkl' %{'season' : season}
    
    df = pd.read_pickle(filename)
    
    columns = df.columns.tolist()

    for column in columns:
        for i in range(len(df)):
            df.loc[i,column] = euro_to_float(df.loc[i,column])
        
    df.to_pickle(filename)
    
    df1 = pd.read_pickle(filename1)
    
    columns = df1.columns.tolist()

    for column in columns:
        for i in range(len(df1)):
            df1.loc[i,column] = euro_to_float(df1.loc[i,column])
        
    df1.to_pickle(filename1)

## Adding altitudes to the course summaries


In [57]:
# Adding altitudes to the dataframe (choosing to use course_summaries here)

altitudes = pd.read_csv('site_altitudes.csv')
altitudes.set_index('Unnamed: 0', inplace = True, drop = True)

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    filename = 'course_summary_%(season)s_M.pkl' %{'season' : season}
    df = pd.read_pickle(filename)
    
    for i in range(len(df)):
        season = int(df.loc[i,'Year'])
        event = df.loc[i,'Event']
    
        df.loc[i,'Altitude'] = altitudes.loc[season, event]

    df.to_pickle(filename)


## Quantifying the seasons and events



In [58]:
quant_seasons = {'0405' : 1, '0506' : 2, '0607' : 3, '0708' : 4, '0809' : 5, '0910' : 6,
                 '1011' : 7, '1112' : 8, '1213' : 9, '1314' : 10, '1415' : 11, '1516' : 12,
                 '1617' : 13,'1718' : 14}
quant_events = {'CP01' : [1,1], 'CP02' : [2,2], 'CP03' : [3,3], 'CP04' : [4,4], 
                'CP05' : [5,5], 'CP06' : [6,6], 'CP07' : [7,8], 'CP08' : [8,9], 
                'CP09':[10,10], 'CH__' : [9,7], 'OG__' : [7,7]}

# note that the order of events in a given season is dependent on the season

In [59]:
# Adding quantitative values for season and event to the dataframes 
# (choosing to use course_summaries here)

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    filename = 'course_summary_%(season)s_M.pkl' %{'season' : season}
    df = pd.read_pickle(filename)

    for i in range(len(df)):
        season = str(int(df.loc[i,'Year']))
        if len(season) < 4:
            season = "".join(['0',season])
        event = df.loc[i,'Event']
    
        if season in ['0405','1011','1112','1415','1516']:
            df.loc[i,'Quant Event'] = quant_events[event][0]
        else:
            df.loc[i,'Quant Event'] = quant_events[event][1]
        df.loc[i,'Quant Year'] = quant_seasons[season]
        
    df.to_pickle(filename)


## Correcting inconsistancies in capitalization and bad data

To be entirely clear here, bad data refers to cases where the data was misread from the pdf files, in which case it is being replaced by the original data from the pdf file, and cases where the given values are logically impossible (a maximum climb that exceeds the height differential of the course, for example) in which case it is replaced by the value from other races held at the same site.



In [60]:
# Making a few alterations to snow and weather conditions due to inconsistancies in using
# capitalization

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    filename = 'weather_summary_%(season)s_M.pkl' %{'season' : season}
    df = pd.read_pickle(filename)
    
    df.replace('Mostly cloudy','Mostly Cloudy', inplace = True)
    df.replace('Partly cloudy','Partly Cloudy', inplace = True)
    df.replace('Partly cloudy night','Partly Cloudy Night', inplace = True)
    
    df.replace('Packed powder','Packed Powder', inplace = True)
    df.replace('Wet and powder', 'Wet & Powder', inplace = True)
    df.replace('Wet & powder', 'Wet & Powder', inplace = True)
    df.replace('Wet and Powder', 'Wet & Powder', inplace = True)
    df.replace('Light snow','Light Snow', inplace = True)
    df.replace('Hard packed variable','Hard Packed Variable', inplace = True)
    df.replace('Hard packed','Hard Packed', inplace = True)
    df.replace('Spring conditions', 'Spring Conditions', inplace = True)
    df.replace('powder', 'Powder', inplace = True)

    df.to_pickle(filename)

In [61]:
# Correcting for misread or missing data 

# PyeongChang 1718 OG__

df = pd.read_pickle('weather_summary_1718_M.pkl')

row_z = np.where(df['Event'] == 'OG__')[0][0]
df.loc[row_z] = [1718.0, 'OG__', 'SMSP', 'Light Snow', 'Mostly Cloudy', 'Mostly Cloudy',
                 'Mostly Cloudy', u'Compact', u'Compact', u'Compact', u'Compact', -12.9,
                 -14.3, -14.3, -13.8, -11.0, -10.7, -11.0, -11.3, 68.0, 63.0, 64.0, 64.0, 
                 1.9, 2.7, 1.4, 2.3]

df.to_pickle('weather_summary_1718_M.pkl')

# Hochfilzen 1718 CH02

df = pd.read_pickle('course_summary_1718_M.pkl')

row_y = np.where(df['Event'] == 'CP02')[0][0]
df.loc[row_y, 'Height Diff'] = 39.2

df.to_pickle('course_summary_1718_M.pkl')

# Nove Mesto 1617 CP03 
## NOTE for this race and the next one, it would probably be better to fill the 
# dataframe completely

df = pd.read_pickle('weather_summary_1617_M.pkl')

row_a = np.where((df['Event'] == 'CP03'))[0][0]
df.loc[row_a,'Weather C'] = 'Mostly Cloudy Night'
df.loc[row_a,'Snow Cond C'] = 'Granular'
df.loc[row_a,'Snow Temp C'] = -0.5
df.loc[row_a,'Air Temp C'] = -0.2
df.loc[row_a,'Humidity C'] = 67
df.loc[row_a,'Wind C'] = 0.5

# PyeongChang 1617 CP07

row_b = np.where((df['Event'] == 'CP07'))[0][0]
df.loc[row_b,'Weather C'] = 'Partly Cloudy Night'
df.loc[row_b,'Snow Cond C'] = 'Hard Packed Variable'
df.loc[row_b,'Snow Temp C'] = 0.0
df.loc[row_b,'Air Temp C'] = 1.2
df.loc[row_b,'Humidity C'] = 53
df.loc[row_b,'Wind C'] = 1.3

df.to_pickle('weather_summary_1617_M.pkl')
# 1617 Championship (European style decimal was misconverted)

df = pd.read_pickle('course_summary_1617_M.pkl')
row_c = np.where((df['Event'] == 'CH__'))[0][0]
df.loc[row_c, 'Height Diff'] = 39.2
df.to_pickle('course_summary_1617_M.pkl')

# 0607 CP02 Total Climb (Note that the given value here, 3335m is unreasonable. I replace it
# by the value given for other races at the same site)

df = pd.read_pickle('course_summary_0607_M.pkl')
row_d = np.where((df['Event'] == 'CP02'))[0][0]
df.loc[row_d,'Total Climb'] = 381
df.to_pickle('course_summary_0607_M.pkl')

# And replacing the Max Climb by Max Climb from another event at the same site, because 
# the given max climb exceeds the height differential

df = pd.read_pickle('course_summary_0405_M.pkl')
row_e = np.where((df['Event'] == 'CP06'))[0][0]
df.loc[row_e,'Max Climb'] = 29
df.to_pickle('course_summary_0405_M.pkl')

df = pd.read_pickle('course_summary_0506_M.pkl')
row_f = np.where( (df['Event'] == 'CP06'))[0][0]
df.loc[row_f,'Max Climb'] = 29
df.to_pickle('course_summary_0506_M.pkl')




## Performing the same steps for the ibu summary dataframes



In [62]:
# Adding altitudes to the dataframe (choosing to use course_summaries here)

altitudes = pd.read_csv('ibu_site_elevations.csv')
altitudes.set_index('Year',inplace = True, drop = True)

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617', '1718']

for season in seasons:
    filename = 'ibu_course_summary_%(season)s_M.pkl' %{'season' : season}
    df = pd.read_pickle(filename)
    
    for i in range(len(df)):
        season = int(df.loc[i,'Year'])
        event = df.loc[i,'Event']
    
        df.loc[i,'Altitude'] = altitudes.loc[season, event]

    df.to_pickle(filename)

In [63]:
quant_seasons = {'0405' : 1, '0506' : 2, '0607' : 3, '0708' : 4, '0809' : 5, '0910' : 6, 
                 '1011' : 7, '1112' : 8, '1213' : 9, '1314' : 10, '1415' : 11, '1516' : 12,
                 '1617' : 13, '1718' : 14}
quant_events = {'CP01' : [1,1,1,1], 'CP02' : [2,2,2,2], 'CP03' : [3,3,3,3], 
                'CP04' : [4,4,4,4], 'CP05' : [5,5,9,5], 'CP06' : [6,7,5,6], 
                'CP07' : [8,8,7,7], 'CP08' : [9,9,8,9], 'CH__' : [7,6,6,8]}

# note that the order of events in a given season is dependent on the season

In [64]:
# Adding quantitative values for season and event to the dataframes 
# (choosing to use course_summaries here)

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']

for season in seasons:
    filename = 'ibu_course_summary_%(season)s_M.pkl' %{'season' : season}
    df = pd.read_pickle(filename)

    for i in range(len(df)):
        season = str(int(df.loc[i,'Year']))
        if len(season) < 4:
            season = "".join(['0',season])
        event = df.loc[i,'Event']
    
        if season in ['0405','0708','1314']:
            df.loc[i,'Quant Event'] = quant_events[event][0]
        elif season in ['0506','1112','1415','1617','1718']:
            df.loc[i,'Quant Event'] = quant_events[event][1]
        elif season in ['0607']:
            df.loc[i,'Quant Event'] = quant_events[event][2]
        else: # season in ['0809','0910','1011','1213','1516']
            df.loc[i,'Quant Event'] = quant_events[event][3]
        df.loc[i,'Quant Year'] = quant_seasons[season]
        
    df.to_pickle(filename)



In [65]:
# Correcting for misread or missing data 

# Altenberg 0506 CP04

df = pd.read_pickle('ibu_course_summary_0506_M.pkl')

row_z = np.where((df['Event'] == 'CP04'))[0][0]
df.loc[row_z,'Max Climb'] = 37

df.to_pickle('ibu_course_summary_0506_M.pkl')

# Forni Avoltri 0607 CP04 

df = pd.read_pickle('ibu_course_summary_0607_M.pkl')

row_a = np.where((df['Event'] == 'CP04'))[0][0]
df.loc[row_a,'Length'] = 10000

df.to_pickle('ibu_course_summary_0607_M.pkl')

# Brezno Osrblie 1011 CP06 

df = pd.read_pickle('ibu_course_summary_1011_M.pkl')

row_b = np.where((df['Event'] == 'CP06'))[0][0]
df.loc[row_b,'Height Diff'] = 46.5
df.loc[row_b,'Max Climb'] = 19.5
df.loc[row_b,'Total Climb'] = 122.5

df.to_pickle('ibu_course_summary_1011_M.pkl')

# Brezno Osrblie 1617 CP06

df = pd.read_pickle('ibu_course_summary_1617_M.pkl')

row_c = np.where((df['Event'] == 'CP06'))[0][0]
df.loc[row_c, 'Height Diff'] = 46.8
df.loc[row_c, 'Max Climb'] = 25.07
df.loc[row_c, 'Total Climb'] = 381.38

df.to_pickle('ibu_course_summary_1617_M.pkl')

# Martell-Val Martello 1314 CP08 

df = pd.read_pickle('ibu_weather_summary_1314_M.pkl')

row_d = np.where((df['Event'] == 'CP08'))[0][0]

df.loc[row_d,'Snow Cond A'] = 'Hard Packed Variable'
df.loc[row_d,'Snow Cond B'] = 'Hard Packed Variable'

df.to_pickle('ibu_weather_summary_1314_M.pkl')

# 1617 weather problems

df = pd.read_pickle('ibu_weather_summary_1617_M.pkl')

row_e = np.where((df['Event'] == 'CP07'))[0][0]
df.loc[row_e, 'Weather A'] = 'Mostly Cloudy'
df.loc[row_e, 'Weather B'] = 'Mostly Cloudy'

row_f = np.where((df['Event'] == 'CP08') & (df['Race'] == 'SMSP'))[0][0]
df.loc[row_f, 'Snow Cond A'] = 'Packed Powder'
df.loc[row_f, 'Snow Cond B'] = 'Packed Powder'

row_g = np.where((df['Event'] == 'CP08') & (df['Race'] == 'SMSPS'))[0][0]
df.loc[row_g, 'Snow Cond A'] = 'Packed Powder'
df.loc[row_g, 'Snow Cond B'] = 'Packed Powder'

df.to_pickle('ibu_weather_summary_1617_M.pkl')

# 1718 weather problems

df = pd.read_pickle('ibu_weather_summary_1718_M.pkl')

row_h = np.where(df['Event'] == 'CP06')[0][0]
df.loc[row_h, 'Snow Cond A'] = 'Packed Powder'
df.loc[row_h, 'Snow Cond B'] = 'Packed Powder'

row_i = np.where(df['Event'] == 'CP07')[0][0]
df.loc[row_i, 'Weather A'] = 'Partly Cloudy'
df.loc[row_i,'Weather B'] = 'Partlly Cloudy'

row_j = np.where(df['Event'] == 'CP08')[0][0]
df.loc[row_j, 'Snow Cond A'] = 'Packed Powder'
df.loc[row_j, 'Snow Cond B'] = 'Packed Powder'

df.to_pickle('ibu_weather_summary_1718_M.pkl')


In [66]:
# Making a few alterations to snow and weather conditions due to inconsistancies in 
# using capitalization

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415','1516',
           '1617','1718']

for season in seasons:
    filename = 'ibu_weather_summary_%(season)s_M.pkl' %{'season' : season}
    df = pd.read_pickle(filename)
    
    df.replace('Mostly cloudy','Mostly Cloudy', inplace = True)
    df.replace('Partly cloudy','Partly Cloudy', inplace = True)
    df.replace('Partly cloudy night','Partly Cloudy Night', inplace = True)
    
    df.replace('Packed powder','Packed Powder', inplace = True)
    df.replace('Wet and powder', 'Wet & Powder', inplace = True)
    df.replace('Wet & powder', 'Wet & Powder', inplace = True)
    df.replace('Wet and Powder', 'Wet & Powder', inplace = True)
    df.replace('Light snow','Light Snow', inplace = True)
    df.replace('Hard packed variable','Hard Packed Variable', inplace = True)
    df.replace('Hard packed','Hard Packed', inplace = True)
    df.replace('Spring conditions', 'Spring Conditions', inplace = True)
    df.replace('powder', 'Powder', inplace = True)


    df.to_pickle(filename)



# Exploring Relationships

Next we wish to explore the relationships between various course attributes and weather conditions and speed, shooting accuracy, etc on an individual basis. In particular, for each aspect of the total biathlon time (speed, prone shooting accuracy, standing shooting accuracy, prone shooting time, standing shooting time, prone penalty loop time, and standing penalty loop time), we wish to determine which of the predictor variables correlate most closely with that aspect. In doing this, we recognize that, due to the fact that we have fourteen different predictor variables, it is reasonable to expect that none of the correlation values will be particularly high.

Previous Section: [Adjustments to course and weather data](#Adjustments-to-course-and-weather-data)


Next Section: [Effect of Conditions on Speed](#Effects-of-conditions-on-speed)


[Table of Contents](#Table-of-Contents)




# Effects of conditions on speed

#### Objective
To successfully model the average speed and standard deviation of speed for a race as a function of course characteristics and conditions. 

#### Step 0 : Convert 'Total Ski' into seconds

I find the average ski time, and then use that to compute the average speed.  In order to do this, the first step is going to have to be to convert all of the ```Total Ski``` columns in these files from a minutes and seconds format to a seconds format.
1. ```convert_to_seconds```: This function takes a string of the form "h:mm:ss.s" or "mm:ss.s", converts the elapsed time into seconds, and returns that number as a float.
2. ```skitimes_to_seconds```: This function loops through all of the rows in a given competition analysis dataframe and replaces the string time entries for ```'Ski Time'``` with the time converted to seconds
3. ```compute_speeds```: This function takes as input a season and event, retrieves the length of the men's sprint race for that event from the course summary file, and calculates the speed (in m/s) for every racer in the competition. It then adds these values to the competition analysis file as a dataframe column, and returns the  file to pickle.
4. ```ibu_compute_speeds```: This function is analogous to ```compute_speeds```.
 
#### Step 1: Collect the data

The obvious first step here is to assemble the data. In theory, this has already been collected from the pdf pages and saved to my laptop, so what remains is to pull the pieces of it that I want and assemble it into a useable dataframe. Here, I should be able to use data all the way back to the 2004-2005 season, which will, hopefully, increase my ability to get an accurate model. There aren't actually any functions to describe here, just a lot of finding means and then putting the results into a dataframe.

#### Step 2 : Add weather and snow condition data
The next step is to add information about the weather and snow conditions to our dataframe, and then to add information about the course layout to our dataframe.
1. ```add_course_data```: This function adds data about the course to the output of a dataframe containing \*-response variable data
2. ```add_weather_data```: This function adds data about the weather conditions to the output of a dataframe containing \*-response variable and course data.

#### Step 3: Quantifying snow and weather conditions

In order to explore effects of each predictor variable on the response variables, it seemed useful to be able to treat all the variables, including the categorical variables, as quantitative variables. Of our four categorical variables, two, season and event, are relatively easy to quantify because their values are temporal. The other two are less easy to quantify, however. On the one hand, it seems clear that the condition of the snow will have an effect on the speed at which the racers are able to ski. On the other hand, I don't have, a priori, a very good sense of what that effect might be. Similarly, it seems likely that weather conditions might effect shooting accuracy, but again, it is unclear from just looking at the possible values taken by weather what exactly those effects might be. Ideally we would be able to rank all of the values for snow conditions and all of the values for weather from most difficult to least difficult and we would be able to use those rankings to quantify the variables.

The question then becomes how to go about quantifying these categorical variables. My initial approach was to search for descriptions of snow conditions on skiing sites, looking in particular at those that gave an _ease of skiing score_ for each of their listed snow conditions. However, I ran into problems because the snow condition values that I had did not always match the snow condition values available on the sites, because they quite often gave ranges of values for ease of skiing, and because the sites were tailored for Alpine rather than Nordic skiers. Furthermore, there was nothing that was even remotely useful of that nature for weather condition data. My next approach was to simply try to group snow conditions (and weather conditions) based on which variables seemed to be most similar to me. Here the biggest problem is that I didn't know enough to have a sense of what differences are important and which are unimportant.

Ultimately, I decided that the most straightforward approach was to simply use the performance data that I had to attempt to rank values taken by weather and by snow conditions. The first thing to acknowledge here is that this risks being a bit like the tail wagging the dog. In other words, by using the performance data that I have to rank the values and then using the rankings to make predictions about performance data, I risk demonstrating precisely nothing. As a result, I took a couple of precautions:
1. Since my ultimate was to make predictions about the total race times for races in the last four seasons of my available data (from 2014-15 to 2017-18), I used only performance data from World Cup races in the first ten seasons of available data (from 2004-05 to 2013-14) to rank the values. This meant that there were a few values for both snow conditions and weather that appeared only in later races and/or ibu races. In these cases, I decided which of the values that I did have scores for they were the most like and assigned the same scores to the unranked values.
2. I based the ranking of snow condition and weather values entirely on speed performance, and decided that, if the quantitative versions of snow conditions and weather had no impact on any other aspect, that I would eliminate them and do the best that I could with the categorical variables as categorical variables. (In fact, though their quantification is based entirely on speed, they end up on the list of most significant variables for all seven aspects of the total time.)

In order to quantify the categorical variable snow conditions, I did the following:
1. For each value taken by snow conditions during the seasons 2004-05 to 2013-14, I calculated the mean of the average speeds.
2. I ranked them in order from fastest average to slowest average. 
3. The fastest average was given a score of 10; the slowest was given a score of 0. 
4. The range of the averages was divided into 10 equal pieces, and each remaining value was assigned a score (from 0 to 9) based on which piece of the range it was in.
5. Any values for snow conditions that were not found in the first ten seasons of world cup data were added manually by deciding which of the known conditions they were most like.

The quantification of the categorical variable weather proceeded in much the same way.

Once these quantifications were complete, the variables ```quant_snow``` and ```quant_weather``` were added to the speed dataframe.

#### Step 4: Exploring correlation and impact

The other thing that I want to do here is to compute the correlation coefficients and total changes explained for each of the quantitative predictor variables.

1. ```corr_and_change```: Given a predictor variable x and a response variable y in a dataframe df, this function determines the strength of the linear dependance of y on x and what fraction of the total change in y is predicted based on the range of values taken by x.
2. ```collect_best_variables```: This function takes a dataframe produced by ```corr_and_change``` and chooses the predictor variables with the highest coefficients of determination and the highest percent change predicted (in both cases, with values at least half as large as the maximal values)

\*-response: Because these functions (```add_course_data``` and ```add_weather_data```) are used to add condition data to dataframes containing speed, accuracy, \*-response variable indicates that a variety of possible response variables can by entered in this part of the description.

Previous Section: [Exploring Relationships](#Exploring-Relationships)


Next Section: [Evaluating Impacts on Speed](#Evaluating-Impacts-on-Speed)


[Table of Contents](#Table-of-Contents)


#### Step 0


In [67]:
"""
Function
--------

convert_to_seconds : takes a time string of the form "h:mm:ss.s" or "mm:ss.s" and returns 
                     the total number of seconds as a float

Parameters
----------

string : a time string of the form "h:mm:ss.s" or "mm:ss.s"

Returns
-------

seconds : the elapsed time given by the time string in seconds

Example
-------

"""

def convert_to_seconds(string):
    
    try:
        split_string = string.split(":")
    
        if len(split_string) == 1:
            try:
                seconds = float(string)
            except:
                seconds = string
        elif len(split_string) == 2:
            seconds = float(split_string[0]) * 60 + float(split_string[1])
        elif len(split_string) == 3:
            seconds = (float(split_string[0]) * 3600 + float(split_string[1]) * 60
                       + float(split_string[2]))
        else:
            seconds = string
    except:
        seconds = string
        
    return seconds

In [68]:
"""
Function
--------

skitimes_to_seconds : converts all of the ski times for a given competition to seconds

Parameters
----------

df : a competition analysis dataframe

Returns
-------

modifies the given dataframe in place

Example
-------

"""

def skitimes_to_seconds(df):
    
    for i in df.index.tolist():
        df.loc[i,'Total Ski'] = convert_to_seconds(df.loc[i,'Total Ski'])
        
    return df

In [69]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

for season in seasons:
    for event in events:
        filename = ('companal_SMSP_%(season)s_%(event)s.pkl' 
                    %{'season' : season, 'event' : event})
        
        try:
            race_data = pd.read_pickle(filename)
            race_data_new = skitimes_to_seconds(race_data)
            race_data_new.to_pickle(filename)
        except: # the race does not exist or does not have data
            pass

In [70]:
"""
Function
--------

compute_speeds : takes as input a season and event, retrieves the length of the men's sprint
                 race for that event from the course summary file, then calculates speed 
                 (in m/s) for every racer in the competition (length/time), adds them to 
                 the companal file as a dataframe column, and returns the companal file to 
                 pickle

Parameters
----------

season : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”

Returns
-------

stores a pickle file to the hard drive

Example
-------

"""

def compute_speeds(season, event):
    
    course_filename = "course_summary_%(season)s_M.pkl" %{'season': season}
    course_data = pd.read_pickle(course_filename)
    course_data.set_index('Event', drop=True, inplace=True)

    filename = "companal_SMSP_%(season)s_%(event)s.pkl" %{'season' : season, 'event' : event}
    
    race_data = pd.read_pickle(filename)
    length = float(course_data.loc[event, 'Length'])
    for i in range(len(race_data)):
        race_data.loc[i, 'Speed'] = length/race_data.loc[i,'Total Ski']
                
    race_data.to_pickle(filename)

    

In [71]:
# Adding speed columns to our event dataframes

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

failures = []

for season in seasons:
    for event in events:
        try:
            compute_speeds(season, event)
        except:
            failures.append([season, event])

#### And adding speeds for the IBU cup races



In [72]:
seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

for season in seasons:
    for event in events:
        filename = 'ibu_SMSP_%(season)s_%(event)s.pkl' %{'season' : season, 'event' : event}
        
        try:
            race_data = pd.read_pickle(filename)
            race_data_new = skitimes_to_seconds(race_data)
            race_data_new.to_pickle(filename)
        except: # the race does not exist or does not have data
            pass
        
        filename = 'ibu_SMSPS_%(season)s_%(event)s.pkl' %{'season' : season, 'event' : event}
        
        try:
            race_data = pd.read_pickle(filename)
            race_data_new = skitimes_to_seconds(race_data)
            race_data_new.to_pickle(filename)
        except: # the race does not exist or does not have data
            pass

In [73]:
"""
Function
--------

ibu_compute_speeds : takes as input a season and event, retrieves the length of the men's
                     sprint race for that event from the course summary file, then calculates
                     speed (in m/s) for every racer in the competition (length/time), adds
                     them to the ibu competition analysis file as a dataframe column, and 
                     returns the competition analysis file to pickle

Parameters
----------

season : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”

Returns
-------

stores a pickle file to the hard drive

Example
-------


"""

def ibu_compute_speeds(season, event):
    
    course_filename = "ibu_course_summary_%(season)s_M.pkl" %{'season': season}
    course_data = pd.read_pickle(course_filename)
    course_data.set_index('Event', drop=True, inplace=True)

    filename = "ibu_SMSP_%(season)s_%(event)s.pkl" %{'season' : season, 'event' : event}
    filenameS = "ibu_SMSPS_%(season)s_%(event)s.pkl" %{'season' : season, 'event' : event}
    
    try:
        race_data = pd.read_pickle(filename)
        length = float(course_data.loc[event, 'Length'])
        for i in range(len(race_data)):
            race_data.loc[i, 'Speed'] = length/race_data.loc[i,'Total Ski']
    
        race_data.to_pickle(filename)
    except:
        pass
    
    try:
        race_data = pd.read_pickle(filenameS)
        length = float(course_data.loc[event, 'Length'])
        for i in range(len(race_data)):
            race_data.loc[i, 'Speed'] = length/race_data.loc[i,'Total Ski']
    
        race_data.to_pickle(filenameS)
    except:
        pass
    

In [74]:
# Adding speed columns to our event dataframes

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415'
           ,'1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CH__']

failures = []

for season in seasons:
    for event in events:
        try:
            ibu_compute_speeds(season, event)
        except:
            failures.append([season, event])

#### Step 1: Collect the data

The obvious first step here is to assemble the data. In theory, this has already been collected from the pdf pages and saved to my laptop, so what remains is to pull the pieces of it that I want and assemble it into a useable dataframe. Here, I should be able to use data all the way back to the 2004-2005 season, which will, hopefully, increase my ability to get an accurate model.



In [75]:
# Building a dataframe of mean speed and standard deviation

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

race_times = []
failures = []

for season in seasons:
    course_filename = "course_summary_%(season)s_M.pkl" %{'season': season}
    course_data = pd.read_pickle(course_filename)
    course_data.set_index('Event', drop=True, inplace=True)

    for event in events:
        filename = ("companal_SMSP_%(season)s_%(event)s.pkl" 
                    %{'season' : season, 'event' : event})
        
        try:
            race_data = pd.read_pickle(filename)
            if len(race_data) > 50:
                length = float(course_data.loc[event, 'Length'])
                race_info = [season, event, length/np.mean(race_data['Total Ski']), 
                             np.std(race_data['Speed'])]
                race_times.append(race_info)
            else:
                failures.append([season,event])
        except: # This race has no companal file
            failures.append([season,event])          
mean_mens_sprint_speed = pd.DataFrame(race_times, 
                                      columns = ['Year','Event','Mean Speed', 'StDev Speed'])




#### Step 2 : Add weather and snow condition data
The next step is to add information about the weather and snow conditions to our dataframe, and then to add information about the course layout to our dataframe.



In [76]:
"""
Function
--------

add_course_data : adds data about the course to the output of a dataframe containing speed
                  data

Parameters
----------

df : a dataframe containing biathlon speed data
row : the dataframe row that is under consideration (a particular competition)

Returns
-------

df : the original dataframe with course information added for the relevant row

Example
-------

"""

def add_course_data(df, row):
    
    # Get the identifying info for the race
    season = df.loc[row,'Year']
    event = df.loc[row,'Event']

    # Figure out the filename where the relevant course data is stored
    filename = "course_summary_%(season)s_M.pkl" %{'season': season}

    # Load that file
    
    course_data = pd.read_pickle(filename)
    
    # Figure out which row in that filename corresponds to your row
    
    race_row = -1
    
    for i in range(len(course_data)):
        if (course_data.loc[i,'Event'] == event) & (course_data.loc[i,'Race'] == 'SMSP'):
            race_row = i
        
    # Next, grab the data and put it in the file
    
    if race_row > -1:
        df.loc[row,'Length'] = float(course_data.loc[race_row,'Length'])
        df.loc[row,'Height Diff'] = float(course_data.loc[race_row,'Height Diff'])
        df.loc[row,'Max Climb'] = float(course_data.loc[race_row,'Max Climb'])
        df.loc[row,'Total Climb'] = float(course_data.loc[race_row,'Total Climb'])
        df.loc[row, 'Altitude'] = float(course_data.loc[race_row,'Altitude'])
        df.loc[row, 'Quant Year'] = float(course_data.loc[race_row, 'Quant Year'])
        df.loc[row, 'Quant Event'] = float(course_data.loc[race_row, 'Quant Event'])
        
    else:
        df.loc[row,'Length'] = np.nan
        df.loc[row,'Height Diff'] = np.nan
        df.loc[row,'Max Climb'] = np.nan
        df.loc[row,'Total Climb'] = np.nan
        df.loc[row,'Altitude'] = np.nan
        df.loc[row, 'Quant Year'] = np.nan
        df.loc[row, 'Quant Event'] = np.nan
        
    return df


In [77]:
for row in range(len(mean_mens_sprint_speed)):
    mean_mens_sprint_speed = add_course_data(mean_mens_sprint_speed, row)

In [78]:
"""
Function
--------

add_weather_data : adds data about the weather conditions to the output of a dataframe
                   containing speed and course data

Parameters
----------

df : a dataframe containing biathlon speed  and course data
row : the dataframe row that is under consideration (a particular competition)

Returns
-------

df : the original dataframe with weather information added to the relevant row

Example
-------

"""

def add_weather_data(df, row):
    
    # Get the identifying info for the race
    
    season = df.loc[row,'Year']
    event = df.loc[row,'Event']

    # Figure out the filename where the relevant course data is stored
    
    filename = "weather_summary_%(season)s_M.pkl" %{'season': season}

    # Load that file
    
    weather_data = pd.read_pickle(filename)
    
    # Figure out which row in that filename corresponds to your row
    
    race_row = -1
    for i in range(len(weather_data)):
        if (weather_data.loc[i,'Event'] == event) & (weather_data.loc[i,'Race'] == 'SMSP'):
            race_row = i
     
    # Next, grab the data and put it in the file
    
    if race_row > -1:
        df.loc[row,'Weather'] = weather_data.loc[race_row,'Weather C']
        df.loc[row,'Snow Cond'] = weather_data.loc[race_row,'Snow Cond C']
        df.loc[row,'Snow Temp'] = euro_to_float(weather_data.loc[race_row,'Snow Temp C'])
        df.loc[row,'Air Temp'] = euro_to_float(weather_data.loc[race_row,'Air Temp C'])
        df.loc[row,'Humidity'] = euro_to_float(weather_data.loc[race_row,'Humidity C'])
        df.loc[row,'Wind'] = euro_to_float(weather_data.loc[race_row,'Wind C'])



        
    else:
        df.loc[row,'Weather'] = np.nan
        df.loc[row,'Snow Cond'] = np.nan
        df.loc[row,'Snow Temp'] = np.nan
        df.loc[row,'Air Temp'] = np.nan
        df.loc[row,'Humidity'] = np.nan
        df.loc[row,'Wind'] = np.nan
        
    return df

In [79]:
for row in range(len(mean_mens_sprint_speed)):
    mean_mens_sprint_speed = add_weather_data(mean_mens_sprint_speed, row)
    



#### Quantifying snow and weather conditions

While most of the data that I have here is quantitative, some of it is categorical. For the categorical data, I would like to introduce some sort of quantification. In the case of season and event, this was quite simple, as the seasons have a natural ordering, and then events within the seasons also have a natural ordering. For the snow conditions and weather this is somewhat more difficult. While it seems obvious that snow conditions will have an effect on speed, it isn't obvious to me (as a non cross country skier) how this is going to play out with respect the the descriptions that are given in the files. As a result, I've decided to assign values between 0 and 10 to each of the possible weather and snow condition values. To do this, I'm going to compute the average speeds associated to each of the different snow conditions and weather conditions over the first 10 seasons of my data. The fastest weather and snow conditions will be assigned a value of 10. The slowest will be assigned a value of 0. The difference between them will be divided in tenths, and each remaining condition will be assigned the value that corresponds to it's slice in the distribution. Because I ultimately want to use this to try to make predictions about more recent seasons, I'm going to restrict the data that I consider here to the first 10 seasons for which I have data (which leaves 4 seasons untouched here).



In [80]:
predictor_data = mean_mens_sprint_speed.iloc[:96]

weather_averages = []

for weather in set(predictor_data['Weather']):
    weather_data = predictor_data['Weather'] == weather
    weather_averages.append([weather,np.mean(predictor_data[weather_data]['Mean Speed'])])
    
weather_averages = pd.DataFrame(weather_averages, columns = ['Condition', 'Speed'])
weather_averages.set_index('Condition', inplace = True)

max_weather = max(weather_averages['Speed'])
min_weather = min(weather_averages['Speed'])
step = (max_weather-min_weather)/10

for weather in weather_averages.index.tolist():
    weather_averages.loc[weather, 'Weather Quant'] = int((weather_averages.loc[weather,'Speed']
                                                          -min_weather)/step)
    
weather_averages

Unnamed: 0_level_0,Speed,Weather Quant
Condition,Unnamed: 1_level_1,Unnamed: 2_level_1
Partly Cloudy Night,7.223154,10.0
Mostly Cloudy,6.868781,5.0
Partly Cloudy,6.880993,6.0
Clear,6.836928,5.0
Sunny,6.715308,4.0
Snow,6.611454,3.0
Cloudy,6.939226,6.0
Sky Clear,6.867468,5.0
Fog,6.700835,4.0
Rain,6.430744,0.0


Problem : 'Mostly Cloudy Night' appears in later data, but not in the 10 seasons that we have of World Cup data. That means that I'll have to assign it a value. I'm going to assume that it's most like Partly Cloudy Night, and assign it a value of 10. 'Light Snowfall' is also not in the first 10 seasons. This seems to be clearly very similar to 'Light Snow', and I'll assign it a value accordingly

Similarly, there are conditions that appear in the IBU race data, that don't appear here. Again, I'll assign values based on which condition with a computed value it seems to be closest to.


In [81]:
weather_averages.loc['Mostly Cloudy Night', 'Weather Quant'] = 10.0
weather_averages.loc['Light Snowfall', 'Weather Quant'] = 4.0
weather_averages.loc['Rain Snow', 'Weather Quant'] = 3
weather_averages.loc['Windy', 'Weather Quant'] = 0

In [82]:
predictor_data = mean_mens_sprint_speed.iloc[:96]

snow_averages = []

for snow in set(predictor_data['Snow Cond']):
    snow_data = predictor_data['Snow Cond'] == snow
    snow_averages.append([snow,np.mean(predictor_data[snow_data]['Mean Speed'])])
    
snow_averages = pd.DataFrame(snow_averages, columns = ['Condition', 'Speed'])
snow_averages.set_index('Condition', inplace = True)

max_snow = max(snow_averages['Speed'])
min_snow = min(snow_averages['Speed'])
step = (max_snow-min_snow)/10

for snow in snow_averages.index.tolist():
    snow_averages.loc[snow, 'Snow Quant'] = int((snow_averages.loc[snow,'Speed']
                                                 - min_snow)/step)
    
snow_averages

Unnamed: 0_level_0,Speed,Snow Quant
Condition,Unnamed: 1_level_1,Unnamed: 2_level_1
Compact,7.141269,10.0
Wet & Powder,6.063496,0.0
Packed Powder,6.736725,6.0
Hard Packed,6.942572,8.0
Hard,6.744447,6.0
Granular,6.724335,6.0
Powder,6.583711,4.0
Wet,6.712424,6.0
Fresh,6.641513,5.0
Packed,6.933795,8.0


In [83]:
snow_averages.loc['Spring Conditions', 'Snow Quant'] = 6
snow_averages.loc['Powder', 'Snow Quant'] = 4
snow_averages.loc['Soft', 'Snow Quant'] = 5

In [84]:
weather_averages.to_pickle('weather_averages.pkl')
snow_averages.to_pickle('snow_averages.pkl')

In [85]:
for i in range(len(mean_mens_sprint_speed)):
    mean_mens_sprint_speed.loc[i,'Quant Weather'] = weather_averages.loc[
                                mean_mens_sprint_speed.loc[i, 'Weather'], 'Weather Quant']
    mean_mens_sprint_speed.loc[i, 'Quant Snow'] = snow_averages.loc[
                                mean_mens_sprint_speed.loc[i, 'Snow Cond'], 'Snow Quant']

In [86]:
# And pickle it

mean_mens_sprint_speed.to_pickle("mean_mens_sprint_speed.pkl")


#### Step 3: Exploring correlation and impact

The other thing that I want to do here is to compute the correlation coefficients and total changes explained for each of the quantitative predictor variables.


In [8]:
"""
Function
--------

corr_and_change : for a given pair of variables x and y in a dataframe df, determines the
                  strength of their linear relationship and how much of a change in y is
                  predicted based on the range given for x

Parameters
----------

x : the column to be treated as the predictor variable
y : the column to be treated as the response variables
df : the dataframe containing the data

Returns
-------

[r, r_sq, change] : a list containing the correlation coefficient, the coefficient of 
                    determination, and the percent of the total change in y  predicted 
                    over the given range of x by the best fit line through the data

Example
-------

"""

def corr_and_change(x, y, df):
    
    correlation_output = stats.linregress(df[x],df[y])
    
    r = correlation_output[2]
    r_sq = correlation_output[2]**2
    
    max_x = max(df[x])
    min_x = min(df[x])
    
    change = max_x * correlation_output[0] - min_x * correlation_output[0]
    total_change = max(df[y]) - min(df[y])
    change = abs(change)/total_change

    return [r, r_sq, change]

In [88]:
corr_and_changes_speed = []

x_values = ['Length', 'Height Diff', 'Max Climb', 'Total Climb',  'Snow Temp', 'Air Temp', 
            'Humidity', 'Wind', 'Altitude', 'Quant Year', 'Quant Event', 'Quant Weather',
            'Quant Snow']

y_values = ['Mean Speed']

for y in y_values:
    for x in x_values:
        row_data = [y,x]
        row_data.extend(corr_and_change(x,y,mean_mens_sprint_speed.iloc[:96]))
        corr_and_changes_speed.append(row_data)
        
corr_and_changes_speed = pd.DataFrame(corr_and_changes_speed, columns = ['Response',
                     'Predictor','Correlation', 'Determination','Percent Change Predicted'])

corr_and_changes_speed

Unnamed: 0,Response,Predictor,Correlation,Determination,Percent Change Predicted
0,Mean Speed,Length,0.076656,0.005876,0.074319
1,Mean Speed,Height Diff,-0.191723,0.036758,0.177647
2,Mean Speed,Max Climb,-0.095647,0.009148,0.104154
3,Mean Speed,Total Climb,-0.275609,0.07596,0.227985
4,Mean Speed,Snow Temp,-0.006824,4.7e-05,0.008645
5,Mean Speed,Air Temp,-0.077797,0.006052,0.096905
6,Mean Speed,Humidity,-0.208916,0.043646,0.177597
7,Mean Speed,Wind,-0.109666,0.012027,0.133893
8,Mean Speed,Altitude,-0.162561,0.026426,0.117719
9,Mean Speed,Quant Year,0.231685,0.053678,0.15973


From here, I want to pluck off the variables that do the best job of predicting...


In [89]:
"""
Function
--------

collect_best_variables : takes a dataframe produced by corr_and_change and chooses the
                         predictor variables with the highest coefficients of determination
                         and the highest percent change predicted (in both cases, with 
                         values at least half as large as the maximal values)
Parameters
----------

df : the dataframe containing the data

Returns
-------

best_variables : a list of the variables that seem to best predict the response variable

Example
-------

"""

def collect_best_variables(df):
    
    best_variables = []
    
    df.sort_values('Determination', inplace = True, ascending = False)
    
    max_deter = df.iloc[0,3]
    for i in range(len(df)):
        if df.iloc[i,3] >= 0.5*max_deter:
            best_variables.append(df.iloc[i,1])
            
    df.sort_values('Percent Change Predicted', inplace = True, ascending = False)
    
    max_change = df.iloc[0,4]
    for i in range(len(df)):
        if df.iloc[i,4] >= 0.5*max_change:
            best_variables.append(df.iloc[i,1])

    best_variables = list(set(best_variables))
    
    return best_variables

In [90]:
best_variables_speed = collect_best_variables(corr_and_changes_speed)


# Evaluating Impacts on Speed

We observe here that the variables that do the best jobs individually of predicting speed (assuming that none of the other variables are available for consideration) are Quant(itative) Snow which has an $r$ value of 0.452142 and which explains roughly 60% of the total change in speed, and Quant(itative) Weather which has an $r$ value of 0.401326 and which explains roughly 54% of the total change in speed. There are, of course, some caveats.
1. We defined the variables Quant(itative) Snow and Quant(itative) Weather by looking at the average speeds associated with each value taken by the variables Weather and Snow Conditions. While it is clear that these assignments did not yield perfect alignment (since if they did, we would expect the $r$ values to be much closer to 1 than they are), it is not surprising that that they did well relative to the other variables.
2. The variables quant_snow in particular likely has some dependence on the other weather related variables, in particular on air temperature and snow temperature.

If we eliminate those two variables from consideration, then our best variables are Total Climb with an $r$ value of -0.275609, Humidity with and $r$ value of -0.208916, and Height Diff(erential) with an $r$ value of -0.191723. Note that only the first of these values is even half as large as the values for Quant(itative) Snow and Quant(itative) Weather.

Previous Section: [Effect of Conditions on Speed](#Effects-of-conditions-on-speed)


Next Section: [Effect of Conditions on Shooting Accuracy](#Effects-of-conditions-on-shooting-accuracy)


[Table of Contents](#Table-of-Contents)

# Effects of conditions on shooting accuracy

#### Objective
To understand the impact of individual course and weather condition variables on prone and standing shooting accuracy.

#### Step 1 : Collecting the data

Because I want to include a certain amount of uncertainty in this shooting data, I'm going to use bootstrap sampling to collect 100 samples (with replacement) from the shooting data for each race. For each of these, I'm going to calculate a mean, and I'm going to use these means to estimate the parameters (mean and standard deviation) of the prone and standing shooting accuracy for each race.

1. ```bootstrap_sample_dist```: This function takes a competition analysis dataframe and the name of a column containing shooting data, then draws  100 bootstrap samples and finds their means. The function then converts these values from an average number of missed shots to a percentage of accurate shots. Finally, it returns the mean and standard deviation of this list.

Now, we loop through all the World Cup men's sprint races, using ```bootstrap_sample_dist``` to estimate means and standard deviations for both prone and standing shooting. We then create a dataframe with the following columns: Year, Event, Prone Accuracy, Prone Dev, Standing Accuracy, and Standing Dev. As above, we use ```add_course_data``` and ```add_weather_data``` to add information about course conditions for each event, and then use the values we found when quantifying the snow conditions and weather to add Quant Snow and Quant Weather values to our dataframe.

#### Step 2: Correlations

In order to compare the effects of the various predictor variables on prone shooting we loop through all quantitative predictors (including the quantified versions of the categorical variables), applying ```corr_and_change``` each time. The results are then stored in a dataframe. We repeat this to explore the effects of the various predictor variables on standing shooting. Finally, we use the function ```collect_best_variables``` on each of the resulting dataframes to select which of the predictor variables seem to be most closely tied to prone shooting accuracy and standing shooting accuracy.

Previous Section: [Evaluating Impacts on Speed](#Evaluating-Impacts-on-Speed)

Next Section: [Evaluating Impacts on Shooting Accuracy](#Evaluating-Impacts-on-Accuracy)


[Table of Contents](#Table-of-Contents)



#### Step 1 : Collecting the data


In [92]:
"""
Function
--------

bootstrap_sample_dist : takes a companal dataframe and column name, then draws and finds
                        the means of 100 bootstrap samples and returns the mean and standard
                        deviation of the list of sample means

Parameters
----------

df : a competition analysis dataframe
shooting : the code for a column containing shot data ('P1' or 'S1')

Returns
-------

average : the mean of the bootstrap sample means
stdev : the standard deviation of the bootstrap sample means

Example
-------

"""

def bootstrap_sample_dist(df, shooting):
    
    bootstrap_means = []
    
    for i in range(100):
        bootstrap_sample = np.random.choice((df[shooting].astype(float)), len(df))
        bootstrap_means.append((5 - np.mean(bootstrap_sample))/5)
        
    average = np.mean(bootstrap_means)
    stdev = np.std(bootstrap_means, ddof = 1)
    
    return average, stdev

In [93]:
# Collecting the shooting data into a single dataframe

shooting_accuracy = []

seasons = ['0405','0506','0607','0708','0809','0910', '1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

for season in seasons:
    for event in events:
        filename = ("companal_SMSP_%(season)s_%(event)s.pkl" 
                    %{'season' : season, 'event' : event})
        
        try:
            race_data = pd.read_pickle(filename)
            if len(race_data) > 50:
                shooting = [season, event]
                shooting.extend(bootstrap_sample_dist(race_data, 'P1'))
                shooting.extend(bootstrap_sample_dist(race_data, 'S1'))
                shooting_accuracy.append(shooting)
            
        except: # there is no file for this race
            pass
            
shooting_accuracy = pd.DataFrame(shooting_accuracy, columns = ['Year','Event', 
                        'Prone Accuracy', 'Prone Dev', 'Standing Accuracy','Standing Dev']) 

Now, I want to add the weather and course data to this. 

In [94]:
for row in range(len(shooting_accuracy)):
    shooting_accuracy = add_course_data(shooting_accuracy, row)
    
for row in range(len(shooting_accuracy)):
    shooting_accuracy = add_weather_data(shooting_accuracy, row)


In [95]:
for i in range(len(mean_mens_sprint_speed)):
    shooting_accuracy.loc[i,'Quant Weather'] = weather_averages.loc[
        shooting_accuracy.loc[i, 'Weather'], 'Weather Quant']
    shooting_accuracy.loc[i, 'Quant Snow'] = snow_averages.loc[
        shooting_accuracy.loc[i, 'Snow Cond'], 'Snow Quant']

In [96]:
shooting_accuracy.to_pickle('shooting_accuracy.pkl')


#### Step 2: Correlation



In [97]:
corr_and_changes_prone_accuracy = []

x_values = ['Length', 'Height Diff', 'Max Climb', 'Total Climb',  'Snow Temp', 'Air Temp', 
            'Humidity', 'Wind', 'Altitude', 'Quant Year', 'Quant Event', 'Quant Weather', 
            'Quant Snow']

y_values = ['Prone Accuracy']

for y in y_values:
    for x in x_values:
        row_data = [y,x]
        row_data.extend(corr_and_change(x,y,shooting_accuracy.iloc[:96]))
        corr_and_changes_prone_accuracy.append(row_data)
        
corr_and_changes_prone_accuracy = pd.DataFrame(corr_and_changes_prone_accuracy, 
                                         columns = ['Response','Predictor','Correlation',
                                                'Determination','Percent Change Predicted'])


corr_and_changes_prone_accuracy

Unnamed: 0,Response,Predictor,Correlation,Determination,Percent Change Predicted
0,Prone Accuracy,Length,-0.181564,0.032965,0.154681
1,Prone Accuracy,Height Diff,-0.081244,0.006601,0.06615
2,Prone Accuracy,Max Climb,-0.298844,0.089308,0.285957
3,Prone Accuracy,Total Climb,-0.058446,0.003416,0.042483
4,Prone Accuracy,Snow Temp,-0.059271,0.003513,0.065983
5,Prone Accuracy,Air Temp,-0.226165,0.051151,0.24755
6,Prone Accuracy,Humidity,-0.171227,0.029319,0.127906
7,Prone Accuracy,Wind,-0.249846,0.062423,0.268048
8,Prone Accuracy,Altitude,-0.094331,0.008898,0.060026
9,Prone Accuracy,Quant Year,0.305661,0.093428,0.185176


In [98]:
corr_and_changes_standing_accuracy = []

x_values = ['Length', 'Height Diff', 'Max Climb', 'Total Climb',  'Snow Temp', 'Air Temp', 
            'Humidity', 'Wind', 'Altitude', 'Quant Year', 'Quant Event', 'Quant Weather', 
            'Quant Snow']

y_values = ['Standing Accuracy']

for y in y_values:
    for x in x_values:
        row_data = [y,x]
        row_data.extend(corr_and_change(x,y,shooting_accuracy.iloc[:96]))
        corr_and_changes_standing_accuracy.append(row_data)
        
corr_and_changes_standing_accuracy = pd.DataFrame(corr_and_changes_standing_accuracy, 
                                         columns = ['Response','Predictor','Correlation',
                                                 'Determination','Percent Change Predicted'])


corr_and_changes_standing_accuracy

Unnamed: 0,Response,Predictor,Correlation,Determination,Percent Change Predicted
0,Standing Accuracy,Length,-0.230931,0.053329,0.143778
1,Standing Accuracy,Height Diff,-0.077414,0.005993,0.046064
2,Standing Accuracy,Max Climb,-0.209144,0.043741,0.146253
3,Standing Accuracy,Total Climb,-0.116389,0.013546,0.061827
4,Standing Accuracy,Snow Temp,0.039142,0.001532,0.031844
5,Standing Accuracy,Air Temp,0.05272,0.002779,0.042171
6,Standing Accuracy,Humidity,-0.129259,0.016708,0.070564
7,Standing Accuracy,Wind,-0.411383,0.169236,0.322544
8,Standing Accuracy,Altitude,-0.101569,0.010316,0.047233
9,Standing Accuracy,Quant Year,0.128357,0.016476,0.056829


In [99]:
best_variables_prone_accuracy = collect_best_variables(corr_and_changes_prone_accuracy)

In [100]:
best_variables_standing_accuracy = collect_best_variables(corr_and_changes_standing_accuracy)


# Evaluating Impacts on Accuracy

Beginning with the impacts of our predictor variables on prone accuracy, we find that the variables with the greatest impact are
- Quant Weather, which has an $r$ value of 0.499915 and which explains roughly 59% of the total change in prone accuracy,
- Max Climb, which has an $r$ value of -0.298844 and which explains roughly 29% of the total change in prone accuracy,
- Wind, which has an $r$ value of -0.249846 and which explains roughly 27% of the total change in prone accuracy, and
- Air Temp, which has an $r$ value of -0.226165 and which explains roughly 25% of the total change in prone accuracy.

In considering the impacts of our predictor variables on standing accuracy, we find that the variables with the greatest impact are
- Quant Weather, which has an $r$ value of 0.459026 and which explains roughly 39% of the total change in standing accuracy, and 
- Wind, which has an $r$ value of -0.411383 and which explains roughly 32% of the total change in standing accuracy.

One observation here is that, despite the fact that shooting accuracy was not considered in any way when assigning numeric values to the different values taken by Weather, Quant Weather remains the variable that is the single best predictor of prone accuracy by a significant amount, and the best predictor of standing accuracy, though it is not significantly larger than Wind in this case.

Previous Section: [Effect of Conditions on Shooting Accuracy](#Effects-of-conditions-on-shooting-accuracy)


Next Section:[Effect of Conditions on Range and Penalty Times](#Effects-of-conditions-on-range-and-penalty-times)


[Table of Contents](#Table-of-Contents)



# Effects of conditions on range and penalty times

#### Objective
To investigate the effect of race conditions on average range times and penalty loop times.

#### Step 0 : Calculate prone and standing range + penalty times in seconds

Due to the fact that before the 2011-2012 season there was a single value given for each racer that encompassed both range and penalty times, whereas the two are split from the 2011-2012 season on, I combine range and penalty time into a single value, in seconds, for each of prone and standing shooting.

1. ```rangetimes_to_seconds```: This function takes a competition analysis dataframe and uses ```convert_to_seconds``` to convert the prone and standing range times (for seasons through 1011) or the sum of the prone and standing range and penalty times (for seasons after 1112) into seconds. It records these values in the new columns Prone Range and Standing Range in the original dataframe, and then returns that dataframe.
2. ```ibu_rangetimes_to_seconds```: This function is analogous to ```rangetimes_to_seconds```, with the main difference being that, since penalty times were not recorded separately for ibu races until the 2014-15 season, it converts the range times alone until the beginning of the 2014-15 season.



#### Step 1: Collect the data

Next, we wish to find values for four different things. The values that we have found above for prone range time and standing range time include all the time elapsed from the point when the racer enters the shooting area until the point where the racer leaves the penalty zone. During that time, the racer
- skis to the shooting location (a short distance for standing shooting, a somewhat longer distance for prone shooting)
- gets in position
- takes 5 shots
- skis to the entrance of the penalty zone
- skis any penalty loops that are necessary due to missed shots
- skis out of the penalty zone

What we would like to do is estimate the time spent skiing penalty loops (which is dependent on the number of shots missed) and to isolate it from all the other pieces of the range time. To this end, we will henceforth use the term _range time_ to refer to everything that happens outside of the penalty loops, and _loop time_ to refer to the time required (on average) to ski a single penalty loop. 

(NB: This raises the question: If you're going to go and split up the range time and the penalty time now, why did you add them together up there? The answer is fairly simple. For the seasons where the penalty time is split off from the range time, the penalty time includes a short section of track/range that every racer needs to ski (which takes about 7 or 8 seconds, on average), regardless of whether or not they have missed shots. As a result, if we estimate loop time for those races by dividing the penalty times by the number of missed shots, our estimates will be higher than the actual loop times by 25-30%.)

In order to estimate the average range times and penalty loop times for a given race, we use

3. ```calculate_penalty_lines```: This function uses bootstrap sampling to select a collection of missed prone shots and prone range time (and  missed standing shots and standing range time). It then uses linear regression on each sample  to estimate range time and penalty loop time for a given competition. The function returns a list containing the means and standard deviations of the estimates found for prone range time, prone penalty loop time, standing range time, and standing penalty loop time.

We then loop through all of the World Cup races, applying ```calculate_penalty_lines``` to each, and store the results in the form of a dataframe with the following columns: Year, Event, prone range mean, prone range dev, prone loop mean, prone loop dev, standing range mean, standing range dev, standing loop mean, and standing loop dev. Finally, as above, we use add_course_data and add_weather_data to add information about course conditions for each event, and then use the values we found when quantifying the snow conditions and weather to add Quant Snow and Quant Weather values to our dataframe.

#### Step 2: Correlations for range times

In order to compare the effects of the various predictor variables on prone range times we loop through all quantitative predictors (including the quantified versions of the categorical variables), applying corr_and_change each time. The results are then stored in a dataframe. We repeat this to explore the effects of the various predictor variables on standing range times. Finally, we use the function collect_best_variables on each of the resulting dataframes to select which of the predictor variables seem to be most closely tied to prone range times and standing range times.

#### Step 3: Correlations for penalty loop times

In order to compare the effects of the various predictor variables on prone penalty loop times we loop through all quantitative predictors (including the quantified versions of the categorical variables), applying corr_and_change each time. The results are then stored in a dataframe. We repeat this to explore the effects of the various predictor variables on standing penalty loop times. Finally, we use the function collect_best_variables on each of the resulting dataframes to select which of the predictor variables seem to be most closely tied to prone penalty loop times and standing penalty loop times.

Previous Section: [Evaluating Impacts on Shooting Accuracy](#Evaluating-Impacts-on-Accuracy)


Next Section: [Evaluating Impacts on Range Times](#Evaluating-Impacts-on-Range-Times)


[Table of Contents](#Table-of-Contents)




#### Step 0 : Calculate prone and standing range + penalty times in seconds



In [101]:
"""
Function
--------

rangetimes_to_seconds : takes a competition analysis dataframe and converts the range times
                        (for seasons through 1011) or the sum of the range and penalty times
                        (for seasons after 1112) into seconds

Parameters
----------

df : a competition analysis dataframe
year : the season in which the competition was held

Returns
-------

df : the original dataframe with range times given in seconds 

Example
-------

"""

def rangetimes_to_seconds(df, year):
    
    if year in ['0405','0506','0607','0708','0809','0910','1011']:
        for i in df.index.tolist():
            df.loc[i,'prone range'] = convert_to_seconds(df.loc[i,'r1'])
            df.loc[i, 'standing range'] = convert_to_seconds(df.loc[i,'r2'])
    else:
        for i in df.index.tolist():
            df.loc[i,'prone range'] = (convert_to_seconds(df.loc[i,'r1']) + 
                                                   convert_to_seconds(df.loc[i,'pen1']))
            df.loc[i,'standing range'] = (convert_to_seconds(df.loc[i,'r2']) + 
                                                   convert_to_seconds(df.loc[i,'pen2']))

    return df

In [102]:
# Adding range times in seconds to each event dataframe

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CP09','CH__','OG__']

for season in seasons:
    for event in events:
        filename = ('companal_SMSP_%(season)s_%(event)s.pkl'
                    %{'season' : season, 'event' : event})
        
        try:
            race_data = pd.read_pickle(filename)
            race_data_new = rangetimes_to_seconds(race_data,season)
            race_data_new.to_pickle(filename)
        except: # the race does not exist or does not have data
            pass

In [103]:
"""
Function
--------

ibu_rangetimes_to_seconds : takes an ibu competition analysis dataframe and converts the
                            range times (for seasons through 1314) or the sum of the range
                            and penalty times (for seasons after 1415) into seconds

Parameters
----------

df : a competition analysis dataframe
year : the season in which the competition was held

Returns
-------

df : the original dataframe with range times given in seconds 

Example
-------


"""

def ibu_rangetimes_to_seconds(df, year):
    
    if year in ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314']:
        for i in df.index.tolist():
            df.loc[i,'prone range'] = convert_to_seconds(df.loc[i,'r1'])
            df.loc[i, 'standing range'] = convert_to_seconds(df.loc[i,'r2'])
    else:
        for i in df.index.tolist():
            df.loc[i,'prone range'] = (convert_to_seconds(df.loc[i,'r1']) + 
                                                   convert_to_seconds(df.loc[i,'pen1']))
            df.loc[i,'standing range'] = (convert_to_seconds(df.loc[i,'r2']) + 
                                                      convert_to_seconds(df.loc[i,'pen2']))

    return df

In [104]:
# Adding range times in seconds to each event dataframe

seasons = ['0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','1415',
           '1516','1617','1718']
events = ['CP01','CP02','CP03','CP04','CP05','CP06','CP07','CP08','CH__']

for season in seasons:
    for event in events:
        filename = 'ibu_SMSP_%(season)s_%(event)s.pkl' %{'season' : season, 'event' : event}
        
        try:
            race_data = pd.read_pickle(filename)
            race_data_new = ibu_rangetimes_to_seconds(race_data,season)
            race_data_new.to_pickle(filename)
        except: # the race does not exist or does not have data
            pass
        
        filename = 'ibu_SMSPS_%(season)s_%(event)s.pkl' %{'season' : season, 'event' : event}
        
        try:
            race_data = pd.read_pickle(filename)
            race_data_new = ibu_rangetimes_to_seconds(race_data,season)
            race_data_new.to_pickle(filename)
        except: # the race does not exist or does not have data
            pass


#### Step 1 : Collect the data



In [105]:
"""
Function
--------

calculate_penalty_lines : uses bootstrap sampling and linear regression to estimate range time
                            and penalty loop time for a given competition

Parameters
----------

season : the years of the biathlon season, in form y1y2 where y1 is the last 2 digits of the
        first year and y2 is the last two digits of the second year
event : a four character code specifying the event. Possibilities are “CP01”, “CP02”, . . .,
        “CP09”, “OG__”,”CH__”

Returns
-------

penalty_data : a list containing the averages and standard deviations of the predicted
                range and loop times for prone and standing shooting for the given
                competition

Example
-------

"""

def calculate_penalty_lines(season,event):
    
    filename = 'companal_SMSP_%(season)s_%(event)s.pkl' %{'event' : event, 'season' : season}
    
    race_data = pd.read_pickle(filename)
    
    bootstrap_data = []
    
    linreg_prone = LinearRegression()
    linreg_stand = LinearRegression()
    
    for i in range(100):
        try: # because it's possible that somehow all of the racers in a sample will 
             #have the same missed shots
            bootstrap_sample = np.random.choice(race_data.index.tolist(), 
                                                            len(race_data)).tolist()
            df = race_data.loc[bootstrap_sample]
        
            linreg_prone.fit(df['P1'].values.reshape(-1, 1), 
                                         df['prone range'].values.reshape(-1, 1))
            linreg_stand.fit(df['S1'].values.reshape(-1, 1), 
                                         df['standing range'].values.reshape(-1, 1))
        
            fit_data = [linreg_prone.intercept_.tolist()[0], 
                            linreg_prone.coef_.tolist()[0][0], 
                                linreg_stand.intercept_.tolist()[0],
                                    linreg_stand.coef_.tolist()[0][0]]

            bootstrap_data.append(fit_data)
        except: # If all racers in the bootstrap sample missed the same number of shots
            pass
        
    bootstrap_data = pd.DataFrame(bootstrap_data, columns = ['prone range','prone loop',
                                                             'standing range','standing loop'])
    penalty_data = [np.mean(bootstrap_data['prone range']), 
                            np.std(bootstrap_data['prone range']),
                   np.mean(bootstrap_data['prone loop']), 
                            np.std(bootstrap_data['prone loop']),
                   np.mean(bootstrap_data['standing range']), 
                            np.std(bootstrap_data['standing range']),
                   np.mean(bootstrap_data['standing loop']), 
                            np.std(bootstrap_data['standing loop'])]
    
    return penalty_data
                                            

In [106]:
# Collecting the range and penalty data into a single dataframe

penalties = []

seasons = ['0405', '0506', '0607', '0708', '0809', '0910', '1011', '1112', '1213', '1314',
           '1415', '1516', '1617', '1718']
events = ['CP01', 'CP02', 'CP03', 'CP04', 'CP05', 'CP06', 'CP07', 'CP08', 'CP09', 
          'CH__', 'OG__']

for season in seasons:
    for event in events:
        try:
            filename = ('companal_SMSP_%(season)s_%(event)s.pkl'
                        %{'event' : event, 'season' : season})
            if len(pd.read_pickle(filename)) > 50:
                penalty = [season, event]
                penalty.extend(calculate_penalty_lines(season, event))
                penalties.append(penalty)
            
        except: # there is no file for this race
            pass
            
penalty_info = pd.DataFrame(penalties, columns = ['Year', 'Event', 'prone range mean', 
                                                  'prone range dev', 'prone loop mean',
                                                  'prone loop dev', 'standing range mean', 
                                                  'standing range dev', 'standing loop mean',
                                                  'standing loop dev'])           

Now I want to add the weather and course information to this.

In [107]:
for row in range(len(penalty_info)):
    penalty_info = add_course_data(penalty_info, row)
    
for row in range(len(penalty_info)):
    penalty_info = add_weather_data(penalty_info, row)



In [108]:
for i in range(len(penalty_info)):
    penalty_info.loc[i,'Quant Weather'] = (weather_averages.loc[penalty_info.loc[i, 'Weather'],
                                                               'Weather Quant'])
    penalty_info.loc[i, 'Quant Snow'] = (snow_averages.loc[penalty_info.loc[i, 'Snow Cond'],
                                                          'Snow Quant'])

In [109]:
# And, just in case there are some NaN values, let's get rid of them

penalty_info.dropna(how = 'any', axis = 'rows', inplace = True)
penalty_info.reset_index(drop = True, inplace = True)

In [110]:
penalty_info.to_pickle('penalty_info.pkl')


#### Step 2: Correlations for range times


In [111]:
corr_and_changes_prone_range = []

x_values = ['Length', 'Height Diff', 'Max Climb', 'Total Climb',  'Snow Temp', 'Air Temp', 
            'Humidity', 'Wind', 'Altitude', 'Quant Year', 'Quant Event', 'Quant Weather',
            'Quant Snow']

y_values = ['prone range mean']

for y in y_values:
    for x in x_values:
        row_data = [y,x]
        row_data.extend(corr_and_change(x,y,penalty_info.iloc[:96]))
        corr_and_changes_prone_range.append(row_data)
        
corr_and_changes_prone_range = pd.DataFrame(corr_and_changes_prone_range, 
                                      columns = ['Response','Predictor','Correlation',
                                                'Determination','Percent Change Predicted'])

corr_and_changes_prone_range

Unnamed: 0,Response,Predictor,Correlation,Determination,Percent Change Predicted
0,prone range mean,Length,-0.028273,0.000799,0.027397
1,prone range mean,Height Diff,0.123728,0.015309,0.114583
2,prone range mean,Max Climb,0.03033,0.00092,0.033009
3,prone range mean,Total Climb,0.065217,0.004253,0.053919
4,prone range mean,Snow Temp,-0.246668,0.060845,0.312329
5,prone range mean,Air Temp,-0.232323,0.053974,0.28923
6,prone range mean,Humidity,0.096898,0.009389,0.082328
7,prone range mean,Wind,0.113187,0.012811,0.138119
8,prone range mean,Altitude,-0.005786,3.3e-05,0.004188
9,prone range mean,Quant Year,-0.239406,0.057315,0.164966


In [112]:
corr_and_changes_standing_range = []

x_values = ['Length', 'Height Diff', 'Max Climb', 'Total Climb',  'Snow Temp', 'Air Temp', 
            'Humidity', 'Wind', 'Altitude', 'Quant Year', 'Quant Event', 'Quant Weather',
            'Quant Snow']

y_values = ['standing range mean']

for y in y_values:
    for x in x_values:
        row_data = [y,x]
        row_data.extend(corr_and_change(x,y,penalty_info.iloc[:96]))
        corr_and_changes_standing_range.append(row_data)
        
corr_and_changes_standing_range = pd.DataFrame(corr_and_changes_standing_range, 
                                      columns = ['Response','Predictor','Correlation',
                                                  'Determination','Percent Change Predicted'])

corr_and_changes_standing_range

Unnamed: 0,Response,Predictor,Correlation,Determination,Percent Change Predicted
0,standing range mean,Length,-0.018747,0.000351,0.015428
1,standing range mean,Height Diff,0.074434,0.00554,0.058545
2,standing range mean,Max Climb,-0.004905,2.4e-05,0.004534
3,standing range mean,Total Climb,0.06523,0.004255,0.045803
4,standing range mean,Snow Temp,-0.257711,0.066415,0.27714
5,standing range mean,Air Temp,-0.207546,0.043075,0.219449
6,standing range mean,Humidity,0.143971,0.020728,0.10389
7,standing range mean,Wind,0.130585,0.017052,0.135336
8,standing range mean,Altitude,0.096176,0.00925,0.05912
9,standing range mean,Quant Year,-0.194929,0.037997,0.114078


In [113]:
best_variables_prone_range = collect_best_variables(corr_and_changes_prone_range)

In [114]:
best_variables_standing_range = collect_best_variables(corr_and_changes_standing_range)


#### Step 3: Correlations for penalty loop times





In [9]:
corr_and_changes_prone_loop = []

x_values = ['Length', 'Height Diff', 'Max Climb', 'Total Climb',  'Snow Temp', 'Air Temp', 
            'Humidity', 'Wind', 'Altitude', 'Quant Year','Quant Event', 'Quant Weather', 
            'Quant Snow']

y_values = ['prone loop mean']

for y in y_values:
    for x in x_values:
        row_data = [y,x]
        row_data.extend(corr_and_change(x,y,penalty_info.iloc[:96]))
        corr_and_changes_prone_loop.append(row_data)
        
corr_and_changes_prone_loop = pd.DataFrame(corr_and_changes_prone_loop, 
                                           columns = ['Response','Predictor','Correlation',
                                                 'Determination','Percent Change Predicted'])

corr_and_changes_prone_loop

Unnamed: 0,Response,Predictor,Correlation,Determination,Percent Change Predicted
0,prone loop mean,Length,-0.004694,2.2e-05,0.004107
1,prone loop mean,Height Diff,0.043059,0.001854,0.036008
2,prone loop mean,Max Climb,0.069018,0.004763,0.067828
3,prone loop mean,Total Climb,0.10154,0.01031,0.075805
4,prone loop mean,Snow Temp,-0.103892,0.010794,0.118784
5,prone loop mean,Air Temp,-0.06747,0.004552,0.075847
6,prone loop mean,Humidity,0.114134,0.013027,0.087564
7,prone loop mean,Wind,0.180649,0.032634,0.199051
8,prone loop mean,Altitude,0.078575,0.006174,0.051352
9,prone loop mean,Quant Year,-0.171745,0.029496,0.106861


In [116]:
corr_and_changes_standing_loop = []

x_values = ['Length', 'Height Diff', 'Max Climb', 'Total Climb',  'Snow Temp', 'Air Temp', 
            'Humidity', 'Wind', 'Altitude', 'Quant Year','Quant Event', 'Quant Weather', 
            'Quant Snow']

y_values = ['standing loop mean']

for y in y_values:
    for x in x_values:
        row_data = [y,x]
        row_data.extend(corr_and_change(x,y,penalty_info.iloc[:96]))
        corr_and_changes_standing_loop.append(row_data)
        
corr_and_changes_standing_loop = pd.DataFrame(corr_and_changes_standing_loop, 
                                           columns = ['Response','Predictor','Correlation',
                                                  'Determination','Percent Change Predicted'])

corr_and_changes_standing_loop

Unnamed: 0,Response,Predictor,Correlation,Determination,Percent Change Predicted
0,standing loop mean,Length,-0.012883,0.000166,0.008422
1,standing loop mean,Height Diff,0.097568,0.009519,0.060958
2,standing loop mean,Max Climb,0.213295,0.045495,0.156611
3,standing loop mean,Total Climb,0.099593,0.009919,0.05555
4,standing loop mean,Snow Temp,-0.006103,3.7e-05,0.005213
5,standing loop mean,Air Temp,-0.021165,0.000448,0.017776
6,standing loop mean,Humidity,0.128615,0.016542,0.073722
7,standing loop mean,Wind,0.159715,0.025509,0.131483
8,standing loop mean,Altitude,0.143037,0.02046,0.069842
9,standing loop mean,Quant Year,-0.059052,0.003487,0.027451


In [117]:
best_variables_prone_loop = collect_best_variables(corr_and_changes_prone_loop)

In [118]:
best_variables_standing_loop = collect_best_variables(corr_and_changes_standing_loop)


# Evaluating Impacts on Range Times

Beginning with the impacts of our predictor variables on prone range times, we find that the variables with the greatest impact are
<!--, which has an $r$ value of 0.499915 and which explains roughly 59% of the total change in prone accuracy,-->
- Quant Weather, which has an $r$ value of -0.278661 and which explains roughly 37% of the total change in prone range times,
- Snow Temp, which has an $r$ value of -0.246668 and which explains roughly 31% of the total change in prone range times,
- Quant Year, which has an $r$ value of -0.239406 and which explains roughly 16% of the total change in prone range times,
- Air Temp, which has an $r$ value of -0.232323 and which explains roughly 29% of the total change in prone range times,
- Quant Event, which has an $r$ value of -0.228858 and which explains roughly 16% of the total change in prone range times, and
- Quant Snow, which has an $r$ value of -0.220391 and which explains roughly 29% of the total change in prone range times.
  
In considering the impacts of our predictor variables on standing range times, we find that the variables with the greatest impact are
- Quant Weather, which has an $r$ value of -0.420419 and which explains roughly 47% of the total change in standing range times,
- Quant Snow, which has an $r$ value of -0.282148 and which explains roughly 32% of the total change in standing range times, and
- Snow Temp, which has an $r$ value of -0.257711 and which explains roughly 28% of the total change in standing range times.
  
One observation here is that, despite the fact that range times were not considered in any way when assigning numeric values to the different values taken by Weather and Snow Conditions, Quant Weather and Quant Snow remain variables that are highly (relatively speaking) correlated with both prone and standing range times.

Previous Section: [Effect of Conditions on Range and Penalty Times](#Effects-of-conditions-on-range-and-penalty-times)

Next Section: [Evaluating Impacts on Penalty Loop Times](#Evaluating-Impacts-on-Penalty-Loop-Times)


[Table of Contents](#Table-of-Contents)




# Evaluating Impacts on Penalty Loop Times

Beginning with the impacts of our predictor variables on prone penalty loop times, we find that the variables with the greatest impact are
<!--, which has an $r$ value of 0.499915 and which explains roughly 59% of the total change in prone accuracy,-->
- Quant Weather, which has an $r$ value of -0.359824 and which explains roughly 43% of the total change in prone loop times, and
- Quant Snow, which has an $r$ value of -0.299457 and which explains roughly 36% of the total change in prone loop times.

In considering the impacts of our predictor variables on standing range times, we find that the variables with the greatest impact are
- Quant Weather, which has an $r$ value of -0.287663 and which explains roughly 26% of the total change in standing loop times
- Quant Snow, which has an $r$ value of -0.279078 and which explains roughly 25% of the total change in standing loop times
- Max Climb, which has an $r$ value of 0.213295 and which explains roughly 16% of the total change in standing loop times
- Wind, which has an $r$ value of 0.159715 and which explains roughly 13% of the total change in standing loop times

One observation here is that, despite the fact that penalty loop times were not considered in any way when assigning numeric values to the different values taken by Weather and Snow Conditions, Quant Weather and Quant Snow remain variables that are highly (relatively speaking) correlated with both prone and standing penalty loop times.

Previous Section: [Evaluating Impacts on Range Times](#Evaluating-Impacts-on-Range-Times)


Next Section: [Handing off to the next notebook](#Handing-off-to-the-next-notebook)


[Table of Contents](#Table-of-Contents)




# Handing off to the next notebook

In order to prevent individual notebooks from becoming too unwieldy, I'm choosing to split this project over multiple notebooks. At this point, I have completed the following tasks:
1. Read in the competition analysis files for all World Cup and IBU Cup men's sprint races from 2004-05 to 2017-18, parsed their contents, and stored the desired data in dataframes.
2. Read in the competition data summary files for all World Cup and IBU Cup men's sprint races over the same period, parsed their contents, and stored the desired data in dataframes.
3. Investigated the impact of each of the recorded conditions (length, wind, snow temperature, etc) on the various aspects of the race (speed, accuracy, range times, penalty loop times) and isolated the variables that seemed to have the most individual impact on each aspect.

In the next notebook, I intend to use build models for each aspect, which will use results of previous races to predict performance in the race of interest. In each case, I will use the similarity between conditions in races to weight the importance of each previous race for predicting the outcome of the current race. Furthermore, since it appears from the results found in this notebook that some condition variables are far more important than others for making these predictions, I'm going to isolate individual variables when considering these weightings. In other words, I'm going to test my model with a weighting that is based entirely on how similar the course lengths are, then test it with a weighting based entirely on altitude, and so on until I've used each individual predictor variable as a basis for a weighting. I'm then going to determine which of the variables did the best job of predicting outcomes, and save those for future use.

The last thing that I need to do before finishing up this notebook is to collect all of my best variables in a dictionary and then to pickle them so that they will available for use in the following and subsequent notebooks.

Previous Section: [Evaluating Impacts on Penalty Loop Times](#Evaluating-Impacts-on-Penalty-Loop-Times)



[Table of Contents](#Table-of-Contents)



In [120]:
best_variables = {'speed' : best_variables_speed, 'prone_acc' : best_variables_prone_accuracy,
                  'standing_acc' : best_variables_standing_accuracy, 
                  'prone_range' : best_variables_prone_range,
                  'standing_range' : best_variables_standing_range,
                  'prone_loop' : best_variables_prone_loop,
                  'standing_loop' : best_variables_standing_loop}

In [None]:
with open('best_variables.pickle', 'wb') as handle:
    pickle.dump(best_variables, handle, protocol=pickle.HIGHEST_PROTOCOL)


In [122]:
best_variables

{'prone_acc': ['Quant Weather'],
 'prone_loop': ['Quant Weather', 'Quant Snow'],
 'prone_range': ['Snow Temp',
  'Quant Snow',
  'Air Temp',
  'Quant Event',
  'Quant Year',
  'Quant Weather'],
 'speed': ['Quant Snow', 'Quant Weather'],
 'standing_acc': ['Quant Weather', 'Wind'],
 'standing_loop': ['Quant Weather', 'Quant Snow', 'Max Climb', 'Wind'],
 'standing_range': ['Quant Weather', 'Quant Snow', 'Snow Temp']}