# **Hello!**

Here are the data cleaning, pre-processing, feature engineering and modeling steps of our proposed solution. We did our best to explicitate our process, assumptions and decisions while remaining concise.

Thank you for your time in evaluating our submission!

Cynthia & Juliette

## **1. Data cleaning**

- **Removing Redundant Columns**: We identified and removed redundant columns from the dataset to simplify and optimize it for analysis. The 'GoingID' and 'GoingAbbrev' columns were removed as 'Wetness' already capture the information.

- **Type Casting**: To ensure data consistency and proper analysis, we performed type casting on certain columns

- **Updating values**: To enhance the clarity and readability of the data, we updated values in some columns. (e.g. in 'RacingSubType')

- **Handling missing values**: We applied the following strategies to address missing values.
  - RaceGroup': Blank values were replaced with 'Other' to provide meaningful information.
  - 'CourseIndicator', 'Barrier', 'RaceOverallTime', 'NoFrontCover', 'WideOffRail', 'BeatenMargin': Missing, blank, or abnormal values (-9, 999) were replaced with null values to ensure data accuracy,
  - 'HandicapType': Missing values were replaced with 'None' to maintain consistency.


## **2. Pre-processing**

- encoding categorical variables
  - For

- scaling numerical variables

- maybe that should be after the feature engineering section, we'll see

## **3. Feature engineering**

- 'Season' variable
  - Rationale: The time of year when a race takes place affect race conditions and horse performance due to weather patterns, track conditions, and seasonal horse fitness levels.
  - Method: The 'Season' variable is derived from the 'RaceStartTime' by mapping dates to their corresponding seasons.

- Age Restrictions Variables (max_age, min_age)
  - These variables are derived from 'AgeRestriction' by parsing the age limits and applying them to the respective columns, aiding in filtering horses eligible for each race.

- Started_race and Finished_race Variables
  - Rationale: Tracking whether a horse started or finished a race can signal its reliability and potential issues that may affect its future performance.
  - Model Impact: The 'Started_race' and 'Finished_race' boolean variables help to differentiate between horses that consistently compete and those that may have health or behavioral issues.
  - Method: The 'Started_race' variable is true for all horses not marked with 'NP' in 'FinishPosition', while 'Finished_race' is true for horses that have a numerical finish position, thus capturing participation and completion rates.

- Specific Status Boolean Variables
  - Rationale: Specific incidents during a race such as breaking stride, not finishing, pulling up, or falling can significantly impact a horse's immediate and future performances.
  - Model Impact: Boolean variables for each incident type ('Broke_Stride', 'UN', 'Pulled_Up', 'Fell') provide granular data points for race analytics, enabling the model to account for non-standard race outcomes.
  - Method: These variables are flags set to true when the corresponding incident code appears in the race data, offering detailed insights into each horse's race history.

- Placed Variable
  - Model Impact: The 'Placed' variable considers horses in the top positions, which typically receive prize money, thus highlighting consistently successful competitors and adding more information than only looking at the 'WIN' variable.
  - Method: This variable is true for horses finishing in positions 1-7, recognizing those that achieve a position that usually awards prize money.

- Relative_age Variable
  - Rationale: The age of a horse relative to its competitors can be a factor in its performance, with younger horses potentially being less experienced and older horses possibly past their peak.
  - Model Impact: The 'Relative_age' variable contextualizes a horse's age within the specific race field, potentially identifying advantageous age-related performance trends.
  - Method: This variable scales a horse's age against the youngest and oldest horses in the race, with 0 representing the youngest and 1 the oldest.

- Updated FinishPosition Variable
  - Model Impact: By assigning the last position to horses with a non-numeric 'FinishPosition'.
  - Method: The 'FinishPosition' is updated to reflect the order of completion

- WIN Variable Modification
  - Rationale: In our dataset, prize money is a more accurate reflection of the significance of the race and the performance of the horses.
  - Model Impact: Determining the 'WIN' variable based on the highest 'Prizemoney' allows for the identification of winners in cases where disqualification might occur, offering a more nuanced view of race outcomes.
  - Method: The winning horse is identified for each race by selecting the horse with the highest 'Prizemoney' amongst those not disqualified.

- Performance Score Enhancement
  - Rationale: A performance score based on finishing position offers a quantitative measure of a horse's success in a race, with adjustments for non-finishers providing clearer distinctions in performance.
  - Model Impact: The new performance score accounts for all race outcomes in a way that directly reflects each horse's placement relative to others, offering a more differentiated assessment of performance.
  - Method: We first updated the FinishPosition variable by adding 1 to horses who either did not start / did not finish the race. These horses were identified by having letter in the finish position. This way, non-finishers were assigned last place + 1 to differentiate from those actually finishing the race. The performance score is calculated as the reciprocal of the updated 'FinishPosition' and then normalized within each 'RaceID' to ensure comparability

4. Features on horses, trainers, and jockeys past participations

**add details**

- Consistency of Horse's Finish Position
  - Rationale: A horse that consistently finishes in top positions is likely to have better training, conditioning, and inherent ability, making it a strong contender in future races.
  - Model Impact: By quantifying finish position consistency the model can better assess the reliability of a horse's performance. This feature can help distinguish consistently good performers from horses that might have had a few random wins.
  - Method: We created 2 new variables: 'horse_consistency_position' takes the standard deviation of the horse's last five finish position. 'horse_consistency' takes the standard devaition of the horse's last five normalized score.

- Average Beaten Margin
  - Rationale: The average margin by which a horse is beaten in races can provide insights into how competitive the horse is, even when it does not win. A smaller beaten margin could indicate that the horse is often a strong competitor but may have faced some minor issues preventing a win.
  - Model Impact: This feature helps to differentiate between horses that barely lose and those that are often non-competitive. In close races, horses with a history of smaller beaten margins might be given more favorable odds by the model. This gives us additional information in determining the horse's win probability than only looking at their past finish position.

- Field Competitiveness
  - Rationale: The level of competition in previous races can be a significant factor in interpreting a horse's past performance. Winning a high-stakes race with strong competitors can be more indicative of a horse's ability than winning among a weak field.
  - Model Impact: By incorporating measures of field competitiveness the model can weigh a horse's past wins or places more accurately.
  - Method: We calculated this variable using the 'horse_past_perf_score', 'jockey_past_perf_score', 'trainer_past_perf_score', and 'race_prizemoney_score'. After assigning different weight to these variable, we multiple these variables with the weight and sum them up. We then group the dataset by RaceID and get the average of this score for each race as the 'field_competitiveness'.

 **'composite_past_perf_score' -> derived as we're calculating for the field competitiveness -> are we keeping it? as I believe it was the top feature for some of the models**

- Dam and Sire Past Performance Score
  - Rationale: Pedigree information can be an important indicator of latent potential and performance capacity. A horse's dam (mother) and sire (father) can provide insights into the genetic quality and inherited traits that may influence a horse's capabilities.
  - Model Impact: By quantifying the average performance of siblings from the same dam, we can adjust predictions to account for genetic factors that may not be immediately observable, potentially improving the accuracy of the model when assessing a horse's prospects.
  - Method: We calculated the 'Dam Performance Score' by taking the average of horse_averagescore_scaled for all offspring of the same dam. This score is then assigned to each horse as an inherited performance metric. We did the same for 'Sire_performance_score'

- Age and Race Prize Money Interaction Term
  - Rationale: The interaction between a horse's age and the prize money of a race could signal the maturity and experience of a horse relative to the level of competition it has faced
  - Model Impact: This interaction term can help the model to discern whether a horse is in its prime and competing successfully in higher-stakes races, which is often a strong predictor of future performance.
  - Method: We created the interaction term by multiplying the 'HorseAge' variable with 'RacePrizemoney', enabling the model to assess the combined effect of these variables on race outcomes.


- Combined Performance Score (Horse * Jockey * Trainer)
  - Rationale: Combining the past performance score of horse, jockey, and trainer can give us a fuller picture of how a horse's performance.
  - Model Impact: A combined performance score can quantify this synergy of the three parties, offering a holistic view of the team behind a horse, and may correlate strongly with winning outcomes.
  - Method: We compute 3 'Combined Performance Score' by multiplying the win rates, average scores, and last 5 race scores of the horse, jockey, and trainer. This composite score is designed to capture the cumulative effect of each contributor's recent performance.

- Rest Period
  - Rationale: The duration of a horse's rest period before a race can greatly affect its performance. Both extended rest and quick turnarounds have implications for a horse's readiness and physical condition.
  - Model Impact: Including the 'Rest Period' as a variable helps to ensure that the model accounts for potential effects of fatigue or well-restedness, which can be particularly influential in a horse's race day performance.
  - Method: The 'Rest Period' is calculated as the number of days since a horse's last race. This period is considered in the context of each horse's historical performance trends post-rest to evaluate its potential impact on the upcoming race.


etc. -> in this section, we can go through all engineered features that we end up using in the final model and briefly explain what they mean and how we get them

## **4. Model training**

- **Train-test split**: The data was split between train (before Nov 1, 2021) and test (up to 31 Oct, 2021).
- **Model selection**: We tested different classification algorithms, beginning with Logistic Regression as our baseline model. Additionally, we tried Naive Bayes, Decision Trees, Random Forests, Support Vector Machines (SVM) and XGBoost to assess their initial performance and identify the most suitable approach. We selected XGBoost, which offered the best performance given limited computational resources (as we were training models from our personal laptops, SVM and neural networks were difficult to work with).
- **Feature selection**: (add rationale of leaving all engineered features in instead of minimizing features bc we're optimizing the log loss for the purposoes of the competition)
- **Parameter tuning**:

## **5. Output probabilities post-processing**

explain process of scaling probabilities at RaceID level

## **6. Performance metrics**

- Accuracy: 0.893
- Precision: 0.295
- Recall: 0.295
- Log loss score:  0.231
- Number of races for which the winner is correctly predicted: 631 out of 2140 (= 29.5% of races)

# Testing our model with new data

- maybe we need to add instructions to test the model with new data, and add commented out cells in the notebook that they can uncomment when running their test adding their new data
e.g. add a cell that imports a paquet or csv file, and creates a new column "Set" where that new data = "test" and the original dataset has "train" -- then add a cell at the train-test split step for them to switch from the data split to splitting based on "Set" value. and I think that's all
- or upload another version of the notebook where all they have to do is add the path to the new dataset and that's it. (because the new data needs to be merged to the original dataset, cleaned, go through the feature engineering process etc. before going through the model)
  - maybe make up a race worth of synthetic data just to make sure that process works (e.g. ensure the dataset gets sorted by RaceStartTime after merging the new data with the original data, so that all our cumulative computations don't get fuckd up)

Environment Setup: Requirements

```
# preprocessing
import pandas as pd
import numpy as np
import plotly.express as px
import calendar
from datetime import datetime
```

