# Horse Race Predictions

In [8]:
from IPython.display import Image, display

# Provide the path to your local image
image_path = "C:/Users/masin/OneDrive/Desktop/Horse Race.jpg"
display(Image(filename=image_path))


<IPython.core.display.Image object>

<a id="cont></a>"

## Table of Contents

* [1. Project Overview](#chapter1)
  * [1.1 Introduction](#section_1_1)
      * [1.1.1 Problem Statement](#sub_section_1_1_1)
      * [1.1.2 Aim](#sub_section_1_1_2)
      * [1.1.3 Objectives](#sub_section_1_1_3)      
* [2. Importing Packages](#chapter2)
* [3. Loading Data](#chapter3)
* [4. Data Cleaning](#chapter4)
* [5. Exploratory Data Analysis (EDA)](#chapter5)

The dataset selected for this project is the Race Details dataset, which contains information about horse racing results.

## 1.1 Introduction
Horse racing is a competitive sport where horses, guided by jockeys, race to finish first. Various factors, such as the horse’s physical condition, age, track type, and jockey's skills, can affect performance. The races_2020.csv dataset includes detailed information on horse races, including race distances, track conditions, prize money, and race results, which provide valuable insights into the dynamics of horse racing.

This project aims to analyze the 2020 horse racing data to uncover key factors that influence race outcomes. By exploring historical race data, we seek to better understand the variables that contribute to success in races, allowing for improved race strategies and performance insights.

### 1.1.1 Problem Statement
Despite the availability of detailed horse racing data, predicting race outcomes remains challenging due to the complex interplay of factors such as race distance, track conditions, and horse attributes. Traditionally, horse racing enthusiasts and trainers have relied on intuition and surface-level analysis to make predictions. However, a more structured and data-driven approach using the races_2020.csv dataset could offer deeper insights into what makes certain horses perform better in races.

This project aims to fill this gap by analyzing horse racing data from 2020. We will examine factors like race distance, track conditions, horse age group, and country to determine their influence on race outcomes. The goal is to better understand the determinants of race performance and apply these findings to optimize future racing strategies.

### 1.1.2 Aim
The aim of this project is to perform a detailed analysis of horse racing data from 2020, with a focus on identifying key factors that impact race outcomes. We will analyze features such as race distance, track conditions, age groups, and prize money to uncover patterns that influence a horse’s success in a race.

Through data-driven analysis, the project will help trainers, jockeys, and race organizers gain insights into the most influential factors for winning races, leading to improved race strategies and training programs.

### 1.1.3 Objectives
Data Analysis: To collect and analyze horse racing data from the races_2020.csv dataset, identifying key trends, performance factors, and patterns in race outcomes.

Examine Performance Factors: To explore the impact of variables such as race distance, track type, age group, and track conditions on horse performance.

Impact Assessment: To assess how different features (e.g., race distance, track conditions, and horse age) affect the likelihood of a horse winning or placing in the top ranks.

Recommendations: To provide actionable recommendations for trainers, jockeys, and race organizers on optimizing strategies for better performance, understanding the most critical predictors of success in horse racing based on 2020 data.



# <font color=red>2. Importing Packages</font>

<div class="alert alert-block alert-info">
<b>Package Imports </b>  Package imports refer to the process of including external libraries and modules in your code. These packages provide additional functionality and tools that are not available in the standard library, enabling more efficient and effective coding for tasks such as data manipulation, visualization, and machine learning.
</div>

In [3]:
# Importing packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns
import warnings
from sklearn.preprocessing import StandardScaler

# Suppress all warnings
warnings.filterwarnings('ignore')

  from pandas.core import (


# <font color=red>3. Data Loading </font>

<div class="alert alert-block alert-info">
<b>Data Loading </b>  refers to the process of importing data into a workspace to make it ready for analysis. It involves reading data from various sources, such as files, databases, or APIs, and converting it into a format suitable for processing and analysis.
</div>

In [5]:
# Load the Excel file
race_df = pd.read_csv("races_2020.csv")

# Display the first few rows of the DataFrame to verify
print(race_df.head())

     rid            course   time      date  \
0  10312          Fakenham  02:55  20/01/01   
1  10896        Cheltenham  03:50  20/01/01   
2  23038     Tramore (IRE)  02:55  20/01/01   
3  23986  Fairyhouse (IRE)  02:40  20/01/01   
4  25123  Fairyhouse (IRE)  02:05  20/01/01   

                                               title   rclass    band  ages  \
0               Happy New Year Maiden Hurdle (Div I)  Class 4     NaN  4yo+   
1  EBF Stallions &amp; Cheltenham Pony Club (A St...  Class 1     NaN   4yo   
2        Jerry O'Donovan Memorial Rated Novice Chase      NaN     NaN  5yo+   
3  Follow Fairyhouse On Social Media Beginners Chase      NaN     NaN  5yo+   
4  Fairyhouse Launches New Brand In 2020 Handicap...      NaN  80-109  4yo+   

  distance     condition     hurdles  \
0       2m  Good To Soft   9 hurdles   
1     1m6f          Soft         NaN   
2       2m          Soft   12 fences   
3     2m5f      Yielding   13 fences   
4       3m      Yielding  13 hurdles   

 

# <font color=red>4. Data Cleaning </font>

<div class="alert alert-block alert-info">
<b>Data cleaning</b>  refers to the process of identifying and correcting errors, inconsistencies, and inaccuracies in a dataset to improve its quality and reliability for analysis. It involves several steps, including handling missing or incomplete data, correcting data format issues, removing duplicate records, and dealing with outliers or anomalies.
</div>

In [6]:
# display a concise summary of the DataFrame
race_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14794 entries, 0 to 14793
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   rid          14794 non-null  int64  
 1   course       14794 non-null  object 
 2   time         14794 non-null  object 
 3   date         14794 non-null  object 
 4   title        14794 non-null  object 
 5   rclass       7196 non-null   object 
 6   band         5333 non-null   object 
 7   ages         14794 non-null  object 
 8   distance     14794 non-null  object 
 9   condition    14794 non-null  object 
 10  hurdles      3067 non-null   object 
 11  prizes       14794 non-null  object 
 12  winningTime  14794 non-null  float64
 13  prize        14794 non-null  int64  
 14  metric       14794 non-null  float64
 15  countryCode  14794 non-null  object 
 16  ncond        14794 non-null  int64  
 17  class        14794 non-null  int64  
 18  currency     9281 non-null   object 
dtypes: f

In [7]:
# Check for missing values in each column
missing_values = race_df.isnull().sum()

# Display the missing values count for each column
print("Missing values in each column:")
print(missing_values)

Missing values in each column:
rid                0
course             0
time               0
date               0
title              0
rclass          7598
band            9461
ages               0
distance           0
condition          0
hurdles        11727
prizes             0
winningTime        0
prize              0
metric             0
countryCode        0
ncond              0
class              0
currency        5513
dtype: int64


In [9]:
# Remove the 'hurdles' column from the dataframe
race_df = race_df.drop(columns=['hurdles'])


In [10]:
# Verify the changes
race_df.head()

Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,prizes,winningTime,prize,metric,countryCode,ncond,class,currency
0,10312,Fakenham,02:55,20/01/01,Happy New Year Maiden Hurdle (Div I),Class 4,,4yo+,2m,Good To Soft,"[5198.4, 1526.4, 763.2, 381.6]",253.88,7869,3218.0,GB,10,4,
1,10896,Cheltenham,03:50,20/01/01,EBF Stallions &amp; Cheltenham Pony Club (A St...,Class 1,,4yo,1m6f,Soft,"[14237.5, 5342.5, 2675.0, 1332.5, 670.0, 335.0]",206.55,24592,2815.0,GB,5,1,
2,23038,Tramore (IRE),02:55,20/01/01,Jerry O'Donovan Memorial Rated Novice Chase,,,5yo+,2m,Soft,"[7387.5, 2387.5, 1137.5, 512.5, 262.5, 137.5]",266.4,11826,3218.0,IE,5,0,
3,23986,Fairyhouse (IRE),02:40,20/01/01,Follow Fairyhouse On Social Media Beginners Chase,,,5yo+,2m5f,Yielding,"[8274.0, 2674.0, 1274.0, 574.0, 294.0, 154.0]",340.9,13244,4223.0,IE,6,0,
4,25123,Fairyhouse (IRE),02:05,20/01/01,Fairyhouse Launches New Brand In 2020 Handicap...,,80-109,4yo+,3m,Yielding,"[7092.0, 2292.0, 1092.0, 492.0, 252.0, 132.0]",389.0,11352,4827.0,IE,6,0,


In [12]:
# Fill missing values in 'rclass' and 'band' columns with 'Unknown'
race_df['rclass'].fillna('Unknown', inplace=True)
race_df['band'].fillna('Unknown', inplace=True)

In [14]:
# Verify that the missing values are handled
missing_values_after_fill = race_df[['rclass', 'band']].isnull().sum()
missing_values_after_fill

rclass    0
band      0
dtype: int64

In [16]:
# Checking the most common value in the 'currency' column
most_common_currency = race_df['currency'].mode()[0]

# Filling missing values in 'currency' with the most common currency
race_df['currency'].fillna(most_common_currency, inplace=True)

In [17]:
# Verify that the missing values are handled
missing_values_currency_after_fill = race_df['currency'].isnull().sum()
missing_values_currency_after_fill

0

In [18]:
# Check for missing values in each column
missing_values = race_df.isnull().sum()

# Display the missing values count for each column
print("Missing values in each column:")
print(missing_values)

Missing values in each column:
rid            0
course         0
time           0
date           0
title          0
rclass         0
band           0
ages           0
distance       0
condition      0
prizes         0
winningTime    0
prize          0
metric         0
countryCode    0
ncond          0
class          0
currency       0
dtype: int64


In [19]:
# Check for duplicate rows in the dataframe
duplicate_values = race_df.duplicated().sum()

# Display the number of duplicate rows
print(f"\nNumber of duplicate rows: {duplicate_values}")



Number of duplicate rows: 0


In [22]:
# Define the new column names
new_column_names = ['Race_ID', 'Race_course', 'Time', 'Date', 'Race_title', 'Race_class', 'Performance_bands', 'Age_group', 
                    'Race_distance', 'Track_condition', 'Prize_info', 'Winning_time', 'Prize_amount', 'Race_speed', 'Country_code', 
                    'Temp_condition', 'Class_indicator', 'Currency']

# Assign the new column names to the dataframe
race_df.columns = new_column_names

# Verify the changes
print(race_df.head())


   Race_ID       Race_course   Time      Date  \
0    10312          Fakenham  02:55  20/01/01   
1    10896        Cheltenham  03:50  20/01/01   
2    23038     Tramore (IRE)  02:55  20/01/01   
3    23986  Fairyhouse (IRE)  02:40  20/01/01   
4    25123  Fairyhouse (IRE)  02:05  20/01/01   

                                          Race_title Race_class  \
0               Happy New Year Maiden Hurdle (Div I)    Class 4   
1  EBF Stallions &amp; Cheltenham Pony Club (A St...    Class 1   
2        Jerry O'Donovan Memorial Rated Novice Chase    Unknown   
3  Follow Fairyhouse On Social Media Beginners Chase    Unknown   
4  Fairyhouse Launches New Brand In 2020 Handicap...    Unknown   

  Performance_bands Age_group Race_distance Track_condition  \
0           Unknown      4yo+            2m    Good To Soft   
1           Unknown       4yo          1m6f            Soft   
2           Unknown      5yo+            2m            Soft   
3           Unknown      5yo+          2m5f       