# `clean_races_dates.ipynb`

### Author: Anthony Hein

#### Last updated: 10/19/2021

# Overview:

It was later found that during cleaning of the datasets, we missed one type of error which was that there is sometimes an extraneous ` 00:00` at the end of the day or sometimes the year only contains one character, like `9/12/30`, which doesn't work in some formulas and should instead be `09/12/30`. We fix that here.

---

## Setup

In [1]:
import git
import os
from typing import List
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

---

## Load `races_aticnmi.csv`

In [4]:
races_aticnmi = pd.read_csv(f"{BASE_DIR}/data/csv/races_aticnmi.csv", low_memory=False) 
races_aticnmi.head()

Unnamed: 0,rid,course,time,date,hurdles,prizes,winningTime,metric,countryCode,ncond,class
0,302858,Thurles (IRE),01:15,97/01/09,,[],277.2,3821.0,IE,1,0
1,291347,Punchestown (IRE),03:40,97/02/16,,[],447.2,5229.0,IE,5,0
2,377929,Leopardstown (IRE),03:00,97/05/11,,[],106.4,1609.0,IE,4,0
3,275117,Curragh (IRE),03:35,97/05/25,,[],125.9,2011.0,IE,4,0
4,66511,Leopardstown (IRE),04:30,97/06/02,,[],116.3,1810.0,IE,1,0


In [5]:
races_aticnmi.shape

(19510, 11)

---

## Fix Date

In [10]:
def fix_date(date: str) -> str:
    # the strip here is a hack until we can fix elsewhere, similarly the prepend with 0
    if date.find(' 00:00') >= 0:
        date = date[:date.find(' 00:00')]
    date = '0' + date if date[1] == '/' else date
    return date

In [11]:
fix_date('6/01/20 00:00')

'06/01/20'

In [13]:
races_aticnmi['date'] = races_aticnmi['date'].map(fix_date)

---

## Save Dataframes

In [14]:
races_aticnmi.to_csv(f"{BASE_DIR}/data/csv/races_aticnmid.csv", index=False)

---