# Galopp-Preprocessor

For this step in the project I will use jupyter notebook, because it is way faster to visualize and more flexible on doin data preprocessing. (The exploratory data analysis part will also be with jupyter notebook)

In [None]:
print("Preprocessing...")

### Imports

In [1]:
import pandas as pd
import numpy as np

### Load dataset

In [2]:
galopp = pd.read_csv("csvs/all_races.csv")
galopp.sample(5)

Unnamed: 0,Date,Location,Distance,Prize,Category,Class,Ground_state,Horses
5619,05. August 2017,Bad Doberan,1900 m,7000 €,E,III,Boden: gut,"[' 1. ', ' Del C..."
2887,15. Juni 2015,Hoppegarten,1800 m,8000 €,,III,Boden: gut,"[' 1. ', ' Cruci..."
8056,31. August 2019,Baden-Baden,1600 m,11000 €,D,III,Boden: gut,"[' 1. ', ' Anpak..."
8127,15. September 2019,Dortmund (Turf),1600 m,5100 €,D,,Boden: gut,"[' 1. ', ' Palao..."
6977,16. September 2018,Hannover,2400 m,4000 €,F,IV,Boden: gut,"[' 1. ', ' The T..."


### Look at general information

In [3]:
galopp.columns

Index(['Date', 'Location', 'Distance', 'Prize', 'Category', 'Class',
       'Ground_state', 'Horses'],
      dtype='object')

In [4]:
print(galopp.isna().sum())
print("")
print(galopp.isna().sum()/len(galopp)*100)

Date              15
Location          15
Distance           0
Prize              0
Category        3115
Class           4245
Ground_state       0
Horses             0
dtype: int64

Date             0.162725
Location         0.162725
Distance         0.000000
Prize            0.000000
Category        33.792580
Class           46.051204
Ground_state     0.000000
Horses           0.000000
dtype: float64


In [5]:
galopp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9218 entries, 0 to 9217
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Date          9203 non-null   object
 1   Location      9203 non-null   object
 2   Distance      9218 non-null   object
 3   Prize         9218 non-null   object
 4   Category      6103 non-null   object
 5   Class         4973 non-null   object
 6   Ground_state  9218 non-null   object
 7   Horses        9218 non-null   object
dtypes: object(8)
memory usage: 576.2+ KB


First thing I see is that there are some NaN values. 15 each on _Date_ and _Location_ (which are only 0.16% of the whole set). Because only a few are missing, and they are not really important to my goal, I fill them with the mode (normally not really applicable for the _Date_ , but it should be sufficient).

Also, the columns 'Category' and 'Class' are missing 3115 and 4245, which are ~33% and ~46%. Normally, because they are categorical, they could be filled with the mode, but so many values are missing, so I decide to drop them, as they are not important for my goal.

Next, the datatypes should be changed for _Distance_ and _Prize_ to int. But before that, the 'm' in _Distance_ and the '€' in _Prize_ have to be removed. (Also, there are '\xa0's in both columns for each entry! These have to be removed too before converting to int.). And, at last, fill the empty strings with a 0, otherwise this would lead to ValueErrors when converting to int.

The _Ground_state_ always contains the prefix 'Boden: ', this can also be removed.

In [6]:
# Drop columns
galopp.drop(columns=["Category", "Class"], inplace=True)

# Fill dates and location by 
galopp["Date"].fillna(galopp["Date"].mode(), inplace=True)
galopp["Location"].fillna(galopp["Location"].mode(), inplace=True)

# Remove units
galopp["Distance"] = galopp["Distance"].apply(lambda x: x.replace("m", ""))
galopp["Prize"] = galopp["Prize"].apply(lambda x: x.replace("€", ""))

# Remove bytes
galopp["Distance"] = galopp["Distance"].apply(lambda x: x.replace("\xa0", ""))
galopp["Prize"] = galopp["Prize"].apply(lambda x: x.replace("\xa0", ""))

# Fill empty strings with 0
galopp["Distance"] = galopp["Distance"].apply(lambda x: "0" if len(x) == 0 else x)
galopp["Prize"] = galopp["Prize"].apply(lambda x: "0" if len(x) == 0 else x)

# Change datatype to int
galopp["Distance"] = galopp["Distance"].astype(int)
galopp["Prize"] = galopp["Prize"].astype(int)

# Remove 'Boden: ' prefix
galopp["Ground_state"] = galopp["Ground_state"].apply(lambda x: x.replace("Boden: ", ""))

Looking at the dataframe again:

In [7]:
galopp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9218 entries, 0 to 9217
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Date          9203 non-null   object
 1   Location      9203 non-null   object
 2   Distance      9218 non-null   int64 
 3   Prize         9218 non-null   int64 
 4   Ground_state  9218 non-null   object
 5   Horses        9218 non-null   object
dtypes: int64(2), object(4)
memory usage: 432.2+ KB


In [8]:
galopp.sample(10)

Unnamed: 0,Date,Location,Distance,Prize,Ground_state,Horses
8427,16. Februar 2020,Dortmund (Sand),1700,3100,nass,"[' 1. ', ' Archi..."
5569,26. Juli 2017,Hoppegarten,1800,7500,weich,"[' 1. ', ' Kings..."
1507,14. Juni 2014,Dresden,1500,0,gut,"[' 1. ', ' Al Qu..."
211,28. April 2013,Ffm,1600,0,weich,"[' 1. ', ' Rabia..."
1516,15. Juni 2014,Hoppegarten,1000,0,gut,"[' 1. ', ' Alcoh..."
7855,21. Juli 2019,Düsseldorf,1400,10000,gut,"[' 1. ', ' Hayla..."
8474,09. Mai 2020,Mülheim,1300,3000,gut,"[' 1. ', ' Big B..."
3730,30. Januar 2016,Neuss (Sand),1500,5000,nass,"[' 1. ', ' Oroto..."
6604,01. Juli 2018,Hamburg,2000,8200,gut,"[' 1. ', ' Zinda..."
7363,14. April 2019,Zweibrücken,2950,2600,gut,"[' 1. ', ' Pearl..."


No NaNs and the fitting datatype, also the samples looking good aswell.
So for those columns everything that needs to be done is done. Lets go on with the horses per race. For this, I intend to:
- get the list of horses and make another dataframe of it
- Clean this dataset (No column names and datatypes here)
- Return the cleaned dataframe as a list and replace it
- Save each horse participation in another dataframe / csv

### Clean races and generate a participants dataframe

In [9]:
def clean_placement_string(x):
    
    if "NS" in x: # Treat "Nichtstarter", horses who didn't start the race
        x = -1
    else:
        x = x.replace(".","")
        x = x.replace("'","")
        x = x.replace("[","")
        x = x.replace("]", "")
        x = x.strip()
    
    return x

def clean_horse_name_string(x):
    x = x.replace("'", "")
    x = x.strip()
    x = x.lower()
    return x

def clean_jockey_name_string(x):
    x = x.replace("'", "")
    x = x.strip()
    x = x.lower()
    
    if "." in x:
        while "." in x:
            x = x[x.index(".", )+1:] # Get surname by dot
    elif len(x.split()) == 2:
        x = x.split()[1]  # Get surname when both names are in the name string
    else:
        pass
    
    return x

def clean_trainer_name_string(x):
    x = x.replace("'", "")
    x = x.strip()
    x = x.lower()
    
    if "." in x:
        while "." in x:
            x = x[x.index(".", )+1:] # Get surname by dot
    elif len(x.split()) == 2:
        x = x.split()[1]  # Get surname when both names are in the name string
    else:
        x=x
    
    return x

def clean_weight_string(x):
    x = x.replace("'", "")
    x = x.replace(",",".")
    x = x.replace("]", "")
    x = x.strip()
    return x

In [10]:
# Load, clean, and replace each race
races = []
horses = []
columns = ["Place", "Horse_name", "Jockey_name", "Trainer_name", "Weight"]

for row in galopp["Horses"]:
    
    # Load row as a seperate dataset and make it a dataframe for easier editing
    split = row.split(", ")
    try:
        row_reshaped = np.array(split).reshape((-1, 5))
        race_df = pd.DataFrame(data=row_reshaped, columns=columns)

        # Clean dataset (and save a version with the races)
        race_df["Place"] = race_df["Place"].apply(clean_placement_string)
        race_df["Horse_name"] = race_df["Horse_name"].apply(clean_horse_name_string)
        race_df["Jockey_name"] = race_df["Jockey_name"].apply(clean_jockey_name_string)
        race_df["Trainer_name"] = race_df["Trainer_name"].apply(clean_trainer_name_string)
        race_df["Weight"] = race_df["Weight"].apply(clean_weight_string)

        # Add each participant to a list
        for horse in race_df.values:
            horses.append(horse)

        # Add the whole race to a list
        races.append(race_df.values)
    except:
        races.append("DELETE THIS ROW") # Some rows just dont fit... delete the rows afterwards!
    
# Save participations for further inspection
all_participations_df = pd.DataFrame(data=horses, columns=columns)
all_participations_df.to_csv("csvs/participations.csv", index=False)

In [11]:
# Replace the cleaned races with the old races in the dataframe
flattened = np.array(races).reshape(1, -1)
flattened_races = []
for dim in flattened:
    for dim2 in dim:
        try:
            flattened_races.append(dim2.reshape(1,-1))
        except:
            flattened_races.append([])
        
galopp["Horses"] = np.array(flattened_races)
galopp.to_csv("csvs/galopp_cleaned.csv", index=False)