# Galopp-Preprocessor

For this step in the project I will use jupyter notebook, because it is way faster to visualize and more flexible on doin data preprocessing. (The exploratory data analysis part will also be with jupyter notebook)

### Imports

In [1]:
import pandas as pd

### Load dataset

In [2]:
galopp = pd.read_csv("all_races.csv")
galopp.sample(5)

Unnamed: 0,Date,Location,Distance,Prize,Category,Class,Ground_state,Horses
2504,06. April 2015,Köln,1850 m,4000 €,,,Boden: weich ...,"[' 1. ', ' Balti..."
4446,13. August 2016,Hoppegarten,1600 m,5100 €,D,3yo,Boden: gut,"[' 1. ', ' Calan..."
5110,01. Mai 2017,Leipzig,1850 m,5100 €,D,III,Boden: gut,"[' 1. ', ' Bursc..."
3900,23. April 2016,Mülheim,1400 m,7000 €,D,III,Boden: gut,"[' 1. ', ' My Ma..."
5509,11. Juli 2017,München,1300 m,6000 €,E,IV,Boden: gut,"[' 1. ', ' Arine..."


In [3]:
galopp.columns

Index(['Date', 'Location', 'Distance', 'Prize', 'Category', 'Class',
       'Ground_state', 'Horses'],
      dtype='object')

In [4]:
print(galopp.isna().sum())
print("")
print(galopp.isna().sum()/len(galopp)*100)

Date              15
Location          15
Distance           0
Prize              0
Category        3115
Class           4245
Ground_state       0
Horses             0
dtype: int64

Date             0.162725
Location         0.162725
Distance         0.000000
Prize            0.000000
Category        33.792580
Class           46.051204
Ground_state     0.000000
Horses           0.000000
dtype: float64


In [5]:
galopp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9218 entries, 0 to 9217
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Date          9203 non-null   object
 1   Location      9203 non-null   object
 2   Distance      9218 non-null   object
 3   Prize         9218 non-null   object
 4   Category      6103 non-null   object
 5   Class         4973 non-null   object
 6   Ground_state  9218 non-null   object
 7   Horses        9218 non-null   object
dtypes: object(8)
memory usage: 576.2+ KB


First thing I see is that there are some NaN values. 15 each on _Date_ and _Location_ (which are only 0.16% of the whole set). Because only a few are missing, and they are not really important to my goal, I fill them with the mode (normally not really applicable for the _Date_ , but it should be sufficient).

Also, the columns 'Category' and 'Class' are missing 3115 and 4245, which are ~33% and ~46%. Normally, because they are categorical, they could be filled with the mode, but so many values are missing, so I decide to drop them, as they are not important for my goal.

Next, the datatypes should be changed for _Distance_ and _Prize_ to int. But before that, the 'm' in _Distance_ and the '€' in _Prize_ have to be removed. (Also, there are '\xa0's in both columns for each entry! These have to be removed too before converting to int.). And, at last, fill the empty strings with a 1, otherwise this would lead to ValueErrors when converting to int. These can be set to the mode (because it was originally a categorical entry) after the columns are int.

The _Ground_state_ always contains the prefix 'Boden: ', this can also be removed.

In [6]:
galopp["Date"].fillna(galopp["Date"].mode(), inplace=True)
galopp["Location"].fillna(galopp["Location"], inplace=True)
galopp.drop(columns=["Category", "Class"], inplace=True)

galopp["Distance"] = galopp["Distance"].apply(lambda x: x.replace("m", ""))
galopp["Prize"] = galopp["Prize"].apply(lambda x: x.replace("€", ""))

galopp["Distance"] = galopp["Distance"].apply(lambda x: x.replace("\xa0", ""))
galopp["Prize"] = galopp["Prize"].apply(lambda x: x.replace("\xa0", ""))

galopp["Distance"] = galopp["Distance"].apply(lambda x: "1" if len(x) == 0 else x)
galopp["Prize"] = galopp["Prize"].apply(lambda x: "1" if len(x) == 0 else x)

galopp["Distance"].astype(int)
galopp["Prize"].astype(int)

galopp["Distance"] = galopp["Distance"].apply(lambda x: galopp["Distance"].mode() if x == 1 else x)
galopp["Prize"] = galopp["Prize"].apply(lambda x: galopp["Distance"].mode() if x == 1 else x)

galopp["Ground_state"] = galopp["Ground_state"].apply(lambda x: x.replace("Boden: ", ""))

In [7]:
galopp["Prize"].unique()

array(['0', '6000', '6200', '5000', '8000', '6450', '5350', '4000',
       '5600', '6400', '9000', '4800', '5100', '8500', '5200', '8750',
       '10000', '3500', '3600', '4850', '4500', '2000', '3200', '2600',
       '7000', '8200', '27000', '6500', '25000', '2200', '3100', '55000',
       '5500', '3000', '2050', '5450', '9500', '12000', '4350', '3400',
       '4200', '2500', '1600', '5550', '70000', '3800', '7500', '52000',
       '22500', '10500', '2700', '8950', '153000', '6600', '3450',
       '125000', '10400', '6100', '13500', '7700', '9250', '7200',
       '35000', '10200', '15000', '16000', '3950', '7150', '8700', '9700',
       '650000', '4444', '20000', '4644', '6666', '37000', '155000',
       '6750', '500000', '5300', '45000', '50000', '175000', '4400',
       '4600', '4250', '5750', '102500', '9750', '11000', '250000',
       '12500', '6050', '4700', '85000', '200000', '8150', '105000',
       '3350', '2800', '5950', '2850', '3150', '4550', '9400', '1800',
       '5850', 