# Galopp-Preprocessor

For this step in the project I will use jupyter notebook, because it is way faster to visualize and more flexible on doin data preprocessing. (The exploratory data analysis part will also be with jupyter notebook)

### Imports

In [1]:
import pandas as pd

### Load dataset

In [2]:
galopp = pd.read_csv("all_races.csv")
galopp.sample(5)

Unnamed: 0,Date,Location,Distance,Prize,Category,Class,Ground_state,Horses
5754,03. September 2017,Baden-Baden,1500 m,6000 €,D,,Boden: gut,"[' 1. ', ' Kilde..."
3200,22. August 2015,Magdeburg,1800 m,3500 €,F,IV,Boden: gut,"[' 1. ', ' Shinn..."
8650,13. Juni 2020,Dresden,1400 m,4000 €,D,III,Boden: gut,"[' 1. ', ' Miste..."
8290,31. Oktober 2019,Halle,1750 m,4000 €,E,,Boden: gut stellenweise w...,"[' 1. ', ' Newto..."
3575,14. November 2015,Bremen,1600 m,5100 €,D,3yo,Boden: w-s,"[' 1. ', ' Moone..."


### Look at general information

In [3]:
galopp.columns

Index(['Date', 'Location', 'Distance', 'Prize', 'Category', 'Class',
       'Ground_state', 'Horses'],
      dtype='object')

In [4]:
print(galopp.isna().sum())
print("")
print(galopp.isna().sum()/len(galopp)*100)

Date              15
Location          15
Distance           0
Prize              0
Category        3115
Class           4245
Ground_state       0
Horses             0
dtype: int64

Date             0.162725
Location         0.162725
Distance         0.000000
Prize            0.000000
Category        33.792580
Class           46.051204
Ground_state     0.000000
Horses           0.000000
dtype: float64


In [5]:
galopp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9218 entries, 0 to 9217
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Date          9203 non-null   object
 1   Location      9203 non-null   object
 2   Distance      9218 non-null   object
 3   Prize         9218 non-null   object
 4   Category      6103 non-null   object
 5   Class         4973 non-null   object
 6   Ground_state  9218 non-null   object
 7   Horses        9218 non-null   object
dtypes: object(8)
memory usage: 576.2+ KB


First thing I see is that there are some NaN values. 15 each on _Date_ and _Location_ (which are only 0.16% of the whole set). Because only a few are missing, and they are not really important to my goal, I fill them with the mode (normally not really applicable for the _Date_ , but it should be sufficient).

Also, the columns 'Category' and 'Class' are missing 3115 and 4245, which are ~33% and ~46%. Normally, because they are categorical, they could be filled with the mode, but so many values are missing, so I decide to drop them, as they are not important for my goal.

Next, the datatypes should be changed for _Distance_ and _Prize_ to int. But before that, the 'm' in _Distance_ and the '€' in _Prize_ have to be removed. (Also, there are '\xa0's in both columns for each entry! These have to be removed too before converting to int.). And, at last, fill the empty strings with a 0, otherwise this would lead to ValueErrors when converting to int.

The _Ground_state_ always contains the prefix 'Boden: ', this can also be removed.

In [6]:
# Drop columns
galopp.drop(columns=["Category", "Class"], inplace=True)

# Fill dates and location by 
galopp["Date"].fillna(galopp["Date"].mode(), inplace=True)
galopp["Location"].fillna(galopp["Location"].mode(), inplace=True)

# Remove units
galopp["Distance"] = galopp["Distance"].apply(lambda x: x.replace("m", ""))
galopp["Prize"] = galopp["Prize"].apply(lambda x: x.replace("€", ""))

# Remove bytes
galopp["Distance"] = galopp["Distance"].apply(lambda x: x.replace("\xa0", ""))
galopp["Prize"] = galopp["Prize"].apply(lambda x: x.replace("\xa0", ""))

# Fill empty strings with 0
galopp["Distance"] = galopp["Distance"].apply(lambda x: "0" if len(x) == 0 else x)
galopp["Prize"] = galopp["Prize"].apply(lambda x: "0" if len(x) == 0 else x)

# Change datatype to int
galopp["Distance"] = galopp["Distance"].astype(int)
galopp["Prize"] = galopp["Prize"].astype(int)

# Remove 'Boden: ' prefix
galopp["Ground_state"] = galopp["Ground_state"].apply(lambda x: x.replace("Boden: ", ""))

Looking at the dataframe again:

In [7]:
galopp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9218 entries, 0 to 9217
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Date          9203 non-null   object
 1   Location      9203 non-null   object
 2   Distance      9218 non-null   int64 
 3   Prize         9218 non-null   int64 
 4   Ground_state  9218 non-null   object
 5   Horses        9218 non-null   object
dtypes: int64(2), object(4)
memory usage: 432.2+ KB


In [8]:
galopp.sample(10)

Unnamed: 0,Date,Location,Distance,Prize,Ground_state,Horses
2857,10. Juni 2015,Mons,2300,5600,Sand,"[' 1. ', ' Princ..."
4107,28. Mai 2016,Baden-Baden,1800,10000,gut,"[' 1. ', ' Bearh..."
3954,01. Mai 2016,Leipzig,1850,10000,gut,"[' 1. ', ' Miss ..."
7899,28. Juli 2019,Erbach,2150,5000,gut,"[' 1. ', ' Smoke..."
9099,04. Oktober 2020,Hannover,2000,52000,gut stellenweise weich ...,"[' 1. ', ' Sky E..."
800,20. Oktober 2013,Baden-Baden,2200,0,schwer,"[' 1. ', ' Talit..."
6125,27. Februar 2018,Neuss (Sand),1900,4000,normal,"[' 1. ', ' Simin..."
509,14. Juli 2013,München,1600,0,gut,"[' 1. ', ' Quick..."
5804,17. September 2017,Hannover,1400,5100,gut,"[' 1. ', ' Amora..."
1162,21. April 2014,Hannover,1900,0,g-w,"[' 1. ', ' Invad..."


No NaNs and the fitting datatype, also the samples looking good aswell.
So for those columns everything that needs to be done is done. Lets go on with the horses per race. For this, I intend to:
- get the list of horses and make another dataframe of it
- Clean this dataset (No column names and datatypes here)
- Return the cleaned dataframe as a list and replace it

### Load the list in a dataset

### Clean

### Return