# **Racing Data Cleaning Project**


## **1\. Introduction**

The goal of this project is to clean and prepare race data for later analysis. Main tasks include:
- Removing unnecessary data and duplicates.
- Handling missing values.
- Correction of data formats.

## **2\. Load libraries**

In [1]:
import os
import re
import pandas as pd
from pandas.api.types import CategoricalDtype

## **3\. Load data**

Here we load the data from various CSV files. Each file corresponds to a specific race.
We combine them into one DataFrame to simplify analysis.

In [2]:
def create_dataframe_from_events(base_path, years, event_type):
    if event_type == "Races":
        required_columns = ['Pos', 'Car #', 'Class', 'Drivers', 'Team', 'Car', 'Time', 'Laps', 'Gap']
        folder_name = "Races"
    elif event_type == "Qualifications":
        required_columns = ['Pos', 'Car #', 'Class', 'Drivers', 'Team', 'Car', 'Time', 'Laps', 'Gap']
        folder_name = "Qualifications"
    else:
        raise ValueError("Invalid event type provided. Choose 'Races' or 'Qualifications'.")

    all_data = []  # List to hold all data rows

    for year in years:  # Process each year in the list
        year_path = os.path.join(base_path, str(year))  # Path to the specific year

        if not os.path.exists(year_path):
            print(f"The directory for year {year} does not exist.")
            continue  # Skip to the next year if the directory doesn't exist

        for meeting in os.listdir(year_path):  # Iterate through all meetings in the year
            meeting_path = os.path.join(year_path, meeting)
            event_path = os.path.join(meeting_path, folder_name)  # Path to the 'Races' or 'Qualifications' folder

            if os.path.exists(event_path):
                for event_file in os.listdir(event_path):  # Iterate through all files in the specified folder
                    file_path = os.path.join(event_path, event_file)
                    try:
                        df = pd.read_csv(file_path)
                        if set(required_columns).issubset(df.columns):
                            df['Season'] = year
                            df['Meeting'] = meeting.replace("_", " ")
                            df['Event name'] = event_file.replace(".csv", "").replace("_", " ")
                            all_data.append(df[['Season', 'Meeting', 'Event name'] + required_columns])
                        else:
                            print(f"Skipping {file_path} due to missing required columns.")
                    except Exception as e:
                        print(f"Error reading {file_path}: {e}")

    # Concatenate all data into a single DataFrame
    final_df = pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()
    return final_df

In [3]:
path = ".\data_csv" 
years = [2021, 2022, 2023]
all_races = create_dataframe_from_events(path, years, "Races")

## **4\. Initial data exploration**

After loading the data, it's important to explore the first few entries. This helps us understand the structure of the data and identify any obvious issues such as missing values.

In [4]:
all_races.head()

Unnamed: 0,Season,Meeting,Event name,Pos,Car #,Class,Drivers,Team,Car,Time,Laps,Gap
0,2021,Barcelona,Main Race,1,88,Pro Cup,"Raffaele Marciello, Felipe Fraga, Jules Gounon",AKKA ASP,Mercedes-AMG GT3,1:47.211,95.0,
1,2021,Barcelona,Main Race,2,54,Pro Cup,"Klaus Bachler, Christian Engelhart, Matteo Cai...",Dinamic Motorsport,Porsche 911 GT3-R (991.II),1:47.148,95.0,2.174
2,2021,Barcelona,Main Race,3,32,Pro Cup,"Dries Vanthoor, Robin Frijns, Charles Weerts",Team WRT,Audi R8 LMS GT3,1:47.612,95.0,4.036
3,2021,Barcelona,Main Race,4,63,Pro Cup,"Mirko Bortolotti, Marco Mapelli, Andrea Caldar...",Orange 1 FFF Racing Team,Lamborghini Huracan GT3 Evo,1:47.027,95.0,9.511
4,2021,Barcelona,Main Race,5,4,Pro Cup,"Maro Engel, Luca Stolz, Nico Bastian",HRT,Mercedes-AMG GT3,1:47.588,95.0,9.984


The dataset contains information about various races, with the following columns:
1. `Season`: The year the race took place;
2. `Meeting`: The location of the meeting;
3. `Race name`: The name of the race;
4. `Pos`: The finishing position in the race;
5. `Car #`: The car number;
6. `Class`: The racing class in which a team competes in a race;
7. `Drivers`: The names of the drivers;
8. `Team`: The name of the team;
9. `Car`: The make and model of the car;
10. `Time`: Best lap time;
11. `Laps`: The number of laps completed;
12. `Gap`: Represent the time difference between each car and the car ahead of it, with a slight twist: <br>
    * If a car is on the same lap as the leader of the race, the Gap shows the actual time difference. <br>
    * If a car is one or more laps behind, the Gap resets and shows the time difference to the car ahead within the same lap, not the overall race leader.
    * `Gap` also stores negative values, according to my assumptions this means that the participant dropped out of the race.


### **4.1. Data types and missing values**

Let's get information about missing values and data types.

Data Types should be:
1. `Season`: is a numerical and discrete variable, it should be represented as **integer**;
2. `Meeting`: is categorical and nominal variable, it should be represented as **object**;
3. `Race name`: is categorical and nominal variable, it should be represented as **object**;
4. `Pos`: is categorical and ordinal variable, it should be represented as **integer**;
5. `Car #`: is categorical and discrete variable, it should be represented as **integer**;
6. `Class`: is categorical and ordinal variable, it should be represented as **category**;
7. `Drivers`: is categorical and nominal variable, it should be represented as **object**;
8. `Team`: is categorical and nominal variable, it should be represented as **object**;
9. `Car`: is categorical and nominal variable, it should be represented as **object**;
10. `Time`: is numerical and continuous variable, it should be represented as **timedelta**;
11. `Laps`: is a numerical and discrete variable, it should be represented as **integer**;
12. `Gap`: is numerical and continuous variable, it should be represented as **timedelta**;

In [5]:
all_races.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7606 entries, 0 to 7605
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Season      7606 non-null   int64  
 1   Meeting     7606 non-null   object 
 2   Event name  7606 non-null   object 
 3   Pos         7606 non-null   object 
 4   Car #       7606 non-null   int64  
 5   Class       7606 non-null   object 
 6   Drivers     7606 non-null   object 
 7   Team        7606 non-null   object 
 8   Car         7606 non-null   object 
 9   Time        7467 non-null   object 
 10  Laps        7533 non-null   float64
 11  Gap         7389 non-null   object 
dtypes: float64(1), int64(2), object(9)
memory usage: 713.2+ KB


#### **4.1.1. Сolumns With Inappropriate Data Types**

3\. `Pos` is an object, should be an int;<br>
5\. `Class` is an object, should be a category;<br>
9\. `Time` is an object, should be a timedelta;<br>
10\. `Laps` is a float64, should be an integer;<br>
11\. `Gap` is an object, should be a timedelta.<br>

#### **4.1.2. Columns With Missing Values**

If a column contains non-null values that not equal to RangeIndex, in our case 7606, this means that the column contain missing values.

9\. `Time` contain 7467 non-null rows;<br>
10\. `Laps` contain 7533 non-null rows;<br>
11\. `Gap` contain 7389 non-null rows.<br>

#### **4.1.3. Duplicates**

In [6]:
all_races.duplicated().value_counts()

False    7606
Name: count, dtype: int64

We got 0 duplicates.

## **5\. Checking and correcting columns**

### **5.1. Pos column**

The `Pos` (Position) column is expected to contain numeric values representing the finishing position of competitors in a race. However, it's currently of type object, suggesting the presence of non-numeric entries.

#### **5.1.1. Identifying Non-numeric Entries**

Firstly, we identify any non-numeric entries that may indicate special cases such as disqualifications.

In [7]:
# Display unique non-numeric entries in the 'Pos' column
all_races[all_races['Pos'].str.contains('\D', regex=True, na=False)]['Pos'].value_counts().reset_index()

Unnamed: 0,Pos,count
0,NC,286
1,DSQ,1


This returns the following special entries:

- 'NC' (Not Classified): Refers to competitors who did not meet the classification criteria.
- 'DSQ' (Disqualified): Refers to competitors who were disqualified.

#### **5.1.2. Handling Special Cases**
Next, we handle these entries by removing rows with 'NC' and 'DSQ'. It's crucial to remove these to maintain the integrity of the numeric data, plus we don't need this data for our analysis purposes.

In [8]:
# Remove rows with 'NC' and 'DSQ' from the dataset
all_races = all_races.loc[~all_races['Pos'].isin(['NC', 'DSQ'])]
all_races.reset_index(drop=True, inplace=True)

#### **5.1.3.Converting Data Type**

Finally, we convert the `Pos` column to an integer type.

In [9]:
# Convert the 'Pos' column to an integer data type

all_races['Pos'] = pd.to_numeric(all_races['Pos'], errors='coerce')

all_races['Pos'].dtype

dtype('int64')

### **5.2. Class column**
The `Class` column in our dataset categorizes the racing classes but is currently typed as an object, indicating mixed data types. For clarity and consistency, it's beneficial to convert this column to a categorical type. This not only saves memory but also sets the stage for any order-based analysis we might want to perform later.

#### **5.2.1. Identifying Unique Classes**
We start by examining the unique values within the `Class` column:

In [10]:
# Display the unique values in the 'Class' column and their counts
all_races['Class'].value_counts().reset_index()

Unnamed: 0,Class,count
0,Pro Cup,2614
1,Silver Cup,1870
2,Pro-AM Cup,1042
3,Gold Cup,827
4,Bronze Cup,823
5,Pro-Am Cup,72
6,AM Cup,44
7,Bronze,21
8,Invitational,6


This reveals various classes, including 'Pro Cup', 'Silver Cup', and others. However, we notice some inconsistencies in naming conventions which could lead to misinterpretation of the data.

#### **5.2.2. Renaming for Consistency**

To address these inconsistencies, we proceed to normalize the naming convention across the different classes:

In [11]:
# Standardize the class names for uniformity
all_races['Class'] = all_races['Class'].replace({
    'Pro-AM Cup': 'Pro-Am Cup', 
    'AM Cup': 'Am Cup', 
    'Bronze': 'Bronze Cup'
})

all_races['Class'].value_counts().reset_index()

Unnamed: 0,Class,count
0,Pro Cup,2614
1,Silver Cup,1870
2,Pro-Am Cup,1114
3,Bronze Cup,844
4,Gold Cup,827
5,Am Cup,44
6,Invitational,6


Now the class names are uniform, making the data cleaner and the 'Class' column easier to work with.

#### **5.2.3. Setting a Logical Order**

Some analyses might require an understanding of the hierarchy or order of the classes. For this reason, we define a logical order for the categories, from most to least prestigious, and apply this ordering to our data:

In [12]:
# Define the category order and convert the 'class' column to a categorical type
cat_type = CategoricalDtype(categories=[
    'Invitational', 'Am Cup', 'Bronze Cup', 'Pro-Am Cup', 'Silver Cup', 'Gold Cup', 'Pro Cup'
], ordered=True)

all_races['Class'] = all_races['Class'].astype(cat_type)

#### **5.2.4. Flagging Special Categories**
For further clarity, we flag 'Invitational' as a special class since it may follow different rules or standards:

In [13]:
# Create a new column to indicate special class status
all_races.insert(loc=6, column='Special Class', value=False)
all_races.loc[all_races['Class'] == 'Invitational', 'Special Class'] = True

### **5.3. Time column**

The `Time` column contains the time of the fastest lap for each participant. Missing values in this column indicate that a fastest lap time was not recorded, possibly due to early race retirements or other issues.

#### **5.3.1. Investigating Missing Values**
First, we explore the missing values to understand their nature:

In [14]:
# Display rows where 'Time' is missing
missing_time_rows = all_races[all_races['Time'].isnull()].head()
missing_time_rows.head()

Unnamed: 0,Season,Meeting,Event name,Pos,Car #,Class,Special Class,Drivers,Team,Car,Time,Laps,Gap
633,2021,Monza,Main Race after 1.30 hour,41,25,Pro Cup,False,"Adrien Tambay, Alexandre Cougnaud, Christopher...",Sainteloc Racing,Audi R8 LMS GT3,,2.0,-1:18:14.532
634,2021,Monza,Main Race after 1.30 hour,42,70,Pro-Am Cup,False,"Oliver Millroy, Brendan Iribe",Inception Racing,McLaren 720S GT3,,1.0,-1:29:07.011
675,2021,Monza,Main Race after 2.30 hours,41,25,Pro Cup,False,"Adrien Tambay, Alexandre Cougnaud, Christopher...",Sainteloc Racing,Audi R8 LMS GT3,,2.0,-2:16:56.451
676,2021,Monza,Main Race after 2.30 hours,42,70,Pro-Am Cup,False,"Oliver Millroy, Brendan Iribe",Inception Racing,McLaren 720S GT3,,1.0,-2:27:48.930
2246,2022,Barcelona,Main Race after 1.30 hour,47,19,Pro Cup,False,"Leo Roussel, Giacomo Altoe, Arthur Rougier",Emil Frey Racing,Lamborghini Huracan GT3 Evo,,2.0,-1:27:06.233


#### **5.3.2. Filling Missing Values**

We'll fill missing `Time` values with a placeholder '0.000'. This value indicates no recorded fastest lap. 

In [15]:
# Fill missing 'Time' values with '0.000'
all_races['Time'] = all_races['Time'].fillna('0.000')

#### **5.3.3. Creating Indicator for Best Lap Set**

To differentiate between participants who set a fastest lap and those who didn't, we create a new column, `Best lap set`. This column will be True if a time was recorded and False otherwise:

In [16]:
# Create a 'Best lap set' column that will be False if 'Time' is '0.000'
all_races.insert(10, 'Best lap set', all_races['Time'] != '0.000')

#### **5.3.4. Converting Time for Calculations**

For precise calculations that involve time data, we will convert the `Time` values to a timedelta data type in a new `Time timedelta` column. This will allow for accurate and efficient time-based operations while keeping the original `Time` column in its readable object format:

In [17]:
def convert_to_timedelta(value):
    try:
        # Directly return a timedelta of zero for '0.000'
        if value == '0.000' or value == '0':
            return pd.to_timedelta('00:00:00.000')
        
        # Check the number of parts when split by ':'
        parts = value.split(':')
        if len(parts) == 1:  # Only seconds (and possibly milliseconds)
            return pd.to_timedelta(float(value), unit='s')
        elif len(parts) == 2:  # Minutes and seconds
            return pd.to_timedelta('00:' + value)
        elif len(parts) == 3:  # Hours, minutes, and seconds
            return pd.to_timedelta(value)
    except Exception as e:
        print(f"Error converting value: {value}, Error: {e}")
        # Return a default timedelta in case of unexpected format
        return pd.Timedelta(0)

# Apply the conversion to each value in 'Time'
time_timedelta_values = all_races['Time'].astype(str).apply(convert_to_timedelta)

# Insert the new 'Time Timedelta' column into the DataFrame
all_races.insert(loc=12, column='Time timedelta', value=time_timedelta_values)

### **5.4. Laps Column**

The Laps column quantifies the total number of laps completed by each participant during a race. However, this column is currently formatted as a float, which is not suitable for a variable that should inherently be an integer.

#### **5.4.1. Investigating Missing Values**

In [18]:
# Display rows where 'Time' is missing
missing_time_laps = all_races[all_races['Laps'].isnull()].head()
missing_time_laps.head()

Unnamed: 0,Season,Meeting,Event name,Pos,Car #,Class,Special Class,Drivers,Team,Car,Best lap set,Time,Time timedelta,Laps,Gap


Since we don't have missing values, we can simply change the data type.

#### **5.4.2. Converting Floats to Integers**

In [19]:
# Change data type of 'Laps' from float to int
all_races['Laps'] = all_races['Laps'].astype('int64')
all_races['Laps'].dtype

dtype('int64')

### **5.5. Gap Column** 
`Gap`: Represent the time difference between each car and the car ahead of it, with a slight twist:<br> 
If a car is on the same lap as the leader of the race, the `Gap` shows the actual time difference.<br> 
If a car is one or more laps behind, the `Gap` resets and shows the time difference to the car ahead within the same lap, not the overall race leader.

However, the column has several issues that need addressing: it contains missing values for race leaders (since they have no gap to the leader), and negative values which may indicate cars that dropped out of the race.


#### **5.5.1. Inspectiong Gap Column**

In [20]:
null_gap_rows = all_races[pd.isnull(all_races['Gap'])]
null_gap_rows['Pos'].value_counts().reset_index()

Unnamed: 0,Pos,count
0,1,144


#### **5.5.2. Handling Missing Values**

Missing values are present in cases where the car is leading. Since there's no gap to measure against themselves, these entries are logically missing. We will fill these with a placeholder:

In [21]:
# Fill missing 'Gap' values with '0.000' to indicate no gap
all_races['Gap'] = all_races['Gap'].fillna('0.000')

#### **5.5.3. Interpreting  Negative Values**

Negative values in the `Gap` column could indicate that a car has dropped out of the race. Then we will create a `Dropped off the Race` column with bool values. True if in `Gap` contains '-' before numbers, False overwise.

In [22]:
# Function to determine if a participant has dropped off of the race
def check_dropped_off(value):
    if pd.isnull(value):
        return False
    return bool(re.match(r"^-", value))

# Apply the function to create a new boolean column
all_races['Dropped off the Race'] = all_races['Gap'].apply(check_dropped_off)

#### **5.5.3. Delete All '-' in Gap Column**

In [23]:
all_races['Gap'] = all_races['Gap'].str.replace('-', '', regex=False)

#### **5.5.4. Changing the Data Type**

Now add a new `Gap Timedelta` column with the same values as in `Gap` but with a different data type for calculations.

In [24]:
# Apply the conversion to each value in 'Gap'
gaps_timedelta_values = all_races['Gap'].astype(str).apply(convert_to_timedelta)

# Insert the new 'Gap Timedelta' column into the DataFrame
all_races.insert(loc=15, column='Gap Timedelta', value=gaps_timedelta_values)

### **5.6. The Work is Mostly Completed**

In [25]:
all_races.reset_index(drop=True).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7319 entries, 0 to 7318
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype          
---  ------                --------------  -----          
 0   Season                7319 non-null   int64          
 1   Meeting               7319 non-null   object         
 2   Event name            7319 non-null   object         
 3   Pos                   7319 non-null   int64          
 4   Car #                 7319 non-null   int64          
 5   Class                 7319 non-null   category       
 6   Special Class         7319 non-null   bool           
 7   Drivers               7319 non-null   object         
 8   Team                  7319 non-null   object         
 9   Car                   7319 non-null   object         
 10  Best lap set          7319 non-null   bool           
 11  Time                  7319 non-null   object         
 12  Time timedelta        7319 non-null   timedelta64[ns]
 13  Lap

Now we don't have missing values and all data types are appropriate. To ensure absolute accuracy, let's manually review each column—Meeting, Event Name and Car—as there may be errors that initially went unnoticed, such as incorrect names or superfluous spaces.

#### **5.6.1. Inspecting 'Meeting'**

In [26]:
all_races['Meeting'].value_counts()

Meeting
TotalEnergies 24 Hours of Spa    2619
CrowdStrike 24 Hours of Spa      1659
Circuit Paul Ricard 1000Km        932
Barcelona                         378
Monza                             317
Nürburgring                       286
Hockenheim                        207
Imola                             187
Misano                            169
Valencia                          162
Brands Hatch                      156
Zandvoort                         149
Magny-Cours                        98
Name: count, dtype: int64

The 'TotalEnergies 24 Hours of Spa' and the 'CrowdStrike 24 Hours of Spa' refer to the same event, known as the '24 Hours of Spa', but they highlight different title sponsorships for the race. Let's rename "TotalEnergies 24 Hours of Spa" and "CrowdStrike 24 Hours of Spa" to `Circuit de Spa-Francorchamps`. And "Circuit Paul Ricard 1000Km" to `Circuit Paul Ricard` for consistency.

In [27]:
all_races['Meeting'] = all_races['Meeting'].replace({
    'TotalEnergies 24 Hours of Spa': 'Circuit de Spa-Francorchamps',
    'CrowdStrike 24 Hours of Spa': 'Circuit de Spa-Francorchamps',
    'Circuit Paul Ricard 1000Km' : 'Circuit Paul Ricard'
})

#### **5.6.2. Inspecting 'Event name'**

In [28]:
all_races['Event name'].value_counts()

Event name
Main Race after 1.30 hour     655
Main Race                     638
Main Race after 2.30 hours    535
Race 2                        402
Race 1                        400
Main Race after 30 mins       263
Main Race after 9 hours       194
Main Race after 8 hours       194
Main Race after 7 hours       194
Main Race after 6 hours       194
Main Race after 5 hours       194
Main Race after 4 hours       194
Main Race after 3 hours       194
Main Race after 21 hours      194
Main Race after 20 hours      194
Main Race after 19 hours      194
Main Race after 18 hours      194
Main Race after 14 hours      194
Main Race after 13 hours      194
Main Race after 12 hours      194
Main Race after 15 hours      194
Main race after 4.30 hours    152
Main Race after 3.30 hours    152
Main Race after 16 hours      136
Main Race after 23 hours      136
Main Race after 22 hours      128
Main Race after 2 hours       128
Main Race after 17 hours      128
Main Race after 1 hour        128
Mai

We got some naming inconsistency. Let's rename 'Main Race after 22 Hours' to 'Main Race after 22 hours'.

#### **5.6.2.1. Fix Naming Error** 

In [29]:
all_races['Event name'] = all_races['Event name'].replace({
    'Main Race after 22 Hours': 'Main Race after 22 hours'
})

#### **5.6.2.1. Inspecting 'Car'** 


In [30]:
all_races['Car'].value_counts()

Car
Mercedes-AMG GT3                 1467
Porsche 911 GT3-R (991.II)        728
Lamborghini Huracan GT3 Evo       647
Audi R8 LMS evo II GT3            614
Ferrari 488 GT3                   537
Porsche 911 GT3 R (992)           470
BMW M4 GT3                        450
Audi R8 LMS GT3 EVO II            443
Audi R8 LMS GT3                   370
Lamborghini Huracan GT3 EVO 2     368
McLaren 720S GT3                  278
McLaren 720S GT3 EVO              202
Aston Martin Vantage AMR GT3      200
Ferrari 296 GT3                   167
BMW M6 GT3                        107
Mercedes-AMG GT3 EVO               84
Bentley Continental GT3            75
Audi R8 LMS GT3 EVO2               46
Aston Martin Vantage GT3           38
Mercedes-AMG GT GT3 2020           10
Honda NSX GT3 EVO 2                10
Honda NSX GT3                       8
Name: count, dtype: int64

We got some naming inconsistency. Let's rename 'Audi R8 LMS evo II GT3', 'Audi R8 LMS GT3 EVO II', 'Audi R8 LMS GT3 EVO2' to 'Audi R8 LMS GT3 EVO 2'. And 'Lamborghini Huracan GT3 Evo' will be renamed to 'Lamborghini Huracan GT3 EVO'.


In [31]:
all_races['Car'] = all_races['Car'].replace({
    'Audi R8 LMS evo II GT3': 'Audi R8 LMS GT3 EVO 2',
    'Audi R8 LMS GT3 EVO II': 'Audi R8 LMS GT3 EVO 2',
    'Audi R8 LMS GT3 EVO2': 'Audi R8 LMS GT3 EVO 2',
    'Lamborghini Huracan GT3 Evo': 'Lamborghini Huracan GT3 EVO'
})

## **6. The Work is Done**

We now have the appropriate data types and no missing values. Our work is complete and we have now prepared a dataset for analysis.

In [32]:
all_races.reset_index(drop=True).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7319 entries, 0 to 7318
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype          
---  ------                --------------  -----          
 0   Season                7319 non-null   int64          
 1   Meeting               7319 non-null   object         
 2   Event name            7319 non-null   object         
 3   Pos                   7319 non-null   int64          
 4   Car #                 7319 non-null   int64          
 5   Class                 7319 non-null   category       
 6   Special Class         7319 non-null   bool           
 7   Drivers               7319 non-null   object         
 8   Team                  7319 non-null   object         
 9   Car                   7319 non-null   object         
 10  Best lap set          7319 non-null   bool           
 11  Time                  7319 non-null   object         
 12  Time timedelta        7319 non-null   timedelta64[ns]
 13  Lap

## **7. Save Cleaned Dataset**

The Parquet format is well suited for storing dates with complex structures and data types. It compresses data efficiently and supports most pandas data types.

In [33]:
all_races.to_csv('.\\cleaned_data\\race_data.csv', index=False)
all_races.to_parquet('.\\cleaned_data\\race_data.parquet', index=False)