<a href="https://colab.research.google.com/github/Milind1505/AI-Powered-Data-Quality-Monitoring-for-F1-Telemetry-Data/blob/main/AI_Powered_Data_Quality_Monitoring_System_for_F1_Telemetry_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

AI-Powered Data Quality Monitoring System for F1 Telemetry Data

Introduction

This project demonstrates an AI-powered pipeline to monitor and flag anomalies in Formula 1 telemetry data for George Russell. We use the fastf1 library to access telemetry data, build a rules engine, and employ an Isolation Forest model for anomaly detection. The results are visualized with Plotly for both static and interactive exploration.

Data Source

The project utilizes the fastf1 library to access Formula 1 telemetry data. This library provides a convenient way to retrieve lap times, speed, throttle, brake, RPM, and other sensor data from past races. The data for George Russell in the 2023 Bahrain Grand Prix is used as an example in this project.

Step-by-Step Code with Explanations

1. Install Libraries:

This installs the required libraries:

fastf1: For accessing F1 telemetry data

pandas: For data manipulation and analysis

numpy: For numerical operations

plotly: For interactive visualizations

scikit-learn: For machine learning, including Isolation Forest

In [9]:
!pip install fastf1 pandas numpy plotly scikit-learn




2. Load and Prepare Data:

The fastf1 library is used to load the telemetry data for George Russell's fastest lap in the 2023 Bahrain Grand Prix.
Relevant features, such as speed, throttle, brake, RPM, and gear, are selected for anomaly detection.
The data is split into training and testing sets (70-30 split) for model training and evaluation.

In [12]:
import fastf1
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
import os

# Create the cache directory if it doesn't exist
cache_dir = 'cache'
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

# Enable caching for fastf1
fastf1.Cache.enable_cache('cache')

# Load data for George Russell in the 2023 Bahrain Grand Prix
session = fastf1.get_session(2023, 'Bahrain', 'R')
session.load()
russell = session.laps.pick_drivers('RUS').pick_fastest()
telemetry = russell.get_telemetry()

# Select relevant features for anomaly detection
features = ['Speed', 'Throttle', 'Brake', 'RPM', 'nGear']
data = telemetry[features]

core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.5.3]
INFO:fastf1.fastf1.core:Loading data for Bahrain Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
INFO:fastf1.fastf1.req:Using cached data for session_info
req            INFO 	Using cached data for driver_info
INFO:fastf1.fastf1.req:Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
INFO:fastf1.fastf1.req:Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
INFO:fastf1.fastf1.req:Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
INFO:fastf1.fastf1.req:Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
INFO:fastf1.fastf1.req:Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
INFO:fastf1.fastf1.req:Using cached data for timing_app_data
core         

Rules Engine:

A rules engine is implemented to flag potential anomalies based on predefined rules.
These rules capture basic scenarios where anomalies are likely to occur, such as high speed with low throttle or hard braking while accelerating.
The rules engine adds an 'Anomaly_Rule' column to the data, indicating whether a data point violates any rules.

In [13]:
def check_rules(df):
    """Applies basic rules to flag potential anomalies."""
    df['Anomaly_Rule'] = 0
    df.loc[(df['Speed'] > 350) & (df['Throttle'] < 0.2), 'Anomaly_Rule'] = 1 # High speed, low throttle
    df.loc[(df['Brake'] > 0.8) & (df['Throttle'] > 0.8), 'Anomaly_Rule'] = 1 # Hard braking while accelerating
    return df

data = check_rules(data)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



3. Anomaly Detection with Isolation Forest:

An Isolation Forest model is trained on the training data to detect anomalies.
This model is well-suited for anomaly detection as it isolates anomalous data points that are far away from the normal data distribution.
The model is used to predict anomalies on the testing data and adds an 'Anomaly_Prediction' column.

In [17]:
# 'Distance' in the features list BEFORE anomaly detection
features = ['Speed', 'Throttle', 'Brake', 'RPM', 'nGear', 'Distance'] # Add 'Distance' here
data = telemetry[features]

# Apply rules and anomaly detection with the updated features
data = check_rules(data)

# Split data into training and testing sets (70-30 split)
train_data = data.sample(frac=0.7, random_state=42)
test_data = data.drop(train_data.index)

# Train Isolation Forest model
model = IsolationForest(contamination=0.05, random_state=42) # Adjust contamination as needed
model.fit(train_data[features])

# Predict anomalies on the test set
test_data['Anomaly_Score'] = model.decision_function(test_data[features])
test_data['Anomaly_Prediction'] = model.predict(test_data[features])
test_data['Anomaly_Prediction'] = test_data['Anomaly_Prediction'].map({1: 0, -1: 1}) # Map -1 to 1 for anomaly



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



4. Visualization with Plotly:

Plotly is used to create two visualizations:
Static Plot: Speed vs. Distance with Anomalies Highlighted: This plot shows the overall trend of speed changes and highlights anomalies detected by the rules engine and the Isolation Forest model.
Interactive Plot: 3D Scatter Plot of Speed, Throttle, and Brake: This plot allows users to explore the relationships between these three variables and identify anomalies in a 3-dimensional space.

In [18]:
import plotly.express as px
import plotly.graph_objects as go


# Static Plot: Speed vs. Time with Anomalies Highlighted
fig = px.line(test_data, x='Distance', y='Speed', color='Anomaly_Prediction', title='Speed vs. Distance with Anomalies')
fig.show()

# Interactive Plot: 3D Scatter Plot of Speed, Throttle, and Brake
fig = go.Figure(data=[go.Scatter3d(
    x=test_data['Speed'],
    y=test_data['Throttle'],
    z=test_data['Brake'],
    mode='markers',
    marker=dict(
        size=5,
        color=test_data['Anomaly_Prediction'],
        colorscale='Viridis',
        opacity=0.8
    )
)])
fig.update_layout(scene = dict(
                    xaxis_title='Speed',
                    yaxis_title='Throttle',
                    zaxis_title='Brake'),
                    title='3D Scatter Plot of Telemetry Data')
fig.show()

1. Static Plot: Speed vs. Distance

This plot acts as a foundational overview of the data. It shows the general trend of speed variations as the car covers distance. By highlighting anomalies in this plot, we can quickly identify specific regions or instances where unusual behavior occurred. This overview helps us understand:

Overall Speed Profile: We can see the general pattern of speed changes, such as acceleration, deceleration, and periods of constant speed.
Potential Anomaly Locations: The highlighted anomalies pinpoint areas where the speed deviates significantly from the expected pattern. These areas require further investigation using the interactive plot.
2. Interactive 3D Scatter Plot: Speed, Throttle, and Brake

This plot provides a deeper, multi-dimensional exploration of the data. It allows us to investigate the relationships between speed, throttle, and brake, which are crucial factors in understanding car performance. By color-coding the data points based on anomaly prediction, we can:

Identify Anomaly Clusters: If anomalies cluster together in specific regions of the 3D space, it suggests a systemic issue or a particular scenario where anomalies are likely to occur.
Analyze Anomaly Characteristics: By examining the values of speed, throttle, and brake for the anomalous points, we can understand the specific nature of the anomaly. For example, an anomaly with high speed and low throttle might indicate a sensor error or an unusual driving behavior.
Explore Data Relationships: The interactive nature of the plot allows us to rotate and zoom, gaining different perspectives on the data relationships. This exploration can reveal hidden patterns or correlations that might be missed in the static plot.
Comprehensive View: The Synergy

The static plot acts as a starting point, directing our attention to potential anomaly locations. The interactive plot then allows us to delve deeper into those areas, exploring the underlying factors contributing to the anomalies. This combination provides a comprehensive view because:

Contextualization: The static plot provides context by showing the overall speed profile and highlighting anomaly locations within that context.
Detailed Investigation: The interactive plot allows for a detailed investigation of the anomalies, revealing their characteristics and relationships with other variables.
Pattern Recognition: By exploring the data from different perspectives in the interactive plot, we can identify patterns and clusters that might not be apparent in the static plot.
This synergy between the two visualizations enhances our understanding of the F1 telemetry data and helps us effectively detect and analyze potential anomalies. The static plot directs our attention, and the interactive plot allows for in-depth exploration, leading to a more comprehensive and insightful analysis.

**Alternative Anomaly Detection Models**

Besides Isolation Forest, here are a few other models that could be suitable for this project:

Local Outlier Factor (LOF): LOF is a density-based anomaly detection method that measures the local deviation of a data point with respect to its neighbors. It can be effective in identifying outliers in datasets with varying densities.

One-Class SVM (OCSVM): OCSVM is a novelty detection method that learns a decision boundary around the normal data points. It can be useful for detecting anomalies that are significantly different from the training data.

Autoencoders: Autoencoders are neural networks that are trained to reconstruct their input data. They can be used for anomaly detection by comparing the reconstruction error of a data point with a threshold.

In [19]:
from sklearn.neighbors import LocalOutlierFactor

# Train LOF model
lof = LocalOutlierFactor(contamination=0.05)  # Adjust contamination as needed
lof_predictions = lof.fit_predict(data[features])
data['Anomaly_LOF'] = lof_predictions
data['Anomaly_LOF'] = data['Anomaly_LOF'].map({1: 0, -1: 1})  # Map -1 to 1 for anomaly

# Visualize LOF results with Plotly
fig = px.scatter(data, x='Distance', y='Speed', color='Anomaly_LOF', title='Anomaly Detection with LOF')
fig.show()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [20]:
from sklearn.svm import OneClassSVM

# Train OCSVM model
ocsvm = OneClassSVM(nu=0.05)  # Adjust nu as needed
ocsvm_predictions = ocsvm.fit_predict(data[features])
data['Anomaly_OCSVM'] = ocsvm_predictions
data['Anomaly_OCSVM'] = data['Anomaly_OCSVM'].map({1: 0, -1: 1})  # Map -1 to 1 for anomaly

# Visualize OCSVM results with Plotly
fig = px.scatter(data, x='Distance', y='Speed', color='Anomaly_OCSVM', title='Anomaly Detection with OCSVM')
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Both LOF and OCSVM are used to identify anomalies in the F1 telemetry data, and their results are visualized using Plotly scatter plots. Let's break down the interpretations of these visualizations:

1. Local Outlier Factor (LOF):

Visualization Type: Scatter plot
Purpose: To highlight data points that are considered outliers based on their local density compared to their neighbors.
Data:
x-axis: 'Distance' - Represents the distance traveled by the car.
y-axis: 'Speed' - Represents the speed of the car.
Color: 'Anomaly_LOF' - Indicates whether a data point is flagged as an anomaly (1) or not (0) by the LOF model.
Interpretation:
Normal Data Points: Data points that are clustered together and have similar densities are considered normal and are typically represented by one color (e.g., blue).
Anomalies/Outliers: Data points that are far away from their neighbors or have significantly lower densities are flagged as anomalies and are highlighted with a different color (e.g., red).
Density-Based Anomaly Detection: LOF focuses on identifying data points that deviate from the local density patterns in the dataset. Anomalies are those points that are in sparsely populated regions or have significantly different densities compared to their surroundings.
2. One-Class SVM (OCSVM):

Visualization Type: Scatter plot
Purpose: To identify data points that fall outside the learned decision boundary, indicating they are likely anomalies.
Data:
x-axis: 'Distance' - Represents the distance traveled by the car.
y-axis: 'Speed' - Represents the speed of the car.
Color: 'Anomaly_OCSVM' - Indicates whether a data point is flagged as an anomaly (1) or not (0) by the OCSVM model.
Interpretation:
Normal Data Points: Data points that fall within the learned decision boundary are considered normal and are typically represented by one color (e.g., blue).
Anomalies/Outliers: Data points that fall outside the decision boundary are flagged as anomalies and are highlighted with a different color (e.g., red).
Novelty Detection: OCSVM focuses on learning a boundary around the normal data points and identifying anything outside this boundary as an anomaly. This is useful for detecting novel or unseen data patterns that deviate from the training data.

In [21]:
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

# Scale the data
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data[features])

# Define the autoencoder model
input_dim = scaled_data.shape[1]
encoding_dim = 10 # Adjust as needed

autoencoder = tf.keras.models.Sequential([
    tf.keras.layers.Dense(encoding_dim, activation='relu', input_shape=(input_dim,)),
    tf.keras.layers.Dense(input_dim, activation='sigmoid')
])

# Compile the model
autoencoder.compile(optimizer='adam', loss='mse')

# Train the model
autoencoder.fit(scaled_data, scaled_data, epochs=50, batch_size=32) # Adjust epochs and batch size as needed

# Get reconstruction errors
reconstructions = autoencoder.predict(scaled_data)
reconstruction_errors = np.mean(np.square(scaled_data - reconstructions), axis=1)

# Set a threshold for anomaly detection
threshold = np.percentile(reconstruction_errors, 95) # Adjust percentile as needed

# Identify anomalies
anomalies = reconstruction_errors > threshold
data['Anomaly_Autoencoder'] = anomalies.astype(int)

# Visualize anomalies with Plotly
fig = px.scatter(data, x='Distance', y='Speed', color='Anomaly_Autoencoder', title='Anomaly Detection with Autoencoder')
fig.show()

Epoch 1/50



Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.1267   
Epoch 2/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1168  
Epoch 3/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1042 
Epoch 4/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0993 
Epoch 5/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0938 
Epoch 6/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0879 
Epoch 7/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0854 
Epoch 8/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0798 
Epoch 9/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0800 
Epoch 10/50
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0755 
Epoch 1



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Okay, let's explain the visualization generated by the autoencoder for anomaly detection in F1 telemetry data:

Visualization Explanation: Autoencoder

The autoencoder visualization is a scatter plot that highlights anomalies based on reconstruction errors. Here's a breakdown of the interpretation:

Visualization Type: Scatter plot
Purpose: To identify data points that have high reconstruction errors, indicating they are likely anomalies.
Data:
x-axis: 'Distance' - Represents the distance traveled by the car.
y-axis: 'Speed' - Represents the speed of the car.
Color: 'Anomaly_Autoencoder' - Indicates whether a data point is flagged as an anomaly (1) or not (0) by the autoencoder.
Interpretation:
Normal Data Points: Data points that have low reconstruction errors are considered normal and are typically represented by one color (e.g., blue). These points are well-represented by the autoencoder's learned patterns.
Anomalies/Outliers: Data points that have high reconstruction errors are flagged as anomalies and are highlighted with a different color (e.g., red). These points deviate significantly from the autoencoder's learned patterns, suggesting unusual behavior.
Reconstruction Error-Based Anomaly Detection: Autoencoders work by learning a compressed representation of the normal data. When presented with an anomaly, the autoencoder struggles to reconstruct it accurately, resulting in a high reconstruction error. This error is used as an indicator of anomalous behavior.
Understanding the Scatter Plot:

Clusters and Outliers: The scatter plot can reveal clusters of anomalies or individual outliers. Clusters might indicate systemic issues or specific scenarios where anomalies are likely to occur. Isolated outliers could represent sensor errors or unique events.
Relationship with Distance and Speed: By examining the positions of anomalies on the scatter plot, you can understand how they relate to distance and speed. For example, anomalies clustered at high speeds might indicate issues with the car's aerodynamics or engine performance.
Threshold Impact: The threshold for anomaly detection (defined as a percentile of reconstruction errors) influences the number of data points flagged as anomalies. A higher threshold will result in fewer anomalies being detected, while a lower threshold will increase the sensitivity to potential anomalies.

Findings and Conclusions

This project demonstrates an AI-powered pipeline for monitoring and flagging anomalies in Formula 1 telemetry data for George Russell. By combining a rules engine with unsupervised anomaly detection models like Isolation Forest, Local Outlier Factor (LOF), One-Class SVM (OCSVM), and autoencoders, we identified potential areas of concern in the telemetry data.

Key Findings:

Rules Engine: The rules engine effectively flagged basic anomalies based on predefined thresholds for speed, throttle, and brake. This provides a first-level filter for identifying obvious deviations from expected behavior.

Isolation Forest: The Isolation Forest model successfully identified anomalies that deviate from the normal data distribution. The visualization highlighted these anomalies, allowing for further investigation.

Local Outlier Factor (LOF): LOF identified anomalies based on local density deviations, revealing potential outliers in sparsely populated regions of the data space. The visualization highlighted these points, suggesting unusual behavior in specific driving scenarios.

One-Class SVM (OCSVM): OCSVM learned a decision boundary around the normal data and flagged points outside this boundary as anomalies. This approach is useful for detecting novel or unseen data patterns that might indicate unexpected car behavior.

Autoencoders: The autoencoder model detected anomalies based on reconstruction errors, highlighting data points that deviate significantly from the learned patterns. The visualization revealed clusters of anomalies and individual outliers, suggesting potential systemic issues or sensor errors.

Overall Conclusions:

The combination of different anomaly detection models and visualizations provides a comprehensive view of potential anomalies in the F1 telemetry data.
Each model offers a unique perspective on anomaly detection, highlighting different types of deviations from expected behavior.
Visualizations are crucial for interpreting the results of the models and understanding the underlying patterns in the data.
The system can be used to identify potential issues, improve data quality, and enhance the performance of F1 cars.
Potential Applications and Extensions:

Real-time Anomaly Detection: Integrate the pipeline into a real-time data stream for live monitoring during races.
Predictive Maintenance: Use the anomaly detection models to predict potential failures in car components.
Driver Performance Analysis: Identify unusual driving patterns or behaviors for performance improvement.
Data Quality Monitoring in Other Domains: Adapt the pipeline for other engineering applications.
Limitations:

The rules engine relies on predefined thresholds that may need to be adjusted for different tracks or driving styles.
The unsupervised anomaly detection models require careful tuning of parameters to achieve optimal performance.
The system may not be able to detect all types of anomalies, especially those that are subtle or previously unseen.
By combining multiple models and visualizations, this project demonstrates a robust and insightful approach to data quality monitoring in F1 telemetry data. With further development and refinement, this system has the potential to significantly enhance the performance, reliability, and safety of F1 cars

Potential Applications and Extensions

This project can be extended and applied to various scenarios, as discussed previously.

By presenting the code cells alongside the textual explanations, the Kaggle project becomes more engaging and easier to follow. Users can readily understand the code's purpose and how it contributes to the overall pipeline. This presentation style also makes the project more reproducible, as users can easily copy and execute the code in their own Kaggle environments. I hope this refined presentation further enhances the value and accessibility of the project.