## Analisis para la prediccion de Tiempo de Vuelta usando Regresion Lineal Multiple

`Melbourne` (Gran Premio de Australia) es un circuito urbano ubicado en Albert Park, `Melbourne`. Con una longitud de 5303 metros, se caracteriza por ser una pista mixta que combina secciones de alta velocidad con curvas técnicas. Aunque es un circuito que se corre en sentido horario, las zonas más difíciles son aquellas con múltiples cambios de dirección, lo que exige un alto nivel de control y precisión.

El clima en `Melbourne` es impredecible, con cambios repentinos de temperatura y posibles lluvias que complican las estrategias de los equipos. Las curvas de alta velocidad y las rectas relativamente cortas hacen que las paradas en boxes sean cruciales para los pilotos. Por lo tanto, es esencial tener un buen manejo de los neumáticos, especialmente en las zonas donde el asfalto es más abrasivo. El desempeño en la frenada y las aceleraciones de las curvas 1 y 3 son clave para conseguir tiempos rápidos.


![Melbourne](../img/interlagos.jpg)

Hagamos entonces un estudio de un modelo de regresión lineal múltiple con variable dependiente `FinalRaceTime` en el circuito de Interlagos.


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

data = pd.read_csv("formula1_interlagos_df_final.csv")

# Print column names
print("Available columns:")
print(data.columns.tolist())


# circuit_name = 'Interlagos'
    
circuit_data = data.copy()  
# circuit_data = data[data['Circuit'] == circuit_name].copy()  

# if circuit_data.empty:
#         print(f"No data found for circuit {circuit_name}")
#         print("Available circuits:")
#         print(data['Circuit'])  
# else:
#         print(f"Data for circuit {circuit_name}:")
#         # print(circuit_data.head()) 


# print(f"\nAnalyzing data for circuit: {circuit_name}")
# print(f"Number of races: {len(circuit_data)}")

    # Select relevant features
# features = ['MaxSpeed','DriverSkill','Age','PitStopTime','ReactionTime',
#            'FinalPosition','Experience','DNF','Points',
#             'Overtakes','TyreWear','Experience','DriverSkill','CarPerformance',
#             'TrackFamiliarity','EngineMode','FuelConsumption','DownforceLevel','TrackTemperature',
#             'WeatherCondition_Mixed','WeatherCondition_Wet','TyreCompound_Medium','TyreCompound_Soft','TrackGrip_Low','TrackGrip_Medium']

# features = ['MaxSpeed','DriverSkill',
#            'Experience','Points',
#             'Experience','DriverSkill',
#             'TrackTemperature',
#             'WeatherCondition_Mixed','WeatherCondition_Wet','TyreCompound_Medium','TyreCompound_Soft','TrackGrip_Low','TrackGrip_Medium']


features = ['MaxSpeed','DriverSkill',
           'Experience',
            'Experience','DriverSkill',
            'TrackTemperature',
            'WeatherCondition_Mixed','WeatherCondition_Wet','TyreCompound_Medium','TyreCompound_Soft','TrackGrip_Low','TrackGrip_Medium']



# features = ['MaxSpeed','PitStopTime','ReactionTime',
#            'FinalPosition','Experience','DNF','Points',
#             'Overtakes','TyreWear','Experience','CarPerformance',
#             'DownforceLevel',
#             'WeatherCondition_Mixed','TyreCompound_Medium']

# features = ['MaxSpeed','PitStopTime','ReactionTime',
#            'FinalPosition','DNF','Points',
#             'Overtakes','TyreWear',
#             'DownforceLevel',
#             'TyreCompound_Medium']

# features = ['MaxSpeed','ReactionTime',
#            'FinalPosition','DNF','Points',
#             'Overtakes','TyreWear',
#             'DownforceLevel',
#             'TyreCompound_Medium']

target = 'FinalRaceTime'

    # Check if columns exist and remove those that don't
features = [f for f in features if f in data.columns]
if target not in data.columns:
        raise ValueError(f"Target column '{target}' not found in the dataset")

print(f"\nUsing features: {features}")
print(f"Number of features : {len(features)}")
print(f"Target: {target}")

for feature in features:
        circuit_data[feature] = pd.to_numeric(circuit_data[feature], errors='coerce')
circuit_data = circuit_data.dropna(subset=features + [target])


    # Split the data for the selected driver (only Hamilton's data)
X = circuit_data[features]
y = circuit_data[target]


    # Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



    # Create and train the model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

    # Print the equation of the hyperplane
coefficients = model.coef_
intercept = model.intercept_

print("\nEquation of the hyperplane:")
equation = f"FinalRiceTime = {intercept:.2f}"
for feature, coef in zip(features, coefficients):
        equation += f" + ({coef:.2f} * {feature})"
print(equation)

    # Calculate R-squared
r_squared = model.score(X_test_scaled, y_test)
print(f"\nR-squared: {r_squared:.4f}")
    

Available columns:
['Date', 'Driver', 'Age', 'Team', 'Circuit', 'PitStopTime', 'ReactionTime', 'FinalPosition', 'DNF', 'Points', 'MaxSpeed', 'Overtakes', 'Experience', 'DriverSkill', 'CarPerformance', 'TrackFamiliarity', 'EngineMode', 'QualifyingPosition', 'TyreWear', 'FuelConsumption', 'DownforceLevel', 'FinalRaceTime', 'TrackTemperature', 'WeatherCondition_Mixed', 'WeatherCondition_Wet', 'TyreCompound_Medium', 'TyreCompound_Soft', 'TrackGrip_Low', 'TrackGrip_Medium']

Using features: ['MaxSpeed', 'DriverSkill', 'Experience', 'Experience', 'DriverSkill', 'TrackTemperature', 'WeatherCondition_Mixed', 'WeatherCondition_Wet', 'TyreCompound_Medium', 'TyreCompound_Soft', 'TrackGrip_Low', 'TrackGrip_Medium']
Number of features : 12
Target: FinalRaceTime

Equation of the hyperplane:
FinalRiceTime = 266.99 + (-1.19 * MaxSpeed) + (0.60 * DriverSkill) + (-1.34 * Experience) + (-1.34 * Experience) + (0.60 * DriverSkill) + (-1.68 * TrackTemperature) + (2.50 * WeatherCondition_Mixed) + (5.64 * Wea

In [None]:

# Agregar constante para la intersección
X_train_const = sm.add_constant(X_train_scaled)

# Ajustar el modelo de regresión con statsmodels
model_sm = sm.OLS(y_train, X_train_const).fit()

# Mostrar los p-values de cada coeficiente
print(model_sm.summary())

# Create a DataFrame for coefficients and p-values
summary_df = pd.DataFrame({
    'Feature': model_sm.params.index,
    'Coefficient': model_sm.params.values,
    'P-value': model_sm.pvalues.values
})

# Filter to keep only p-values greater than 0.05
summary_df = summary_df[summary_df['P-value'] > 0.05]

# Sort the DataFrame by p-value in descending order
summary_df = summary_df.sort_values(by='P-value', ascending=False)


# Print the sorted DataFrame
print("\nSorted Coefficients and P-values (from highest to lowest p-value):")
print(summary_df)

                            OLS Regression Results                            
Dep. Variable:          FinalRaceTime   R-squared:                       0.690
Model:                            OLS   Adj. R-squared:                  0.682
Method:                 Least Squares   F-statistic:                     86.64
Date:                Sun, 16 Feb 2025   Prob (F-statistic):           1.54e-92
Time:                        19:46:51   Log-Likelihood:                -1190.8
No. Observations:                 400   AIC:                             2404.
Df Residuals:                     389   BIC:                             2448.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        266.9915      0.241   1108.690      0.0

Aquí se ha empezado considerando todas las variables de estudio de nuestro dataset, incluso se ha modificado el dataset para 

                            OLS Regression Results                            
Dep. Variable:          FinalRaceTime   R-squared:                       0.669
Model:                            OLS   Adj. R-squared:                  0.495
Method:                 Least Squares   F-statistic:                     3.857
Date:                Sun, 16 Feb 2025   Prob (F-statistic):            0.00383
Time:                        01:17:42   Log-Likelihood:                -94.490
No. Observations:                  33   AIC:                             213.0
Df Residuals:                      21   BIC:                             230.9
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        270.3552      0.925    292.246      0.0

In [32]:
# Convertir variables categóricas en dummies
categorical_features = ['TyreCompound', 'WeatherCondition', 'EngineMode', 'TrackGrip']
circuit_data = pd.get_dummies(circuit_data, columns=categorical_features, drop_first=True)

# Seleccionar las características de nuevo después de dummies
features = [col for col in circuit_data.columns if col != target]

# Definir X e y
X = circuit_data[features]
y = circuit_data[target]

# Split en train y test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Escalar los datos
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Crear y entrenar el modelo
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Mostrar la ecuación de la regresión
coefficients = model.coef_
intercept = model.intercept_

print("\nEquation of the hyperplane:")
equation = f"FinalRaceTime = {intercept:.2f}"
for feature, coef in zip(features, coefficients):
    equation += f" + ({coef:.2f} * {feature})"
print(equation)

# Calcular R-cuadrado
r_squared = model.score(X_test_scaled, y_test)
print(f"\nR-squared: {r_squared:.4f}")

KeyError: "None of [Index(['TyreCompound', 'WeatherCondition', 'EngineMode', 'TrackGrip'], dtype='object')] are in the [columns]"