#**1. Abordar o problema e Analisar**
Objetivo: prever se o vencedor é um piloto "top" (acima de X vitórias no histórico)
Tipo: classificação binária (1 = piloto top, 0 = piloto normal)
Desafios: dataset histórico, nomes de pilotos diferentes, variação de épocas

In [2]:
# IMPORTAÇÔES

# Obter dados
import pandas as pd
# Pré-processamento
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Treino e teste
from sklearn.model_selection import train_test_split
# Definir e aplicar treinamento
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
# Validação
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
#Salvar
import joblib


**2. Obter os dados**

In [3]:
df = pd.read_csv("winners_rally_1950_2025.csv")

print(df.shape)
df.head()

(228, 9)


Unnamed: 0,rally_event,winner_name,team,year,continent,distance_km,time_seconds,winner_win_count,top_driver
0,Rally 1950-1,Petter Solberg,Team Subaru,1950,Asia,335,3519.27,12,1
1,Rally 1950-2,Colin McRae,Citroën Racing,1950,North America,457,5334.57,23,1
2,Rally 1950-3,Richard Burns,Hyundai Rally,1950,South America,431,3977.33,13,1
3,Rally 1951-1,Carlos Sainz,Hyundai Rally,1951,North America,376,4170.8,12,1
4,Rally 1951-2,Colin McRae,Toyota Gazoo,1951,Asia,497,5750.93,23,1


**3. Explorar os dados**

In [4]:
# Info geral
print(df.info())

# Valores únicos
for col in df.columns:
    print(f"\nColuna: {col}")
    print(df[col].value_counts().head(5))

# Estatísticas numéricas
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228 entries, 0 to 227
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rally_event       228 non-null    object 
 1   winner_name       228 non-null    object 
 2   team              228 non-null    object 
 3   year              228 non-null    int64  
 4   continent         228 non-null    object 
 5   distance_km       228 non-null    int64  
 6   time_seconds      228 non-null    float64
 7   winner_win_count  228 non-null    int64  
 8   top_driver        228 non-null    int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 16.2+ KB
None

Coluna: rally_event
rally_event
Rally 1950-1    1
Rally 1950-2    1
Rally 1950-3    1
Rally 1951-1    1
Rally 1951-2    1
Name: count, dtype: int64

Coluna: winner_name
winner_name
Colin McRae       23
Sébastien Loeb    22
Ari Vatanen       20
Ott Tänak         18
Tommi Mäkinen     18
Name: count, dtype: int6

**4. Tratamento dos dados (Feature Engineering + Target)**

In [7]:
# Criar contagem de vitórias por piloto
win_counts = df['winner_name'].value_counts().to_dict()
df['winner_win_count'] = df['winner_name'].map(win_counts)

# Definir threshold X (exemplo: 20 vitórias)
X_threshold = 20
df['top_driver'] = (df['winner_win_count'] > X_threshold).astype(int)

df[['winner_name','winner_win_count','top_driver']].head(20)

Unnamed: 0,winner_name,winner_win_count,top_driver
0,Petter Solberg,12,0
1,Colin McRae,23,1
2,Richard Burns,13,0
3,Carlos Sainz,12,0
4,Colin McRae,23,1
5,Sébastien Ogier,17,0
6,Colin McRae,23,1
7,Ari Vatanen,20,0
8,Tommi Mäkinen,18,0
9,Colin McRae,23,1


*5. Separar Base de Dados em Arrays*

In [9]:
# Features escolhidas
features = ['year','time_seconds','winner_win_count','continent']

X = df[features]
y = df['top_driver']

**6. Técnicas de Pré-processamento**

In [13]:
# Definir colunas numéricas e categóricas
num_features = ['year','time_seconds','winner_win_count']
cat_features = ['continent']

# Pipelines
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combinar
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

**7. Dividir Base de Dados entre Treino e Teste**

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(X_train.shape, X_test.shape)

(182, 4) (46, 4)


**8. Definir vários modelos e aplicar Treinamento**

In [14]:
models = {
    "LogisticRegression": LogisticRegression(max_iter=500),
    "RandomForest": RandomForestClassifier(n_estimators=200, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

pipelines = {name: Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
]) for name, model in models.items()}

for name, pipe in pipelines.items():
    pipe.fit(X_train, y_train)
    print(f"{name} treinado com sucesso.")

LogisticRegression treinado com sucesso.
RandomForest treinado com sucesso.
KNN treinado com sucesso.


**9. Validar o Modelo**

In [15]:
results = {}

for name, pipe in pipelines.items():
    y_pred = pipe.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n{name} - Acurácia: {acc:.4f}")
    print(classification_report(y_test, y_pred))

    cv_scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"{name} - CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

    results[name] = acc

best_model_name = max(results, key=results.get)
best_model = pipelines[best_model_name]
print("\nMelhor modelo:", best_model_name)


LogisticRegression - Acurácia: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        37
           1       1.00      1.00      1.00         9

    accuracy                           1.00        46
   macro avg       1.00      1.00      1.00        46
weighted avg       1.00      1.00      1.00        46

LogisticRegression - CV accuracy: 0.9867 (+/- 0.0267)

RandomForest - Acurácia: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        37
           1       1.00      1.00      1.00         9

    accuracy                           1.00        46
   macro avg       1.00      1.00      1.00        46
weighted avg       1.00      1.00      1.00        46

RandomForest - CV accuracy: 1.0000 (+/- 0.0000)

KNN - Acurácia: 0.9348
              precision    recall  f1-score   support

           0       0.93      1.00      0.96        37
           1       1.00      0.67      

**10. Salvar a Solução**

In [16]:
joblib.dump(best_model, "best_model_f1_driver.pkl")
print("Modelo salvo em best_model_f1_driver.pkl")

Modelo salvo em best_model_f1_driver.pkl
