# Exemple de soumission valide pour le challenge

Ce notebook sert de point de départ pour avancer dans la compétition : nous ne présenterons pas ici les meilleurs pratiques ou le meilleur algorithme pour remporter la compétition.Commençons par importer les données.

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")
df.head()

Unnamed: 0,ticker,commodity,date,open,high,low,volume,ID,target
0,CL=F,Crude Oil,2000-08-23,31.950001,32.799999,31.950001,79385,0,31.629999
1,CL=F,Crude Oil,2000-08-24,31.9,32.240002,31.4,72978,1,32.049999
2,CL=F,Crude Oil,2000-08-25,31.700001,32.099998,31.32,44601,2,32.869999
3,CL=F,Crude Oil,2000-08-28,32.040001,32.919998,31.860001,46770,3,32.720001
4,CL=F,Crude Oil,2000-08-29,32.82,33.029999,32.560001,49131,4,33.400002


Il faut rendre les données exploitable pour un algorithme de Machine Learning. Nous allons couper la base d'entraînement en un jeu d'entraînement et de validation. Pour le faire, gardons en tête que nous devons conserver l'ordre temporel : il n'est pas le même pour toute les commodité :

In [None]:
for ticker in df["ticker"].value_counts().index:
    temp = df.loc[df["ticker"] == ticker, ]
    print(ticker, ":", np.min(temp["date"]), np.max(temp["date"]))

CL=F : 2000-08-23 00:00:00 2019-08-29 00:00:00
NG=F : 2000-08-30 00:00:00 2019-08-29 00:00:00
HO=F : 2000-09-01 00:00:00 2019-08-30 00:00:00
RB=F : 2000-11-01 00:00:00 2019-09-12 00:00:00
BZ=F : 2007-07-30 00:00:00 2021-01-28 00:00:00


Nous allons donc construire le jeu d'entraînement avec les valorisations jusqu'à fin 2017 et nous utiliserons le reste comme jeu de validation.

In [None]:
train, validation = df.loc[df["date"] < "2018-01-01", ], df.loc[df["date"] >= "2018-01-01", ]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

Nous pouvons maintenant entraîner et mesurer la performance d'un algorithme.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_valid)

RMSE = lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))
performance = RMSE(y_valid, y_pred)
print(f"RMSE: {performance:.4}")
print(y_pred)

RMSE: 0.9557
[ 3.0457442   2.08960474 60.68800317 ... 55.84959569 55.72807279
 56.05454931]


Nous avons un modèle fonctionnel. Passons au test ! Pour qu'une soumission soit valide, il faut que l'on ait la colonne *ID* et une colonne nommée *predicted* qui contiendra la valeur prédite par l'algorithme.

In [None]:
test = pd.read_csv("test.csv")
X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
y_pred = model.predict(X_test)
print(y_pred)
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_pred
})

submission.to_csv("submission.csv", index=False)

Il ne reste plus qu'à charger le fichier sur Kaggle !

# Random Forest

## Test 1

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")
df.head()

Unnamed: 0,ticker,commodity,date,open,high,low,volume,ID,target
0,CL=F,Crude Oil,2000-08-23,31.950001,32.799999,31.950001,79385,0,31.629999
1,CL=F,Crude Oil,2000-08-24,31.9,32.240002,31.4,72978,1,32.049999
2,CL=F,Crude Oil,2000-08-25,31.700001,32.099998,31.32,44601,2,32.869999
3,CL=F,Crude Oil,2000-08-28,32.040001,32.919998,31.860001,46770,3,32.720001
4,CL=F,Crude Oil,2000-08-29,32.82,33.029999,32.560001,49131,4,33.400002


In [None]:
for ticker in df["ticker"].value_counts().index:
    temp = df.loc[df["ticker"] == ticker, ]
    print(ticker, ":", np.min(temp["date"]), np.max(temp["date"]))

CL=F : 2000-08-23 00:00:00 2019-08-29 00:00:00
NG=F : 2000-08-30 00:00:00 2019-08-29 00:00:00
HO=F : 2000-09-01 00:00:00 2019-08-30 00:00:00
RB=F : 2000-11-01 00:00:00 2019-09-12 00:00:00
BZ=F : 2007-07-30 00:00:00 2021-01-28 00:00:00


In [None]:
train, validation = df.loc[df["date"] < "2018-01-01", ], df.loc[df["date"] >= "2018-01-01", ]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

model = RandomForestRegressor(n_estimators=100, random_state=42).fit(X_train, y_train)
y_pred = model.predict(X_valid)

# Calculate the RMSE
RMSE = lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))
performance = RMSE(y_valid, y_pred)
print(f"RMSE: {performance:.4f}")

RMSE: 1.0410


In [None]:
test = pd.read_csv("test.csv")
X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
y_pred = model.predict(X_test)
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_pred
})

submission.to_csv("submission.csv", index=False)

## Test 2

In [None]:
import numpy as np
import pandas as pd

# Load the training data
df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")

# Ensure all columns except for 'date' and 'commodity' are numeric
for col in df.columns:
    if col not in ["date", "commodity", "ticker"]:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Handling missing values by filling them with the mean of the column
df.fillna(df.mean(numeric_only=True), inplace=True)

df.head()


Unnamed: 0,ticker,commodity,date,open,high,low,volume,ID,target
0,CL=F,Crude Oil,2000-08-23,31.950001,32.799999,31.950001,79385,0,31.629999
1,CL=F,Crude Oil,2000-08-24,31.9,32.240002,31.4,72978,1,32.049999
2,CL=F,Crude Oil,2000-08-25,31.700001,32.099998,31.32,44601,2,32.869999
3,CL=F,Crude Oil,2000-08-28,32.040001,32.919998,31.860001,46770,3,32.720001
4,CL=F,Crude Oil,2000-08-29,32.82,33.029999,32.560001,49131,4,33.400002


In [None]:
# Feature Engineering: Adding rolling mean and price difference
df['rolling_mean_3'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=3).mean())
df['price_diff'] = df.groupby('ticker')['target'].transform(lambda x: x.diff())

# Fill any newly introduced NaNs
df.fillna(df.mean(numeric_only=True), inplace=True)

df.head()


Unnamed: 0,ticker,commodity,date,open,high,low,volume,ID,target,rolling_mean_3,price_diff
0,CL=F,Crude Oil,2000-08-23,31.950001,32.799999,31.950001,79385,0,31.629999,26.784241,5.9e-05
1,CL=F,Crude Oil,2000-08-24,31.9,32.240002,31.4,72978,1,32.049999,26.784241,0.42
2,CL=F,Crude Oil,2000-08-25,31.700001,32.099998,31.32,44601,2,32.869999,32.183332,0.82
3,CL=F,Crude Oil,2000-08-28,32.040001,32.919998,31.860001,46770,3,32.720001,32.546666,-0.149998
4,CL=F,Crude Oil,2000-08-29,32.82,33.029999,32.560001,49131,4,33.400002,32.996667,0.68


In [None]:
# Split the data into training and validation sets
train, validation = df.loc[df["date"] < "2018-01-01", ], df.loc[df["date"] >= "2018-01-01", ]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

X_train.shape, X_valid.shape


((19929, 6), (2453, 6))

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Normalizing the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

# Train a Random Forest Regressor model
model = RandomForestRegressor(n_estimators=1000, random_state=10).fit(X_train, y_train)
y_pred = model.predict(X_valid)

# Calculate the RMSE
RMSE = lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))
performance = RMSE(y_valid, y_pred)
print(f"RMSE: {performance:.4f}")


RMSE: 0.3586


In [None]:
# Load the test data
test = pd.read_csv("test.csv")

# Ensure all columns except for 'date' and 'commodity' are numeric
for col in test.columns:
    if col not in ["date", "commodity", "ticker"]:
        test[col] = pd.to_numeric(test[col], errors='coerce')

# Feature Engineering: Adding rolling mean and price difference for test data without using 'target'
test['rolling_mean_3'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=3).mean())
test['price_diff'] = test.groupby('ticker')['open'].transform(lambda x: x.diff())
test.fillna(test.mean(numeric_only=True), inplace=True)

# Normalize the test data
X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
y_pred = model.predict(X_test)

# Prepare the submission file
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_pred
})

# Save the submission file
submission.to_csv("submissionV1.3.csv", index=False)

print("Submission file saved as 'submission.csv'")




Submission file saved as 'submission.csv'


## Test 3 FAIL

In [None]:
import numpy as np
import pandas as pd

# Load the training data
df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")

# Ensure all columns except for 'date' and 'commodity' are numeric
for col in df.columns:
    if col not in ["date", "commodity", "ticker"]:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Handling missing values by filling them with the mean of the column
df.fillna(df.mean(numeric_only=True), inplace=True)


In [None]:
# Feature Engineering: Adding rolling mean and price difference
df['rolling_mean_3'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=3).mean())
df['price_diff'] = df.groupby('ticker')['target'].transform(lambda x: x.diff())

# Fill any newly introduced NaNs
df.fillna(df.mean(numeric_only=True), inplace=True)

# Remove outliers
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    df = df[~((df[column] < (Q1 - 1.5 * IQR)) | (df[column] > (Q3 + 1.5 * IQR)))]
    return df

# Remove outliers from all relevant numeric columns
for col in df.columns:
    if col not in ["date", "commodity", "ticker", "ID"]:
        df = remove_outliers(df, col)

df.shape

(12447, 11)

In [None]:
# Split the data into training and validation sets
train, validation = df.loc[df["date"] < "2018-01-01", ], df.loc[df["date"] >= "2018-01-01", ]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

X_train.shape, X_valid.shape

((19929, 4), (2453, 4))

In [None]:
from sklearn.preprocessing import StandardScaler

# Normalizing the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [10, 20, 30, 40, 50, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the RandomForestRegressor
rf = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=3, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error')

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best estimator
best_rf = grid_search.best_estimator_
print(grid_search.best_params_)

Fitting 3 folds for each of 810 candidates, totalling 2430 fits


KeyboardInterrupt: 

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define the parameter grid
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the RandomForestRegressor
rf = RandomForestRegressor(random_state=42)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist,
                                   n_iter=50, cv=3, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error',
                                   random_state=42)

# Fit the model
random_search.fit(X_train, y_train)

# Get the best estimator
best_rf = random_search.best_estimator_
print(random_search.best_params_)


Fitting 3 folds for each of 50 candidates, totalling 150 fits


KeyboardInterrupt: 

In [None]:
from sklearn.metrics import mean_squared_error

# Predict on the validation set
y_pred = best_rf.predict(X_valid)

# Calculate the RMSE
RMSE = lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))
performance = RMSE(y_valid, y_pred)
print(f"RMSE: {performance:.4f}")

RMSE: 0.0129


In [None]:
# Load the test data
test = pd.read_csv("test.csv")

# Ensure all columns except for 'date' and 'commodity' are numeric
for col in test.columns:
    if col not in ["date", "commodity", "ticker"]:
        test[col] = pd.to_numeric(test[col], errors='coerce')

# Feature Engineering: Adding rolling mean and price difference for test data without using 'target'
test['rolling_mean_3'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=3).mean())
test['price_diff'] = test.groupby('ticker')['open'].transform(lambda x: x.diff())
test.fillna(test.mean(numeric_only=True), inplace=True)

# Normalize the test data
X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
X_test = scaler.transform(X_test)

# Predict on the test set
y_pred = best_rf.predict(X_test)

# Prepare the submission file
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_pred
})

# Save the submission file
submission.to_csv("submissionrfV2.csv", index=False)

print("Optimized submission file saved as 'submissionrfV2.1.csv'")


Optimized submission file saved as 'submissionrfV2.csv'


## Test4


# XGBoost

## Test1


In [None]:
import numpy as np
import pandas as pd

# Load the training data
df = pd.read_csv("/train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")

# Ensure all columns except for 'date' and 'commodity' are numeric
for col in df.columns:
    if col not in ["date", "commodity", "ticker"]:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Handling missing values by filling them with the mean of the column
df.fillna(df.mean(numeric_only=True), inplace=True)

# Feature Engineering: Adding rolling mean and price difference
df['rolling_mean_3'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=3).mean())
df['price_diff'] = df.groupby('ticker')['target'].transform(lambda x: x.diff())

# Fill any newly introduced NaNs
df.fillna(df.mean(numeric_only=True), inplace=True)

# Remove outliers
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    df = df[~((df[column] < (Q1 - 1.5 * IQR)) | (df[column] > (Q3 + 1.5 * IQR)))]
    return df

# Remove outliers from all relevant numeric columns
for col in df.columns:
    if col not in ["date", "commodity", "ticker", "ID"]:
        df = remove_outliers(df, col)

# Split the data into training and validation sets
train, validation = df.loc[df["date"] < "2018-01-01", ], df.loc[df["date"] >= "2018-01-01", ]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

# Normalizing the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)


In [None]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error

# Initialize the XGBoost Regressor
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

# Define the parameter grid
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=xg_reg, param_distributions=param_dist,
                                   n_iter=50, cv=3, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error',
                                   random_state=42)

# Fit the model
random_search.fit(X_train, y_train)

# Get the best estimator
best_xg_reg = random_search.best_estimator_
print(random_search.best_params_)


Fitting 3 folds for each of 50 candidates, totalling 150 fits
{'subsample': 0.6, 'n_estimators': 200, 'max_depth': 9, 'learning_rate': 0.2, 'colsample_bytree': 0.6}


In [None]:
# Predict on the validation set
y_pred = best_xg_reg.predict(X_valid)

# Calculate the RMSE
RMSE = lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))
performance = RMSE(y_valid, y_pred)
print(f"Optimized RMSE with XGBoost: {performance:.4f}")
print(y_pred)
# Load the test data
test = pd.read_csv("test.csv")

# Ensure all columns except for 'date' and 'commodity' are numeric
for col in test.columns:
    if col not in ["date", "commodity", "ticker"]:
        test[col] = pd.to_numeric(test[col], errors='coerce')

# Feature Engineering: Adding rolling mean and price difference for test data without using 'target'
test['rolling_mean_3'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=3).mean())
test['price_diff'] = test.groupby('ticker')['open'].transform(lambda x: x.diff())
test.fillna(test.mean(numeric_only=True), inplace=True)

# Normalize the test data
X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
X_test = scaler.transform(X_test)

# Predict on the test set
y_pred = best_xg_reg.predict(X_test)

# Prepare the submission file
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_pred
})

# Save the submission file
submission.to_csv("submissionxgboostV1.csv", index=False)

print("Optimized submission file saved as 'submissionxgboostV1.csv'")


Optimized RMSE with XGBoost: 0.0147
Optimized submission file saved as 'submissionxgboostV1.csv'


## *Test2*Fail

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from scipy import stats

# Chargement des données d'entraînement
df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")

# Assurer que toutes les colonnes sauf 'date' et 'commodity' sont numériques
for col in df.columns:
    if col not in ["date", "commodity", "ticker"]:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Traiter les valeurs manquantes en utilisant l'interpolation
df.interpolate(method='linear', inplace=True)

# Ingénierie des fonctionnalités
df['rolling_mean_3'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=3).mean())
df['price_diff'] = df.groupby('ticker')['target'].transform(lambda x: x.diff())
df['rolling_mean_7'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=7).mean())
df['rolling_std_3'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=3).std())
df['rolling_std_7'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=7).std())
df['price_diff_7'] = df.groupby('ticker')['target'].transform(lambda x: x.diff(7))

# Remplir à nouveau les NaNs introduits
df.interpolate(method='linear', inplace=True)

# Supprimer les valeurs aberrantes en utilisant la méthode du Z-score
df_no_outliers = df.copy()
z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))
df = df[(z_scores < 3).all(axis=1)]

# Séparation des données en ensembles d'entraînement et de validation
train, validation = df_no_outliers.loc[df_no_outliers["date"] < "2018-01-01", ], df_no_outliers.loc[df_no_outliers["date"] >= "2018-01-01", ]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

# Normalisation des données
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

# Définir la grille de paramètres pour RandomizedSearchCV
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Initialiser le modèle XGBoost
xg_reg = XGBRegressor(objective='reg:squarederror', random_state=42)

# Initialiser RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=xg_reg, param_distributions=param_dist,
                                   n_iter=50, cv=3, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error',
                                   random_state=42)

# Entraîner le modèle
random_search.fit(X_train, y_train)

# Obtenir le meilleur estimateur
best_xg_reg = random_search.best_estimator_
print(random_search.best_params_)

# Prédire sur l'ensemble de validation
y_pred = best_xg_reg.predict(X_valid)
performance = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"Optimized RMSE with XGBoost: {performance:.4f}")

# Chargement des données de test
test = pd.read_csv("test.csv")

# Assurer que toutes les colonnes sauf 'date' et 'commodity' sont numériques
for col in test.columns:
    if col not in ["date", "commodity", "ticker"]:
        test[col] = pd.to_numeric(test[col], errors='coerce')

# Ingénierie des fonctionnalités pour les données de test
test['rolling_mean_3'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=3).mean())
test['price_diff'] = test.groupby('ticker')['open'].transform(lambda x: x.diff())
test['rolling_mean_7'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=7).mean())
test['rolling_std_3'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=3).std())
test['rolling_std_7'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=7).std())
test['price_diff_7'] = test.groupby('ticker')['open'].transform(lambda x: x.diff(7))

# Remplir à nouveau les NaNs introduits
test.interpolate(method='linear', inplace=True)

# Normaliser les données de test
X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
X_test = scaler.transform(X_test)

# Prédire sur l'ensemble de test
y_test_pred = best_xg_reg.predict(X_test)

# Préparer le fichier de soumission
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_test_pred
})

# Sauvegarder le fichier de soumission
submission.to_csv("submission_xgboost_optimized.csv", index=False)
print("Optimized submission file saved as 'submission_xgboost_optimized.csv'")


Fitting 3 folds for each of 50 candidates, totalling 150 fits
{'subsample': 0.6, 'n_estimators': 300, 'max_depth': 3, 'learning_rate': 0.2, 'colsample_bytree': 0.6}
Optimized RMSE with XGBoost: 0.5391
Optimized submission file saved as 'submission_xgboost_optimized.csv'


## Test3

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from scipy import stats

# Chargement des données d'entraînement
df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")

# Assurer que toutes les colonnes sauf 'date' et 'commodity' sont numériques
for col in df.columns:
    if col not in ["date", "commodity", "ticker"]:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Traiter les valeurs manquantes en utilisant l'interpolation
df.interpolate(method='linear', inplace=True)

# Ingénierie des fonctionnalités
df['rolling_mean_3'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=3).mean())
df['price_diff'] = df.groupby('ticker')['target'].transform(lambda x: x.diff())


# Remplir à nouveau les NaNs introduits
df.interpolate(method='linear', inplace=True)


# Séparation des données en ensembles d'entraînement et de validation
train, validation = df.loc[df["date"] < "2018-01-01", ], df.loc[df["date"] >= "2018-01-01", ]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

# Normalisation des données
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

# Définir la grille de paramètres pour RandomizedSearchCV
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Initialiser le modèle XGBoost
xg_reg = XGBRegressor(objective='reg:squarederror', random_state=42)

# Initialiser RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=xg_reg, param_distributions=param_dist,
                                   n_iter=50, cv=3, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error',
                                   random_state=42)

# Entraîner le modèle
random_search.fit(X_train, y_train)

# Obtenir le meilleur estimateur
best_xg_reg = random_search.best_estimator_
print(random_search.best_params_)

# Prédire sur l'ensemble de validation
y_pred = best_xg_reg.predict(X_valid)
performance = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"Optimized RMSE with XGBoost: {performance:.4f}")

# Chargement des données de test
test = pd.read_csv("test.csv")

# Assurer que toutes les colonnes sauf 'date' et 'commodity' sont numériques
for col in test.columns:
    if col not in ["date", "commodity", "ticker"]:
        test[col] = pd.to_numeric(test[col], errors='coerce')

# Ingénierie des fonctionnalités pour les données de test
test['rolling_mean_3'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=3).mean())
test['price_diff'] = test.groupby('ticker')['open'].transform(lambda x: x.diff())


# Remplir à nouveau les NaNs introduits
test.interpolate(method='linear', inplace=True)

# Normaliser les données de test
X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])

# Prédire sur l'ensemble de test
y_test_pred = best_xg_reg.predict(X_test)

# Préparer le fichier de soumission
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_test_pred
})

# Sauvegarder le fichier de soumission
submission.to_csv("submission_xgboostV1.csv", index=False)
print("Optimized submission file saved as 'submission_xgboostV1.csv'")


Fitting 3 folds for each of 50 candidates, totalling 150 fits
{'subsample': 0.8, 'n_estimators': 200, 'max_depth': 9, 'learning_rate': 0.2, 'colsample_bytree': 1.0}
Optimized RMSE with XGBoost: 0.4875
Optimized submission file saved as 'submission_xgboostV1.csv'


In [None]:
print(y_test_pred)

[108.246056 107.54986  130.89052  ... 131.57074  129.39757  132.27306 ]


# Regression Linéaire

## Test1


In [None]:
import numpy as np
import pandas as pd

# Load the training data
df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")

# Ensure all columns except for 'date' and 'commodity' are numeric
for col in df.columns:
    if col not in ["date", "commodity", "ticker"]:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Handling missing values by filling them with the mean of the column
df.fillna(df.mean(numeric_only=True), inplace=True)

df.head()


Unnamed: 0,ticker,commodity,date,open,high,low,volume,ID,target
0,CL=F,Crude Oil,2000-08-23,31.950001,32.799999,31.950001,79385,0,31.629999
1,CL=F,Crude Oil,2000-08-24,31.9,32.240002,31.4,72978,1,32.049999
2,CL=F,Crude Oil,2000-08-25,31.700001,32.099998,31.32,44601,2,32.869999
3,CL=F,Crude Oil,2000-08-28,32.040001,32.919998,31.860001,46770,3,32.720001
4,CL=F,Crude Oil,2000-08-29,32.82,33.029999,32.560001,49131,4,33.400002


In [None]:
# Feature Engineering: Adding rolling mean and price difference
df['rolling_mean_3'] = df.groupby('ticker')['target'].transform(lambda x: x.rolling(window=3).mean())
df['price_diff'] = df.groupby('ticker')['target'].transform(lambda x: x.diff())

# Fill any newly introduced NaNs
df.fillna(df.mean(numeric_only=True), inplace=True)

df.head()


Unnamed: 0,ticker,commodity,date,open,high,low,volume,ID,target,rolling_mean_3,price_diff
0,CL=F,Crude Oil,2000-08-23,31.950001,32.799999,31.950001,79385,0,31.629999,26.784241,5.9e-05
1,CL=F,Crude Oil,2000-08-24,31.9,32.240002,31.4,72978,1,32.049999,26.784241,0.42
2,CL=F,Crude Oil,2000-08-25,31.700001,32.099998,31.32,44601,2,32.869999,32.183332,0.82
3,CL=F,Crude Oil,2000-08-28,32.040001,32.919998,31.860001,46770,3,32.720001,32.546666,-0.149998
4,CL=F,Crude Oil,2000-08-29,32.82,33.029999,32.560001,49131,4,33.400002,32.996667,0.68


In [None]:
# Split the data into training and validation sets
train, validation = df.loc[df["date"] < "2018-01-01", ], df.loc[df["date"] >= "2018-01-01", ]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

X_train.shape, X_valid.shape


((19929, 6), (2453, 6))

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_valid)

RMSE = lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred))
performance = RMSE(y_valid, y_pred)
print(f"RMSE: {performance:.4}")
print(y_pred)


RMSE: 0.2814
[ 2.95751294  2.10340423 61.87572263 ... 55.75251777 55.45065541
 56.40384017]


In [None]:
# Load the test data
test = pd.read_csv("test.csv")

# Ensure all columns except for 'date' and 'commodity' are numeric
for col in test.columns:
    if col not in ["date", "commodity", "ticker"]:
        test[col] = pd.to_numeric(test[col], errors='coerce')

# Feature Engineering: Adding rolling mean and price difference for test data without using 'target'
test['rolling_mean_3'] = test.groupby('ticker')['open'].transform(lambda x: x.rolling(window=3).mean())
test['price_diff'] = test.groupby('ticker')['open'].transform(lambda x: x.diff())
test.fillna(test.mean(numeric_only=True), inplace=True)


# Normalize the test data
X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
y_pred = model.predict(X_test)

print(y_pred)
# Prepare the submission file
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_pred
})

# Save the submission file
submission.to_csv("submissionRegLinV1.csv", index=False)

print("Submission file saved as 'submissionRegLinV1.csv'")


[51.82767588 48.9028474  55.04148914 ... 80.9566664  81.50954471
 81.52734599]
Submission file saved as 'submissionRegLinV1.csv'


# Nouvelle section

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Load the train dataset
df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")

# Data Cleansing: Handle missing values and duplicates
df = df.drop_duplicates()
df = df.dropna()

# Feature Engineering: Create additional features
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['volatility'] = (df['high'] - df['low']) / df['open']

# Moving averages
df['ma7'] = df['target'].rolling(window=7).mean()
df['ma14'] = df['target'].rolling(window=14).mean()
df['ma30'] = df['target'].rolling(window=30).mean()

df = df.dropna()

# Split the data into training and validation sets
train, validation = df.loc[df["date"] < "2018-01-01"], df.loc[df["date"] >= "2018-01-01"]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

# Data Normalization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

# Model Selection and Hyperparameter Tuning
models = {
    'RandomForest': RandomForestRegressor(),
    'GradientBoosting': GradientBoostingRegressor()
}

params = {
    'RandomForest': {
        'n_estimators': [100, 200],
        'max_depth': [None, 10, 20],
    },
    'GradientBoosting': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5, 7]
    }
}

best_model = None
best_score = float('inf')

for model_name in models:
    grid_search = GridSearchCV(models[model_name], params[model_name], cv=5, scoring='neg_mean_squared_error')
    grid_search.fit(X_train, y_train)
    model = grid_search.best_estimator_
    y_pred = model.predict(X_valid)
    rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
    print(f"{model_name} RMSE: {rmse:.4f}")
    if rmse < best_score:
        best_score = rmse
        best_model = model

print(f"Best model: {best_model}")

# Load the test dataset
test = pd.read_csv("test.csv")
test["date"] = pd.to_datetime(test["date"], format="%Y-%m-%d")
test = test.sort_values(by="date")

# Feature Engineering on test data
test['day_of_week'] = test['date'].dt.dayofweek
test['month'] = test['date'].dt.month
test['year'] = test['date'].dt.year
test['volatility'] = (test['high'] - test['low']) / test['open']

# Moving averages (use training data for rolling window)
test = test.join(df[['date', 'target']].set_index('date').rolling(window=7).mean(), on='date', rsuffix='_ma7')
test = test.join(df[['date', 'target']].set_index('date').rolling(window=14).mean(), on='date', rsuffix='_ma14')
test = test.join(df[['date', 'target']].set_index('date').rolling(window=30).mean(), on='date', rsuffix='_ma30')

test = test.dropna()

X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
X_test = scaler.transform(X_test)

# Predict on test data
y_pred_test = best_model.predict(X_test)

# Create a submission dataframe
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_pred_test
})

# Save the submission dataframe to a CSV file
submission.to_csv("submissiontest1.csv", index=False)
submission.head()


KeyboardInterrupt: 

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.metrics import mean_squared_error
import xgboost as xgb

# Load the train dataset
df = pd.read_csv("train.csv")
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
df = df.sort_values(by="date")

# Data Cleansing: Handle missing values and duplicates
df = df.drop_duplicates()
df = df.dropna()

# Feature Engineering: Create additional features
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['volatility'] = (df['high'] - df['low']) / df['open']

# Moving averages
df['ma7'] = df['target'].rolling(window=7).mean()
df['ma14'] = df['target'].rolling(window=14).mean()
df['ma30'] = df['target'].rolling(window=30).mean()

# Lag features
df['lag1'] = df['target'].shift(1)
df['lag2'] = df['target'].shift(2)
df['lag3'] = df['target'].shift(3)

df = df.dropna()

# Split the data into training and validation sets
train, validation = df.loc[df["date"] < "2018-01-01"], df.loc[df["date"] >= "2018-01-01"]
X_train = train.drop(columns=["ticker", "commodity", "date", "ID", "target"])
X_valid = validation.drop(columns=["ticker", "commodity", "date", "ID", "target"])
y_train = train["target"]
y_valid = validation["target"]

# Data Normalization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

# Feature Selection using Recursive Feature Elimination
model = xgb.XGBRegressor()
rfe = RFECV(estimator=model, step=1, scoring='neg_mean_squared_error', cv=5)
rfe.fit(X_train, y_train)
X_train_rfe = rfe.transform(X_train)
X_valid_rfe = rfe.transform(X_valid)

# Hyperparameter Tuning using GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train_rfe, y_train)
best_model = grid_search.best_estimator_

# Predict on validation data
y_pred = best_model.predict(X_valid_rfe)
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print(f"Optimized RMSE: {rmse:.4f}")

# Load the test dataset
test = pd.read_csv("test.csv")
test["date"] = pd.to_datetime(test["date"], format="%Y-%m-%d")
test = test.sort_values(by="date")

# Feature Engineering on test data
test['day_of_week'] = test['date'].dt.dayofweek
test['month'] = test['date'].dt.month
test['year'] = test['date'].dt.year
test['volatility'] = (test['high'] - test['low']) / test['open']

# Moving averages (use training data for rolling window)
test['ma7'] = test['target'].rolling(window=7).mean()
test['ma14'] = test['target'].rolling(window=14).mean()
test['ma30'] = test['target'].rolling(window=30).mean()

# Lag features
test['lag1'] = test['target'].shift(1)
test['lag2'] = test['target'].shift(2)
test['lag3'] = test['target'].shift(3)

test = test.dropna()

X_test = test.drop(columns=["ticker", "commodity", "date", "ID"])
X_test = scaler.transform(X_test)
X_test_rfe = rfe.transform(X_test)

# Predict on test data
y_pred_test = best_model.predict(X_test_rfe)

# Create a submission dataframe
submission = pd.DataFrame({
    "ID": test["ID"],
    "predicted": y_pred_test
})

# Save the submission dataframe to a CSV file
submission.to_csv("submissiontest2.csv", index=False)
submission.head()


Optimized RMSE: 1.0208


KeyError: 'target'