# Introduction to AutoML with Sweetviz and H2O AutoML

In this project, we applied automated machine learning (AutoML) to predict used car prices. We used two main tools:

1. **Sweetviz** for data exploration.
2. **H2O AutoML** for building regression models.

While these tools can save a lot of time, there are important trade-offs to consider, especially when it comes to understanding the process behind the results.

## Sweetviz: Automated Data Exploration

Sweetviz is a Python library that quickly generates a visual report of your dataset. It helps you get a snapshot of the data’s structure, including distributions, correlations, and missing values.

### Benefits:
- It’s very fast and easy to use.
- Provides a detailed report with minimal effort.

### Downsides:
- Limited customization options.
- Doesn't offer deep analysis beyond basic data exploration.

## H2O AutoML: Automated Model Building

H2O AutoML is an open-source tool that automates the process of creating and tuning machine learning models. It tests various models and chooses the best one based on performance metrics.

### Benefits:
- Saves time by automating the entire modeling process, from data preparation to model selection and tuning.
- Can be used by people with limited machine learning experience.

### Downsides:
- **Black box problem**: AutoML can feel like a black box. It builds models automatically, but you don’t always know how or why certain decisions were made (e.g., which features were most important).
- **Less control**: You don’t have full control over the process, which can be frustrating if you want to fine-tune or understand the model in more detail.

## Why AutoML Isn’t Always the Best Choice

Even though tools like H2O AutoML save time, I don’t recommend using them in every situation for the following reasons:

1. **Lack of transparency**: Since AutoML handles everything for you, it’s hard to understand how the model was built or what decisions were made during the process.
   
2. **Limited flexibility**: AutoML may not give you the customization you need for complex or highly specific tasks.

3. **Over-reliance on automation**: Relying too much on these tools can prevent you from learning or applying the important concepts behind machine learning. In some cases, a deeper understanding is needed, especially when explaining or adapting the model for real-world applications.

## Conclusion

Sweetviz and H2O AutoML are great tools for quickly exploring data and building machine learning models. However, because they work like a "black box," you should use them with caution in cases where it’s important to understand or explain how a model works. AutoML is a good starting point but shouldn’t replace manual model-building in situations that require more control and transparency.


In [3]:
# Import
import numpy as np
import pandas as pd
from datetime import datetime
#from sklearn.preprocessing import RobustScaler
#from sklearn.linear_model import Lasso, Ridge
#from sklearn.ensemble import RandomForestRegressor
#from catboost import CatBoostRegressor
#from lightgbm import LGBMRegressor
#from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
#from sklearn.metrics import mean_squared_error
#import matplotlib.pyplot as plt
#import seaborn as sns
import warnings
warnings.simplefilter("ignore")

In [4]:
# Load CSV files (train and test)
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_train = df_train.drop(columns=['id'])
df=df_train.copy()

In [5]:
print("train data : {}".format(df.shape))
print("test data : {}".format(df_test.shape))
df.head()

train data : (188533, 12)
test data : (125690, 12)


Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,MINI,Cooper S Base,2007,213000,Gasoline,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,A/T,Yellow,Gray,None reported,Yes,4200
1,Lincoln,LS V8,2002,143250,Gasoline,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Silver,Beige,At least 1 accident or damage reported,Yes,4999
2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900
3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000
4,Mercedes-Benz,Metris Base,2021,7388,Gasoline,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Black,Beige,None reported,Yes,97500


In [6]:
df.dtypes

brand           object
model           object
model_year       int64
milage           int64
fuel_type       object
engine          object
transmission    object
ext_col         object
int_col         object
accident        object
clean_title     object
price            int64
dtype: object

In [7]:
df.duplicated().sum()

0

In [8]:
#% of null by colums abs
(df.isnull().sum())/df.shape[0]*100

brand            0.000000
model            0.000000
model_year       0.000000
milage           0.000000
fuel_type        2.696080
engine           0.000000
transmission     0.000000
ext_col          0.000000
int_col          0.000000
accident         1.300568
clean_title     11.360876
price            0.000000
dtype: float64

**any manipilation done to train set must be done to test set**

In [9]:
df['clean_title'].fillna('Unknown', inplace=True)
df_test['clean_title'].fillna('Unknown', inplace=True)
# fill null values in fuel_type and accident
df['fuel_type'] = df.groupby(['model', 'model_year'])['fuel_type'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
df_test['fuel_type'] = df_test.groupby(['model', 'model_year'])['fuel_type'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
df['accident'] = df.groupby(['model', 'model_year'])['accident'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
df_test['accident'] = df_test.groupby(['model', 'model_year'])['accident'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))

In [10]:
import sweetviz as sv
report = sv.analyze(df)
report.show_html()  # generate html report


                                             |                                             | [  0%]   00:00 ->…

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [11]:
from IPython.display import IFrame
IFrame(src='SWEETVIZ_REPORT.html', width=1000, height=600)

In [13]:
import h2o
from h2o.automl import H2OAutoML

# Initialiser le serveur H2O
h2o.init()

# Charger vos données pandas dans un DataFrame (df est votre DataFrame pandas nettoyé)
# Exemple: df = pd.read_csv('votre_fichier.csv')

# Convert to H2O Frame
h2o_df = h2o.H2OFrame(df)

# train test split 
train, test = h2o_df.split_frame(ratios=[0.8], seed=1234)

# choosing target and features 
x = train.columns[:-1]  
y = train.columns[-1]  

# Initialising AutoML and choosing RMSE for metric
aml = H2OAutoML(max_runtime_secs=600, sort_metric="RMSE")

# training 
aml.train(x=x, y=y, training_frame=train)

# print the leaderboard of best model sorted by RMSE
lb = aml.leaderboard
print(lb)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.421-b09, mixed mode)
  Starting server from C:\Users\sisqo32\anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\sisqo32\AppData\Local\Temp\tmpd0k_rrm8
  JVM stdout: C:\Users\sisqo32\AppData\Local\Temp\tmpd0k_rrm8\h2o_sisqo32_started_from_python.out
  JVM stderr: C:\Users\sisqo32\AppData\Local\Temp\tmpd0k_rrm8\h2o_sisqo32_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.5
H2O_cluster_version_age:,22 days
H2O_cluster_name:,H2O_from_python_sisqo32_d7e4g0
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.492 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |
20:01:36.49: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%
model_id                                                    rmse          mse      mae       rmsle    mean_residual_deviance
StackedEnsemble_AllModels_3_AutoML_1_20240920_200136     70579.1  4.98141e+09  19084.2    0.53634                4.98141e+09
StackedEnsemble_BestOfFamily_4_AutoML_1_20240920_200136  70585.3  4.98228e+09  19120.2    0.53742                4.98228e+09
StackedEnsemble_AllModels_2_AutoML_1_20240920_200136     70587.2  4.98255e+09  19120.8    0.537644               4.98255e+09
StackedEnsemble_BestOfFamily_3_AutoML_1_20240920_200136  70589.8  4.98293e+09  19134.5  nan                      4.98293e+09
DeepLearning_1_AutoML_1_20240920_200136                  70783.5  5.01031e+09  19237.7  nan                      5.01031e+0

### Explanation of H2O AutoML Results

#### 1. **Starting the H2O Server**
The H2O AutoML framework requires a local Java-based server to function. When the process begins, H2O checks if there is an instance running on `http://localhost:54321`. If not, it attempts to start a local H2O server. 

- **"Attempting to start a local H2O server"**: H2O automatically starts the server.
- **"Server is running at http://127.0.0.1:54321"**: The server is successfully running and connected.

#### 2. **H2O Cluster Information**
After starting, H2O provides the configuration details for the cluster:
- **Cluster Version**: The version of H2O running (e.g., 3.46.0.4).
- **Total Memory**: The amount of free memory available, here 7.5 GB.
- **Total Cores**: The server uses 4 CPU cores.
- **Cluster Status**: The cluster is running and healthy.

#### 3. **Data Parsing**
H2O parses the provided data into its own internal format for analysis. This is indicated by:
- **"Parse progress: (done) 100%"**: The dataset is fully parsed and ready for model training.

#### 4. **AutoML Process and Results**
The H2O AutoML feature trains and evaluates multiple models automatically. After completion, a leaderboard of models is displayed with evaluation metrics such as **RMSE** (Root Mean Squared Error), **MSE** (Mean Squared Error), **MAE** (Mean Absolute Error), and **RMSLE** (Root Mean Squared Log Error).

Example of leaderboard entries:
- **StackedEnsemble_AllModels_2_AutoML**: The best performing model with RMSE of 70,657.8 and MSE of 4.99e+09.
- **DeepLearning_1_AutoML**: Another model with RMSE of 70,971.1 and similar performance.

The objective of the AutoML process is to minimize RMSE, and the models are sorted based on this metric.

#### 5. **Model Interpretation**
- **RMSE**: Measures the average magnitude of errors between predicted and actual values. The lower, the better.
- **MSE**: Squared differences between predictions and true values.
- **MAE**: The mean of the absolute differences between predicted and true values.
- **RMSLE**: Measures the logarithmic differences between predicted and true values, used when dealing with exponential growth data.

The **StackedEnsemble** models generally perform the best in this case, combining multiple models' strengths.


In [22]:
h2o_test= h2o.import_file('test.csv')
hdf_test=h2o.H2OFrame(df_test)
predictions = aml.leader.predict(test)

rmse = aml.leader.model_performance(test).rmse()
print("Test RMSE:", rmse)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Test RMSE: 80725.09848945898


In [23]:
# Convertir les prédictions en DataFrame H2O
predictions = aml.leader.predict(h2o_test).as_data_frame()

# Créer le fichier de soumission
submissionH2O = pd.DataFrame({
    'id': df_test['id'],  # Utiliser l'ID des données de test
    'target': predictions['predict']  # Utiliser la colonne des prédictions
})

# Sauvegarder le fichier au format CSV
submissionH2O.to_csv('submissionH2O.csv', index=False)

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%





In [20]:
submissionH2O.to_csv('submissionH2O.csv')