# Tabular-Playground-Series - Sep 2021 👀🔥💥🎉

# AutoML  🎉

1. **import library**
2. **Read the data**
3. **KFOlD - 5splits**
4. **Identify the null datasand apply simpleimputer**
4. **Correlation the input data**
5. **Preprocessing the input data**
6. **Build the model**
7. **AutoML(H20)** --> 1st time try
7. **Submit predict output**


In [None]:
#import necessary libraries
import numpy as np
import pandas as pd
import matplotlib 
from matplotlib import pyplot as plt
%matplotlib inline
matplotlib.rcParams["figure.figsize"] = (12,10)
import seaborn as sns
#import model
from sklearn.preprocessing import StandardScaler,OrdinalEncoder,LabelEncoder
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.model_selection import train_test_split,cross_val_score,KFold
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer


# Read the data

In [None]:

df = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
df_t = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
sample_submission = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

print(df.shape,df_t.shape)



# Identify null data and fix the issues

### **Apply simple Imputer:**
#### SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset.

[https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html](http://)


In [None]:
# identify the null values
df.isnull().sum()

In [None]:
#fix the null values issues
imputer = SimpleImputer()
train_imputer = pd.DataFrame(imputer.fit_transform(df))
test_imputer = pd.DataFrame(imputer.fit_transform(df_t))

#imputation removed columns 
train_imputer.columns = df.columns
test_imputer.columns = df_t.columns

df = train_imputer
df_t = test_imputer

# KFold - 5 Splits

In [None]:
#insert the kfold columns
df['kfold'] = -1
#distributing the data
kfold = KFold(n_splits = 5,shuffle=True,random_state = 42)
for fold, (tr_i,va_i) in enumerate(kfold.split(X=df)):
    df.loc[va_i,'kfold'] = fold
    
print(df.kfold.value_counts())
df.to_csv("folds_5.csv",index=False)
print("successfully folds")

In [None]:
sns.heatmap(df.corr(),cmap='PiYG_r')

In [None]:
df.columns

# **Features**

In [None]:
#Folds data
df = pd.read_csv("./folds_5.csv")

#features taken to train
features = [f for f in df.columns if f not in("id","kfold","claim")]
num_cols = [cols for cols in features if 'f' in cols]

test= df_t[features]


# Build the Model 💥

## Basic model

**XGBoostRegressor**

In [None]:
prediction = []
score = []

for fold in range (5):
    xtrain = df[df.kfold != fold].reset_index(drop=True)
    xvalid = df[df.kfold == fold].reset_index(drop=True)
    xtest = test.copy()
    
    ytrain = xtrain.claim
    yvalid = xvalid.claim
    
    xtrain = xtrain[features]
    xvalid = xvalid[features]
    
    lE = StandardScaler()
    xtrain[num_cols] = lE.fit_transform(xtrain[num_cols])
    xvalid[num_cols] = lE.transform(xvalid[num_cols])
    xtest[num_cols] = lE.transform(xtest[num_cols])
    
    #xgboost model
    xgb_m = XGBRegressor(learning_rate=0.01,
                         n_estimators=5000,
                        random_state=42,
                        gpu_id=0,
                        tree_method='gpu_hist',
                        predictor='gpu_predictor')
    xgb_m.fit(xtrain,ytrain,early_stopping_rounds=100,eval_set=[(xvalid,yvalid)],verbose=1000)
    predict_valid = xgb_m.predict(xvalid)
    test_predict = xgb_m.predict(xtest)
    prediction.append(test_predict)
    
    #Root_mean_square
    rms = mean_squared_error(yvalid,predict_valid,squared=False)
    
    #Score 
    score.append(rms)
    print(f"fold|split:{fold},rmse:{rms}")
    
print(np.mean(score),np.std(score))


#reconfigure of split data
final_predict = np.mean(np.column_stack(prediction),axis=1)
print(final_predict)
sample_submission.claim = final_predict
sample_submission.to_csv("First_submission_xgb.csv",index=False)
print("Final achieve to send xgboost output data")

    

**End Hyperparameter tuning xgboostregressor**

**And now try automl**

# AutoML-H20

**H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.**



## Install packages

In [None]:
import h2o
from h2o.automl import H2OAutoML
#initial the process
h2o.init() 

In [None]:
df= pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
df_test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
sample = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

In [None]:
#fix the null values issues
imputer = SimpleImputer()
train_imputer = pd.DataFrame(imputer.fit_transform(df))
test_imputer = pd.DataFrame(imputer.fit_transform(df_test))

#imputation removed columns 
train_imputer.columns = df.columns
test_imputer.columns = df_t.columns

df = train_imputer
df_test = test_imputer

**Covert the h2oformat**

In [None]:
#convert to h20 frame format
train = h2o.H2OFrame(df) 
test =  h2o.H2OFrame(df_test)

1. **Identify predictors and response**
2. **for binary classifyication,response should be as factor**

In [None]:
x= train.columns
y= "claim"
x.remove(y)
#binary classifying
train[y] = train[y].asfactor() 

# Train AutoML model

In [None]:
automl = H2OAutoML(nfolds=5,#folds
                  seed=42,#random seed
                  max_models=20,
                  include_algos = ["XGBoost","StackedEnsemble","GBM"],
                  max_runtime_secs = 3600*3,#time in sec's
                  stopping_metric = 'AUC'
                  )
automl.train(x=x,y=y,training_frame=train)

# Check voting model
***Apply leaderboard***

In [None]:
l_b = automl.leaderboard
l_b

**correlation of data's**

In [None]:
corr = automl.model_correlation_heatmap(train)

**Learning curves represent**

In [None]:
model = h2o.get_model(l_b[2,"model_id"])
#Learning curve
model.learning_curve_plot()

In [None]:
model.varimp_plot()

# Test data

In [None]:
#generate test data
test_pred = automl.leader.predict(test)

# Submission 👀

In [None]:
sub = pd.DataFrame({
    'id':test['id'].as_data_frame().id,
    'claim':test_pred.as_data_frame().predict
})
sub.to_csv("Mysubmission_H2o.csv",index=False)
print("Final achieve AutoML output send data")

In [None]:
sub

#### **Reference:**😍

**@Abishek thaur**

**Day1** version2-simple baseline kfold and xgboost process

**Day2** version3- simple imputer apply for null data (mean,median process)

## **AutoML** 👀🔥💥🎉
**Reference:**

[https://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html](http://)

[https://www.kaggle.com/mhslearner/starter-simple-eda-h2oautoml/comments](http://)

[https://www.kaggle.com/sudalairajkumar/getting-started-with-h2o](http://)😍

## **Thank you**