# **Tabular Playground Series Aug 2021** ✨

In this notebook we going to train a regression machine learning model to predict the `loss` variable. Also we are going to visualize the data set with various visualizing tools. 

If you find this help ful please do **upvote**, It motivates me to publish new notebooks.

P.S: This notebook is taking some time to train...I will optimize it in future versions

Happy learning :)

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;
          text-align:center;">
Loading libraries 📚📚
    
             
</p>
</div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV

pd.set_option("display.max_columns", 102)

In [None]:
train = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

In [None]:
print(f'Train data shape:{train.shape}')
print(f'test data shape:{test.shape}')

In [None]:
train.head()

In [None]:
train.columns

In [None]:
train.info()

In [None]:
train.describe()

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px"> 

<p style="padding: 10px;
              color:white;
          text-align:center;">
EDA 📊📈
             
</p>
</div>

### Let's look at the loss variable

In [None]:
f = plt.figure(figsize=(15,8))

ax = f.add_subplot()
sns.countplot(train['loss'])
ax.set_title('Count plot of loss')

**NOTE**:
- 📌 We need to model loss of the model as shown above
- 📌 `loss` has values ranging from 0 to 42

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;
          text-align:center;">
Model 🤖
             
</p>
</div>

In [None]:
X = np.array(train.drop(['id', 'loss'], axis = 1).copy())
y = train['loss'].values #for classification

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
rf_model = RandomForestRegressor(n_estimators=200,max_depth=8, random_state=42)

params = {
    'n_estimators': [200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8]
}

grid = GridSearchCV(rf_model, param_grid=params, cv=3, n_jobs=-1, verbose=500)

grid.best_params_

grid.best_score_

In [None]:
rf_model.fit(X_train, y_train)

In [None]:
print(f'Val RMSE: {np.sqrt(mean_squared_error(rf_model.predict(X_val), y_val))}')

In [None]:
X_test = test.drop('id', axis=1).copy()

In [None]:
sample_submission['loss'] = rf_model.predict(X_test).astype(int)
sample_submission.to_csv('submission.csv', index=None)

In [None]:
sample_submission.head()