<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/6565/media/daimler-mercedes%20V02.jpg"/>

# Competition Summary
Daimler’s engineers have developed a robust testing system to ensure the safety and reliability of each and every unique car configuration. But optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming.

# Load Competition Dataset

Competition dataset located in "/kaggle/input"; This path defined by Kaggle to access the competition file. We will list two files from this path as input files.

In [14]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        path=os.path.join(dirname, filename)
        if 'train' in path:
            __training_path=path
        elif 'test' in path:
            __test_path=path

## Check training and test path

In [3]:
#loaded files
print(f'Training path:{__training_path}\nTest path:{__test_path}')

In [1]:
# Kaggle Environment Prepration
#update kaggle env
import sys
#you may update the environment that allow you to run the whole code
!{sys.executable} -m pip install --upgrade scikit-learn=="0.24.2"

In [2]:
#record this information if you need to run the Kernel internally
import sklearn; sklearn.show_versions() 

# Exploratory Data Analysis (EDA)
## General Structure
Mercedes-Benz Greener Manufacturing includes <b>4209</b> columns and <b>378</b> rows.
There are <b>3</b> different data types as follows: int64,float64,object

# Finding Intresting Datapoints
Let's process each field by their histogram frequency and check if there is any intresting data point.

There are 4 number of intresting values in the following columns, where: 

The below table shows each <b>Value</b> of each <b>Field</b> with their total frequencies, <b>Lower</b> shows the lower frequency of normal distribution, <b>Upper</b> shows the upper bound frequency of normal distribution, and <b>Criteria</b> shows if the frequnecy passed <b>Upper bound</b> or <b>Lower bound</b>.
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Field</th>
      <th>Value</th>
      <th>Frequency</th>
      <th>Lower</th>
      <th>Upper</th>
      <th>Criteria</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>X1</td>
      <td>aa</td>
      <td>833</td>
      <td>3.0000</td>
      <td>832.3890</td>
      <td>Upper</td>
    </tr>
    <tr>
      <th>1</th>
      <td>X2</td>
      <td>as</td>
      <td>1659</td>
      <td>1.0000</td>
      <td>1653.9991</td>
      <td>Upper</td>
    </tr>
    <tr>
      <th>2</th>
      <td>X3</td>
      <td>c</td>
      <td>1942</td>
      <td>57.0636</td>
      <td>1941.4804</td>
      <td>Upper</td>
    </tr>
    <tr>
      <th>3</th>
      <td>X4</td>
      <td>d</td>
      <td>4205</td>
      <td>1.0000</td>
      <td>4203.7391</td>
      <td>Upper</td>
    </tr>
  </tbody>
</table>

# Input Dataset

In [None]:
def __load__data(__training_path, __test_path, concat=False):
	"""load data as input dataset
	params: __training_path: the training path of input dataset
	params: __test_path: the path of test dataset
	params: if it is True, then it will concatinate the training and test dataset as output
	returns: generate final loaded dataset as dataset, input and test
	"""
	# LOAD DATA
	import pandas as pd
	__train_dataset = pd.read_csv(__training_path, delimiter=',')
	__test_dataset = pd.read_csv(__test_path, delimiter=',')
	return __train_dataset, __test_dataset
__train_dataset, __test_dataset = __load__data(__training_path, __test_path, concat=True)
__train_dataset.head()

In [None]:
# STORE SUBMISSION RELEVANT COLUMNS
__test_dataset_submission_columns = __test_dataset['ID']

### DISCARD IRRELEVANT COLUMNS
In the given input dataset there are <b>1</b> column that can be removed as follows:* ID *.

In [None]:
# DISCARD IRRELEVANT COLUMNS
__train_dataset.drop(['ID'], axis=1, inplace=True)
__test_dataset.drop(['ID'], axis=1, inplace=True)

## Encoding Ordinal Categorical Features
Let's transfer categorical features as an integer array.
We will use Ordinal Encoder as explained [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html).

In the given input dataset there are <b>8</b> columns that can be transfered to integer and it includes:* X0,X1,X2,X3,X4,X5,X6,X8 *.

In [None]:
# PREPROCESSING-1
from sklearn.preprocessing import OrdinalEncoder
_CATEGORICAL_COLS = ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']
_ohe = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
__train_dataset[_CATEGORICAL_COLS] = pd.DataFrame(_ohe.fit_transform(__train_dataset[_CATEGORICAL_COLS]), columns=_CATEGORICAL_COLS)
__test_dataset[_CATEGORICAL_COLS] = pd.DataFrame(_ohe.transform(__test_dataset[_CATEGORICAL_COLS]), columns=_CATEGORICAL_COLS)

### Target Column
The target column is the value which we need to predict.
Therefore, we need to detach the target columns in prediction.
Note that if we don't drop this fields, it will generate a model with high accuracy on training and worst accuracy on test (because the value in test dataset is Null).
Target column: "y"

In [None]:
# DETACH TARGET
__feature_train = __train_dataset.drop(['y'], axis=1)
__target_train =__train_dataset['y']
__feature_test = __test_dataset

# Training Model and Prediction
First, we will train a model based on preprocessed values of training data set.
Second, let's predict test values based on the trained model.

## CatBoostRegressor
We will use CatBoostRegressor which is a fast, scalable, high performance gradient boosting on decision trees library. Used for ranking, classification, regression and other ML tasks.
*CatBoostRegressor* detail can be found [here](https://catboost.ai/docs/installation/python-installation-method-pip-install).

In [None]:
# MODEL
import numpy as np
from catboost import CatBoostRegressor
__model = CatBoostRegressor()
__model.fit(__feature_train, __target_train)
__y_pred = __model.predict(__feature_test)

# Submission File
We have to maintain the target columns in "submission.csv" which will be submitted as our prediction results.

In [None]:
# SUBMISSION
submission = pd.DataFrame(columns=['ID'], data=__test_dataset_submission_columns)
submission['y'] = __y_pred
submission.head()

In [None]:
# save submission file
submission.to_csv("kaggle_submission.csv", index=False)