# Validation Layer

## **Description:**

- Now you have to create validation schema and produce simple model running on it
    - Create a validation class
    - Create a simpe model
    - Create a simpe feature extraction step
- No extensive modeling or feature extraction is required, we need just to ensure our modeling step is trustworthy:
    - We don't have any target leakage
    - Our validation results are very close to the production behaviour
    - We have enough validation data to rank our model results
- Prepare a simple picture that would describe your schema:
    - Decalre Train/Validation/Test splits
    - Define validation approach
    - Define the way you exrtact features that won't have target or data leakage

In [1]:
import numpy as np
import pandas as pd

import sys
sys.path.append('../')
import scripts.validate as validate # validate.py module

## Load and transform **train, test** datasets

In [2]:
train_df = pd.read_csv('../data/result_train.csv')
test_df = pd.read_csv('../data/result_test.csv')

In [6]:
float_columns = train_df.select_dtypes(include=np.number).columns.tolist()
object_columns = train_df.select_dtypes(include=object).columns.tolist()

train_df = validate.transform_df_types(train_df, int_columns=[], float_columns=float_columns, object_columns=object_columns)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1608724 entries, 0 to 1608723
Data columns (total 31 columns):
 #   Column                  Non-Null Count    Dtype   
---  ------                  --------------    -----   
 0   date_block_num          1608724 non-null  float32 
 1   shop_id                 1608724 non-null  float32 
 2   item_id                 1608724 non-null  float32 
 3   item_cnt_month          1608724 non-null  float32 
 4   item_price              1608724 non-null  float32 
 5   month                   1608724 non-null  float32 
 6   year                    1608724 non-null  float32 
 7   item_name               1608724 non-null  category
 8   item_category_id        1608724 non-null  float32 
 9   item_category_name      1608724 non-null  category
 10  shop_name               1608724 non-null  category
 11  months_since_last_sale  1608724 non-null  float32 
 12  revenue                 1608724 non-null  float32 
 13  revenue_lag_1           1608724 non-null  

In [7]:
float_columns = test_df.select_dtypes(include=np.number).columns.tolist()
object_columns = test_df.select_dtypes(include=object).columns.tolist()

test_df = validate.transform_df_types(test_df, int_columns=[], float_columns=float_columns, object_columns=object_columns)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214284 entries, 0 to 214283
Data columns (total 27 columns):
 #   Column                  Non-Null Count   Dtype   
---  ------                  --------------   -----   
 0   ID                      214284 non-null  float32 
 1   shop_id                 214284 non-null  float32 
 2   item_id                 214284 non-null  float32 
 3   item_name               214284 non-null  category
 4   item_category_id        214284 non-null  float32 
 5   item_category_name      214284 non-null  category
 6   shop_name               214284 non-null  category
 7   months_since_last_sale  214284 non-null  float32 
 8   revenue_lag_1           214284 non-null  float32 
 9   revenue_lag_2           214284 non-null  float32 
 10  revenue_lag_3           214284 non-null  float32 
 11  revenue_lag_6           214284 non-null  float32 
 12  revenue_lag_12          214284 non-null  float32 
 13  item_cnt_month_lag_1    214284 non-null  float32 
 14  item