### **Libraries**

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
import joblib


First, we should call dataframe from data preprocessing section.

In [2]:
%store -r df
df = df

### **Declare Features and Target**

In this stage, we should split data to train and validation. First of all, we should declare features and target.

In [3]:
X = df.drop('Price', axis=1)
y = df['Price']

### **Data Splitting (Train, Val)**

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42)

### **Data Scaling**

In this step, we should scale the numerical data for train, validation and test: `MinDayNights` ,	`CountReview` ,	`AvgReview`	, `TotalHostListings` ,	`DayAvailability` ,	`Year`.

* If normal distribution or slight skewed → StandardScaler
* If they are high skewed → RobustScaler
* If neural networks → MinMaxScaler

In [None]:
columns_to_scale = ['MinDayNights', 'CountReview', 'AvgReview', 'TotalHostListings', 'DayAvailability', 'Year']

# Initialize scaler for selected columns
scaler = ColumnTransformer([
    ('scaler', StandardScaler(), columns_to_scale)
], remainder='passthrough')


['../../artifacts/Numerical_Scaler.pkl']

Now, we can apply on train and val data.

In [6]:
# Fit and transform training data
X_train_scaled = scaler.fit_transform(X_train)
all_columns = columns_to_scale + [col for col in X_train.columns if col not in columns_to_scale]
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=all_columns)
X_train = X_train_scaled_df.copy()

In [7]:
# Transform Val data
X_val_scaled = scaler.transform(X_val)
all_columns = columns_to_scale + [col for col in X_val.columns if col not in columns_to_scale]
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=all_columns)
X_val = X_val_scaled_df.copy()

### **CHECKPOINT**

In [12]:
%store X_train
%store y_train

%store X_val
%store y_val

Stored 'X_train' (DataFrame)
Stored 'y_train' (Series)
Stored 'X_val' (DataFrame)
Stored 'y_val' (Series)
