### Load the Dataset

First of all, we will load the raw banknote dataset using pandas. We will be using the file path from our global constants, and we will prefer feather file over csv because it is more efficient.

In [1]:
import pandas as pd
# from src.utils.constants import RAW_FEATHER_DATA_PATH
RAW_FEATHER_DATA_PATH = '../../data/banknote_net.feather'

df = pd.read_feather(RAW_FEATHER_DATA_PATH)
df.head()

Unnamed: 0,v_0,v_1,v_2,v_3,v_4,v_5,v_6,v_7,v_8,v_9,...,v_248,v_249,v_250,v_251,v_252,v_253,v_254,v_255,Currency,Denomination
0,0.423395,0.327657,2.568988,3.166228,4.801421,5.531792,2.458083,1.218453,0.0,1.116785,...,0.0,2.273451,5.790633,0.0,0.0,0.0,5.6354,0.0,AUD,100_1
1,1.158823,1.669602,3.638447,2.823524,4.83989,2.777757,0.75335,0.764005,0.347871,1.928572,...,0.0,2.329623,3.516146,0.0,0.0,0.0,2.548191,1.05341,AUD,100_1
2,0.0,0.958235,4.706119,1.688242,3.312702,4.516483,0.0,1.876461,2.250795,1.883192,...,0.811282,5.591417,1.879267,0.641139,0.571079,0.0,1.861483,2.172145,AUD,100_1
3,0.920511,1.820294,3.939334,3.206829,6.253655,0.942557,2.952453,0.0,2.064298,1.367196,...,1.764936,3.415151,2.518404,0.582229,1.105192,0.0,1.566918,0.533945,AUD,100_1
4,0.331918,0.0,3.330771,3.023437,4.369099,5.177336,1.499362,0.590646,0.553625,1.405708,...,0.0,4.615945,4.825463,0.302261,0.378229,0.0,2.710654,0.325945,AUD,100_1


### Clean Data

Now, we remove any missing and duplicate values from the dataframe to ensure the data quality.

In [2]:
df = df.dropna()
df = df.drop_duplicates()

### Encode Labels

Then, we convert the values of the label i.e., Currency to numeric values for modeling.

In [6]:
from sklearn.preprocessing import LabelEncoder

y = df['Currency']
currency_encoder = LabelEncoder()
y = currency_encoder.fit_transform(y)

### Feature Scaling

Then, we standardize the values of the feature columns to have zero mean and unit variance using StandardScaler.

In [None]:
from sklearn.preprocessing import StandardScaler

X = df.drop(columns=['Currency', 'Denomination'])
scaler = StandardScaler()
X = scaler.fit_transform(X)

### Split Data

Now, we split the data into training, validation, and testing sets of 70%, 10%, and 20% respectively.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=420, stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.1, random_state=420, stratify=y_train
)

### Save Processed Data and Scaler

Now, we save the processed and split datasets in `.npy` files, and the scaler in a `.pkl` file for using them in model training and evaluation.

In [9]:
import numpy as np
import joblib

SAVED_FILES_DIR = '../../saved/'
SAVED_MODELS_DIR = SAVED_FILES_DIR + 'models/'
SAVED_DATASETS_DIR = SAVED_FILES_DIR + 'processed/'

joblib.dump(scaler, SAVED_MODELS_DIR + 'scaler.pkl')

np.save(SAVED_DATASETS_DIR + 'X_train.npy', X_train)
np.save(SAVED_DATASETS_DIR + 'X_val.npy', X_val)
np.save(SAVED_DATASETS_DIR + 'X_test.npy', X_test)
np.save(SAVED_DATASETS_DIR + 'y_train.npy', y_train)
np.save(SAVED_DATASETS_DIR + 'y_val.npy', y_val)
np.save(SAVED_DATASETS_DIR + 'y_test.npy', y_test)