# Feature Engineering Notebook

In this section, we'll conduct essential data transformations guided by the insights gleaned from our previous step, Exploratory Data Analysis (EDA). Additionally, we'll derive new features from existing ones to enhance our understanding of the dataset. The outcome of this phase will be a refined dataset tailored for our machine learning anomaly detection modeling.

#### Import libraries section

In [1]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler

### 1. Loading data from staged

In [2]:
consolidated_df = pd.read_csv("../Data/Staged/staged_consolidated_data.csv")
filtered_df = pd.read_csv("../Data/Staged/staged_filtered_data.csv")

In [3]:
consolidated_df.head()

Unnamed: 0,end_date,year,ticker,form,cluster_6,cluster_10,cluster_15,cluster_14,cluster_4,cluster_2,...,cluster_3,cluster_9,cluster_5,cluster_1,cluster_13,cluster_0,cluster_11,cluster_7,cluster_16,anomaly
0,2008-01-01,2008,CAKE,10-K,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.475738,0.0,0.0,0.0,-0.2981119,0.0,0.0,0.0,0.0,0
1,2008-01-31,2008,MDLZ,10-K,0.0,-0.038235,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,2008-02-02,2008,TGT,10-K,1.373144,6.981317,0.0,1.181053,0.0,4.596254,...,2.330604,0.593854,-0.341399,0.0,-1.04e-08,-0.088416,17.912556,0.0,2.593812,1
3,2008-02-02,2008,TGT,10-K/A,1.373144,6.981317,0.0,1.181053,0.0,4.596254,...,2.330604,0.593854,-0.341399,0.0,-1.04e-08,-0.088416,17.912556,0.0,2.593812,1
4,2008-02-02,2008,TGT,10-Q,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.160446,0.0,0.0,0.0,0.0,-0.088416,0.0,0.0,0.0,1


In [4]:
numerical_cols_consolidated_df = consolidated_df.columns.tolist()[4:-1]

In [5]:
filtered_df.head()

Unnamed: 0,end_date,year,ticker,form,Assets,EarningsPerShareBasic,EarningsPerShareDiluted,LiabilitiesAndStockholdersEquity,NetIncomeLoss,RetainedEarningsAccumulatedDeficit,StockholdersEquity,anomaly
0,2008-01-01,2008,CAKE,10-K,,,,,,,,0
1,2008-01-31,2008,MDLZ,10-K,,,,,,,,0
2,2008-02-02,2008,TGT,10-K,,0.853175,0.881148,,,,,1
3,2008-02-02,2008,TGT,10-K/A,,0.853175,0.881148,,,,,1
4,2008-02-02,2008,TGT,10-Q,,,,,,,,1


In [6]:
numerical_cols_filtered_df = filtered_df.columns.tolist()[4:-1]

### 2. Handling NaNs
Let's impute the column mean into NaN values using the SimpleImputer from Sklearn.

The first dataset doesnt need this step as it was already subject to Robust Scaler and therefore the mean is already centred around 0.

In [7]:
# (2) Filtered dataset
imputer = SimpleImputer(strategy='mean')
temp_df = pd.DataFrame(imputer.fit_transform(filtered_df[numerical_cols_filtered_df]), columns = imputer.get_feature_names_out())
filtered_df = pd.concat([filtered_df.iloc[:,:4], temp_df, filtered_df.iloc[:,-1:] ], axis=1)

### 3. Handling outliers

Capping or Clipping: Limit the extreme values within a certain percentile range to reduce their impact.

Both datasets need this step

In [8]:
# (1) Consolidated dataset

for col in numerical_cols_consolidated_df:
    lower_bound = consolidated_df[col].quantile(0.05)
    upper_bound = consolidated_df[col].quantile(0.95)
    consolidated_df[col] = np.clip(consolidated_df[col], lower_bound, upper_bound)

In [9]:
# (2) Filtered dataset

for col in numerical_cols_filtered_df:
    lower_bound = filtered_df[col].quantile(0.05)
    upper_bound = filtered_df[col].quantile(0.95)
    filtered_df[col] = np.clip(filtered_df[col], lower_bound, upper_bound)

### 4. Log transformation
This step will be made to reduce the skewness observed in the numerical features.

Only applied to second dataset.

In [10]:
# (2) Filtered dataset
small_constant = 1e-6
# Log Transformation to reduce skewness
filtered_df[numerical_cols_filtered_df] = filtered_df[numerical_cols_filtered_df].apply(lambda x: np.log1p(x+small_constant))

### 5. Standarize features
Use robust scaling methods that are less sensitive to outliers, such as RobustScaler which scales based on the median and interquartile range.

Only the 2nd dataset needs this step: filtered_df

In [11]:
# (2) Filtered dataset

# Standardize the features
scaler = RobustScaler()
X_scaled = scaler.fit_transform(filtered_df[numerical_cols_filtered_df])
filtered_df_scaled = pd.DataFrame(X_scaled, columns=numerical_cols_filtered_df)
filtered_df_scaled["anomaly"]= filtered_df["anomaly"]

In [12]:
filtered_df_scaled

Unnamed: 0,Assets,EarningsPerShareBasic,EarningsPerShareDiluted,LiabilitiesAndStockholdersEquity,NetIncomeLoss,RetainedEarningsAccumulatedDeficit,StockholdersEquity,anomaly
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0
2,0.000000,0.388380,0.383475,0.000000,0.000000,0.000000,0.000000,1
3,0.000000,0.388380,0.383475,0.000000,0.000000,0.000000,0.000000,1
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1
...,...,...,...,...,...,...,...,...
5554,-1.318369,-2.231086,-2.211543,-1.338515,-1.199695,-1.318712,-1.524128,0
5555,-0.587707,0.000728,0.016145,-0.566868,0.771510,-1.190864,-0.056229,0
5556,-0.777642,0.491430,0.494138,-0.767359,-0.178854,1.087179,-1.433650,0
5557,-1.018638,-0.327735,-0.302533,-1.021834,-0.593425,-1.459085,-1.532717,0


### 6. Feature Selection
In this step we will select our features and target for the modelling based on EDA conclusions.

We will only take into consideration the numerical columns.
In the second dataset, we will drop the columns suggested by the correlation between the inputs of the dataset: EarningsPerShareDiluted and LiabilitiesAndStockholdersEquity.

In [13]:
# (1) Consolidated dataset

consolidated_df = consolidated_df.iloc[:,4:]

In [14]:
# (2) Filtered dataset

filtered_df_scaled.drop(["EarningsPerShareDiluted", "LiabilitiesAndStockholdersEquity"], axis=1, inplace=True)

### 7. Saving the processed dataset

In [15]:
# (1) Consolidated dataset

# Save the dataset for modeling
output_file_path = '../data/processed/processed_consolidated_data.csv'
consolidated_df.to_csv(output_file_path, index=False)
print(f"Processed dataframe saved to {output_file_path}")

Processed dataframe saved to ../data/processed/processed_consolidated_data.csv


In [16]:
# (2) Filtered dataset

# Save the dataset for modeling
output_file_path = '../data/processed/processed_filtered_data.csv'
filtered_df_scaled.to_csv(output_file_path, index=False)
print(f"Processed dataframe saved to {output_file_path}")

Processed dataframe saved to ../data/processed/processed_filtered_data.csv
