# FEATURE ENGINEERING

This notebook logically follows the preprocessing step and performs feature engineering for modeling or further analysis.

✔ What it does:

- Loads the cleaned data from the previous step (pre-processed.csv).

Creates new features:
- Price Change %
- Target (binary label: will price rise next day?)
- Date features: weekday, month, quarter
- Rolling averages (MA_5, MA_10)
- Daily Return and log of Adjusted Close price


In [1]:
import os
import pandas as pd
import numpy as np

file_path = "data\PREPROCESSING\merged_stock_income.csv" 

df_merged = pd.read_csv(file_path)

df_merged.info()

  file_path = "data\PREPROCESSING\merged_stock_income.csv"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4502544 entries, 0 to 4502543
Data columns (total 32 columns):
 #   Column                                    Dtype  
---  ------                                    -----  
 0   Ticker                                    object 
 1   Date                                      object 
 2   SimFinId                                  int64  
 3   Open                                      float64
 4   High                                      float64
 5   Low                                       float64
 6   Close                                     float64
 7   Adj. Close                                float64
 8   Volume                                    int64  
 9   Shares Outstanding                        float64
 10  Currency                                  object 
 11  Fiscal Year                               float64
 12  Fiscal Period                             object 
 13  Publish Date                              object 
 14  Re

In [2]:
df_merged.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,Operating Income (Loss),Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common)
0,GOOG,2019-04-25,18,63.24,63.37,62.6,63.17,62.87,22145900,13884880000.0,...,35928000000.0,5394000000.0,0.0,41322000000.0,-1697000000.0,39625000000.0,-5282000000.0,34343000000.0,34343000000.0,34343000000.0
1,HEES,2019-04-25,6767693,30.0,31.77,29.26,30.26,25.06,676844,35787000.0,...,211318000.0,-47424000.0,-54033000.0,163894000.0,16836000.0,180730000.0,-47036000.0,133694000.0,132170000.0,132170000.0
2,MANH,2019-04-25,105128,66.52,67.0,64.24,65.44,65.44,917088,64594510.0,...,115924000.0,153000.0,715000.0,116077000.0,7252000.0,116077000.0,-30315000.0,85762000.0,85762000.0,85762000.0
3,ERII,2019-04-25,6767695,9.6,9.64,9.43,9.46,9.46,113678,54116000.0,...,10364000.0,1892000.0,2010000.0,12256000.0,-2332000.0,12256000.0,-1343000.0,10913000.0,10913000.0,10913000.0
4,CIM,2019-04-25,6767699,57.39,57.39,56.43,56.76,28.84,383803,62364580.0,...,505294000.0,939000.0,2300000.0,505294000.0,169225000.0,674519000.0,-4405000.0,670114000.0,670114000.0,596350000.0


New Columns: 

✅ Price Change % → Added daily price change % calculation (1 day lag)

✅ Target → Created Target column for market movement prediction

✅ Weekday (Weekday) → Categorical feature (Monday-Sunday) for classification models.

✅ Month (Month) → Categorical month number (1-12) for classification models.

✅ Quarter (Quarter) → Categorical feature (1-4) for classification models.

✅ Daily Returns (Daily_Return) → Percentage change in adjusted close price, useful for financial modeling.

✅ Moving Averages (MA_5, MA_10) → 5-day and 10-day moving averages for trend detection.

✅ Log Returns (Log_Returns) → Helps in normalizing price data for time series models.


In [3]:
df_merged["Price Change %"] = df_merged["Close"].pct_change() * 100  # Use correct column name

df_merged.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),Price Change %
0,GOOG,2019-04-25,18,63.24,63.37,62.6,63.17,62.87,22145900,13884880000.0,...,5394000000.0,0.0,41322000000.0,-1697000000.0,39625000000.0,-5282000000.0,34343000000.0,34343000000.0,34343000000.0,
1,HEES,2019-04-25,6767693,30.0,31.77,29.26,30.26,25.06,676844,35787000.0,...,-47424000.0,-54033000.0,163894000.0,16836000.0,180730000.0,-47036000.0,133694000.0,132170000.0,132170000.0,-52.097515
2,MANH,2019-04-25,105128,66.52,67.0,64.24,65.44,65.44,917088,64594510.0,...,153000.0,715000.0,116077000.0,7252000.0,116077000.0,-30315000.0,85762000.0,85762000.0,85762000.0,116.259088
3,ERII,2019-04-25,6767695,9.6,9.64,9.43,9.46,9.46,113678,54116000.0,...,1892000.0,2010000.0,12256000.0,-2332000.0,12256000.0,-1343000.0,10913000.0,10913000.0,10913000.0,-85.54401
4,CIM,2019-04-25,6767699,57.39,57.39,56.43,56.76,28.84,383803,62364580.0,...,939000.0,2300000.0,505294000.0,169225000.0,674519000.0,-4405000.0,670114000.0,670114000.0,596350000.0,500.0


Create the Target Variable for ML
To predict whether the price will go up or down, shift the adjusted close price:

In [4]:
df_merged["Target"] = (df_merged["Close"].shift(-1) > df_merged["Close"]).astype(int)  # 1 if price rises, 0 if falls
df_merged.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),Price Change %,Target
0,GOOG,2019-04-25,18,63.24,63.37,62.6,63.17,62.87,22145900,13884880000.0,...,0.0,41322000000.0,-1697000000.0,39625000000.0,-5282000000.0,34343000000.0,34343000000.0,34343000000.0,,0
1,HEES,2019-04-25,6767693,30.0,31.77,29.26,30.26,25.06,676844,35787000.0,...,-54033000.0,163894000.0,16836000.0,180730000.0,-47036000.0,133694000.0,132170000.0,132170000.0,-52.097515,1
2,MANH,2019-04-25,105128,66.52,67.0,64.24,65.44,65.44,917088,64594510.0,...,715000.0,116077000.0,7252000.0,116077000.0,-30315000.0,85762000.0,85762000.0,85762000.0,116.259088,0
3,ERII,2019-04-25,6767695,9.6,9.64,9.43,9.46,9.46,113678,54116000.0,...,2010000.0,12256000.0,-2332000.0,12256000.0,-1343000.0,10913000.0,10913000.0,10913000.0,-85.54401,1
4,CIM,2019-04-25,6767699,57.39,57.39,56.43,56.76,28.84,383803,62364580.0,...,2300000.0,505294000.0,169225000.0,674519000.0,-4405000.0,670114000.0,670114000.0,596350000.0,500.0,0


Handling Null Values: Since .pct_change() creates NaN values for the first row, fill them:

In [5]:
df_merged["Price Change %"].fillna(0, inplace=True)
df_merged["Target"].fillna(0, inplace=True)
df_merged.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged["Price Change %"].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged["Target"].fillna(0, inplace=True)


Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),Price Change %,Target
0,GOOG,2019-04-25,18,63.24,63.37,62.6,63.17,62.87,22145900,13884880000.0,...,0.0,41322000000.0,-1697000000.0,39625000000.0,-5282000000.0,34343000000.0,34343000000.0,34343000000.0,0.0,0
1,HEES,2019-04-25,6767693,30.0,31.77,29.26,30.26,25.06,676844,35787000.0,...,-54033000.0,163894000.0,16836000.0,180730000.0,-47036000.0,133694000.0,132170000.0,132170000.0,-52.097515,1
2,MANH,2019-04-25,105128,66.52,67.0,64.24,65.44,65.44,917088,64594510.0,...,715000.0,116077000.0,7252000.0,116077000.0,-30315000.0,85762000.0,85762000.0,85762000.0,116.259088,0
3,ERII,2019-04-25,6767695,9.6,9.64,9.43,9.46,9.46,113678,54116000.0,...,2010000.0,12256000.0,-2332000.0,12256000.0,-1343000.0,10913000.0,10913000.0,10913000.0,-85.54401,1
4,CIM,2019-04-25,6767699,57.39,57.39,56.43,56.76,28.84,383803,62364580.0,...,2300000.0,505294000.0,169225000.0,674519000.0,-4405000.0,670114000.0,670114000.0,596350000.0,500.0,0


In [6]:
# Set option to display all columns
pd.set_option('display.max_columns', None)

# Show DataFrame info
df_merged.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4502544 entries, 0 to 4502543
Data columns (total 34 columns):
 #   Column                                    Dtype  
---  ------                                    -----  
 0   Ticker                                    object 
 1   Date                                      object 
 2   SimFinId                                  int64  
 3   Open                                      float64
 4   High                                      float64
 5   Low                                       float64
 6   Close                                     float64
 7   Adj. Close                                float64
 8   Volume                                    int64  
 9   Shares Outstanding                        float64
 10  Currency                                  object 
 11  Fiscal Year                               float64
 12  Fiscal Period                             object 
 13  Publish Date                              object 
 14  Re

In [7]:
df_merged.columns

Index(['Ticker', 'Date', 'SimFinId', 'Open', 'High', 'Low', 'Close',
       'Adj. Close', 'Volume', 'Shares Outstanding', 'Currency', 'Fiscal Year',
       'Fiscal Period', 'Publish Date', 'Restated Date', 'Shares (Basic)',
       'Shares (Diluted)', 'Revenue', 'Cost of Revenue', 'Gross Profit',
       'Operating Expenses', 'Selling, General & Administrative',
       'Operating Income (Loss)', 'Non-Operating Income (Loss)',
       'Interest Expense, Net', 'Pretax Income (Loss), Adj.',
       'Abnormal Gains (Losses)', 'Pretax Income (Loss)',
       'Income Tax (Expense) Benefit, Net',
       'Income (Loss) from Continuing Operations', 'Net Income',
       'Net Income (Common)', 'Price Change %', 'Target'],
      dtype='object')

Weekday column

In [8]:
df_merged['Date'] = pd.to_datetime(df_merged['Date']) # converting the "Date" column to a Datetime Object (double checking)

In [9]:
# 1. Weekday (Categorical for Classification)
df_merged['Weekday'] = df_merged['Date'].dt.day_name()  # 'Monday', 'Tuesday', etc.

Month Column

In [10]:
# 2. Month (Categorical for Classification)
df_merged['Month'] = df_merged['Date'].dt.month  # 1 (Jan) to 12 (Dec)

Quarter Column

In [11]:
# 3. Quarter (Categorical for Classification)
df_merged['Quarter'] = df_merged['Date'].dt.quarter  # 1 to 4

Daily Returns (Percentage Change)

In [12]:
df_merged['Daily_Return'] = df_merged['Close'].pct_change() * 100  # Convert to %

Moving Averages (5-day and 10-day)

In [13]:
df_merged['MA_5'] = df_merged['Close'].rolling(window=5, min_periods=1).mean()
df_merged['MA_10'] = df_merged['Close'].rolling(window=10, min_periods=1).mean()

Log Returns for Time Series Modeling

In [14]:
# Compute daily log-returns
df_merged['Log_Returns'] = np.log(df_merged['Close'] / df_merged['Close'].shift(1))

  result = getattr(ufunc, method)(*inputs, **kwargs)


In [15]:
df_merged.set_index('Date', inplace=True)

In [16]:
df_merged.head()

Unnamed: 0_level_0,Ticker,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,Currency,Fiscal Year,Fiscal Period,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),Revenue,Cost of Revenue,Gross Profit,Operating Expenses,"Selling, General & Administrative",Operating Income (Loss),Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),Price Change %,Target,Weekday,Month,Quarter,Daily_Return,MA_5,MA_10,Log_Returns
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1
2019-04-25,GOOG,18,63.24,63.37,62.6,63.17,62.87,22145900,13884880000.0,USD,2019.0,FY,2020-02-04,2022-02-02,13851920000.0,14901660000.0,161857000000.0,-71896000000.0,89961000000.0,-54033000000.0,-28015000000.0,35928000000.0,5394000000.0,0.0,41322000000.0,-1697000000.0,39625000000.0,-5282000000.0,34343000000.0,34343000000.0,34343000000.0,0.0,0,Thursday,4,2,,63.17,63.17,
2019-04-25,HEES,6767693,30.0,31.77,29.26,30.26,25.06,676844,35787000.0,USD,2022.0,FY,2023-02-22,2025-02-21,35943000.0,36089000.0,1244518000.0,-689355000.0,555163000.0,-343845000.0,-343845000.0,211318000.0,-47424000.0,-54033000.0,163894000.0,16836000.0,180730000.0,-47036000.0,133694000.0,132170000.0,132170000.0,-52.097515,1,Thursday,4,2,-52.097515,46.715,46.715,-0.736003
2019-04-25,MANH,105128,66.52,67.0,64.24,65.44,65.44,917088,64594510.0,USD,2019.0,FY,2020-02-10,2022-02-07,64397000.0,65103000.0,617949000.0,-284967000.0,332982000.0,-217058000.0,-121463000.0,115924000.0,153000.0,715000.0,116077000.0,7252000.0,116077000.0,-30315000.0,85762000.0,85762000.0,85762000.0,116.259088,0,Thursday,4,2,116.259088,52.956667,52.956667,0.771307
2019-04-25,ERII,6767695,9.6,9.64,9.43,9.46,9.46,113678,54116000.0,USD,2019.0,FY,2020-03-06,2022-02-24,54740000.0,56067000.0,86942000.0,-20335000.0,66607000.0,-56243000.0,-32266000.0,10364000.0,1892000.0,2010000.0,12256000.0,-2332000.0,12256000.0,-1343000.0,10913000.0,10913000.0,10913000.0,-85.54401,1,Thursday,4,2,-85.54401,42.0825,42.0825,-1.934061
2019-04-25,CIM,6767699,57.39,57.39,56.43,56.76,28.84,383803,62364580.0,USD,2021.0,FY,2022-02-17,2024-02-29,77915700.0,81824120.0,937546000.0,-326628000.0,610918000.0,-105624000.0,-22246000.0,505294000.0,939000.0,2300000.0,505294000.0,169225000.0,674519000.0,-4405000.0,670114000.0,670114000.0,596350000.0,500.0,0,Thursday,4,2,500.0,45.018,45.018,1.791759


In [18]:
df_merged['Log_Returns'].dropna(inplace=True)

In [19]:
df_merged['Log_Returns'].describe()

  return umr_sum(a, axis, dtype, out, keepdims, initial, where)


count    4.502517e+06
mean              NaN
std               NaN
min              -inf
25%     -1.417481e+00
50%      0.000000e+00
75%      1.417047e+00
max               inf
Name: Log_Returns, dtype: float64

In [20]:
# Replace inf/-inf with NaN
df_merged['Log_Returns'].replace([np.inf, -np.inf], np.nan, inplace=True)

# Drop missing values
df_merged.dropna(subset=['Log_Returns'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged['Log_Returns'].replace([np.inf, -np.inf], np.nan, inplace=True)


In [21]:
df_merged['Log_Returns'].describe()

count    4.483847e+06
mean    -6.131044e-05
std      2.295275e+00
min     -1.798928e+01
25%     -1.410339e+00
50%      0.000000e+00
75%      1.409907e+00
max      1.866754e+01
Name: Log_Returns, dtype: float64

In [22]:
df_merged['Close'].isna().sum()
df_merged['Close'].eq(0).sum()

0

In [23]:
# Save enriched dataset
os.makedirs("data/ENRICH", exist_ok=True)
df_merged.to_csv("data/ENRICH/merged_stock_income.csv", index=True) 