#  4_Enrico_Enriched_Draft.ipynb FEATURE ENGINEERING

This notebook logically follows the preprocessing step and performs feature engineering for modeling or further analysis. This will be using the newly cleaned dataset from 3_Enrico_Pre-Processing_Draft.ipynb. 

✔ What it does:

- Loads the cleaned data from the previous step (pre-processed.csv).

Creates new features:
- Price Change %
- Target (binary label: will price rise next day?)
- Date features: weekday, month, quarter
- Rolling averages (MA_5, MA_10)
- Daily Return and log of Adjusted Close price


In [None]:
import pandas as pd
import numpy as np

file_path = "/Users/enricotajanlangit/Desktop/pre-processed.csv" # - ADAPT TO YOUR LOCAL ENVIRONMENT

df_merged = pd.read_csv(file_path)

df_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11953125 entries, 0 to 11953124
Data columns (total 32 columns):
 #   Column                                    Dtype  
---  ------                                    -----  
 0   Ticker                                    object 
 1   Date                                      object 
 2   SimFinId                                  int64  
 3   Open                                      float64
 4   High                                      float64
 5   Low                                       float64
 6   Close                                     float64
 7   Adj. Close                                float64
 8   Volume                                    int64  
 9   Shares Outstanding                        float64
 10  Currency                                  object 
 11  Fiscal Year                               int64  
 12  Fiscal Period                             object 
 13  Publish Date                              object 
 14  

In [2]:
df_merged.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,Operating Income (Loss),Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common)
0,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,317515869.0,...,941000000.0,-22000000.0,-38000000.0,919000000.0,,919000000.0,152000000.0,1071000000,1071000000,1071000000
1,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,44464000.0,-2682000.0,-2763000.0,41782000.0,10936000.0,52718000.0,-10929000.0,41789000,41789000,41789000
2,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,21634000.0,-13000.0,-39000.0,21621000.0,-416000.0,21205000.0,-5319000.0,15886000,15886000,15886000
3,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,17191000.0,-449000.0,-83000.0,16742000.0,-459000.0,16283000.0,-2914000.0,13369000,13369000,13369000
4,AXSM,2019-04-15,10383750,14.02,14.8,13.62,14.31,14.31,1774954,33052468.0,...,-124707000.0,-5696000.0,-5696000.0,-130403000.0,-5674333.0,-130403000.0,-3405000.0,-130403000,-130403000,-130403000


New Columns: 

✅ Price Change % → Added daily price change % calculation (1 day lag)

✅ Target → Created Target column for market movement prediction

✅ Weekday (Weekday) → Categorical feature (Monday-Sunday) for classification models.

✅ Month (Month) → Categorical month number (1-12) for classification models.

✅ Quarter (Quarter) → Categorical feature (1-4) for classification models.

✅ Daily Returns (Daily_Return) → Percentage change in adjusted close price, useful for financial modeling.

✅ Moving Averages (MA_5, MA_10) → 5-day and 10-day moving averages for trend detection.

✅ Log Transformation (Log_Close) → Helps in normalizing price data for time series models.


In [3]:
df_merged["Price Change %"] = df_merged["Adj. Close"].pct_change() * 100  # Use correct column name

df_merged.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),Price Change %
0,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,317515869.0,...,-22000000.0,-38000000.0,919000000.0,,919000000.0,152000000.0,1071000000,1071000000,1071000000,
1,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-2682000.0,-2763000.0,41782000.0,10936000.0,52718000.0,-10929000.0,41789000,41789000,41789000,-53.133903
2,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-13000.0,-39000.0,21621000.0,-416000.0,21205000.0,-5319000.0,15886000,15886000,15886000,0.0
3,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-449000.0,-83000.0,16742000.0,-459000.0,16283000.0,-2914000.0,13369000,13369000,13369000,0.0
4,AXSM,2019-04-15,10383750,14.02,14.8,13.62,14.31,14.31,1774954,33052468.0,...,-5696000.0,-5696000.0,-130403000.0,-5674333.0,-130403000.0,-3405000.0,-130403000,-130403000,-130403000,-60.45869


Create the Target Variable for ML
To predict whether the price will go up or down, shift the adjusted close price:

In [4]:
df_merged["Target"] = (df_merged["Adj. Close"].shift(-1) > df_merged["Adj. Close"]).astype(int)  # 1 if price rises, 0 if falls
df_merged.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),Price Change %,Target
0,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,317515869.0,...,-38000000.0,919000000.0,,919000000.0,152000000.0,1071000000,1071000000,1071000000,,0
1,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-2763000.0,41782000.0,10936000.0,52718000.0,-10929000.0,41789000,41789000,41789000,-53.133903,0
2,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-39000.0,21621000.0,-416000.0,21205000.0,-5319000.0,15886000,15886000,15886000,0.0,0
3,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-83000.0,16742000.0,-459000.0,16283000.0,-2914000.0,13369000,13369000,13369000,0.0,0
4,AXSM,2019-04-15,10383750,14.02,14.8,13.62,14.31,14.31,1774954,33052468.0,...,-5696000.0,-130403000.0,-5674333.0,-130403000.0,-3405000.0,-130403000,-130403000,-130403000,-60.45869,1


Handling Null Values: Since .pct_change() creates NaN values for the first row, fill them:

In [5]:
df_merged["Price Change %"].fillna(0, inplace=True)
df_merged["Target"].fillna(0, inplace=True)
df_merged.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged["Price Change %"].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged["Target"].fillna(0, inplace=True)


Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),Price Change %,Target
0,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,317515869.0,...,-38000000.0,919000000.0,,919000000.0,152000000.0,1071000000,1071000000,1071000000,0.0,0
1,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-2763000.0,41782000.0,10936000.0,52718000.0,-10929000.0,41789000,41789000,41789000,-53.133903,0
2,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-39000.0,21621000.0,-416000.0,21205000.0,-5319000.0,15886000,15886000,15886000,0.0,0
3,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,...,-83000.0,16742000.0,-459000.0,16283000.0,-2914000.0,13369000,13369000,13369000,0.0,0
4,AXSM,2019-04-15,10383750,14.02,14.8,13.62,14.31,14.31,1774954,33052468.0,...,-5696000.0,-130403000.0,-5674333.0,-130403000.0,-3405000.0,-130403000,-130403000,-130403000,-60.45869,1


In [6]:
# Set option to display all columns
pd.set_option('display.max_columns', None)

# Show DataFrame info
df_merged.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11953125 entries, 0 to 11953124
Data columns (total 34 columns):
 #   Column                                    Dtype  
---  ------                                    -----  
 0   Ticker                                    object 
 1   Date                                      object 
 2   SimFinId                                  int64  
 3   Open                                      float64
 4   High                                      float64
 5   Low                                       float64
 6   Close                                     float64
 7   Adj. Close                                float64
 8   Volume                                    int64  
 9   Shares Outstanding                        float64
 10  Currency                                  object 
 11  Fiscal Year                               int64  
 12  Fiscal Period                             object 
 13  Publish Date                              object 
 14  

In [7]:
df_merged.columns

Index(['Ticker', 'Date', 'SimFinId', 'Open', 'High', 'Low', 'Close',
       'Adj. Close', 'Volume', 'Shares Outstanding', 'Currency', 'Fiscal Year',
       'Fiscal Period', 'Publish Date', 'Restated Date', 'Shares (Basic)',
       'Shares (Diluted)', 'Revenue', 'Cost of Revenue', 'Gross Profit',
       'Operating Expenses', 'Selling, General & Administrative',
       'Operating Income (Loss)', 'Non-Operating Income (Loss)',
       'Interest Expense, Net', 'Pretax Income (Loss), Adj.',
       'Abnormal Gains (Losses)', 'Pretax Income (Loss)',
       'Income Tax (Expense) Benefit, Net',
       'Income (Loss) from Continuing Operations', 'Net Income',
       'Net Income (Common)', 'Price Change %', 'Target'],
      dtype='object')

Weekday column

In [None]:
df_merged['Date'] = pd.to_datetime(df_merged['Date']) # converting the "Date" column to a Datetime Object (double checking)

In [9]:
# 1. Weekday (Categorical for Classification)
df_merged['Weekday'] = df_merged['Date'].dt.day_name()  # 'Monday', 'Tuesday', etc.

Month Column

In [10]:
# 2. Month (Categorical for Classification)
df_merged['Month'] = df_merged['Date'].dt.month  # 1 (Jan) to 12 (Dec)

Quarter Column

In [11]:
# 3. Quarter (Categorical for Classification)
df_merged['Quarter'] = df_merged['Date'].dt.quarter  # 1 to 4

Daily Returns (Percentage Change)

In [12]:
df_merged['Daily_Return'] = df_merged['Adj. Close'].pct_change() * 100  # Convert to %

Moving Averages (5-day and 10-day)

In [13]:
df_merged['MA_5'] = df_merged['Adj. Close'].rolling(window=5, min_periods=1).mean()
df_merged['MA_10'] = df_merged['Adj. Close'].rolling(window=10, min_periods=1).mean()

Log Transformation for Time Series Modeling

In [14]:
df_merged['Log_Close'] = np.log(df_merged['Adj. Close'])  # Log of Adjusted Close Price
df_merged.head()

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,Currency,Fiscal Year,Fiscal Period,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),Revenue,Cost of Revenue,Gross Profit,Operating Expenses,"Selling, General & Administrative",Operating Income (Loss),Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),Price Change %,Target,Weekday,Month,Quarter,Daily_Return,MA_5,MA_10,Log_Close
0,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,317515869.0,USD,2019,FY,2019-12-19,2021-12-17,314000000.0,318000000.0,5163000000.0,-2358000000.0,2805000000.0,-1864000000.0,-1460000000.0,941000000.0,-22000000.0,-38000000.0,919000000.0,,919000000.0,152000000.0,1071000000,1071000000,1071000000,0.0,0,Monday,4,2,,77.22,77.22,4.346658
1,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,USD,2022,FY,2023-03-16,2025-03-03,7564000.0,7663000.0,353792000.0,-263532000.0,90260000.0,-45796000.0,-45796000.0,44464000.0,-2682000.0,-2763000.0,41782000.0,10936000.0,52718000.0,-10929000.0,41789000,41789000,41789000,-53.133903,0,Monday,4,2,-53.133903,56.705,56.705,3.588783
2,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,USD,2021,FY,2022-03-14,2024-02-29,7524000.0,7615000.0,206320000.0,-155206000.0,51114000.0,-29480000.0,-29480000.0,21634000.0,-13000.0,-39000.0,21621000.0,-416000.0,21205000.0,-5319000.0,15886000,15886000,15886000,0.0,0,Monday,4,2,0.0,49.866667,49.866667,3.588783
3,UFPT,2019-04-15,6767729,36.5,36.82,35.9,36.19,36.19,24772,7402000.0,USD,2020,FY,2021-03-12,2023-03-16,7484000.0,7568000.0,179373000.0,-134689000.0,44684000.0,-27493000.0,-27493000.0,17191000.0,-449000.0,-83000.0,16742000.0,-459000.0,16283000.0,-2914000.0,13369000,13369000,13369000,0.0,0,Monday,4,2,0.0,46.4475,46.4475,3.588783
4,AXSM,2019-04-15,10383750,14.02,14.8,13.62,14.31,14.31,1774954,33052468.0,USD,2021,FY,2022-03-01,2024-02-23,37618599.0,37618599.0,737671000.0,-439937000.0,356587000.0,-124707000.0,-66646000.0,-124707000.0,-5696000.0,-5696000.0,-130403000.0,-5674333.0,-130403000.0,-3405000.0,-130403000,-130403000,-130403000,-60.45869,1,Monday,4,2,-60.45869,40.02,40.02,2.660959
